Skip to main content

Website Scrape

Scrape website content and bring it into Galaxy for processing. Galaxy fetches content, processes it, and extracts structure, text, and entities. Website Scrape source configuration

How to Connect

  1. Select Website Scrape as the Source type
  2. Provide base URL: Enter the base URL of the website to scrape
  3. Configure scraping options:
    • Max depth: Maximum depth to crawl (optional)
    • Max pages: Maximum number of pages to scrape (optional)
    • Allowed domains: Specific domains to allow (optional)
    • Delay between requests: Delay in milliseconds between requests (optional)
    • Respect robots.txt: Whether to respect robots.txt (optional)
    • User agent: Custom user agent string (optional)
  4. Name the Source: Give your Website Scrape Source a name
Once created, Galaxy begins scraping the specified URL and processing the content.

Configuration Options

URL Configuration

  • Base URL: Starting URL for the scrape

Scraping Options

  • Max Depth: Maximum depth to crawl from the base URL
  • Max Pages: Maximum number of pages to scrape
  • Allowed Domains: List of domains that are allowed to be scraped (leave empty to allow all)
  • Delay Between Requests: Delay in milliseconds between requests (helps avoid overwhelming servers)
  • Respect robots.txt: Whether to respect robots.txt rules
  • User Agent: Custom user agent string to identify the scraper

Content Processing

Galaxy processes scraped website content with:
  • Text extraction: Extracts text content from web pages, preserving structure and layout
  • Content normalization: Normalizes scraped content for consistency
  • Entity extraction: Automatically extracts and normalizes semantic entities including:
    • Dates and times (normalized to standard formats)
    • Email addresses and URLs
    • Phone numbers
    • Measurements, money, and percentages
    • Serial numbers, model numbers, and part numbers
    • IP addresses and version numbers
    • Technical measurements (temperature, pressure, voltage, current, frequency)
  • Normalization: Extracted entities are normalized to standardized formats

What’s Next