
How to Connect
- Select Browserbase as the Source type
- Name the Source: Give your Website Scrape Source a name
- Provide base URL: Enter the base URL of the website to scrape
- (Optional) Add a Prompt: Provide additional instructions to guide extraction (e.g., focus on product documentation, ignore navigation, prioritize structured data)
- Choose Credentials 5. Galaxy managed (Recommended) — Galaxy handles Browserbase credentials automatically 5. Bring your own API key — Connect your own Browserbase account
- (Optional) Configure Max URLs to Scrape: Set the maximum number of URLs Galaxy should process from the starting page.
COMING SOON More Configuration Options
- Max Depth: Maximum depth to crawl from the base URL
- Max Pages: Maximum number of pages to scrape
- Allowed Domains: List of domains that are allowed to be scraped (leave empty to allow all)
- Delay Between Requests: Delay in milliseconds between requests (helps avoid overwhelming servers)
- Respect robots.txt: Whether to respect robots.txt rules
- User Agent: Custom user agent string to identify the scraper
Content Processing
Galaxy processes scraped website content with:- Text extraction: Extracts text content from web pages, preserving structure and layout
- Content normalization: Normalizes scraped content for consistency
- Entity extraction: Automatically extracts and normalizes semantic entities including:
- Dates and times (normalized to standard formats)
- Email addresses and URLs
- Phone numbers
- Measurements, money, and percentages
- Serial numbers, model numbers, and part numbers
- IP addresses and version numbers
- Technical measurements (temperature, pressure, voltage, current, frequency)
- Normalization: Extracted entities are normalized to standardized formats
Don’t see a source type you’re looking for? We connect to hundreds of systems - reach out to support@getgalaxy.io to request access.