How it works
The crawling process has several steps:Only one crawl job is allowed per agent at a time. You cannot submit a new crawl while one is already running or queued.
Crawl modes
Single page (default)
By default, only the URL you enter is scraped. No links are followed — the crawler extracts text content from that one page only.Recursive
Toggle Crawl Website on to enable recursive crawling. The crawler starts at your URL and follows same-domain links it finds on each page. Links are only followed if they fall under the same path as the starting URL. For example, if you start athttps://example.com/docs, the crawler will follow links to https://example.com/docs/getting-started but not https://example.com/blog.
Recursive crawling goes up to 5 levels deep from the starting URL.
Sitemap
Toggle Sitemap on to switch to sitemap mode. The crawler fetches your XML sitemap and extracts all listed URLs. This is useful when you want to crawl specific pages or when your site’s link structure doesn’t connect all pages. Enter your sitemap URL (e.g.,https://example.com/sitemap.xml). The crawler supports both individual sitemaps and sitemap index files that reference multiple sitemaps.
In sitemap mode, any URL on the same domain is in scope (not restricted to the starting path like recursive mode).
Configuration
Advanced settings
Toggle Advanced to access these settings:| Setting | Description | Default |
|---|---|---|
| Crawl Website | Enable to follow links and crawl multiple pages recursively. When off, only the entered URL is scraped. | Off |
| Sitemap | Switch to sitemap mode instead of recursive crawling. | Off |
| Concurrent Crawling | Number of pages crawled simultaneously. Higher values are faster but put more load on the target server. | 10 |
| Selectors to Exclude | CSS selectors for elements to remove before extracting text (e.g., header, footer, .sidebar, nav). | header, #header, .header, footer, #footer, .footer, form |
Page limit
Each plan has a maximum number of pages per crawl job. The crawler stops once it reaches your plan’s limit or runs out of pages to process, whichever comes first.| Plan | Pages per crawl |
|---|---|
| Free | 100 |
| Basic AI | 5,000 |
| Standard AI | 10,000 |
| Pro AI | 10,000 |
Content extraction
For each page, the crawler:- Loads the page in a headless browser
- Removes
<script>,<style>, and<noscript>elements - Removes hidden elements (
display: none,visibility: hidden,[hidden]) - Removes elements matching your Selectors to Exclude
- Extracts the remaining visible text content
Reviewing results
After crawling completes, you’ll see a list of all discovered pages with:- Page title
- URL
- Text length (characters)
- Status (success or failed)
- Select individual pages to add as Knowledge Base resources
- Select all pages to bulk-add everything
- Review page content before adding
Troubleshooting
Crawl stuck in queue
Crawl stuck in queue
Only one crawl runs at a time per workspace. If another crawl is in progress, your job will wait. If a crawl stays queued for more than 30 minutes with no other active jobs, try submitting the URL again.
Pages showing as failed
Pages showing as failed
Individual pages can fail for several reasons:
- The page returned a non-HTML response (e.g., a PDF or image URL)
- The server returned an error (403, 404, 500)
- The page took too long to load
Missing pages in results
Missing pages in results
- Recursive mode: Only pages linked from the starting URL (and its descendants) are discovered. Pages that aren’t linked from any crawled page won’t be found. Try sitemap mode instead.
- Sitemap mode: Only URLs listed in the sitemap are crawled. Make sure your sitemap is up to date.
- Check that you haven’t hit your max pages limit.
Extracted content looks wrong or incomplete
Extracted content looks wrong or incomplete
- Try adding exclude selectors for navigation menus, sidebars, or cookie banners that add noise
- SPA sites that render content via JavaScript may not be fully captured
- Very large pages may be truncated
Crawl blocked by hosting provider or firewall
Crawl blocked by hosting provider or firewall
Some hosting providers or firewalls may block requests from our crawlers. If all pages are failing or the crawl produces no results, your server may be rejecting the crawler’s requests. Contact support to request our crawler IP addresses so you can add them to your server’s allowlist.