Skip to main content
The website crawler lets you import content from any publicly accessible website into your Knowledge Base. You can scrape a single page, crawl an entire site by following links, or use a sitemap to target specific URLs.

How it works

The crawling process has several steps:
1

Submit a URL

Enter a starting URL and configure crawl settings.
2

Job queued

The crawl job enters a processing queue.
3

Crawling

Pages are fetched, content is extracted, and links are followed (if enabled).
4

Review results

Browse the crawled pages and their extracted content.
5

Create resources

Select individual pages or all pages to add to your Knowledge Base.
6

Train

Click Train to process and index the content so your agent can search it.
Only one crawl job is allowed per agent at a time. You cannot submit a new crawl while one is already running or queued.

Crawl modes

Single page (default)

By default, only the URL you enter is scraped. No links are followed — the crawler extracts text content from that one page only.

Recursive

Toggle Crawl Website on to enable recursive crawling. The crawler starts at your URL and follows same-domain links it finds on each page. Links are only followed if they fall under the same path as the starting URL. For example, if you start at https://example.com/docs, the crawler will follow links to https://example.com/docs/getting-started but not https://example.com/blog. Recursive crawling goes up to 5 levels deep from the starting URL.

Sitemap

Toggle Sitemap on to switch to sitemap mode. The crawler fetches your XML sitemap and extracts all listed URLs. This is useful when you want to crawl specific pages or when your site’s link structure doesn’t connect all pages. Enter your sitemap URL (e.g., https://example.com/sitemap.xml). The crawler supports both individual sitemaps and sitemap index files that reference multiple sitemaps. In sitemap mode, any URL on the same domain is in scope (not restricted to the starting path like recursive mode).

Configuration

Advanced settings

Toggle Advanced to access these settings:
SettingDescriptionDefault
Crawl WebsiteEnable to follow links and crawl multiple pages recursively. When off, only the entered URL is scraped.Off
SitemapSwitch to sitemap mode instead of recursive crawling.Off
Concurrent CrawlingNumber of pages crawled simultaneously. Higher values are faster but put more load on the target server.10
Selectors to ExcludeCSS selectors for elements to remove before extracting text (e.g., header, footer, .sidebar, nav).header, #header, .header, footer, #footer, .footer, form

Page limit

Each plan has a maximum number of pages per crawl job. The crawler stops once it reaches your plan’s limit or runs out of pages to process, whichever comes first.
PlanPages per crawl
Free100
Basic AI5,000
Standard AI10,000
Pro AI10,000

Content extraction

For each page, the crawler:
  1. Loads the page in a headless browser
  2. Removes <script>, <style>, and <noscript> elements
  3. Removes hidden elements (display: none, visibility: hidden, [hidden])
  4. Removes elements matching your Selectors to Exclude
  5. Extracts the remaining visible text content
The crawler extracts text content only. Images, videos, and embedded media are not processed. JavaScript-rendered content (SPAs) may not be fully captured if the content loads asynchronously after the initial page render.

Reviewing results

After crawling completes, you’ll see a list of all discovered pages with:
  • Page title
  • URL
  • Text length (characters)
  • Status (success or failed)
From here you can:
  • Select individual pages to add as Knowledge Base resources
  • Select all pages to bulk-add everything
  • Review page content before adding
Once you’ve selected your pages, click Train to process and index the content. Only after training will the content be searchable by your agent.

Troubleshooting

Only one crawl runs at a time per workspace. If another crawl is in progress, your job will wait. If a crawl stays queued for more than 30 minutes with no other active jobs, try submitting the URL again.
Individual pages can fail for several reasons:
  • The page returned a non-HTML response (e.g., a PDF or image URL)
  • The server returned an error (403, 404, 500)
  • The page took too long to load
Failed pages are retried up to 5 times before being marked as permanently failed. You can still create resources from the pages that succeeded.
  • Recursive mode: Only pages linked from the starting URL (and its descendants) are discovered. Pages that aren’t linked from any crawled page won’t be found. Try sitemap mode instead.
  • Sitemap mode: Only URLs listed in the sitemap are crawled. Make sure your sitemap is up to date.
  • Check that you haven’t hit your max pages limit.
  • Try adding exclude selectors for navigation menus, sidebars, or cookie banners that add noise
  • SPA sites that render content via JavaScript may not be fully captured
  • Very large pages may be truncated
Some hosting providers or firewalls may block requests from our crawlers. If all pages are failing or the crawl produces no results, your server may be rejecting the crawler’s requests. Contact support to request our crawler IP addresses so you can add them to your server’s allowlist.