Website Crawler - Tiny Talk

The website crawler lets you import content from any publicly accessible website into your Knowledge Base. You can scrape a single page, crawl an entire site by following links, or use a sitemap to target specific URLs.

How it works

The crawling process has several steps:

Submit a URL

Enter a starting URL and configure crawl settings.

Job queued

The crawl job enters a processing queue.

Crawling

Pages are fetched, content is extracted, and links are followed (if enabled).

Review results

Browse the crawled pages and their extracted content.

Create resources

Select individual pages or all pages to add to your Knowledge Base.

Train

Click Train to process and index the content so your agent can search it.

Only one crawl job is allowed per agent at a time. You cannot submit a new crawl while one is already running or queued.

Crawl modes

Single page (default)

By default, only the URL you enter is scraped. No links are followed — the crawler extracts text content from that one page only.

Recursive

Toggle Crawl Website on to enable recursive crawling. The crawler starts at your URL and follows same-domain links it finds on each page. Links are only followed if they fall under the same path as the starting URL. For example, if you start at https://example.com/docs, the crawler will follow links to https://example.com/docs/getting-started but not https://example.com/blog. Recursive crawling goes up to 5 levels deep from the starting URL.

Sitemap

Toggle Sitemap on to switch to sitemap mode. The crawler fetches your XML sitemap and extracts all listed URLs. This is useful when you want to crawl specific pages or when your site’s link structure doesn’t connect all pages. Enter your sitemap URL (e.g., https://example.com/sitemap.xml). The crawler supports both individual sitemaps and sitemap index files that reference multiple sitemaps. In sitemap mode, any URL on the same domain is in scope (not restricted to the starting path like recursive mode).

Configuration

Advanced settings

Toggle Advanced to access these settings:

Setting	Description	Default
Crawl Website	Enable to follow links and crawl multiple pages recursively. When off, only the entered URL is scraped.	Off
Sitemap	Switch to sitemap mode instead of recursive crawling.	Off
Concurrent Crawling	Number of pages crawled simultaneously. Higher values are faster but put more load on the target server.	10
Selectors to Exclude	CSS selectors for elements to remove before extracting text (e.g., `header, footer, .sidebar, nav`).	`header, #header, .header, footer, #footer, .footer, form`

Page limit

Each plan has a maximum number of pages per crawl job. The crawler stops once it reaches your plan’s limit or runs out of pages to process, whichever comes first.

Plan	Pages per crawl
Free	100
Basic AI	5,000
Standard AI	10,000
Pro AI	10,000

Content extraction

For each page, the crawler:

Loads the page in a headless browser
Removes <script>, <style>, and <noscript> elements
Removes hidden elements (display: none, visibility: hidden, [hidden])
Removes elements matching your Selectors to Exclude
Extracts the remaining visible text content

The crawler extracts text content only. Images, videos, and embedded media are not processed. JavaScript-rendered content (SPAs) may not be fully captured if the content loads asynchronously after the initial page render.

Reviewing results

After crawling completes, you’ll see a list of all discovered pages with:

Page title
URL
Text length (characters)
Status (success or failed)

From here you can:

Select individual pages to add as Knowledge Base resources
Select all pages to bulk-add everything
Review page content before adding

Once you’ve selected your pages, click Train to process and index the content. Only after training will the content be searchable by your agent.

Troubleshooting

Crawl stuck in queue

Only one crawl runs at a time per workspace. If another crawl is in progress, your job will wait. If a crawl stays queued for more than 30 minutes with no other active jobs, try submitting the URL again.

Pages showing as failed

Individual pages can fail for several reasons:

The page returned a non-HTML response (e.g., a PDF or image URL)
The server returned an error (403, 404, 500)
The page took too long to load

Failed pages are retried up to 5 times before being marked as permanently failed. You can still create resources from the pages that succeeded.

Missing pages in results

Recursive mode: Only pages linked from the starting URL (and its descendants) are discovered. Pages that aren’t linked from any crawled page won’t be found. Try sitemap mode instead.
Sitemap mode: Only URLs listed in the sitemap are crawled. Make sure your sitemap is up to date.
Check that you haven’t hit your max pages limit.

Extracted content looks wrong or incomplete

Try adding exclude selectors for navigation menus, sidebars, or cookie banners that add noise
SPA sites that render content via JavaScript may not be fully captured
Very large pages may be truncated

Crawl blocked by hosting provider or firewall

Some hosting providers or firewalls may block requests from our crawlers. If all pages are failing or the crawl produces no results, your server may be rejecting the crawler’s requests. Contact support to request our crawler IP addresses so you can add them to your server’s allowlist.

​How it works

​Crawl modes

​Single page (default)

​Recursive

​Sitemap

​Configuration

​Advanced settings

​Page limit

​Content extraction

​Reviewing results

​Troubleshooting

How it works

Crawl modes

Single page (default)

Recursive

Sitemap

Configuration

Advanced settings

Page limit

Content extraction

Reviewing results

Troubleshooting