> For the complete documentation index, see [llms.txt](https://k-ai.gitbook.io/knowledge-ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/web-crawler.md).

# Web crawler

K-AI ingests public web pages via the **K-AI Web Crawler** service — a headless-browser crawler that fetches pages, classifies them, extracts structured content, and detects changes via a hash diff on the extracted content. The crawler is used for ingesting public marketing sites, regulatory portals, supplier documentation, and similar URL-addressable corpora.

The crawler is managed independently of the K-AI Instance through the `crawler.kai-studio.ai` admin UI. K-AI exposes it as a connector and the Instance pulls extracted content via the crawler's public read-only API.

## Architecture

The web crawler is a managed service exposing an admin UI for managing URL bases and a public read-only API the K-AI Instance pulls from. Workers are stateless and scale horizontally. The K-AI Instance treats the crawler as a source through the generic ingestion pathway: the connector pulls successful URLs via the crawler's public API and POSTs them to the Orchestrator endpoint of the [Instance API](/knowledge-ai/sources-and-ingestion/instance-api/instance-api.md).

### Multi-stage extraction

Each URL is loaded in a headless browser, classified (CAPTCHA / paywall / login-wall pages exit early), cleaned of boilerplate (cookie banners, navigation, footers, ads), and converted to structured Markdown chunks with breadcrumb context preserved. A quality-scoring pass filters out boilerplate and applies selective LLM refinement to weak chunks. Change detection uses a content hash over the structured output, so heading or page-title changes trigger a re-index while ads and timestamps are ignored.

## Authentication

Two layers:

* **Source-side** (between crawler and the web) — anonymous for public web pages. Basic auth and cookie headers are supported for protected sites; configured per URL base.
* **K-AI side** (between Instance and crawler) — Customers receive an API key for each URL base via the crawler admin UI; revocation is one-click. The K-AI Instance authenticates against the crawler's public API with an `X-API-Key` header.

## Configuration

| Field     | Type   | Required | Description                                                                                      |
| --------- | ------ | -------- | ------------------------------------------------------------------------------------------------ |
| `api_key` | string | yes      | Public API key issued by the crawler for a specific URL base. Encrypted at rest by the platform. |

The URL base itself — seed URLs, allow-list, deny-list, crawl depth (single-page only, no recursion), and re-crawl interval — is configured in the crawler admin UI, not in the K-AI Instance.

Fields available on a URL base (managed in the crawler UI):

| Field                    | Description                                                                                  |
| ------------------------ | -------------------------------------------------------------------------------------------- |
| `name`                   | URL base name (organization-scoped).                                                         |
| `description`            | Free-text description.                                                                       |
| `urls`                   | List of URLs to crawl. Importable via CSV. Each URL is crawled independently — no recursion. |
| `recrawl_interval_hours` | Cadence for re-crawling each URL. Range: 3–720.                                              |

## Document types ingested

Each URL produces a single document containing the extracted text content. Native MIME type ingested:

* HTML pages (the primary case).
* PDFs linked from a page — fetched and routed through the [indexation pipeline](/knowledge-ai/sources-and-ingestion/indexation-pipeline.md). PDFs hosted directly at a URL (not behind an HTML wrapper) follow the same path.

Images, video, and binary media are not ingested by the crawler; references are recorded as metadata pointers.

## Sync mode

Scheduled re-crawl. Each URL has its own `recrawl_interval_hours` and is re-fetched independently. Workers pick up URLs whose last successful crawl is older than the interval.

Change detection is a content hash on the structured output. The K-AI Instance treats a hash change as a content modification and re-runs the [indexation pipeline](/knowledge-ai/sources-and-ingestion/indexation-pipeline.md). Unchanged content is a no-op.

Tombstone logic: a URL that returns `404` for several consecutive crawls is flipped to `status='gone'` and removed from the public API. A subsequent `200` brings it back automatically. RFC 9110 `410 Gone` tombstones immediately.

## ACL handling

Web pages have no native ACL semantics — they are public by definition. The connector applies the configured **fallback group** at registration time (typically a `public_web` group readable by everyone, or a more restrictive group when the customer wants to gate access to ingested web content). The mapping is one-shot at the source level and identical for every document under the URL base.

For sites behind basic auth or cookie sessions, the same fallback rule applies; access control is enforced upstream by the source site, and the K-AI Instance simply mirrors the resulting allowlist.

## Rate limits

Worker concurrency, per-host politeness delay, and page size cap are configurable per deployment. Pages exceeding the size cap are flagged in the crawl result and not extracted.

## Known limitations

* **JavaScript-heavy SPAs** may need extended wait windows; the adaptive wait covers most cases but some apps require manual tuning per URL base.
* **CAPTCHA / WAF** challenges (DataDome, Cloudflare, hCaptcha) cause the page to be flagged at the detect stage and exit before extraction. No CAPTCHA-solving is performed.
* **Deep recursive crawls** are not supported — each URL is crawled in isolation. To ingest a site tree, the URLs must be enumerated upfront (manual entry or CSV import).
* **Dynamic per-user content** is not supported; the crawler authenticates as one identity and ingests the resulting view.
* **LLM-refined chunks** are recorded as such in document metadata for downstream audit.
* **Cost** — the refine stage triggers LLM calls, billed per organization (see [Monitoring](/knowledge-ai/operate/monitoring.md#cost-events)). High-volume bases should be sized accordingly.

## Setup walkthrough

1. **Sign in to the crawler admin UI** at `https://crawler.kai-studio.ai` with a K-AI account in the target organization.
2. **Create a URL base**: name, description, recrawl interval. Add seed URLs manually or via CSV import.
3. **Wait for the initial crawl** to complete (the worker picks up new URLs on its next poll). Inspect crawl results per URL in the admin UI to confirm extraction quality.
4. **Generate a public API key** for the URL base (one key per base; rotatable, revocable). Capture the token displayed once at generation.
5. **Add the source in the K-AI admin portal**: select Web crawler, paste the `api_key`. The connector calls `GET /api/public/base` to resolve the base ID and confirm credentials.
6. **Trigger a test sync** from the K-AI Instance. Confirm the document count via the [Documents endpoint](/knowledge-ai/sources-and-ingestion/instance-api/documents.md) of the K-AI Instance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/web-crawler.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
