Indexation pipeline
When a document is sent to K-AI, the platform extracts its content, makes it searchable, and prepares it for the Neural Semantic Graph. The customer sees a single state transition from INITIAL_SAVED to INDEXED; everything in between is managed by the platform.
What the pipeline does
Receives the document via the Orchestrator endpoints of the Instance API.
Extracts content from the major document families: PDF, Word, spreadsheets, presentations, email, images, HTML and plain text. Structural cues (headings, tables, page numbers) are preserved.
Indexes the extracted content into a per-instance vector index and into the Neural Semantic Graph. Vectors and graph nodes are scoped to a single K-AI instance — see Security & isolation.
Stores the original file in object storage so consumers can retrieve it on demand.
Records a cost event for the billing system.
What the customer sees
A document state observable via the Instance API Documents endpoints. States visible to integrators:
INITIAL_SAVED,ON_CONTENT_EXTRACT,INDEXED, plus an error statePARSING_ERRORfor retries.A per-document deterministic outcome: re-indexing the same source produces the same result. Re-indexation requests are addressed through the Orchestrator (
reindex-document,differential-indexation,retry-documents-parsing-error).A per-instance isolation guarantee: vectors from one customer are never shared with, nor comparable across, another customer's instance.
Where vectors land
Depends on the deployment model:
SaaS — a managed vector index, one per K-AI instance.
Snowflake Native App —
VECTORcolumns inside the customer's own Snowflake account. Data never leaves Snowflake.On-premise — a bundled vector index running inside the customer's Kubernetes cluster. See On-premise installation.
Next steps
Document state machine — what each state means and how to observe transitions.
Instance API — Orchestrator — endpoints to drive indexation.
Operate — Monitoring — observability for ingestion at scale.
Last updated