Document state machine
Every document ingested by K-AI moves through a small set of states from registration to availability. State is observable via the Documents endpoints of the Instance API. State transitions are durable and idempotent — retries are safe.
Three principal states — INITIAL_SAVED, ON_CONTENT_EXTRACT, INDEXED — are what a Consumer or Steward encounters. A handful of finer-grained transient states exist between extraction and graph write; they appear in the API responses but are not surfaced as user-facing milestones.
INITIAL_SAVED
The document has been registered with its source metadata: source URL, ACLs (mirrored from the upstream Document Repository), MIME type, last-modified date, and Owner / Authority pointers when known. Content has not yet been extracted.
Entering this state queues the document for indexation.
ON_CONTENT_EXTRACT
Extraction is in progress: content is parsed (per MIME type), semantically structured, and embeddings are computed.
In this state the document is searchable on its metadata, but its content does not yet contribute to vector search or to the Neural Semantic Graph.
INDEXED
Embeddings have been written to the per-instance vector store. Semantic graph nodes (concept, subject, actor, dependency) have been written to the instance's database. The document is fully available to Consumers — both for vector search and for Neural Semantic Graph traversal.
INDEXED is the only terminal state for a healthy document. A re-indexation request moves the document back to INITIAL_SAVED and re-runs the pipeline; the result is deterministic for an unchanged source document.
Failures & retries
Failures during extraction transition the document to PARSING_ERROR. The most common causes are: source unreachable at fetch time (expired credentials, deleted file, permission revoked upstream), unsupported or corrupted file format, or extraction timeout on very large files.
Documents in PARSING_ERROR are visible via /api/documents/list-docs with a state filter. To re-queue, call POST /api/orchestrator/retry-documents-parsing-error (re-queues every document currently in error) or POST /api/orchestrator/reindex-document with a specific document id — both are idempotent. Re-indexing the same document produces the same vectors and graph nodes, modulo content changes upstream.
Observing state
A Steward or Engineer (DocOps) tracking indexation progress at scale should:
Poll
POST /api/documents/list-docswith astatefilter to enumerate documents in a given state.Call
POST /api/documents/count-documents(optionally with astatefilter) to track aggregate progress without paginating the full list.Call
POST /api/orchestrator/count-back-tasksto see queue pressure on the indexation pipeline.
See Instance API — Documents for full request and response schemas.
Last updated