> For the complete documentation index, see [llms.txt](https://k-ai.gitbook.io/knowledge-ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://k-ai.gitbook.io/knowledge-ai/operate/scaling.md).

# Scaling & quotas

K-AI is designed to scale horizontally without customer-side tuning. This page summarises what scales automatically, what is physically isolated, and which quotas you can configure.

## How K-AI scales

* **Stateless services scale horizontally.** The management plane, the audit and retrieval services, the authentication service, the web crawler, and the billing surface are stateless and run as multi-replica deployments. No sticky sessions; identity travels in every request.
* **LLM serving is tier-managed.** K-AI operates the LLM service with automatic failover between an on-cluster primary, an on-cluster fallback, and a managed endpoint of last resort. Customers do not configure tiers — the service routes requests transparently.
* **Indexation is ephemeral per document.** Each document is parsed in its own short-lived workload, then released. A malformed file cannot poison a shared worker, and concurrency scales with available capacity rather than a fixed pool.
* **Per-instance physical isolation prevents noisy-neighbour effects.** Each K-AI Instance has its own compute, its own vector index, and its own object storage. A bulk re-ingest on one instance cannot block queries on another.
* **Bulk-ingestion throughput is governed centrally.** For large initial backfills (multi-million-document estates), K-AI raises ingestion capacity for the duration of the backfill and returns to baseline afterwards. Coordinate timing with your account team.

## Quotas you can configure

K-AI enforces quotas at three levels. SaaS deployments come pre-configured for the typical mid-to-large customer envelope; on-premise and Snowflake Native App deployments inherit the same defaults and can be tuned per environment.

| Level                        | What it controls                                                                                                   | How to change it                                                     |
| ---------------------------- | ------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------- |
| **Per-organisation KCU cap** | Total monthly consumption across all of an org's instances and cost types.                                         | Configurable on request via your account team.                       |
| **LLM rate limits**          | Per API key (Instance API), per user (Retrieval / Audit), per IP (auth surfaces). Protects against runaway agents. | Defaults are sized for production workloads; adjustments on request. |
| **Vision rate limits**       | Per-key call rate and image-size cap for multimodal requests.                                                      | Defaults are sized for production workloads; adjustments on request. |

Quota events surface in the K-AI Consumption Monitoring System as standard cost events — your dashboards reflect them without extra configuration.

## What this means in practice

For customers, scaling is transparent in the steady state. The cases that benefit from coordination with K-AI are:

* A planned backfill of more than a few hundred thousand documents.
* An expected step-change in MCP traffic (new agent rollout, new copilot deployment).
* A change to your monthly KCU envelope.

For all of these, contact your account team in advance so the platform team can size accordingly.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://k-ai.gitbook.io/knowledge-ai/operate/scaling.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
