Scaling & quotas
K-AI is designed to scale horizontally without customer-side tuning. This page summarises what scales automatically, what is physically isolated, and which quotas you can configure.
How K-AI scales
Stateless services scale horizontally. The management plane, the audit and retrieval services, the authentication service, the web crawler, and the billing surface are stateless and run as multi-replica deployments. No sticky sessions; identity travels in every request.
LLM serving is tier-managed. K-AI operates the LLM service with automatic failover between an on-cluster primary, an on-cluster fallback, and a managed endpoint of last resort. Customers do not configure tiers — the service routes requests transparently.
Indexation is ephemeral per document. Each document is parsed in its own short-lived workload, then released. A malformed file cannot poison a shared worker, and concurrency scales with available capacity rather than a fixed pool.
Per-instance physical isolation prevents noisy-neighbour effects. Each K-AI Instance has its own compute, its own vector index, and its own object storage. A bulk re-ingest on one instance cannot block queries on another.
Bulk-ingestion throughput is governed centrally. For large initial backfills (multi-million-document estates), K-AI raises ingestion capacity for the duration of the backfill and returns to baseline afterwards. Coordinate timing with your account team.
Quotas you can configure
K-AI enforces quotas at three levels. SaaS deployments come pre-configured for the typical mid-to-large customer envelope; on-premise and Snowflake Native App deployments inherit the same defaults and can be tuned per environment.
Per-organisation KCU cap
Total monthly consumption across all of an org's instances and cost types.
Configurable on request via your account team.
LLM rate limits
Per API key (Instance API), per user (Retrieval / Audit), per IP (auth surfaces). Protects against runaway agents.
Defaults are sized for production workloads; adjustments on request.
Vision rate limits
Per-key call rate and image-size cap for multimodal requests.
Defaults are sized for production workloads; adjustments on request.
Quota events surface in PICSOU as standard cost events — your dashboards reflect them without extra configuration.
What this means in practice
For customers, scaling is transparent in the steady state. The cases that benefit from coordination with K-AI are:
A planned backfill of more than a few hundred thousand documents.
An expected step-change in MCP traffic (new agent rollout, new copilot deployment).
A change to your monthly KCU envelope.
For all of these, contact your account team in advance so the platform team can size accordingly.
Last updated