> For the complete documentation index, see [llms.txt](https://k-ai.gitbook.io/knowledge-ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/snowflake-stage.md).

# Snowflake Stage

K-AI ingests documents from a Snowflake Stage via the Snowflake Python connector. The connector authenticates with a Snowflake user/role, lists files inside an internal or external stage, and syncs incrementally based on each file's stage metadata. It is the natural ingest path for the **K-AI Snowflake Native App** deployments, where customer documents live in Stages by design rather than in an object store outside the data warehouse.

## Supported source versions

* **Snowflake** — all currently supported Snowflake editions (Standard, Enterprise, Business Critical, VPS). The connector targets the standard SQL surface and works against any region or cloud provider Snowflake operates on (AWS, Azure, GCP).
* **Stage types** — both **internal stages** (Snowflake-managed storage) and **external stages** (S3, Azure Blob, GCS backing) are supported through the same `LIST @stage` interface.
* **Connection modes** — direct user/password (`EXTERNAL` type) for off-Snowflake K-AI deployments, and Snowpark Container Services OAuth token (`INTERNAL` type) for K-AI running inside Snowflake as a Native App.

## Authentication

Two modes, selected by the `type` field:

* **`EXTERNAL`** — Snowflake username + password, scoped to a Snowflake role. K-AI opens a standard `snowflake.connector.connect()` session with `(user, password, account, role, database, schema)`. Use this when K-AI runs outside the Snowflake account that hosts the data.
* **`INTERNAL`** — Snowpark Container Services session token. K-AI reads the token from `/snowflake/session/token` (injected by SPCS into the container), and connects with `authenticator='oauth'` against the SPCS-provided host. Use this when K-AI runs as a Snowflake Native App inside the customer's account — no shared credentials are exchanged.

The role supplied (or the SPCS service role) must hold at minimum:

* `USAGE` on the database and schema hosting the stage.
* `READ` on the stage.
* `USAGE` on a warehouse if the connector resolves stage metadata that requires compute.

Key-pair authentication is not exposed in the current connector.

## Configuration

| Field        | Type                           | Required | Description                                                                                                                   |
| ------------ | ------------------------------ | -------- | ----------------------------------------------------------------------------------------------------------------------------- |
| `type`       | enum (`EXTERNAL` / `INTERNAL`) | yes      | Connection mode. `EXTERNAL` uses user/password; `INTERNAL` uses the SPCS session token.                                       |
| `account`    | string                         | yes      | Snowflake account identifier, e.g. `acme-prod` (the part before `.snowflakecomputing.com`).                                   |
| `user`       | string                         | yes      | Snowflake user name. For `INTERNAL` mode this is the service account configured on the SPCS service.                          |
| `password`   | string                         | yes      | Snowflake password. Encrypted at rest by the platform. Ignored in `INTERNAL` mode but kept non-empty by the credential check. |
| `role`       | string                         | yes      | Snowflake role used by the session.                                                                                           |
| `database`   | string                         | yes      | Database that hosts the stage.                                                                                                |
| `schema_`    | string                         | yes      | Schema that hosts the stage (note the trailing underscore — `schema` is a reserved name in some serialisers).                 |
| `stage_name` | string                         | yes      | Name of the stage to ingest, e.g. `kai_documents` for `@kai_documents`.                                                       |

The connector derives the Snowflake host (`{account}.snowflakecomputing.com`) and registers it in the K-AI host allow-list.

## Document types ingested

Anything the [indexation pipeline](/knowledge-ai/sources-and-ingestion/indexation-pipeline.md) can extract from a Stage file:

* PDF (`.pdf`)
* Word (`.docx`, `.doc`)
* Excel (`.xlsx`, `.xls`, `.csv`)
* PowerPoint (`.pptx`, `.ppt`)
* Plain text and Markdown (`.txt`, `.md`)
* JSON, XML, and HTML.
* Images (`.png`, `.jpg`, `.tif`) — OCR via the indexation pipeline.

Snowflake-native file formats used for table loading (Parquet, Avro, ORC) are listed but not interpreted as document content; they are skipped during extraction.

## Sync mode

The connector runs `LIST @{stage_name}` to enumerate files. Each row carries the file `name`, `size`, `last_modified`, and an MD5 `md5`. Incremental sync is **MD5-based with LMT fallback**: a file's MD5 (or, when unavailable, its `last_modified`) is compared against the cursor persisted from the previous run. New or changed files are downloaded via `GET @stage/file ...` (external mode) or read through the SPCS file system mount (internal mode).

Deletions are detected when a file present in the previous snapshot disappears from the listing. The Stage interface does not expose a delete event stream — the connector relies on the full listing every cycle.

The first sync is a full crawl. Subsequent syncs only download files whose MD5 has changed.

## ACL handling

Snowflake's access control sits at the stage level — there is no per-file ACL inside a Stage. The connector treats the entire stage as a single access boundary and applies the configured **mapped** ACL strategy: a group rule attached at source registration time is propagated to every document ingested from this stage.

For tenant-aware access control, customers should partition documents into separate stages (one per tenant or business unit), register one K-AI source per stage, and apply distinct group rules. The Snowflake role that K-AI uses must hold `READ` on the stage; access to the underlying data is governed by the same role-based grants that govern any other Snowflake consumer.

At retrieval time, [K-AI MCP](/knowledge-ai/k-ai-mcp/mcp.md) replays the mapped ACL against the calling identity.

## Rate limits & throttling

Snowflake throttles by warehouse concurrency rather than by HTTP RPS. The connector keeps a single warehouse session open per sync cycle, paginates `LIST` results through the connector cursor, and serialises file downloads (default: 8 concurrent file fetches). Stage-level external storage (S3/Azure Blob/GCS) imposes its own object-store rate limits, which the connector respects through the same back-off curve used by the [Azure Blob Storage](/knowledge-ai/sources-and-ingestion/connectors/azure-blob.md) connector.

## Known limitations

* **No native per-file ACL** — access is governed at the stage level only. Partition stages or use group rules to scope access.
* **No key-pair authentication** in the current connector — password or SPCS OAuth token only.
* **Stages with > 10,000 files** require a sustained `LIST` cursor; the initial crawl must complete in one run or be resumed from the last checkpoint.
* **External stage credentials** (S3 IAM role, Azure SAS, GCS service account) are managed by Snowflake — K-AI never sees the underlying storage credentials.
* **Snowflake table data** is not ingested by this connector; only files inside Stages. To bring tabular data into K-AI, expose it through a [Generic HTTP](/knowledge-ai/sources-and-ingestion/connectors/generic-http.md) driver.
* **File formats used for COPY INTO** (Parquet, Avro, ORC) are listed but not extracted as documents.

## Setup walkthrough

1. **Create a Snowflake stage** in the chosen database/schema:

   ```sql
   CREATE STAGE acme_db.public.kai_documents
     FILE_FORMAT = (TYPE = 'CSV' COMPRESSION = 'AUTO');
   ```

   Or attach an external stage backed by S3/Azure Blob/GCS.
2. **Create a Snowflake role and user** for K-AI:

   ```sql
   CREATE ROLE kai_reader;
   GRANT USAGE ON DATABASE acme_db TO ROLE kai_reader;
   GRANT USAGE ON SCHEMA acme_db.public TO ROLE kai_reader;
   GRANT READ ON STAGE acme_db.public.kai_documents TO ROLE kai_reader;
   CREATE USER kai_svc PASSWORD = '...' DEFAULT_ROLE = kai_reader;
   GRANT ROLE kai_reader TO USER kai_svc;
   ```

   For SPCS deployments, attach `kai_reader` to the service identity instead of creating a password user.
3. **Add the source in the K-AI admin portal**: select Snowflake Stage, set `type` (`EXTERNAL` or `INTERNAL`), fill `account`, `user`, `password`, `role`, `database`, `schema_`, and `stage_name`. Set the sync schedule.
4. **Assign a group rule** for ACL mapping — every document ingested from this stage will inherit that group on the K-AI side.
5. **Trigger a test sync** and confirm the document count via the [Documents endpoint](/knowledge-ai/sources-and-ingestion/instance-api/documents.md) of the K-AI Instance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/snowflake-stage.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.