> For the complete documentation index, see [llms.txt](https://k-ai.gitbook.io/knowledge-ai/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/azure-blob.md).

# Azure Blob Storage

K-AI ingests documents from an Azure Blob Storage container via the Azure Storage REST API. The connector authenticates with either a SAS URL or a storage account connection string, enumerates blobs inside a single container, and syncs incrementally based on each blob's Last-Modified Time (LMT). It is the recommended path when documents are already dropped into a customer-owned Azure container — a common pattern, since K-AI SaaS itself uses Azure Blob Storage for its internal storage layer.

## Supported source versions

* **Azure Blob Storage** — all current Azure Storage account tiers (general-purpose v2, BlockBlobStorage, BlobStorage). The connector uses the Azure Storage REST API at its latest service version.
* **Azure Stack Hub** blob endpoints — supported when reachable through a compatible connection string.
* **Container scope only** — the connector targets one container per source registration. To ingest multiple containers, register one source per container.

## Authentication

Two modes are supported, picked automatically by the connector based on which credential field is populated:

* **SAS URL** — a pre-signed container-scoped Shared Access Signature URL, e.g. `https://contoso.blob.core.windows.net/mycontainer?sv=...&se=...&sr=c&sp=rl&sig=...`. Use this mode when the customer wants to scope K-AI's read access narrowly without sharing the account key. The SAS must grant at minimum the `Read` and `List` permissions on the container.
* **Connection string** — the full Azure Storage connection string (`DefaultEndpointsProtocol=https;AccountName=...;AccountKey=...;EndpointSuffix=core.windows.net`). Grants full account-level access and is intended for accounts dedicated to K-AI ingestion.

Either `sasUrl` or `connectionString` must be set. Azure AD service principal authentication is not exposed in the current connector — customers that require it should request the extended driver from the K-AI integration team.

## Configuration

| Field              | Type   | Required    | Description                                                                                              |
| ------------------ | ------ | ----------- | -------------------------------------------------------------------------------------------------------- |
| `containerName`    | string | yes         | Name of the blob container to ingest, e.g. `kai-documents`.                                              |
| `sasUrl`           | string | conditional | SAS URL for the container. Required when `connectionString` is not set.                                  |
| `connectionString` | string | conditional | Storage account connection string. Required when `sasUrl` is not set. Encrypted at rest by the platform. |

The connector derives the storage host (e.g. `contoso.blob.core.windows.net`) from the supplied credential and registers it in the K-AI host allow-list automatically.

## Document types ingested

Anything the [indexation pipeline](/knowledge-ai/sources-and-ingestion/indexation-pipeline.md) can extract from a blob:

* PDF (`.pdf`)
* Word (`.docx`, `.doc`)
* Excel (`.xlsx`, `.xls`, `.csv`)
* PowerPoint (`.pptx`, `.ppt`)
* Plain text and Markdown (`.txt`, `.md`)
* Email archives (`.msg`, `.eml`)
* Images (`.png`, `.jpg`, `.jpeg`, `.gif`, `.tif`) — OCR via the indexation pipeline.
* HTML pages.

The MIME type is inferred from the blob's `Content-Type` property when set; otherwise it is detected from the file extension and content sniffing at extraction time.

## Sync mode

Incremental sync is **LMT-based**. On each cycle the connector calls `list_blobs()` on the container and compares each blob's `last_modified` timestamp against the cursor it persists between runs. Only blobs whose LMT exceeds the cursor are downloaded and pushed to the Orchestrator.

Deletions are detected on full re-listing: a blob present in the previous snapshot but absent in the current listing is marked deleted in the K-AI Instance. Soft-deleted blobs (when soft delete is enabled on the account) are ignored unless their soft-delete window has expired.

The first sync is a full crawl — every blob in the container is enumerated and ingested. Subsequent syncs are LMT-incremental.

## ACL handling

Azure Blob Storage exposes no per-object ACL beyond the container-level access (account key, SAS scope, or RBAC role on the container). The connector treats the entire container as a single access boundary and applies the configured **mapped** ACL strategy: the K-AI admin assigns a group rule at source-registration time (e.g. `azure-blob-readers`), which is then attached to every document ingested from this container.

For tenant-aware access control, customers should partition documents into separate containers, register one K-AI source per container, and apply distinct group rules. At retrieval time, [K-AI MCP](/knowledge-ai/k-ai-mcp/mcp.md) replays the mapped ACL against the calling identity.

## Rate limits & throttling

Azure Storage enforces per-account scalability targets (default: 20,000 requests/s per storage account, 500 requests/s per partition). The connector caps concurrent downloads (default: 16) and honours `Retry-After` headers on `429` and `503` responses with exponential back-off and jitter. Listing requests use server-side pagination via the SDK's `by_page()` iterator (default page size: 5,000 blobs).

## Known limitations

* **Single container per source** — multi-container ingest requires multiple source registrations.
* **No native ACL** — access control is container-level only. Use group rules at the K-AI side or partition documents across containers.
* **No Azure AD authentication** in the current connector — SAS URL or connection string only.
* **Append blobs** and **page blobs** are listed but extracted only when their content type matches a supported document MIME type; binary disk images, VM disks, and database backups are skipped.
* **Blob versions and snapshots** are ignored — only the current version of each blob is ingested.
* **Hierarchical namespace** (ADLS Gen2) containers are supported as flat blob containers; ACLs from the Data Lake POSIX-style permission model are not surfaced.

## Setup walkthrough

1. **Provision a container** in the Azure Storage account that will host the documents, e.g. `kai-documents`.
2. **Generate credentials** — either:
   * Open the container in the Azure portal → **Shared access tokens** → grant `Read` and `List` permissions, set an expiry, and copy the generated SAS URL; or
   * Open the storage account → **Access keys** → copy the connection string.
3. **Add the source in the K-AI admin portal**: select Azure Blob Storage, paste `containerName` and either `sasUrl` or `connectionString`. Set the sync schedule (e.g. `*/15 * * * *`).
4. **Assign a group rule** for ACL mapping — every document ingested from this container will inherit that group on the K-AI side.
5. **Trigger a test sync** from the portal. The connector lists the container, registers blobs into `INITIAL_SAVED`, and the indexation pipeline takes over. Confirm the document count via the [Documents endpoint](/knowledge-ai/sources-and-ingestion/instance-api/documents.md) of the K-AI Instance.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter, and the optional `goal` query parameter:

```
GET https://k-ai.gitbook.io/knowledge-ai/sources-and-ingestion/connectors/azure-blob.md?ask=<question>&goal=<endgoal>
```

`ask` is the immediate question: it should be specific, self-contained, and written in natural language.
`goal` is optional and describes the broader end goal you are ultimately trying to accomplish on behalf of the user. GitBook uses it to tailor the answer towards what is most useful for that goal.

The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
