Blob Storage Indexing with Azure Cognitive Search
Azure Blob Storage is one of the most common data sources for Azure Cognitive Search. Whether you’re building a document search solution, a knowledge base, or a content management system, understanding how to effectively index blob content is essential. Let me walk you through the key concepts and best practices.
Understanding Blob Indexing
When indexing blobs, the indexer performs several operations:
- Enumerate blobs - Find blobs matching your criteria
- Download content - Retrieve blob data
- Document cracking - Extract text and metadata
- AI enrichment - Apply cognitive skills (optional)
- Index mapping - Map fields to search index
Setting Up Blob Data Source
Basic Configuration
{
"name": "documents-blob-datasource",
"type": "azureblob",
"credentials": {
"connectionString": "DefaultEndpointsProtocol=https;AccountName=mydocs;AccountKey=xxx;EndpointSuffix=core.windows.net"
},
"container": {
"name": "documents",
"query": null
}
}
Using Managed Identity
{
"name": "documents-blob-datasource",
"type": "azureblob",
"credentials": {
"connectionString": "ResourceId=/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/mydocs;"
},
"container": {
"name": "documents"
}
}
Filtering with Query
{
"container": {
"name": "documents",
"query": "legal/contracts/"
}
}
Supported Document Formats
Azure Cognitive Search can extract text from many formats:
Office Documents:
- Microsoft Word (.docx, .doc)
- Microsoft Excel (.xlsx, .xls)
- Microsoft PowerPoint (.pptx, .ppt)
PDF: .pdf (text-based and scanned with OCR)
Plain Text:
- .txt
- .csv
- .tsv
Web:
- .html, .htm
- .xml
- .json
Email: .eml, .msg
Images (with OCR): .jpg, .png, .bmp, .tiff, .gif
Indexer Configuration for Blobs
Full Configuration Example
{
"name": "documents-indexer",
"dataSourceName": "documents-blob-datasource",
"targetIndexName": "documents-index",
"skillsetName": "document-skillset",
"schedule": {
"interval": "PT1H"
},
"parameters": {
"batchSize": 10,
"maxFailedItems": -1,
"maxFailedItemsPerBatch": -1,
"configuration": {
"dataToExtract": "contentAndMetadata",
"imageAction": "generateNormalizedImages",
"normalizedImageMaxWidth": 2000,
"normalizedImageMaxHeight": 2000,
"indexStorageMetadataOnlyForOversizedDocuments": true,
"failOnUnsupportedContentType": false,
"failOnUnprocessableDocument": false,
"indexedFileNameExtensions": ".pdf,.docx,.pptx,.xlsx,.txt,.html",
"excludedFileNameExtensions": ".zip,.exe,.dll"
}
},
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "id",
"mappingFunction": {
"name": "base64Encode"
}
},
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "fileName"
},
{
"sourceFieldName": "metadata_storage_path",
"targetFieldName": "url"
},
{
"sourceFieldName": "metadata_storage_size",
"targetFieldName": "fileSize"
},
{
"sourceFieldName": "metadata_storage_last_modified",
"targetFieldName": "lastModified"
},
{
"sourceFieldName": "metadata_content_type",
"targetFieldName": "contentType"
}
]
}
Handling Different Content Types
PDF Documents
{
"parameters": {
"configuration": {
"dataToExtract": "contentAndMetadata",
"pdfTextRotationAlgorithm": "detectAngles"
}
}
}
Scanned PDFs with OCR
{
"name": "ocr-skillset",
"skills": [
{
"@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
"name": "ocr",
"description": "Extract text from images",
"context": "/document/normalized_images/*",
"lineEnding": "Space",
"defaultLanguageCode": "en",
"detectOrientation": true,
"inputs": [
{
"name": "image",
"source": "/document/normalized_images/*"
}
],
"outputs": [
{
"name": "text",
"targetName": "ocrText"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "merge-text",
"description": "Merge OCR text with content",
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "itemsToInsert",
"source": "/document/normalized_images/*/ocrText"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "mergedContent"
}
]
}
]
}
Office Documents
{
"fieldMappings": [
{
"sourceFieldName": "metadata_author",
"targetFieldName": "author"
},
{
"sourceFieldName": "metadata_title",
"targetFieldName": "title"
},
{
"sourceFieldName": "metadata_creation_date",
"targetFieldName": "createdDate"
}
]
}
Index Schema for Document Search
{
"name": "documents-index",
"fields": [
{
"name": "id",
"type": "Edm.String",
"key": true,
"searchable": false
},
{
"name": "fileName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"sortable": true
},
{
"name": "content",
"type": "Edm.String",
"searchable": true,
"analyzer": "en.microsoft"
},
{
"name": "url",
"type": "Edm.String",
"searchable": false,
"retrievable": true
},
{
"name": "contentType",
"type": "Edm.String",
"filterable": true,
"facetable": true
},
{
"name": "fileSize",
"type": "Edm.Int64",
"filterable": true,
"sortable": true
},
{
"name": "lastModified",
"type": "Edm.DateTimeOffset",
"filterable": true,
"sortable": true
},
{
"name": "author",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"facetable": true
},
{
"name": "keyPhrases",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"facetable": true
},
{
"name": "organizations",
"type": "Collection(Edm.String)",
"searchable": true,
"filterable": true,
"facetable": true
}
],
"suggesters": [
{
"name": "suggester",
"searchMode": "analyzingInfixMatching",
"sourceFields": ["fileName", "keyPhrases"]
}
]
}
Change Detection with Blob Storage
Using Blob Metadata
from azure.storage.blob import BlobServiceClient
import hashlib
def update_blob_metadata_for_change_detection(container_name, blob_name):
blob_service = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service.get_blob_client(container_name, blob_name)
# Download and hash content
content = blob_client.download_blob().readall()
content_hash = hashlib.md5(content).hexdigest()
# Update metadata
blob_client.set_blob_metadata({
'contenthash': content_hash,
'indexdate': datetime.utcnow().isoformat()
})
Native Soft Delete for Deletion Detection
{
"dataDeletionDetectionPolicy": {
"@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
}
}
Enable soft delete on storage account:
az storage account blob-service-properties update \
--account-name mystorageaccount \
--enable-delete-retention true \
--delete-retention-days 7
Handling Large Documents
Configure Size Limits
{
"parameters": {
"configuration": {
"indexStorageMetadataOnlyForOversizedDocuments": true,
"failOnUnprocessableDocument": false
}
}
}
Chunking Large Documents
{
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "split-skill",
"description": "Split content into chunks",
"context": "/document",
"textSplitMode": "pages",
"maximumPageLength": 5000,
"pageOverlapLength": 500,
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
]
}
]
}
Organizing Blob Content
Folder-Based Organization
documents/
├── legal/
│ ├── contracts/
│ └── compliance/
├── hr/
│ ├── policies/
│ └── training/
└── finance/
├── reports/
└── invoices/
Create separate indexers per folder:
folders = ["legal/contracts", "legal/compliance", "hr/policies", "finance/reports"]
for folder in folders:
datasource = {
"name": f"{folder.replace('/', '-')}-datasource",
"type": "azureblob",
"credentials": {"connectionString": connection_string},
"container": {
"name": "documents",
"query": folder
}
}
indexer = {
"name": f"{folder.replace('/', '-')}-indexer",
"dataSourceName": datasource["name"],
"targetIndexName": "documents-index",
"schedule": {"interval": "PT1H"}
}
# Create datasource and indexer
search_client.create_data_source(datasource)
search_client.create_indexer(indexer)
Metadata-Based Organization
from azure.storage.blob import BlobServiceClient
def upload_with_metadata(container, file_path, department, document_type):
blob_service = BlobServiceClient.from_connection_string(connection_string)
blob_client = blob_service.get_blob_client(container, file_path)
with open(file_path, "rb") as data:
blob_client.upload_blob(
data,
metadata={
"department": department,
"documentType": document_type,
"uploadedBy": "system",
"uploadDate": datetime.utcnow().isoformat()
}
)
Map metadata in indexer:
{
"fieldMappings": [
{
"sourceFieldName": "metadata_department",
"targetFieldName": "department"
},
{
"sourceFieldName": "metadata_documentType",
"targetFieldName": "documentType"
}
]
}
Searching Indexed Documents
from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential
search_client = SearchClient(
endpoint="https://mysearch.search.windows.net",
index_name="documents-index",
credential=AzureKeyCredential("api-key")
)
# Search with filters and facets
results = search_client.search(
search_text="contract renewal",
filter="contentType eq 'application/pdf'",
facets=["author", "contentType"],
highlight_fields="content",
top=10,
include_total_count=True
)
print(f"Total results: {results.get_count()}")
for result in results:
print(f"File: {result['fileName']}")
print(f"Score: {result['@search.score']}")
if '@search.highlights' in result:
print(f"Highlights: {result['@search.highlights']['content']}")
print("---")
# Get facet counts
for facet in results.get_facets()['contentType']:
print(f"{facet['value']}: {facet['count']}")
Best Practices
- Use appropriate file extensions filter - Only index needed file types
- Configure error handling - Don’t let bad documents stop indexing
- Enable soft delete - Properly handle document deletions
- Organize content logically - Use folders and metadata
- Monitor indexer health - Check for warnings and errors
- Test with representative content - Validate before production
- Use managed identity - More secure than connection strings
Conclusion
Blob storage indexing with Azure Cognitive Search enables powerful document search scenarios. By understanding the configuration options, handling different content types, and implementing proper change detection, you can build robust search solutions that keep your index synchronized with your document repository. The combination of document cracking, OCR, and AI enrichment makes it possible to search across any document type.