April 11, 2021 2 min read

Blob Storage Indexing with Azure Cognitive Search

Azure Blob Storage Cognitive Search Document Processing Knowledge Mining

Azure Blob Storage is one of the most common data sources for Azure Cognitive Search. Whether you’re building a document search solution, a knowledge base, or a content management system, understanding how to effectively index blob content is essential. Let me walk you through the key concepts and best practices.

Understanding Blob Indexing

When indexing blobs, the indexer performs several operations:

Enumerate blobs - Find blobs matching your criteria
Download content - Retrieve blob data
Document cracking - Extract text and metadata
AI enrichment - Apply cognitive skills (optional)
Index mapping - Map fields to search index

Setting Up Blob Data Source

Basic Configuration

{
  "name": "documents-blob-datasource",
  "type": "azureblob",
  "credentials": {
    "connectionString": "DefaultEndpointsProtocol=https;AccountName=mydocs;AccountKey=xxx;EndpointSuffix=core.windows.net"
  },
  "container": {
    "name": "documents",
    "query": null
  }
}

Using Managed Identity

{
  "name": "documents-blob-datasource",
  "type": "azureblob",
  "credentials": {
    "connectionString": "ResourceId=/subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Storage/storageAccounts/mydocs;"
  },
  "container": {
    "name": "documents"
  }
}

Filtering with Query

{
  "container": {
    "name": "documents",
    "query": "legal/contracts/"
  }
}

Supported Document Formats

Azure Cognitive Search can extract text from many formats:

Office Documents:
  - Microsoft Word (.docx, .doc)
  - Microsoft Excel (.xlsx, .xls)
  - Microsoft PowerPoint (.pptx, .ppt)

PDF: .pdf (text-based and scanned with OCR)

Plain Text:
  - .txt
  - .csv
  - .tsv

Web:
  - .html, .htm
  - .xml
  - .json

Email: .eml, .msg

Images (with OCR): .jpg, .png, .bmp, .tiff, .gif

Indexer Configuration for Blobs

Full Configuration Example

{
  "name": "documents-indexer",
  "dataSourceName": "documents-blob-datasource",
  "targetIndexName": "documents-index",
  "skillsetName": "document-skillset",
  "schedule": {
    "interval": "PT1H"
  },
  "parameters": {
    "batchSize": 10,
    "maxFailedItems": -1,
    "maxFailedItemsPerBatch": -1,
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "imageAction": "generateNormalizedImages",
      "normalizedImageMaxWidth": 2000,
      "normalizedImageMaxHeight": 2000,
      "indexStorageMetadataOnlyForOversizedDocuments": true,
      "failOnUnsupportedContentType": false,
      "failOnUnprocessableDocument": false,
      "indexedFileNameExtensions": ".pdf,.docx,.pptx,.xlsx,.txt,.html",
      "excludedFileNameExtensions": ".zip,.exe,.dll"
    }
  },
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "id",
      "mappingFunction": {
        "name": "base64Encode"
      }
    },
    {
      "sourceFieldName": "metadata_storage_name",
      "targetFieldName": "fileName"
    },
    {
      "sourceFieldName": "metadata_storage_path",
      "targetFieldName": "url"
    },
    {
      "sourceFieldName": "metadata_storage_size",
      "targetFieldName": "fileSize"
    },
    {
      "sourceFieldName": "metadata_storage_last_modified",
      "targetFieldName": "lastModified"
    },
    {
      "sourceFieldName": "metadata_content_type",
      "targetFieldName": "contentType"
    }
  ]
}

Handling Different Content Types

PDF Documents

{
  "parameters": {
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "pdfTextRotationAlgorithm": "detectAngles"
    }
  }
}

Scanned PDFs with OCR

{
  "name": "ocr-skillset",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Vision.OcrSkill",
      "name": "ocr",
      "description": "Extract text from images",
      "context": "/document/normalized_images/*",
      "lineEnding": "Space",
      "defaultLanguageCode": "en",
      "detectOrientation": true,
      "inputs": [
        {
          "name": "image",
          "source": "/document/normalized_images/*"
        }
      ],
      "outputs": [
        {
          "name": "text",
          "targetName": "ocrText"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "merge-text",
      "description": "Merge OCR text with content",
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/normalized_images/*/ocrText"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "mergedContent"
        }
      ]
    }
  ]
}

Office Documents

{
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_author",
      "targetFieldName": "author"
    },
    {
      "sourceFieldName": "metadata_title",
      "targetFieldName": "title"
    },
    {
      "sourceFieldName": "metadata_creation_date",
      "targetFieldName": "createdDate"
    }
  ]
}

Index Schema for Document Search

{
  "name": "documents-index",
  "fields": [
    {
      "name": "id",
      "type": "Edm.String",
      "key": true,
      "searchable": false
    },
    {
      "name": "fileName",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "sortable": true
    },
    {
      "name": "content",
      "type": "Edm.String",
      "searchable": true,
      "analyzer": "en.microsoft"
    },
    {
      "name": "url",
      "type": "Edm.String",
      "searchable": false,
      "retrievable": true
    },
    {
      "name": "contentType",
      "type": "Edm.String",
      "filterable": true,
      "facetable": true
    },
    {
      "name": "fileSize",
      "type": "Edm.Int64",
      "filterable": true,
      "sortable": true
    },
    {
      "name": "lastModified",
      "type": "Edm.DateTimeOffset",
      "filterable": true,
      "sortable": true
    },
    {
      "name": "author",
      "type": "Edm.String",
      "searchable": true,
      "filterable": true,
      "facetable": true
    },
    {
      "name": "keyPhrases",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": true,
      "facetable": true
    },
    {
      "name": "organizations",
      "type": "Collection(Edm.String)",
      "searchable": true,
      "filterable": true,
      "facetable": true
    }
  ],
  "suggesters": [
    {
      "name": "suggester",
      "searchMode": "analyzingInfixMatching",
      "sourceFields": ["fileName", "keyPhrases"]
    }
  ]
}

Change Detection with Blob Storage

Using Blob Metadata

from azure.storage.blob import BlobServiceClient
import hashlib

def update_blob_metadata_for_change_detection(container_name, blob_name):
    blob_service = BlobServiceClient.from_connection_string(connection_string)
    blob_client = blob_service.get_blob_client(container_name, blob_name)

    # Download and hash content
    content = blob_client.download_blob().readall()
    content_hash = hashlib.md5(content).hexdigest()

    # Update metadata
    blob_client.set_blob_metadata({
        'contenthash': content_hash,
        'indexdate': datetime.utcnow().isoformat()
    })

Native Soft Delete for Deletion Detection

{
  "dataDeletionDetectionPolicy": {
    "@odata.type": "#Microsoft.Azure.Search.NativeBlobSoftDeleteDeletionDetectionPolicy"
  }
}

Enable soft delete on storage account:

az storage account blob-service-properties update \
    --account-name mystorageaccount \
    --enable-delete-retention true \
    --delete-retention-days 7

Handling Large Documents

Configure Size Limits

{
  "parameters": {
    "configuration": {
      "indexStorageMetadataOnlyForOversizedDocuments": true,
      "failOnUnprocessableDocument": false
    }
  }
}

Chunking Large Documents

{
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "split-skill",
      "description": "Split content into chunks",
      "context": "/document",
      "textSplitMode": "pages",
      "maximumPageLength": 5000,
      "pageOverlapLength": 500,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    }
  ]
}

Organizing Blob Content

Folder-Based Organization

documents/
├── legal/
│   ├── contracts/
│   └── compliance/
├── hr/
│   ├── policies/
│   └── training/
└── finance/
    ├── reports/
    └── invoices/

Create separate indexers per folder:

folders = ["legal/contracts", "legal/compliance", "hr/policies", "finance/reports"]

for folder in folders:
    datasource = {
        "name": f"{folder.replace('/', '-')}-datasource",
        "type": "azureblob",
        "credentials": {"connectionString": connection_string},
        "container": {
            "name": "documents",
            "query": folder
        }
    }

    indexer = {
        "name": f"{folder.replace('/', '-')}-indexer",
        "dataSourceName": datasource["name"],
        "targetIndexName": "documents-index",
        "schedule": {"interval": "PT1H"}
    }

    # Create datasource and indexer
    search_client.create_data_source(datasource)
    search_client.create_indexer(indexer)

Metadata-Based Organization

from azure.storage.blob import BlobServiceClient

def upload_with_metadata(container, file_path, department, document_type):
    blob_service = BlobServiceClient.from_connection_string(connection_string)
    blob_client = blob_service.get_blob_client(container, file_path)

    with open(file_path, "rb") as data:
        blob_client.upload_blob(
            data,
            metadata={
                "department": department,
                "documentType": document_type,
                "uploadedBy": "system",
                "uploadDate": datetime.utcnow().isoformat()
            }
        )

Map metadata in indexer:

{
  "fieldMappings": [
    {
      "sourceFieldName": "metadata_department",
      "targetFieldName": "department"
    },
    {
      "sourceFieldName": "metadata_documentType",
      "targetFieldName": "documentType"
    }
  ]
}

Searching Indexed Documents

from azure.search.documents import SearchClient
from azure.core.credentials import AzureKeyCredential

search_client = SearchClient(
    endpoint="https://mysearch.search.windows.net",
    index_name="documents-index",
    credential=AzureKeyCredential("api-key")
)

# Search with filters and facets
results = search_client.search(
    search_text="contract renewal",
    filter="contentType eq 'application/pdf'",
    facets=["author", "contentType"],
    highlight_fields="content",
    top=10,
    include_total_count=True
)

print(f"Total results: {results.get_count()}")

for result in results:
    print(f"File: {result['fileName']}")
    print(f"Score: {result['@search.score']}")
    if '@search.highlights' in result:
        print(f"Highlights: {result['@search.highlights']['content']}")
    print("---")

# Get facet counts
for facet in results.get_facets()['contentType']:
    print(f"{facet['value']}: {facet['count']}")

Best Practices

Use appropriate file extensions filter - Only index needed file types
Configure error handling - Don’t let bad documents stop indexing
Enable soft delete - Properly handle document deletions
Organize content logically - Use folders and metadata
Monitor indexer health - Check for warnings and errors
Test with representative content - Validate before production
Use managed identity - More secure than connection strings

Conclusion

Blob storage indexing with Azure Cognitive Search enables powerful document search scenarios. By understanding the configuration options, handling different content types, and implementing proper change detection, you can build robust search solutions that keep your index synchronized with your document repository. The combination of document cracking, OCR, and AI enrichment makes it possible to search across any document type.