September 1, 2020 1 min read

Azure Cognitive Search Indexers for Document Processing

Azure Cognitive Search AI Document Processing

Azure Cognitive Search isn’t just search - it’s a complete document processing pipeline with built-in AI enrichment.

The Indexer Pipeline

Data Source → Skillset (AI Enrichment) → Index → Search Queries

Creating an Indexer

{
  "name": "document-indexer",
  "dataSourceName": "blob-documents",
  "targetIndexName": "documents-index",
  "skillsetName": "document-skills",
  "schedule": {
    "interval": "PT2H"
  },
  "parameters": {
    "configuration": {
      "dataToExtract": "contentAndMetadata",
      "parsingMode": "default"
    }
  },
  "fieldMappings": [
    {"sourceFieldName": "metadata_storage_path", "targetFieldName": "id"},
    {"sourceFieldName": "metadata_storage_name", "targetFieldName": "filename"}
  ],
  "outputFieldMappings": [
    {"sourceFieldName": "/document/content", "targetFieldName": "content"},
    {"sourceFieldName": "/document/keyphrases", "targetFieldName": "keyphrases"},
    {"sourceFieldName": "/document/entities", "targetFieldName": "entities"}
  ]
}

AI Skillset

{
  "name": "document-skills",
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "inputs": [{"name": "text", "source": "/document/content"}],
      "outputs": [{"name": "keyPhrases", "targetName": "keyphrases"}]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "categories": ["Organization", "Person", "Location"],
      "inputs": [{"name": "text", "source": "/document/content"}],
      "outputs": [{"name": "entities", "targetName": "entities"}]
    }
  ]
}

Now every document uploaded to blob storage is automatically indexed with AI-extracted metadata.