Back to Blog
2 min read

Building Multi-Modal AI Applications with GPT-4 Vision and Audio

Multi-modal AI applications that combine text, images, and audio are becoming increasingly powerful. Azure OpenAI’s GPT-4 with vision capabilities enables sophisticated document understanding, visual analysis, and cross-modal reasoning.

Processing Images with GPT-4 Vision

GPT-4 Vision can analyze images directly, enabling applications like document processing, visual inspection, and accessibility features:

import base64
from openai import AzureOpenAI
from pathlib import Path

class MultiModalProcessor:
    def __init__(self, client: AzureOpenAI, deployment_name: str = "gpt-4-vision"):
        self.client = client
        self.deployment = deployment_name

    def encode_image(self, image_path: str) -> str:
        """Encode image to base64 for API submission."""
        with open(image_path, "rb") as image_file:
            return base64.b64encode(image_file.read()).decode("utf-8")

    def analyze_document(self, image_path: str, extraction_prompt: str) -> dict:
        """Extract structured data from document images."""

        base64_image = self.encode_image(image_path)

        response = self.client.chat.completions.create(
            model=self.deployment,
            messages=[
                {
                    "role": "system",
                    "content": "You are a document analysis assistant. Extract information accurately and return structured JSON."
                },
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": extraction_prompt},
                        {
                            "type": "image_url",
                            "image_url": {
                                "url": f"data:image/png;base64,{base64_image}",
                                "detail": "high"
                            }
                        }
                    ]
                }
            ],
            max_tokens=2000,
            response_format={"type": "json_object"}
        )

        return response.choices[0].message.content

Practical Application: Invoice Processing

Combine vision capabilities with structured extraction for business automation:

def process_invoice(processor: MultiModalProcessor, invoice_image: str) -> dict:
    """Extract invoice data using multi-modal AI."""

    extraction_prompt = """
    Analyze this invoice image and extract the following information as JSON:
    {
        "vendor_name": "string",
        "invoice_number": "string",
        "invoice_date": "YYYY-MM-DD",
        "due_date": "YYYY-MM-DD",
        "line_items": [{"description": "string", "quantity": number, "unit_price": number}],
        "subtotal": number,
        "tax": number,
        "total": number
    }
    If any field is not visible or unclear, use null.
    """

    result = processor.analyze_document(invoice_image, extraction_prompt)
    return result

# Usage example
processor = MultiModalProcessor(azure_openai_client)
invoice_data = process_invoice(processor, "/documents/invoice_001.png")

Performance Considerations

Multi-modal requests consume more tokens and have higher latency. Use the detail parameter wisely: “low” for quick analysis, “high” for detailed extraction, and “auto” to let the model decide based on image complexity.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.