Back to Blog
2 min read

GPT-4o Vision: Building Image Analysis Applications

GPT-4o’s vision capabilities enable powerful image understanding directly through the chat API. From document analysis to visual inspection, multimodal AI opens new application possibilities.

Basic Image Analysis

from openai import AzureOpenAI
import base64
import os

client = AzureOpenAI(
    api_key=os.environ["AZURE_OPENAI_KEY"],
    api_version="2024-08-01-preview",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"]
)

def encode_image(image_path: str) -> str:
    """Encode image to base64."""
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode("utf-8")

def analyze_image(image_path: str, prompt: str) -> str:
    """Analyze an image with GPT-4o."""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"  # or "low" for faster processing
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    return response.choices[0].message.content

Practical Applications

# Document extraction
invoice_data = analyze_image(
    "invoice.jpg",
    "Extract all line items, amounts, and the total from this invoice. Return as JSON."
)

# Quality inspection
defect_report = analyze_image(
    "product_photo.jpg",
    "Inspect this product image for manufacturing defects. List any issues found."
)

# Chart interpretation
chart_summary = analyze_image(
    "sales_chart.png",
    "Describe the trends shown in this chart and identify key insights."
)

# Multiple images for comparison
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Compare these two product designs"},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img1}"}},
            {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{img2}"}}
        ]
    }]
)

GPT-4o vision provides remarkable understanding of images, but always validate outputs for critical applications. Combine with traditional computer vision for highest accuracy.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.