May 2, 2024 2 min read

The Future of Multimodal AI: What's Next After GPT-4 Vision

Multimodal AI is evolving rapidly. GPT-4 Vision showed us the power of combining text and image understanding. Today, let’s explore what’s possible now and what the future might hold.

Current Multimodal Architecture

Traditional multimodal systems work like this:

Audio -> Speech-to-Text -> LLM -> Text-to-Speech -> Audio
Image -> Vision Encoder -> LLM -> Text

Each modality is handled separately, then combined. This works, but has latency and integration challenges.

GPT-4 Vision Today

GPT-4V can analyze images alongside text:

from openai import AzureOpenAI
import base64
import os

client = AzureOpenAI(
    api_version="2024-02-15-preview",
    azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"],
    api_key=os.environ["AZURE_OPENAI_KEY"]
)

def analyze_image(image_path, prompt):
    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = client.chat.completions.create(
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )
    return response.choices[0].message.content

Image Detail Levels

GPT-4V supports different detail levels:

Detail	Tokens	Use Case
`low`	85	Quick classification
`high`	85 + 170/tile	Detailed analysis
`auto`	Varies	Let model decide

def estimate_tokens(width, height, detail="auto"):
    if detail == "low":
        return 85

    # Scale down if needed
    if max(width, height) > 2048:
        ratio = 2048 / max(width, height)
        width = int(width * ratio)
        height = int(height * ratio)

    if min(width, height) > 768:
        ratio = 768 / min(width, height)
        width = int(width * ratio)
        height = int(height * ratio)

    tiles_x = (width + 511) // 512
    tiles_y = (height + 511) // 512

    return 85 + (170 * tiles_x * tiles_y)

Document Processing Pipeline

Processing invoices, receipts, and forms:

import asyncio
from dataclasses import dataclass
from typing import List
import json

@dataclass
class InvoiceData:
    vendor: str
    invoice_number: str
    date: str
    total: float
    line_items: List[dict]

async def process_invoice(image_path: str) -> InvoiceData:
    prompt = """Extract invoice data in JSON format:
    {
        "vendor": "company name",
        "invoice_number": "INV-XXX",
        "date": "YYYY-MM-DD",
        "total": 0.00,
        "line_items": [
            {"description": "", "quantity": 0, "unit_price": 0.00, "total": 0.00}
        ]
    }
    Return only valid JSON."""

    with open(image_path, "rb") as f:
        image_data = base64.b64encode(f.read()).decode("utf-8")

    response = await asyncio.to_thread(
        client.chat.completions.create,
        model="gpt-4-vision-preview",
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/png;base64,{image_data}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=1000
    )

    data = json.loads(response.choices[0].message.content)
    return InvoiceData(**data)

The Future: Unified Multimodal Models

The industry is moving toward models that process all modalities natively:

Audio/Image/Text -> Single Unified Model -> Audio/Image/Text

This end-to-end training means the model understands relationships across modalities that separate models miss.

Expected Benefits

Lower latency: No pipeline overhead
Better understanding: Cross-modal relationships preserved
Simpler integration: One API for everything
Cost efficiency: Fewer separate services

Preparing for the Future

Build abstractions that can adapt:

class MultimodalClient:
    """Abstraction for multimodal AI that can evolve"""

    def __init__(self, client, vision_model: str = "gpt-4-vision-preview"):
        self.client = client
        self.vision_model = vision_model

    def analyze_with_image(self, text: str, image_base64: str) -> str:
        response = self.client.chat.completions.create(
            model=self.vision_model,
            messages=[
                {
                    "role": "user",
                    "content": [
                        {"type": "text", "text": text},
                        {
                            "type": "image_url",
                            "image_url": {"url": f"data:image/png;base64,{image_base64}"}
                        }
                    ]
                }
            ],
            max_tokens=1000
        )
        return response.choices[0].message.content

    def upgrade_model(self, new_model: str):
        """Ready for future multimodal models"""
        self.vision_model = new_model

Best Practices Today

Optimize image size - Resize before sending to reduce tokens
Use appropriate detail level - Low for classification, high for extraction
Batch requests - Process multiple images concurrently
Cache results - Store analysis results to avoid reprocessing
Handle errors gracefully - Images may fail validation

What’s Next

The AI landscape is evolving rapidly. Keep an eye on:

Microsoft Build announcements (May 21-23)
OpenAI updates
Azure OpenAI new model deployments

Tomorrow I’ll cover current best practices for voice AI integration.