Back to Blog
2 min read

Structured Output with JSON Mode: Reliable Data Extraction from LLMs

Extracting structured data from LLMs has historically been unreliable. JSON mode and structured outputs now guarantee valid, schema-conformant responses. Here’s how to use these features effectively for data extraction pipelines.

Using OpenAI Structured Outputs

The structured output feature ensures responses match your exact schema:

from openai import AzureOpenAI
from pydantic import BaseModel
from typing import Optional
from enum import Enum

class Priority(str, Enum):
    LOW = "low"
    MEDIUM = "medium"
    HIGH = "high"
    CRITICAL = "critical"

class ExtractedTicket(BaseModel):
    title: str
    description: str
    priority: Priority
    category: str
    affected_system: Optional[str]
    estimated_hours: Optional[float]
    tags: list[str]

client = AzureOpenAI(...)

async def extract_ticket_info(email_content: str) -> ExtractedTicket:
    response = await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """Extract support ticket information from customer emails.
                Infer priority from urgency language.
                Categories: billing, technical, feature-request, general"""
            },
            {
                "role": "user",
                "content": email_content
            }
        ],
        response_format=ExtractedTicket
    )

    return response.choices[0].message.parsed

# Usage
email = """
Subject: URGENT - Production server down!

Our main application server crashed this morning and we can't
process any orders. This is affecting our entire business.
The error logs mention database connection timeouts.

Please help immediately!
"""

ticket = await extract_ticket_info(email)
# Returns: ExtractedTicket(
#     title="Production server down",
#     priority=Priority.CRITICAL,
#     category="technical",
#     affected_system="application server",
#     ...
# )

Handling Complex Nested Structures

Define complex schemas with nested objects:

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class Invoice(BaseModel):
    invoice_number: str
    vendor_name: str
    issue_date: str
    due_date: str
    line_items: list[LineItem]
    subtotal: float
    tax_rate: float
    tax_amount: float
    total_amount: float
    payment_terms: Optional[str]

async def extract_invoice(document_text: str) -> Invoice:
    return await client.beta.chat.completions.parse(
        model="gpt-4o",
        messages=[{"role": "user", "content": f"Extract invoice data:\n\n{document_text}"}],
        response_format=Invoice
    ).choices[0].message.parsed

Validation and Error Handling

Structured outputs eliminate parsing errors. Add business validation for semantic correctness - the schema ensures format, you ensure meaning.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.