2 min read
Structured Output with JSON Mode: Reliable Data Extraction from LLMs
Extracting structured data from LLMs has historically been unreliable. JSON mode and structured outputs now guarantee valid, schema-conformant responses. Here’s how to use these features effectively for data extraction pipelines.
Using OpenAI Structured Outputs
The structured output feature ensures responses match your exact schema:
from openai import AzureOpenAI
from pydantic import BaseModel
from typing import Optional
from enum import Enum
class Priority(str, Enum):
LOW = "low"
MEDIUM = "medium"
HIGH = "high"
CRITICAL = "critical"
class ExtractedTicket(BaseModel):
title: str
description: str
priority: Priority
category: str
affected_system: Optional[str]
estimated_hours: Optional[float]
tags: list[str]
client = AzureOpenAI(...)
async def extract_ticket_info(email_content: str) -> ExtractedTicket:
response = await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """Extract support ticket information from customer emails.
Infer priority from urgency language.
Categories: billing, technical, feature-request, general"""
},
{
"role": "user",
"content": email_content
}
],
response_format=ExtractedTicket
)
return response.choices[0].message.parsed
# Usage
email = """
Subject: URGENT - Production server down!
Our main application server crashed this morning and we can't
process any orders. This is affecting our entire business.
The error logs mention database connection timeouts.
Please help immediately!
"""
ticket = await extract_ticket_info(email)
# Returns: ExtractedTicket(
# title="Production server down",
# priority=Priority.CRITICAL,
# category="technical",
# affected_system="application server",
# ...
# )
Handling Complex Nested Structures
Define complex schemas with nested objects:
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class Invoice(BaseModel):
invoice_number: str
vendor_name: str
issue_date: str
due_date: str
line_items: list[LineItem]
subtotal: float
tax_rate: float
tax_amount: float
total_amount: float
payment_terms: Optional[str]
async def extract_invoice(document_text: str) -> Invoice:
return await client.beta.chat.completions.parse(
model="gpt-4o",
messages=[{"role": "user", "content": f"Extract invoice data:\n\n{document_text}"}],
response_format=Invoice
).choices[0].message.parsed
Validation and Error Handling
Structured outputs eliminate parsing errors. Add business validation for semantic correctness - the schema ensures format, you ensure meaning.