Back to Blog
2 min read

Prompt Caching with Claude and Azure OpenAI: Reducing Latency and Costs

Prompt caching is a game-changer for AI applications with repetitive context. Both Claude and Azure OpenAI now support automatic prompt caching, reducing costs by up to 90% and latency by 50% for qualifying requests. Here’s how to architect your applications to maximize cache hits.

Understanding Cache Mechanics

Caching works on prefix matching - the system caches processed prompts and reuses them when subsequent requests share the same prefix. Structure your prompts with static content first:

from anthropic import Anthropic

client = Anthropic()

# Static system context that will be cached
SYSTEM_CONTEXT = """You are an expert financial analyst assistant.
You have access to the following company data:

<company_financials>
{financial_data}  # 50KB of structured financial data
</company_financials>

<analysis_guidelines>
{guidelines}  # 10KB of analysis guidelines
</analysis_guidelines>

Always cite specific numbers from the data."""

async def analyze_query(user_query: str, financial_data: str, guidelines: str):
    response = await client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=2048,
        system=[
            {
                "type": "text",
                "text": SYSTEM_CONTEXT.format(
                    financial_data=financial_data,
                    guidelines=guidelines
                ),
                "cache_control": {"type": "ephemeral"}
            }
        ],
        messages=[{"role": "user", "content": user_query}]
    )

    # Check cache performance
    usage = response.usage
    print(f"Cache read tokens: {usage.cache_read_input_tokens}")
    print(f"Cache write tokens: {usage.cache_creation_input_tokens}")

    return response.content[0].text

Azure OpenAI Caching Strategy

For Azure OpenAI, caching happens automatically. Optimize by ensuring your system message and initial context remain identical across requests:

from openai import AzureOpenAI

client = AzureOpenAI(...)

# Keep system message identical for cache hits
SYSTEM_MESSAGE = {"role": "system", "content": static_context}

def get_completion(user_message: str, conversation_history: list):
    messages = [SYSTEM_MESSAGE] + conversation_history + [
        {"role": "user", "content": user_message}
    ]
    return client.chat.completions.create(
        model="gpt-4o",
        messages=messages
    )

Measuring Impact

Track cache hit rates in your monitoring. Applications with high context reuse typically see 60-80% cache hit rates, dramatically reducing per-request costs.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.