Back to Blog
5 min read

Azure OpenAI: The Hidden Costs Nobody Talks About

Your Azure OpenAI bill is higher than expected. Let me guess—you thought you had it figured out, you calculated the token costs, and then reality hit.

Let me show you where the costs hide.

The Obvious Costs

These you probably know:

  • Input tokens: What you send to the model
  • Output tokens: What the model generates
  • Model tier: GPT-4o costs more than GPT-4o-mini

Simple math, right? Wrong.

The Hidden Cost #1: Context Accumulation

Every chat conversation includes the entire history. That “helpful” feature where the AI remembers your conversation? You’re paying for it.

Example conversation:

User: "Summarize this document" (100 tokens + 5000 token document)
AI: "Here's a summary..." (300 tokens)
User: "Now translate it to Spanish"

That second request? You’re paying for:

  • Original 100 tokens
  • The 5000 token document AGAIN
  • Previous 300 token response
  • The new 10 token request
  • The new response (probably 400 tokens)

Total: 5,810 tokens, not 410.

The Fix

from azure.ai.openai import OpenAIClient

client = OpenAIClient(endpoint="...", credential="...")

# Bad: Accumulating context
messages = []
messages.append({"role": "user", "content": long_document})
response1 = client.chat.completions.create(messages=messages, model="gpt-4o")
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Translate to Spanish"})
response2 = client.chat.completions.create(messages=messages, model="gpt-4o")

# Good: Passing only necessary context
messages = [
    {"role": "user", "content": "Translate this summary to Spanish: " + response1.content}
]
response2 = client.chat.completions.create(messages=messages, model="gpt-4o")

Savings: Potentially 90% on follow-up requests.

The Hidden Cost #2: Embeddings at Scale

“Embeddings are cheap!” Until you embed a million documents. Twice. Because you changed your chunking strategy.

Real scenario from a client:

  • 1M documents
  • Average 1000 tokens each
  • Using text-embedding-ada-002
  • Cost: ~$100

Then they realized their chunks were too large. Re-embedding:

  • Another $100

Then they wanted to support multiple languages. Per language:

  • Another $100

Total: $300 for what they thought would be $100.

The Fix

import hashlib

def get_embedding_with_cache(text: str, model: str = "text-embedding-ada-002"):
    # Hash the content
    content_hash = hashlib.sha256(text.encode()).hexdigest()
    cache_key = f"embedding:{model}:{content_hash}"
    
    # Check cache first
    cached = redis_client.get(cache_key)
    if cached:
        return json.loads(cached)
    
    # Generate and cache
    embedding = openai_client.embeddings.create(
        input=text,
        model=model
    )
    
    redis_client.set(cache_key, json.dumps(embedding), ex=86400*30)  # 30 days
    return embedding

Only re-embed what actually changed.

The Hidden Cost #3: Retries and Errors

Your code has retry logic. That’s good engineering. That’s also expensive.

@retry(tries=3, delay=1, backoff=2)
def call_openai(prompt):
    return client.chat.completions.create(...)

This seems reasonable. But if your prompt is 10K tokens and you hit rate limits:

  • Attempt 1: 10K tokens (rate limit)
  • Attempt 2: 10K tokens (rate limit)
  • Attempt 3: 10K tokens (success)

You paid for 30K tokens to get one response.

The Fix

from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError

@retry(
    retry=retry_if_exception_type(RateLimitError),  # Only retry rate limits
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=4, max=10)
)
def call_openai(prompt):
    try:
        return client.chat.completions.create(...)
    except OpenAIError as e:
        if "content_filter" in str(e):
            # Don't retry content filter rejections
            raise
        raise

Don’t retry errors that won’t succeed on retry.

The Hidden Cost #4: Development vs Production

Your dev environment uses the same API keys as production. Developers are running tests, experimenting, iterating.

One developer’s afternoon of debugging:

  • 50 test runs
  • Average 5K tokens per run
  • $2.50 at GPT-4o rates

Times 10 developers: $25/day or $500/month just for development.

The Fix

Separate deployments and monitoring:

import os

def get_deployment():
    env = os.getenv("ENVIRONMENT", "dev")
    
    if env == "production":
        return "gpt-4o"
    elif env == "staging":
        return "gpt-4o-mini"
    else:
        return "gpt-3.5-turbo"  # Cheapest for dev

Use cheaper models for development. Reserve GPT-4o for production.

The Hidden Cost #5: Streaming Responses

Streaming feels responsive. Users love it. But if they navigate away mid-stream, you still pay for the full response.

async def stream_response(prompt):
    stream = await client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
        stream=True,
        max_tokens=4000
    )
    
    # User closes browser at token 100
    # You still pay for all 4000 tokens the model was generating

The Fix

Monitor user engagement and stop generation:

async def stream_response(prompt, stop_event):
    stream = await client.chat.completions.create(...)
    
    async for chunk in stream:
        if stop_event.is_set():
            # User disconnected, stop the stream
            break
        yield chunk

Stop generating when no one’s listening.

Real Numbers

From a recent client project:

  • Expected monthly cost: $2,000
  • Actual first month: $8,000
  • After optimizations: $2,500

Savings: $5,500/month or $66,000/year.

Changes made:

  • Aggressive context pruning (50% savings)
  • Caching embeddings (20% savings)
  • Using GPT-4o-mini for classification (15% savings)
  • Proper retry logic (10% savings)
  • Development environment limits (5% savings)

Best Practices

  1. Log everything - Track tokens per request, per user, per feature
  2. Set budgets - Alert when spending exceeds thresholds
  3. Use PTU for predictable load - Provisioned throughput can be cheaper at scale
  4. Right-size your model - Don’t use GPT-4o for classification tasks
  5. Cache aggressively - Identical requests should hit cache
  6. Monitor your prompts - Long system prompts multiply across all requests

Tools I Use

# Cost tracking decorator
def track_openai_cost(func):
    @wraps(func)
    async def wrapper(*args, **kwargs):
        start_time = time.time()
        result = await func(*args, **kwargs)
        
        cost = calculate_cost(
            result.usage.prompt_tokens,
            result.usage.completion_tokens,
            result.model
        )
        
        log_cost(
            function=func.__name__,
            cost=cost,
            duration=time.time() - start_time,
            model=result.model
        )
        
        return result
    return wrapper

Track every call. You can’t optimize what you don’t measure.

The Bottom Line

Azure OpenAI is incredibly powerful. It’s also easy to rack up costs without realizing it. Most teams overspend by 3-4x in their first few months.

Pay attention to:

  • Context management
  • Caching strategies
  • Model selection
  • Error handling
  • Development practices

Your finance team will thank you.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.