Azure OpenAI: The Hidden Costs Nobody Talks About
Your Azure OpenAI bill is higher than expected. Let me guess—you thought you had it figured out, you calculated the token costs, and then reality hit.
Let me show you where the costs hide.
The Obvious Costs
These you probably know:
- Input tokens: What you send to the model
- Output tokens: What the model generates
- Model tier: GPT-4o costs more than GPT-4o-mini
Simple math, right? Wrong.
The Hidden Cost #1: Context Accumulation
Every chat conversation includes the entire history. That “helpful” feature where the AI remembers your conversation? You’re paying for it.
Example conversation:
User: "Summarize this document" (100 tokens + 5000 token document)
AI: "Here's a summary..." (300 tokens)
User: "Now translate it to Spanish"
That second request? You’re paying for:
- Original 100 tokens
- The 5000 token document AGAIN
- Previous 300 token response
- The new 10 token request
- The new response (probably 400 tokens)
Total: 5,810 tokens, not 410.
The Fix
from azure.ai.openai import OpenAIClient
client = OpenAIClient(endpoint="...", credential="...")
# Bad: Accumulating context
messages = []
messages.append({"role": "user", "content": long_document})
response1 = client.chat.completions.create(messages=messages, model="gpt-4o")
messages.append({"role": "assistant", "content": response1.content})
messages.append({"role": "user", "content": "Translate to Spanish"})
response2 = client.chat.completions.create(messages=messages, model="gpt-4o")
# Good: Passing only necessary context
messages = [
{"role": "user", "content": "Translate this summary to Spanish: " + response1.content}
]
response2 = client.chat.completions.create(messages=messages, model="gpt-4o")
Savings: Potentially 90% on follow-up requests.
The Hidden Cost #2: Embeddings at Scale
“Embeddings are cheap!” Until you embed a million documents. Twice. Because you changed your chunking strategy.
Real scenario from a client:
- 1M documents
- Average 1000 tokens each
- Using text-embedding-ada-002
- Cost: ~$100
Then they realized their chunks were too large. Re-embedding:
- Another $100
Then they wanted to support multiple languages. Per language:
- Another $100
Total: $300 for what they thought would be $100.
The Fix
import hashlib
def get_embedding_with_cache(text: str, model: str = "text-embedding-ada-002"):
# Hash the content
content_hash = hashlib.sha256(text.encode()).hexdigest()
cache_key = f"embedding:{model}:{content_hash}"
# Check cache first
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)
# Generate and cache
embedding = openai_client.embeddings.create(
input=text,
model=model
)
redis_client.set(cache_key, json.dumps(embedding), ex=86400*30) # 30 days
return embedding
Only re-embed what actually changed.
The Hidden Cost #3: Retries and Errors
Your code has retry logic. That’s good engineering. That’s also expensive.
@retry(tries=3, delay=1, backoff=2)
def call_openai(prompt):
return client.chat.completions.create(...)
This seems reasonable. But if your prompt is 10K tokens and you hit rate limits:
- Attempt 1: 10K tokens (rate limit)
- Attempt 2: 10K tokens (rate limit)
- Attempt 3: 10K tokens (success)
You paid for 30K tokens to get one response.
The Fix
from tenacity import retry, stop_after_attempt, wait_exponential, retry_if_exception_type
from openai import RateLimitError
@retry(
retry=retry_if_exception_type(RateLimitError), # Only retry rate limits
stop=stop_after_attempt(3),
wait=wait_exponential(multiplier=1, min=4, max=10)
)
def call_openai(prompt):
try:
return client.chat.completions.create(...)
except OpenAIError as e:
if "content_filter" in str(e):
# Don't retry content filter rejections
raise
raise
Don’t retry errors that won’t succeed on retry.
The Hidden Cost #4: Development vs Production
Your dev environment uses the same API keys as production. Developers are running tests, experimenting, iterating.
One developer’s afternoon of debugging:
- 50 test runs
- Average 5K tokens per run
- $2.50 at GPT-4o rates
Times 10 developers: $25/day or $500/month just for development.
The Fix
Separate deployments and monitoring:
import os
def get_deployment():
env = os.getenv("ENVIRONMENT", "dev")
if env == "production":
return "gpt-4o"
elif env == "staging":
return "gpt-4o-mini"
else:
return "gpt-3.5-turbo" # Cheapest for dev
Use cheaper models for development. Reserve GPT-4o for production.
The Hidden Cost #5: Streaming Responses
Streaming feels responsive. Users love it. But if they navigate away mid-stream, you still pay for the full response.
async def stream_response(prompt):
stream = await client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
stream=True,
max_tokens=4000
)
# User closes browser at token 100
# You still pay for all 4000 tokens the model was generating
The Fix
Monitor user engagement and stop generation:
async def stream_response(prompt, stop_event):
stream = await client.chat.completions.create(...)
async for chunk in stream:
if stop_event.is_set():
# User disconnected, stop the stream
break
yield chunk
Stop generating when no one’s listening.
Real Numbers
From a recent client project:
- Expected monthly cost: $2,000
- Actual first month: $8,000
- After optimizations: $2,500
Savings: $5,500/month or $66,000/year.
Changes made:
- Aggressive context pruning (50% savings)
- Caching embeddings (20% savings)
- Using GPT-4o-mini for classification (15% savings)
- Proper retry logic (10% savings)
- Development environment limits (5% savings)
Best Practices
- Log everything - Track tokens per request, per user, per feature
- Set budgets - Alert when spending exceeds thresholds
- Use PTU for predictable load - Provisioned throughput can be cheaper at scale
- Right-size your model - Don’t use GPT-4o for classification tasks
- Cache aggressively - Identical requests should hit cache
- Monitor your prompts - Long system prompts multiply across all requests
Tools I Use
# Cost tracking decorator
def track_openai_cost(func):
@wraps(func)
async def wrapper(*args, **kwargs):
start_time = time.time()
result = await func(*args, **kwargs)
cost = calculate_cost(
result.usage.prompt_tokens,
result.usage.completion_tokens,
result.model
)
log_cost(
function=func.__name__,
cost=cost,
duration=time.time() - start_time,
model=result.model
)
return result
return wrapper
Track every call. You can’t optimize what you don’t measure.
The Bottom Line
Azure OpenAI is incredibly powerful. It’s also easy to rack up costs without realizing it. Most teams overspend by 3-4x in their first few months.
Pay attention to:
- Context management
- Caching strategies
- Model selection
- Error handling
- Development practices
Your finance team will thank you.