Monitoring AI Applications in Production
Your AI app works in staging. You deploy to production. Then silence. No errors. No alerts. Is it working? You don’t know.
This is how most AI deployments start. Here’s how to fix it.
Why AI Monitoring Is Different
Traditional apps: request comes in, response goes out. Monitor latency, errors, throughput. Done.
AI apps: request comes in, LLM generates something unpredictable, response goes out. The response might be wrong, harmful, or irrelevant—and your monitoring won’t catch it because there’s no error code for “bad answer.”
The Four Layers of AI Monitoring
Layer 1: Infrastructure
The basics. These are the same as any application.
# Track in Application Insights or your APM of choice
metrics = {
"request_count": counter,
"response_latency_ms": histogram,
"error_rate": gauge,
"active_connections": gauge
}
If the app is down, nothing else matters. Monitor infrastructure first.
Layer 2: LLM-Specific Metrics
def track_llm_call(response):
metrics.record({
"model": response.model,
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"total_tokens": response.usage.total_tokens,
"latency_ms": response.response_time,
"finish_reason": response.choices[0].finish_reason,
"estimated_cost": calculate_cost(response.usage)
})
Track tokens, costs, and latency per model. You’ll want this when the bill arrives.
Layer 3: Quality Metrics
This is where it gets interesting.
def track_response_quality(input_text, output_text, user_feedback=None):
metrics.record({
"response_length": len(output_text),
"contains_refusal": detect_refusal(output_text),
"safety_score": content_safety_check(output_text),
"relevance_score": quick_relevance_check(input_text, output_text),
"user_thumbs_up": user_feedback == "positive",
"user_thumbs_down": user_feedback == "negative"
})
Automated quality checks catch obvious problems. User feedback catches the subtle ones.
Layer 4: Business Metrics
def track_business_impact():
metrics.record({
"tasks_completed_with_ai": count,
"tasks_completed_without_ai": count,
"ai_suggestions_accepted": rate,
"ai_suggestions_rejected": rate,
"time_saved_estimate": duration
})
If AI isn’t improving business outcomes, the technical metrics don’t matter.
Alerting Strategy
Immediate alerts (page someone):
- Error rate > 5%
- Latency > 30 seconds
- Safety filter triggered > 10 times in an hour
- Cost spike > 200% of daily average
Daily digest:
- Quality score trends
- User feedback summary
- Cost breakdown by model/feature
- Top refused or failed queries
Weekly review:
- Random sample of conversations for human review
- Drift detection in response patterns
- Feature adoption metrics
The Dashboard
Every AI app needs a dashboard with:
- Cost tracker - Real-time spend vs budget
- Quality gauge - Rolling average of quality scores
- User satisfaction - Thumbs up/down ratio
- Latency distribution - P50, P95, P99
- Safety alerts - Content filter triggers
Azure-Specific Tools
Application Insights: Custom metrics and distributed tracing for the full pipeline.
Azure Monitor: Alerts and dashboards.
Content Safety API: Built-in content filtering metrics.
Azure OpenAI diagnostics: Token usage and throttling metrics out of the box.
The Minimum
If you can only do three things:
- Track cost - Know what you’re spending, daily
- Track quality - Automated checks plus user feedback
- Log everything - With PII redaction, so you can debug after the fact
Everything else is optimization. These three are survival.
The Reality
You can’t monitor AI quality the same way you monitor uptime. There’s no binary “working/broken” for AI responses.
Accept the ambiguity. Build monitoring around it. Review regularly.
Your AI app is only as good as your ability to know when it’s not good.