1 min read
Prometheus with Azure Monitor: Unified Observability
I wrote “Prometheus with Azure Monitor: Unified Observability” to share practical, production-minded guidance on this topic.
The Challenge
Self-managed Prometheus requires:
- Scaling for high-cardinality metrics
- Long-term storage solutions
- High availability configuration
- Alert manager setup
- Security management
Azure Monitor for Prometheus addresses these while maintaining compatibility.
Azure Monitor Managed Prometheus
Creating a Workspace
# Create Azure Monitor Workspace
az monitor account create \
--name my-prometheus-workspace \
--resource-group my-rg \
--location eastus
# Get workspace ID
WORKSPACE_ID=$(az monitor account show \
--name my-prometheus-workspace \
--resource-group my-rg \
--query id -o tsv)
Connecting AKS
# Enable Prometheus metrics on AKS
az aks update \
--name my-aks-cluster \
--resource-group my-rg \
--enable-azure-monitor-metrics \
--azure-monitor-workspace-resource-id $WORKSPACE_ID
This deploys:
- Metrics collection agents
- Recording rules processor
- Remote write endpoint
Prometheus Scrape Configuration
Customize what gets scraped:
# ama-metrics-prometheus-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: ama-metrics-prometheus-config
namespace: kube-system
data:
prometheus-config: |
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Kubernetes API servers
- job_name: 'kubernetes-apiservers'
kubernetes_sd_configs:
- role: endpoints
scheme: https
tls_config:
ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
relabel_configs:
- source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
action: keep
regex: default;kubernetes;https
# Kubernetes nodes
- job_name: 'kubernetes-nodes'
kubernetes_sd_configs:
- role: node
relabel_configs:
- action: labelmap
regex: __meta_kubernetes_node_label_(.+)
# Application pods with prometheus.io annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
Querying with PromQL
Use Azure Monitor to query Prometheus metrics:
# Install Azure CLI extension
az extension add --name monitor-control-service
# Query metrics
az monitor metrics list \
--resource $WORKSPACE_ID \
--metric-namespace "prometheus" \
--metric "up" \
--start-time 2021-05-15T00:00:00Z \
--end-time 2021-05-15T01:00:00Z
In Grafana, connect to Azure Monitor Prometheus:
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)
# Memory usage percentage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/ sum(container_spec_memory_limit_bytes{namespace="production"}) by (pod)
* 100
# Request rate
sum(rate(http_requests_total{job="web-api"}[5m])) by (status_code)
# Error rate
sum(rate(http_requests_total{job="web-api",status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="web-api"}[5m]))
* 100
# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="web-api"}[5m])) by (le))
Recording Rules
Define recording rules for expensive queries:
apiVersion: v1
kind: ConfigMap
metadata:
name: ama-metrics-prometheus-config
namespace: kube-system
data:
recording-rules: |
groups:
- name: kubernetes-apps
interval: 30s
rules:
- record: namespace:container_cpu_usage_seconds_total:sum_rate
expr: |
sum by (namespace) (
rate(container_cpu_usage_seconds_total{image!=""}[5m])
)
- record: namespace:container_memory_usage_bytes:sum
expr: |
sum by (namespace) (
container_memory_usage_bytes{image!=""}
)
- record: namespace:http_request_rate:sum
expr: |
sum by (namespace) (
rate(http_requests_total[5m])
)
- name: application-slos
interval: 30s
rules:
- record: job:http_request_duration_seconds:p99
expr: |
histogram_quantile(0.99,
sum by (job, le) (
rate(http_request_duration_seconds_bucket[5m])
)
)
- record: job:http_request_error_rate:ratio
expr: |
sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
/
sum by (job) (rate(http_requests_total[5m]))
Alert Rules
Configure alerting via Azure Monitor:
apiVersion: v1
kind: ConfigMap
metadata:
name: ama-metrics-prometheus-config
namespace: kube-system
data:
alerting-rules: |
groups:
- name: kubernetes-alerts
rules:
- alert: PodCrashLooping
expr: |
rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
for: 15m
labels:
severity: warning
annotations:
summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
description: "Pod has restarted more than 5 times in the last 15 minutes"
- alert: HighMemoryUsage
expr: |
sum by (namespace, pod) (container_memory_usage_bytes)
/ sum by (namespace, pod) (container_spec_memory_limit_bytes)
> 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage in {{ $labels.namespace }}/{{ $labels.pod }}"
description: "Memory usage is above 90%"
- alert: HighErrorRate
expr: |
job:http_request_error_rate:ratio > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is above 5%"
Python Application with Prometheus
Instrument your Python application:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
from functools import wraps
import time
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status_code']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)
ACTIVE_REQUESTS = Gauge(
'http_requests_active',
'Active HTTP requests',
['method', 'endpoint']
)
IN_PROGRESS = Gauge(
'background_jobs_in_progress',
'Background jobs currently running',
['job_type']
)
def track_requests(method, endpoint):
"""Decorator to track request metrics"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).inc()
start_time = time.time()
try:
result = func(*args, **kwargs)
status_code = getattr(result, 'status_code', 200)
REQUEST_COUNT.labels(
method=method,
endpoint=endpoint,
status_code=status_code
).inc()
return result
except Exception as e:
REQUEST_COUNT.labels(
method=method,
endpoint=endpoint,
status_code=500
).inc()
raise
finally:
REQUEST_LATENCY.labels(
method=method,
endpoint=endpoint
).observe(time.time() - start_time)
ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).dec()
return wrapper
return decorator
# Flask example
from flask import Flask, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST
app = Flask(__name__)
@app.route('/metrics')
def metrics():
return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)
@app.route('/api/users')
@track_requests('GET', '/api/users')
def get_users():
# Your logic here
return {'users': []}
if __name__ == '__main__':
start_http_server(8001) # Prometheus metrics on separate port
app.run(host='0.0.0.0', port=8000)
Kubernetes Deployment with Metrics
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
replicas: 3
selector:
matchLabels:
app: web-api
template:
metadata:
labels:
app: web-api
annotations:
prometheus.io/scrape: "true"
prometheus.io/port: "8000"
prometheus.io/path: "/metrics"
spec:
containers:
- name: web-api
image: myregistry.azurecr.io/web-api:latest
ports:
- containerPort: 8000
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "512Mi"
Federation with Self-Managed Prometheus
If you have existing Prometheus instances:
# Remote write to Azure Monitor
remote_write:
- url: "https://<workspace-id>.prometheus.monitor.azure.com/api/v1/write"
azure_ad:
cloud: AzurePublic
managed_identity:
client_id: "<managed-identity-client-id>"
Resources
- Azure Monitor Managed Prometheus
- PromQL Documentation
- Prometheus Client Libraries\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n