Prometheus Metrics Collection in AKS
Prometheus Metrics Collection in AKS
Prometheus has become the de facto standard for metrics collection in Kubernetes. In this post, we’ll explore how to set up Prometheus in AKS and integrate it with Azure Monitor for a comprehensive monitoring solution.
Prometheus Architecture
Prometheus uses a pull-based model:
- Applications expose metrics on an HTTP endpoint
- Prometheus scrapes these endpoints at regular intervals
- Metrics are stored in a time-series database
- PromQL queries extract insights from the data
Deploying Prometheus with Helm
# Add Prometheus community Helm repo
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Create namespace
kubectl create namespace monitoring
# Install Prometheus stack (includes Grafana)
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set prometheus.prometheusSpec.podMonitorSelectorNilUsesHelmValues=false
Exposing Application Metrics
Python Flask Example
from flask import Flask
from prometheus_client import Counter, Histogram, generate_latest, CONTENT_TYPE_LATEST
import time
app = Flask(__name__)
# Define metrics
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint']
)
@app.route('/api/data')
def get_data():
start_time = time.time()
# Your business logic here
result = {"data": "example"}
# Record metrics
REQUEST_COUNT.labels(method='GET', endpoint='/api/data', status='200').inc()
REQUEST_LATENCY.labels(method='GET', endpoint='/api/data').observe(time.time() - start_time)
return result
@app.route('/metrics')
def metrics():
return generate_latest(), 200, {'Content-Type': CONTENT_TYPE_LATEST}
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
.NET Core Example
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
// Add Prometheus metrics
builder.Services.AddSingleton<Counter>(
Metrics.CreateCounter("http_requests_total", "Total HTTP requests",
new CounterConfiguration
{
LabelNames = new[] { "method", "endpoint", "status" }
}));
builder.Services.AddSingleton<Histogram>(
Metrics.CreateHistogram("http_request_duration_seconds", "HTTP request latency",
new HistogramConfiguration
{
LabelNames = new[] { "method", "endpoint" },
Buckets = Histogram.ExponentialBuckets(0.001, 2, 10)
}));
var app = builder.Build();
// Enable Prometheus metrics endpoint
app.UseMetricServer();
app.UseHttpMetrics();
app.MapGet("/api/data", (Counter counter, Histogram histogram) =>
{
using (histogram.WithLabels("GET", "/api/data").NewTimer())
{
counter.WithLabels("GET", "/api/data", "200").Inc();
return Results.Ok(new { data = "example" });
}
});
app.Run();
Creating ServiceMonitors
ServiceMonitors tell Prometheus which services to scrape:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app-monitor
namespace: monitoring
labels:
release: prometheus
spec:
selector:
matchLabels:
app: my-app
namespaceSelector:
matchNames:
- default
endpoints:
- port: http
interval: 30s
path: /metrics
Creating PodMonitors
For pods without services:
apiVersion: monitoring.coreos.com/v1
kind: PodMonitor
metadata:
name: batch-job-monitor
namespace: monitoring
spec:
selector:
matchLabels:
app: batch-processor
namespaceSelector:
matchNames:
- batch
podMetricsEndpoints:
- port: metrics
interval: 60s
PromQL Queries
Request Rate
# Requests per second over 5 minutes
rate(http_requests_total[5m])
# Requests per second by endpoint
sum(rate(http_requests_total[5m])) by (endpoint)
Latency Percentiles
# 95th percentile latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 99th percentile by endpoint
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
Error Rate
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
Resource Usage
# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="default"}[5m])) by (pod)
# Container memory usage
sum(container_memory_working_set_bytes{namespace="default"}) by (pod)
Azure Monitor Integration
Container Insights can scrape Prometheus metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: container-azm-ms-agentconfig
namespace: kube-system
data:
prometheus-data-collection-settings: |
[prometheus_data_collection_settings.cluster]
interval = "1m"
monitor_kubernetes_pods = true
monitor_kubernetes_pods_namespaces = ["default", "app"]
[prometheus_data_collection_settings.node]
interval = "1m"
urls = ["http://localhost:9100/metrics"]
Recording Rules
Pre-compute expensive queries with recording rules:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: recording-rules
namespace: monitoring
spec:
groups:
- name: http_requests
interval: 30s
rules:
- record: http:requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (endpoint)
- record: http:latency:p95
expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, endpoint))
Alerting Rules
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: alerting-rules
namespace: monitoring
spec:
groups:
- name: http_alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le)) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency detected"
description: "95th percentile latency is {{ $value }}s"
Federation for Multi-Cluster
Configure federation to aggregate metrics from multiple clusters:
scrape_configs:
- job_name: 'federate'
scrape_interval: 15s
honor_labels: true
metrics_path: '/federate'
params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"http_.*"}'
static_configs:
- targets:
- 'prometheus-cluster1:9090'
- 'prometheus-cluster2:9090'
Storage Considerations
For production, configure persistent storage:
prometheus:
prometheusSpec:
retention: 15d
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: managed-premium
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 100Gi
Conclusion
Prometheus provides powerful metrics collection and alerting capabilities for Kubernetes workloads. Combined with Azure Monitor integration, you get the best of both worlds - detailed application metrics and centralized Azure monitoring.
Tomorrow, we’ll build Grafana dashboards to visualize these Prometheus metrics effectively.