Skip to content
Back to Blog
1 min read

Prometheus with Azure Monitor: Unified Observability

I wrote “Prometheus with Azure Monitor: Unified Observability” to share practical, production-minded guidance on this topic.

The Challenge

Self-managed Prometheus requires:

  • Scaling for high-cardinality metrics
  • Long-term storage solutions
  • High availability configuration
  • Alert manager setup
  • Security management

Azure Monitor for Prometheus addresses these while maintaining compatibility.

Azure Monitor Managed Prometheus

Creating a Workspace

# Create Azure Monitor Workspace
az monitor account create \
    --name my-prometheus-workspace \
    --resource-group my-rg \
    --location eastus

# Get workspace ID
WORKSPACE_ID=$(az monitor account show \
    --name my-prometheus-workspace \
    --resource-group my-rg \
    --query id -o tsv)

Connecting AKS

# Enable Prometheus metrics on AKS
az aks update \
    --name my-aks-cluster \
    --resource-group my-rg \
    --enable-azure-monitor-metrics \
    --azure-monitor-workspace-resource-id $WORKSPACE_ID

This deploys:

  • Metrics collection agents
  • Recording rules processor
  • Remote write endpoint

Prometheus Scrape Configuration

Customize what gets scraped:

# ama-metrics-prometheus-config ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
  name: ama-metrics-prometheus-config
  namespace: kube-system
data:
  prometheus-config: |
    global:
      scrape_interval: 15s
      evaluation_interval: 15s

    scrape_configs:
      # Kubernetes API servers
      - job_name: 'kubernetes-apiservers'
        kubernetes_sd_configs:
          - role: endpoints
        scheme: https
        tls_config:
          ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
        relabel_configs:
          - source_labels: [__meta_kubernetes_namespace, __meta_kubernetes_service_name, __meta_kubernetes_endpoint_port_name]
            action: keep
            regex: default;kubernetes;https

      # Kubernetes nodes
      - job_name: 'kubernetes-nodes'
        kubernetes_sd_configs:
          - role: node
        relabel_configs:
          - action: labelmap
            regex: __meta_kubernetes_node_label_(.+)

      # Application pods with prometheus.io annotations
      - job_name: 'kubernetes-pods'
        kubernetes_sd_configs:
          - role: pod
        relabel_configs:
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
            action: keep
            regex: true
          - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
            action: replace
            target_label: __metrics_path__
            regex: (.+)
          - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
            action: replace
            regex: ([^:]+)(?::\d+)?;(\d+)
            replacement: $1:$2
            target_label: __address__

Querying with PromQL

Use Azure Monitor to query Prometheus metrics:

# Install Azure CLI extension
az extension add --name monitor-control-service

# Query metrics
az monitor metrics list \
    --resource $WORKSPACE_ID \
    --metric-namespace "prometheus" \
    --metric "up" \
    --start-time 2021-05-15T00:00:00Z \
    --end-time 2021-05-15T01:00:00Z

In Grafana, connect to Azure Monitor Prometheus:

# Container CPU usage
sum(rate(container_cpu_usage_seconds_total{namespace="production"}[5m])) by (pod)

# Memory usage percentage
sum(container_memory_working_set_bytes{namespace="production"}) by (pod)
/ sum(container_spec_memory_limit_bytes{namespace="production"}) by (pod)
* 100

# Request rate
sum(rate(http_requests_total{job="web-api"}[5m])) by (status_code)

# Error rate
sum(rate(http_requests_total{job="web-api",status_code=~"5.."}[5m]))
/ sum(rate(http_requests_total{job="web-api"}[5m]))
* 100

# P99 latency
histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket{job="web-api"}[5m])) by (le))

Recording Rules

Define recording rules for expensive queries:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ama-metrics-prometheus-config
  namespace: kube-system
data:
  recording-rules: |
    groups:
      - name: kubernetes-apps
        interval: 30s
        rules:
          - record: namespace:container_cpu_usage_seconds_total:sum_rate
            expr: |
              sum by (namespace) (
                rate(container_cpu_usage_seconds_total{image!=""}[5m])
              )

          - record: namespace:container_memory_usage_bytes:sum
            expr: |
              sum by (namespace) (
                container_memory_usage_bytes{image!=""}
              )

          - record: namespace:http_request_rate:sum
            expr: |
              sum by (namespace) (
                rate(http_requests_total[5m])
              )

      - name: application-slos
        interval: 30s
        rules:
          - record: job:http_request_duration_seconds:p99
            expr: |
              histogram_quantile(0.99,
                sum by (job, le) (
                  rate(http_request_duration_seconds_bucket[5m])
                )
              )

          - record: job:http_request_error_rate:ratio
            expr: |
              sum by (job) (rate(http_requests_total{status_code=~"5.."}[5m]))
              /
              sum by (job) (rate(http_requests_total[5m]))

Alert Rules

Configure alerting via Azure Monitor:

apiVersion: v1
kind: ConfigMap
metadata:
  name: ama-metrics-prometheus-config
  namespace: kube-system
data:
  alerting-rules: |
    groups:
      - name: kubernetes-alerts
        rules:
          - alert: PodCrashLooping
            expr: |
              rate(kube_pod_container_status_restarts_total[15m]) * 60 * 5 > 0
            for: 15m
            labels:
              severity: warning
            annotations:
              summary: "Pod {{ $labels.namespace }}/{{ $labels.pod }} is crash looping"
              description: "Pod has restarted more than 5 times in the last 15 minutes"

          - alert: HighMemoryUsage
            expr: |
              sum by (namespace, pod) (container_memory_usage_bytes)
              / sum by (namespace, pod) (container_spec_memory_limit_bytes)
              > 0.9
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High memory usage in {{ $labels.namespace }}/{{ $labels.pod }}"
              description: "Memory usage is above 90%"

          - alert: HighErrorRate
            expr: |
              job:http_request_error_rate:ratio > 0.05
            for: 5m
            labels:
              severity: critical
            annotations:
              summary: "High error rate for {{ $labels.job }}"
              description: "Error rate is above 5%"

Python Application with Prometheus

Instrument your Python application:

from prometheus_client import Counter, Histogram, Gauge, start_http_server
from functools import wraps
import time

# Define metrics
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status_code']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[.005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]
)

ACTIVE_REQUESTS = Gauge(
    'http_requests_active',
    'Active HTTP requests',
    ['method', 'endpoint']
)

IN_PROGRESS = Gauge(
    'background_jobs_in_progress',
    'Background jobs currently running',
    ['job_type']
)

def track_requests(method, endpoint):
    """Decorator to track request metrics"""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).inc()
            start_time = time.time()

            try:
                result = func(*args, **kwargs)
                status_code = getattr(result, 'status_code', 200)
                REQUEST_COUNT.labels(
                    method=method,
                    endpoint=endpoint,
                    status_code=status_code
                ).inc()
                return result
            except Exception as e:
                REQUEST_COUNT.labels(
                    method=method,
                    endpoint=endpoint,
                    status_code=500
                ).inc()
                raise
            finally:
                REQUEST_LATENCY.labels(
                    method=method,
                    endpoint=endpoint
                ).observe(time.time() - start_time)
                ACTIVE_REQUESTS.labels(method=method, endpoint=endpoint).dec()

        return wrapper
    return decorator

# Flask example
from flask import Flask, Response
from prometheus_client import generate_latest, CONTENT_TYPE_LATEST

app = Flask(__name__)

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype=CONTENT_TYPE_LATEST)

@app.route('/api/users')
@track_requests('GET', '/api/users')
def get_users():
    # Your logic here
    return {'users': []}

if __name__ == '__main__':
    start_http_server(8001)  # Prometheus metrics on separate port
    app.run(host='0.0.0.0', port=8000)

Kubernetes Deployment with Metrics

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "8000"
    prometheus.io/path: "/metrics"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
      annotations:
        prometheus.io/scrape: "true"
        prometheus.io/port: "8000"
        prometheus.io/path: "/metrics"
    spec:
      containers:
      - name: web-api
        image: myregistry.azurecr.io/web-api:latest
        ports:
        - containerPort: 8000
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "512Mi"

Federation with Self-Managed Prometheus

If you have existing Prometheus instances:

# Remote write to Azure Monitor
remote_write:
  - url: "https://<workspace-id>.prometheus.monitor.azure.com/api/v1/write"
    azure_ad:
      cloud: AzurePublic
      managed_identity:
        client_id: "<managed-identity-client-id>"

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.