4 min read
Building Effective Grafana Dashboards for AKS
Building Effective Grafana Dashboards for AKS
Grafana transforms your Prometheus metrics into actionable visualizations. In this post, we’ll create effective dashboards for monitoring AKS clusters and the applications running on them.
Deploying Grafana
If you installed kube-prometheus-stack, Grafana is already included. Otherwise:
helm install grafana grafana/grafana \
--namespace monitoring \
--set persistence.enabled=true \
--set persistence.size=10Gi \
--set adminPassword='your-secure-password'
Accessing Grafana
# Port forward to access Grafana
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# Get admin password (if using kube-prometheus-stack)
kubectl get secret -n monitoring prometheus-grafana -o jsonpath="{.data.admin-password}" | base64 --decode
Configuring Data Sources
Prometheus Data Source
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-kube-prometheus-prometheus:9090
access: proxy
isDefault: true
Azure Monitor Data Source
apiVersion: 1
datasources:
- name: Azure Monitor
type: grafana-azure-monitor-datasource
jsonData:
cloudName: azuremonitor
tenantId: ${TENANT_ID}
clientId: ${CLIENT_ID}
subscriptionId: ${SUBSCRIPTION_ID}
secureJsonData:
clientSecret: ${CLIENT_SECRET}
Building a Cluster Overview Dashboard
JSON Dashboard Definition
{
"dashboard": {
"title": "AKS Cluster Overview",
"panels": [
{
"title": "Cluster CPU Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 0, "y": 0},
"targets": [
{
"expr": "sum(rate(container_cpu_usage_seconds_total{namespace!=\"kube-system\"}[5m])) / sum(machine_cpu_cores) * 100",
"legendFormat": "CPU %"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
"unit": "percent",
"max": 100
}
}
},
{
"title": "Cluster Memory Usage",
"type": "gauge",
"gridPos": {"h": 8, "w": 6, "x": 6, "y": 0},
"targets": [
{
"expr": "sum(container_memory_working_set_bytes{namespace!=\"kube-system\"}) / sum(machine_memory_bytes) * 100",
"legendFormat": "Memory %"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 70},
{"color": "red", "value": 90}
]
},
"unit": "percent",
"max": 100
}
}
}
]
}
}
Creating Dashboard Panels
Node Resource Panel
# CPU usage by node
sum(rate(container_cpu_usage_seconds_total{id="/"}[5m])) by (node)
# Memory usage by node
sum(container_memory_working_set_bytes{id="/"}) by (node)
Pod Status Panel
# Running pods by namespace
count(kube_pod_status_phase{phase="Running"}) by (namespace)
# Pending pods
count(kube_pod_status_phase{phase="Pending"}) by (namespace)
Network Traffic Panel
# Network receive rate by pod
sum(rate(container_network_receive_bytes_total[5m])) by (pod)
# Network transmit rate by pod
sum(rate(container_network_transmit_bytes_total[5m])) by (pod)
Application Dashboard Template
{
"dashboard": {
"title": "Application Metrics",
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(kube_pod_info, namespace)",
"datasource": "Prometheus"
},
{
"name": "deployment",
"type": "query",
"query": "label_values(kube_deployment_labels{namespace=\"$namespace\"}, deployment)",
"datasource": "Prometheus"
}
]
},
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) by (endpoint)",
"legendFormat": "{{endpoint}}"
}
]
},
{
"title": "Latency P95",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"$namespace\"}[5m])) by (le, endpoint))",
"legendFormat": "{{endpoint}}"
}
]
},
{
"title": "Error Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(http_requests_total{namespace=\"$namespace\", status=~\"5..\"}[5m])) / sum(rate(http_requests_total{namespace=\"$namespace\"}[5m])) * 100",
"legendFormat": "Error %"
}
]
}
]
}
}
SLA/SLO Dashboard
Availability Panel
# Uptime percentage (30d)
(1 - sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d]))) * 100
Error Budget Panel
# Error budget remaining (99.9% SLO)
((1 - 0.999) - (sum(rate(http_requests_total{status=~"5.."}[30d])) / sum(rate(http_requests_total[30d])))) / (1 - 0.999) * 100
Latency SLO Panel
# Percentage of requests under 200ms
sum(rate(http_request_duration_seconds_bucket{le="0.2"}[5m])) / sum(rate(http_request_duration_seconds_count[5m])) * 100
Provisioning Dashboards via ConfigMap
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-dashboards
namespace: monitoring
labels:
grafana_dashboard: "1"
data:
cluster-overview.json: |
{
"dashboard": {
"title": "AKS Cluster Overview",
"uid": "aks-cluster-overview",
"panels": [...]
}
}
Alert Annotations
Show alerts on dashboards:
{
"annotations": {
"list": [
{
"name": "Alerts",
"datasource": "Prometheus",
"enable": true,
"expr": "ALERTS{alertstate=\"firing\"}",
"titleFormat": "{{alertname}}",
"textFormat": "{{description}}"
}
]
}
}
Azure Managed Grafana
For production, consider Azure Managed Grafana:
# Create Azure Managed Grafana instance
az grafana create \
--name myGrafana \
--resource-group myResourceGroup \
--location eastus
# Link to Azure Monitor workspace
az grafana data-source create \
--name myGrafana \
--resource-group myResourceGroup \
--definition '{
"name": "Azure Monitor",
"type": "grafana-azure-monitor-datasource",
"access": "proxy"
}'
Dashboard Best Practices
- Use variables - Make dashboards reusable with template variables
- Layer information - Overview at top, details below
- Consistent time ranges - Use dashboard time picker, not per-panel
- Color coding - Use consistent colors for status (green=good, red=bad)
- Include context - Add text panels explaining metrics
- Link dashboards - Create drill-down links between dashboards
Conclusion
Effective Grafana dashboards provide immediate visibility into cluster and application health. By combining Prometheus metrics with thoughtful visualization, you can quickly identify and troubleshoot issues.
Tomorrow, we’ll explore Azure Monitor for containers and how it integrates with your existing monitoring setup.