Cost Optimization with AKS Spot Node Pools
Cost Optimization with AKS Spot Node Pools
Azure Spot VMs offer significant cost savings by utilizing unused Azure capacity. When combined with AKS node pools, you can achieve up to 90% discount on compute costs. Let’s explore how to effectively use spot node pools in your Kubernetes workloads.
Understanding Spot VMs
Spot VMs are ideal for workloads that can handle interruptions:
- Batch processing jobs
- Dev/test environments
- CI/CD build agents
- Stateless web applications with redundancy
- Big data processing
Creating a Spot Node Pool
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name spotpool \
--node-count 3 \
--node-vm-size Standard_D4s_v3 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price -1 \
--labels kubernetes.azure.com/scalesetpriority=spot \
--node-taints kubernetes.azure.com/scalesetpriority=spot:NoSchedule
Key parameters:
--priority Spot- Creates a spot node pool--eviction-policy Delete- Deletes the VM when evicted (default)--spot-max-price -1- Pay up to the on-demand price (never evicted for price)
Setting a Maximum Price
To control costs further, set a maximum price per hour:
az aks nodepool add \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name spotpool \
--node-count 3 \
--node-vm-size Standard_D4s_v3 \
--priority Spot \
--eviction-policy Delete \
--spot-max-price 0.05
Handling Evictions Gracefully
Spot VMs can be evicted with 30 seconds notice. Design your applications to handle this:
Pod Disruption Budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: batch-job-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: batch-processor
Graceful Shutdown Handler
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 5
selector:
matchLabels:
app: batch-processor
template:
metadata:
labels:
app: batch-processor
spec:
nodeSelector:
kubernetes.azure.com/scalesetpriority: spot
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
terminationGracePeriodSeconds: 25
containers:
- name: processor
image: myregistry.azurecr.io/batch-processor:v1
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 5 && /app/graceful-shutdown.sh"]
Spot Node Pool with Autoscaling
Combine spot nodes with cluster autoscaler:
az aks nodepool update \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name spotpool \
--enable-cluster-autoscaler \
--min-count 0 \
--max-count 20
Hybrid Architecture: Spot + Regular Nodes
For reliability, combine spot and regular nodes:
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 6
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
affinity:
nodeAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: In
values:
- spot
- weight: 1
preference:
matchExpressions:
- key: kubernetes.azure.com/scalesetpriority
operator: DoesNotExist
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
containers:
- name: web
image: myregistry.azurecr.io/web-app:v1
Terraform Configuration
resource "azurerm_kubernetes_cluster_node_pool" "spot" {
name = "spot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_D4s_v3"
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1
enable_auto_scaling = true
node_count = 3
min_count = 0
max_count = 20
node_labels = {
"kubernetes.azure.com/scalesetpriority" = "spot"
}
node_taints = [
"kubernetes.azure.com/scalesetpriority=spot:NoSchedule"
]
tags = {
Environment = "Production"
NodeType = "Spot"
}
}
Monitoring Spot Evictions
Create an alert for spot evictions:
# Query for spot eviction events in Azure Monitor
az monitor metrics alert create \
--name spot-eviction-alert \
--resource-group myResourceGroup \
--scopes /subscriptions/{subscription-id}/resourceGroups/myResourceGroup/providers/Microsoft.ContainerService/managedClusters/myAKSCluster \
--condition "count VMEvictionCount > 0" \
--window-size 5m \
--evaluation-frequency 1m \
--action myActionGroup
Cost Comparison Script
import subprocess
import json
def get_vm_pricing(vm_size, region):
# Fetch spot and regular prices
result = subprocess.run([
'az', 'vm', 'list-skus',
'--location', region,
'--size', vm_size,
'--output', 'json'
], capture_output=True, text=True)
return json.loads(result.stdout)
def calculate_savings(regular_price, spot_price, hours_per_month=730):
monthly_regular = regular_price * hours_per_month
monthly_spot = spot_price * hours_per_month
savings = monthly_regular - monthly_spot
percentage = (savings / monthly_regular) * 100
return {
'monthly_regular': monthly_regular,
'monthly_spot': monthly_spot,
'savings': savings,
'percentage': percentage
}
# Example usage
# savings = calculate_savings(0.192, 0.038)
# print(f"Monthly savings: ${savings['savings']:.2f} ({savings['percentage']:.1f}%)")
Best Practices
- Never run stateful workloads on spot - Use regular nodes for databases and persistent workloads
- Implement proper health checks - Ensure quick pod replacement
- Use multiple VM sizes - Increase availability across different spot pools
- Set appropriate PDBs - Maintain minimum availability during evictions
- Monitor eviction rates - Track and respond to high eviction periods
- Combine with regular nodes - Ensure baseline capacity for critical workloads
Conclusion
Spot node pools are an excellent way to reduce Kubernetes costs for appropriate workloads. By designing for interruption and implementing proper failover strategies, you can achieve significant savings while maintaining reliability.
Tomorrow, we’ll explore virtual nodes for serverless Kubernetes scaling.