6 min read
Azure Spot Instance Strategies for Cost Savings
Azure Spot VMs offer up to 90% savings compared to pay-as-you-go pricing. The catch? They can be evicted with 30 seconds notice. Let’s explore how to use them effectively.
Understanding Spot VMs
Spot VMs use Azure’s excess capacity at significant discounts. Key characteristics:
- Up to 90% discount
- Can be evicted anytime
- 30-second eviction notice
- Best for fault-tolerant, interruptible workloads
Ideal Use Cases
spot_vm_suitability = {
"excellent": [
"Batch processing jobs",
"CI/CD build agents",
"Development/testing environments",
"Big data processing (Spark, Hadoop)",
"Machine learning training",
"Rendering workloads",
"Simulation and modeling"
],
"good_with_design": [
"Web applications (with proper scaling)",
"Stateless microservices",
"Queue processors",
"Scheduled tasks"
],
"not_recommended": [
"Production databases",
"Stateful applications without checkpointing",
"Real-time systems with SLA requirements",
"Single-instance critical workloads"
]
}
Basic Spot VM Deployment
// Single Spot VM
resource spotVm 'Microsoft.Compute/virtualMachines@2022-03-01' = {
name: 'spot-worker-01'
location: location
properties: {
hardwareProfile: {
vmSize: 'Standard_D4s_v3'
}
priority: 'Spot'
evictionPolicy: 'Deallocate' // or 'Delete'
billingProfile: {
maxPrice: -1 // Pay up to on-demand price
// Or set specific max: maxPrice: 0.05
}
storageProfile: {
imageReference: {
publisher: 'Canonical'
offer: '0001-com-ubuntu-server-jammy'
sku: '22_04-lts'
version: 'latest'
}
osDisk: {
createOption: 'FromImage'
managedDisk: {
storageAccountType: 'Standard_LRS'
}
}
}
osProfile: {
computerName: 'spot-worker-01'
adminUsername: adminUsername
adminPassword: adminPassword
}
networkProfile: {
networkInterfaces: [
{
id: nic.id
}
]
}
}
}
Spot VM Scale Sets
// Spot VMSS for scalable workloads
resource spotVmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
name: 'spot-vmss'
location: location
sku: {
name: 'Standard_D4s_v3'
tier: 'Standard'
capacity: 10
}
properties: {
upgradePolicy: {
mode: 'Rolling'
rollingUpgradePolicy: {
maxBatchInstancePercent: 20
maxUnhealthyInstancePercent: 20
pauseTimeBetweenBatches: 'PT5S'
}
}
virtualMachineProfile: {
priority: 'Spot'
evictionPolicy: 'Delete'
billingProfile: {
maxPrice: -1
}
osProfile: {
computerNamePrefix: 'spot'
adminUsername: adminUsername
adminPassword: adminPassword
}
storageProfile: {
imageReference: {
publisher: 'Canonical'
offer: '0001-com-ubuntu-server-jammy'
sku: '22_04-lts'
version: 'latest'
}
osDisk: {
createOption: 'FromImage'
managedDisk: {
storageAccountType: 'Standard_LRS'
}
}
}
networkProfile: {
networkInterfaceConfigurations: [
{
name: 'nic'
properties: {
primary: true
ipConfigurations: [
{
name: 'ipconfig'
properties: {
subnet: {
id: subnet.id
}
}
}
]
}
}
]
}
}
automaticRepairsPolicy: {
enabled: true
gracePeriod: 'PT10M'
}
}
}
Handling Evictions
Metadata Service Polling
import requests
import time
import signal
import sys
def check_eviction():
"""Check for scheduled eviction via Azure Metadata Service."""
try:
response = requests.get(
"http://169.254.169.254/metadata/scheduledevents",
params={"api-version": "2020-07-01"},
headers={"Metadata": "true"},
timeout=2
)
events = response.json().get("Events", [])
for event in events:
if event.get("EventType") == "Preempt":
return True, event.get("NotBefore")
return False, None
except Exception as e:
print(f"Error checking eviction: {e}")
return False, None
def graceful_shutdown():
"""Handle graceful shutdown on eviction."""
print("Eviction detected! Starting graceful shutdown...")
# Save checkpoint
save_checkpoint()
# Notify orchestrator
notify_controller("eviction")
# Complete current task if possible
complete_current_task()
sys.exit(0)
def main():
while True:
is_evicting, eviction_time = check_eviction()
if is_evicting:
graceful_shutdown()
# Do work
process_next_job()
time.sleep(5)
Kubernetes Spot Node Handling
# Spot node pool in AKS
apiVersion: v1
kind: NodePool
metadata:
name: spot-pool
spec:
scaleSetPriority: Spot
scaleSetEvictionPolicy: Delete
spotMaxPrice: -1
---
# Pod with spot node tolerance
apiVersion: apps/v1
kind: Deployment
metadata:
name: batch-processor
spec:
replicas: 10
template:
spec:
tolerations:
- key: "kubernetes.azure.com/scalesetpriority"
operator: "Equal"
value: "spot"
effect: "NoSchedule"
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "kubernetes.azure.com/scalesetpriority"
operator: In
values:
- spot
containers:
- name: processor
image: myregistry/batch-processor:latest
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "/app/save-checkpoint.sh"]
terminationGracePeriodSeconds: 25 # Less than 30s notice
Mixed Workload Strategy
// Combine spot and regular VMs for reliability
resource mixedVmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
name: 'mixed-vmss'
location: location
sku: {
name: 'Standard_D4s_v3'
tier: 'Standard'
capacity: 20
}
properties: {
orchestrationMode: 'Flexible'
platformFaultDomainCount: 5
}
}
// Regular instances for baseline capacity
resource regularProfile 'Microsoft.Compute/virtualMachineScaleSets/virtualMachineProfiles@2022-03-01' = {
name: 'regular'
parent: mixedVmss
properties: {
priority: 'Regular'
// 5 regular instances for baseline
}
}
// Spot instances for burst capacity
resource spotProfile 'Microsoft.Compute/virtualMachineScaleSets/virtualMachineProfiles@2022-03-01' = {
name: 'spot'
parent: mixedVmss
properties: {
priority: 'Spot'
evictionPolicy: 'Delete'
billingProfile: {
maxPrice: -1
}
// 15 spot instances for cost-effective scaling
}
}
Batch Processing Pattern
from azure.batch import BatchServiceClient
from azure.batch.models import PoolAddParameter, VirtualMachineConfiguration
def create_spot_batch_pool(batch_client: BatchServiceClient):
"""Create Azure Batch pool with spot VMs."""
pool = PoolAddParameter(
id='spot-processing-pool',
vm_size='Standard_D8s_v3',
target_low_priority_nodes=50, # Spot VMs in Batch terminology
target_dedicated_nodes=5, # Regular VMs for reliability
virtual_machine_configuration=VirtualMachineConfiguration(
image_reference=ImageReference(
publisher='Canonical',
offer='0001-com-ubuntu-server-jammy',
sku='22_04-lts'
),
node_agent_sku_id='batch.node.ubuntu 22.04'
),
task_slots_per_node=4,
enable_auto_scale=True,
auto_scale_formula='''
$totalNodes = (
$PendingTasks.GetSamplePercent(TimeInterval_Minute * 15) < 70 ?
max(0, $ActiveTasks.GetSample(1)) :
max($PendingTasks.GetSample(1), $ActiveTasks.GetSample(1))
);
$spotNodes = $totalNodes * 0.9;
$regularNodes = $totalNodes * 0.1;
$TargetLowPriorityNodes = $spotNodes;
$TargetDedicatedNodes = max(5, $regularNodes);
'''
)
batch_client.pool.add(pool)
Cost Monitoring
def analyze_spot_savings(subscription_id: str, resource_group: str):
"""Analyze actual savings from spot VMs."""
# Get spot VM usage
spot_usage = get_vm_usage(
subscription_id,
resource_group,
priority="Spot"
)
# Calculate what it would cost at regular price
regular_equivalent_cost = sum(
vm["hours"] * get_regular_price(vm["size"])
for vm in spot_usage
)
# Actual spot cost
actual_cost = sum(vm["actual_cost"] for vm in spot_usage)
# Savings
savings = regular_equivalent_cost - actual_cost
savings_percentage = (savings / regular_equivalent_cost) * 100
# Eviction impact
eviction_count = sum(1 for vm in spot_usage if vm["was_evicted"])
eviction_rate = eviction_count / len(spot_usage) * 100
return {
"regular_equivalent_cost": regular_equivalent_cost,
"actual_cost": actual_cost,
"savings": savings,
"savings_percentage": savings_percentage,
"eviction_count": eviction_count,
"eviction_rate": eviction_rate
}
Best Practices
spot_best_practices:
design:
- Design for interruption from the start
- Implement checkpointing for long-running jobs
- Use queues for work distribution
- Keep tasks short (< 1 hour ideal)
deployment:
- Mix spot with regular VMs for reliability
- Use multiple VM sizes for better availability
- Spread across availability zones
- Set appropriate max prices
monitoring:
- Track eviction rates
- Monitor actual savings
- Alert on high eviction rates
- Track job completion rates
operations:
- Automate recreation of evicted instances
- Implement automatic retry for failed tasks
- Use managed services where available (Batch, AKS)
Conclusion
Spot VMs offer massive savings for appropriate workloads. The key is designing for interruption from the start. Use them for batch processing, CI/CD, and stateless services. Combine with regular VMs for hybrid reliability. Monitor eviction rates and adjust strategies accordingly.