December 21, 2022 1 min read

Azure Spot Instance Strategies for Cost Savings

Azure Spot Instances Cost Optimization Compute Best Practices

Azure Spot VMs offer up to 90% savings compared to pay-as-you-go pricing. The catch? They can be evicted with 30 seconds notice. Let’s explore how to use them effectively.

Understanding Spot VMs

Spot VMs use Azure’s excess capacity at significant discounts. Key characteristics:

Up to 90% discount
Can be evicted anytime
30-second eviction notice
Best for fault-tolerant, interruptible workloads

Ideal Use Cases

spot_vm_suitability = {
    "excellent": [
        "Batch processing jobs",
        "CI/CD build agents",
        "Development/testing environments",
        "Big data processing (Spark, Hadoop)",
        "Machine learning training",
        "Rendering workloads",
        "Simulation and modeling"
    ],
    "good_with_design": [
        "Web applications (with proper scaling)",
        "Stateless microservices",
        "Queue processors",
        "Scheduled tasks"
    ],
    "not_recommended": [
        "Production databases",
        "Stateful applications without checkpointing",
        "Real-time systems with SLA requirements",
        "Single-instance critical workloads"
    ]
}

Basic Spot VM Deployment

// Single Spot VM
resource spotVm 'Microsoft.Compute/virtualMachines@2022-03-01' = {
  name: 'spot-worker-01'
  location: location
  properties: {
    hardwareProfile: {
      vmSize: 'Standard_D4s_v3'
    }
    priority: 'Spot'
    evictionPolicy: 'Deallocate'  // or 'Delete'
    billingProfile: {
      maxPrice: -1  // Pay up to on-demand price
      // Or set specific max: maxPrice: 0.05
    }
    storageProfile: {
      imageReference: {
        publisher: 'Canonical'
        offer: '0001-com-ubuntu-server-jammy'
        sku: '22_04-lts'
        version: 'latest'
      }
      osDisk: {
        createOption: 'FromImage'
        managedDisk: {
          storageAccountType: 'Standard_LRS'
        }
      }
    }
    osProfile: {
      computerName: 'spot-worker-01'
      adminUsername: adminUsername
      adminPassword: adminPassword
    }
    networkProfile: {
      networkInterfaces: [
        {
          id: nic.id
        }
      ]
    }
  }
}

Spot VM Scale Sets

// Spot VMSS for scalable workloads
resource spotVmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: 'spot-vmss'
  location: location
  sku: {
    name: 'Standard_D4s_v3'
    tier: 'Standard'
    capacity: 10
  }
  properties: {
    upgradePolicy: {
      mode: 'Rolling'
      rollingUpgradePolicy: {
        maxBatchInstancePercent: 20
        maxUnhealthyInstancePercent: 20
        pauseTimeBetweenBatches: 'PT5S'
      }
    }
    virtualMachineProfile: {
      priority: 'Spot'
      evictionPolicy: 'Delete'
      billingProfile: {
        maxPrice: -1
      }
      osProfile: {
        computerNamePrefix: 'spot'
        adminUsername: adminUsername
        adminPassword: adminPassword
      }
      storageProfile: {
        imageReference: {
          publisher: 'Canonical'
          offer: '0001-com-ubuntu-server-jammy'
          sku: '22_04-lts'
          version: 'latest'
        }
        osDisk: {
          createOption: 'FromImage'
          managedDisk: {
            storageAccountType: 'Standard_LRS'
          }
        }
      }
      networkProfile: {
        networkInterfaceConfigurations: [
          {
            name: 'nic'
            properties: {
              primary: true
              ipConfigurations: [
                {
                  name: 'ipconfig'
                  properties: {
                    subnet: {
                      id: subnet.id
                    }
                  }
                }
              ]
            }
          }
        ]
      }
    }
    automaticRepairsPolicy: {
      enabled: true
      gracePeriod: 'PT10M'
    }
  }
}

Handling Evictions

Metadata Service Polling

import requests
import time
import signal
import sys

def check_eviction():
    """Check for scheduled eviction via Azure Metadata Service."""
    try:
        response = requests.get(
            "http://169.254.169.254/metadata/scheduledevents",
            params={"api-version": "2020-07-01"},
            headers={"Metadata": "true"},
            timeout=2
        )

        events = response.json().get("Events", [])

        for event in events:
            if event.get("EventType") == "Preempt":
                return True, event.get("NotBefore")

        return False, None

    except Exception as e:
        print(f"Error checking eviction: {e}")
        return False, None

def graceful_shutdown():
    """Handle graceful shutdown on eviction."""
    print("Eviction detected! Starting graceful shutdown...")

    # Save checkpoint
    save_checkpoint()

    # Notify orchestrator
    notify_controller("eviction")

    # Complete current task if possible
    complete_current_task()

    sys.exit(0)

def main():
    while True:
        is_evicting, eviction_time = check_eviction()

        if is_evicting:
            graceful_shutdown()

        # Do work
        process_next_job()

        time.sleep(5)

Kubernetes Spot Node Handling

# Spot node pool in AKS
apiVersion: v1
kind: NodePool
metadata:
  name: spot-pool
spec:
  scaleSetPriority: Spot
  scaleSetEvictionPolicy: Delete
  spotMaxPrice: -1

---
# Pod with spot node tolerance
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
spec:
  replicas: 10
  template:
    spec:
      tolerations:
        - key: "kubernetes.azure.com/scalesetpriority"
          operator: "Equal"
          value: "spot"
          effect: "NoSchedule"
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: "kubernetes.azure.com/scalesetpriority"
                    operator: In
                    values:
                      - spot
      containers:
        - name: processor
          image: myregistry/batch-processor:latest
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "/app/save-checkpoint.sh"]
          terminationGracePeriodSeconds: 25  # Less than 30s notice

Mixed Workload Strategy

// Combine spot and regular VMs for reliability
resource mixedVmss 'Microsoft.Compute/virtualMachineScaleSets@2022-03-01' = {
  name: 'mixed-vmss'
  location: location
  sku: {
    name: 'Standard_D4s_v3'
    tier: 'Standard'
    capacity: 20
  }
  properties: {
    orchestrationMode: 'Flexible'
    platformFaultDomainCount: 5
  }
}

// Regular instances for baseline capacity
resource regularProfile 'Microsoft.Compute/virtualMachineScaleSets/virtualMachineProfiles@2022-03-01' = {
  name: 'regular'
  parent: mixedVmss
  properties: {
    priority: 'Regular'
    // 5 regular instances for baseline
  }
}

// Spot instances for burst capacity
resource spotProfile 'Microsoft.Compute/virtualMachineScaleSets/virtualMachineProfiles@2022-03-01' = {
  name: 'spot'
  parent: mixedVmss
  properties: {
    priority: 'Spot'
    evictionPolicy: 'Delete'
    billingProfile: {
      maxPrice: -1
    }
    // 15 spot instances for cost-effective scaling
  }
}

Batch Processing Pattern

from azure.batch import BatchServiceClient
from azure.batch.models import PoolAddParameter, VirtualMachineConfiguration

def create_spot_batch_pool(batch_client: BatchServiceClient):
    """Create Azure Batch pool with spot VMs."""

    pool = PoolAddParameter(
        id='spot-processing-pool',
        vm_size='Standard_D8s_v3',
        target_low_priority_nodes=50,  # Spot VMs in Batch terminology
        target_dedicated_nodes=5,       # Regular VMs for reliability
        virtual_machine_configuration=VirtualMachineConfiguration(
            image_reference=ImageReference(
                publisher='Canonical',
                offer='0001-com-ubuntu-server-jammy',
                sku='22_04-lts'
            ),
            node_agent_sku_id='batch.node.ubuntu 22.04'
        ),
        task_slots_per_node=4,
        enable_auto_scale=True,
        auto_scale_formula='''
            $totalNodes = (
                $PendingTasks.GetSamplePercent(TimeInterval_Minute * 15) < 70 ?
                max(0, $ActiveTasks.GetSample(1)) :
                max($PendingTasks.GetSample(1), $ActiveTasks.GetSample(1))
            );
            $spotNodes = $totalNodes * 0.9;
            $regularNodes = $totalNodes * 0.1;
            $TargetLowPriorityNodes = $spotNodes;
            $TargetDedicatedNodes = max(5, $regularNodes);
        '''
    )

    batch_client.pool.add(pool)

Cost Monitoring

def analyze_spot_savings(subscription_id: str, resource_group: str):
    """Analyze actual savings from spot VMs."""

    # Get spot VM usage
    spot_usage = get_vm_usage(
        subscription_id,
        resource_group,
        priority="Spot"
    )

    # Calculate what it would cost at regular price
    regular_equivalent_cost = sum(
        vm["hours"] * get_regular_price(vm["size"])
        for vm in spot_usage
    )

    # Actual spot cost
    actual_cost = sum(vm["actual_cost"] for vm in spot_usage)

    # Savings
    savings = regular_equivalent_cost - actual_cost
    savings_percentage = (savings / regular_equivalent_cost) * 100

    # Eviction impact
    eviction_count = sum(1 for vm in spot_usage if vm["was_evicted"])
    eviction_rate = eviction_count / len(spot_usage) * 100

    return {
        "regular_equivalent_cost": regular_equivalent_cost,
        "actual_cost": actual_cost,
        "savings": savings,
        "savings_percentage": savings_percentage,
        "eviction_count": eviction_count,
        "eviction_rate": eviction_rate
    }

Best Practices

spot_best_practices:
  design:
    - Design for interruption from the start
    - Implement checkpointing for long-running jobs
    - Use queues for work distribution
    - Keep tasks short (< 1 hour ideal)

  deployment:
    - Mix spot with regular VMs for reliability
    - Use multiple VM sizes for better availability
    - Spread across availability zones
    - Set appropriate max prices

  monitoring:
    - Track eviction rates
    - Monitor actual savings
    - Alert on high eviction rates
    - Track job completion rates

  operations:
    - Automate recreation of evicted instances
    - Implement automatic retry for failed tasks
    - Use managed services where available (Batch, AKS)

Conclusion

Spot VMs offer massive savings for appropriate workloads. The key is designing for interruption from the start. Use them for batch processing, CI/CD, and stateless services. Combine with regular VMs for hybrid reliability. Monitor eviction rates and adjust strategies accordingly.