Back to Blog
2 min read

Azure Machine Learning Managed Endpoints: Advanced Deployment Patterns

Azure Machine Learning managed endpoints have matured significantly, offering sophisticated deployment patterns that balance performance, cost, and operational simplicity. Let’s explore advanced deployment strategies that go beyond basic model serving.

Blue-Green Deployments with Traffic Splitting

Managed endpoints support traffic splitting across multiple deployments, enabling safe rollouts of new model versions:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration
)
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="your-subscription-id",
    resource_group_name="rg-mlops-prod",
    workspace_name="mlw-production"
)

# Create endpoint with traffic rules
endpoint = ManagedOnlineEndpoint(
    name="fraud-detection-endpoint",
    description="Production fraud detection with blue-green deployment",
    auth_mode="key",
    traffic={"blue-v1": 90, "green-v2": 10}  # 10% canary traffic
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Deploy new model version as green deployment
green_deployment = ManagedOnlineDeployment(
    name="green-v2",
    endpoint_name="fraud-detection-endpoint",
    model=Model(path="./models/fraud_model_v2"),
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./environment/conda.yml"
    ),
    instance_type="Standard_DS3_v2",
    instance_count=2,
    request_settings={
        "request_timeout_ms": 3000,
        "max_concurrent_requests_per_instance": 100
    }
)

Implementing Gradual Traffic Shifting

Automate traffic shifting based on deployment health metrics:

import time
from azure.monitor.query import MetricsQueryClient

def gradual_traffic_shift(
    ml_client: MLClient,
    metrics_client: MetricsQueryClient,
    endpoint_name: str,
    new_deployment: str,
    old_deployment: str,
    error_threshold: float = 0.01
):
    """Gradually shift traffic while monitoring error rates."""

    traffic_steps = [10, 25, 50, 75, 100]

    for target_traffic in traffic_steps:
        # Update traffic split
        endpoint = ml_client.online_endpoints.get(endpoint_name)
        endpoint.traffic = {
            new_deployment: target_traffic,
            old_deployment: 100 - target_traffic
        }
        ml_client.online_endpoints.begin_create_or_update(endpoint).result()

        # Wait and monitor
        time.sleep(300)  # 5 minute observation window

        error_rate = get_deployment_error_rate(
            metrics_client, endpoint_name, new_deployment
        )

        if error_rate > error_threshold:
            # Rollback on high error rate
            endpoint.traffic = {old_deployment: 100, new_deployment: 0}
            ml_client.online_endpoints.begin_create_or_update(endpoint)
            raise Exception(f"Rollback triggered: error rate {error_rate}")

        print(f"Traffic at {target_traffic}%, error rate: {error_rate:.4f}")

Cost Optimization with Autoscaling

Configure autoscaling rules that balance responsiveness with cost for production deployments while maintaining service level objectives.

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.