Skip to content
Back to Blog
1 min read

Azure Machine Learning Managed Endpoints: Advanced Deployment Patterns

I wrote “Azure Machine Learning Managed Endpoints: Advanced Deployment Patterns” to share practical, production-minded guidance on this topic.

Blue-Green Deployments with Traffic Splitting

Managed endpoints support traffic splitting across multiple deployments, enabling safe rollouts of new model versions:

from azure.ai.ml import MLClient
from azure.ai.ml.entities import (
    ManagedOnlineEndpoint,
    ManagedOnlineDeployment,
    Model,
    Environment,
    CodeConfiguration
)
from azure.identity import DefaultAzureCredential

ml_client = MLClient(
    DefaultAzureCredential(),
    subscription_id="your-subscription-id",
    resource_group_name="rg-mlops-prod",
    workspace_name="mlw-production"
)

# Create endpoint with traffic rules
endpoint = ManagedOnlineEndpoint(
    name="fraud-detection-endpoint",
    description="Production fraud detection with blue-green deployment",
    auth_mode="key",
    traffic={"blue-v1": 90, "green-v2": 10}  # 10% canary traffic
)

ml_client.online_endpoints.begin_create_or_update(endpoint).result()

# Deploy new model version as green deployment
green_deployment = ManagedOnlineDeployment(
    name="green-v2",
    endpoint_name="fraud-detection-endpoint",
    model=Model(path="./models/fraud_model_v2"),
    environment=Environment(
        image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu20.04",
        conda_file="./environment/conda.yml"
    ),
    instance_type="Standard_DS3_v2",
    instance_count=2,
    request_settings={
        "request_timeout_ms": 3000,
        "max_concurrent_requests_per_instance": 100
    }
)

Implementing Gradual Traffic Shifting

Automate traffic shifting based on deployment health metrics:

import time
from azure.monitor.query import MetricsQueryClient

def gradual_traffic_shift(
    ml_client: MLClient,
    metrics_client: MetricsQueryClient,
    endpoint_name: str,
    new_deployment: str,
    old_deployment: str,
    error_threshold: float = 0.01
):
    """Gradually shift traffic while monitoring error rates."""

    traffic_steps = [10, 25, 50, 75, 100]

    for target_traffic in traffic_steps:
        # Update traffic split
        endpoint = ml_client.online_endpoints.get(endpoint_name)
        endpoint.traffic = {
            new_deployment: target_traffic,
            old_deployment: 100 - target_traffic
        }
        ml_client.online_endpoints.begin_create_or_update(endpoint).result()

        # Wait and monitor
        time.sleep(300)  # 5 minute observation window

        error_rate = get_deployment_error_rate(
            metrics_client, endpoint_name, new_deployment
        )

        if error_rate > error_threshold:
            # Rollback on high error rate
            endpoint.traffic = {old_deployment: 100, new_deployment: 0}
            ml_client.online_endpoints.begin_create_or_update(endpoint)
            raise Exception(f"Rollback triggered: error rate {error_rate}")

        print(f"Traffic at {target_traffic}%, error rate: {error_rate:.4f}")

Cost Optimization with Autoscaling

Configure autoscaling rules that balance responsiveness with cost for production deployments while maintaining service level objectives.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.