February 22, 2021 1 min read

Proactive Monitoring with Azure Monitor Alerts

Azure Monitoring Alerts DevOps Observability

Azure Monitor Alerts proactively notify you when important conditions are found in your monitoring data. They allow you to identify and address issues before users notice them, enabling a proactive approach to application and infrastructure management.

Types of Alerts

Azure Monitor supports several alert types:

Metric alerts - Based on metric values crossing thresholds
Log alerts - Based on Log Analytics query results
Activity log alerts - Based on Azure activity log events
Smart detection alerts - AI-powered anomaly detection

Creating Metric Alerts

# Create an action group for notifications
az monitor action-group create \
    --name ag-ops-team \
    --resource-group rg-monitoring \
    --short-name ops-team \
    --action email ops-email ops@company.com \
    --action webhook ops-webhook "https://hooks.slack.com/services/xxx"

# Create a metric alert for high CPU
az monitor metrics alert create \
    --name "High CPU Alert" \
    --resource-group rg-monitoring \
    --scopes "/subscriptions/{sub-id}/resourceGroups/rg-app/providers/Microsoft.Web/sites/mywebapp" \
    --condition "avg Percentage CPU > 80" \
    --window-size 5m \
    --evaluation-frequency 1m \
    --action ag-ops-team \
    --severity 2 \
    --description "Alert when CPU exceeds 80% for 5 minutes"

# Create alert for multiple conditions
az monitor metrics alert create \
    --name "App Health Alert" \
    --resource-group rg-monitoring \
    --scopes "/subscriptions/{sub-id}/resourceGroups/rg-app/providers/Microsoft.Web/sites/mywebapp" \
    --condition "avg Http5xx > 10" \
    --condition "avg ResponseTime > 5" \
    --window-size 5m \
    --evaluation-frequency 1m \
    --action ag-ops-team \
    --severity 1

Log-Based Alerts

Creating Log Alerts with ARM

{
    "$schema": "https://schema.management.azure.com/schemas/2019-04-01/deploymentTemplate.json#",
    "contentVersion": "1.0.0.0",
    "resources": [
        {
            "type": "microsoft.insights/scheduledqueryrules",
            "apiVersion": "2021-08-01",
            "name": "High Error Rate Alert",
            "location": "eastus",
            "properties": {
                "displayName": "High Error Rate Alert",
                "description": "Alert when error rate exceeds threshold",
                "severity": 1,
                "enabled": true,
                "evaluationFrequency": "PT5M",
                "windowSize": "PT15M",
                "scopes": [
                    "[resourceId('Microsoft.OperationalInsights/workspaces', 'myloganalytics')]"
                ],
                "criteria": {
                    "allOf": [
                        {
                            "query": "AppRequests | where ResultCode >= 500 | summarize ErrorCount = count() by bin(TimeGenerated, 5m)",
                            "timeAggregation": "Total",
                            "metricMeasureColumn": "ErrorCount",
                            "operator": "GreaterThan",
                            "threshold": 50,
                            "failingPeriods": {
                                "numberOfEvaluationPeriods": 3,
                                "minFailingPeriodsToAlert": 2
                            }
                        }
                    ]
                },
                "actions": {
                    "actionGroups": [
                        "[resourceId('microsoft.insights/actionGroups', 'ag-ops-team')]"
                    ],
                    "customProperties": {
                        "Team": "Platform",
                        "Priority": "P1"
                    }
                }
            }
        }
    ]
}

Common Log Alert Queries

// High exception rate
AppExceptions
| where TimeGenerated > ago(15m)
| summarize ExceptionCount = count() by AppRoleInstance
| where ExceptionCount > 100

// Slow database queries
AppDependencies
| where TimeGenerated > ago(15m)
| where DependencyType == "SQL"
| where DurationMs > 5000
| summarize SlowQueries = count(), AvgDuration = avg(DurationMs)
| where SlowQueries > 10

// Memory pressure
Perf
| where TimeGenerated > ago(5m)
| where ObjectName == "Memory" and CounterName == "% Committed Bytes In Use"
| summarize AvgMemory = avg(CounterValue) by Computer
| where AvgMemory > 90

// Failed logins
SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType != 0
| summarize FailedAttempts = count() by UserPrincipalName, IPAddress
| where FailedAttempts > 10

Activity Log Alerts

# Alert on service health incidents
az monitor activity-log alert create \
    --name "Service Health Alert" \
    --resource-group rg-monitoring \
    --condition category=ServiceHealth \
    --action-group ag-ops-team \
    --scope "/subscriptions/{sub-id}"

# Alert on resource deletion
az monitor activity-log alert create \
    --name "Resource Deletion Alert" \
    --resource-group rg-monitoring \
    --condition category=Administrative \
    --condition operationName="Microsoft.Resources/subscriptions/resourceGroups/delete" \
    --action-group ag-ops-team \
    --scope "/subscriptions/{sub-id}"

Smart Detection Alerts

Configure Application Insights smart detection:

from azure.identity import DefaultAzureCredential
from azure.mgmt.applicationinsights import ApplicationInsightsManagementClient

credential = DefaultAzureCredential()
client = ApplicationInsightsManagementClient(credential, subscription_id)

# Get smart detection rules
rules = client.proactive_detection_configurations.list(
    resource_group_name="rg-app",
    resource_name="myappinsights"
)

for rule in rules:
    print(f"Rule: {rule.name}")
    print(f"  Enabled: {rule.enabled}")
    print(f"  Send emails to subscription owners: {rule.send_emails_to_subscription_owners}")

# Update a rule
client.proactive_detection_configurations.update(
    resource_group_name="rg-app",
    resource_name="myappinsights",
    configuration_id="degradationindependencyduration",
    proactive_detection_configuration_properties={
        "enabled": True,
        "send_emails_to_subscription_owners": True,
        "custom_emails": ["oncall@company.com"]
    }
)

Action Groups Configuration

{
    "type": "microsoft.insights/actionGroups",
    "apiVersion": "2021-09-01",
    "name": "ag-comprehensive",
    "location": "Global",
    "properties": {
        "groupShortName": "ops-all",
        "enabled": true,
        "emailReceivers": [
            {
                "name": "ops-team-email",
                "emailAddress": "ops@company.com",
                "useCommonAlertSchema": true
            }
        ],
        "smsReceivers": [
            {
                "name": "oncall-sms",
                "countryCode": "1",
                "phoneNumber": "5551234567"
            }
        ],
        "webhookReceivers": [
            {
                "name": "slack-webhook",
                "serviceUri": "https://hooks.slack.com/services/xxx",
                "useCommonAlertSchema": true,
                "useAadAuth": false
            },
            {
                "name": "pagerduty-webhook",
                "serviceUri": "https://events.pagerduty.com/integration/xxx/enqueue",
                "useCommonAlertSchema": true
            }
        ],
        "azureFunctionReceivers": [
            {
                "name": "auto-remediation",
                "functionAppResourceId": "/subscriptions/{sub-id}/resourceGroups/rg-app/providers/Microsoft.Web/sites/func-alerts",
                "functionName": "HandleAlert",
                "httpTriggerUrl": "https://func-alerts.azurewebsites.net/api/HandleAlert",
                "useCommonAlertSchema": true
            }
        ],
        "logicAppReceivers": [
            {
                "name": "incident-workflow",
                "resourceId": "/subscriptions/{sub-id}/resourceGroups/rg-app/providers/Microsoft.Logic/workflows/IncidentWorkflow",
                "callbackUrl": "https://xxx.logic.azure.com:443/workflows/xxx",
                "useCommonAlertSchema": true
            }
        ],
        "armRoleReceivers": [
            {
                "name": "notify-owners",
                "roleId": "8e3af657-a8ff-443c-a75c-2fe8c4bcb635",
                "useCommonAlertSchema": true
            }
        ]
    }
}

Auto-Remediation with Azure Functions

# Azure Function for auto-remediation
import azure.functions as func
from azure.identity import DefaultAzureCredential
from azure.mgmt.compute import ComputeManagementClient
from azure.mgmt.web import WebSiteManagementClient
import json
import logging

def main(req: func.HttpRequest) -> func.HttpResponse:
    logging.info('Alert handler triggered')

    try:
        alert_data = req.get_json()
        alert_type = alert_data.get('data', {}).get('essentials', {}).get('alertRule')
        affected_resource = alert_data.get('data', {}).get('essentials', {}).get('alertTargetIDs', [])[0]

        logging.info(f"Alert: {alert_type}, Resource: {affected_resource}")

        credential = DefaultAzureCredential()

        if "High CPU" in alert_type:
            remediation_result = scale_out_app_service(credential, affected_resource)
        elif "Memory Pressure" in alert_type:
            remediation_result = restart_vm(credential, affected_resource)
        elif "Disk Space" in alert_type:
            remediation_result = clear_temp_files(credential, affected_resource)
        else:
            remediation_result = {"action": "none", "reason": "Unknown alert type"}

        return func.HttpResponse(
            json.dumps(remediation_result),
            mimetype="application/json"
        )

    except Exception as e:
        logging.error(f"Error: {str(e)}")
        return func.HttpResponse(str(e), status_code=500)

def scale_out_app_service(credential, resource_id):
    """Scale out App Service Plan."""
    parts = resource_id.split('/')
    subscription_id = parts[2]
    resource_group = parts[4]
    site_name = parts[8]

    client = WebSiteManagementClient(credential, subscription_id)

    # Get current site
    site = client.web_apps.get(resource_group, site_name)
    plan_parts = site.server_farm_id.split('/')
    plan_name = plan_parts[-1]

    # Get current plan
    plan = client.app_service_plans.get(resource_group, plan_name)

    # Scale out
    new_capacity = min(plan.sku.capacity + 1, 10)
    plan.sku.capacity = new_capacity

    client.app_service_plans.begin_create_or_update(
        resource_group,
        plan_name,
        plan
    ).result()

    return {
        "action": "scale_out",
        "resource": plan_name,
        "new_capacity": new_capacity
    }

def restart_vm(credential, resource_id):
    """Restart a virtual machine."""
    parts = resource_id.split('/')
    subscription_id = parts[2]
    resource_group = parts[4]
    vm_name = parts[8]

    client = ComputeManagementClient(credential, subscription_id)
    client.virtual_machines.begin_restart(resource_group, vm_name).result()

    return {
        "action": "restart_vm",
        "resource": vm_name
    }

Alert Management Dashboard

Create a workbook for alert visibility:

{
    "version": "Notebook/1.0",
    "items": [
        {
            "type": 1,
            "content": {
                "json": "# Alert Management Dashboard"
            }
        },
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": "AlertsManagementResources\n| where type == 'microsoft.alertsmanagement/alerts'\n| where properties.essentials.startDateTime > ago(24h)\n| summarize Count = count() by Severity = tostring(properties.essentials.severity)\n| order by Severity",
                "size": 0,
                "title": "Alerts by Severity (24h)",
                "queryType": 1,
                "resourceType": "microsoft.resourcegraph/resources",
                "visualization": "piechart"
            }
        },
        {
            "type": 3,
            "content": {
                "version": "KqlItem/1.0",
                "query": "AlertsManagementResources\n| where type == 'microsoft.alertsmanagement/alerts'\n| where properties.essentials.alertState == 'New'\n| project \n    Name = name,\n    Severity = properties.essentials.severity,\n    Resource = properties.essentials.targetResourceName,\n    Time = properties.essentials.startDateTime\n| order by Time desc",
                "size": 0,
                "title": "Active Alerts",
                "queryType": 1,
                "resourceType": "microsoft.resourcegraph/resources",
                "visualization": "table"
            }
        }
    ]
}

Best Practices

Start with critical alerts - Don’t alert on everything
Use appropriate thresholds - Avoid alert fatigue
Configure action groups thoughtfully - Right people, right channel
Implement auto-remediation where possible
Review and tune alerts regularly
Use dynamic thresholds for adaptive alerting

Conclusion

Azure Monitor Alerts are essential for proactive system management. By combining metric, log, and activity log alerts with appropriate action groups and auto-remediation, you can maintain system health and minimize downtime.

Start with essential alerts for your most critical resources and expand coverage as you understand your system’s behavior patterns.