Skip to content
Back to Blog
1 min read

Real-Time AI Inference with Azure Container Apps: Dynamic Scaling Patterns

I wrote “Real-Time AI Inference with Azure Container Apps: Dynamic Scaling Patterns” to share practical, production-minded guidance on this topic.

Container Configuration for ML Workloads

Structure your inference container for fast cold starts and efficient resource utilization:

# inference_server.py
from fastapi import FastAPI
from contextlib import asynccontextmanager
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = None
tokenizer = None

@asynccontextmanager
async def lifespan(app: FastAPI):
    # Load model once at startup
    global model, tokenizer
    model = AutoModelForCausalLM.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct",
        torch_dtype=torch.float16,
        device_map="auto",
        trust_remote_code=True
    )
    tokenizer = AutoTokenizer.from_pretrained(
        "microsoft/Phi-3-mini-4k-instruct"
    )
    yield
    # Cleanup on shutdown
    del model, tokenizer
    torch.cuda.empty_cache()

app = FastAPI(lifespan=lifespan)

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 256):
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=max_tokens,
        do_sample=True,
        temperature=0.7
    )
    return {"response": tokenizer.decode(outputs[0], skip_special_tokens=True)}

KEDA Scaling Configuration

Configure scaling based on HTTP requests and queue depth for batch workloads:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: inference-service
spec:
  template:
    metadata:
      annotations:
        keda.sh/scaled-object.http-scaler: |
          minReplicaCount: 0
          maxReplicaCount: 20
          cooldownPeriod: 300
          pollingInterval: 15
          triggers:
            - type: http
              metadata:
                scalingMetric: requestRate
                targetValue: "100"

Cost Optimization

Use spot instances for non-critical workloads, implement request batching for throughput optimization, and configure appropriate scale-to-zero policies. Monitor cold start latency and adjust minimum replicas during peak hours.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.