Skip to content
Back to Blog
1 min read

Windows AI Evolution: NPU, Recall, and On-Device Intelligence

I wrote “Windows AI Evolution: NPU, Recall, and On-Device Intelligence” to share practical, production-minded guidance on this topic.

The Windows AI Stack

Applications
    │
    ▼
┌────────────────────────────────────────┐
│  Windows Copilot + AI APIs             │
├────────────────────────────────────────┤
│  DirectML + ONNX Runtime               │
├────────────────────────────────────────┤
│  Windows ML                            │
├────────────────────────────────────────┤
│  Hardware: NPU / GPU / CPU             │
└────────────────────────────────────────┘

Neural Processing Units (NPUs)

NPUs provide dedicated AI acceleration:

import onnxruntime as ort
import numpy as np

# Check available execution providers
providers = ort.get_available_providers()
print(f"Available: {providers}")
# ['DmlExecutionProvider', 'CPUExecutionProvider']  # DML = DirectML (NPU/GPU)

# Load model with NPU acceleration
session_options = ort.SessionOptions()
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider', 'CPUExecutionProvider']
)

# Run inference on NPU
def run_inference(input_data):
    result = session.run(None, {"input": input_data})
    return result

# NPU provides:
# - Lower latency than cloud
# - No network dependency
# - Better privacy (data stays local)
# - Lower power consumption than GPU

Windows Recall for Developers

Windows Recall captures and indexes your screen activity:

# Hypothetical: Recall API for developers (speculative)

from windows.ai import RecallClient

recall = RecallClient()

# Search your activity history
results = recall.search(
    query="data pipeline architecture diagram",
    time_range="last_week",
    app_filter=["PowerPoint", "Visio", "Browser"]
)

for result in results:
    print(f"Found at: {result.timestamp}")
    print(f"App: {result.application}")
    print(f"Screenshot: {result.screenshot_path}")
    print(f"Extracted text: {result.text}")

# For data professionals:
# - Find that query you ran last Tuesday
# - Recover the dashboard configuration you were viewing
# - Search for specific data patterns you saw

On-Device Language Models

Small language models running locally:

from windows.ai import LocalLLM

# Initialize local model (runs on NPU)
llm = LocalLLM(
    model="phi-3-mini",  # Small but capable
    device="npu"
)

# Local inference - no cloud required
response = llm.generate(
    prompt="Summarize this meeting transcript:",
    context=transcript_text,
    max_tokens=200
)

# Benefits:
# - Works offline
# - Lower latency (~100ms vs ~500ms cloud)
# - No API costs
# - Data never leaves device

# Trade-offs:
# - Smaller model = less capable
# - Limited context window
# - Fixed model (no easy updates)

Hybrid Local + Cloud AI

from windows.ai import HybridAI

ai = HybridAI(
    local_model="phi-3-mini",
    cloud_model="gpt-4o",
    cloud_client=azure_openai_client
)

async def smart_process(text: str, complexity: str = "auto"):
    """Route to appropriate model based on task."""

    if complexity == "auto":
        # Let the system decide
        return await ai.process(text)

    elif complexity == "simple":
        # Use local NPU
        return await ai.local.generate(text)

    else:
        # Use cloud for complex tasks
        return await ai.cloud.generate(text)

# Automatic routing based on:
# - Task complexity
# - Network availability
# - Latency requirements
# - Cost constraints

Windows ML for Data Processing

import winml
import pandas as pd
import numpy as np

# Train a model and deploy locally
class LocalMLPipeline:
    def __init__(self):
        self.model = None

    def train(self, data: pd.DataFrame, target: str):
        """Train model and export to ONNX."""
        from sklearn.ensemble import RandomForestClassifier
        import skl2onnx

        X = data.drop(columns=[target])
        y = data[target]

        clf = RandomForestClassifier()
        clf.fit(X, y)

        # Convert to ONNX for Windows ML
        onnx_model = skl2onnx.convert_sklearn(
            clf,
            initial_types=[('input', skl2onnx.common.data_types.FloatTensorType([None, X.shape[1]]))]
        )

        with open("model.onnx", "wb") as f:
            f.write(onnx_model.SerializeToString())

    def deploy_local(self):
        """Deploy model for NPU inference."""
        self.model = winml.LearningModel.load_from_file("model.onnx")

    def predict(self, features: np.ndarray) -> np.ndarray:
        """Run inference on NPU."""
        binding = winml.LearningModelBinding(self.model)
        binding.bind("input", winml.TensorFloat.create_from_array(features))

        result = self.model.evaluate(binding)
        return result.outputs["output"].get_as_vector_view()

DirectML for Custom AI

import torch
import torch_directml

# Use DirectML backend for PyTorch (NPU/GPU)
device = torch_directml.device()

# Your model runs on NPU
model = MyCustomModel().to(device)

# Training on NPU
for batch in dataloader:
    inputs = batch['input'].to(device)
    targets = batch['target'].to(device)

    outputs = model(inputs)
    loss = criterion(outputs, targets)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# DirectML supports:
# - Training and inference
# - PyTorch and TensorFlow
# - Works on AMD, Intel, and NVIDIA hardware

Privacy and Security

# On-device AI enhances privacy

from windows.ai import SecureAI

secure_ai = SecureAI(
    model="phi-3-mini",
    data_encryption=True,
    secure_enclave=True  # Hardware security
)

# Sensitive data processing stays local
result = await secure_ai.process(
    data=sensitive_document,
    task="summarize",
    # Data never leaves the device
    allow_cloud=False
)

# Windows AI security features:
# - Data encryption at rest and in transit
# - Secure enclave for model weights
# - No telemetry on processed data
# - Audit logging for compliance

Developer Considerations

When to Use On-Device AI

use_cases = {
    "on_device_preferred": [
        "Real-time text suggestions",
        "Local document search",
        "Privacy-sensitive analysis",
        "Offline capabilities needed",
        "Low-latency requirements"
    ],
    "cloud_preferred": [
        "Complex reasoning tasks",
        "Large context windows needed",
        "Access to latest models",
        "Heavy compute requirements",
        "Multi-modal advanced tasks"
    ],
    "hybrid": [
        "Start local, escalate to cloud",
        "Cloud train, local inference",
        "Batch cloud, real-time local"
    ]
}

Getting Started

  1. Check hardware: Verify NPU availability
  2. Install Windows ML SDK: Enable local AI development
  3. Export models to ONNX: Standard format for Windows ML
  4. Optimize for NPU: Use quantization and optimization tools
  5. Test offline: Ensure graceful degradation

Windows AI brings intelligence to the edge. For data professionals, this means faster, more private analytics on your desktop.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.