May 9, 2024 2 min read

The Rise of AI PCs: NPUs and Local AI Processing

The PC industry is undergoing a transformation with the emergence of AI-optimized hardware. Neural Processing Units (NPUs) are becoming standard in new laptops, enabling local AI inference without cloud dependency. Let’s explore what this means for developers.

What is an NPU?

NPUs (Neural Processing Units) are dedicated silicon for AI inference:

Workload Distribution:

CPU: General computing, complex logic
- Good at: Sequential tasks, branching logic
- AI performance: 1-5 TOPS
- Power: 15-65W (laptop)

GPU: Parallel compute, graphics
- Good at: Matrix operations, training
- AI performance: 50-200 TOPS
- Power: 30-150W (discrete)

NPU: AI inference, always-on AI
- Good at: Efficient inference, low power
- AI performance: 10-40+ TOPS
- Power: 5-15W

Current NPU Landscape

Several chips now include NPUs:

Chip	NPU Performance	Platform
Intel Core Ultra	10+ TOPS	Windows
AMD Ryzen AI	10+ TOPS	Windows
Apple M3	18 TOPS	macOS
Qualcomm Snapdragon X	45 TOPS	Windows (coming)

Developing for NPU

Windows AI APIs

using Microsoft.AI.MachineLearning;

// Check for NPU availability
var devices = LearningModelDevice.GetAvailableDevices();
var npuDevice = devices.FirstOrDefault(d =>
    d.Kind == LearningModelDeviceKind.DirectXHighPerformance);

if (npuDevice != null)
{
    Console.WriteLine($"NPU-capable device found");
}

// Load model
var model = await LearningModel.LoadFromFilePath("model.onnx");
var session = new LearningModelSession(model, npuDevice);

DirectML Integration

using Microsoft.ML.OnnxRuntime;

// Create session with DirectML execution provider
var sessionOptions = new SessionOptions();
sessionOptions.AppendExecutionProvider_DML(0);

using var session = new InferenceSession("model.onnx", sessionOptions);

// Run inference
var inputs = new List<NamedOnnxValue>
{
    NamedOnnxValue.CreateFromTensor("input", inputTensor)
};

using var results = session.Run(inputs);
var output = results.First().AsTensor<float>();

Python with ONNX Runtime

import onnxruntime as ort
import numpy as np

# Check available providers
print(ort.get_available_providers())
# ['DmlExecutionProvider', 'CPUExecutionProvider']

# Create session with DirectML (uses NPU/GPU)
session = ort.InferenceSession(
    "model.onnx",
    providers=['DmlExecutionProvider', 'CPUExecutionProvider']
)

# Run inference
input_data = np.random.randn(1, 3, 224, 224).astype(np.float32)
outputs = session.run(None, {"input": input_data})

Model Optimization for NPU

Quantization

from onnxruntime.quantization import quantize_dynamic, QuantType

# Quantize model for better NPU performance
quantize_dynamic(
    model_input="model_fp32.onnx",
    model_output="model_int8.onnx",
    weight_type=QuantType.QInt8
)

ONNX Model Optimization

import onnx
from onnxruntime.transformers import optimizer

# Optimize for inference
optimized_model = optimizer.optimize_model(
    "model.onnx",
    model_type='bert',
    num_heads=12,
    hidden_size=768
)

optimized_model.save_model_to_file("model_optimized.onnx")

Performance Comparison

Running small language models on different hardware:

Hardware	Tokens/sec	Power (W)	Efficiency
CPU only	8-12	25W	0.4 t/s/W
GPU (integrated)	25-35	35W	0.9 t/s/W
NPU	20-35	8W	3.5 t/s/W

NPU wins on efficiency, crucial for battery life.

Building NPU-Aware Applications

public class AIWorkloadManager
{
    private readonly bool _npuAvailable;
    private readonly SessionOptions _options;

    public AIWorkloadManager()
    {
        _options = new SessionOptions();

        // Try to use DirectML (NPU/GPU)
        try
        {
            _options.AppendExecutionProvider_DML(0);
            _npuAvailable = true;
        }
        catch
        {
            // Fall back to CPU
            _npuAvailable = false;
        }
    }

    public async Task<InferenceResult> RunInference(string modelPath, float[] input)
    {
        using var session = new InferenceSession(modelPath, _options);

        var inputTensor = new DenseTensor<float>(input, new[] { 1, input.Length });
        var inputs = new List<NamedOnnxValue>
        {
            NamedOnnxValue.CreateFromTensor("input", inputTensor)
        };

        using var results = session.Run(inputs);
        return new InferenceResult
        {
            Output = results.First().AsTensor<float>().ToArray(),
            UsedNPU = _npuAvailable
        };
    }
}

Use Cases for Local AI

Document Processing

def process_document_locally(image_path: str) -> dict:
    """Process document using local AI model"""
    # Load optimized model
    session = ort.InferenceSession(
        "document_ai_int8.onnx",
        providers=['DmlExecutionProvider', 'CPUExecutionProvider']
    )

    # Preprocess image
    image = preprocess_image(image_path)

    # Run inference locally
    outputs = session.run(None, {"image": image})

    return parse_document_output(outputs)

Real-Time Transcription

def transcribe_audio_locally(audio_chunk: np.ndarray) -> str:
    """Transcribe audio using local Whisper model"""
    session = ort.InferenceSession(
        "whisper_tiny_int8.onnx",
        providers=['DmlExecutionProvider', 'CPUExecutionProvider']
    )

    # Process audio
    features = extract_audio_features(audio_chunk)
    outputs = session.run(None, {"audio": features})

    return decode_tokens(outputs[0])

What Developers Should Know

Not all ops are NPU-accelerated - Complex ops may fall back to CPU
Batch size matters - NPUs often prefer larger batches
Model format - ONNX with int8 quantization works best
Power awareness - NPU saves battery, important for laptops
Fallback strategy - Always have CPU fallback for compatibility

The Future

Microsoft and partners are expected to announce more AI PC initiatives at Build 2024 (May 21-23). We anticipate:

New Copilot features leveraging local AI
Enhanced Windows AI APIs
More NPU-optimized applications
Better developer tools for edge AI

Conclusion

AI PCs with NPUs represent a significant shift toward local AI processing. Start experimenting with ONNX Runtime and DirectML to prepare your applications for this hardware evolution.