Back to Blog
5 min read

Phi-3 Family: Microsoft's Small Language Models

Microsoft’s Phi-3 family represents a significant shift in thinking about language models. Instead of bigger is always better, Phi-3 shows that smaller, well-trained models can achieve impressive results.

The Phi-3 Family

ModelParametersContextUse Case
Phi-3-mini3.8B4K/128KMobile, edge devices
Phi-3-small7B8K/128KBalanced performance
Phi-3-medium14B4K/128KComplex reasoning
Phi-3-vision4.2B128KMultimodal tasks

Why Phi-3 Matters

Quality per Parameter

Phi-3-mini (3.8B) outperforms many 7B models on benchmarks:

BenchmarkPhi-3-miniLlama-3-8BMixtral 8x7B
MMLU68.866.670.6
GSM8K82.579.674.4
HumanEval58.562.240.2

Efficient Training

Phi-3 uses high-quality training data (textbooks, filtered web content) rather than raw internet scale.

Getting Started with Phi-3

Using Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

model_id = "microsoft/Phi-3-mini-4k-instruct"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

pipe = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    max_new_tokens=500
)

messages = [
    {"role": "system", "content": "You are a helpful AI assistant."},
    {"role": "user", "content": "Explain the CAP theorem in distributed systems."}
]

output = pipe(messages)
print(output[0]['generated_text'][-1]['content'])

Using Azure AI

from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential

client = ChatCompletionsClient(
    endpoint="https://your-endpoint.inference.ai.azure.com",
    credential=AzureKeyCredential("your-key")
)

response = client.complete(
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Write a Python function to calculate factorial."}
    ],
    model="Phi-3-mini-4k-instruct",
    temperature=0.7,
    max_tokens=500
)

print(response.choices[0].message.content)

Using ONNX Runtime

import onnxruntime_genai as og

# Load quantized model
model = og.Model("./phi-3-mini-4k-instruct-onnx")
tokenizer = og.Tokenizer(model)

# Create generator
params = og.GeneratorParams(model)
params.set_search_options(max_length=500, temperature=0.7)

prompt = "<|user|>\nExplain REST APIs<|end|>\n<|assistant|>\n"
tokens = tokenizer.encode(prompt)
params.input_ids = tokens

generator = og.Generator(model, params)

output_tokens = []
while not generator.is_done():
    generator.compute_logits()
    generator.generate_next_token()
    output_tokens.append(generator.get_next_tokens()[0])

response = tokenizer.decode(output_tokens)
print(response)

Phi-3 Vision

Multimodal capabilities in a small package:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "microsoft/Phi-3-vision-128k-instruct"

processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

def analyze_image(image_path: str, question: str) -> str:
    image = Image.open(image_path)

    messages = [
        {"role": "user", "content": f"<|image_1|>\n{question}"}
    ]

    prompt = processor.tokenizer.apply_chat_template(
        messages,
        tokenize=False,
        add_generation_prompt=True
    )

    inputs = processor(prompt, [image], return_tensors="pt").to(model.device)

    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=500,
            do_sample=True,
            temperature=0.7
        )

    response = processor.batch_decode(
        outputs[:, inputs['input_ids'].shape[1]:],
        skip_special_tokens=True
    )[0]

    return response

# Usage
result = analyze_image("architecture_diagram.png", "Describe this system architecture.")
print(result)

Quantization for Edge

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

# 4-bit quantization
quantization_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    quantization_config=quantization_config,
    device_map="auto",
    trust_remote_code=True
)

# Model now uses ~2GB RAM instead of ~8GB
print(f"Model size: {model.get_memory_footprint() / 1e9:.2f} GB")

Fine-tuning Phi-3

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
import torch

# Load base model
model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True
)

tokenizer = AutoTokenizer.from_pretrained(
    "microsoft/Phi-3-mini-4k-instruct",
    trust_remote_code=True
)

# Configure LoRA
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()

# Training configuration
training_args = TrainingArguments(
    output_dir="./phi3-finetuned",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=2e-4,
    warmup_steps=100,
    logging_steps=10,
    save_steps=500,
    fp16=True
)

# Trainer
trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    tokenizer=tokenizer,
    max_seq_length=2048
)

trainer.train()

Use Cases for Phi-3

Code Assistant

def code_completion(partial_code: str) -> str:
    prompt = f"""Complete this Python code:

```python
{partial_code}

Completed code:"""

response = pipe([{"role": "user", "content": prompt}])
return response[0]['generated_text'][-1]['content']

Usage

code = """def binary_search(arr, target): left, right = 0, len(arr) - 1 while left <= right:"""

completed = code_completion(code) print(completed)


### Document Q&A

```python
def answer_question(context: str, question: str) -> str:
    prompt = f"""Context:
{context}

Question: {question}

Answer based only on the provided context:"""

    response = pipe([
        {"role": "system", "content": "Answer questions based only on the provided context."},
        {"role": "user", "content": prompt}
    ])
    return response[0]['generated_text'][-1]['content']

Structured Output

import json

def extract_entities(text: str) -> dict:
    prompt = f"""Extract entities from this text as JSON:
{{
    "people": [],
    "organizations": [],
    "locations": [],
    "dates": []
}}

Text: {text}

JSON:"""

    response = pipe([{"role": "user", "content": prompt}])
    output = response[0]['generated_text'][-1]['content']

    # Parse JSON from response
    try:
        return json.loads(output)
    except:
        return {"error": "Failed to parse", "raw": output}

Phi-3 vs GPT-4o: When to Use Which

ScenarioPhi-3GPT-4o
Offline requiredYesNo
Cost-sensitiveYesDepends
Complex reasoningLimitedYes
MultimodalPhi-3-visionYes
Fine-tuning neededYes (easy)No
Data privacyYesRequires Azure

Deployment Options

  1. Local device - Direct inference
  2. Azure AI - Serverless API
  3. Azure ML - Managed endpoint
  4. Edge devices - IoT Hub integration
  5. Mobile - ONNX Runtime Mobile

What’s Next

Tomorrow I’ll cover Small Language Models more broadly and their role in enterprise AI.

Resources

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.