5 min read
Phi-3 Family: Microsoft's Small Language Models
Microsoft’s Phi-3 family represents a significant shift in thinking about language models. Instead of bigger is always better, Phi-3 shows that smaller, well-trained models can achieve impressive results.
The Phi-3 Family
| Model | Parameters | Context | Use Case |
|---|---|---|---|
| Phi-3-mini | 3.8B | 4K/128K | Mobile, edge devices |
| Phi-3-small | 7B | 8K/128K | Balanced performance |
| Phi-3-medium | 14B | 4K/128K | Complex reasoning |
| Phi-3-vision | 4.2B | 128K | Multimodal tasks |
Why Phi-3 Matters
Quality per Parameter
Phi-3-mini (3.8B) outperforms many 7B models on benchmarks:
| Benchmark | Phi-3-mini | Llama-3-8B | Mixtral 8x7B |
|---|---|---|---|
| MMLU | 68.8 | 66.6 | 70.6 |
| GSM8K | 82.5 | 79.6 | 74.4 |
| HumanEval | 58.5 | 62.2 | 40.2 |
Efficient Training
Phi-3 uses high-quality training data (textbooks, filtered web content) rather than raw internet scale.
Getting Started with Phi-3
Using Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
model_id = "microsoft/Phi-3-mini-4k-instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=500
)
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "Explain the CAP theorem in distributed systems."}
]
output = pipe(messages)
print(output[0]['generated_text'][-1]['content'])
Using Azure AI
from azure.ai.inference import ChatCompletionsClient
from azure.core.credentials import AzureKeyCredential
client = ChatCompletionsClient(
endpoint="https://your-endpoint.inference.ai.azure.com",
credential=AzureKeyCredential("your-key")
)
response = client.complete(
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function to calculate factorial."}
],
model="Phi-3-mini-4k-instruct",
temperature=0.7,
max_tokens=500
)
print(response.choices[0].message.content)
Using ONNX Runtime
import onnxruntime_genai as og
# Load quantized model
model = og.Model("./phi-3-mini-4k-instruct-onnx")
tokenizer = og.Tokenizer(model)
# Create generator
params = og.GeneratorParams(model)
params.set_search_options(max_length=500, temperature=0.7)
prompt = "<|user|>\nExplain REST APIs<|end|>\n<|assistant|>\n"
tokens = tokenizer.encode(prompt)
params.input_ids = tokens
generator = og.Generator(model, params)
output_tokens = []
while not generator.is_done():
generator.compute_logits()
generator.generate_next_token()
output_tokens.append(generator.get_next_tokens()[0])
response = tokenizer.decode(output_tokens)
print(response)
Phi-3 Vision
Multimodal capabilities in a small package:
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_id = "microsoft/Phi-3-vision-128k-instruct"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
def analyze_image(image_path: str, question: str) -> str:
image = Image.open(image_path)
messages = [
{"role": "user", "content": f"<|image_1|>\n{question}"}
]
prompt = processor.tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
inputs = processor(prompt, [image], return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=500,
do_sample=True,
temperature=0.7
)
response = processor.batch_decode(
outputs[:, inputs['input_ids'].shape[1]:],
skip_special_tokens=True
)[0]
return response
# Usage
result = analyze_image("architecture_diagram.png", "Describe this system architecture.")
print(result)
Quantization for Edge
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# 4-bit quantization
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
quantization_config=quantization_config,
device_map="auto",
trust_remote_code=True
)
# Model now uses ~2GB RAM instead of ~8GB
print(f"Model size: {model.get_memory_footprint() / 1e9:.2f} GB")
Fine-tuning Phi-3
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model
from trl import SFTTrainer
import torch
# Load base model
model = AutoModelForCausalLM.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
"microsoft/Phi-3-mini-4k-instruct",
trust_remote_code=True
)
# Configure LoRA
lora_config = LoraConfig(
r=16,
lora_alpha=32,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Training configuration
training_args = TrainingArguments(
output_dir="./phi3-finetuned",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
learning_rate=2e-4,
warmup_steps=100,
logging_steps=10,
save_steps=500,
fp16=True
)
# Trainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
tokenizer=tokenizer,
max_seq_length=2048
)
trainer.train()
Use Cases for Phi-3
Code Assistant
def code_completion(partial_code: str) -> str:
prompt = f"""Complete this Python code:
```python
{partial_code}
Completed code:"""
response = pipe([{"role": "user", "content": prompt}])
return response[0]['generated_text'][-1]['content']
Usage
code = """def binary_search(arr, target): left, right = 0, len(arr) - 1 while left <= right:"""
completed = code_completion(code) print(completed)
### Document Q&A
```python
def answer_question(context: str, question: str) -> str:
prompt = f"""Context:
{context}
Question: {question}
Answer based only on the provided context:"""
response = pipe([
{"role": "system", "content": "Answer questions based only on the provided context."},
{"role": "user", "content": prompt}
])
return response[0]['generated_text'][-1]['content']
Structured Output
import json
def extract_entities(text: str) -> dict:
prompt = f"""Extract entities from this text as JSON:
{{
"people": [],
"organizations": [],
"locations": [],
"dates": []
}}
Text: {text}
JSON:"""
response = pipe([{"role": "user", "content": prompt}])
output = response[0]['generated_text'][-1]['content']
# Parse JSON from response
try:
return json.loads(output)
except:
return {"error": "Failed to parse", "raw": output}
Phi-3 vs GPT-4o: When to Use Which
| Scenario | Phi-3 | GPT-4o |
|---|---|---|
| Offline required | Yes | No |
| Cost-sensitive | Yes | Depends |
| Complex reasoning | Limited | Yes |
| Multimodal | Phi-3-vision | Yes |
| Fine-tuning needed | Yes (easy) | No |
| Data privacy | Yes | Requires Azure |
Deployment Options
- Local device - Direct inference
- Azure AI - Serverless API
- Azure ML - Managed endpoint
- Edge devices - IoT Hub integration
- Mobile - ONNX Runtime Mobile
What’s Next
Tomorrow I’ll cover Small Language Models more broadly and their role in enterprise AI.