GPT-4 vs GPT-3.5: A Practical Comparison
After a day of testing GPT-4, I’ve run systematic comparisons against GPT-3.5. Here are concrete examples showing where GPT-4 excels and where GPT-3.5 remains sufficient.
Test Setup
import openai
import time
from dataclasses import dataclass
@dataclass
class TestResult:
model: str
prompt: str
response: str
tokens: int
latency: float
cost: float
def test_both_models(prompt: str) -> tuple[TestResult, TestResult]:
"""Test same prompt on both models."""
results = []
for model, deployment in [("gpt-35-turbo", "gpt-35-turbo"), ("gpt-4", "gpt-4")]:
start = time.time()
response = openai.ChatCompletion.create(
engine=deployment,
messages=[{"role": "user", "content": prompt}],
temperature=0.1
)
latency = time.time() - start
tokens = response.usage.total_tokens
# Calculate cost
if model == "gpt-4":
cost = (response.usage.prompt_tokens * 0.03 + response.usage.completion_tokens * 0.06) / 1000
else:
cost = tokens * 0.002 / 1000
results.append(TestResult(
model=model,
prompt=prompt,
response=response.choices[0].message.content,
tokens=tokens,
latency=latency,
cost=cost
))
return tuple(results)
Test 1: Complex SQL Generation
Prompt: “Write a SQL query to find customers who made purchases in 3 consecutive months, had average order value above $100, and whose first purchase was in 2022. Tables: customers(id, name, signup_date), orders(id, customer_id, order_date, amount)”
GPT-3.5 Response:
SELECT c.id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE YEAR(o.order_date) = 2022
GROUP BY c.id, c.name
HAVING AVG(o.amount) > 100
AND COUNT(DISTINCT MONTH(o.order_date)) >= 3;
GPT-4 Response:
WITH monthly_orders AS (
SELECT
customer_id,
DATE_TRUNC('month', order_date) as order_month,
SUM(amount) as monthly_total,
COUNT(*) as order_count
FROM orders
GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
consecutive_months AS (
SELECT
customer_id,
order_month,
order_month - INTERVAL '1 month' * ROW_NUMBER() OVER (
PARTITION BY customer_id ORDER BY order_month
) as group_id
FROM monthly_orders
),
customers_with_consecutive AS (
SELECT customer_id
FROM consecutive_months
GROUP BY customer_id, group_id
HAVING COUNT(*) >= 3
),
first_purchase AS (
SELECT customer_id, MIN(order_date) as first_order_date
FROM orders
GROUP BY customer_id
HAVING EXTRACT(YEAR FROM MIN(order_date)) = 2022
),
avg_order_value AS (
SELECT customer_id
FROM orders
GROUP BY customer_id
HAVING AVG(amount) > 100
)
SELECT DISTINCT c.id, c.name
FROM customers c
JOIN customers_with_consecutive cc ON c.id = cc.customer_id
JOIN first_purchase fp ON c.id = fp.customer_id
JOIN avg_order_value aov ON c.id = aov.customer_id;
Analysis: GPT-3.5 missed the “consecutive months” requirement entirely. GPT-4 correctly identified the need for window functions to detect consecutive periods.
Test 2: Code Debugging
Prompt: “This Python code has a bug. Find it: def get_avg(nums): return sum(nums) / len(nums)”
GPT-3.5: “The function will raise a ZeroDivisionError if nums is empty. Add a check: if not nums: return 0”
GPT-4: “The function has several issues:
- ZeroDivisionError: Empty list causes division by zero
- TypeError: Non-numeric elements will cause sum() to fail
- None handling: If nums is None, both sum() and len() will fail
Improved version:
def get_avg(nums):
if not nums:
return 0.0 # or raise ValueError('Cannot average empty sequence')
try:
total = sum(nums)
return total / len(nums)
except TypeError:
raise ValueError('All elements must be numeric')
Consider also whether you want integer division or float division, and whether NaN should be handled.”
Analysis: GPT-4 thinks more comprehensively about edge cases.
Test 3: Simple Classification
Prompt: “Classify this text as positive, negative, or neutral: ‘The product arrived on time’”
GPT-3.5: “Positive”
GPT-4: “Neutral. The statement is factual and describes the delivery meeting expectations, but doesn’t express explicit satisfaction or dissatisfaction.”
Analysis: For nuanced classification, GPT-4 provides better reasoning. For simple sentiment, both work.
Test 4: Summarization
Prompt: [500-word technical document about Azure Synapse]
GPT-3.5: Accurate, 80-word summary GPT-4: Accurate, 85-word summary with better organization
Analysis: Minimal difference for straightforward summarization.
Test 5: Mathematical Reasoning
Prompt: “A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?”
GPT-3.5: “The ball costs $0.10” (Wrong - this is the common intuitive but incorrect answer)
GPT-4: “Let’s work through this:
- Let ball = x
- Bat = x + $1
- Total: x + (x + $1) = $1.10
- 2x + $1 = $1.10
- 2x = $0.10
- x = $0.05
The ball costs $0.05 (5 cents).”
Analysis: GPT-4 correctly solves the classic cognitive reflection test problem.
Cost-Benefit Summary
| Task Type | GPT-3.5 Sufficient? | GPT-4 Cost/Quality Ratio |
|---|---|---|
| Simple classification | Yes | Not worth 30x cost |
| Summarization | Yes | Marginal improvement |
| Code generation (simple) | Yes | Minor quality gain |
| Code generation (complex) | No | Worth it |
| SQL (simple) | Yes | Not needed |
| SQL (complex) | No | Necessary for accuracy |
| Math/reasoning | No | Required |
| Multi-step analysis | No | Required |
Decision Framework
def select_model(
task_type: str,
complexity: str,
accuracy_critical: bool,
budget_constrained: bool
) -> str:
"""Select optimal model for task."""
# High accuracy requirement always uses GPT-4
if accuracy_critical and not budget_constrained:
return "gpt-4"
# Task-specific routing
gpt4_required = {
"complex_sql": True,
"multi_step_reasoning": True,
"code_review": True,
"mathematical": True,
"legal_analysis": True,
}
gpt35_sufficient = {
"summarization": True,
"simple_classification": True,
"extraction": True,
"translation": True,
"simple_qa": True,
}
if task_type in gpt4_required and gpt4_required[task_type]:
return "gpt-4" if not budget_constrained else "gpt-35-turbo"
if task_type in gpt35_sufficient and gpt35_sufficient[task_type]:
return "gpt-35-turbo"
# Default based on complexity
if complexity == "high":
return "gpt-4" if not budget_constrained else "gpt-35-turbo"
return "gpt-35-turbo"
Performance Comparison
| Metric | GPT-3.5 Turbo | GPT-4 |
|---|---|---|
| Latency (simple) | ~1s | ~3s |
| Latency (complex) | ~2s | ~8s |
| Cost per 1K tokens | $0.002 | $0.03-0.06 |
| Context window | 4K | 8K/32K |
| Reasoning accuracy | ~70% | ~90% |
Practical Recommendations
- Start with GPT-3.5 for all tasks
- Upgrade to GPT-4 when quality is insufficient
- Use GPT-4 directly for reasoning-heavy tasks
- Monitor costs closely - GPT-4 bills add up fast
- Hybrid approach - use GPT-3.5 for pre-processing, GPT-4 for final analysis
The models complement each other. Use both strategically based on task requirements and budget constraints.