Skip to content
Back to Blog
4 min read

GPT-4 vs GPT-3.5: A Practical Comparison

After the first day of GPT-4 access, my comparison methodology was deliberately practical rather than benchmark-focused: I ran the same set of prompts—tasks from actual production applications, not academic evaluations—against gpt-35-turbo and gpt-4 and compared the outputs qualitatively. The consistent GPT-4 improvements: code explanation accuracy (GPT-4 correctly identified subtle bugs in multi-step reasoning scenarios where GPT-3.5 provided plausible-sounding but incorrect explanations); complex instruction adherence (GPT-4 followed multi-part instructions with conditional logic more reliably—“if the severity is high, add an escalation note; if it’s low, omit the escalation section” was followed correctly at a much higher rate); structured output format consistency (GPT-4 produced valid JSON without escape errors on complex nested structures more consistently than GPT-3.5). The cases where GPT-3.5 remained sufficient: simple text classification, sentiment analysis, straightforward summarisation, basic question answering from provided context, and standard code generation for common patterns—GPT-4’s quality improvement for these tasks didn’t justify the significant price premium. The practical cost guidance: model selection should be made per-task category, not per-application—an application that uses GPT-3.5 for retrieval-augmented FAQ responses and GPT-4 only for complex document analysis tasks gets substantially better cost performance than using GPT-4 uniformly.

Test Setup

import openai
import time
from dataclasses import dataclass

@dataclass
class TestResult:
    model: str
    prompt: str
    response: str
    tokens: int
    latency: float
    cost: float

def test_both_models(prompt: str) -> tuple[TestResult, TestResult]:
    """Test same prompt on both models."""
    results = []

    for model, deployment in [("gpt-35-turbo", "gpt-35-turbo"), ("gpt-4", "gpt-4")]:
        start = time.time()

        response = openai.ChatCompletion.create(
            engine=deployment,
            messages=[{"role": "user", "content": prompt}],
            temperature=0.1
        )

        latency = time.time() - start
        tokens = response.usage.total_tokens

        # Calculate cost
        if model == "gpt-4":
            cost = (response.usage.prompt_tokens * 0.03 + response.usage.completion_tokens * 0.06) / 1000
        else:
            cost = tokens * 0.002 / 1000

        results.append(TestResult(
            model=model,
            prompt=prompt,
            response=response.choices[0].message.content,
            tokens=tokens,
            latency=latency,
            cost=cost
        ))

    return tuple(results)

Test 1: Complex SQL Generation

Prompt: “Write a SQL query to find customers who made purchases in 3 consecutive months, had average order value above $100, and whose first purchase was in 2022. Tables: customers(id, name, signup_date), orders(id, customer_id, order_date, amount)”

GPT-3.5 Response:

SELECT c.id, c.name
FROM customers c
JOIN orders o ON c.id = o.customer_id
WHERE YEAR(o.order_date) = 2022
GROUP BY c.id, c.name
HAVING AVG(o.amount) > 100
  AND COUNT(DISTINCT MONTH(o.order_date)) >= 3;

GPT-4 Response:

WITH monthly_orders AS (
    SELECT
        customer_id,
        DATE_TRUNC('month', order_date) as order_month,
        SUM(amount) as monthly_total,
        COUNT(*) as order_count
    FROM orders
    GROUP BY customer_id, DATE_TRUNC('month', order_date)
),
consecutive_months AS (
    SELECT
        customer_id,
        order_month,
        order_month - INTERVAL '1 month' * ROW_NUMBER() OVER (
            PARTITION BY customer_id ORDER BY order_month
        ) as group_id
    FROM monthly_orders
),
customers_with_consecutive AS (
    SELECT customer_id
    FROM consecutive_months
    GROUP BY customer_id, group_id
    HAVING COUNT(*) >= 3
),
first_purchase AS (
    SELECT customer_id, MIN(order_date) as first_order_date
    FROM orders
    GROUP BY customer_id
    HAVING EXTRACT(YEAR FROM MIN(order_date)) = 2022
),
avg_order_value AS (
    SELECT customer_id
    FROM orders
    GROUP BY customer_id
    HAVING AVG(amount) > 100
)
SELECT DISTINCT c.id, c.name
FROM customers c
JOIN customers_with_consecutive cc ON c.id = cc.customer_id
JOIN first_purchase fp ON c.id = fp.customer_id
JOIN avg_order_value aov ON c.id = aov.customer_id;

Analysis: GPT-3.5 missed the “consecutive months” requirement entirely. GPT-4 correctly identified the need for window functions to detect consecutive periods.

Test 2: Code Debugging

Prompt: “This Python code has a bug. Find it: def get_avg(nums): return sum(nums) / len(nums)

GPT-3.5: “The function will raise a ZeroDivisionError if nums is empty. Add a check: if not nums: return 0

GPT-4: “The function has several issues:

  1. ZeroDivisionError: Empty list causes division by zero
  2. TypeError: Non-numeric elements will cause sum() to fail
  3. None handling: If nums is None, both sum() and len() will fail

Improved version:

def get_avg(nums):
    if not nums:
        return 0.0  # or raise ValueError('Cannot average empty sequence')

    try:
        total = sum(nums)
        return total / len(nums)
    except TypeError:
        raise ValueError('All elements must be numeric')

Consider also whether you want integer division or float division, and whether NaN should be handled.”

Analysis: GPT-4 thinks more comprehensively about edge cases.

Test 3: Simple Classification

Prompt: “Classify this text as positive, negative, or neutral: ‘The product arrived on time’”

GPT-3.5: “Positive”

GPT-4: “Neutral. The statement is factual and describes the delivery meeting expectations, but doesn’t express explicit satisfaction or dissatisfaction.”

Analysis: For nuanced classification, GPT-4 provides better reasoning. For simple sentiment, both work.

Test 4: Summarization

Prompt: [500-word technical document about Azure Synapse]

GPT-3.5: Accurate, 80-word summary GPT-4: Accurate, 85-word summary with better organization

Analysis: Minimal difference for straightforward summarization.

Test 5: Mathematical Reasoning

Prompt: “A bat and ball cost $1.10 total. The bat costs $1 more than the ball. How much does the ball cost?”

GPT-3.5: “The ball costs $0.10” (Wrong - this is the common intuitive but incorrect answer)

GPT-4: “Let’s work through this:

  • Let ball = x
  • Bat = x + $1
  • Total: x + (x + $1) = $1.10
  • 2x + $1 = $1.10
  • 2x = $0.10
  • x = $0.05

The ball costs $0.05 (5 cents).”

Analysis: GPT-4 correctly solves the classic cognitive reflection test problem.

Cost-Benefit Summary

Task TypeGPT-3.5 Sufficient?GPT-4 Cost/Quality Ratio
Simple classificationYesNot worth 30x cost
SummarizationYesMarginal improvement
Code generation (simple)YesMinor quality gain
Code generation (complex)NoWorth it
SQL (simple)YesNot needed
SQL (complex)NoNecessary for accuracy
Math/reasoningNoRequired
Multi-step analysisNoRequired

Decision Framework

def select_model(
    task_type: str,
    complexity: str,
    accuracy_critical: bool,
    budget_constrained: bool
) -> str:
    """Select optimal model for task."""

    # High accuracy requirement always uses GPT-4
    if accuracy_critical and not budget_constrained:
        return "gpt-4"

    # Task-specific routing
    gpt4_required = {
        "complex_sql": True,
        "multi_step_reasoning": True,
        "code_review": True,
        "mathematical": True,
        "legal_analysis": True,
    }

    gpt35_sufficient = {
        "summarization": True,
        "simple_classification": True,
        "extraction": True,
        "translation": True,
        "simple_qa": True,
    }

    if task_type in gpt4_required and gpt4_required[task_type]:
        return "gpt-4" if not budget_constrained else "gpt-35-turbo"

    if task_type in gpt35_sufficient and gpt35_sufficient[task_type]:
        return "gpt-35-turbo"

    # Default based on complexity
    if complexity == "high":
        return "gpt-4" if not budget_constrained else "gpt-35-turbo"

    return "gpt-35-turbo"

Performance Comparison

MetricGPT-3.5 TurboGPT-4
Latency (simple)~1s~3s
Latency (complex)~2s~8s
Cost per 1K tokens$0.002$0.03-0.06
Context window4K8K/32K
Reasoning accuracy~70%~90%

Practical Recommendations

  1. Start with GPT-3.5 for all tasks
  2. Upgrade to GPT-4 when quality is insufficient
  3. Use GPT-4 directly for reasoning-heavy tasks
  4. Monitor costs closely - GPT-4 bills add up fast
  5. Hybrid approach - use GPT-3.5 for pre-processing, GPT-4 for final analysis

The models complement each other. Use both strategically based on task requirements and budget constraints.\n\n## Takeaways\n\nAdd a concise, personal takeaway and recommended next steps here.\n

Michael John Pena

Michael John Pena

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.