July 2, 2024 2 min read

Claude 3.5 Sonnet vs GPT-4o: A Technical Comparison

AI Claude GPT-4o OpenAI Anthropic Comparison

With Claude 3.5 Sonnet and GPT-4o both available, choosing the right model for your application requires understanding their differences. I’ve been testing both extensively for data engineering and AI development tasks. Here’s what I’ve learned.

Benchmark Comparison

Benchmark	Claude 3.5 Sonnet	GPT-4o
MMLU	88.7%	88.7%
HumanEval	92.0%	90.2%
MATH	71.1%	76.6%
GPQA	59.4%	53.6%
MGSM	91.6%	90.5%

The benchmarks are close, but they tell different stories. Claude edges ahead on coding and reasoning; GPT-4o leads on math. Real-world performance differs from benchmarks though.

Code Generation Test

I gave both models the same task:

Create a Python class that implements a connection pool for Azure SQL Database
with automatic retry logic, connection health checks, and graceful shutdown.

Claude 3.5 Sonnet Output:

import pyodbc
import threading
import time
from queue import Queue, Empty
from contextlib import contextmanager
from typing import Optional
import logging

logger = logging.getLogger(__name__)

class AzureSQLConnectionPool:
    def __init__(
        self,
        connection_string: str,
        min_connections: int = 2,
        max_connections: int = 10,
        connection_timeout: int = 30,
        max_retries: int = 3,
        retry_delay: float = 1.0,
        health_check_interval: int = 60
    ):
        self.connection_string = connection_string
        self.min_connections = min_connections
        self.max_connections = max_connections
        self.connection_timeout = connection_timeout
        self.max_retries = max_retries
        self.retry_delay = retry_delay
        self.health_check_interval = health_check_interval

        self._pool: Queue = Queue(maxsize=max_connections)
        self._lock = threading.Lock()
        self._active_connections = 0
        self._shutdown = False
        self._health_check_thread: Optional[threading.Thread] = None

        self._initialize_pool()
        self._start_health_check()

    def _create_connection(self) -> pyodbc.Connection:
        """Create a new connection with retry logic."""
        last_exception = None

        for attempt in range(self.max_retries):
            try:
                conn = pyodbc.connect(
                    self.connection_string,
                    timeout=self.connection_timeout
                )
                conn.autocommit = False
                return conn
            except pyodbc.Error as e:
                last_exception = e
                logger.warning(f"Connection attempt {attempt + 1} failed: {e}")
                if attempt < self.max_retries - 1:
                    time.sleep(self.retry_delay * (2 ** attempt))

        raise ConnectionError(f"Failed to connect after {self.max_retries} attempts") from last_exception

    def _initialize_pool(self):
        """Pre-populate pool with minimum connections."""
        for _ in range(self.min_connections):
            try:
                conn = self._create_connection()
                self._pool.put(conn)
                with self._lock:
                    self._active_connections += 1
            except ConnectionError as e:
                logger.error(f"Failed to initialize connection: {e}")

    def _is_connection_healthy(self, conn: pyodbc.Connection) -> bool:
        """Check if connection is still valid."""
        try:
            cursor = conn.cursor()
            cursor.execute("SELECT 1")
            cursor.fetchone()
            cursor.close()
            return True
        except pyodbc.Error:
            return False

    def _start_health_check(self):
        """Start background health check thread."""
        def health_check_loop():
            while not self._shutdown:
                time.sleep(self.health_check_interval)
                if self._shutdown:
                    break
                self._check_and_repair_connections()

        self._health_check_thread = threading.Thread(
            target=health_check_loop,
            daemon=True
        )
        self._health_check_thread.start()

    def _check_and_repair_connections(self):
        """Check all pooled connections and replace unhealthy ones."""
        checked = 0
        while checked < self._pool.qsize():
            try:
                conn = self._pool.get_nowait()
                if self._is_connection_healthy(conn):
                    self._pool.put(conn)
                else:
                    logger.info("Replacing unhealthy connection")
                    try:
                        conn.close()
                    except:
                        pass
                    with self._lock:
                        self._active_connections -= 1
                    try:
                        new_conn = self._create_connection()
                        self._pool.put(new_conn)
                        with self._lock:
                            self._active_connections += 1
                    except ConnectionError as e:
                        logger.error(f"Failed to replace connection: {e}")
                checked += 1
            except Empty:
                break

    @contextmanager
    def get_connection(self):
        """Get a connection from the pool."""
        if self._shutdown:
            raise RuntimeError("Connection pool is shut down")

        conn = None
        try:
            conn = self._pool.get(timeout=self.connection_timeout)

            if not self._is_connection_healthy(conn):
                try:
                    conn.close()
                except:
                    pass
                with self._lock:
                    self._active_connections -= 1
                conn = self._create_connection()
                with self._lock:
                    self._active_connections += 1

            yield conn
            conn.commit()

        except Empty:
            with self._lock:
                if self._active_connections < self.max_connections:
                    conn = self._create_connection()
                    self._active_connections += 1
                    try:
                        yield conn
                        conn.commit()
                    finally:
                        self._pool.put(conn)
                    return
            raise TimeoutError("No connections available")

        except Exception as e:
            if conn:
                try:
                    conn.rollback()
                except:
                    pass
            raise

        finally:
            if conn and not self._shutdown:
                self._pool.put(conn)

    def shutdown(self, timeout: float = 30.0):
        """Gracefully shutdown the connection pool."""
        logger.info("Initiating connection pool shutdown")
        self._shutdown = True

        if self._health_check_thread:
            self._health_check_thread.join(timeout=5.0)

        deadline = time.time() + timeout
        while time.time() < deadline:
            try:
                conn = self._pool.get_nowait()
                try:
                    conn.close()
                except:
                    pass
                with self._lock:
                    self._active_connections -= 1
            except Empty:
                break

        logger.info(f"Connection pool shutdown complete. Remaining: {self._active_connections}")

GPT-4o Output:

GPT-4o produced similar quality code but with slightly different design choices - it used async/await patterns and integrated with Azure SDK retry policies. Both were production-viable.

Key Differences I’ve Noticed

Instruction Following

Claude 3.5 Sonnet tends to follow instructions more precisely. When I say “only return JSON,” it returns only JSON. GPT-4o sometimes adds explanatory text.

# Claude follows format instructions strictly
claude_response = """
{
  "tables": ["customers", "orders"],
  "relationships": [{"from": "orders.customer_id", "to": "customers.id"}]
}
"""

# GPT-4o might add context
gpt4o_response = """
Here's the JSON representation:

{
  "tables": ["customers", "orders"],
  "relationships": [{"from": "orders.customer_id", "to": "customers.id"}]
}

This shows the basic structure...
"""

Context Window Usage

Both have 128K+ token context windows, but they behave differently with long contexts:

GPT-4o: Better at maintaining coherence across very long documents
Claude: Better at finding specific details in long contexts (“needle in haystack”)

Code Style

Claude tends toward more verbose, explicit code with comprehensive error handling. GPT-4o often produces more concise solutions. Neither is objectively better - it depends on your team’s preferences.

Testing Framework

Here’s how I compare models systematically:

import anthropic
from openai import AzureOpenAI
import time
import json

class ModelComparator:
    def __init__(self):
        self.claude = anthropic.Anthropic()
        self.openai = AzureOpenAI(
            azure_endpoint="https://your-resource.openai.azure.com/",
            api_version="2024-05-01-preview"
        )

    def compare_response(self, prompt: str, evaluation_criteria: list[str]):
        results = {}

        # Claude
        start = time.time()
        claude_response = self.claude.messages.create(
            model="claude-3-5-sonnet-20240620",
            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}]
        )
        results["claude"] = {
            "response": claude_response.content[0].text,
            "latency": time.time() - start,
            "input_tokens": claude_response.usage.input_tokens,
            "output_tokens": claude_response.usage.output_tokens
        }

        # GPT-4o
        start = time.time()
        gpt_response = self.openai.chat.completions.create(
            model="gpt-4o",
            max_tokens=4096,
            messages=[{"role": "user", "content": prompt}]
        )
        results["gpt4o"] = {
            "response": gpt_response.choices[0].message.content,
            "latency": time.time() - start,
            "input_tokens": gpt_response.usage.prompt_tokens,
            "output_tokens": gpt_response.usage.completion_tokens
        }

        return results

    def run_benchmark_suite(self, test_cases: list[dict]):
        all_results = []

        for test in test_cases:
            result = self.compare_response(
                test["prompt"],
                test.get("criteria", [])
            )
            result["test_name"] = test["name"]
            all_results.append(result)

        return all_results

# Example usage
comparator = ModelComparator()
tests = [
    {
        "name": "sql_generation",
        "prompt": "Generate a SQL query for Azure Synapse that calculates running totals partitioned by region",
        "criteria": ["correctness", "efficiency", "readability"]
    },
    {
        "name": "error_diagnosis",
        "prompt": "Diagnose this error: 'Cannot insert duplicate key row in object dbo.Users'",
        "criteria": ["accuracy", "actionability", "completeness"]
    }
]

results = comparator.run_benchmark_suite(tests)

Cost Analysis

Model	Input ($/1M tokens)	Output ($/1M tokens)
Claude 3.5 Sonnet	$3.00	$15.00
GPT-4o	$5.00	$15.00

Claude 3.5 Sonnet has lower input costs, which matters for applications with large context windows or RAG systems with lengthy retrieved documents.

My Recommendations

Use Claude 3.5 Sonnet for:

Code generation and review
Precise instruction following
Long document analysis with specific extraction
Safety-critical applications

Use GPT-4o for:

Real-time voice applications (Realtime API)
Math-heavy computations
Existing OpenAI/Azure OpenAI infrastructure
Multimodal applications with audio

Use Both: The best strategy is often using both models. Route simpler tasks to the cheaper option, complex tasks to whichever performs better for that specific use case.

Conclusion

Both models are excellent. The “best” choice depends on your specific use case, existing infrastructure, and cost constraints. I recommend building abstraction layers that let you switch models easily - the landscape changes rapidly, and flexibility is valuable.