March 2, 2024 2 min read

Claude 2.1 vs GPT-4: A Technical Comparison While We Await Claude 3

With Claude 3 expected soon, now is a good time to compare the current state of play between Claude 2.1 and GPT-4. Let’s dive into a technical comparison of these two frontier models.

Current Benchmark Performance

Based on published benchmarks, both models perform impressively:

Benchmark	Claude 2.1	GPT-4
MMLU	78.5%	86.4%
HumanEval	70.0%	67.0%
Context Window	200K	128K

API Comparison

Claude 2.1

import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-2.1",
    max_tokens=4096,
    messages=[
        {"role": "user", "content": "Write a binary search in Python"}
    ]
)

GPT-4

from openai import OpenAI

client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4-turbo-preview",
    max_tokens=4096,
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a binary search in Python"}
    ]
)

Context Window Comparison

# Claude 2.1: 200K tokens
# GPT-4 Turbo: 128K tokens

# Example: Processing large documents with Claude
def process_large_document(document: str) -> str:
    client = anthropic.Anthropic()

    response = client.messages.create(
        model="claude-2.1",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": f"Summarize this document:\n\n{document}"
            }
        ]
    )
    return response.content[0].text

Pricing Comparison

Model	Input (per 1M tokens)	Output (per 1M tokens)
Claude 2.1	$8.00	$24.00
GPT-4 Turbo	$10.00	$30.00

Practical Test: Code Generation

Let’s test both models with a real coding task:

# Test prompt: "Create a REST API endpoint for user authentication"

# Claude 2.1 typically produces clean, well-documented code:
from fastapi import FastAPI, HTTPException, Depends
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
from pydantic import BaseModel
import jwt
from datetime import datetime, timedelta

app = FastAPI()
security = HTTPBearer()
SECRET_KEY = "your-secret-key"

class UserLogin(BaseModel):
    username: str
    password: str

class Token(BaseModel):
    access_token: str
    token_type: str

@app.post("/auth/login", response_model=Token)
async def login(user: UserLogin):
    # Validate credentials (simplified)
    if user.username == "admin" and user.password == "password":
        token = jwt.encode(
            {
                "sub": user.username,
                "exp": datetime.utcnow() + timedelta(hours=24)
            },
            SECRET_KEY,
            algorithm="HS256"
        )
        return Token(access_token=token, token_type="bearer")
    raise HTTPException(status_code=401, detail="Invalid credentials")

@app.get("/auth/verify")
async def verify(credentials: HTTPAuthorizationCredentials = Depends(security)):
    try:
        payload = jwt.decode(
            credentials.credentials,
            SECRET_KEY,
            algorithms=["HS256"]
        )
        return {"username": payload["sub"], "valid": True}
    except jwt.ExpiredSignatureError:
        raise HTTPException(status_code=401, detail="Token expired")

Key Differences

Context length: Claude 2.1 has a larger context window (200K vs 128K)
System prompts: GPT-4 has dedicated system role, Claude uses user messages
Vision: GPT-4V supports image inputs, Claude 2.1 is text-only
Function calling: GPT-4 has more mature tool use capabilities

What Claude 3 Might Change

When Claude 3 releases, we expect:

Potential vision capabilities
Improved benchmark scores
Possibly multiple model tiers for different use cases
Better reasoning on complex tasks

Recommendation

Choose based on your current needs:

Claude 2.1: Best for long documents, instruction following, cost-conscious deployments
GPT-4: Best for established ecosystem, multimodal needs, Azure integration

Conclusion

Both models are excellent choices today. The best approach is often a multi-provider strategy, using each model for its strengths. Stay tuned for Claude 3’s release, which could shift this comparison significantly.