Skip to content
Back to Blog
1 min read

How I Evaluate LLM Changes: building eval sets from real user queries

I worked on smoothing the handoff between data engineering and AI teams—standardizing feature contracts, embedding validation, and adding lightweight integration tests.

The friction I kept seeing was simple: teams over-rotate on tooling when alignment is the real bottleneck.

Instead of adding more moving parts, I tested a single-path implementation before introducing alternatives.

April is where Q2 intentions either become systems or remain slideware.

What I changed today

  • I clarified ownership for one high-impact surface so escalations are faster.
  • I replaced a vague process step with a concrete, testable checkpoint.
  • I cut one source of rework by tightening upstream validation.

What I want to keep doing

Delivery speed held, while ambiguity dropped. That is a win in real teams. When assumptions are visible, teams move faster with fewer expensive surprises.

Tomorrow’s focus

Tomorrow I want to tighten the metrics so improvements are obvious without interpretation.

References

Michael John Peña

Michael John Peña

Senior Data Engineer based in Sydney. Writing about data, cloud, and technology.