Skip to content
All writing
LLM Practice · 1 min read

Evals are the new tests

How I measure whether my agent workflows actually work — and the small habit that catches regressions before they ship.

Unit tests check that a function returns what you wrote it to return. Evals check whether a probabilistic system still behaves the way you need it to — across the messy, ambiguous inputs it’ll actually see in production.

Why evals matter for agents

When the system under test is a language model deciding which tool to call, regular tests don’t catch the failure modes that hurt you most: the model picks the wrong tool 4% of the time, the prompt change makes it overconfident on edge cases, the new model version subtly drifts toward verbose answers.

You don’t notice any of this until someone files a ticket.

The smallest habit that helps

Keep one fixture file per agent: 20–50 representative inputs covering the obvious happy paths plus the cases you’ve already seen go wrong. Re-run them on every prompt change and every model upgrade. Track the pass rate over time.

// fixtures/scheduling-agent.eval.ts
const cases = [
  { input: "book a 30 min slot tomorrow afternoon", expect: { tool: "find_slot", tz: true } },
  { input: "cancel my last meeting",                expect: { tool: "cancel_event" } },
  // ...
];

That’s it. Not a framework, not a service — just a file you can run against any prompt revision and read the diff before you ship.

What this article isn’t

It’s not a guide to eval frameworks (there are good ones — promptfoo, braintrust, langfuse). It’s the case for having an eval set at all, which is the part most teams skip.

Tags #agents #evals #llm-eng