Evals are the new tests
How I measure whether my agent workflows actually work — and the small habit that catches regressions before they ship.
Unit tests check that a function returns what you wrote it to return. Evals check whether a probabilistic system still behaves the way you need it to — across the messy, ambiguous inputs it’ll actually see in production.
Why evals matter for agents
When the system under test is a language model deciding which tool to call, regular tests don’t catch the failure modes that hurt you most: the model picks the wrong tool 4% of the time, the prompt change makes it overconfident on edge cases, the new model version subtly drifts toward verbose answers.
You don’t notice any of this until someone files a ticket.
The smallest habit that helps
Keep one fixture file per agent: 20–50 representative inputs covering the obvious happy paths plus the cases you’ve already seen go wrong. Re-run them on every prompt change and every model upgrade. Track the pass rate over time.
// fixtures/scheduling-agent.eval.ts
const cases = [
{ input: "book a 30 min slot tomorrow afternoon", expect: { tool: "find_slot", tz: true } },
{ input: "cancel my last meeting", expect: { tool: "cancel_event" } },
// ...
];
That’s it. Not a framework, not a service — just a file you can run against any prompt revision and read the diff before you ship.
What this article isn’t
It’s not a guide to eval frameworks (there are good ones — promptfoo, braintrust, langfuse). It’s the case for having an eval set at all, which is the part most teams skip.