JANUARY 22, 2026 · AI · 9 min

LLM evals that actually predict production behavior.

MO
Maxed Out
Author

Most eval harnesses tell you the model is good. Production tells you something different, usually at 2am. Here’s the eval setup we deploy on every AI engagement.

Almost every LLM eval we see in the wild is wrong in the same way. The harness runs the model against a fixed dataset, grades it on a rubric, and spits out a number. The number goes up over time. The product ships. And then production tells a completely different story, usually around two in the morning, usually to the on-call engineer who had nothing to do with building the thing.

The problem is not that the evals are bad. The problem is that the evals are measuring the wrong axis. They measure "is the model good at the task" when they should be measuring "is the system good at the job."

Four things our evals measure

First: drift. We version our eval set and we rerun old versions every week. Not because the test cases change — because the model providers change things on us. A quiet update to the underlying model can swing an eval score by ten percent in a weekend. If you are not tracking this, your production quality is a weather system you do not control.

Second: failure modes, named and ranked. A 92% pass rate is meaningless until you characterize the 8%. Are they hallucinations? Refusals? Off-topic answers? Each of those has a completely different cost profile in production, and the aggregate score hides that.

"Most eval harnesses tell you the model is good. Production tells you something different. The eval is there to predict the production, not to make you feel better on a Tuesday."

Third: cost per successful interaction. Not cost per call — cost per call that actually solved the user's problem. A model that costs twice as much but resolves in one turn instead of three is cheaper by every measure that matters. We have yet to see a client whose eval tracked this before we arrived.

Fourth: human agreement rate on ambiguous cases. For anything consequential, we run a subset through human review and measure not just whether the model was right, but whether two humans even agree on what "right" means. If your humans disagree at 20%, you cannot expect the model to do better than 80%, and no amount of prompt engineering will fix that.

What this looks like in practice

We deploy this eval stack on every AI engagement, and it usually catches something in the first week. A prompt regression from a colleague. A silent model downgrade on the provider side. A category of inputs nobody had thought to test. Finding those things on a Tuesday afternoon, in staging, is the entire point. Finding them in production is the thing we are trying to prevent.

None of this is cutting-edge. The cutting-edge is in the model weights. The eval harness is plumbing. But plumbing is what lets a product go from "impressive demo" to "reliable enough to build a business on," and we have not yet met a team that regretted building it properly.

Next

Got a project that looks like one of these essays?

Start a conversation →