AI evals are broken, but builders still need them

AI benchmarks are useful but incomplete. Here is a practical way builders can use product-specific evals to catch failures before users do.

#AI

#AI Evaluation

#LLM Ops

#Developer Workflows

#AI Engineering

✍️jenuel.dev

Jun. 08, 2026. 8:03 AM

The uncomfortable truth about AI in 2026 is that the demo is getting easier while the measurement is getting harder. A model can pass a polished benchmark, produce a beautiful product video, and still fail on the boring task your team actually needs every Tuesday morning.

That is why the current conversation around AI evals matters. In the last 48 hours, Hacker News surfaced a blunt essay arguing that AI has a measurement problem. Google News also showed fresh discussion around evals being broken but still essential, while Hugging Face recently highlighted EVA-Bench Data 2.0 with 121 tool-use scenarios across three domains. The signal is clear: builders are realizing that leaderboard scores are not enough.

My take is simple: evals are not a magic truth machine. They are a seatbelt. They will not make your AI system safe, reliable, or useful by themselves, but shipping without them is asking your users to become your QA department.

Why benchmark scores feel less useful now

Classic benchmarks were helpful when the question was mostly, "Which model is smarter at a known task?" That question still matters, but it is no longer the whole game. Modern AI products are chains of decisions: retrieve the right context, call the right tool, follow policy, write the output, recover from mistakes, and know when to stop.

A single model score does not tell you whether your support bot will cite the correct refund policy. It does not tell you whether your coding assistant will edit the right file in a messy repo. It does not tell you whether your internal agent will call a destructive tool when the user only wanted a preview.

This is where many teams get fooled. They compare two models on a public leaderboard, swap the winner into production, and then wonder why support tickets spike. The benchmark was not fake. It was just answering a narrower question than the product needed.

What useful evals look like for real products

Good evals should look boringly close to your real work. If you are building an AI assistant for a church admin team, test the actual flows: summarize a meeting, draft a volunteer message, extract dates from a messy announcement, and refuse to expose private member details. If you are building a developer tool, test real repositories, real style rules, real failing tests, and real rollback behavior.

A practical eval suite should usually include three layers:

Task success: Did the system complete the job the user asked for?
Constraint following: Did it stay inside the rules, permissions, tone, budget, and tool limits?
Failure behavior: When it was uncertain, did it ask, stop, escalate, or make up an answer?

That last layer is underrated. A model that fails loudly and safely is often more useful than a model that sounds confident while quietly corrupting a workflow.

Do not turn evals into theater

The danger is that evals can become a ritual. A team adds a few golden prompts, gets a green check, and treats it like proof. That is not measurement. That is decoration.

If your eval set never changes, the system will eventually overfit to it. If your judges are vague, your scores will drift. If you only test happy paths, your product will break exactly where users are most stressed. The goal is not to have an eval dashboard. The goal is to find the ugly failure before a customer does.

One useful habit is to add examples from production every week. Take confusing user prompts, bad model responses, tool-call mistakes, and support escalations, then turn them into regression tests. Over time, your eval suite becomes a memory of what your product has learned the hard way.

What builders should do this week

If you do not have evals yet, start small. Pick 20 examples from real user intent. Include a mix of easy, normal, edge-case, and malicious prompts. Write down what a good answer means before you run the model. Then compare outputs from your current model and one alternative.

Do not chase a perfect score. Look for the failures that would cost trust: wrong citations, invented facts, unsafe tool use, ignored instructions, bad privacy judgment, or responses that sound good but skip the actual work.

If you already have evals, inspect whether they still match the product. Many eval suites are built during prototype week and then forgotten. Your product changed. Your users changed. Your model changed. Your evals should change too.

The practical bottom line

AI evals are broken in the same way maps are imperfect. They leave things out, they get outdated, and they can give false confidence. But the answer is not to drive blind. The answer is to use better maps, update them often, and still keep your eyes on the road.

For developers, the winning pattern is not "trust the benchmark" or "ignore the benchmark." It is: use public benchmarks for a first filter, use product-specific evals for real decisions, and keep human review close to the places where failure hurts.

The next wave of AI products will not be won only by whoever plugs in the newest model first. It will be won by teams that can measure whether the model is actually helping.

References

Thanks for reading! If you enjoyed this article and like this kind of content, you're always welcome to buy me a little coffee, but only if you'd like to. No pressure at all, and either way I'm truly grateful you stopped by. ☕