The Eval Gap: Measuring AI Outputs While Outcomes Go Dark
Enterprise AI evals measure outputs, not outcomes — 83% of agentic-AI papers track only technical metrics. The fix: wire your eval pipeline to CFO KPIs.
Koundinya Lanka
Leadership
Researchers who combed the agentic-AI evaluation literature found that 83% of papers measure only technical metrics like accuracy, latency, and task-completion rates, while just 15% account for both technical and human dimensions ([arXiv study](https://arxiv.org/abs/2506.02064)). The economic outcome, whether the system moved a number anyone in the business actually tracks, goes almost entirely unmeasured.
That is the eval gap, and it is the most expensive blind spot in enterprise AI today. Teams have gotten very good at answering "does the model work?" and have built almost no apparatus for the only question that funds next quarter: "did anyone's job change?"
The two questions feel adjacent. They are structurally different. Standard benchmarks were built to answer whether a model works in general; an enterprise needs to know whether it delivers value in its specific context — a different question requiring different instrumentation ([Kili Technology](https://kili-technology.com/blog/the-evaluation-gap-why-ai-breaks-in-reality-even-when-it-works-in-the-lab)). A model that scores well on an internal benchmark and a model that measurably reduces the rework in a claims adjuster's week are not the same achievement. Most eval pipelines cannot tell them apart, because they were never built to.
Outputs are not outcomes
An output is something the model produced: a correct answer, a passing test case, a completed task. An outcome is something that changed in the business because of it: a shorter cycle time, a lower cost per case, time a person got back. Output metrics are cheap to collect, fast to compute, and owned entirely by engineering. Outcome metrics are slow, messy, and live in someone else's system. So teams instrument what is convenient and call it evaluation.
How AI teams are graded reinforces the habit. Most are measured on adoption and model quality, not on the P&L. So they build the instrument that reports the thing they are accountable for. The MIT analysis that found 95% of enterprise generative-AI pilots failing to deliver measurable business impact attributed those failures to poor workflow integration and misaligned incentives, not model quality ([Fortune, on the MIT report](https://fortune.com/2025/08/18/mit-report-95-percent-generative-ai-pilots-at-companies-failing-cfo/)). The incentive to measure outputs and the failure to produce outcomes are the same problem seen from two angles.
Confidence theater
Walk into most enterprise AI program reviews and the dashboards look reassuring. Adoption is up. Token volume is up. The eval suite is green. Leadership nods.
None of that proves value was created. High adoption rates do not guarantee productivity improvement — users can engage with a tool superficially, never fold it into a workflow, and generate what one framework calls the "Adoption Illusion": activity that masks zero business value ([Larridin](https://larridin.com/blog/ai-roi-measurement)). A green eval dashboard in that world is confidence theater. It certifies that the model answered correctly, not that the answer changed anything downstream.
Why better benchmarks won't save you
The reflex, when a measurement fails, is to add more metrics. That reflex is wrong here, because the gap is structural rather than a tooling deficiency. As Snorkel's research team put it, "our ability to measure AI has been outpaced by our ability to develop it" ([Snorkel AI](https://benchmarks.snorkel.ai/closing-the-evaluation-gap-in-agentic-ai/)). A structural gap does not close by adding rows to the dashboard you already have.
There is a sharper reason, too. Benchmark-driven optimization shapes what gets built ([arXiv study](https://arxiv.org/html/2506.02064v2)). When a team is rewarded for scoring well on an eval, the system gets designed to score well on that eval. Goodhart's Law at enterprise scale. The measure becomes the target, and once it is the target it stops measuring anything real. Optimizing harder against a benchmark that doesn't track outcomes doesn't shrink the gap; it widens it while making the dashboard look better.
This is not theoretical. Systems in healthcare, finance, and retail that excelled on technical benchmarks went on to fail in real-world deployment, not because of capability gaps, but because of human, temporal, and contextual factors no benchmark had measured ([arXiv study](https://arxiv.org/abs/2506.02064)). The model that aces the lab is not the model that survives the floor, and the standard eval suite cannot see the difference.
Three diagnoses, no consensus
Ask why the gap exists and the literature offers three answers without ranking them. One says it is technical: the metrics are simply wrong, and custom, context-specific evals fix it. One says it is organizational: no single role owns the outcome, so it falls between engineering and finance and never gets measured. One says it is an incentive problem: the vendors selling these systems profit from deployment, not from outcomes, so the market has little reason to instrument the harder number. The three are not mutually exclusive, and the honest reading is that they compound, but no source establishes which dominates, which means a team copying someone else's fix may be solving the wrong layer. A custom eval suite does nothing if the AI team is still graded on adoption rather than the P&L. Before buying a tool, a program should decide which of the three it actually has.
The body count
Measuring the wrong thing surfaces downstream as abandoned projects. One industry playbook estimates that for every 33 AI proof-of-concepts an enterprise starts, only 4 reach production ([IDC/Lenovo, via Kili](https://kili-technology.com/blog/the-evaluation-gap-why-ai-breaks-in-reality-even-when-it-works-in-the-lab)). Gartner has attributed 85% of GenAI project failures to bad data or models that were not properly tested ([via AlignX AI](https://medium.com/@AlignX_AI/why-standard-llm-benchmarks-fail-enterprises-and-how-custom-evaluations-drive-real-business-79a272bcb62c)). And one ROI analysis estimates that 72% of AI investments are actively destroying value through waste, with 42% of companies abandoning most of their AI projects in 2025, up from 17% the prior year ([Larridin](https://larridin.com/blog/ai-roi-measurement)).
A caution on stacking these numbers: they measure different failure modes at different stages, and no source reconciles them. The 95%-of-pilots figure and the 42%-abandonment figure are not additive — read them as independent readings on the same disease, not a cumulative tally. The direction is unambiguous even where the arithmetic isn't: a large majority of enterprise AI spend is not converting into outcomes anyone can measure.
The fix is a wiring problem
If the gap is structural, the fix has to change the structure of what gets measured. The most credible proposal in the literature is a four-axis evaluation model: technical performance, human-centered outcomes, safety, and economic impact, scored together rather than treating the last three as someone else's job ([arXiv study](https://arxiv.org/html/2506.02064v2)). Its authors are explicit that this demands a paradigm shift across the field, not a better leaderboard.
For an enterprise team, the operative axis is economic impact, and the practical move is unglamorous: wire the eval pipeline to the KPIs the CFO already uses to decide whether a project lives or dies. Not a bespoke "AI success score" invented by the AI team. The actual numbers finance tracks. Cycle time. Cost per case. Revenue per rep. Hours reclaimed. If the eval dashboard cannot show movement in at least one number that already appears in a finance review, the program is being graded on a rubric no one with a budget recognizes.
The translation is mechanical once the metric is chosen. The eval still runs its technical pass — did the system produce a correct, safe output. A second pass asks whether that correct output is actually used in the workflow it was built for, because an unused correct answer is the Adoption Illusion in miniature. A third pass ties that usage to the finance number over the chosen horizon. Technical, human, economic — the same axes the researchers propose, collapsed into a pipeline an enterprise team can actually run.
This is harder than swapping a metric, and it is worth being honest about why. The outcome data usually lives in finance; the eval pipeline lives with the AI team. Closing the gap means a handoff across an org boundary most companies have never built, and no source yet offers a clean technical pattern for it. That handoff, not the choice of metric, is where the real work sits. A team that solves the wiring and picks a crude outcome number will beat a team that builds an elegant eval suite wired to nothing.
One more trap to name: the measurement horizon. Applying a traditional ROI window to AI may be a category error, because the value of knowledge work tends to compound over time rather than appear as a quarterly earnings spike ([UC Berkeley Executive Education](https://exec-ed.berkeley.edu/2025/09/beyond-roi-are-we-using-the-wrong-metric-in-measuring-ai-success/)). A six-month outcome eval that comes back flat may be reading a curve too early, not a failure. Honesty requires admitting the correct horizon is unsettled — the same studies warning that windows are too short are themselves built on bounded timeframes. Pick a horizon deliberately, write it down, and revisit, rather than killing a program on its first flat quarter.
What to do Monday
The diagnostic is fast. Pull up the dashboard the AI program reports against and ask one question of every metric on it: does this number appear anywhere in a finance review? If the answer is no across the board, the program is measuring outputs and reporting them as outcomes — confidence theater with a green checkmark.
Then do three things. Add one economic-impact metric drawn from a number finance already owns, even a crude one. Name a single accountable owner for that metric who sits close enough to the P&L to be believed. And set the measurement horizon on purpose instead of defaulting to the quarter.
None of this requires a better model. It requires pointing the measurement at the outcome instead of the output, and accepting that a model that aces every benchmark while changing no one's job has failed the only eval that funds the next one.
Koundinya Lanka
Founder of The Production Line, writing weekly intelligence on enterprise AI adoption, agentic systems, and the future of work.
Enjoyed this article? Get more like it every week.