The 200K Context Window Is a Sales Number
Frontier LLMs lose 30+ points of accuracy long before their advertised context window. The procurement discipline buyers need before signing long-context deals.
Koundinya Lanka
Industry Trends
The number on the product page is what fits in the model's input buffer. The number that survives a real enterprise task is often a small fraction of that.
GPT-4o answers retrieval questions with 99.3% accuracy at under 1,000 tokens. Push that same task to 32,000 tokens, and accuracy drops to 69.7% on the [NoLiMa benchmark](https://arxiv.org/pdf/2502.05167), which forces models to do latent inference instead of keyword matching. That is a thirty-point collapse at a context length well inside what most enterprise document workloads use.
The collapse is not a GPT-4o problem. It is a frontier-LLM property. [Every model tested in the current generation shows monotonically decreasing accuracy as context length grows](https://research.trychroma.com/context-rot), and no production model maintains uniform retrieval accuracy across its full advertised window. The pattern is universal, not vendor-specific.
This is the gap that should be sitting in the middle of every enterprise long-context procurement conversation. It almost never is. Buyers sign on the marketed window. The workload runs on the reliable window. The space between them is where the failures live, and the failures stay invisible until somebody audits the output against ground truth. In most long-context deployments, nobody is doing that.
This is the discipline gap that separates production-ready AI procurement from a pilot that fails quietly six months in.
Why does the advertised window break down inside a real workload?
The architecture itself biases against the middle.
The [Lost in the Middle](https://direct.mit.edu/tacl/article/doi/10.1162/tacl_a_00638/119630/Lost-in-the-Middle-How-Language-Models-Use-Long) study found that information placed in the middle of a long context window suffers more than 30% accuracy degradation compared to the same information placed at the start or end. The pattern is a U-shaped performance curve driven by primacy and recency biases inside transformer attention. It is not a prompt-engineering problem. It is a property of how the layers were trained.
The middle of a 200,000-token context is exactly where the body of a credit agreement, the middle three years of a deposition, or the actual logic of a 12,000-line pull request lives. The retrieval the buyer needs is the retrieval the architecture is worst at. Moving the important sections to the front of the prompt helps on toy benchmarks; it does not help when the workload itself has structure that puts the load-bearing content in the middle.
What is the 64x gap between marketed and reliable capacity?
The NoLiMa benchmark, which strips out the literal keyword overlap that makes traditional needle-in-haystack tests easy, measured [GPT-4.1's effective reliable context length at approximately 16,000 tokens](https://arxiv.org/pdf/2502.05167). The product claim is 1,000,000. That is a 64x gap between what is sold and what reliably works on tasks requiring inference rather than substring matching.
The same gap shows up across vendors. [Claude 200K, Gemini 1M, and GPT-4 128K all exhibit 30 to 35% performance degradation](https://arxiv.org/html/2602.14188) as context approaches their advertised limits on real multi-document tasks.
0
Marketed-to-reliable gap
GPT-4.1: 1,000,000 tokens sold vs ~16,000 reliably accurate on inference tasks per NoLiMa benchmark
0
Degradation at advertised limits
Measured across Claude 200K, Gemini 1M, and GPT-4 128K on real multi-document tasks as context approaches their advertised windows
0
GPT-4o accuracy collapse
From 99.3% at under 1,000 tokens to 69.7% at 32,000 tokens — a 30-point drop at a context length inside typical enterprise document workloads
A buyer choosing between vendors on context-window size is choosing on a metric that does not survive contact with the workload. The vendor with the longer marketed number is not necessarily the vendor with the longer reliable number. They are not the same axis.
Does 100% retrieval mean the model will get the answer right?
No. And this is the result that most vendor decks have not caught up to.
A recent paper showed that [even when a model achieves perfect retrieval of the relevant tokens](https://arxiv.org/html/2510.05381v1) (when the right passages are demonstrably in the window and the model can identify them), downstream task accuracy on math, QA, and code generation still degrades as total context length grows.
Retrieval and reasoning are separate degradation axes. A model that finds the relevant clause in a 180,000-token contract bundle can still produce a wrong synthesis of what that clause implies. The needle-in-haystack tests that vendors publish only measure the first axis. The second axis is the one that drives the actual enterprise output, and it is not on the product page.
A procurement that benchmarks only retrieval is a procurement that has tested half the system.
A procurement that benchmarks only retrieval has tested half the system.
What vendors benchmark (Retrieval axis): model finds the relevant passage. Needle-in-haystack passes. One degradation axis tested. Published on the product page.
What drives enterprise outcomes (Reasoning axis): model correctly synthesizes meaning across clauses, deposition years, or code files. Degrades independently of retrieval. Absent from the product page.
Why do structured documents make the problem worse, not better?
Chroma's context-rot study turned up a counterintuitive result. [Structurally coherent documents, the kind with logical narrative flow, cross-references, and clear section headers, performed worse than randomly shuffled haystacks](https://research.trychroma.com/context-rot) across every model tested.
Warning
Structurally coherent documents — those with logical narrative flow, cross-references, and clear section headers — performed worse than randomly shuffled haystacks across every model tested. The enterprise corpus (master service agreements, regulatory filings, codebases, board packs) is the hard case, not the easy one. Vendor demos that pass on academic haystacks are passing the easy version of the test.
The reading is uncomfortable. Models appear to exploit surface-level positional cues rather than semantic structure. The shuffled haystack provides no false signal; the coherent document provides several. Cross-references and forward pointers in a well-written contract or policy doc give the model patterns to latch onto that are not actually load-bearing for the answer.
This matters because the enterprise corpus is the structured case. Master service agreements, regulatory filings, internal policy documents, codebases, board packs. All of them have the structure that breaks the model worst. Random web text is the easy case. The buyer's actual workload is the hard case. The product demos that pass on academic haystacks are passing the easy version of the test.
What happens when a long-context model fails?
flowchart LR
subgraph Expected
A1[Model uncertain] --> A2[Low-confidence flag or refusal] --> A3[Human review triggered] --> A4[Error caught before production]
end
subgraph Actual
B1:::warn[Model wrong at 180K tokens] --> B2:::warn[Fluent, confident output] --> B3:::warn[Surface monitoring shows healthy] --> B4:::warn[Wrong contract clause summarized / wrong patient record flagged / wrong code recommended] --> B5:::warn[Error only caught in audit]
end
classDef warn fill:#FFA500,color:#000,stroke:#cc7000It does not say so. It returns a fluent, confident, incorrect answer.
Long-context degradation does not surface as a refusal or a low-confidence flag. [Models silently fail](https://www.digitalapplied.com/blog/long-context-retrieval-needle-in-haystack-2026). They produce output that reads like a successful synthesis and is wrong in ways that require ground-truth evaluation to detect.
This is the failure mode that should worry a buyer most. A model that refuses when uncertain is a model an enterprise can build a human-in-the-loop around. A model that hallucinates with confidence at 180,000 tokens (but not at 18,000) produces a class of error that surface monitoring does not catch. The wrong contract clause is summarized correctly. The wrong patient record is flagged. The wrong line of code is recommended for refactor. The system looks healthy until somebody audits the output against ground truth.
[Atlan's review of LLM context-window limitations](https://atlan.com/know/llm-context-window-limitations/) attributes a large share of enterprise AI failures to context drift or memory loss during multi-step reasoning, though the primary methodology behind that figure is not public and the number should be read as a directional signal rather than a measured rate. The directional reading is consistent with the rest of the evidence: when long-context systems break, they break quietly, and they break in the middle of multi-step work where the failure compounds.
What should an enterprise buyer run before signing the long-context contract?
- 1
Multi-needle retrieval on your own data
Benchmark on customer-supplied documents at your operating context length. Multi-needle tests — requiring the model to combine several pieces of evidence — show 30–60 point gaps vs. single-needle at the same length. Vendors who refuse customer-data benchmarks are signaling the gap.
- 2
Degradation curve across your operating range
Get the accuracy curve from 50K to 500K tokens on your corpus, with the inflection point clearly marked. Most enterprise workloads live in this range. A single number at 1M tokens is the wrong number.
- 3
Separate reasoning evaluation
Test the downstream task, not just retrieval. Contract analysis: can the model synthesize obligations across clauses? Code review: does it identify cross-file dependencies? Retrieval and reasoning degrade on different axes.
The vendor demo is structured to pass. It is single-needle, lexically-overlapping retrieval at a context length the vendor knows the model handles. The real procurement test is the inverse.
There are three benchmarks the buyer should require before signing, all run on the buyer's data, not the vendor's.
The first is **multi-needle retrieval on the buyer's own documents at the buyer's operating context length.** Single-needle tests are systematically misleading; [multi-needle retrieval, which matches real enterprise tasks where the answer requires combining several pieces of evidence, shows 30 to 60 point performance gaps versus single-needle scores](https://www.digitalapplied.com/blog/long-context-retrieval-needle-in-haystack-2026) at the same context length. If the vendor will not provide multi-needle benchmarks on customer-supplied data, that refusal is the signal.
The second is **a degradation curve across the operating range, not the model's maximum.** The typical enterprise long-context workload — code review, contract analysis, knowledge-base Q&A — sits well inside the advertised window, and [that operating range is also where the steepest multi-hop reasoning degradation occurs](https://research.trychroma.com/context-rot). A vendor that reports a single accuracy number at 1M tokens is reporting the wrong number. The buyer needs the curve from 50,000 to 500,000 on their own corpus, with the inflection point clearly marked.
The third is **a separate evaluation for reasoning, not just retrieval.** Because retrieval and reasoning degrade on different axes, the buyer needs to test the downstream task the long-context window will actually serve — not just whether the model can find the relevant passage. A contract-analysis pilot should test whether the model can correctly synthesize obligations across clauses. A code-review pilot should test whether the model identifies cross-file dependencies, not whether it can echo back a particular function when asked.
These three tests are not exotic. They are what separates the procurement that survives the audit from the one that becomes a sunk-cost pilot. Vendor selection on context-window length without them is a procurement decision dressed up as a technical one — and the failure surface is not visible until production.
The buyer's framing
The 200,000-token, 1,000,000-token, multi-million-token context window is real as a number. It is not real as a capability guarantee. Buyers who sign on the advertised window are paying for capacity the architecture cannot reliably deliver on the workload that drove the purchase.
Key Insight
The 200,000-token context window is real as a number. It is not real as a capability guarantee. Without a benchmark that matches the workload, a degradation curve covering the operating range, and a separate test for the downstream task, the long-context window is a sales number — and the silent-failure mode behind it is exactly the class of risk that turns a promising pilot into a sunk cost.
The fix is not to abandon long-context models. It is to procure them the way an operator procures any production system: with a benchmark that matches the workload, a degradation curve that covers the operating range, and a separate test for the downstream task that matters. Without that discipline, the long-context window is a sales number. And the silent-failure mode behind it is exactly the class of risk that turns [a promising pilot into one of the four debts that keep enterprise AI stuck](https://theproductionline.ai/blog/four-debts-enterprise-ai-pilot-purgatory).
The wider [build-vs-buy decision across the AI stack runs on the same discipline](https://theproductionline.ai/blog/vendor-concentration-scorecard-enterprise-ai-stacks): the vendor's marketed number is the start of the conversation. The buyer's benchmark is the end.
Frequently asked questions
Why do long-context LLMs lose accuracy as the prompt gets longer?
Transformer attention has primacy and recency biases that cause information in the middle of a long context window to suffer more than 30% accuracy degradation compared to the same information at the start or end. This is an architectural property of how the layers were trained, not a prompt-engineering problem, and it shows up monotonically as context length grows across every frontier model tested.
What is the difference between single-needle and multi-needle benchmarks?
Single-needle tests ask the model to retrieve one piece of information from a long context, usually with high lexical overlap between the question and the answer. Multi-needle tests require combining several pieces of evidence — which is what real enterprise tasks look like — and they show 30 to 60 point performance gaps versus single-needle scores at the same context length. Vendor demos almost always use single-needle, which is why they pass while real workloads fail.
How should an enterprise buyer test a long-context model before signing?
Run three benchmarks on the buyer's own documents: multi-needle retrieval at the operating context length, a degradation curve from 50,000 to 500,000 tokens that shows where accuracy inflects, and a separate evaluation for the downstream reasoning task (not just retrieval). If the vendor will not support multi-needle benchmarks on customer-supplied data, that refusal is itself the signal.
What is the lost-in-the-middle effect?
Lost in the middle is a U-shaped performance curve where language models reliably use information placed at the start or end of a long context but lose more than 30% accuracy on identical information placed in the middle. It was documented by Stanford, UC Berkeley, and Samaya AI in a paper published in the Transactions of the Association for Computational Linguistics, and it is driven by transformer attention biases that cannot be fixed through prompt engineering alone.
Koundinya Lanka
Founder of The Production Line, writing weekly intelligence on enterprise AI adoption, agentic systems, and the future of work.
Enjoyed this article? Get more like it every week.