Context Window Limits: What Testing Taught Me

by Alien Brain Trust AI Learning
Context Window Limits: What Testing Taught Me

Context Window Limits: What Testing Taught Me

TL;DR: Every tutorial shows you how to call an LLM API. None of them explain what happens when your context window fills up mid-task in a production agent workflow — and why that failure mode looks nothing like a typical software error. Here’s what I learned by breaking it repeatedly.


I’ve spent 25 years designing systems that fail predictably. In security, unpredictable failure is the enemy. You design for known failure modes, instrument for unknown ones, and test until the blast radius of any single component failure is bounded and understood.

LLMs break that model. And no course, tutorial, or documentation page prepared me for exactly how they break it.

The specific lesson: context window exhaustion is a silent, partial, and misleading failure. It doesn’t throw an exception. It doesn’t return a 500. In many cases, it quietly degrades the quality of the output and keeps going — and you won’t notice until something downstream is wrong in a way that’s hard to trace back.

This post is about what I learned building a multi-step agent workflow that processes security policy documents. Long inputs, structured outputs, chained prompts. Exactly the kind of task where context limits bite you.


What the Documentation Tells You About Context Windows

The documentation is accurate. It just doesn’t tell you what you need to know.

You’ll read that a model has a 200K token context window. You’ll run a quick calculation. Your documents are X words, your prompts are Y tokens, you’re well inside the limit. You ship it.

What the documentation doesn’t cover:

  • Token counting is not word counting. A 10,000-word policy document with dense legal language and defined terms can tokenize to 16,000–18,000 tokens depending on vocabulary. Code and structured data tokenize differently than prose.
  • The context window includes everything: system prompt, conversation history, tool call results, structured output schemas, and any retrieved context you inject. Your “document” is not the only thing in there.
  • Models don’t fail cleanly at the limit. Behavior degrades before you hit the hard limit. Outputs get shorter. Instructions from early in the prompt get dropped. The model starts optimizing for completion over accuracy.

None of that is in the quickstart guide.


What Actually Happened in Production Testing

I was building a workflow to process security policy documents — extract controls, map them to a framework, flag gaps. Multi-step: ingest, extract, classify, compare.

Step one worked fine in isolation. Step two worked fine in isolation. When I chained them — passing the output of step one as input to step two, plus the original document as reference — something subtle broke.

The classification output looked reasonable. It was shorter than expected, but the structure was correct. The problem surfaced three steps later when a comparison function returned results that didn’t match the source document. Gaps were being flagged that didn’t exist. Controls that were clearly present in the policy were missing from the extracted set.

I spent two hours debugging what I assumed was a logic error in my comparison code. It wasn’t. The extraction step had silently dropped roughly 20% of the controls from a long policy section — the section that happened to fall in the latter half of the context window. The model completed the task. The output was structurally valid. It was just wrong.


The Security Implication Nobody Mentions

Here’s what makes this a security problem, not just an engineering problem.

In a compliance or audit context, silent omission is worse than visible failure. If my agent throws an error, I investigate. If my agent confidently returns a 94% complete control mapping, I trust it — and I’m wrong.

In 25 years of IAM work, we call this a false negative: the system says you’re covered when you’re not. It’s the failure mode that gets organizations breached. The audit passes. The gap stays open.

An LLM agent running near its context limit is a systematic false-negative generator. And because the output is coherent and structured, it bypasses the sanity checks that would catch a more obvious failure.

This isn’t hypothetical. If you’re using an AI agent to review access controls, evaluate policy documents, or summarize security findings, and that agent is processing content that approaches its effective context limit, you need to validate completeness — not just correctness of what it returned.


What I Changed After Breaking It

1. Measure actual token usage, not estimated word counts

Most API clients return token usage in the response. Log it. Set an alert threshold at 70% of the model’s context limit. If a request exceeds that threshold, it goes to a review queue before the output is used downstream.

This is the same principle as disk space monitoring. You don’t wait for the disk to fill up to notice you have a problem.

2. Chunk and verify, don’t pass and hope

For documents over a certain size, I now chunk the input and process sections independently, then merge and reconcile. It’s more API calls and more cost. It’s also the only approach that gives me consistent completeness.

The reconciliation step matters: after merging section outputs, I do a final pass that compares extracted item counts against rough expected counts based on document structure. If the merge result looks suspiciously smaller than the source, that’s a flag.

3. Add a completeness probe

After any extraction task on a long document, I run a second lightweight prompt: “Given this original document, list any sections or topics that do not appear to be represented in this summary.” It’s not foolproof, but it catches the most egregious omissions.

Think of it as a control test. In security auditing, you don’t just check that controls exist — you verify they work. Same principle applied to your AI pipeline.

4. Validate output density

If I’m extracting controls from a 50-page policy, and the previous run on a similar document returned 40 controls, and this run returns 12, that’s anomalous. I added basic output density checks — expected range based on document length — as a fast sanity filter before downstream processing.


What No Tutorial Teaches You About LLM Failure Modes

Courses and tutorials teach you the happy path. They show you a working API call, a clean output, a working chain. They don’t show you what quiet degradation looks like.

The failure modes worth understanding for production LLM systems:

  • Context exhaustion degradation (what I described above)
  • Instruction priority decay: instructions given early in a long prompt are followed less reliably than instructions given near the end, especially in constrained context
  • Confident fabrication under pressure: when a model is working with incomplete context and instructed to complete a structured task, it will sometimes invent plausible-sounding entries rather than report incompleteness
  • Schema compliance vs. accuracy tradeoff: models optimized to return valid JSON will return valid JSON even when the content is incomplete or wrong

All of these have analogues in security. A system that logs “success” when it failed silently is a monitoring gap. A system that fabricates valid-looking output to satisfy a schema is the LLM equivalent of a honeypot response — it looks right and is completely wrong.


Key Takeaways

  • Context window limits cause silent, partial failure — not clean errors. Instrument for this explicitly.
  • Token counts are not word counts. Measure actual token usage in production and set thresholds.
  • For any extraction task on long documents, validate completeness, not just structure. Missing items are worse than malformed items.
  • Add a completeness probe step after major extraction tasks. A second lightweight check catches what the primary pass misses.
  • Output density checks — comparing returned item counts against expected ranges — are a fast, cheap sanity filter.
  • If your AI agent produces compliance or security outputs, silent omission is a false-negative risk. Treat it like one.

The fundamentals from 25 years of security engineering still apply: instrument for failure, validate outputs, never trust a system that can fail silently. The tools changed. The discipline didn’t.

Tags: #building-and-learning#enterprise-ai#llm-security#implementation#case-study

Comments

Loading comments...