Context Engineering: Making AI Agents Useful on Codebases That Fight Back

Jun 09, 2026

Chroma tested 18 frontier models and every one degrades as input context grows. Stanford measured a 30% accuracy drop for information in the middle of the window. The question is not whether your context window is big enough. It is whether the context inside it is good enough.

Last week I wrote about my AI coding setup. The tools, the CLAUDE.md architecture, the skills, the guardrails. The response I got most was some version of the same question: “This works on your services, but what about when the codebase is massive? What about when the agent needs to understand interactions across systems? What about the stuff that is not written down anywhere?”

This is the post that answers those questions. The answers are not clean. Context engineering on large, legacy codebases is the hardest problem in AI-assisted development right now. The tools can be set up in a day. The context problem is ongoing, and it never fully resolves.

Context Is a Budget, Not a Library

I am going to spend two paragraphs on the research and then move to the technique, because the research matters but the technique is why you are reading.

Chroma’s 2025 study tested 18 frontier models and found every one degrades as input length increases, well before the window fills. Stanford demonstrated a U-shaped accuracy curve: 30% or more accuracy loss for information positioned in the middle of the context. Cognition measured it in coding agents specifically: agents spend 60% or more of their first turn just searching, and doubling task duration quadruples the failure rate.

You have felt this. The first hour of a Claude Code session is sharp. By the third hour, the agent forgets decisions it made earlier. Code suggestions drift from the codebase’s patterns. The context is full of noise and the signal is getting lost. The research confirms what your experience already told you: more context is not better. Better context is better. Everything that follows is about engineering for that reality.

Subagent Delegation: The Core Pattern

This is the most important technique in my workflow and the one that took the longest to get right.

Claude Code supports subagents: separate agent instances with their own context window, system prompt, and tool permissions. The parent delegates a task, the subagent does the work in isolation, and the parent gets back a compressed summary. The subagent’s verbose investigation, every file it read, every dead end it explored, stays in its context and gets discarded when it is done.

The parent session’s context stays clean for the work that matters.

Why subagents do not suffer the same degradation as long main sessions: the answer is scope. A subagent investigating a test failure reads three or four files, reasons about one specific question, and produces a focused summary. It never accumulates the sprawling context that degrades a main session over hours. Each subagent is a short, sharp interaction. Context rot requires time and accumulation. Subagents have neither.

I use three patterns daily.

The investigator. Before the main session implements anything, I delegate the diagnosis to a subagent: “Read the test failure in AllocationServiceTests. Read the test, the service, and the repository. Report back what is causing the assertion failure and which files need to change.” The subagent comes back with a focused diagnosis. The main session implements the fix with clean context.

The reviewer. Before I review a large PR, a subagent reads through the changes: “Summarise what was changed, what the likely intent was, which files contain the most complex logic, and whether any validation rules or API contracts were modified.” I review the actual code with the briefing as context.

The scout. When I need to understand how a pattern is used across the codebase: “Find every service that implements the audit logging pattern. Show me the three most recent examples and note any inconsistencies.” The scout reports back. The main session uses the summary to implement the same pattern correctly.

The critical verification step: subagent output is not trusted output. The investigator might produce a diagnosis that is technically coherent but wrong because it does not know the history behind a design decision. I treat subagent summaries the same way I treat a junior engineer’s analysis: useful context that I verify against my own understanding before acting on it. The subagent saves me the time of reading six files. It does not save me the time of thinking about whether the analysis is correct. If I cannot verify the summary, I read the files myself. That is the escape hatch, and using it is not a failure of the workflow. It is the workflow working as designed.

The cost trade-off: subagents consume additional API tokens. The investigator pattern means two context windows for a debugging session instead of one. The reviewer pattern means an AI pass before every human review. This costs real money at scale. In my experience, the trade-off is overwhelmingly positive because a clean main session produces better output on the first pass, which means less rework, which means fewer total tokens spent than a degraded main session that produces bad output you have to correct and regenerate. But this is worth tracking. If you are running Claude Code across a team of twelve engineers, the subagent token cost is a line item you should measure, not assume.

Multi-Service Context: The Enterprise Problem

Here is where generic AI advice stops being useful.

I work across multiple services that interact through APIs, shared databases, and message queues. A change in one service’s validation layer can break a downstream service’s allocation logic. A schema change in a shared database affects every service that reads from it. The services are too large to fit in the context window simultaneously, and the interactions between them are where most of the bugs live.

The naive approach is to load both codebases. This fails. Two services at summary level still consume context faster than you expect, and the agent’s reasoning about their interaction degrades as the window fills.

My approach is the contract-first pattern. Instead of loading both services, I load the contract between them.

Specifically: I use a subagent to extract the contract from the actual running code, not from documentation. “Read the controller endpoints in this service. List every public endpoint, its request and response shapes, and any documented constraints.” The subagent reads the real code and produces a contract summary. That summary, typically a few hundred tokens, goes into the main session alongside the code I am modifying.

The distinction between extracted contracts and documented contracts matters. Documented contracts drift from implementation within weeks. Extracted contracts reflect what the code actually does right now. The subagent reads the source of truth, not a wiki page that was last updated eight months ago.

This handles roughly 80% of cross-service interactions. The remaining 20% are implicit dependencies: situations where service B relies on a timing behaviour of service A that was never documented because it was never intended to be a feature. Those break in ways that no contract extraction can prevent. They require the institutional knowledge of an engineer who has been on-call when the implicit dependency failed. I do not have a tooling solution for those. I handle them by knowing the systems well enough to spot the risk manually.

When to Navigate vs When to Pre-load

One of the mistakes I made early on was pre-loading everything the agent might need. File trees, database schemas, configuration files, every piece of context I thought could be relevant.

The output got worse, not better. The agent was drowning.

The rule I have settled on: pre-load only what the agent cannot find on its own.

What I pre-load: CLAUDE.md files (loaded automatically). The specific files I want modified. Contract definitions for cross-service work. Regulatory constraints not expressed in the code.

What I let the agent find: project structure, similar implementations, test files, utility classes, configuration. Anything discoverable through the agent’s file navigation.

Every token I pre-load is a token the agent carries for the entire session, whether it turns out to be relevant or not. When the agent navigates to a file on its own, the context cost is paid only when the information is actually needed. The research is clear: truncated inputs outperform full-text inputs. The less you load, the sharper the agent stays.

The Knowledge That Is Never Written Down

Every complex software system accumulates institutional knowledge that lives in the heads of the engineers who built it. Why the retry logic uses a specific ceiling. Why the audit logging differs between two services built by different teams three years apart. Why a database index exists even though the query planner appears not to use it.

This knowledge is not in the code, not in the comments, not in the documentation. It transfers to other engineers through proximity: pair programming, code review conversations, the offhand comment during a PR review that explains why the seemingly redundant check exists.

AI agents do not absorb any of this. Every session is day one.

My approach is partial. When I discover institutional knowledge that affects how the agent should behave, I write it into the relevant CLAUDE.md file. One line. “The retry ceiling is 30 seconds because the downstream service times out at 45 and we need headroom.” “Audit logging in this service uses synchronous writes because the compliance audit requires guaranteed delivery order.”

Over time, these notes accumulate. Twenty notes cover twenty situations where the agent would otherwise guess wrong. It is not comprehensive. It never will be. New discoveries happen faster than I can document them. But partial is better than absent.

The honest assessment: context engineering on enterprise codebases in 2026 is roughly 70% solvable. The CLAUDE.md architecture, skills, subagent delegation, contract-first patterns, and selective pre-loading cover the majority. The remaining 30% is institutional knowledge that resists documentation and that no AI tool handles well. Working productively means working within that gap, not pretending the tools will close it.

A Session That Went Sideways

Let me show you what this workflow actually looks like, including the part where it broke.

I needed to add a new validation rule to a service. Regulatory change. A specific field must be present on records that meet certain criteria.

I started in Claude Desktop. Spent fifteen minutes clarifying the requirement: which records, which criteria, what happens to existing records without the field, any downstream effects. Good practice. Got a clean specification.

Opened Claude Code in the terminal (enterprise Java, JetBrains, terminal-first). Sent a scout subagent to find how existing validation rules are implemented. Scout came back with two recent examples and the test patterns.

Asked the main session to implement the new rule following the scout’s examples. The implementation looked clean. Validation rule, service layer integration, test. I reviewed it.

Here is where it went wrong. The test passed. But the test was wrong. The agent had written a test that asserted the validation rule fired for records matching the criteria, which was correct. But it had not written a test for the negative case: records that do not match the criteria should not be affected. The scout’s examples had included negative case tests. The main session had not followed the pattern fully.

I caught it because I was checking the test against the specification, not just checking whether the test passed. If I had been glancing rather than reviewing, the incomplete test coverage would have shipped, and the validation rule might have been applied too broadly in production, rejecting records that should have been accepted.

The fix took five minutes. I asked the main session to add the negative case test, referencing the scout’s examples. It did. Everything passed.

But the lesson is the one I keep re-learning: the agent’s output always looks correct. The tests always pass. The code always compiles. Verification is not about checking whether the agent produced working code. It is about checking whether the agent produced the right code. That check requires understanding the requirement well enough to notice what is missing, which is harder than noticing what is wrong.

Total time: about fifty minutes, including the course correction. Without the context engineering, without the subagent delegation, without the specification work in Claude Desktop, it would take longer and produce worse output. Not because the agent is incapable. Because the agent without context is confident, fast, and incomplete.

What I Am Still Getting Wrong

Long-running refactoring sessions still degrade. Even with subagents, a refactoring that touches twenty files eventually exhausts the main session’s context. My workaround is breaking refactors into multiple sessions with a coordination document (a markdown file listing what has been done and what remains) that I load into each new session. It works. It is clumsy.

Cross-language context is weak. When a change spans Java on the backend and Angular on the frontend, the agent struggles to reason about both simultaneously. The contract-first pattern handles API boundaries, but UI integration still requires human judgement about whether the frontend behaviour matches the backend change.

The CLAUDE.md files are getting long. Some push 2,000 tokens. The institutional knowledge notes are starting to crowd out the structural information. I need a pruning strategy and I do not have one yet.

Team adoption is uneven. I have shared these patterns with colleagues. Some adopted the subagent patterns immediately. Others still dump entire files into the main session. The learning curve is real, and I have not found the right way to standardise this across a team without it feeling like a mandate. If you are a manager reading this and thinking about rolling it out to your teams, my honest advice is to start with one pattern (the investigator subagent) and let the value demonstrate itself before adding complexity.

These are engineering problems with engineering solutions I have not built yet. The workflow is good. It is not finished.

The Senior Engineer's Compass

Discussion about this post

Ready for more?