Home » Context Engineering

Context Engineering

Context engineering is the practice of deciding what information goes into a language model's context window on every call, so the model has exactly what it needs to produce a correct answer and nothing that distracts it. It has emerged as the central skill in building reliable AI systems because a model can only reason over what is in its context, and what belongs there changes with every request. Where prompt engineering tunes the fixed wording of an instruction, context engineering builds the system that assembles instructions, retrieved knowledge, memory, tool outputs, and history into the right window at the right moment, then prunes everything else.

Why Context Engineering Matters

A language model has no access to anything except the tokens you put in its context window. It does not remember your last conversation, it cannot see your database, and it does not know what your tools returned unless that information is present in the prompt for this specific call. Everything the model appears to know about your task, your user, and your data is there because some part of your system decided to put it there. Context engineering is the discipline of making that decision well, on every request, under a hard token budget.

The reason this has become the dominant concern in production AI is that model quality is now rarely the bottleneck. Frontier models are extremely capable when they have the right information in front of them, and they fail in predictable ways when they do not. A support agent that gives a wrong refund answer usually had the correct policy somewhere in its knowledge base and simply did not retrieve it into context. A coding assistant that edits the wrong function usually was never shown the file that mattered. These are not reasoning failures, they are context failures, and no amount of prompt wording fixes a window that is missing the one document the answer depended on.

The constraint that makes this hard is that the context window is finite and that more context is not free. Even models that advertise very large windows degrade as the window fills, because relevant tokens get diluted by irrelevant ones and the model's attention is spread thinner. Every token you add also costs money and latency on every call. So context engineering is an optimization under a budget: get the few thousand tokens that actually matter for this request into the window, leave everything else out, and do it fast enough and cheaply enough to run in production. A team that treats the prompt as a static string and stuffs everything it has into the window will pay more, respond slower, and get worse answers than a team that assembles a lean, relevant context for each call.

There is also a reliability argument that gets context engineering funded when the quality argument alone does not. Systems that assemble context ad hoc tend to fail silently and unpredictably: the same question works on Monday and fails on Friday because the retrieval index grew, or because the conversation got long enough to push the system instructions out of the window. A deliberate context pipeline, with explicit rules for what gets included and a fixed budget for each part of the window, turns these silent failures into something you can measure, test, and debug. That is the difference between a demo that impresses and a system that holds up with real users over months.

What Context Engineering Is

Context engineering is the set of systems and decisions that determine the exact contents of the context window for each model call. The context window for a single request is typically assembled from several distinct sources: the system instructions that define the model's role and rules, the immediate user input, relevant history from earlier in the conversation, knowledge retrieved from documents or a database, long-term memory about the user or task, the definitions of any tools the model can call, and the results those tools returned. Each of these competes for the same limited budget, and context engineering is the practice of deciding how much of each to include and in what form.

The work is dynamic rather than static. A prompt template is written once and reused, but the context for a request is built at runtime from whatever is relevant now. When a user asks a follow-up question, the system has to decide which earlier turns still matter and which can be summarized or dropped. When a question requires external knowledge, the system has to retrieve the right passages and fit them in. When a user has a history with the product, the system has to recall the facts about them that bear on this request and leave out the ones that do not. None of this can be hardcoded because the relevant set changes with every input. This is why context engineering is described as building a system rather than writing a prompt.

The goal of that system can be stated precisely: maximize the relevance density of the window. Relevance density is the fraction of the tokens in context that actually bear on the current request. A window padded with an entire user manual to answer one question about shipping has low relevance density and will produce worse answers than a window holding just the shipping section, even though the padded window contains the answer too. The entire craft reduces to raising this ratio, which means being aggressive about what to include and equally aggressive about what to leave out or remove.

Key Takeaway

Context engineering builds the runtime system that assembles instructions, history, retrieved knowledge, memory, and tool results into the context window for each call. Its single objective is relevance density: the highest possible fraction of tokens that actually matter for the request, within a fixed budget.

How It Differs From Prompt Engineering

Prompt engineering is a subset of context engineering. Prompt engineering is the craft of writing the instruction itself: choosing the wording, the examples, the output format, and the reasoning structure that gets the best response from a model. It operates on the part of the window that is relatively fixed and authored by hand. Context engineering is the larger discipline of managing the entire window, including everything that is assembled dynamically around that instruction. A perfectly worded prompt still fails if the retrieval step did not surface the document the answer needed, and that failure is a context engineering problem, not a prompt wording problem.

The reason the field shifted its language from prompting to context engineering is that real applications spend most of their effort on the dynamic parts. In a chatbot that answers questions about your product, the system instruction is a small, stable block of text that you write once. The hard, ongoing work is retrieving the right knowledge, managing a conversation that grows past the window, remembering what the user told you earlier, and fitting tool results back into the prompt. That is where the failures happen and where the engineering time goes. Calling the whole activity prompt engineering undersold the part that actually determines whether the system works. The page on context engineering versus prompt engineering works through this distinction with concrete before-and-after examples.

The Four Context Strategies

The practical techniques of context engineering group into four strategies, and a mature system uses all four together. They are write, select, compress, and isolate.

Write means putting information outside the context window so it can be brought back later, rather than trying to hold everything in the window at once. This includes saving notes or a scratchpad during a long task, and persisting facts to long-term memory at the end of a session. Writing context out is what makes a system able to operate over more information than fits in a single window, because it can store now and retrieve the relevant part later.

Select means choosing what to pull back into the window for the current request. This is the retrieval step: semantic search over a knowledge base, a query against stored memory, fetching the relevant rows from a database, or picking which earlier conversation turns to include. Selection is where most context quality is won or lost, because including the right three passages and excluding the irrelevant fifty is the difference between a grounded answer and a diluted one. The guide on retrieving the right context covers how to do this well.

Compress means reducing information to fit more signal into fewer tokens. The most common form is summarizing a long conversation history into a short recap that preserves the facts that still matter. Other forms include trimming retrieved documents to the relevant span, deduplicating repeated information, and replacing verbose tool output with its essential result. Compression buys window space without dropping the information that earns its place, and the guide to compressing context covers when and how to apply it safely.

Isolate means splitting context across boundaries so that each model call sees only what it needs. A multi-agent system isolates context by giving each sub-agent its own focused window for its sub-task, rather than running everything through one bloated context. Sandboxing tool outputs and keeping certain state out of the model's window entirely are also forms of isolation. Isolation keeps any single window small and focused, which is the precondition for high relevance density. The full set is laid out in the principles of context engineering.

How Context Fails

Understanding context engineering means understanding the specific ways context degrades, because each failure mode calls for a different fix. The most discussed is context rot, the steady decline in answer quality as a window fills up, even when the window is technically within the model's limit. As more tokens accumulate, the model's attention spreads across them, relevant details get harder to locate, and the response quality drops. Context rot is why a long-running conversation gives worse answers in its fiftieth turn than its fifth, and it is the reason large windows do not eliminate the need for context engineering. The explanation of context rot covers the evidence and the mitigations.

Beyond rot, a few distinct failures recur. Context distraction happens when irrelevant but plausible information in the window pulls the model off the correct answer, which is the cost of low relevance density. Context confusion happens when the window contains contradictory information, such as a current fact and a stale one, and the model cannot tell which to trust. Context clash happens when retrieved passages or tool definitions conflict with the system instructions, producing inconsistent behavior. The lost in the middle effect, where models attend more to the beginning and end of a long window than its middle, means that placement within the window matters and that the most important content should not be buried in the center of a large context. Each of these is a reason to keep windows lean and curated rather than full.

The Token Budget

Every decision in context engineering happens under a token budget, and making that budget explicit is what separates a deliberate pipeline from one that overflows unpredictably. The window has a hard ceiling, the model needs room within it to generate its response, and quality degrades as the window approaches capacity, so the practical budget is smaller than the advertised limit. Within that practical budget, the parts of the window, instructions, history, retrieved knowledge, memory, and tool results, compete for space, and the act of allocating tokens across them is a core engineering decision rather than an afterthought.

The allocation should follow what the application most depends on. A support assistant grounded in policy gives most of its budget to retrieved knowledge and memory, while a coding assistant gives most to retrieved files and symbols. Setting the allocation in advance forces the priority question, what matters most for this request, and prevents the common failure where an unbounded history or an oversized retrieval result silently crowds out the system instructions. When a section exceeds its allocation, that is the signal to apply compression to it rather than to let it spill over and push something important out of the window.

Budgeting also keeps cost and latency in check, because every token in the window is paid for on every call. A pipeline without a budget tends to grow over time as prompts accumulate additions and retrieval results creep larger, and the cost rises with no corresponding gain in quality. A fixed budget makes this visible: when a section wants more room, the cost of giving it more is explicit, and the trade against the other sections is deliberate. This is the discipline that ties context engineering to AI cost optimization, since the leanest window that still answers well is usually both the highest-quality and the cheapest.

Context Engineering for Agents

Agents make context engineering both harder and more important. An agent runs a loop: it reads its context, decides on an action, calls a tool, and feeds the result back into context for the next step. Every iteration adds tokens, so an agent's window grows continuously over a task, and without active management it hits the window limit or succumbs to context rot well before the task is done. The longer and more autonomous the agent, the more its success depends on managing what stays in the window across many steps.

The strategies that keep agents working are the four above applied across the loop. Agents write intermediate results to a scratchpad or to memory so they can clear them from the active window and recall them when needed. They compress, often by summarizing completed sub-tasks into a short result before moving on, so finished work does not consume the budget for remaining work. They isolate by spawning sub-agents with their own clean windows for distinct sub-tasks, so the main agent's context stays focused on coordination. And they select carefully which tool results and which parts of history to keep as the task proceeds. The dedicated guide on context engineering for AI agents goes through these patterns with concrete agent loops, and the memory for AI agents pillar covers the persistence layer that makes them possible.

Memory as the Context Layer

Memory is the part of context engineering that handles information persisting across requests and sessions. Retrieval over a static knowledge base answers questions about documents, but it does not capture what a specific user told you last week, what an agent learned during a previous task, or which facts have been confirmed and which contradicted over time. A memory layer stores these facts outside the window and selects the relevant ones back into context when they bear on the current request, which is exactly the write and select strategies applied to information that the system itself accumulates rather than information that came pre-written in documents.

This is where context engineering and a memory system meet directly. Adaptive Recall is a memory layer built for this job: it stores facts with confidence scores that rise as information is independently corroborated and fall when it is contradicted, so the select step can prefer well-supported memories and avoid pulling stale or conflicting ones into the window. Because memories carry these scores and can be queried by relevance, the system raises relevance density automatically, bringing back the few facts that matter for a request and leaving the rest in storage. The page on whether memory is part of context engineering works through this relationship, and treating memory as the persistent half of your context pipeline is what lets a system carry knowledge across sessions without drowning every window in history.

Key Takeaway

Context engineering is not one technique but a system: write information out of the window, select the relevant part back in, compress what stays to fit more signal, and isolate context so each call is focused. Memory is the persistent layer that makes selection work across sessions, and agents are where all four strategies become non-negotiable.

Common Mistakes in Context Engineering

A handful of mistakes recur often enough to be worth naming, because avoiding them does more for system quality than any single advanced technique. The first and most common is stuffing the window: including everything that might be relevant on the assumption that more context is safer. This lowers relevance density, invites context rot, and raises cost, and it produces worse answers than a lean window even though the answer is technically present. The discipline is to be as aggressive about exclusion as about inclusion.

The second mistake is treating the prompt as static. Teams that write a fixed prompt template and never build the dynamic assembly around it cap their system at what one authored block can do, and they hit a wall the moment the application becomes multi-turn or personalized. The third is ignoring history growth, letting a conversation accumulate verbatim until it overflows the window or displaces the system instructions, rather than summarizing old turns. The fourth is having no memory layer, so the system resets to blank every session and cannot use what users told it before, no matter how good its document retrieval is.

The fifth and most insidious mistake is operating the pipeline blind. A team that cannot see what entered the window for a given request cannot diagnose why an answer was wrong, so it tunes the prompt at random while the real failure sits in retrieval or compression. Making the assembled window observable, logging what each source contributed and correlating it with answer quality, is what turns context engineering from guesswork into a measurable engineering practice. Avoiding these five mistakes is mostly a matter of treating the window as a managed system rather than a string, which is the whole premise of the discipline.

Key Takeaway

The recurring failures are stuffing the window, treating the prompt as static, ignoring history growth, having no memory layer, and operating blind. Each is a failure to treat the window as a managed system, and avoiding them matters more than any single advanced technique.

Core Concepts

Foundations

Concepts and Comparisons

Implementation Guides

Building a Context Pipeline