Home » Context Engineering » How to Compress Context

How to Compress Context Without Losing What Matters

Compressing context means reducing the information in a window so it occupies fewer tokens while keeping everything that matters for the request. The reliable method is to set a budget for each part of the window, summarize old conversation history into a compact recap, trim retrieved documents to their relevant spans, deduplicate and condense tool output, and explicitly protect critical facts from being summarized away. Done in this order, compression buys window space and raises relevance density without discarding the details a later step depends on.

Compression is one of the four context strategies, and it is the one that keeps long-running systems from drowning in their own accumulated content. The risk that makes it tricky is that aggressive summarization can throw away a detail that turns out to matter, so the goal is not maximum shrinkage but maximum shrinkage of the low-value parts while the high-value facts are preserved exactly. The steps below order the work so the safe, high-yield compression happens first and the risky parts are protected throughout.

Step 1: Set a token budget per window section

Before compressing anything, decide how many tokens each part of the window is allowed: the system instructions, the conversation history, the retrieved knowledge, and the tool output. A budget turns compression from a vague instinct into a target, you now know that history must fit in, say, a thousand tokens and retrieved knowledge in two thousand. It also forces the priority decision of which parts matter most for your application, which is the real design choice. Without a budget, compression is reactive and you only act once the window overflows, which is too late and produces uneven results.

Step 2: Summarize old conversation history

The largest and safest compression target in most systems is old conversation history. Keep the most recent turns verbatim, because they are most likely to be referenced directly, and replace older turns with a running recap. The recap should preserve the durable outcomes, decisions made, facts the user stated, questions still open, and drop the verbatim phrasing of the exchange. Update the recap as the conversation grows so it stays current. This single step often reclaims the majority of a bloated window, because raw turn-by-turn history is mostly low-value tokens once a conversation is long.

Step 3: Trim retrieved documents to the relevant span

Retrieved documents are frequently included whole when only a section is relevant. Trim each retrieved item to the passage that actually matches the request, and cut boilerplate like navigation text, repeated headers, and legal footers. If your retrieval returns large chunks, consider a second pass that extracts the relevant sentences from each chunk before placing them in the window. This raises relevance density directly, since every trimmed token was diluting the ones that mattered. The retrieval side of this is covered in how to retrieve the right context.

Step 4: Deduplicate and condense tool output

Tool and function results are often verbose and repetitive, especially when an agent calls the same tool multiple times. Remove duplicated information across results, and replace raw output with its essential values, a query that returned a hundred rows can often be condensed to the few rows or the aggregate that the model actually needs. For agents, condense the output of a completed sub-task into its result before moving on, so the steps it took do not linger in the window. This is the same compression principle applied to the parts of the window that grow fastest in tool-using systems.

Step 5: Protect critical facts from compression

The final and most important step runs throughout the others: identify the facts that must never be lost and protect them from summarization. Identifiers, exact numbers, names, explicit decisions, and any state a later step depends on should be pinned and carried forward verbatim, not folded into a summary that might paraphrase or drop them. A practical pattern is to maintain a small, protected facts block that survives every compression pass untouched, separate from the summarized narrative. This is what lets you compress aggressively elsewhere without the failure mode where the recap reads well but has quietly dropped the one number the task hinged on.

Key Takeaway

Compress in order of safety: budget each section, summarize old history, trim documents, condense tool output, and protect critical facts throughout. The goal is to shrink the low-value parts hard while carrying the high-value facts forward exactly, so the window stays lean without losing what a later step needs.

When to Compress, and When Not To

Compression has a cost, both the tokens and latency of running a summarization call and the risk of dropping a detail, so it should be triggered, not constant. Good triggers are crossing a window-size threshold, completing a sub-task, or transitioning between phases of a long interaction. Below the threshold, leaving content verbatim is safer and cheaper. The aim is to keep the window in a healthy range, not to compress for its own sake.

For information that needs to persist across sessions rather than just within a long conversation, compression is the wrong tool and memory is the right one. Summarizing a conversation keeps its gist available within the current session, but a memory layer extracts the durable facts and stores them so a future session can recall exactly the relevant ones, which is cleaner than carrying an ever-growing summary forward forever. The relationship between in-session compression and cross-session memory is covered in whether memory is part of context engineering.

Lossless Compression Before Lossy

Not all compression carries the risk of dropping information, and the safe practice is to exhaust the lossless techniques before reaching for the lossy ones. Lossless compression removes tokens that carry no information: duplicated content, boilerplate, redundant formatting, verbose tool output that can be reduced to its values, and repeated restatements of the same fact across turns. None of this loses anything the model needed, so it can be applied freely and aggressively, and in many bloated windows it alone reclaims a large fraction of the budget. Doing it first means you may not need lossy compression at all, or you need far less of it.

Lossy compression, primarily summarization, is where judgment and risk enter, because condensing a narrative necessarily chooses what to keep and what to drop. Reserve it for content where the gist genuinely suffices, old conversational back-and-forth, the narrative of how a completed sub-task was done, and protect the concrete facts within that content from the summary. By ordering the work lossless first and lossy second, you minimize the amount of risky compression you do and confine it to the places where it is safest, which is the discipline that lets a system compress hard without the failure where a summary quietly loses the detail a later step needed.

Verifying Compression Did Not Hurt

Because lossy compression can silently drop something important, a mature pipeline checks rather than assumes. The simplest verification is to evaluate answer quality with and without a compression change on a fixed set of representative requests, including ones that depend on details from older context, so a regression from over-aggressive summarization shows up as a measurable drop rather than as a surprise in production. If quality holds while tokens fall, the compression is safe, and if it drops, you know to preserve more before compressing. This ties compression to the measurement discipline covered in the LLM evaluation pillar, and it is what separates compression you can trust from compression you hope is working.

A practical refinement is to keep the original alongside the compressed version when storage allows, so a later step that finds the summary insufficient can fall back to the full content rather than failing. This hybrid pattern, a compact summary in the active window with the full text retained in cheap external storage, gives you the token savings of compression without permanently committing to the lossy version. An agent that summarized a document into the window can re-fetch the original if a later question needs a detail the summary dropped. This is compression and the write strategy working together: you write the full content out, keep only the compressed form in context, and selectively pull the original back when the summary proves too thin. Designed this way, compression stops being a one-way loss of information and becomes a reversible trade between window space and retrieval cost, which is the safest form it can take in a long-running system. The cost is the extra storage and the occasional re-fetch, both cheap relative to the alternative of either bloating the window with full content or losing a needed detail to an irreversible summary, which is why retaining the original is the default worth adopting wherever your storage budget allows it.