Home » Context Window Management » Summarization vs Compression

Summarization vs Compression: Which Is Better

Summarization rewrites content into a shorter version that captures the key points. Compression removes redundancy while preserving the original wording. Summarization achieves higher reduction ratios (70 to 90%) but loses specific details unpredictably. Compression preserves details exactly but achieves lower ratios (20 to 40%). The right choice depends on whether your application needs exact details or narrative coherence.

How Summarization Works

Summarization uses an LLM to read the full text and produce a shorter version that captures the essential information. The output is new text written by the model, not a selection of original sentences. This means summarization can consolidate information from multiple paragraphs into a single statement, rephrase complex ideas more concisely, and omit details that the model judges to be less important.

The compression ratio is high because summarization can eliminate entire topics, merge related points, and express ideas at a higher level of abstraction. A 10-paragraph discussion about database migration options might summarize to "The team decided to use PostgreSQL with a staged migration over three sprints." That single sentence captures the decision while eliminating the reasoning, alternatives considered, and implementation details.

The danger is that summarization is lossy in unpredictable ways. The model decides what to keep and what to discard, and its judgment may not match what future queries need. If a later question asks "why did we reject MySQL?" the answer existed in the original text but was discarded by the summarizer. There is no way to recover it from the summary alone.

Abstractive vs Extractive Summarization

Abstractive summarization generates new text that paraphrases the original. This is what LLMs naturally produce when asked to summarize. It achieves the highest compression ratios but introduces the most risk of information loss and potential inaccuracies. The model might subtly change the meaning while paraphrasing, introducing factual errors that are difficult to detect.

Extractive summarization selects the most important sentences from the original text and concatenates them. No new text is generated. This preserves exact wording and eliminates the risk of paraphrasing errors, but achieves lower compression ratios (typically 30 to 50%) because it cannot merge or rephrase.

How Compression Works

Compression reduces token count without rewriting the content. Several techniques operate at different levels:

Syntactic compression removes filler words, redundant modifiers, and verbose phrasing. "In order to successfully complete the process of migrating the database" becomes "To migrate the database." The information is identical; the expression is more compact. Typical reduction: 15 to 25%.

Deduplication identifies sentences or passages that repeat the same information and keeps only one instance. In retrieved context from RAG, where multiple documents may describe the same concept, deduplication can remove 20 to 40% of the content. The key challenge is setting the similarity threshold correctly so that near-duplicates are caught without accidentally merging related but distinct statements.

Semantic pruning uses embedding similarity between each sentence and the query to identify and remove sentences that contribute little to answering the current question. This is query-dependent compression, meaning the same document is compressed differently for different queries. A document about database performance would retain connection pooling details for a query about pool sizing but prune them for a query about query optimization. Typical reduction: 30 to 50%.

Side-by-Side Comparison

DimensionSummarizationCompression
Compression ratio70-90%20-40%
Detail preservationLow (model decides what to keep)High (original wording preserved)
Factual accuracy riskModerate (paraphrasing can introduce errors)None (no text is generated)
Computational costHigh (requires LLM call)Low (heuristic or embedding-based)
Latency300-2000ms (LLM inference)5-50ms (local computation)
Best forConversation history, narrative contentTechnical docs, code context, retrieved results

When to Use Each

Use Summarization When

Conversation history is the ideal use case for summarization. A 20-turn discussion about API design decisions can be summarized to "The team agreed to use REST with pagination, JWT authentication, and rate limiting at 100 requests per minute." Future queries about the API design get the decision without the deliberation.

Use Compression When

Retrieved context from RAG is the ideal use case for compression. When you retrieve 5 documents about the same topic, deduplication removes the overlapping content while preserving the unique details from each. Semantic pruning further reduces the content to only the sentences relevant to the current query. No information is rewritten, so technical accuracy is maintained.

Use Both When

Many production applications use both techniques for different parts of the context. Summarization for conversation history (where decisions matter more than details), compression for retrieved context (where details matter), and neither for the system prompt (which should be optimized once at design time rather than compressed at runtime).

The External Memory Approach

Both summarization and compression are strategies for fitting more information into a fixed context window. External memory eliminates the need for most of this compression by storing persistent knowledge outside the window entirely. When knowledge lives in a memory system, only the specific memories relevant to the current query enter the context. There is no accumulated history to summarize and no redundant retrieval results to compress.

Adaptive Recall's cognitive scoring ensures that only the most relevant memories are retrieved for each query. The retrieval is already selective, so the context it produces is already compact and focused. The remaining compression needs (if any) are minimal and can be handled with simple deduplication.

Skip the compression pipeline. Adaptive Recall retrieves only what matters for each query, keeping context lean without lossy compression.

Get Started Free