Home » Context Window Management » Dynamic Pruning

Dynamic Pruning Strategies for Token Management

Dynamic pruning selectively removes low-value content from the context based on the current query. Unlike static compression, which reduces content the same way regardless of the question, dynamic pruning adapts to each request. A paragraph about database indexing is pruned when the user asks about CSS styling but retained when they ask about query performance. This query-dependent approach achieves 40 to 60% token reduction while preserving everything relevant to the current interaction.

What Dynamic Pruning Is

Static compression applies the same reduction to context regardless of what the user is asking. It removes filler words, deduplicates sentences, or summarizes old conversation turns using fixed rules. The result is a compressed context that is smaller but still generic, containing information relevant to many possible queries but optimized for none.

Dynamic pruning is different. It evaluates each piece of context against the current query and keeps only the content that is relevant to this specific question. The same document might be pruned to 30% of its length for one query and to 70% for another, depending on which sections are relevant. This query-dependent approach produces a context that is not just smaller but more focused, giving the model exactly the information it needs without distraction.

Pruning Techniques

Sentence-Level Relevance Filtering

The simplest form of dynamic pruning embeds both the query and each sentence in the context, computes similarity scores, and removes sentences below a relevance threshold. This works well for retrieved documents where only portions are relevant to the query.

from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer('all-MiniLM-L6-v2')

def prune_by_relevance(text, query, threshold=0.3):
    sentences = text.split('. ')
    query_emb = model.encode([query])[0]
    sent_embs = model.encode(sentences)

    kept = []
    for i, sent in enumerate(sentences):
        sim = np.dot(query_emb, sent_embs[i])
        sim /= (np.linalg.norm(query_emb) * np.linalg.norm(sent_embs[i]))
        if sim >= threshold:
            kept.append(sent)

    return '. '.join(kept)

The threshold controls the aggressiveness of pruning. A threshold of 0.2 keeps most content and removes only clearly irrelevant sentences. A threshold of 0.4 is aggressive, keeping only sentences with strong relevance to the query. Start with 0.3 and adjust based on your evaluation results.

Section-Level Pruning

For structured documents with headers and sections, pruning at the section level is faster and preserves the document's logical structure. Compute relevance for each section and include or exclude entire sections based on their score. This avoids the problem of keeping isolated sentences that lack context because they are surrounded by pruned content.

Conversation Turn Pruning

In multi-turn conversations, not every previous turn is relevant to the current query. Dynamic pruning of conversation history embeds each turn against the current query and keeps only the turns that are relevant, regardless of their position in the conversation. This is more effective than a sliding window because it preserves relevant old turns while discarding irrelevant recent turns.

def prune_history(messages, current_query, max_tokens=8000):
    query_emb = model.encode([current_query])[0]

    scored = []
    for i, msg in enumerate(messages):
        msg_emb = model.encode([msg["content"]])[0]
        relevance = np.dot(query_emb, msg_emb)
        relevance /= (np.linalg.norm(query_emb) * np.linalg.norm(msg_emb))
        recency_boost = 0.1 * (i / len(messages))
        scored.append((msg, relevance + recency_boost))

    scored.sort(key=lambda x: -x[1])

    kept = []
    total_tokens = 0
    for msg, score in scored:
        msg_tokens = count_tokens(msg["content"])
        if total_tokens + msg_tokens <= max_tokens:
            kept.append(msg)
            total_tokens += msg_tokens

    kept.sort(key=lambda x: messages.index(x))
    return kept

Attention-Based Pruning

More advanced approaches use the model's own attention patterns to identify which tokens it finds important. After a forward pass with the full context, the attention weights indicate which tokens the model attended to most. Tokens with consistently low attention across all heads can be pruned for subsequent calls. This is computationally expensive (it requires an extra forward pass) but produces highly accurate pruning because it uses the model's own judgment about what matters.

Pruning vs Other Reduction Techniques

Technique	Query-Dependent?	Reduction Ratio	Latency Cost	Best For
Syntactic compression	No	15-25%	Negligible	System prompts
Summarization	No	70-90%	1-3 seconds	Conversation history
Deduplication	No	20-40%	10-50ms	Retrieved documents
Dynamic pruning	Yes	40-60%	20-100ms	All context types

Dynamic pruning can be combined with other techniques. Apply syntactic compression to the system prompt (once, at design time), dynamic pruning to retrieved context and conversation history (per query), and summarization to very old conversation history (when pruning alone is not enough to fit the budget).

When Dynamic Pruning Falls Short

Dynamic pruning requires the query to be known before context is assembled. In applications where the system needs to anticipate multiple possible queries (proactive agents, multi-step reasoning chains), the pruning decision cannot be made upfront because the relevant context depends on decisions the model has not yet made.

For these cases, external memory systems are more effective because they provide on-demand retrieval. Instead of loading and pruning a large context once, the model can call a retrieval tool multiple times during its reasoning process, getting fresh, query-specific context at each step. Adaptive Recall's MCP integration enables exactly this pattern, where the model calls the recall tool with different queries as its reasoning evolves, getting precisely the right context at each step.

Dynamic pruning per query, or better yet, on-demand retrieval that only loads what each step needs. Adaptive Recall supports both patterns.

Try It Free