Dynamic Pruning Strategies for Token Management
What Dynamic Pruning Is
Static compression applies the same reduction to context regardless of what the user is asking. It removes filler words, deduplicates sentences, or summarizes old conversation turns using fixed rules. The result is a compressed context that is smaller but still generic, containing information relevant to many possible queries but optimized for none.
Dynamic pruning is different. It evaluates each piece of context against the current query and keeps only the content that is relevant to this specific question. The same document might be pruned to 30% of its length for one query and to 70% for another, depending on which sections are relevant. This query-dependent approach produces a context that is not just smaller but more focused, giving the model exactly the information it needs without distraction.
Pruning Techniques
Sentence-Level Relevance Filtering
The simplest form of dynamic pruning embeds both the query and each sentence in the context, computes similarity scores, and removes sentences below a relevance threshold. This works well for retrieved documents where only portions are relevant to the query.
from sentence_transformers import SentenceTransformer
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
def prune_by_relevance(text, query, threshold=0.3):
sentences = text.split('. ')
query_emb = model.encode([query])[0]
sent_embs = model.encode(sentences)
kept = []
for i, sent in enumerate(sentences):
sim = np.dot(query_emb, sent_embs[i])
sim /= (np.linalg.norm(query_emb) * np.linalg.norm(sent_embs[i]))
if sim >= threshold:
kept.append(sent)
return '. '.join(kept)The threshold controls the aggressiveness of pruning. A threshold of 0.2 keeps most content and removes only clearly irrelevant sentences. A threshold of 0.4 is aggressive, keeping only sentences with strong relevance to the query. Start with 0.3 and adjust based on your evaluation results.
Section-Level Pruning
For structured documents with headers and sections, pruning at the section level is faster and preserves the document's logical structure. Compute relevance for each section and include or exclude entire sections based on their score. This avoids the problem of keeping isolated sentences that lack context because they are surrounded by pruned content.
Conversation Turn Pruning
In multi-turn conversations, not every previous turn is relevant to the current query. Dynamic pruning of conversation history embeds each turn against the current query and keeps only the turns that are relevant, regardless of their position in the conversation. This is more effective than a sliding window because it preserves relevant old turns while discarding irrelevant recent turns.
def prune_history(messages, current_query, max_tokens=8000):
query_emb = model.encode([current_query])[0]
scored = []
for i, msg in enumerate(messages):
msg_emb = model.encode([msg["content"]])[0]
relevance = np.dot(query_emb, msg_emb)
relevance /= (np.linalg.norm(query_emb) * np.linalg.norm(msg_emb))
recency_boost = 0.1 * (i / len(messages))
scored.append((msg, relevance + recency_boost))
scored.sort(key=lambda x: -x[1])
kept = []
total_tokens = 0
for msg, score in scored:
msg_tokens = count_tokens(msg["content"])
if total_tokens + msg_tokens <= max_tokens:
kept.append(msg)
total_tokens += msg_tokens
kept.sort(key=lambda x: messages.index(x))
return keptAttention-Based Pruning
More advanced approaches use the model's own attention patterns to identify which tokens it finds important. After a forward pass with the full context, the attention weights indicate which tokens the model attended to most. Tokens with consistently low attention across all heads can be pruned for subsequent calls. This is computationally expensive (it requires an extra forward pass) but produces highly accurate pruning because it uses the model's own judgment about what matters.
Pruning vs Other Reduction Techniques
| Technique | Query-Dependent? | Reduction Ratio | Latency Cost | Best For |
|---|---|---|---|---|
| Syntactic compression | No | 15-25% | Negligible | System prompts |
| Summarization | No | 70-90% | 1-3 seconds | Conversation history |
| Deduplication | No | 20-40% | 10-50ms | Retrieved documents |
| Dynamic pruning | Yes | 40-60% | 20-100ms | All context types |
Dynamic pruning can be combined with other techniques. Apply syntactic compression to the system prompt (once, at design time), dynamic pruning to retrieved context and conversation history (per query), and summarization to very old conversation history (when pruning alone is not enough to fit the budget).
When Dynamic Pruning Falls Short
Dynamic pruning requires the query to be known before context is assembled. In applications where the system needs to anticipate multiple possible queries (proactive agents, multi-step reasoning chains), the pruning decision cannot be made upfront because the relevant context depends on decisions the model has not yet made.
For these cases, external memory systems are more effective because they provide on-demand retrieval. Instead of loading and pruning a large context once, the model can call a retrieval tool multiple times during its reasoning process, getting fresh, query-specific context at each step. Adaptive Recall's MCP integration enables exactly this pattern, where the model calls the recall tool with different queries as its reasoning evolves, getting precisely the right context at each step.
Dynamic pruning per query, or better yet, on-demand retrieval that only loads what each step needs. Adaptive Recall supports both patterns.
Try It Free