Can You Extend a Context Window After the Fact
Why Context Windows Are Fixed
The context window is determined by the positional encoding system the model learned during training. Transformers use positional encodings to understand the order and position of tokens. These encodings are defined for a fixed range (e.g., positions 0 through 127,999 for a 128k model). The model has never seen tokens beyond this range during training, so it cannot process them at inference time. Attempting to extend the window by simply passing more tokens results in either an error or, if somehow forced, meaningless output because the model has no positional information for those positions.
Some research techniques like YaRN and ALiBi allow limited context extension beyond the training window, but these require modifying the model weights and are typically applied by model providers before deployment, not by API users at runtime. As an API consumer, you work within the provider's published context limit.
Alternatives That Simulate a Larger Window
External Memory
The most effective alternative to a larger context window is external memory. Instead of holding all knowledge in the context, store it in a memory system and retrieve the relevant subset for each query. An application with 100,000 stored memories and a 16k-token context window has far more knowledge capacity than an application with no external memory and a 1M-token context window, because the memory system can store and retrieve from an effectively unlimited knowledge base.
External memory also solves the attention quality problem. A 16k-token context with 5 highly relevant memories focuses the model's attention on exactly the right information. A 1M-token context with 100,000 memories buried in it dilutes the model's attention across mostly irrelevant content.
Model Routing
When most requests fit in a smaller model's window but occasional requests need more capacity, route dynamically. Use a 128k model for routine queries and automatically switch to a 200k or 1M model for requests that exceed the smaller window. This gives you the cost benefit of the smaller model for most calls and the capacity of the larger model when needed.
def route_to_model(messages, default="claude-haiku-4-5"):
total_tokens = count_message_tokens(messages)
if total_tokens < 50000:
return default # fast, cheap
elif total_tokens < 180000:
return "claude-sonnet-4-6" # 200k window
else:
return "gemini-1.5-pro" # 1M windowSliding Window with Persistent Storage
For conversations that would eventually exceed any context window, implement a sliding window that keeps recent messages in context and stores older messages in a persistent database. When the current query references something from earlier in the conversation, retrieve the relevant older messages from storage. This simulates an infinite conversation history within a fixed context window.
Multi-Pass Processing
For documents that exceed the context window, process them in multiple passes. Split the document into chunks that fit in the window, process each chunk independently, and combine the results. This is the map-reduce pattern, and it works well for tasks like extraction, summarization, and analysis where each chunk can be processed independently.
What Does Not Work
- Passing more tokens than the limit: The API rejects the request with an error.
- Setting a max_tokens higher than the model supports: max_tokens controls output length, not the context window. It cannot exceed the model's limit minus the input length.
- Using longer tokenizers: The tokenizer is fixed for each model. You cannot change how the model tokenizes text to fit more content in the same number of tokens.
- Prompt engineering to "compress" in-context: Telling the model to "treat the following as compressed text" does not change the token count. Every token still consumes one position in the context window regardless of the instruction.
You do not need a bigger window. Adaptive Recall gives any model access to unlimited persistent memory, retrieved on demand.
Try It Free