Home » Context Window Management » What Is a Context Window

What Is a Context Window and Why It Matters

A context window is the maximum number of tokens a large language model can process in a single request. It includes everything the model reads (system instructions, conversation history, retrieved documents) and everything it writes (the response). Understanding context windows is essential because they are the hard constraint that determines what your AI application can and cannot do in a single interaction.

Tokens: The Unit of Context

LLMs do not process text as characters or words. They process tokens, which are subword units that the model learned during training. The tokenizer splits text into these units before the model sees it. In English, one token averages about 3 to 4 characters, or roughly 0.75 words. The sentence "How do I reset my password?" is about 8 tokens. A paragraph of 100 words is roughly 130 tokens. A full page of text is approximately 500 to 700 tokens.

Tokenization is not uniform across languages or content types. Chinese, Japanese, and Korean text typically uses more tokens per character than English. Code tokenizes differently than prose because variable names, syntax characters, and indentation all consume tokens. JSON and XML are particularly token-expensive because of their verbose syntax. A JSON object that a human reads as 50 words might tokenize to 200 or more tokens because of all the braces, quotes, and colons.

This variability matters because developers often estimate context usage in words and then discover in production that their actual token usage is 30 to 50% higher than expected. Always use the tokenizer that matches your model to count tokens precisely rather than relying on word-count estimates.

What Goes Into the Context Window

The context window is shared between input and output. Everything the model needs to read and everything it writes must fit within the limit. For a typical LLM application, the input includes several components that compete for space:

The system prompt defines the model's behavior, personality, guardrails, and output format. This is typically 500 to 5,000 tokens depending on complexity. It is included in every API call and is usually the most stable component.

The conversation history includes all previous messages in the current conversation. This grows linearly with each turn. A 10-turn conversation might use 4,000 to 8,000 tokens. A 50-turn conversation might use 20,000 to 40,000 tokens. Without management, this component will eventually consume the entire window.

Retrieved context from RAG pipelines, memory systems, or tool calls brings external information into the prompt. Each retrieved document or memory adds to the token count. Retrieving 5 documents of 500 tokens each adds 2,500 tokens.

Tool definitions describe the functions the model can call. Each tool definition consumes tokens for the name, description, and parameter schema. Applications with 10 to 20 tools might spend 2,000 to 4,000 tokens on tool definitions alone.

The response is generated by the model and also counts against the context window. A detailed response might use 500 to 2,000 tokens. The maximum response length is limited by whatever capacity remains after the input components.

Why Context Windows Are Fixed

The context window is fixed because of how transformer models work internally. The attention mechanism in transformers computes relationships between every pair of tokens in the context. The computational cost of this attention grows quadratically with the number of tokens: doubling the context length quadruples the computation. The memory required to store the attention matrices also grows quadratically.

Model providers choose context window sizes that balance capability against cost. A larger window lets the model consider more information but requires more GPU memory and more computation per request. This is why larger context windows cost more per token in API pricing, the provider is using more hardware to process each request.

Recent advances like grouped query attention, sliding window attention, and flash attention have reduced the computational cost of long contexts, enabling the jump from 4k to 128k and even 1M token windows. But the fundamental constraint remains: more context means more computation, which means more cost and more latency.

The Practical vs Advertised Window

A model's advertised context window is its theoretical maximum. The practical window, the amount of context the model uses effectively, is often much smaller. Several studies have documented this gap.

The "lost in the middle" phenomenon, identified by Liu et al. in 2023, shows that models pay the most attention to information at the beginning and end of the context. Information placed in the middle of a long prompt is significantly less likely to influence the response. In one benchmark, accuracy on questions about information in the middle of a 20-document context was 20 percentage points lower than accuracy on information at the beginning or end.

Attention degradation at scale means that filling a 128k context window with text does not give you 128k tokens worth of useful context. Studies suggest that effective utilization starts declining around 30 to 40% of the maximum window size. Beyond that point, adding more context produces diminishing returns and can even reduce quality by introducing noise that dilutes the signal from relevant content.

This does not mean large context windows are useless. They are valuable for tasks that genuinely require processing large amounts of text, like long-document summarization or code repository analysis. But for most applications, curating a smaller, more relevant context produces better results than filling a larger window with everything available.

Context Windows and Memory

The context window is analogous to working memory in human cognition. Just as you can hold about seven items in working memory at once, an LLM can effectively attend to a limited amount of context. The rest of human knowledge lives in long-term memory, accessed on demand through recall, not held in working memory simultaneously.

External memory systems bring this same architecture to LLM applications. Instead of trying to hold all knowledge in the context window, persistent knowledge is stored externally and retrieved when relevant. The context window holds only the system instructions, the current query, and the specific memories needed for this particular interaction. This keeps token usage low, cost predictable, and attention focused on what matters.

Adaptive Recall implements this by providing seven tools that the LLM calls through the MCP protocol. The recall tool retrieves relevant memories using cognitive scoring that accounts for semantic similarity, recency, frequency, entity connections, and confidence. The result is an LLM that operates within a modest context window but has access to a vast store of persistent knowledge, exactly like an expert who does not hold everything in their head but can recall what they need when they need it.

Give your LLM long-term memory that lives outside the context window. Conversations never hit the token limit because knowledge is stored and retrieved on demand.

Get Started Free