What Is a Context Window and Why It Matters When Using LLMs in Production
If you've worked with LLMs beyond casual conversation, you've probably hit the context window limit: the model starts forgetting things, responses degrade in quality, or you get an error because you've exceeded the maximum allowed tokens.
Understanding what the context window is and how to manage it isn't a minor technical detail — it's the difference between an application that works well and one that fails in ways that are hard to debug.
What Is the Context Window
The context window is the maximum amount of text an LLM can process in a single call. Everything the model can "see" at any given moment — the system prompt, conversation history, documents you've passed in, the current question — all has to fit within that limit.
The limit is measured in tokens, not words or characters. A token is roughly 0.75 words in English. The sentence "the context window is important" is approximately 5 tokens.
You can experiment with tokenization of any text using the official OpenAI tokenizer, which uses the same tokenization system as their models.
Current Limits of the Main Models
Limits vary significantly between models:
- GPT-4o: 128,000 tokens (~300 pages of text)
- GPT-4o-mini: 128,000 tokens
- Claude 3.5 Sonnet: 200,000 tokens (~470 pages)
- Gemini 1.5 Pro: up to 1,000,000 tokens (~2,350 pages)
- Gemini 1.5 Flash: 1,000,000 tokens
- Llama 3.1 8B and 70B: 128,000 tokens
- Mistral 7B: 32,000 tokens
These values are updated frequently — always check the official documentation for current values: OpenAI, Anthropic, Google.
The Difference Between Context Window and Memory
This is the most common misunderstanding: the context window is not persistent memory. It's a sliding window over the text of the current conversation.
When you make an API call, the model processes all the text you send at that moment and generates a response. It remembers nothing from previous calls unless you explicitly include that history in the current call.
Think of it as a sheet of paper with limited space. On each conversation turn, you write everything relevant — the full history plus the new question. When the sheet fills up, you have to decide what to erase to make room.
from openai import OpenAI
client = OpenAI()
# This has NO memory between calls
response_1 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "My name is Alex."}]
)
response_2 = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "What is my name?"}]
# The model doesn't know your name — you didn't include the history
)
# This WORKS — the history is in the context
response_3 = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "user", "content": "My name is Alex."},
{"role": "assistant", "content": "Hi Alex, how can I help you?"},
{"role": "user", "content": "What is my name?"}
]
)
How the Context Window Affects Your Applications
Chatbots and Conversational Assistants
In a long conversation, the history grows with each turn. If you don't manage it, you'll eventually exceed the context window limit and get an error.
The more subtle problem is that before hitting the limit, response quality can degrade. Models tend to pay more attention to recent text and text at the beginning of the context (the system prompt). Information in the "middle" of the context receives less attention — a phenomenon documented in the Stanford paper Lost in the Middle.
Document Processing
If you want the model to analyze a document, that document has to fit in the context window along with the system prompt and the question. A 100-page PDF can have 50,000-80,000 tokens — perfectly manageable with GPT-4o or Claude, but impossible with small-window models.
Pipelines with History
In pipelines where you accumulate results from multiple steps, context grows quickly. A common mistake is passing the full output of each step to the next without filtering, hitting the limit sooner than expected.
Strategies for Managing the Context Window
1. Truncation with Sliding Window
The simplest strategy: when history exceeds a threshold, remove the oldest messages.
import tiktoken
def count_tokens(messages: list, model: str = "gpt-4o") -> int:
encoder = tiktoken.encoding_for_model(model)
total = 0
for message in messages:
total += len(encoder.encode(message["content"]))
total += 4 # per-message overhead
return total
def truncate_history(messages: list, max_tokens: int = 100000) -> list:
# Always preserve the system prompt (first message)
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Remove oldest messages until we fit within the limit
while count_tokens(system + conversation) > max_tokens and len(conversation) > 2:
conversation.pop(0)
return system + conversation
Limitation: if you remove important messages from the beginning, the model loses relevant context.
2. History Summarization
Instead of truncating, periodically summarize the previous conversation and replace the history with that summary.
def summarize_history(history: list, client: OpenAI) -> str:
history_text = "
".join(
f"{m['role'].upper()}: {m['content']}" for m in history
)
response = client.chat.completions.create(
model="gpt-4o-mini", # Economical model for summarization
messages=[
{
"role": "system",
"content": "Summarize this conversation preserving important details: names, decisions made, key information mentioned."
},
{
"role": "user",
"content": history_text
}
]
)
return response.choices[0].message.content
def manage_context(messages: list, client: OpenAI, threshold: int = 80000) -> list:
if count_tokens(messages) > threshold:
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
# Summarize everything except the last 4 messages
to_summarize = conversation[:-4]
recent = conversation[-4:]
summary = summarize_history(to_summarize, client)
summary_message = {
"role": "system",
"content": f"Summary of previous conversation: {summary}"
}
return system + [summary_message] + recent
return messages
3. RAG Instead of Full Context
For large documents, instead of putting the entire document in context, use RAG — retrieve only the relevant fragments for each question. This dramatically reduces context usage and improves precision.
We covered how to implement RAG in detail in the article RAG Explained: How to Connect Your Own Documents to an LLM.
4. Prompt Caching
OpenAI and Anthropic offer prompt caching: if the beginning of your context is identical between calls (for example, a long system prompt or fixed reference documents), the provider caches that part and charges less for it.
At Anthropic this is called prompt caching and can reduce cost by up to 90% for the cached portion. At OpenAI, caching is automatic for contexts that start the same way.
Cost and Context Window
The context window doesn't just affect what the model can process — it also affects cost. Commercial models charge per input token (what you send) and output token (what they generate).
If your application sends 10,000 tokens of context per call and makes 1,000 calls per day, that's 10 million input tokens daily. At GPT-4o's ~$2.50 per million input tokens, that's $25 per day just in input tokens — $750 per month.
Optimizing context window usage isn't just a technical problem, it's a cost problem at scale.
When a Large Context Window Is Actually Useful
You don't always need the largest possible context window. These are the cases where it genuinely makes a difference:
Full document analysis: analyzing an 80-page contract, auditing a complete code repository, reviewing a full financial report.
Very long conversations: multi-hour working sessions where accumulated context is extensive.
Few-shot with many examples: when you need to include 50-100 examples in the prompt so the model learns the correct pattern.
Multi-source synthesis: combining and synthesizing information from many documents in a single call.
For most chatbot and automation applications, 32,000-128,000 tokens is more than enough. The jump to Gemini's 1 million tokens only adds value for specific use cases.