What Is an AI Context Window? Why It Matters and How to Use It

The short version

1M

Tokens in Gemini 1.5 Pro's context window — enough to load an entire codebase or all of Shakespeare's works at once [Google]

750

Words per 1,000 tokens — a rough rule of thumb for English text (varies by writing style and complexity)

~40%

Performance degradation in the "lost in the middle" problem — LLMs recall start/end of context better than the middle [Liu et al.]

The context window is the maximum amount of text a model can process in a single interaction — input plus output combined. Everything outside this window is invisible to the model. It has no memory of previous conversations, no access to documents you haven't provided, no awareness of anything beyond its current context.

Context windows are measured in tokens, not words or characters. A token is typically 3-4 characters of English text — "transformer" is 2-3 tokens, "the" is 1 token, punctuation is usually 1 token. Context size is ultimately constrained by the transformer architecture underneath. APIs charge per token, and context size directly determines what tasks are feasible.

Model comparison

Context windows have grown dramatically since GPT-3's 4K limit in 2020. For a broader head-to-head, see ChatGPT vs Claude vs Gemini. Here's where leading models sit in 2026.

Gemini 1.5 Pro

1,000K

Claude 3.5 Sonnet

200K

GPT-4o

128K

LLaMA 3.1 405B

128K

Mistral Large

128K

DeepSeek-V3

128K

GPT-3

4K

Important caveat: having a 1M token context window doesn't mean you should fill it. Cost scales linearly with tokens. A 200K-token prompt with Gemini costs 200x more than a 1K-token prompt. And the "lost in the middle" problem means retrieval quality degrades when context is enormous — which is why production systems lean on retrieval-augmented generation to pull in only the relevant chunks.

What fits

Real-world equivalents to help you plan what you can actually load into a context window.

~750

words per 1K tokens

~100K

tokens in a full novel (e.g. Harry Potter book 1)

~50K

tokens in a full academic thesis

~10K

tokens in a 40-page PDF report

~3K

tokens in a long blog post like this one

~500

tokens in a one-page email

~200K

tokens in a full codebase (small-medium project)

~1M

tokens in all 154 Shakespeare sonnets + full plays

Using context well

1

Put the most important content at the start or end

LLMs demonstrably recall content from the start and end of their context better than from the middle. If you're loading 50 pages of a report, put the sections you most need the model to reason about at the beginning or end of your prompt — not buried in the middle.

2

Use RAG instead of stuffing raw documents

Retrieving 5 relevant pages from a 500-page document is usually better than loading all 500 pages. RAG reduces cost, reduces noise, and avoids the lost-in-the-middle problem. Large context windows are best for tasks where the entire document is relevant — like code review or contract analysis.

3

Be explicit about what to focus on

If you load a long document, tell the model exactly where the relevant information is: "The policy you need is in section 4.2 — page 23." Don't assume the model will naturally weight that section appropriately.

4

Track token usage for cost management

Claude 3.5 Sonnet costs $3 per million input tokens. A 200K-token context filled with documents costs $0.60 per query. For high-volume applications, compressing context or using smaller models for initial retrieval can cut costs by 80%+.

5

Remember: context doesn't persist across sessions

Each new conversation starts with an empty context window. The model has no memory of your previous session. If you're building an application that needs persistent memory, you need to implement it explicitly — either by storing conversation history or using a memory system like MemGPT.

The practical takeaway

Bigger context windows unlock new use cases — entire codebase review, full contract analysis, comprehensive document summarisation. But bigger isn't always better for everyday tasks. Match context to your actual needs, put critical content at the edges, and use RAG when the document is larger than what you actually need to query.

FAQ

Does a bigger context window make the AI smarter?

No — it gives the model access to more information per query, but doesn't improve its core reasoning capabilities. A 1M-token window doesn't help you if the model makes reasoning errors on simple logic problems. Capability and context are separate properties of a model.

What happens when you exceed the context window?

Different things depending on the system. Some APIs return an error. Others silently truncate the input — either from the beginning (discarding oldest context) or from the middle. Consumer products like ChatGPT typically manage this transparently, but you may notice the model "forgetting" things from early in a very long conversation.

Is the context window the same as memory?

No. Memory implies persistence across sessions. The context window only covers the current conversation. When you close the chat, the model forgets everything. Products that offer "memory" (like ChatGPT's persistent memory feature) implement this separately — typically by injecting a summary of past interactions into the context at the start of new sessions.

Sources

[Google] Google DeepMind — Gemini 1.5 Pro technical report (2024)

[Liu] Liu et al. — "Lost in the Middle: How Language Models Use Long Contexts" (2023)

[Anthropic] Anthropic — Claude 3.5 Sonnet model card and pricing documentation

What Is an AI Context Window? Why It Matters and How to Use It

Sources

04 — Don't watch from the outside