How Does
ChatGPT Work?
The non-technical explanation — tokens, training, why it's helpful (not just statistically likely), and why it still gets things wrong. Plus: how ChatGPT differs from Claude and Gemini under the hood.
00 — The short answer
ChatGPT is a large language model trained to predict the most useful next token given everything you've written. That single sentence explains almost everything about how it works — and why it sometimes fails.
ChatGPT is built on GPT-4o (as of 2026), a transformer-based language model developed by OpenAI. OAI It doesn't "think" in the way humans do. It reads your input, converts it into a numerical representation, and generates a response token by token — where each token is roughly 3/4 of a word. The model has learned patterns from an enormous corpus of text: books, websites, code, academic papers, and much more. That learning gives it the appearance of knowledge and reasoning. And in many domains, it performs remarkably well.
But "prediction" is not the same as "understanding." This distinction explains both ChatGPT's impressive capabilities and its specific failure modes.
01 — Tokens, not words
ChatGPT doesn't read words. It reads tokens. This isn't a trivial distinction — it explains everything from why it occasionally misspells words to why it struggles with some maths problems.
Each coloured block is one token. Words don't split cleanly — "ChatGPT" becomes 3 tokens. Common words like "is" and "a" become single tokens. Numbers, code, and unusual words often become several. ChatGPT processes roughly 75 words per 100 tokens — meaning a 1,000-word article is about 1,333 tokens.
Why does this matter? Because the model has no concept of letters or spelling at the token level. It operates on token sequences, not characters. When it "gets spelling wrong," it's actually producing a token pattern that's statistically reasonable but character-level incorrect. This also explains why ChatGPT can sometimes fail at simple counting tasks: it doesn't perceive words, it perceives tokens — and counting requires seeing individual units it wasn't designed to see.
02 — Training: what happened before you typed anything
The model you interact with today is the result of a multi-stage training process that took OpenAI years and cost hundreds of millions of dollars. You're using the compressed output of that process.
The base model is trained on a massive dataset — hundreds of billions of tokens from Common Crawl (web scrapes), books, Wikipedia, GitHub code, and more. The task: given the tokens so far, predict the next one. Do this trillions of times, adjust the model weights each time you get it wrong, and you end up with a model that has internalised patterns in language, facts, reasoning structures, and code. This is where ChatGPT's knowledge comes from.
OpenAI took the pre-trained model and trained it further on human-written example conversations — specifically crafted by contractors to demonstrate helpful, accurate, appropriately formatted responses. This is what makes ChatGPT feel like a helpful assistant rather than a text completion engine. The model learns what "good answers look like."
Human trainers ranked multiple model responses from best to worst. A separate "reward model" was trained on these rankings — learning to predict which responses humans prefer. The main model was then fine-tuned using reinforcement learning to produce responses that score highly on the reward model. This is why ChatGPT sounds helpful, not just statistically likely. RLHF
Before release, the model is tested by adversarial testers (red team) trying to get it to produce harmful outputs. Additional training is applied to reduce those outputs. This is an ongoing process — every major ChatGPT update includes safety improvements. It's also why ChatGPT refuses some requests: those refusals were trained in, not hard-coded.
03 — What ChatGPT doesn't know
There are four structural limitations that follow directly from how ChatGPT is built. Understanding them makes you a significantly better user.
Knowledge cutoff. ChatGPT's training data has a cutoff date — events after that date didn't exist in the training set, so the model can't know about them. GPT-4o's training data cuts off in early 2024. When you ask about "the latest" anything, you may be getting outdated information. Always verify time-sensitive facts.
Context window limits. ChatGPT can only "see" the tokens in its current context window. GPT-4o has a 128,000-token window — about 96,000 words. Long conversations eventually push earlier content out of the window, and the model effectively "forgets" it. This isn't a memory system; it's a sliding window over tokens.
Hallucinations. Because ChatGPT generates tokens based on statistical patterns rather than verified knowledge retrieval, it can produce confident-sounding text that is factually wrong. Fake citations are a common example — the model produces a plausible-looking citation format because that's what citations look like in training data, even if the specific paper doesn't exist. HALL
No persistent memory by default. Each ChatGPT conversation starts fresh (unless you've enabled the Memory feature in ChatGPT settings). The model has no access to previous conversations unless you paste them in. It doesn't "know you" — it knows the content of the current context window.
04 — ChatGPT vs Claude vs Gemini: how different architectures produce different outputs
All three are transformer-based language models. But training choices, safety approaches, and architecture decisions produce meaningfully different capabilities and personalities.
| Capability | ChatGPT (GPT-4o) | Claude (3.7 Sonnet) | Gemini (2.0 Pro) |
|---|---|---|---|
| Coding | Strong | Very strong | Good |
| Long document analysis | Good | Excellent (200K ctx) | Strong (2M ctx) |
| Image understanding | Strong | Good | Very strong (native) |
| Factual accuracy | Good (some hallucinations) | Very good (lower hallucination rate) | Good (web-grounded) |
| Following instructions | Very strong | Very strong | Good |
| Free tier quality | GPT-4o mini (limited) | Claude Haiku (limited) | Gemini Flash (generous) |
The differences aren't just performance — they're philosophical. Anthropic built Claude around Constitutional AI, a training approach that gives the model explicit principles to reason about rather than just human preference rankings. This tends to produce a model that's more willing to say "I'm not sure" and less likely to hallucinate confidently. OpenAI's RLHF approach optimises heavily for helpfulness and user satisfaction scores, which produces a model that's often more immediate and versatile. Compare all major LLMs with full benchmark data →
05 — Frequently asked questions
Within a single conversation, yes — everything you've said is in the context window. Across conversations, only if you've enabled the Memory feature in ChatGPT settings. By default, each new conversation starts fresh. The model doesn't "know you" from previous sessions unless you've explicitly enabled persistent memory.
Yes, and this is important to understand. ChatGPT generates text that is statistically likely given its training, not text that it has verified against ground truth. It can state incorrect facts confidently. It can invent citations that don't exist. It can produce subtly wrong code that looks correct. Always verify any factual claim, especially in high-stakes contexts, and don't trust citations without independently confirming they exist.
A search engine retrieves existing documents and ranks them. ChatGPT generates new text based on patterns learned from documents. A search engine shows you what others have written. ChatGPT synthesises something new — which is more useful for many tasks (summarising, drafting, explaining) but also means you can't trace its output to a source the way you can with search results.
The model itself isn't learning from your conversations in real time. But OpenAI releases new model versions regularly (GPT-3.5 → GPT-4 → GPT-4o → GPT-4o with reasoning capabilities) that represent genuinely new trained models. Each major release is a distinct model with a different training run. So the product gets smarter over time — but not because of conversations you're having today.
to sound maximally helpful.
That's not a criticism. It's an extraordinarily useful tool when you understand what it is. The users who get the most out of ChatGPT are the ones who treat it as a capable collaborator with specific known failure modes: knowledge cutoffs, hallucinations on obscure facts, and no persistent memory.
The future of these models is rapid. GPT-4o's capabilities in 2026 would have seemed impossible in 2022. The models improving at pace means these limitations shrink over time. But they don't disappear — and understanding the underlying mechanism is the best way to use any of them well, regardless of which model you're using.
AI intelligence,
weekly.
Every week: the AI developments that matter, the tools worth trying, and the data behind the headlines. No hype. No filler.
Subscribe free →