How Does ChatGPT Work? The Non-Technical Explanation (With Benchmarks)

00 — The short answer

ChatGPT is a large language model trained to predict the most useful next token given everything you've written. That single sentence explains almost everything about how it works — and why it sometimes fails.

ChatGPT is built on GPT-4o (as of 2026), a transformer-based language model developed by OpenAI. OAI It doesn't "think" in the way humans do. It reads your input, converts it into a numerical representation, and generates a response token by token — where each token is roughly 3/4 of a word. How large language models work under the hood is more complex still, but the model has learned patterns from an enormous corpus of text: books, websites, code, academic papers, and much more. That learning gives it the appearance of knowledge and reasoning. And in many domains, it performs remarkably well.

But "prediction" is not the same as "understanding." This distinction explains both ChatGPT's impressive capabilities and its specific failure modes.

1.8T

estimated parameters in GPT-4, according to leaked reports SEMI

128K

token context window in GPT-4o — roughly 96,000 words of simultaneous context OAI

122M

daily active users on ChatGPT as of February 2025 OAI

01 — Tokens, not words

ChatGPT doesn't read words. It reads tokens. This isn't a trivial distinction — it explains everything from why it occasionally misspells words to why it struggles with some maths problems.

How "ChatGPT is a language model" becomes tokens

Chat G PT is a language model

Each coloured block is one token. Words don't split cleanly — "ChatGPT" becomes 3 tokens. Common words like "is" and "a" become single tokens. Numbers, code, and unusual words often become several. ChatGPT processes roughly 75 words per 100 tokens — meaning a 1,000-word article is about 1,333 tokens.

Why does this matter? Because the model has no concept of letters or spelling at the token level. It operates on token sequences, not characters. When it "gets spelling wrong," it's actually producing a token pattern that's statistically reasonable but character-level incorrect. This also explains why ChatGPT can sometimes fail at simple counting tasks: it doesn't perceive words, it perceives tokens — and counting requires seeing individual units it wasn't designed to see.

02 — Training: what happened before you typed anything

The model you interact with today is the result of a multi-stage training process that took OpenAI years and cost hundreds of millions of dollars. You're using the compressed output of that process.

Pre-training

The base model is trained on a massive dataset — hundreds of billions of tokens from Common Crawl (web scrapes), books, Wikipedia, GitHub code, and more. The task: given the tokens so far, predict the next one. Do this trillions of times, adjust the model weights each time you get it wrong, and you end up with a model that has internalised patterns in language, facts, reasoning structures, and code. This is where ChatGPT's knowledge comes from.

Supervised fine-tuning (SFT)

OpenAI took the pre-trained model and trained it further on human-written example conversations — specifically crafted by contractors to demonstrate helpful, accurate, appropriately formatted responses. This is what makes ChatGPT feel like a helpful assistant rather than a text completion engine. The model learns what "good answers look like."

RLHF — Reinforcement Learning from Human Feedback

Human trainers ranked multiple model responses from best to worst. A separate "reward model" was trained on these rankings — learning to predict which responses humans prefer. The main model was then fine-tuned using reinforcement learning to produce responses that score highly on the reward model. This is why ChatGPT sounds helpful, not just statistically likely. RLHF

Safety training and red-teaming

Before release, the model is tested by adversarial testers (red team) trying to get it to produce harmful outputs. Additional training is applied to reduce those outputs. This is an ongoing process — every major ChatGPT update includes safety improvements. It's also why ChatGPT refuses some requests: those refusals were trained in, not hard-coded.

03 — What ChatGPT doesn't know

There are four structural limitations that follow directly from how ChatGPT is built. Understanding them makes you a significantly better user.

Knowledge cutoff. ChatGPT's training data has a cutoff date — events after that date didn't exist in the training set, so the model can't know about them. GPT-4o's training data cuts off in early 2024. When you ask about "the latest" anything, you may be getting outdated information. Always verify time-sensitive facts.

Context window limits. ChatGPT can only "see" the tokens in its current context window. GPT-4o has a 128,000-token window — about 96,000 words. Long conversations eventually push earlier content out of the window, and the model effectively "forgets" it. This isn't a memory system; it's a sliding window over tokens.

Hallucinations. Because ChatGPT generates tokens based on statistical patterns rather than verified knowledge retrieval, it can produce confident-sounding text that is factually wrong. Fake citations are a common example — the model produces a plausible-looking citation format because that's what citations look like in training data, even if the specific paper doesn't exist. HALL

No persistent memory by default. Each ChatGPT conversation starts fresh (unless you've enabled the Memory feature in ChatGPT settings). The model has no access to previous conversations unless you paste them in. It doesn't "know you" — it knows the content of the current context window.

04 — ChatGPT vs Claude vs Gemini: how different architectures produce different outputs

All three are transformer-based language models. But training choices, safety approaches, and architecture decisions produce meaningfully different capabilities and personalities — the subject of our full ChatGPT vs Claude vs Gemini comparison.

Capability	ChatGPT (GPT-4o)	Claude (3.7 Sonnet)	Gemini (2.0 Pro)
Coding	Strong	Very strong	Good
Long document analysis	Good	Excellent (200K ctx)	Strong (2M ctx)
Image understanding	Strong	Good	Very strong (native)
Factual accuracy	Good (some hallucinations)	Very good (lower hallucination rate)	Good (web-grounded)
Following instructions	Very strong	Very strong	Good
Free tier quality	GPT-4o mini (limited)	Claude Haiku (limited)	Gemini Flash (generous)

The differences aren't just performance — they're philosophical. Anthropic built Claude around Constitutional AI, a training approach that gives the model explicit principles to reason about rather than just human preference rankings. This tends to produce a model that's more willing to say "I'm not sure" and less likely to hallucinate confidently. OpenAI's RLHF approach optimises heavily for helpfulness and user satisfaction scores, which produces a model that's often more immediate and versatile. Compare all major LLMs with full benchmark data →

05 — Frequently asked questions

Does ChatGPT remember our conversation?

Within a single conversation, yes — everything you've said is in the context window. Across conversations, only if you've enabled the Memory feature in ChatGPT settings. By default, each new conversation starts fresh. The model doesn't "know you" from previous sessions unless you've explicitly enabled persistent memory.

Can ChatGPT be wrong?

Yes, and this is important to understand. ChatGPT generates text that is statistically likely given its training, not text that it has verified against ground truth. It can state incorrect facts confidently. It can invent citations that don't exist. It can produce subtly wrong code that looks correct. Always verify any factual claim, especially in high-stakes contexts, and don't trust citations without independently confirming they exist.

How is ChatGPT different from a search engine?

A search engine retrieves existing documents and ranks them. ChatGPT generates new text based on patterns learned from documents. A search engine shows you what others have written. ChatGPT synthesises something new — which is more useful for many tasks (summarising, drafting, explaining) but also means you can't trace its output to a source the way you can with search results.

Is ChatGPT getting smarter over time?

The model itself isn't learning from your conversations in real time. But OpenAI releases new model versions regularly (GPT-3.5 → GPT-4 → GPT-4o → GPT-4o with reasoning capabilities) that represent genuinely new trained models. Each major release is a distinct model with a different training run. So the product gets smarter over time — but not because of conversations you're having today.

Sources

OAI

OpenAI — GPT-4 Technical Report, 2023openai.com/research/gpt-4

RLHF

OpenAI — Training language models to follow instructions with human feedback, 2022arxiv.org/abs/2203.02155

SEMI

SemiAnalysis — GPT-4 Architecture, Infrastructure, Inference, 2023semianalysis.com

HALL

Maynez et al — On Faithfulness and Factuality in Abstractive Summarization, 2020arxiv.org/abs/2005.00661

ChatGPT is a prediction machine trained
to sound maximally helpful.

That's not a criticism. It's an extraordinarily useful tool when you understand what it is. The users who get the most out of ChatGPT are the ones who treat it as a capable collaborator with specific known failure modes: knowledge cutoffs, hallucinations on obscure facts, and no persistent memory.

The future of these models is rapid. GPT-4o's capabilities in 2026 would have seemed impossible in 2022. The models improving at pace means these limitations shrink over time. But they don't disappear — and understanding the underlying mechanism is the best way to use any of them well, regardless of which model you're using.

04 — Don't watch from the outside

Veltrix Collective

Stay ahead of
the curve

Weekly briefings on AI tools, adoption trends, and what actually matters for practitioners. No hype. Just signal. Join readers navigating the shift.

Subscribe →

Weekly, every Tuesday · No spam · Privacy policy · Unsubscribe anytime