VeltrixVeltrix.
← All articles
14 / 62March 16, 2026

How Does ChatGPT Work? The Non-Technical Explanation (With Benchmarks)

Tokens, training, RLHF, and why ChatGPT sounds helpful rather than just statistically likely. Plus how it compares to Claude and Gemini.

ChatGPT is a large language model trained to predict the most useful next token given everything you've written. That single sentence explains almost everything about how it works — and why it sometimes fails.

ChatGPT is built on GPT-4o (as of 2026), a transformer-based language model developed by OpenAI. OAI It doesn't "think" in the way humans do. It reads your input, converts it into a numerical representation, and generates a response token by token — where each token is roughly 3/4 of a word. The model has learned patterns from an enormous corpus of text: books, websites, code, academic papers, and much more. That learning gives it the appearance of knowledge and reasoning. And in many domains, it performs remarkably well.

But "prediction" is not the same as "understanding." This distinction explains both ChatGPT's impressive capabilities and its specific failure modes.

1.8T

estimated parameters in GPT-4, according to leaked reports SEMI

128K

token context window in GPT-4o — roughly 96,000 words of simultaneous context OAI

122M

daily active users on ChatGPT as of February 2025 OAI

ChatGPT doesn't read words. It reads tokens. This isn't a trivial distinction — it explains everything from why it occasionally misspells words to why it struggles with some maths problems.

How "ChatGPT is a language model" becomes tokens
Chat G PT is a language model

Each coloured block is one token. Words don't split cleanly — "ChatGPT" becomes 3 tokens. Common words like "is" and "a" become single tokens. Numbers, code, and unusual words often become several. ChatGPT processes roughly 75 words per 100 tokens — meaning a 1,000-word article is about 1,333 tokens.

Why does this matter? Because the model has no concept of letters or spelling at the token level. It operates on token sequences, not characters. When it "gets spelling wrong," it's actually producing a token pattern that's statistically reasonable but character-level incorrect. This also explains why ChatGPT can sometimes fail at simple counting tasks: it doesn't perceive words, it perceives tokens — and counting requires seeing individual units it wasn't designed to see.

The model you interact with today is the result of a multi-stage training process that took OpenAI years and cost hundreds of millions of dollars. You're using the compressed output of that process.

1
Pre-training

The base model is trained on a massive dataset — hundreds of billions of tokens from Common Crawl (web scrapes), books, Wikipedia, GitHub code, and more. The task: given the tokens so far, predict the next one. Do this trillions of times, adjust the model weights each time you get it wrong, and you end up with a model that has internalised patterns in language, facts, reasoning structures, and code. This is where ChatGPT's knowledge comes from.

2
Supervised fine-tuning (SFT)

OpenAI took the pre-trained model and trained it further on human-written example conversations — specifically crafted by contractors to demonstrate helpful, accurate, appropriately formatted responses. This is what makes ChatGPT feel like a helpful assistant rather than a text completion engine. The model learns what "good answers look like."

3
RLHF — Reinforcement Learning from Human Feedback

Human trainers ranked multiple model responses from best to worst. A separate "reward model" was trained on these rankings — learning to predict which responses humans prefer. The main model was then fine-tuned using reinforcement learning to produce responses that score highly on the reward model. This is why ChatGPT sounds helpful, not just statistically likely. RLHF

4
Safety training and red-teaming

Before release, the model is tested by adversarial testers (red team) trying to get it to produce harmful outputs. Additional training is applied to reduce those outputs. This is an ongoing process — every major ChatGPT update includes safety improvements. It's also why ChatGPT refuses some requests: those refusals were trained in, not hard-coded.

There are four structural limitations that follow directly from how ChatGPT is built. Understanding them makes you a significantly better user.

Knowledge cutoff. ChatGPT's training data has a cutoff date — events after that date didn't exist in the training set, so the model can't know about them. GPT-4o's training data cuts off in early 2024. When you ask about "the latest" anything, you may be getting outdated information. Always verify time-sensitive facts.

Context window limits. ChatGPT can only "see" the tokens in its current context window. GPT-4o has a 128,000-token window — about 96,000 words. Long conversations eventually push earlier content out of the window, and the model effectively "forgets" it. This isn't a memory system; it's a sliding window over tokens.

Hallucinations. Because ChatGPT generates tokens based on statistical patterns rather than verified knowledge retrieval, it can produce confident-sounding text that is factually wrong. Fake citations are a common example — the model produces a plausible-looking citation format because that's what citations look like in training data, even if the specific paper doesn't exist. HALL

No persistent memory by default. Each ChatGPT conversation starts fresh (unless you've enabled the Memory feature in ChatGPT settings). The model has no access to previous conversations unless you paste them in. It doesn't "know you" — it knows the content of the current context window.

All three are transformer-based language models. But training choices, safety approaches, and architecture decisions produce meaningfully different capabilities and personalities.

Capability ChatGPT (GPT-4o) Claude (3.7 Sonnet) Gemini (2.0 Pro)
Coding Strong Very strong Good
Long document analysis Good Excellent (200K ctx) Strong (2M ctx)
Image understanding Strong Good Very strong (native)
Factual accuracy Good (some hallucinations) Very good (lower hallucination rate) Good (web-grounded)
Following instructions Very strong Very strong Good
Free tier quality GPT-4o mini (limited) Claude Haiku (limited) Gemini Flash (generous)

The differences aren't just performance — they're philosophical. Anthropic built Claude around Constitutional AI, a training approach that gives the model explicit principles to reason about rather than just human preference rankings. This tends to produce a model that's more willing to say "I'm not sure" and less likely to hallucinate confidently. OpenAI's RLHF approach optimises heavily for helpfulness and user satisfaction scores, which produces a model that's often more immediate and versatile. Compare all major LLMs with full benchmark data →

Does ChatGPT remember our conversation?

Within a single conversation, yes — everything you've said is in the context window. Across conversations, only if you've enabled the Memory feature in ChatGPT settings. By default, each new conversation starts fresh. The model doesn't "know you" from previous sessions unless you've explicitly enabled persistent memory.

Can ChatGPT be wrong?

Yes, and this is important to understand. ChatGPT generates text that is statistically likely given its training, not text that it has verified against ground truth. It can state incorrect facts confidently. It can invent citations that don't exist. It can produce subtly wrong code that looks correct. Always verify any factual claim, especially in high-stakes contexts, and don't trust citations without independently confirming they exist.

How is ChatGPT different from a search engine?

A search engine retrieves existing documents and ranks them. ChatGPT generates new text based on patterns learned from documents. A search engine shows you what others have written. ChatGPT synthesises something new — which is more useful for many tasks (summarising, drafting, explaining) but also means you can't trace its output to a source the way you can with search results.

Is ChatGPT getting smarter over time?

The model itself isn't learning from your conversations in real time. But OpenAI releases new model versions regularly (GPT-3.5 → GPT-4 → GPT-4o → GPT-4o with reasoning capabilities) that represent genuinely new trained models. Each major release is a distinct model with a different training run. So the product gets smarter over time — but not because of conversations you're having today.

Sources
OAI
OpenAI — GPT-4 Technical Report, 2023openai.com/research/gpt-4
RLHF
OpenAI — Training language models to follow instructions with human feedback, 2022arxiv.org/abs/2203.02155
SEMI
SemiAnalysis — GPT-4 Architecture, Infrastructure, Inference, 2023semianalysis.com
HALL
Maynez et al — On Faithfulness and Factuality in Abstractive Summarization, 2020arxiv.org/abs/2005.00661
ChatGPT is a prediction machine trained
to sound maximally helpful.

That's not a criticism. It's an extraordinarily useful tool when you understand what it is. The users who get the most out of ChatGPT are the ones who treat it as a capable collaborator with specific known failure modes: knowledge cutoffs, hallucinations on obscure facts, and no persistent memory.

The future of these models is rapid. GPT-4o's capabilities in 2026 would have seemed impossible in 2022. The models improving at pace means these limitations shrink over time. But they don't disappear — and understanding the underlying mechanism is the best way to use any of them well, regardless of which model you're using.

Stay ahead of the curve

AI intelligence,
weekly.

Every week: the AI developments that matter, the tools worth trying, and the data behind the headlines. No hype. No filler.

Subscribe free →

Veltrix Collective · Published April 2026. Sources: OpenAI, SemiAnalysis, arXiv. Benchmark comparisons reflect publicly available evaluations as of Q1 2026. Model capabilities change with each release — check veltrixcollective.com/compare for current rankings.

Written by Luke Madden, founder of Veltrix Collective. Data synthesis and analysis by Vel.