The scale of it
1.8T
Estimated parameters in GPT-4 — each one a tiny numerical value learned during training [Semianalysis]
45TB
Approximate training data for GPT-4 — roughly equivalent to 45 million novels worth of text [OpenAI]
$100M+
Estimated compute cost to train a frontier model in 2025 — rising to $1B+ for next-generation systems [Epoch AI]
An LLM is, at its core, a very large function that takes text as input and produces the most statistically likely continuation of that text. The "large" in large language model isn't marketing — it refers to the sheer number of learned parameters, which encode the statistical patterns of human language at a scale that produces genuinely impressive outputs.
But calling it a "text predictor" undersells what's actually happening. To predict text well, a model must implicitly learn grammar, facts, reasoning patterns, cultural context, and something that looks surprisingly like understanding. Whether that constitutes genuine intelligence is a philosophical debate. Whether it's practically useful is not.
Step 1: Tokenisation
LLMs don't read words. They read tokens — chunks of text that can be whole words, parts of words, or even single characters. This is why they sometimes struggle with spelling and character counting.
Example: How "The transformer architecture changed AI" gets tokenised
The
transform
er
architecture
changed
AI
6 tokens for 5 words. "transformer" splits into "transform" + "er" — which is why asking an LLM how many letters are in "transformer" can produce wrong answers. It never sees individual letters, only token chunks.
GPT-4 uses a vocabulary of around 100,000 tokens. Each token gets converted into a high-dimensional numerical vector — a list of numbers that represents the token's position in semantic space. Words that mean similar things cluster together in this space, which is why the model can reason about semantic relationships.
Step 2: The transformer
The transformer architecture, introduced by Google in the 2017 paper "Attention Is All You Need", is the foundation of every major LLM. Its key innovation: attention.
Attention: the model looks back at all previous tokens to decide what matters for the next prediction
When predicting what comes after "muddy", the model attends most strongly to "bank" and "steep" — using context to infer this is a riverbank, not a financial institution. Attention bars above show relative weight. This is what makes transformers context-aware rather than just pattern-repeating.
A transformer stacks multiple attention "heads" — each one learning to attend to different relationship types simultaneously. One head might track pronouns, another might track subject-verb relationships. GPT-4 uses 120+ layers of stacked attention blocks, each one refining the representation of the input.
Step 3: Training
Training an LLM happens in three phases. Each phase shapes a different aspect of the model's behaviour.
PT
Pre-training — learning language
The model trains on billions of web pages, books, code, and academic papers. The task: predict the next token. Simple objective, staggering scale. Training GPT-4 reportedly used 25,000 A100 GPUs for 90-100 days. The model learns grammar, facts, reasoning, and coding patterns — all as a side effect of getting good at next-token prediction.
SFT
Supervised fine-tuning — learning to follow instructions
Human labellers write example conversations and ideal responses. The model trains on these examples, learning the format of helpful dialogue. Without this phase, the model would complete your sentence — not answer your question.
RLHF
RLHF — learning what humans prefer
Reinforcement Learning from Human Feedback. Human raters compare model outputs and rank them. A reward model learns what humans prefer. The LLM is then optimised to maximise this reward signal. This is what makes ChatGPT give helpful, non-toxic, well-structured responses instead of technically-correct-but-useless ones. It's also what encodes the model's values and safety behaviours.
What this means practically
RLHF is why different models feel different even with similar architectures. OpenAI, Anthropic, and Google each have different rater pools, different preference guidelines, and different safety thresholds. The same base architecture produces very different products depending on the RLHF process applied.
Model scale
| Model |
Parameters |
Context window |
Architecture |
| GPT-3 |
175B |
4K tokens |
Dense transformer |
| GPT-4 |
~1.8T (MoE) |
128K tokens |
Mixture-of-experts |
| Claude 3.5 Sonnet |
Undisclosed |
200K tokens |
Transformer + constitutional AI |
| Gemini 1.5 Pro |
Undisclosed |
1M tokens |
Multimodal transformer |
| LLaMA 3.1 405B |
405B |
128K tokens |
Dense transformer (open weights) |
| DeepSeek-V3 |
671B (MoE) |
128K tokens |
Mixture-of-experts |
More parameters doesn't automatically mean better performance — DeepSeek-V3 matches GPT-4 class performance at a fraction of the training cost through architectural efficiency. For a live comparison of current models, see the Veltrix LLM directory.
FAQ
Does an LLM "understand" what it's saying?
Depends on your definition of understanding. LLMs demonstrably model context, make inferences, resolve ambiguity, and apply abstract reasoning. Whether that constitutes genuine understanding in a philosophical sense is hotly debated. Practically: they understand enough to be useful, but not reliably enough for critical decisions without human review.
Why do LLMs make up facts?
Because they're trained to produce the most statistically likely continuation of text — not to retrieve verified facts. When a model "knows" that academic papers include citations, it will produce plausible-looking citations even when no real reference exists. This is hallucination: the model completing a pattern, not lying intentionally.
What's the difference between an LLM and a chatbot?
An LLM is the underlying model. A chatbot is a product built on top of an LLM with a conversation interface, safety filters, and sometimes additional tools. ChatGPT is a chatbot product. GPT-4o is the underlying LLM. You can access LLMs directly via API or through various products that each apply different guardrails and interfaces.
Sources
[Semianalysis] Semianalysis — "GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE" (2023)
[Vaswani] Vaswani et al. — "Attention Is All You Need", Google Brain (2017)
[Epoch AI] Epoch AI — "Trends in Machine Learning Hardware" (2025)
[InstructGPT] Ouyang et al. — "Training language models to follow instructions with human feedback" (OpenAI, 2022)