What Is a Large Language Model (LLM)? How They Work and How They're Ranked

00 — The definition

A large language model is a deep learning system trained on vast quantities of text to predict and generate language. The "large" refers to scale — billions or trillions of parameters — that turns out to be the difference between unremarkable text completion and genuinely useful AI.

Scale matters in LLMs in a way that surprised researchers. When you increase the size of language models past certain thresholds, they don't just get marginally better — they develop qualitatively new capabilities that smaller models completely lack. The ability to reason through multi-step problems, write working code, and engage in nuanced analysis all emerge as capabilities only at sufficient scale. EMRG

This "emergent capabilities" phenomenon is both fascinating and unsettling: researchers can't fully predict what a model will be able to do until they've built and trained it. GPT-3 couldn't reliably count objects in a sentence. GPT-4 passes professional exams. The difference is mostly scale, training data quality, and training methodology — not a fundamentally different architecture. For the mechanics, see how large language models work.

Parameter count — leading models

GPT-3 (2020)OpenAI baseline

175B

LLaMA 3.1 405B (2024)Meta open source

405B

GPT-4 (2023)OpenAI (leaked estimate)

~1.8T

One thing worth clarifying: more parameters doesn't automatically mean better. Efficiency in training, quality of training data, and instruction tuning matter as much as raw size. The gains really come from the transformer architecture underneath. Google's Gemini Flash is smaller than GPT-4 but competitive on many benchmarks. DeepSeek V3 achieved GPT-4-level performance at a fraction of the training cost, demonstrating that architectural efficiency improvements can partially substitute for raw scale. DSK

01 — How LLMs are built: three phases

Building a useful LLM takes three distinct phases. The raw capability comes from phase 1. The usefulness comes from phases 2 and 3.

Pre-training: learning language at scale

The model is trained on hundreds of billions or trillions of tokens from internet text, books, code repositories, academic papers, and more. The task is simple: given a sequence of tokens, predict the next one. Repeat this billions of times, adjust the model's weights each time it's wrong, and you end up with a model that has internalised enormous amounts of information about language, facts, reasoning patterns, and the structure of knowledge. This is the expensive part — training GPT-4 reportedly cost over $100 million. COST

Supervised fine-tuning: learning to be helpful

The pre-trained model is a text prediction engine, not a helpful assistant. Fine-tuning on curated examples of good responses transforms it. Human contractors write exemplary question-and-answer pairs, the model is trained on these, and it learns to produce responses that match the style, format, and quality of those examples. This is where the "assistant" personality comes from.

RLHF: optimising for what humans prefer

Reinforcement Learning from Human Feedback is the final polish. Human raters compare multiple model outputs and rank them. A reward model learns to predict human preference scores. The main LLM is then fine-tuned using reinforcement learning to produce outputs that score highly. This is why ChatGPT sounds like it's trying to be helpful rather than just generating plausible text. It was trained to maximise a human-preference signal — which has its own failure modes (e.g., models that sound confident even when they shouldn't be). RLHF

02 — The leading LLMs: a comparison

As of April 2026, these are the models that matter. For a closer head-to-head, see ChatGPT vs Claude vs Gemini. Rankings below are based on publicly available benchmarks and Veltrix's own evaluation methodology.

Model	Company	Strengths	Best for	Price tier
Claude 3.7 Sonnet	Anthropic	Coding, long documents, nuanced reasoning, lower hallucination rate	Complex analysis, software engineering, long-context tasks	Mid-premium
GPT-4o	OpenAI	Versatility, multimodal (image + text), wide tool ecosystem	General tasks, image analysis, tool-using applications	Mid-premium
Gemini 2.0 Pro	Google	Longest context (2M tokens), Google Workspace integration, real-time search	Very long documents, research with real-time data, Google ecosystem	Mid-premium
DeepSeek V3 / R1	DeepSeek	Strong performance at very low cost, open weights available	Cost-sensitive applications, self-hosted deployments	Budget
LLaMA 3.1 405B	Meta	Open source, self-hostable, competitive performance	Privacy-sensitive deployments, custom fine-tuning	Open source
Mistral Large 2	Mistral AI	European-built, strong multilingual, efficient	EU compliance requirements, multilingual tasks	Mid-range

The landscape shifts every few months. For current, up-to-date rankings with benchmark data across reasoning, coding, writing, and instruction-following categories, see veltrixcollective.com/llms.

03 — Common questions

Are all AI chatbots LLMs?

Most modern AI chatbots are built on LLMs, but not all. Rule-based chatbots (like many older customer service bots) use scripted decision trees and keyword matching — no LLM involved. Voice assistants like Siri and Alexa have historically used a mix of smaller language models and rule-based systems, though they're increasingly adding LLM capabilities. When you're talking to a modern AI assistant (ChatGPT, Claude, Gemini, Copilot), you're interacting with an LLM.

Can LLMs reason?

This is genuinely contested. LLMs can produce outputs that look like reasoning — following logical steps, working through problems, identifying errors in arguments. Whether this constitutes "reasoning" in the cognitive science sense, or sophisticated pattern matching that mimics reasoning, is an active debate. What's clear is that performance on reasoning benchmarks has improved dramatically with scale and with "chain of thought" prompting techniques. Whether the mechanism is true reasoning or not matters less than whether it works — and for many practical tasks, it does.

What is the best LLM?

It depends entirely on the task. Claude 3.7 Sonnet consistently leads on coding and long-context tasks. GPT-4o is strongest for versatility and multimodal work. Gemini excels when you need very long context or real-time information. DeepSeek V3 offers GPT-4-level performance at a fraction of the cost. See our full comparison with benchmark scores for every major task type →

Sources

EMRG

Wei et al — Emergent Abilities of Large Language Models, 2022arxiv.org/abs/2206.07682

DSK

DeepSeek V3 Technical Report, 2024arxiv.org/abs/2412.19437

COST

The Information — OpenAI Spent $100M Training GPT-4, 2023theinformation.com

RLHF

InstructGPT — Training Language Models to Follow Instructions, 2022arxiv.org/abs/2203.02155

LLMs are the infrastructure
behind nearly every AI tool you use.

Understanding what an LLM is — and isn't — makes you a dramatically better user of AI tools. You'll know why it sometimes gets facts wrong (training data cutoffs, probabilistic generation), when to trust it (well-established facts, reasoning tasks), and when to verify (time-sensitive data, citations, calculations).

The model isn't magic. It's a very large mathematical function that was trained to produce human-preferred text. That description undersells the genuinely impressive capabilities that emerge from it. But it also grounds your expectations correctly — which is the most useful thing any explanation of AI can do.

04 — Don't watch from the outside

Veltrix Collective

Stay ahead of
the curve

Weekly briefings on AI tools, adoption trends, and what actually matters for practitioners. No hype. Just signal. Join readers navigating the shift.

Subscribe →

Weekly, every Tuesday · No spam · Privacy policy · Unsubscribe anytime

Which model wins
for your use case?

Benchmark scores for every major LLM across reasoning, coding, writing, and instruction-following. Updated with every major model release.

See full LLM rankings →