VeltrixVeltrix.
← All articles
18 / 62March 18, 2026

What Is a Large Language Model (LLM)? How They Work and How They're Ranked

LLMs defined — scale, training data, pre-training and RLHF. How current leading models compare and how Veltrix ranks them.

A large language model is a deep learning system trained on vast quantities of text to predict and generate language. The "large" refers to scale — billions or trillions of parameters — that turns out to be the difference between unremarkable text completion and genuinely useful AI.

Scale matters in LLMs in a way that surprised researchers. When you increase the size of language models past certain thresholds, they don't just get marginally better — they develop qualitatively new capabilities that smaller models completely lack. The ability to reason through multi-step problems, write working code, and engage in nuanced analysis all emerge as capabilities only at sufficient scale. EMRG

This "emergent capabilities" phenomenon is both fascinating and unsettling: researchers can't fully predict what a model will be able to do until they've built and trained it. GPT-3 couldn't reliably count objects in a sentence. GPT-4 passes professional exams. The difference is mostly scale, training data quality, and training methodology — not a fundamentally different architecture.

Parameter count — leading models
GPT-3 (2020)OpenAI baseline
175B
LLaMA 3.1 405B (2024)Meta open source
405B
GPT-4 (2023)OpenAI (leaked estimate)
~1.8T

One thing worth clarifying: more parameters doesn't automatically mean better. Efficiency in training, quality of training data, and instruction tuning matter as much as raw size. Google's Gemini Flash is smaller than GPT-4 but competitive on many benchmarks. DeepSeek V3 achieved GPT-4-level performance at a fraction of the training cost, demonstrating that architectural efficiency improvements can partially substitute for raw scale. DSK

Building a useful LLM takes three distinct phases. The raw capability comes from phase 1. The usefulness comes from phases 2 and 3.

1
Pre-training: learning language at scale

The model is trained on hundreds of billions or trillions of tokens from internet text, books, code repositories, academic papers, and more. The task is simple: given a sequence of tokens, predict the next one. Repeat this billions of times, adjust the model's weights each time it's wrong, and you end up with a model that has internalised enormous amounts of information about language, facts, reasoning patterns, and the structure of knowledge. This is the expensive part — training GPT-4 reportedly cost over $100 million. COST

2
Supervised fine-tuning: learning to be helpful

The pre-trained model is a text prediction engine, not a helpful assistant. Fine-tuning on curated examples of good responses transforms it. Human contractors write exemplary question-and-answer pairs, the model is trained on these, and it learns to produce responses that match the style, format, and quality of those examples. This is where the "assistant" personality comes from.

3
RLHF: optimising for what humans prefer

Reinforcement Learning from Human Feedback is the final polish. Human raters compare multiple model outputs and rank them. A reward model learns to predict human preference scores. The main LLM is then fine-tuned using reinforcement learning to produce outputs that score highly. This is why ChatGPT sounds like it's trying to be helpful rather than just generating plausible text. It was trained to maximise a human-preference signal — which has its own failure modes (e.g., models that sound confident even when they shouldn't be). RLHF

As of April 2026, these are the models that matter. Rankings based on publicly available benchmarks and Veltrix's own evaluation methodology.

Model Company Strengths Best for Price tier
Claude 3.7 Sonnet Anthropic Coding, long documents, nuanced reasoning, lower hallucination rate Complex analysis, software engineering, long-context tasks Mid-premium
GPT-4o OpenAI Versatility, multimodal (image + text), wide tool ecosystem General tasks, image analysis, tool-using applications Mid-premium
Gemini 2.0 Pro Google Longest context (2M tokens), Google Workspace integration, real-time search Very long documents, research with real-time data, Google ecosystem Mid-premium
DeepSeek V3 / R1 DeepSeek Strong performance at very low cost, open weights available Cost-sensitive applications, self-hosted deployments Budget
LLaMA 3.1 405B Meta Open source, self-hostable, competitive performance Privacy-sensitive deployments, custom fine-tuning Open source
Mistral Large 2 Mistral AI European-built, strong multilingual, efficient EU compliance requirements, multilingual tasks Mid-range

The landscape shifts every few months. For current, up-to-date rankings with benchmark data across reasoning, coding, writing, and instruction-following categories, see veltrixcollective.com/llms.

Are all AI chatbots LLMs?

Most modern AI chatbots are built on LLMs, but not all. Rule-based chatbots (like many older customer service bots) use scripted decision trees and keyword matching — no LLM involved. Voice assistants like Siri and Alexa have historically used a mix of smaller language models and rule-based systems, though they're increasingly adding LLM capabilities. When you're talking to a modern AI assistant (ChatGPT, Claude, Gemini, Copilot), you're interacting with an LLM.

Can LLMs reason?

This is genuinely contested. LLMs can produce outputs that look like reasoning — following logical steps, working through problems, identifying errors in arguments. Whether this constitutes "reasoning" in the cognitive science sense, or sophisticated pattern matching that mimics reasoning, is an active debate. What's clear is that performance on reasoning benchmarks has improved dramatically with scale and with "chain of thought" prompting techniques. Whether the mechanism is true reasoning or not matters less than whether it works — and for many practical tasks, it does.

What is the best LLM?

It depends entirely on the task. Claude 3.7 Sonnet consistently leads on coding and long-context tasks. GPT-4o is strongest for versatility and multimodal work. Gemini excels when you need very long context or real-time information. DeepSeek V3 offers GPT-4-level performance at a fraction of the cost. See our full comparison with benchmark scores for every major task type →

Sources
EMRG
Wei et al — Emergent Abilities of Large Language Models, 2022arxiv.org/abs/2206.07682
DSK
DeepSeek V3 Technical Report, 2024arxiv.org/abs/2412.19437
COST
The Information — OpenAI Spent $100M Training GPT-4, 2023theinformation.com
RLHF
InstructGPT — Training Language Models to Follow Instructions, 2022arxiv.org/abs/2203.02155
LLMs are the infrastructure
behind nearly every AI tool you use.

Understanding what an LLM is — and isn't — makes you a dramatically better user of AI tools. You'll know why it sometimes gets facts wrong (training data cutoffs, probabilistic generation), when to trust it (well-established facts, reasoning tasks), and when to verify (time-sensitive data, citations, calculations).

The model isn't magic. It's a very large mathematical function that was trained to produce human-preferred text. That description undersells the genuinely impressive capabilities that emerge from it. But it also grounds your expectations correctly — which is the most useful thing any explanation of AI can do.

Compare every major LLM

Which model wins
for your use case?

Benchmark scores for every major LLM across reasoning, coding, writing, and instruction-following. Updated with every major model release.

See full LLM rankings →

Veltrix Collective · Sources: Wei et al (2022), DeepSeek (2024), The Information, InstructGPT paper. Published April 2026. Model comparisons reflect publicly available benchmarks as of Q1 2026.

Written by Luke Madden, founder of Veltrix Collective. Data synthesis and analysis by Vel.