VeltrixVeltrix.
← All articles
28 / 62March 23, 2026

What Is a Transformer Model? The Architecture Powering Modern AI

What is a transformer model? How the attention mechanism replaced RNNs, the 5 key components of transformer architecture, and why it powers every major AI system in 2026.

Technical AI

What Is a Transformer Model?

The architecture that powers ChatGPT, Claude, Gemini, and nearly every leading AI system — explained from attention mechanisms to real-world applications.

2017
"Attention Is All You Need" published by Google Brain — the paper that introduced transformer architecture [Vaswani et al.]
100K+
Citations for the original transformer paper — one of the most cited papers in computer science history
99%
Of frontier AI models in 2026 use transformer architecture or a direct derivative of it

Before 2017, the dominant approach to natural language processing was recurrent neural networks (RNNs) — specifically LSTMs and GRUs. They were sequential: they processed text one word at a time, left to right. This created a fundamental bottleneck. Long-range dependencies — where the meaning of a word depends on context from 50 words earlier — were difficult to capture because information degraded with distance.

The transformer architecture threw out the sequential assumption entirely. Instead of processing left to right, transformers process all tokens simultaneously and use attention to decide which tokens matter for each prediction. This made training massively parallel, which meant it could scale. And scale is what made modern AI possible.

Understanding why transformers won requires understanding what they replaced — and what the tradeoffs were.

Old approach
RNNs / LSTMs

Sequential processing — each token processed in order. Hidden state carries information forward.

  • Can't parallelise training across a GPU
  • Gradient vanishing over long sequences
  • Slow to train at scale
  • Poor at long-range dependencies
  • Max useful context: ~512 tokens
New approach
Transformers

Parallel processing — all tokens processed simultaneously. Attention connects any two tokens directly.

  • Fully parallelisable — scales with GPU count
  • No gradient vanishing problem
  • Dramatically faster training
  • Handles long-range dependencies natively
  • Context windows now up to 1M+ tokens

A transformer has five core components. Each one serves a specific purpose in the processing pipeline.

EMBEDDING
Token embeddings
Each token is converted into a high-dimensional vector (typically 512-4096 dimensions). Tokens with similar meanings end up with similar vectors — "king" and "queen" are closer together than "king" and "bicycle". Positional encoding adds information about where in the sequence each token sits, since unlike RNNs, transformers have no inherent sense of order.
ATTENTION
Self-attention mechanism
For each token, attention calculates how much to focus on every other token in the sequence. It does this by computing Query, Key, and Value vectors for each token, then producing a weighted sum. The model learns what to attend to during training — not via hand-coded rules. Multi-head attention runs this process in parallel with different learned projections, letting each head capture different relationship types.
FFN
Feed-forward network
After attention, each token's representation passes through a small feed-forward neural network independently. This is where much of the model's factual knowledge is thought to be stored — researchers can identify specific neurons that activate for specific concepts like "the Eiffel Tower is in Paris".
NORM
Layer normalisation
After each sub-layer, the output is normalised to prevent values from growing too large or too small. Without this, deep stacks of attention layers would produce unstable gradients during training. This is why you can stack 120 transformer layers and still train successfully.
OUTPUT
Linear + softmax projection
The final hidden state is projected onto the full vocabulary size (100K+ tokens), then softmax produces a probability distribution. The model samples from this distribution to choose the next token. Temperature controls how sharp or flat this distribution is — higher temperature = more creative and unpredictable output.

The transformer architecture turned out to be general-purpose. It's now the dominant architecture in vision, audio, protein folding, and more.

Language
Text generation and understanding
GPT-4, Claude, Gemini, LLaMA, BERT, T5
Vision
Image classification and generation
ViT (Vision Transformer), DALL-E 3, Stable Diffusion, Sora
Audio
Speech recognition and synthesis
Whisper (OpenAI), MusicGen, AudioCraft
Biology
Protein structure prediction
AlphaFold2, ESM-2, RoseTTAFold
Code
Code generation and completion
GitHub Copilot, Claude Code, Codestral, CodeGemma
Multimodal
Cross-modality reasoning
GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet
Why it generalises
Images can be treated as sequences of patches. Audio can be treated as sequences of spectral frames. Proteins can be treated as sequences of amino acids. Once you have a good architecture for sequences plus enough compute, the architecture transfers across domains with minimal modification.
Are all language models transformers?
Almost all frontier models are. Some newer architectures are emerging — Mamba (state space models) uses a different approach that may be more efficient for very long sequences. But as of 2026, every major production LLM (GPT-4o, Claude, Gemini, LLaMA, Mistral) is built on transformer architecture or a close variant.
What's the difference between encoder-only, decoder-only, and encoder-decoder transformers?
Encoder-only models (like BERT) read the full input at once — good for classification and retrieval. Decoder-only models (like GPT) generate text left-to-right — good for generation tasks. Encoder-decoder models (like T5, original translation models) use both — good for sequence-to-sequence tasks like translation and summarisation. Modern chat models are typically decoder-only.
What's a mixture-of-experts (MoE) transformer?
A standard transformer activates all parameters for every token. MoE architecture routes each token through only a subset of "expert" sub-networks — typically 2 of 8-64 experts per layer. GPT-4 reportedly uses MoE, allowing a model with 1.8T total parameters to operate at roughly 280B active parameters per forward pass, dramatically reducing inference cost without sacrificing capability.

Sources

[Vaswani] Vaswani et al. — "Attention Is All You Need", Google Brain (NeurIPS 2017)
[Semianalysis] Semianalysis — GPT-4 architecture deep-dive (2023)
[Elhage] Elhage et al. — "A Mathematical Framework for Transformer Circuits", Anthropic (2021)
[Dosovitskiy] Dosovitskiy et al. — "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020)

Get AI insights every week

The AI Briefing covers what actually matters in AI — no hype, no jargon, just what you need to stay ahead.

Subscribe free
Written by Luke Madden, founder of Veltrix Collective. Data synthesis and analysis by Vel.