What Is a Transformer Model? The Architecture Powering Modern AI

One paper changed everything

2017

"Attention Is All You Need" published by Google Brain — the paper that introduced transformer architecture [Vaswani et al.]

100K+

Citations for the original transformer paper — one of the most cited papers in computer science history

99%

Of frontier AI models in 2026 use transformer architecture or a direct derivative of it

Before 2017, the dominant approach to natural language processing was recurrent neural networks (RNNs) — specifically LSTMs and GRUs. They were sequential: they processed text one word at a time, left to right. This created a fundamental bottleneck. Long-range dependencies — where the meaning of a word depends on context from 50 words earlier — were difficult to capture because information degraded with distance.

The transformer architecture threw out the sequential assumption entirely. Instead of processing left to right, transformers process all tokens simultaneously and use attention to decide which tokens matter for each prediction. This made training massively parallel, which meant it could scale — the full pipeline is broken down in how large language models work. And scale is what made modern AI possible.

Before vs after

Understanding why transformers won requires understanding what they replaced — and what the tradeoffs were.

Old approach

RNNs / LSTMs

Sequential processing — each token processed in order. Hidden state carries information forward.

Can't parallelise training across a GPU
Gradient vanishing over long sequences
Slow to train at scale
Poor at long-range dependencies
Max useful context: ~512 tokens

New approach

Transformers

Parallel processing — all tokens processed simultaneously. Attention connects any two tokens directly.

Fully parallelisable — scales with GPU count
No gradient vanishing problem
Dramatically faster training
Handles long-range dependencies natively
Context windows now up to 1M+ tokens

Key components

A transformer has five core components. Each one serves a specific purpose in the processing pipeline.

EMBEDDING

Token embeddings

Each token is converted into a high-dimensional vector (typically 512-4096 dimensions). Tokens with similar meanings end up with similar vectors — "king" and "queen" are closer together than "king" and "bicycle". Positional encoding adds information about where in the sequence each token sits, since unlike RNNs, transformers have no inherent sense of order.

ATTENTION

Self-attention mechanism

For each token, attention calculates how much to focus on every other token in the sequence — which is exactly what the context window ultimately governs. It does this by computing Query, Key, and Value vectors for each token, then producing a weighted sum. The model learns what to attend to during training — not via hand-coded rules. Multi-head attention runs this process in parallel with different learned projections, letting each head capture different relationship types.

FFN

Feed-forward network

After attention, each token's representation passes through a small feed-forward neural network independently. This is where much of the model's factual knowledge is thought to be stored — researchers can identify specific neurons that activate for specific concepts like "the Eiffel Tower is in Paris".

NORM

Layer normalisation

After each sub-layer, the output is normalised to prevent values from growing too large or too small. Without this, deep stacks of attention layers would produce unstable gradients during training. This is why you can stack 120 transformer layers and still train successfully.

OUTPUT

Linear + softmax projection

The final hidden state is projected onto the full vocabulary size (100K+ tokens), then softmax produces a probability distribution. The model samples from this distribution to choose the next token. Temperature controls how sharp or flat this distribution is — higher temperature = more creative and unpredictable output.

Beyond language

The transformer architecture turned out to be general-purpose. It's now the dominant architecture behind every major large language model, as well as vision, audio, protein folding, and more.

Language

Text generation and understanding

GPT-4, Claude, Gemini, LLaMA, BERT, T5

Vision

Image classification and generation

ViT (Vision Transformer), DALL-E 3, Stable Diffusion, Sora

Audio

Speech recognition and synthesis

Whisper (OpenAI), MusicGen, AudioCraft

Biology

Protein structure prediction

AlphaFold2, ESM-2, RoseTTAFold

Code

Code generation and completion

GitHub Copilot, Claude Code, Codestral, CodeGemma

Multimodal

Cross-modality reasoning

GPT-4o, Gemini 1.5 Pro, Claude 3.5 Sonnet

Why it generalises

Images can be treated as sequences of patches. Audio can be treated as sequences of spectral frames. Proteins can be treated as sequences of amino acids. Once you have a good architecture for sequences plus enough compute, the architecture transfers across domains with minimal modification.

FAQ

Are all language models transformers?

Almost all frontier models are. Some newer architectures are emerging — Mamba (state space models) uses a different approach that may be more efficient for very long sequences. But as of 2026, every major production LLM (GPT-4o, Claude, Gemini, LLaMA, Mistral) is built on transformer architecture or a close variant.

What's the difference between encoder-only, decoder-only, and encoder-decoder transformers?

Encoder-only models (like BERT) read the full input at once — good for classification and retrieval. Decoder-only models (like GPT) generate text left-to-right — good for generation tasks. Encoder-decoder models (like T5, original translation models) use both — good for sequence-to-sequence tasks like translation and summarisation. Modern chat models are typically decoder-only.

What's a mixture-of-experts (MoE) transformer?

A standard transformer activates all parameters for every token. MoE architecture routes each token through only a subset of "expert" sub-networks — typically 2 of 8-64 experts per layer. GPT-4 reportedly uses MoE, allowing a model with 1.8T total parameters to operate at roughly 280B active parameters per forward pass, dramatically reducing inference cost without sacrificing capability.

Sources

[Vaswani] Vaswani et al. — "Attention Is All You Need", Google Brain (NeurIPS 2017)

[Semianalysis] Semianalysis — GPT-4 architecture deep-dive (2023)

[Elhage] Elhage et al. — "A Mathematical Framework for Transformer Circuits", Anthropic (2021)

[Dosovitskiy] Dosovitskiy et al. — "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (2020)

What Is a Transformer Model? The Architecture Powering Modern AI

Sources

04 — Don't watch from the outside