What Is RAG (Retrieval-Augmented Generation)? The Complete Guide

The problem

45%

Reduction in hallucination rate when RAG is applied vs a standalone LLM, per Meta AI Research [Lewis et al.]

$0

Retraining cost for RAG vs $100M+ to fully retrain a frontier model when knowledge goes stale

2024

Default training cutoff for most models — without RAG, they know nothing about events after this date

Standard LLMs have two fundamental limitations: their knowledge is frozen at training cutoff, and they can't access private data. Ask ChatGPT what happened in the news yesterday and it'll either admit it doesn't know or produce an AI hallucination — a plausible-sounding answer that isn't true. Ask it to summarise your company's internal documentation and it's completely blind.

RAG solves this by splitting the problem in two: a retrieval system finds relevant information at query time, and a generation model synthesises the retrieved information into a response. The model doesn't need to memorise everything — it just needs to be good at using information you hand it in the context window.

How it works

RAG adds a retrieval step before generation. Here's the full pipeline from query to response.

1

User submits a query

"What's our refund policy for software subscriptions?" — the question is captured and passed to the retrieval pipeline.

query

2

Query is embedded into a vector

The query gets converted into a dense numerical vector using an embedding model (e.g. text-embedding-3-large). This vector represents the semantic meaning of the query in high-dimensional space.

embedding model

3

Similarity search in vector database

The query vector is compared against millions of pre-embedded document chunks in a vector database (Pinecone, Weaviate, pgvector). The top-k most semantically similar chunks are retrieved — typically 3-10 chunks.

vector search

4

Retrieved chunks injected into prompt

The relevant document chunks are inserted into the LLM's context window alongside the original query. The prompt becomes: "Using only the following information, answer this question..."

context injection

5

LLM generates a grounded response

The model reads the retrieved context and synthesises an answer. Because the relevant facts are in the prompt, the model doesn't need to recall them from training — significantly reducing hallucination risk.

generation

The key insight

LLMs are very good at reading and synthesising information handed to them. They're less good at reliably recalling specific facts from training. RAG plays to the model's strengths by combining accurate retrieval with fluent generation — rather than asking the model to do both at once.

RAG vs fine-tuning

Both RAG and fine-tuning can help models work better with domain-specific information. They solve different problems.

Factor	RAG	Fine-tuning
Best for	Accessing up-to-date or private documents	Changing model behaviour, tone, or format
Knowledge updates	Instant — update the document store	Requires full retraining cycle
Cost	Low — infrastructure + inference costs	High — GPU compute for training
Hallucination risk	Lower — facts grounded in retrieved docs	Medium — model can still confabulate
Source attribution	Built-in — you know where facts came from	Not available — knowledge is implicit
Style/behaviour change	Limited — prompt engineering required	Strong — model learns new patterns

Most production AI systems use both: fine-tuning to establish the right behaviour and response style, RAG to provide the knowledge. Neither approach alone is sufficient for enterprise use cases.

Use cases

Customer support chatbots

Retrieves from product docs, policy pages, and support history. Answers accurately without hallucinating policy details. Intercom, Zendesk, and Freshdesk all offer RAG-powered support.

Internal knowledge bases

Employees query HR policies, technical documentation, or sales playbooks via natural language. Notion AI, Confluence AI, and Microsoft Copilot all use RAG architecture.

Legal document analysis

Retrieves relevant clauses, precedents, or contract terms from private databases. Harvey AI and Lexis+ AI are built on RAG over legal corpora.

Real-time research tools

Perplexity retrieves from the live web before generating answers — every response cites sources. Reduces hallucination for current events queries by an order of magnitude vs a static LLM.

Medical information systems

Clinical decision support tools retrieve from medical literature, drug databases, and patient records. Accuracy and source attribution are non-negotiable in healthcare.

Financial analysis

Bloomberg GPT and similar tools retrieve from live market data, earnings transcripts, and filings. Answers are grounded in verifiable documents, not model memory.

FAQ

Does RAG eliminate hallucinations?

No — it significantly reduces them. The model can still hallucinate if the retrieved context doesn't contain a direct answer and the model tries to fill the gap. It can also misinterpret retrieved content. But grounding generation in retrieved facts cuts hallucination rates substantially — Meta's original RAG paper reported roughly 45% reduction in open-domain QA tasks.

What's a vector database?

A database optimised for storing and searching high-dimensional vectors. When you embed documents into numerical vectors, you need to search for the most similar vectors efficiently at query time. Standard SQL databases aren't built for this. Vector databases (Pinecone, Weaviate, Qdrant, pgvector for PostgreSQL) use approximate nearest-neighbour algorithms to find similar vectors in milliseconds across millions of entries.

How do I build a RAG system?

The basic stack: (1) chunk your documents, (2) embed them using a model like OpenAI's text-embedding-3-large, (3) store in a vector database, (4) at query time, embed the query, retrieve top-k similar chunks, (5) inject into an LLM prompt. Frameworks like LangChain and LlamaIndex handle much of this plumbing. Cloud solutions like AWS Bedrock Knowledge Bases offer managed RAG pipelines.

Sources

[Lewis] Lewis et al. — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks", Meta AI (NeurIPS 2020)

[Gao] Gao et al. — "Retrieval-Augmented Generation for Large Language Models: A Survey" (2023)

[Pinecone] Pinecone — "What is a Vector Database?" (2024)

What Is RAG (Retrieval-Augmented Generation)? The Complete Guide

What Is RAG?

Sources

04 — Don't watch from the outside