What Is RAG (Retrieval-Augmented Generation)?
The four-step pipeline
Every RAG implementation, no matter how fancy, runs the same four steps:
- Chunk. Split your documents into small pieces (200–1,000 tokens). The chunk size is the most consequential dial in the system.
- Embed. Convert each chunk into a vector — a list of numbers that captures meaning. OpenAI
text-embedding-3-large, Cohere Embed, and open-source models like BGE produce these. - Retrieve. When a user asks a question, embed the question, then ask a vector database which chunks are closest. Top-5 to top-20 chunks come back.
- Generate. Stuff the retrieved chunks into the LLM prompt as context, ask the model to answer using only those, get back a grounded response with citations.
Why RAG matters for AI agents
An AI agent without RAG is limited to what its base model memorised — frozen at the model's training cutoff, with no awareness of your private data. RAG gives the agent the "memory" piece of the four agent building blocks — a corpus that updates the moment you update a document. The same agent can then handle queries about your latest pricing, your internal SOPs, or your customer's open ticket — without retraining.
Three things RAG specifically unlocks:
- Citation. The agent can quote the exact document chunk it relied on, which is mandatory for high-trust deployments (legal, medical, regulated).
- Freshness. Update a document, the agent answers from the new version on the next query. No retraining lag.
- Privacy. Your documents stay in your vector DB and your prompt context. They never enter model training.
RAG vs fine-tuning vs long-context prompting
Three competing patterns address "how do I get the model to know my stuff":
- Long-context prompting — just stuff your entire corpus into a 1M-token context window. Works for small corpuses; gets expensive and accuracy degrades for large ones.
- RAG — retrieve only the relevant chunks, stuff those into the prompt. Scales to terabytes of documents at low cost; small risk that retrieval misses the right chunk.
- Fine-tuning — bake the knowledge into model weights. Best for changing tone or output format; bad for facts (model still hallucinates them under pressure) and expensive to update.
2026 reality: most production stacks use RAG for facts, light fine-tuning for tone or output schema, and reserve long-context for the small minority of queries where retrieval misses.
What can go wrong
- Retrieval miss. The chunk that contains the answer never makes it into the top-N. Mitigation: hybrid retrieval (vector + keyword), reranking, query rewriting.
- Stale embeddings. A document changes; the old embedding still gets retrieved. Mitigation: re-embed on document update; track the source version.
- Chunk-boundary loss. The answer spans two chunks; neither alone makes sense. Mitigation: overlapping chunks, semantic chunking, larger chunk sizes.
- Hallucinated citations. The model invents quotes that look like they came from your documents. Mitigation: post-generation citation checking; lower temperatures.
Implementation map for AI agent buyers in 2026
If you do not want to build the pipeline yourself:
- Lindy.ai, Gumloop, Voiceflow, Botpress — all ship with built-in document-knowledge nodes. Upload PDFs, point at a Notion workspace, give the agent a Drive folder, done.
- n8n — exposes a Vector Store node + multiple embedder nodes; you assemble the pipeline yourself but stay in one platform.
- Relevance AI — explicit RAG primitive at platform level; strong choice for developer-led teams.
If you want to build it yourself: pick an embedder (OpenAI text-embedding-3-large for English; Cohere Embed Multilingual for non-English), pick a vector store (Pinecone for managed, pgvector for embedded, Qdrant for self-hosted), and use a framework (LlamaIndex, LangChain, or just raw API calls) to glue them together.
Frequently asked
Do I need a vector database to do RAG?
Not strictly. For a few thousand chunks you can keep embeddings in a flat file and brute-force the cosine similarity at query time. Past ~50,000 chunks, you want a real index — pgvector, Qdrant, Pinecone, or Weaviate.
How big should my chunks be?
Most pipelines settle around 300–800 tokens per chunk with 50–100 token overlap. Smaller chunks improve retrieval precision; larger chunks improve answer coherence. Test on your specific corpus.
Can RAG hallucinate?
Yes — and the failure modes are subtle. The model can quote-look-but-not-quote, mis-attribute information across chunks, or invent links to documents it never saw. Always run a citation-check pass and never deploy RAG-grounded output as fact in regulated contexts without human review.