Retrieval-Augmented Generation (RAG)
What is RAG
- A technique that enhances LLM responses by supplying external, up-to-date knowledge at inference time
- Combines two systems: a retrieval system and a generative model
- Addresses core LLM limitations: knowledge cutoff, hallucination, lack of domain-specific data
Core Problem RAG Solves
- LLMs are frozen at training time — they cannot access new or private information
- Fine-tuning is expensive and slow to update
- RAG allows dynamic injection of relevant context without retraining
RAG Architecture — Step by Step
1. Ingestion Pipeline (Offline)
- Raw documents are collected (PDFs, web pages, databases, etc.)
- Documents are split into smaller chunks (e.g., 256–512 tokens)
- Each chunk is passed through an embedding model to generate a vector representation
- Vectors are stored in a vector database (e.g., Pinecone, Weaviate, ChromaDB, FAISS)
2. Retrieval (Online — at Query Time)
- User submits a query
- The query is embedded using the same embedding model
- A similarity search (e.g., cosine similarity, dot product) is run against the vector store
- Top-k most relevant chunks are retrieved
3. Augmentation
- Retrieved chunks are injected into the LLM prompt as context
- Prompt is structured as: context + original query
4. Generation
- The LLM generates a response grounded in the retrieved context
- Output is based on real, sourced information rather than parametric memory alone
Key Components
- Embedding Model — converts text to dense vectors (e.g., OpenAI ada-002, BGE, Cohere Embed)
- Vector Store — stores and indexes vectors for fast similarity search
- Retriever — fetches relevant chunks based on query similarity
- LLM — generates the final answer using retrieved context
- Orchestration Layer — connects all components (e.g., LangChain, LlamaIndex)
Chunking Strategies
- Fixed-size chunking — split by token/character count
- Sentence-aware chunking — split at sentence boundaries
- Semantic chunking — split based on topic shifts
- Hierarchical chunking — store both summary and detail chunks
Retrieval Strategies
- Dense retrieval — vector similarity search (most common)
- Sparse retrieval — keyword-based (BM25)
- Hybrid retrieval — combines dense and sparse for better recall
- Re-ranking — a second-pass model scores retrieved chunks for relevance
RAG vs Fine-Tuning
- RAG is preferred when knowledge changes frequently
- Fine-tuning is preferred when style, tone, or behavior needs to change
- RAG does not require GPU training cycles
- Both can be combined for optimal results
Common Failure Points
- Poor chunking leads to loss of context across chunk boundaries
- Embedding model mismatch between ingestion and query time
- Top-k retrieval too low — relevant context missed
- Retrieved chunks irrelevant due to vague queries
- LLM ignoring retrieved context and hallucinating anyway
Use Cases
- Enterprise document Q&A
- Customer support over internal knowledge bases
- Legal and compliance document search
- Medical literature querying
- Code documentation assistants
Advanced Patterns
- HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer, embed it, then retrieve
- FLARE — iteratively retrieves during generation when confidence is low
- Agentic RAG — LLM decides when and what to retrieve dynamically
- Multi-hop RAG — chains multiple retrievals to answer complex questions-