#rag #ai #llm #retrieval #vector-database #nlp

Retrieval-Augmented Generation (RAG)

What is RAG

A technique that enhances LLM responses by supplying external, up-to-date knowledge at inference time
Combines two systems: a retrieval system and a generative model
Addresses core LLM limitations: knowledge cutoff, hallucination, lack of domain-specific data

Core Problem RAG Solves

LLMs are frozen at training time — they cannot access new or private information
Fine-tuning is expensive and slow to update
RAG allows dynamic injection of relevant context without retraining

RAG Architecture — Step by Step

1. Ingestion Pipeline (Offline)

Raw documents are collected (PDFs, web pages, databases, etc.)
Documents are split into smaller chunks (e.g., 256–512 tokens)
Each chunk is passed through an embedding model to generate a vector representation
Vectors are stored in a vector database (e.g., Pinecone, Weaviate, ChromaDB, FAISS)

2. Retrieval (Online — at Query Time)

User submits a query
The query is embedded using the same embedding model
A similarity search (e.g., cosine similarity, dot product) is run against the vector store
Top-k most relevant chunks are retrieved

3. Augmentation

Retrieved chunks are injected into the LLM prompt as context
Prompt is structured as: context + original query

4. Generation

The LLM generates a response grounded in the retrieved context
Output is based on real, sourced information rather than parametric memory alone

Key Components

Embedding Model — converts text to dense vectors (e.g., OpenAI ada-002, BGE, Cohere Embed)
Vector Store — stores and indexes vectors for fast similarity search
Retriever — fetches relevant chunks based on query similarity
LLM — generates the final answer using retrieved context
Orchestration Layer — connects all components (e.g., LangChain, LlamaIndex)

Chunking Strategies

Fixed-size chunking — split by token/character count
Sentence-aware chunking — split at sentence boundaries
Semantic chunking — split based on topic shifts
Hierarchical chunking — store both summary and detail chunks

Retrieval Strategies

Dense retrieval — vector similarity search (most common)
Sparse retrieval — keyword-based (BM25)
Hybrid retrieval — combines dense and sparse for better recall
Re-ranking — a second-pass model scores retrieved chunks for relevance

RAG vs Fine-Tuning

RAG is preferred when knowledge changes frequently
Fine-tuning is preferred when style, tone, or behavior needs to change
RAG does not require GPU training cycles
Both can be combined for optimal results

Common Failure Points

Poor chunking leads to loss of context across chunk boundaries
Embedding model mismatch between ingestion and query time
Top-k retrieval too low — relevant context missed
Retrieved chunks irrelevant due to vague queries
LLM ignoring retrieved context and hallucinating anyway

Use Cases

Enterprise document Q&A
Customer support over internal knowledge bases
Legal and compliance document search
Medical literature querying
Code documentation assistants

Advanced Patterns

HyDE (Hypothetical Document Embeddings) — generate a hypothetical answer, embed it, then retrieve
FLARE — iteratively retrieves during generation when confidence is low
Agentic RAG — LLM decides when and what to retrieve dynamically
Multi-hop RAG — chains multiple retrievals to answer complex questions-