RAG Explained — Connect Your Data to Any LLM

RAG pipeline retrieving documents and feeding them to an AI

Ask ChatGPT about your company's internal documentation and it draws a blank. That data wasn't in its training set. Fine-tuning the model to learn it is expensive and slow. RAG is the practical alternative.

What RAG Is

RAG stands for Retrieval-Augmented Generation. Despite the academic-sounding name, the concept is straightforward.

User asks a question
Relevant documents get retrieved from a knowledge base
Those documents get stuffed into the LLM prompt as context
The LLM answers using that context

Think of it as an open-book exam. The LLM isn't pulling answers from memory — it's reading reference material that you placed next to it.

Why RAG Instead of Fine-Tuning

Fine-tuning and RAG both aim to give an LLM knowledge it doesn't have. The approach is completely different.

Fine-tuning modifies the model's weights. You prepare training data, run GPU-intensive retraining, and the model internalizes the knowledge. It's expensive. When data changes, you retrain. Hallucinations are harder to control because the model "knows" the information (or thinks it does).

RAG leaves the model untouched. You just attach a search system. When data changes, update the documents. You can cite sources, making answers traceable. The model answers based on what it reads, not what it memorized.

Some systems use both. But for the majority of "I want my LLM to know about my data" scenarios, RAG wins on cost-effectiveness. It's almost always the right first step.

The Three Core Components

A RAG pipeline breaks into three parts.

1. Document Preprocessing

Raw documents need to be broken into chunks the LLM can digest. PDFs, Word docs, web pages — first convert to plain text, then split into chunks of appropriate size.

Chunk size is typically 500-1000 tokens. Too large and search accuracy drops. Too small and chunks lose context. Overlapping chunks (where the end of one overlaps the start of the next) help prevent information loss at paragraph boundaries.

Raw document → Text extraction → Chunking → Embedding → Vector DB storage

2. Embeddings and Vector DBs

This is the heart of RAG. Converting text into numerical vectors is called "embedding."

"Cat" and "dog" look nothing alike as strings, but in embedding space they're close together — they're semantically similar. "Automobile" is far from both. Embeddings let you calculate semantic similarity as a number.

Common embedding models include OpenAI's text-embedding-3-small and open-source options like bge-m3. Every chunk gets converted to an embedding and stored in a vector database.

Vector databases are specialized for efficiently searching these vectors. Key options:

Pinecone — Managed service, easy setup
Weaviate — Supports hybrid search (keyword + vector)
Chroma — Lightweight, great for local development
pgvector — PostgreSQL extension, leverages existing DB infrastructure

3. Retrieval and Generation

When a user question comes in:

The question gets embedded using the same embedding model
The vector DB returns the k most similar chunks
Those chunks go into the prompt as "context"
The LLM generates an answer referencing that context

[System prompt]
Answer the question based on the context below.
If the answer isn't in the context, say "I don't know."

[Context]
{retrieved chunk 1}
{retrieved chunk 2}
{retrieved chunk 3}

[Question]
{user question}

That's the full RAG flow.

Practical Considerations

The concept is straightforward. Making it work well is another story.

Chunking strategy determines answer quality. The same document chunked differently produces very different search results. Splitting by paragraph or section boundaries beats splitting by raw character count. Including titles and metadata in chunks improves search accuracy.

Consider hybrid search. Vector search (semantic) alone struggles with exact keywords and proper nouns. BM25-style keyword search fills the gap. In practice, combining both and merging results (Reciprocal Rank Fusion, etc.) is common.

Re-ranking improves precision. Pull a generous set of candidates first (say 20), then re-rank with a dedicated model to keep only the top few for context. Cohere Rerank and bge-reranker are popular choices here.

Evaluation is hard. Systematically measuring "are the answers good?" is tricky. Frameworks like RAGAS exist, but domain experts reviewing outputs remains necessary. Automated metrics only go so far.

Context window size matters. Larger context windows (100K+ tokens in Claude, Gemini) might seem like they eliminate the need for RAG — just dump everything in. For small document sets this actually works. But cost scales linearly with input tokens. If you're serving thousands of queries against thousands of documents, RAG's selective retrieval is orders of magnitude cheaper than cramming the full corpus into every prompt.

Embedding model choice affects everything. The retrieval quality ceiling is set by your embedding model. text-embedding-3-large outperforms text-embedding-3-small but costs more per call. For open-source alternatives, the MTEB leaderboard tracks benchmark performance. Pick based on your language mix, budget, and latency requirements.

Getting Started

The fastest path to a working RAG prototype is LangChain + Chroma. In Python, you can have something functional in a handful of files.

# Conceptual flow (pseudocode)
documents = load_documents("./docs")
chunks = split_into_chunks(documents, chunk_size=500)
embeddings = embed(chunks, model="text-embedding-3-small")
vector_store = store_in_chroma(embeddings)

# At query time
query = "What were quarterly revenue trends?"
relevant_chunks = vector_store.search(query, k=3)
answer = llm.generate(context=relevant_chunks, question=query)

Production-grade RAG requires experimenting with chunk strategies, comparing embedding models, adding re-ranking, and building evaluation pipelines. But getting a working prototype first is what matters most. It's the only way to see where the bottlenecks are and what needs improving.

Common Pitfalls

Stuffing too much context. Throwing 20 chunks into the prompt gives the LLM more to read but doesn't necessarily improve answers. Models tend to focus on the beginning and end of long contexts, sometimes missing relevant information in the middle. 3-5 well-chosen chunks usually beat 15 mediocre ones.

Ignoring document structure. A table split across two chunks is useless. A code block broken mid-function is worse. Chunk boundaries should respect the document's logical structure — markdown headers, code fences, paragraph breaks. Tools like Unstructured and LlamaIndex have document-aware splitters for this.

Same embedding model for different languages. If your knowledge base has both English and Korean documents, make sure your embedding model handles multilingual content. bge-m3 is specifically designed for this. Using an English-only model on Korean text produces poor retrieval.

No source attribution. Users trust RAG answers more when they can see where the information came from. Always return the source document and chunk alongside the generated answer. This also makes debugging retrieval quality much easier.

Beyond Basic RAG

Once basic RAG works, several patterns can improve it further:

Multi-step retrieval. Instead of one search, the LLM reformulates the query based on initial results and searches again. Useful for complex questions where the answer spans multiple documents.

Agentic RAG. The LLM decides when to search, what to search for, and whether the results are sufficient — rather than following a fixed pipeline. Frameworks like LlamaIndex and LangGraph support this pattern.

Knowledge graphs + RAG. For structured relationships (who reports to whom, which service calls which API), vector search alone struggles. Adding a knowledge graph layer lets you traverse relationships that embedding similarity can't capture.

RAG has become the default pattern for LLM applications. Internal search systems, customer support chatbots, document-based Q&A — it shows up everywhere. Nail the fundamentals and the applications follow naturally.