Most RAG tutorials show a simple pipeline: chunk documents, embed them, do a cosine similarity search, pass the top results to the LLM. It works for demos. It breaks in production. The problem is that a single retrieval method has systematic blind spots — dense embeddings miss exact keyword matches, and keyword search misses semantic paraphrases. I built a hybrid retrieval system that combines both, and it finds relevant documents that neither method catches alone.
The Problem: Every Index Has Blind Spots
Consider a document about "Section 8 housing voucher programs." A user asks: "What government rental assistance programs are available?" Dense embeddings will match the semantic meaning — "government rental assistance" is conceptually similar to "housing voucher programs." But if the user asks "Section 8 requirements," a keyword search finds it instantly while the dense index may rank it lower because "Section 8" is a proper noun with no direct semantic relationship to generic terms.
This is the fundamental tradeoff. Dense retrieval captures meaning. Keyword retrieval captures specifics. Production queries need both. The question is how to combine them without the scores from one method drowning out the other.
Architecture: Dual-Index with Reciprocal Rank Fusion
The system maintains two parallel indices over the same document chunks:
- BM25 (Okapi) — The keyword index. Uses term frequency with saturation (k1=1.5) and document length normalization (b=0.75). IDF is calculated as
log((N - df + 0.5) / (df + 0.5) + 1.0) to prevent zero scores on common terms. This catches exact matches, proper nouns, and technical terminology.
- TF-IDF Dense Vectors — The semantic index. Scikit-learn's
TfidfVectorizer with a vocabulary cap of 5,000 features generates dense vectors. Cosine similarity with L2 normalization handles the ranking. This catches paraphrases, synonyms, and conceptual similarity.
Both indices return ranked results for every query. The challenge is combining them. You can't simply average the scores — BM25 scores and cosine similarities are on completely different scales with different distributions. A BM25 score of 12.7 and a cosine similarity of 0.83 aren't comparable.
Reciprocal Rank Fusion: The Key Insight
Instead of comparing raw scores, Reciprocal Rank Fusion (RRF) compares positions. Each result gets a score based on where it appears in each ranked list:
# Reciprocal Rank Fusion
# k=60 balances early-rank sensitivity
def reciprocal_rank_fusion(ranked_lists, k=60, top_k=5):
scores = {}
for ranked_list in ranked_lists:
for rank, (chunk_id, _) in enumerate(ranked_list):
scores[chunk_id] = scores.get(chunk_id, 0)
scores[chunk_id] += 1.0 / (k + rank)
return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]
The k=60 constant controls how much an early rank matters versus a late rank. A document ranked #1 in one index gets a score of 1/61 = 0.0164. Ranked #10, it gets 1/70 = 0.0143. The falloff is gentle — being ranked 10th is only slightly worse than 1st. This means a document that ranks well in both indices reliably outscores a document that ranks #1 in one index but doesn't appear in the other.
One implementation detail that matters: each index retrieves 2x the requested number of results before fusion. If you want the top 5, each index returns 10. This ensures that a document ranked #8 in one index and #3 in the other still gets considered, rather than being cut off.
Chunking: Sentence Boundaries Over Fixed Splits
Chunk quality directly determines retrieval quality. I use 500-character chunks with 50-character overlap and sentence-aware boundaries. The algorithm tries to break at sentence-ending periods first, then paragraph breaks, then line breaks, then spaces. It never breaks below half the chunk size to avoid creating fragments.
# Smart boundary detection (priority order)
separators = [". ", "\n\n", "\n", " "]
min_break = chunk_size // 2 # 250 chars minimum
for sep in separators:
break_point = text.rfind(sep, min_break, chunk_size)
if break_point != -1:
return break_point + len(sep)
The 50-character overlap ensures that a question about content at a chunk boundary still finds both neighboring chunks. Without overlap, a sentence split across two chunks becomes invisible to retrieval.
Lazy Evaluation: Don't Re-Embed on Every Query
A common RAG mistake is rebuilding the index on every query. The pipeline uses a dirty flag pattern: embeddings and indices are recomputed only when documents change, not when questions are asked.
def ask(self, question, top_k=5):
self._ensure_fitted() # Only rebuilds if _dirty=True
results = self.retriever.search(question, top_k)
context = self._build_context(results) # Max 4,000 chars
return self.answer_generator.generate(question, context, results)
The first query after document ingestion pays the cost of fitting the TF-IDF vocabulary and building both indices. Every subsequent query skips straight to retrieval. On a corpus of 100 documents, this reduces query latency from ~800ms (with refitting) to ~15ms (retrieval only).
Prompt Engineering Lab: A/B Testing for RAG Prompts
Different prompt templates produce dramatically different answers from the same retrieved context. "Answer concisely with citations" versus "Provide a detailed analysis with supporting evidence" can change both answer quality and token consumption by 2-3x.
The system includes 5 built-in prompt templates and an A/B comparison mode. For any question, you can run two templates side-by-side against the same retrieval results and compare the outputs. The cost tracker records token usage per template, so you get hard numbers on the quality-cost tradeoff rather than guessing.
# Built-in prompt templates
qa_concise # Fact lookup: brief answer + citations
qa_detailed # Research: thorough analysis + sources
summarize # Key points extraction
extract_facts # Structured bullet-point facts
compare # Cross-source comparison
Results
5 formats
PDF, DOCX, TXT, MD, CSV
0 LLM calls
For ingestion & retrieval
Limitations and Tradeoffs
- TF-IDF is not a true embedding model. Using TF-IDF cosine similarity as the "dense" index captures more semantic signal than BM25 alone, but less than a transformer-based embedding model like
all-MiniLM-L6-v2. The tradeoff is zero external dependencies and fast startup — no downloading 90MB models on first run.
- RRF loses score magnitude. By converting to ranks, you lose how much better the #1 result is versus #2. If one result is overwhelmingly more relevant, RRF treats it the same as a marginal lead. For most document QA workloads this doesn't matter, but for tasks where confidence calibration is important, you'd want a learned fusion model instead.
- 500-character chunks are opinionated. Legal documents, code files, and scientific papers all have different optimal chunk sizes. The system uses a single chunk size across all documents, which is a simplification. Per-format chunking would improve retrieval quality for mixed corpora.
- No incremental index updates. Adding one document rebuilds both indices from scratch. For corpora under 10,000 chunks this completes in under a second. Beyond that, incremental index maintenance (e.g., streaming BM25 updates) would be worth implementing.
Try It Yourself
The full implementation is open source with a live demo on Streamlit Cloud. The relevant files:
For performance benchmarks across all projects, see the benchmarks page.