Blog

Technical writeups from building production AI systems.

Multi-Agent Architecture February 8, 2026 · 10 min read

Multi-Agent Orchestration: Building Reliable Cross-Bot Handoffs

Everyone focuses on making individual AI bots smarter. Better prompts, finer-tuned intent detection, richer context windows. That work matters — but when you run multiple specialized bots in production, the hardest engineering problem is not inside any single bot. It is the moment one bot decides a conversation belongs to another bot and tries to hand it off mid-flow.

I built the orchestration layer for EnterpriseHub, a real estate AI platform with three specialized chatbots. This post covers the full orchestration stack: cross-bot handoffs with circular prevention and rate limiting, pattern learning for dynamic threshold adjustment, A/B testing for handoff optimization, and a 7-rule alerting system with 3-level escalation.

Circular handoffs in production

4 layers

Safety checks per handoff

7 rules

Default alert conditions

3-level

Escalation policy

Read the full post →

RAG February 2026

Why Single-Index RAG Fails and How Hybrid Retrieval Fixes It

Most RAG tutorials show a simple pipeline: chunk documents, embed them, do a cosine similarity search, pass the top results to the LLM. It works for demos. It breaks in production. The problem is that a single retrieval method has systematic blind spots — dense embeddings miss exact keyword matches, and keyword search misses semantic paraphrases. I built a hybrid retrieval system that combines both, and it finds relevant documents that neither method catches alone.

The Problem: Every Index Has Blind Spots

Consider a document about "Section 8 housing voucher programs." A user asks: "What government rental assistance programs are available?" Dense embeddings will match the semantic meaning — "government rental assistance" is conceptually similar to "housing voucher programs." But if the user asks "Section 8 requirements," a keyword search finds it instantly while the dense index may rank it lower because "Section 8" is a proper noun with no direct semantic relationship to generic terms.

This is the fundamental tradeoff. Dense retrieval captures meaning. Keyword retrieval captures specifics. Production queries need both. The question is how to combine them without the scores from one method drowning out the other.

Architecture: Dual-Index with Reciprocal Rank Fusion

The system maintains two parallel indices over the same document chunks:

BM25 (Okapi) — The keyword index. Uses term frequency with saturation (k1=1.5) and document length normalization (b=0.75). IDF is calculated as log((N - df + 0.5) / (df + 0.5) + 1.0) to prevent zero scores on common terms. This catches exact matches, proper nouns, and technical terminology.
TF-IDF Dense Vectors — The semantic index. Scikit-learn's TfidfVectorizer with a vocabulary cap of 5,000 features generates dense vectors. Cosine similarity with L2 normalization handles the ranking. This catches paraphrases, synonyms, and conceptual similarity.

Both indices return ranked results for every query. The challenge is combining them. You can't simply average the scores — BM25 scores and cosine similarities are on completely different scales with different distributions. A BM25 score of 12.7 and a cosine similarity of 0.83 aren't comparable.

Reciprocal Rank Fusion: The Key Insight

Instead of comparing raw scores, Reciprocal Rank Fusion (RRF) compares positions. Each result gets a score based on where it appears in each ranked list:

# Reciprocal Rank Fusion
# k=60 balances early-rank sensitivity

def reciprocal_rank_fusion(ranked_lists, k=60, top_k=5):
    scores = {}
    for ranked_list in ranked_lists:
        for rank, (chunk_id, _) in enumerate(ranked_list):
            scores[chunk_id] = scores.get(chunk_id, 0)
            scores[chunk_id] += 1.0 / (k + rank)
    return sorted(scores.items(), key=lambda x: x[1], reverse=True)[:top_k]

The k=60 constant controls how much an early rank matters versus a late rank. A document ranked #1 in one index gets a score of 1/61 = 0.0164. Ranked #10, it gets 1/70 = 0.0143. The falloff is gentle — being ranked 10th is only slightly worse than 1st. This means a document that ranks well in both indices reliably outscores a document that ranks #1 in one index but doesn't appear in the other.

One implementation detail that matters: each index retrieves 2x the requested number of results before fusion. If you want the top 5, each index returns 10. This ensures that a document ranked #8 in one index and #3 in the other still gets considered, rather than being cut off.

Chunking: Sentence Boundaries Over Fixed Splits

Chunk quality directly determines retrieval quality. I use 500-character chunks with 50-character overlap and sentence-aware boundaries. The algorithm tries to break at sentence-ending periods first, then paragraph breaks, then line breaks, then spaces. It never breaks below half the chunk size to avoid creating fragments.

# Smart boundary detection (priority order)
separators = [". ", "\n\n", "\n", " "]
min_break = chunk_size // 2 # 250 chars minimum

for sep in separators:
    break_point = text.rfind(sep, min_break, chunk_size)
    if break_point != -1:
        return break_point + len(sep)

The 50-character overlap ensures that a question about content at a chunk boundary still finds both neighboring chunks. Without overlap, a sentence split across two chunks becomes invisible to retrieval.

Lazy Evaluation: Don't Re-Embed on Every Query

A common RAG mistake is rebuilding the index on every query. The pipeline uses a dirty flag pattern: embeddings and indices are recomputed only when documents change, not when questions are asked.

def ask(self, question, top_k=5):
    self._ensure_fitted() # Only rebuilds if _dirty=True
    results = self.retriever.search(question, top_k)
    context = self._build_context(results) # Max 4,000 chars
    return self.answer_generator.generate(question, context, results)

The first query after document ingestion pays the cost of fitting the TF-IDF vocabulary and building both indices. Every subsequent query skips straight to retrieval. On a corpus of 100 documents, this reduces query latency from ~800ms (with refitting) to ~15ms (retrieval only).

Prompt Engineering Lab: A/B Testing for RAG Prompts

Different prompt templates produce dramatically different answers from the same retrieved context. "Answer concisely with citations" versus "Provide a detailed analysis with supporting evidence" can change both answer quality and token consumption by 2-3x.

The system includes 5 built-in prompt templates and an A/B comparison mode. For any question, you can run two templates side-by-side against the same retrieval results and compare the outputs. The cost tracker records token usage per template, so you get hard numbers on the quality-cost tradeoff rather than guessing.

# Built-in prompt templates
qa_concise    # Fact lookup: brief answer + citations
qa_detailed # Research: thorough analysis + sources
summarize     # Key points extraction
extract_facts # Structured bullet-point facts
compare       # Cross-source comparison

Results

Tests passing

5 formats

PDF, DOCX, TXT, MD, CSV

0 LLM calls

For ingestion & retrieval

Limitations and Tradeoffs

TF-IDF is not a true embedding model. Using TF-IDF cosine similarity as the "dense" index captures more semantic signal than BM25 alone, but less than a transformer-based embedding model like all-MiniLM-L6-v2. The tradeoff is zero external dependencies and fast startup — no downloading 90MB models on first run.
RRF loses score magnitude. By converting to ranks, you lose how much better the #1 result is versus #2. If one result is overwhelmingly more relevant, RRF treats it the same as a marginal lead. For most document QA workloads this doesn't matter, but for tasks where confidence calibration is important, you'd want a learned fusion model instead.
500-character chunks are opinionated. Legal documents, code files, and scientific papers all have different optimal chunk sizes. The system uses a single chunk size across all documents, which is a simplification. Per-format chunking would improve retrieval quality for mixed corpora.
No incremental index updates. Adding one document rebuilds both indices from scratch. For corpora under 10,000 chunks this completes in under a second. Beyond that, incremental index maintenance (e.g., streaming BM25 updates) would be worth implementing.

Try It Yourself

The full implementation is open source with a live demo on Streamlit Cloud. The relevant files:

src/retriever.py — BM25 index, dense index, and Reciprocal Rank Fusion
src/pipeline.py — Orchestration, lazy evaluation, context building
src/ingest.py — Multi-format parsing and sentence-aware chunking
src/prompt_lab.py — Template engine and A/B comparison

For performance benchmarks across all projects, see the benchmarks page.

LLMOps February 2026

How I Reduced LLM Token Costs by 89% Without Changing Models

When I started building EnterpriseHub — a real estate AI platform with 3 specialized chatbots — each lead qualification workflow consumed 93,000 tokens. At Claude's pricing, that adds up fast. After three rounds of optimization, I got it down to 7,800 tokens per workflow. Here's exactly what I did.

The Problem: Every Call Sends Everything

The naive approach is simple: stuff the full conversation history, system prompt, user profile, and context into every API call. It works. It's also wasteful. Most of that context is irrelevant to the specific question being answered. Worse, the same prompts get re-sent on every interaction — identical inputs producing identical outputs, billed every time.

Technique 1: Three-Tier Caching (~60% of savings)

The single biggest win was caching. Most LLM calls in a business application are repetitive — the same classification, the same FAQ answer, the same scoring rubric applied to similar inputs.

I built a three-tier cache:

L1 (In-Memory Dict) — Per-request scope. If the same prompt hits the system twice within a single request lifecycle, the second call costs zero tokens and returns in <1ms. This catches the surprisingly common case of redundant calls within a single orchestration flow.
L2 (Redis with TTL) — Cross-request, shared across all bot instances. Lead qualification questions, market data lookups, and template responses all hit here. TTL tuned per query type: 5 minutes for volatile data, 1 hour for stable templates. Lookup time: ~2ms.
L3 (PostgreSQL) — Persistent fallback. When Redis is unavailable (restart, network blip), the system degrades gracefully to database-backed cache rather than hitting the API.

The overall cache hit rate stabilized at 87%. That means 87 out of 100 LLM calls never reach the API at all.

# From claude_orchestrator.py
# In-process cache for memory context (avoids repeated fetches within a request)
self._memory_context_cache: Dict[str, Any] = {}

# Cache check before LLM call
cache_key = f"mem_ctx:{lead_id}"
cached = self._memory_context_cache.get(cache_key)
if cached is not None:
memory_context = cached # Zero tokens, <1ms

Technique 2: Context Window Optimization (~25% of savings)

Instead of dumping the full conversation history into every call, I built a sliding window that keeps only what the model actually needs. The ClaudeOrchestrator tracks which conversation turns are relevant to the current task and trims everything else.

The result: 2.3x more efficient context usage. The model gets the same quality of context in less than half the tokens. This matters especially for long conversations — a 20-turn qualification flow was sending all 20 turns on every call, when typically only the last 3-5 are relevant.

Technique 3: Model Routing by Task Complexity (~15% of savings)

Not every query needs the most capable (and expensive) model. The LLMClient accepts a TaskComplexity parameter that routes requests to the appropriate model:

class TaskComplexity(Enum):
    ROUTINE = "routine" # Simple classification, template fill
    STANDARD = "standard" # Default complexity
    HIGH_STAKES = "high_stakes" # Complex reasoning, revenue-impacting

def _get_routed_model(self, complexity):
    if complexity == TaskComplexity.ROUTINE:
        return self.fast_model # Cheaper, faster
    if complexity == TaskComplexity.HIGH_STAKES:
        return self.premium_model # Full capability

Simple tasks like "Is this a buyer or seller?" go to the fast model. Complex tasks like "Generate a personalized market analysis for this lead" go to the premium model. The router adds <50ms of overhead.

Results

93K → 7.8K

Tokens per workflow

87%

Cache hit rate

<200ms

Orchestrator overhead

Limitations and Tradeoffs

Cache invalidation is hard. Stale cached responses are worse than no cache. I use time-based TTLs rather than event-based invalidation because it's simpler and the staleness window is acceptable for my use case.
Context windowing can lose information. The sliding window occasionally drops a relevant early turn. I mitigate this with a summary of dropped context, but it's an imperfect solution.
Model routing requires maintenance. As model pricing and capabilities change, the routing logic needs updating. What was "routine" for one model version may not be for the next.
These numbers are specific to my workload. Lead qualification has high repetition (similar questions, similar leads). Applications with more unique queries will see smaller cache hit rates.

Try It Yourself

The full implementation is open source. The relevant files are:

services/claude_orchestrator.py — Cache layers, context management, request orchestration
core/llm_client.py — TaskComplexity routing, model selection, fallback logic
services/jorge/performance_tracker.py — P50/P95/P99 latency tracking, SLA monitoring

For the full benchmark data, see the benchmarks page.