Building Production RAG Pipelines That Actually Work — Aryaverse

The RAG Reality Check

Retrieval-Augmented Generation has become the default architecture for grounding LLMs in proprietary data. The concept is deceptively simple: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and feed them to the model as context. In a demo, this works beautifully. In production, with millions of documents, ambiguous queries, and zero tolerance for hallucination — it falls apart.

After deploying RAG systems across financial services, legal, and healthcare clients, we've catalogued the failure modes that separate a convincing prototype from a production-grade knowledge system. The problems are rarely about the LLM itself. They live in the unglamorous plumbing: how you chunk, how you embed, how you retrieve, and how you evaluate.

Pipeline Architecture

Live Flow

End-to-end RAG flow — from document ingestion through semantic chunking and embedding to retrieval-augmented generation with re-ranking

Chunking Strategy: The Foundation Nobody Gets Right

Most teams default to fixed-size chunking — split every document into 512-token blocks with some overlap. This is fast to implement and catastrophically bad for structured documents. A financial report's risk disclosure gets sliced mid-sentence and merged with an unrelated table. A legal contract's indemnification clause loses its conditional preamble.

We use a hybrid chunking strategy. First, structural parsing extracts semantic boundaries: headings, sections, list items, table rows. Then, within those boundaries, we apply recursive character splitting with overlap only at structural boundaries. For tables, we serialize each row as a self-contained statement with the column headers preserved. The result is chunks that are semantically coherent — each one makes sense on its own without surrounding context.

The impact is measurable. In a benchmark across 12,000 financial documents, switching from fixed-size to semantic chunking improved retrieval precision@5 from 0.67 to 0.89 with zero changes to the embedding model or retrieval logic.

Embedding Selection and Fine-Tuning

Off-the-shelf embedding models like OpenAI's text-embedding-3-large or Cohere's embed-v3 are good general-purpose encoders. But 'general purpose' means they encode semantic similarity broadly — they don't understand that in your domain, 'material adverse change' and 'MAC clause' are the same concept, while 'interest rate' in a mortgage document and 'interest rate' in a central bank report carry very different retrieval intent.

We fine-tune embedding models on domain-specific query-document pairs harvested from actual user interactions. The training loop is straightforward: take real queries that your system received, pair them with the documents that human reviewers marked as relevant, and fine-tune with contrastive loss. Even 2,000 curated pairs can shift recall@10 by 15-20 points in specialized domains.

For deployment, we quantize the fine-tuned model to int8 and serve it behind a batched inference endpoint. Embedding latency stays under 15ms for single queries, and throughput handles 500+ documents per second for bulk ingestion.

Retrieval: Beyond Naive Vector Search

Pure vector similarity search is a blunt instrument. A query like 'What are the termination provisions for the Acme contract?' requires both semantic understanding and entity-level filtering. Vector search alone might return termination clauses from the wrong contract or generic legal definitions that are semantically similar but factually useless.

Our retrieval stack combines three strategies. First, dense retrieval using the fine-tuned embedding model for semantic matching. Second, sparse retrieval using BM25 over the original text for keyword precision — critical for entity names, dates, and numerical references. Third, metadata filtering to scope retrieval to the correct document set, time range, or entity before any similarity computation happens.

These three signals are fused using a learned re-ranker — a cross-encoder model that takes the query and each candidate chunk as a pair and produces a relevance score. The re-ranker is the most expensive component per query, so we apply it only to the top 20 candidates from the initial retrieval stage. Final context is assembled from the top 5 re-ranked chunks, with deduplication and ordering by document position.

Evaluation: Closing the Loop

The most dangerous RAG system is one that looks like it's working. Without rigorous evaluation, you're flying blind — accuracy degrades as document corpora grow, and hallucinations become harder to spot as responses become more fluent.

We run a three-tier evaluation framework. Tier 1: automated metrics computed nightly on a golden dataset of 500+ question-answer pairs. We measure retrieval precision, answer faithfulness (does the answer follow from the retrieved context?), and answer completeness. Tier 2: weekly adversarial testing with queries designed to trigger known failure modes — ambiguous entities, temporal references, negation, and multi-hop reasoning. Tier 3: monthly human evaluation where domain experts score a random sample of production responses.

The golden dataset evolves continuously. Every time a user flags an incorrect answer or a human reviewer identifies a retrieval failure, we add it to the test set. This creates a ratchet effect — the system can only improve, and regressions are caught automatically before they reach users.

Written by Aryaverse Engineering

Enterprise AI and blockchain engineering insights from the team building at the intersection of intelligent systems and decentralized infrastructure.

BlockchainSubstrate vs. Cosmos SDK: Choosing Your L1 Framework18 min read AI + Web3On-Chain AI Agents: Architecture for Trustless Automation15 min read

Back to all insights