Back to Insights
AI EngineeringMarch 4, 202612 min read

Building Production RAG Pipelines That Actually Work

Why most RAG implementations fail in production — and the chunking, embedding, and retrieval strategies we use to achieve 95%+ answer accuracy at enterprise scale.

AV
Aryaverse Engineering
March 4, 2026 · 12 min read

The RAG Reality Check

Retrieval-Augmented Generation has become the default architecture for grounding LLMs in proprietary data. The concept is deceptively simple: embed your documents, store them in a vector database, retrieve relevant chunks at query time, and feed them to the model as context. In a demo, this works beautifully. In production, with millions of documents, ambiguous queries, and zero tolerance for hallucination — it falls apart.

After deploying RAG systems across financial services, legal, and healthcare clients, we've catalogued the failure modes that separate a convincing prototype from a production-grade knowledge system. The problems are rarely about the LLM itself. They live in the unglamorous plumbing: how you chunk, how you embed, how you retrieve, and how you evaluate.

Pipeline Architecture
Live Flow
INGESTION PIPELINEQUERY PIPELINEDocumentsChunkerEmbedderVector DBQueryRetrieverRe-rankerLLM Answer
End-to-end RAG flow — from document ingestion through semantic chunking and embedding to retrieval-augmented generation with re-ranking

Chunking Strategy: The Foundation Nobody Gets Right

Most teams default to fixed-size chunking — split every document into 512-token blocks with some overlap. This is fast to implement and catastrophically bad for structured documents. A financial report's risk disclosure gets sliced mid-sentence and merged with an unrelated table. A legal contract's indemnification clause loses its conditional preamble.

We use a hybrid chunking strategy. First, structural parsing extracts semantic boundaries: headings, sections, list items, table rows. Then, within those boundaries, we apply recursive character splitting with overlap only at structural boundaries. For tables, we serialize each row as a self-contained statement with the column headers preserved. The result is chunks that are semantically coherent — each one makes sense on its own without surrounding context.

The impact is measurable. In a benchmark across 12,000 financial documents, switching from fixed-size to semantic chunking improved retrieval precision@5 from 0.67 to 0.89 with zero changes to the embedding model or retrieval logic.

Embedding Selection and Fine-Tuning

Off-the-shelf embedding models like OpenAI's text-embedding-3-large or Cohere's embed-v3 are good general-purpose encoders. But 'general purpose' means they encode semantic similarity broadly — they don't understand that in your domain, 'material adverse change' and 'MAC clause' are the same concept, while 'interest rate' in a mortgage document and 'interest rate' in a central bank report carry very different retrieval intent.

We fine-tune embedding models on domain-specific query-document pairs harvested from actual user interactions. The training loop is straightforward: take real queries that your system received, pair them with the documents that human reviewers marked as relevant, and fine-tune with contrastive loss. Even 2,000 curated pairs can shift recall@10 by 15-20 points in specialized domains.

For deployment, we quantize the fine-tuned model to int8 and serve it behind a batched inference endpoint. Embedding latency stays under 15ms for single queries, and throughput handles 500+ documents per second for bulk ingestion.

Evaluation: Closing the Loop

The most dangerous RAG system is one that looks like it's working. Without rigorous evaluation, you're flying blind — accuracy degrades as document corpora grow, and hallucinations become harder to spot as responses become more fluent.

We run a three-tier evaluation framework. Tier 1: automated metrics computed nightly on a golden dataset of 500+ question-answer pairs. We measure retrieval precision, answer faithfulness (does the answer follow from the retrieved context?), and answer completeness. Tier 2: weekly adversarial testing with queries designed to trigger known failure modes — ambiguous entities, temporal references, negation, and multi-hop reasoning. Tier 3: monthly human evaluation where domain experts score a random sample of production responses.

The golden dataset evolves continuously. Every time a user flags an incorrect answer or a human reviewer identifies a retrieval failure, we add it to the test set. This creates a ratchet effect — the system can only improve, and regressions are caught automatically before they reach users.

AV
Written by Aryaverse Engineering

Enterprise AI and blockchain engineering insights from the team building at the intersection of intelligent systems and decentralized infrastructure.