Designing a Production RAG Pipeline

Lessons from building a retrieval-augmented generation system at scale

January 15, 2024•1 min read

RAGLLMProduction ML

Introduction

Building a production RAG system taught me that chunking strategy often matters more than embedding model choice.

The Problem

Users needed to query 100k+ internal documents with high accuracy and low latency.

Approach

Chunking: fixed-size (512 tokens) vs semantic chunking
Embeddings: API models vs open-source alternatives
Retrieval: BM25 baseline → dense retrieval → hybrid retrieval

Code Example

def chunk_documents(docs, chunk_size=512, overlap=50):
    chunks = []
    for doc in docs:
        step = chunk_size - overlap
        for i in range(0, len(doc), step):
            chunks.append(doc[i : i + chunk_size])
    return chunks

Results

Top-1 Accuracy: 84% (vs 67% baseline)
p95 Latency: 450ms
Cost: $0.02 per query

Key Lessons

Semantic chunking improved accuracy by 12%
Hybrid retrieval beat dense-only
Caching reduced API costs by 60%