Best RAG Architecture & Stack for 10M+ Text Files

Best RAG Architecture & Stack for 10M+ Text Files

Intro

  • Use hybrid retrieval (BM25/BM25S + dense) with a reranker; add lightweight hierarchical summaries only where recall suffers.
  • Keep ingestion incremental: chunk signatures to skip re-embeds, CDC to detect deletes/updates, shard indexes by month/tenant for bounded rebuilds.
  • Pick a managed vector DB if ops-averse (Pinecone/Weaviate/Qdrant Cloud) or self-host Qdrant/Milvus if you need VPC control; pgvector works for smaller, latency-tolerant loads.
  • Keep metadata rich: doc_id, source, timestamp, type, lang, quality, version. This powers filters, routing, and rerank boosts.
  • Always rerank top 50–100 dense+sparse hits with a cross-encoder (e.g., bge-reranker-large, Cohere Rerank-3) before sending to the LLM.

Core Architecture (10M docs, 100M+ vectors)

Retrieval-Augmented Generation (RAG) : A technique that enhances large language models by retrieving relevant information from external knowledge bases before generating responses, enabling up-to-date and domain-specific answers.
  • Retrieval path: Query parsing → light classifier/router (domain/intent) → hybrid candidate gen (BM25/BM25S + dense HNSW/IVF) → metadata filters → cross-encoder rerank → context assembly (max same-doc diversity, dedupe) → LLM.
  • Hierarchical layer (optional): RAPTOR/LightRAG-style tree summaries per collection to route and to summarize long docs; use only where recall/latency is struggling.
  • Graph assist (selective): Maintain a slim taxonomy/graph for routing (collections/topics/entities) and to expand queries with related entities; keep it small to avoid ops bloat.
  • Agents only where needed: Temporal agent (range filter), summarizer agent for long contexts, and guardrail agent for PHI/PII/tenancy; avoid “agent sprawl.”

Ingestion & Updates

  • CDC + signatures: Track inserts/updates from Postgres (logical decode/CDC) → hash normalized chunks (content + version + locale) → skip re-embed if unchanged.
  • Chunking: Per doc type; target 200–400 tokens for prose, 80–120 for code/log lines. Store original text + token count. Keep doc-level sparse terms for BM25.
  • Deletes & versions: Tombstone old versions; background compaction per shard. Keep doc_version and is_live flags for clean sweeps.
  • Sharding: Partition by month or tenant; reindex only hot shards. Cold shards can be compacted offline.
  • Batching: Use async workers to embed; batch 64–512 tokens per call depending on provider. Cache embeddings for identical hashes.

Stack Recommendations

Vector Database : A specialized database optimized for storing and querying high-dimensional vector embeddings, enabling semantic similarity search at scale.
  • Vector DB: Qdrant (HNSW, good filters, local or cloud), Weaviate (hybrid native, alpha/flat support), Milvus (IVF/HNSW, rock-solid at scale), or Pinecone if you want managed ops. For Postgres-only shops, pgvector + ivfflat works up to low tens of millions if you accept a bit more latency.
  • Sparse search: PostgreSQL tsvector with BM25S or an Elastic/OpenSearch sidecar. Use the same metadata schema as the vector store.
  • Embeddings: text-embedding-3-large/small, mxbai-embed-large, or bge-large-en. Multilingual? Use bge-m3 or minilm-l12 variants. Keep dimensionality consistent per index.
  • Rerankers: bge-reranker-large (OSS) or Cohere Rerank-3. Rerank top 50–100; keep under 500 tokens per candidate.
  • Orchestration: LangChain or LlamaIndex for pipelines; Dify/AnythingLLM if you want UI + admin quickly. Add tracing with Arize/PipeRider/OpenTelemetry to see retrieval quality.
  • Guardrails: Post-filter on tenant/source; add safety/PII checks before the LLM call. Cache safe responses for common queries.

Retrieval Recipe (pseudo)

1
2
3
4
5
6
7
8
q = preprocess(query)
domain = router.predict(q)
candidates_dense = vectordb.search(q, k=120, filter=domain_filter(domain))
candidates_sparse = bm25.search(q, k=120, filter=domain_filter(domain))
merged = merge_dedup(candidates_dense, candidates_sparse, boost_recency=True)
top = reranker.rank(q, merged, k=20)
context = assemble(top, max_per_doc=3, diversity=True)
answer = llm.generate(q, context, citations=True)

Latency & Cost Controls

  • Co-locate vector DB with embeddings/rerankers to avoid cross-zone hops.
  • Enable HNSW (M=32–64, ef_search tuned per latency SLO) or IVF-PQ for cheaper RAM; keep a “quality” index (flat/HNSW) for evals.
  • Cache rerank results for popular queries; short-TTL query result cache before the LLM call.
  • CPU is fine for embeddings at low QPS; move to GPU or managed embeddings when ingest spikes.

When to Add Hierarchical / Graph Layers

  • Add RAPTOR/LightRAG summaries if recall drops for long, heterogeneous docs or when users ask “overview” questions. Store summaries alongside nodes with the same metadata.
  • Add a small knowledge graph if entity relationships matter (people ↔ orgs ↔ events); use it to expand queries and for routing, not as the primary store.

Operational Checklist

  • Bench hybrid+rerank on a labeled set; track Recall@20 and MRR. Don’t ship without evals.
  • Schema discipline: same metadata keys across vector, sparse, and summaries; keep doc_id stable across versions.
  • Run nightly compaction and weekly freshness audits (missing embeddings, stale tombstones).
  • Log retrieval traces to debug “no results” and “wrong docs” quickly.

If you need to start today with minimal ops: Qdrant Cloud + PostgreSQL BM25S, text-embedding-3-small, bge-reranker-large, LangChain/LlamaIndex pipeline, and a tiny router over domain/type/lang. Scale shards and add hierarchical summaries only if recall or latency forces it.

Key Takeaways

Key Takeaways

  • Hybrid retrieval (BM25 + dense vectors) outperforms either approach alone at scale
  • Reranking with cross-encoders is essential; always rerank top 50-100 candidates before LLM
  • Incremental ingestion using CDC and chunk signatures prevents costly re-embeddings
  • Rich metadata (doc_id, source, timestamp, type, lang) enables powerful filtering and routing
  • Managed vector DBs (Qdrant/Weaviate/Pinecone) reduce ops burden for most teams
  • Sharded indexes by month/tenant keep rebuilds bounded and performance predictable
  • Evaluation first: Track Recall@20 and MRR on labeled data before production

Security runs on data.
Make it work for you.

Effortlessly test and evaluate web application security using Vibe Eval agents.