Best RAG Architecture & Stack for 10M+ Text Files

Intro

Use hybrid retrieval (BM25/BM25S + dense) with a reranker; add lightweight hierarchical summaries only where recall suffers.
Keep ingestion incremental: chunk signatures to skip re-embeds, CDC to detect deletes/updates, shard indexes by month/tenant for bounded rebuilds.
Pick a managed vector DB if ops-averse (Pinecone/Weaviate/Qdrant Cloud) or self-host Qdrant/Milvus if you need VPC control; pgvector works for smaller, latency-tolerant loads.
Keep metadata rich: doc_id, source, timestamp, type, lang, quality, version. This powers filters, routing, and rerank boosts.
Always rerank top 50–100 dense+sparse hits with a cross-encoder (e.g., bge-reranker-large, Cohere Rerank-3) before sending to the LLM.

Core Architecture (10M docs, 100M+ vectors)

Retrieval-Augmented Generation (RAG) : A technique that enhances large language models by retrieving relevant information from external knowledge bases before generating responses, enabling up-to-date and domain-specific answers.

Retrieval path: Query parsing → light classifier/router (domain/intent) → hybrid candidate gen (BM25/BM25S + dense HNSW/IVF) → metadata filters → cross-encoder rerank → context assembly (max same-doc diversity, dedupe) → LLM.
Hierarchical layer (optional): RAPTOR/LightRAG-style tree summaries per collection to route and to summarize long docs; use only where recall/latency is struggling.
Graph assist (selective): Maintain a slim taxonomy/graph for routing (collections/topics/entities) and to expand queries with related entities; keep it small to avoid ops bloat.
Agents only where needed: Temporal agent (range filter), summarizer agent for long contexts, and guardrail agent for PHI/PII/tenancy; avoid “agent sprawl.”

Ingestion & Updates

CDC + signatures: Track inserts/updates from Postgres (logical decode/CDC) → hash normalized chunks (content + version + locale) → skip re-embed if unchanged.
Chunking: Per doc type; target 200–400 tokens for prose, 80–120 for code/log lines. Store original text + token count. Keep doc-level sparse terms for BM25.
Deletes & versions: Tombstone old versions; background compaction per shard. Keep doc_version and is_live flags for clean sweeps.
Sharding: Partition by month or tenant; reindex only hot shards. Cold shards can be compacted offline.
Batching: Use async workers to embed; batch 64–512 tokens per call depending on provider. Cache embeddings for identical hashes.

Stack Recommendations

Vector Database : A specialized database optimized for storing and querying high-dimensional vector embeddings, enabling semantic similarity search at scale.

Vector DB: Qdrant (HNSW, good filters, local or cloud), Weaviate (hybrid native, alpha/flat support), Milvus (IVF/HNSW, rock-solid at scale), or Pinecone if you want managed ops. For Postgres-only shops, pgvector + ivfflat works up to low tens of millions if you accept a bit more latency.
Sparse search: PostgreSQL tsvector with BM25S or an Elastic/OpenSearch sidecar. Use the same metadata schema as the vector store.
Embeddings: text-embedding-3-large/small, mxbai-embed-large, or bge-large-en. Multilingual? Use bge-m3 or minilm-l12 variants. Keep dimensionality consistent per index.
Rerankers: bge-reranker-large (OSS) or Cohere Rerank-3. Rerank top 50–100; keep under 500 tokens per candidate.
Orchestration: LangChain or LlamaIndex for pipelines; Dify/AnythingLLM if you want UI + admin quickly. Add tracing with Arize/PipeRider/OpenTelemetry to see retrieval quality.
Guardrails: Post-filter on tenant/source; add safety/PII checks before the LLM call. Cache safe responses for common queries.

Retrieval Recipe (pseudo)

1
2
3
4
5
6
7
8
q = preprocess(query)
domain = router.predict(q)
candidates_dense = vectordb.search(q, k=120, filter=domain_filter(domain))
candidates_sparse = bm25.search(q, k=120, filter=domain_filter(domain))
merged = merge_dedup(candidates_dense, candidates_sparse, boost_recency=True)
top = reranker.rank(q, merged, k=20)
context = assemble(top, max_per_doc=3, diversity=True)
answer = llm.generate(q, context, citations=True)

Latency & Cost Controls

Co-locate vector DB with embeddings/rerankers to avoid cross-zone hops.
Enable HNSW (M=32–64, ef_search tuned per latency SLO) or IVF-PQ for cheaper RAM; keep a “quality” index (flat/HNSW) for evals.
Cache rerank results for popular queries; short-TTL query result cache before the LLM call.
CPU is fine for embeddings at low QPS; move to GPU or managed embeddings when ingest spikes.

When to Add Hierarchical / Graph Layers

Add RAPTOR/LightRAG summaries if recall drops for long, heterogeneous docs or when users ask “overview” questions. Store summaries alongside nodes with the same metadata.
Add a small knowledge graph if entity relationships matter (people ↔ orgs ↔ events); use it to expand queries and for routing, not as the primary store.

Operational Checklist

Bench hybrid+rerank on a labeled set; track Recall@20 and MRR. Don’t ship without evals.
Schema discipline: same metadata keys across vector, sparse, and summaries; keep doc_id stable across versions.
Run nightly compaction and weekly freshness audits (missing embeddings, stale tombstones).
Log retrieval traces to debug “no results” and “wrong docs” quickly.

If you need to start today with minimal ops: Qdrant Cloud + PostgreSQL BM25S, text-embedding-3-small, bge-reranker-large, LangChain/LlamaIndex pipeline, and a tiny router over domain/type/lang. Scale shards and add hierarchical summaries only if recall or latency forces it.

Key Takeaways

Hybrid retrieval (BM25 + dense vectors) outperforms either approach alone at scale
Reranking with cross-encoders is essential; always rerank top 50-100 candidates before LLM
Incremental ingestion using CDC and chunk signatures prevents costly re-embeddings
Rich metadata (doc_id, source, timestamp, type, lang) enables powerful filtering and routing
Managed vector DBs (Qdrant/Weaviate/Pinecone) reduce ops burden for most teams
Sharded indexes by month/tenant keep rebuilds bounded and performance predictable
Evaluation first: Track Recall@20 and MRR on labeled data before production

Intro

Core Architecture (10M docs, 100M+ vectors)

Ingestion & Updates

Stack Recommendations

Retrieval Recipe (pseudo)

Latency & Cost Controls

When to Add Hierarchical / Graph Layers

Operational Checklist

Key Takeaways

Key Takeaways

Share this:

AI Coding Security Insights.Ship Vibe-Coded Apps Safely.

AI Coding Security Insights.
Ship Vibe-Coded Apps Safely.