Intro
- Use hybrid retrieval (BM25/BM25S + dense) with a reranker; add lightweight hierarchical summaries only where recall suffers.
- Keep ingestion incremental: chunk signatures to skip re-embeds, CDC to detect deletes/updates, shard indexes by month/tenant for bounded rebuilds.
- Pick a managed vector DB if ops-averse (Pinecone/Weaviate/Qdrant Cloud) or self-host Qdrant/Milvus if you need VPC control; pgvector works for smaller, latency-tolerant loads.
- Keep metadata rich:
doc_id,source,timestamp,type,lang,quality,version. This powers filters, routing, and rerank boosts. - Always rerank top 50–100 dense+sparse hits with a cross-encoder (e.g.,
bge-reranker-large, Cohere Rerank-3) before sending to the LLM.
Core Architecture (10M docs, 100M+ vectors)
Retrieval-Augmented Generation (RAG)
: A technique that enhances large language models by retrieving relevant information from external knowledge bases before generating responses, enabling up-to-date and domain-specific answers.
- Retrieval path: Query parsing → light classifier/router (domain/intent) → hybrid candidate gen (BM25/BM25S + dense HNSW/IVF) → metadata filters → cross-encoder rerank → context assembly (max same-doc diversity, dedupe) → LLM.
- Hierarchical layer (optional): RAPTOR/LightRAG-style tree summaries per collection to route and to summarize long docs; use only where recall/latency is struggling.
- Graph assist (selective): Maintain a slim taxonomy/graph for routing (collections/topics/entities) and to expand queries with related entities; keep it small to avoid ops bloat.
- Agents only where needed: Temporal agent (range filter), summarizer agent for long contexts, and guardrail agent for PHI/PII/tenancy; avoid “agent sprawl.”
Ingestion & Updates
- CDC + signatures: Track inserts/updates from Postgres (logical decode/CDC) → hash normalized chunks (content + version + locale) → skip re-embed if unchanged.
- Chunking: Per doc type; target 200–400 tokens for prose, 80–120 for code/log lines. Store original text + token count. Keep doc-level sparse terms for BM25.
- Deletes & versions: Tombstone old versions; background compaction per shard. Keep
doc_versionandis_liveflags for clean sweeps. - Sharding: Partition by month or tenant; reindex only hot shards. Cold shards can be compacted offline.
- Batching: Use async workers to embed; batch 64–512 tokens per call depending on provider. Cache embeddings for identical hashes.
Stack Recommendations
Vector Database
: A specialized database optimized for storing and querying high-dimensional vector embeddings, enabling semantic similarity search at scale.
- Vector DB: Qdrant (HNSW, good filters, local or cloud), Weaviate (hybrid native, alpha/flat support), Milvus (IVF/HNSW, rock-solid at scale), or Pinecone if you want managed ops. For Postgres-only shops,
pgvector + ivfflatworks up to low tens of millions if you accept a bit more latency. - Sparse search: PostgreSQL
tsvectorwith BM25S or an Elastic/OpenSearch sidecar. Use the same metadata schema as the vector store. - Embeddings:
text-embedding-3-large/small,mxbai-embed-large, orbge-large-en. Multilingual? Usebge-m3orminilm-l12variants. Keep dimensionality consistent per index. - Rerankers:
bge-reranker-large(OSS) or Cohere Rerank-3. Rerank top 50–100; keep under 500 tokens per candidate. - Orchestration: LangChain or LlamaIndex for pipelines; Dify/AnythingLLM if you want UI + admin quickly. Add tracing with Arize/PipeRider/OpenTelemetry to see retrieval quality.
- Guardrails: Post-filter on tenant/source; add safety/PII checks before the LLM call. Cache safe responses for common queries.
Retrieval Recipe (pseudo)
| |
Latency & Cost Controls
- Co-locate vector DB with embeddings/rerankers to avoid cross-zone hops.
- Enable HNSW (M=32–64, ef_search tuned per latency SLO) or IVF-PQ for cheaper RAM; keep a “quality” index (flat/HNSW) for evals.
- Cache rerank results for popular queries; short-TTL query result cache before the LLM call.
- CPU is fine for embeddings at low QPS; move to GPU or managed embeddings when ingest spikes.
When to Add Hierarchical / Graph Layers
- Add RAPTOR/LightRAG summaries if recall drops for long, heterogeneous docs or when users ask “overview” questions. Store summaries alongside nodes with the same metadata.
- Add a small knowledge graph if entity relationships matter (people ↔ orgs ↔ events); use it to expand queries and for routing, not as the primary store.
Operational Checklist
- Bench hybrid+rerank on a labeled set; track Recall@20 and MRR. Don’t ship without evals.
- Schema discipline: same metadata keys across vector, sparse, and summaries; keep
doc_idstable across versions. - Run nightly compaction and weekly freshness audits (missing embeddings, stale tombstones).
- Log retrieval traces to debug “no results” and “wrong docs” quickly.
If you need to start today with minimal ops: Qdrant Cloud + PostgreSQL BM25S, text-embedding-3-small, bge-reranker-large, LangChain/LlamaIndex pipeline, and a tiny router over domain/type/lang. Scale shards and add hierarchical summaries only if recall or latency forces it.
Key Takeaways
Key Takeaways
- Hybrid retrieval (BM25 + dense vectors) outperforms either approach alone at scale
- Reranking with cross-encoders is essential; always rerank top 50-100 candidates before LLM
- Incremental ingestion using CDC and chunk signatures prevents costly re-embeddings
- Rich metadata (doc_id, source, timestamp, type, lang) enables powerful filtering and routing
- Managed vector DBs (Qdrant/Weaviate/Pinecone) reduce ops burden for most teams
- Sharded indexes by month/tenant keep rebuilds bounded and performance predictable
- Evaluation first: Track Recall@20 and MRR on labeled data before production