By Gopi Krishna Tummala
Introduction: From Naive RAG to Production Systems
Modern Retrieval-Augmented Generation (RAG) systems have evolved far beyond simple vector similarity search. This article outlines a modern, industry-standard approach to building robust RAG systems using OpenSearch as the core engine. We will explore how to transition from simple vector retrieval to a production-grade multimodal system that handles text, images, and video.
Production RAG requires careful consideration of indexing algorithms, retrieval strategies, precision optimization, and multimodal capabilities. Each decision impacts latency, recall, cost, and user experience. This guide covers the critical patterns and trade-offs you’ll encounter when building systems at scale.
1. Vector Search Fundamentals: HNSW vs. IVF
In OpenSearch, the k-NN plugin provides the backbone for vector search. Choosing the right indexing algorithm is a critical system design decision that impacts latency, memory usage, and recall.
HNSW (Hierarchical Navigable Small World)
How it works: HNSW creates a multi-layered graph where the bottom layer contains all vectors and higher layers contain subsets for fast “skipping.” At query time, the algorithm starts at the top layer and navigates downward, finding approximate nearest neighbors efficiently.
When to use: HNSW is the industry standard for low-latency requirements. It provides high recall and is robust for high-dimensional data (e.g., 1536d from OpenAI embeddings). Most production RAG systems start with HNSW.
Trade-offs:
- High RAM usage: The entire graph structure is typically stored in memory. This can be partially mitigated with the space parameter
mandef_construction, or by using approximate HNSW with filters. - Fast queries: Sub-millisecond retrieval for datasets in the millions of vectors.
IVF (Inverted File Index)
How it works: IVF clusters the vector space into “Voronoi cells” (regions). At query time, only the closest clusters are searched, dramatically reducing the search space.
When to use: Large-scale datasets where memory efficiency is prioritized over absolute speed. OpenSearch also supports IVF-PQ (product quantization) variants for extreme scale (billions of vectors) with further compression.
Trade-offs:
- Lower recall: Compared to HNSW, especially if clusters aren’t well-distributed.
- Requires periodic re-training: As data distribution shifts, clusters may need recalculation.
- Memory efficient: Significantly lower memory footprint than HNSW.
When to Choose Which
| Criteria | HNSW | IVF |
|---|---|---|
| Latency requirements | ✅ Best choice | ⚠️ Slower |
| Memory constraints | ⚠️ Higher usage | ✅ More efficient |
| Dataset size | ✅ Good up to ~100M vectors | ✅ Better for billions |
| Recall requirements | ✅ High recall | ⚠️ Lower recall |
| Update frequency | ✅ Handles frequent updates | ⚠️ Requires re-indexing |
Recommendation: Start with HNSW for most production RAG systems. Switch to IVF or IVF-PQ only when memory constraints become critical or you’re dealing with datasets exceeding 100 million vectors.
2. Advanced RAG Patterns
To move beyond “Naive RAG,” production systems implement a multi-stage pipeline that addresses common retrieval failures.
Hybrid Search (BM25 + Vector)
Pure vector search often fails on specific keywords (e.g., “iPhone 15 Pro Max”) because the embedding might group it with “smartphones” generally, losing the specificity. Conversely, keyword search can miss semantic similarity (e.g., “mobile device” vs. “smartphone”).
Solution: Hybrid search combines the precision of keyword matching with the semantic understanding of vector search.
Implementation: OpenSearch supports native hybrid search via the hybrid query type (available from OpenSearch 2.9+), which automatically applies Reciprocal Rank Fusion (RRF) to merge results from BM25 and vector queries into a single ranked list.
Reciprocal Rank Fusion (RRF): This industry-standard algorithm normalizes and combines scores from different retrieval methods:
- BM25 scores are transformed into ranks
- Vector similarity scores are transformed into ranks
- RRF computes a combined score:
score = Σ (1 / (k + rank))for each document across both result sets kis a tuning parameter (typically 60) that controls the influence of lower-ranked results
This ensures that documents appearing in both result sets get boosted, while still preserving highly relevant results from either method alone.
Query Rewriting & Expansion
Users often ask vague questions or use terminology that doesn’t match the indexed content. Query rewriting transforms user queries into more effective search terms.
Technique: Use a small LLM (like GPT-4o-mini or Claude Haiku) to rewrite the user’s query into a more descriptive search term or generate multiple variations to broaden the search (Multi-Query Retrieval).
Example:
- Original: “How do I fix my phone?”
- Rewritten variations:
- “troubleshooting smartphone issues”
- “common mobile device problems”
- “phone repair guide”
- “smartphone troubleshooting steps”
Each variation can be used as a separate query, with results aggregated and ranked. This significantly improves recall for ambiguous queries.
Multi-Query Retrieval: Instead of a single rewritten query, generate 3-5 query variations, retrieve documents for each, and merge using RRF. This is particularly effective when the original query is ambiguous or domain-specific.
Parent-Document Retrieval
Standard chunking (e.g., 500 tokens) often loses context. When a small chunk matches a query, the surrounding context that provides the “big picture” is missing.
Solution: Index small “child” chunks for high-granularity search, but when a match is found, retrieve the larger “parent” document or surrounding context to provide to the LLM.
How it works:
- During indexing, create small chunks (children) for detailed retrieval
- Maintain a mapping from child → parent document
- At query time, retrieve top child chunks
- Expand to include parent documents (or neighboring chunks)
- Pass the enriched context to the LLM
Implementation: This pattern is natively supported in frameworks like LlamaIndex (via HierarchicalNodeParser) and Haystack. In OpenSearch, you can implement this by:
- Storing
parent_idmetadata on child documents - Using
termsquery to fetch parent documents after initial retrieval - Merging and deduplicating before passing to the LLM
This ensures the model has both the precise match and the broader context needed for accurate generation.
3. The Precision Layer: Re-ranking with Cross-Encoders
Initial retrieval is fast but can be “noisy.” Even with hybrid search, you might retrieve documents that are topically related but not actually relevant to the specific query intent.
Two-Stage Retrieval Architecture
Production RAG systems use a two-stage approach to balance speed and precision:
Stage 1 (Bi-Encoder - Fast Retrieval):
- Use HNSW to pull the top 50–200 candidate documents
- This is fast (sub-millisecond) but less accurate
- Recall-focused: cast a wide net
Stage 2 (Cross-Encoder - Precise Re-ranking):
- Pass the query and the candidate documents through a Cross-Encoder model (e.g.,
BAAI/bge-reranker,mixedbread-ai/mxbai-rerank, or sentence-transformers models fine-tuned for reranking) - Cross-encoders process the query and document together, allowing for full attention between them
- This yields much higher relevance scores but is slower and more expensive
Why Cross-Encoders are More Accurate:
Unlike Bi-Encoders (where query and document are embedded separately), Cross-Encoders process them together. This allows the model to:
- Attend to specific word overlaps and interactions
- Understand nuanced relationships (e.g., negation, temporal ordering)
- Score relevance based on the actual query-document pair, not just similarity in embedding space
Cost Consideration: Cross-encoders are slower and more expensive (they process each query-document pair separately), so they’re only applied to the reduced candidate set from Stage 1. This gives you the best of both worlds: fast retrieval with precise reranking.
Popular Models:
- BAAI/bge-reranker-v2-m3: Multilingual, supports 30+ languages
- mixedbread-ai/mxbai-rerank: Fast and accurate for English
- Cohere Rerank API: Managed service option (highest accuracy, but per-query cost)
4. Multimodal Search: Image & Video
Modern RAG isn’t limited to text. By using CLIP (Contrastive Language-Image Pre-training), we can map text, images, and video frames into the same vector space, enabling unified search across modalities.
Image Search
How it works: Store CLIP embeddings of images in a knn_vector field in OpenSearch. A user’s text query is embedded via CLIP’s text encoder and matched against the image vectors.
Example use case: “Show me images of sunset over mountains” — the text query is embedded, matched against image vectors, and relevant images are retrieved even if they weren’t tagged with those keywords.
Embedding dimensions: CLIP dimensions vary by model:
- ViT-B/32: 512 dimensions
- ViT-L/14: 768 dimensions
- OpenAI CLIP: 1024 dimensions (larger models)
Choose the model based on your accuracy vs. latency trade-off.
Video Retrieval
Video adds complexity: you need to handle temporal information and extract meaningful frames.
Frame Sampling Strategy:
- Extract frames at set intervals (e.g., 1 frame per second)
- For long videos, consider adaptive sampling: higher density during action scenes (detected via scene change detection)
- Store frame-level embeddings and timestamps
Captioning (Visual-to-Text): Use models like BLIP-2, LLaVA, or even PaliGemma to generate textual descriptions for each frame. This enables caption search using standard BM25, providing an alternative retrieval path.
Hybrid Video Search: Combine:
- Frame-level CLIP embeddings (semantic similarity)
- Generated captions (keyword matching via BM25)
- Temporal metadata (timestamps for precise moment retrieval)
Temporal Search: Index timestamps alongside vectors so the RAG system can point users to the exact moment in a video. Store start_time and end_time metadata to enable temporal queries like “show me the scene where the character enters the room.”
Advanced Techniques:
- ImageBind: A newer multimodal model that can embed audio, video, depth, and more into the same space
- Video-specific encoders: Use CLIP on sampled frames + temporal pooling (e.g., average or max pooling across frames) to create video-level embeddings
- Scene segmentation: Detect scene boundaries and create embeddings per scene rather than per frame
Example OpenSearch Mapping for Multimodal RAG
PUT /multimodal-index
{
"settings": {
"index.knn": true,
"index.knn.algo_param.ef_search": 100
},
"mappings": {
"properties": {
"text_content": {
"type": "text",
"analyzer": "standard"
},
"text_vector": {
"type": "knn_vector",
"dimension": 1536,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib",
"parameters": {
"ef_construction": 128,
"m": 24
}
}
},
"image_vector": {
"type": "knn_vector",
"dimension": 768,
"method": {
"name": "hnsw",
"space_type": "cosinesimil",
"engine": "nmslib"
}
},
"video_frame_vector": {
"type": "knn_vector",
"dimension": 768,
"method": {
"name": "hnsw",
"space_type": "cosinesimil"
}
},
"caption_text": {
"type": "text",
"analyzer": "standard"
},
"metadata": {
"properties": {
"timestamp": { "type": "float" },
"start_time": { "type": "float" },
"end_time": { "type": "float" },
"source_url": { "type": "keyword" },
"media_type": { "type": "keyword" },
"parent_id": { "type": "keyword" }
}
}
}
}
}
Key features:
- Separate vector fields for text, images, and video frames (allowing independent optimization)
caption_textfield for hybrid search on video frames- Temporal metadata for video moment retrieval
parent_idfor parent-document retrieval pattern
5. Context Window & Token Management
Even with long-context models (128k+ tokens), passing too much information leads to the “Lost in the Middle” phenomenon—where LLMs ignore context in the center of the prompt, focusing on the beginning and end.
This was first documented in research by Liu et al. (2023) and remains relevant even for modern models with extended contexts.
Strategies for Token Optimization
1. Prompt Compression:
Use libraries like LLMLingua or Selective Context to remove redundant tokens from retrieved context while preserving key information:
- Identifies and removes repetitive phrases
- Maintains named entities and key facts
- Can reduce context by 50-70% with minimal information loss
2. Summarization: For very long documents, retrieve the document and have a secondary LLM summarize the relevant sections before passing it to the final prompt:
- Extract top chunks via retrieval
- Summarize each chunk independently
- Pass summarized chunks to the main LLM
- Reduces token count while preserving high-level information
3. Selective Context Pruning: Advanced techniques like context caching (available in LangChain and LlamaIndex) can:
- Cache frequently accessed context
- Prune context based on relevance scores
- Dynamically adjust context length based on query complexity
4. Hierarchical Context: For extremely long documents:
- Store document summaries at multiple levels (section, chapter, document)
- Retrieve at the appropriate granularity based on query
- Only expand to detailed chunks when needed
Token Budget Strategy
Establish a token budget for your RAG pipeline:
- Retrieval stage: Top K documents (e.g., 10–20)
- Re-ranking stage: Top R documents after reranking (e.g., 5–10)
- Final context: Compress or summarize to fit within model limits
- Reserve tokens for system prompts, user query, and model output
This ensures you’re maximizing the value of retrieved information while staying within model constraints.
Conclusion: Building Production RAG Systems
Building production-grade RAG systems requires careful orchestration of multiple components:
- Choose the right indexing algorithm (HNSW for speed, IVF for scale)
- Implement hybrid search to combine keyword and semantic retrieval
- Use query rewriting to handle ambiguous user queries
- Apply parent-document retrieval to preserve context
- Re-rank with cross-encoders for precision
- Extend to multimodal with CLIP and captioning
- Manage token budgets with compression and summarization
Each of these patterns addresses a specific failure mode in naive RAG systems. Implementing them together creates a robust, production-ready system that can handle real-world query complexity and scale to enterprise workloads.
Next Steps
Would you like a deep-dive on:
- Implementing hybrid queries with built-in RRF in OpenSearch (code examples)
- Parent-document retrieval patterns (full implementation guide)
- Deploying multimodal embeddings at scale (infrastructure considerations)
- Cost optimization strategies for cross-encoder reranking
This article is part of the GenAI Systems track. For more on production ML infrastructure, explore the MLOps & Production track.