Introduction
Retrieval-Augmented Generation (RAG) is the most practical pattern for building AI applications with your own data. Vector databases are the enabling technology. Together, they solve the biggest limitations of Large Language Models (LLMs).
The Problem: LLM Limitations
LLMs Have Issues:
- Knowledge Cutoff — Training data has a date limit
- Hallucinations — Confidently generate false information
- No Access to Your Data — Don't know your documents, products, or policies
- Context Window Limits — Can't process millions of documents at once
RAG Solves These:
- Retrieves relevant information at query time
- Grounds responses in actual documents
- Works with any data you provide
- Scales to massive document collections
How RAG Works
RAG Flow:
-
Index Phase (Offline):
- Split documents into chunks
- Generate embeddings for each chunk
- Store embeddings in vector database
-
Query Phase (Online):
- User asks a question
- Generate embedding for question
- Search vector DB for similar chunks
- Send question + retrieved chunks to LLM
- LLM generates answer using the context
Visual Flow:
User Question
↓
[Generate Embedding]
↓
[Vector Search] → Top K relevant chunks
↓
[Augmented Prompt]
Question + Retrieved Context + Instructions
↓
[LLM Generates Response]
↓
Answer (with citations)
What Are Embeddings?
Embeddings convert text into numerical vectors that capture semantic meaning.
How They Work:
- Text goes into an embedding model
- Out comes a vector (array of numbers)
- Similar meanings = similar vectors
- Typically 384-1536 dimensions
Example:
"king" → [0.2, 0.5, -0.1, 0.8, ...]
"queen" → [0.3, 0.4, -0.1, 0.7, ...] // Similar!
"banana" → [-0.5, 0.1, 0.9, -0.2, ...] // Different!
Popular Embedding Models:
- OpenAI: text-embedding-ada-002, text-embedding-3-small/large
- Azure: Azure OpenAI Embeddings
- Open Source: sentence-transformers, BGE, E5
Vector Databases
Vector databases store and search embeddings efficiently.
Why Not Regular Databases?
- Traditional DBs search by exact match or text patterns
- Vector search finds semantically similar content
- "How do I reset my password?" matches "Password recovery steps"
- Requires specialized indexes (HNSW, IVF)
Popular Vector Databases:
Purpose-Built:
- Pinecone — Managed, easy to use
- Weaviate — Open source, hybrid search
- Qdrant — Open source, Rust-based
- Milvus — Open source, highly scalable
- Chroma — Lightweight, Python-native
Vector Extensions:
- PostgreSQL + pgvector
- Elasticsearch with dense_vector
- Redis with RediSearch
Cloud Services:
- Azure AI Search (vector + keyword)
- Amazon OpenSearch with k-NN
- Google Vertex AI Matching Engine
Building a RAG System
Step 1: Document Ingestion
Documents (PDF, HTML, etc.)
↓
[Parse & Extract Text]
↓
[Chunk into segments]
↓
[Generate Embeddings]
↓
[Store in Vector DB]
Step 2: Chunking Strategy
Chunking is critical. Options:
- Fixed size (500-1000 tokens)
- Sentence boundaries
- Paragraph boundaries
- Semantic chunking (by topic)
- Overlap between chunks (100-200 tokens)
Step 3: Retrieval
# Pseudo-code
query_embedding = embed(user_question)
results = vector_db.search(
query_embedding,
top_k=5,
filter={"category": "technical"}
)
Step 4: Generation
Prompt:
"Based on the following context, answer the question.
If the answer isn't in the context, say so.
Context:
{retrieved_chunks}
Question: {user_question}
Answer:"
RAG Optimization Techniques
Retrieval Improvements:
- Hybrid search (vector + keyword)
- Reranking retrieved results
- Query expansion/rewriting
- Metadata filtering
- Hierarchical retrieval
Chunking Improvements:
- Smaller chunks for precision
- Larger chunks for context
- Parent-child chunks (retrieve child, include parent)
- Overlapping chunks
Generation Improvements:
- Include source citations
- Use structured output
- Confidence scoring
- Iterative retrieval (ask followup questions)
Advanced RAG Patterns
Multi-Query RAG:
Original Question → [Generate variations]
→ Query 1 → Results 1
→ Query 2 → Results 2
→ Query 3 → Results 3
→ [Merge & Deduplicate] → LLM
Self-Querying: Let the LLM generate metadata filters:
"Find contracts from 2023 about AWS"
↓
{vector_search: "contracts AWS",
filters: {year: 2023, type: "contract"}}
Agentic RAG: LLM decides what to retrieve:
Question → [Agent decides: search DB? call API? calculate?]
→ [Execute actions]
→ [Synthesize answer]
RAG vs Fine-Tuning
Use RAG When:
- Data changes frequently
- Need citations/traceability
- Data is large or varied
- Quick to implement
- Privacy-sensitive (data stays local)
Use Fine-Tuning When:
- Specific style or behavior needed
- Domain-specific terminology
- Smaller, stable dataset
- Performance is critical
Often Combined: Fine-tune for style/format, RAG for knowledge.
Cloud RAG Services
Azure:
- Azure AI Search (retrieval)
- Azure OpenAI (embeddings + generation)
- Azure AI Studio (orchestration)
AWS:
- Amazon Bedrock Knowledge Bases
- OpenSearch with k-NN
- Kendra (enterprise search)
Google Cloud:
- Vertex AI Search
- Vertex AI Matching Engine
- Enterprise Search on Gen App Builder
Common Pitfalls
Chunking Issues:
- Chunks too large (irrelevant content)
- Chunks too small (missing context)
- Breaking mid-sentence/thought
Retrieval Issues:
- Wrong embedding model for domain
- Not enough retrieved chunks
- No metadata filtering
- Ignoring keyword search
Generation Issues:
- Prompt doesn't instruct to use context
- Not handling "no answer found"
- No citation of sources
- Not validating outputs
Evaluation Metrics
Retrieval Quality:
- Precision@K — Relevant results in top K
- Recall@K — Found relevant items in top K
- MRR — Mean Reciprocal Rank
Generation Quality:
- Faithfulness — Answer supported by context
- Answer Relevance — Addresses the question
- Context Precision — Retrieved context was useful
Tools: Ragas, TruLens, LangSmith
Exam Relevance
For AI Certifications (AI-102, AWS ML):
- Know RAG architecture and components
- When to use RAG vs fine-tuning
- Embedding models and vector search
- Azure AI Search / Bedrock Knowledge Bases
For Solutions Architect:
- System design with RAG
- Choosing vector database
- Scaling considerations
- Cost optimization
Key Takeaway
RAG enables AI applications that use your data by combining vector search with LLM generation. The pattern is: embed and store documents → retrieve relevant chunks at query time → augment LLM prompt with context → generate grounded responses. Vector databases make semantic search possible. This is currently the most practical way to build AI applications with enterprise knowledge.
