Building a RAG-Powered Documentation Assistant with Vector Search
Introduction
As developers, we constantly switch between different documentation sources, trying to find relevant information quickly. Traditional keyword-based search often falls short when dealing with technical documentation that uses varied terminology. This is where RAG (Retrieval-Augmented Generation) comes in, combining the power of semantic search with large language models to create intelligent documentation assistants.
In this guide, we'll build a documentation assistant that can understand context, find relevant information across multiple sources, and provide accurate, contextual answers to developer queries.
Understanding RAG Architecture
RAG works by splitting large documents into chunks, converting them into vector embeddings, storing them in a vector database, and then retrieving relevant chunks to augment LLM prompts. This approach ensures that the AI has access to up-to-date, domain-specific information while maintaining accuracy.
Key Components
- Document Processor: Chunks and preprocesses documentation
- Embedding Model: Converts text to vector representations
- Vector Database: Stores and retrieves similar embeddings
- LLM Integration: Generates responses based on retrieved context
Setting Up the Environment
We'll use Python with several key libraries for our implementation:
pip install openai chromadb langchain tiktoken python-dotenv fastapi uvicornCreate a .env file with your API keys:
OPENAI_API_KEY=your_openai_api_key_hereDocument Processing and Chunking
First, let's create a document processor that can handle various file formats and intelligently chunk content:
import os
from typing import List, Dict
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import DirectoryLoader, TextLoader
import tiktoken
class DocumentProcessor:
def __init__(self, chunk_size: int = 1000, chunk_overlap: int = 200):
self.chunk_size = chunk_size
self.chunk_overlap = chunk_overlap
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap,
length_function=self._tiktoken_len
)
def _tiktoken_len(self, text: str) -> int:
"""Calculate token count using tiktoken"""
encoding = tiktoken.get_encoding('cl100k_base')
return len(encoding.encode(text))
def load_and_chunk_documents(self, docs_path: str) -> List[Dict]:
"""Load documents and split into chunks"""
loader = DirectoryLoader(
docs_path,
glob="**/*.md",
loader_cls=TextLoader
)
documents = loader.load()
chunks = []
for doc in documents:
split_docs = self.text_splitter.split_documents([doc])
for chunk in split_docs:
chunks.append({
'content': chunk.page_content,
'metadata': {
'source': chunk.metadata.get('source', ''),
'chunk_id': len(chunks)
}
})
return chunksVector Database Integration
Next, we'll set up ChromaDB for vector storage and retrieval:
import chromadb
import openai
from typing import List, Dict
class VectorStore:
def __init__(self, collection_name: str = "docs_collection"):
self.client = chromadb.PersistentClient(path="./chroma_db")
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
def add_documents(self, chunks: List[Dict]):
"""Add document chunks to vector store"""
documents = [chunk['content'] for chunk in chunks]
metadatas = [chunk['metadata'] for chunk in chunks]
ids = [f"doc_{i}" for i in range(len(chunks))]
# Generate embeddings using OpenAI
embeddings = self._generate_embeddings(documents)
self.collection.add(
documents=documents,
metadatas=metadatas,
ids=ids,
embeddings=embeddings
)
def _generate_embeddings(self, texts: List[str]) -> List[List[float]]:
"""Generate embeddings using OpenAI API"""
response = openai.Embedding.create(
input=texts,
model="text-embedding-ada-002"
)
return [data['embedding'] for data in response['data']]
def similarity_search(self, query: str, k: int = 5) -> List[Dict]:
"""Search for similar documents"""
query_embedding = self._generate_embeddings([query])[0]
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=k
)
return [
{
'content': doc,
'metadata': meta,
'score': score
}
for doc, meta, score in zip(
results['documents'][0],
results['metadatas'][0],
results['distances'][0]
)
]RAG Query Engine
Now let's create the core RAG engine that combines retrieval with generation:
import openai
from typing import List, Dict
class RAGEngine:
def __init__(self, vector_store: VectorStore):
self.vector_store = vector_store
self.system_prompt = """
You are a helpful documentation assistant. Use the provided context to answer questions accurately.
If the context doesn't contain enough information, say so clearly.
Always cite the source files when possible.
"""
def query(self, question: str, max_context_length: int = 3000) -> Dict:
"""Process a query using RAG"""
# Retrieve relevant documents
relevant_docs = self.vector_store.similarity_search(question, k=5)
# Prepare context from retrieved documents
context = self._prepare_context(relevant_docs, max_context_length)
# Generate response
response = self._generate_response(question, context)
return {
'answer': response,
'sources': [doc['metadata']['source'] for doc in relevant_docs[:3]],
'context_used': len(context)
}
def _prepare_context(self, docs: List[Dict], max_length: int) -> str:
"""Prepare context from retrieved documents"""
context_parts = []
current_length = 0
for doc in docs:
content = doc['content']
source = doc['metadata']['source']
formatted_content = f"Source: {source}\nContent: {content}\n---\n"
if current_length + len(formatted_content) > max_length:
break
context_parts.append(formatted_content)
current_length += len(formatted_content)
return "\n".join(context_parts)
def _generate_response(self, question: str, context: str) -> str:
"""Generate response using OpenAI API"""
messages = [
{"role": "system", "content": self.system_prompt},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
response = openai.ChatCompletion.create(
model="gpt-3.5-turbo",
messages=messages,
max_tokens=500,
temperature=0.1
)
return response['choices'][0]['message']['content']FastAPI Web Interface
Let's create a simple API interface for our documentation assistant:
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import os
from dotenv import load_dotenv
load_dotenv()
openai.api_key = os.getenv("OPENAI_API_KEY")
app = FastAPI(title="Documentation Assistant API")
# Initialize components
doc_processor = DocumentProcessor()
vector_store = VectorStore()
rag_engine = RAGEngine(vector_store)
class QueryRequest(BaseModel):
question: str
class QueryResponse(BaseModel):
answer: str
sources: List[str]
context_used: int
@app.post("/query", response_model=QueryResponse)
async def query_documentation(request: QueryRequest):
try:
result = rag_engine.query(request.question)
return QueryResponse(**result)
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@app.post("/load-docs")
async def load_documentation(docs_path: str):
try:
chunks = doc_processor.load_and_chunk_documents(docs_path)
vector_store.add_documents(chunks)
return {"message": f"Loaded {len(chunks)} document chunks"}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))Best Practices and Optimization
Chunking Strategies
- Use semantic chunking for code documentation
- Maintain context with overlapping chunks
- Consider document structure when splitting
Embedding Optimization
- Cache embeddings to reduce API costs
- Use batch processing for large document sets
- Experiment with different embedding models
Query Enhancement
- Implement query expansion for better retrieval
- Use re-ranking models for improved results
- Add conversation memory for follow-up questions
Conclusion
Building a RAG-powered documentation assistant significantly improves the developer experience by providing contextual, accurate answers from your documentation. The combination of semantic search and language models creates a powerful tool that understands intent and provides relevant information quickly.
This implementation provides a solid foundation that you can extend with features like conversation history, multiple document formats, and advanced retrieval strategies. The key is to iterate based on user feedback and continuously improve the chunking and retrieval strategies for your specific use case.
Related Posts
Building Your First RAG System: Combining LLMs with Custom Knowledge
Learn to build a Retrieval-Augmented Generation system that combines OpenAI's GPT with your own documents for accurate, context-aware responses.
Building AI-Powered Search with RAG and Vector Databases
Learn how to implement Retrieval-Augmented Generation (RAG) with vector databases to create intelligent search systems that understand context.
Building AI-Powered Chat Interfaces with RAG and Vector Search
Learn to build intelligent chat applications using Retrieval-Augmented Generation and vector databases for contextual responses.