Building Your First RAG System: Combining LLMs with Custom Knowledge
Introduction
Large Language Models like GPT-4 are incredibly powerful, but they have limitations: knowledge cutoffs, hallucinations, and no access to your private data. Retrieval-Augmented Generation (RAG) solves these problems by combining the reasoning power of LLMs with your own knowledge base. Today, we'll build a practical RAG system from scratch.
What is RAG and Why Use It?
RAG works by retrieving relevant information from your documents before generating a response. Instead of relying solely on the model's training data, RAG:
- Provides up-to-date information from your knowledge base
- Reduces hallucinations by grounding responses in factual data
- Allows you to work with private or domain-specific information
- Maintains context relevance through semantic search
Architecture Overview
Our RAG system has three main components:
- Document Processing: Convert documents into searchable embeddings
- Retrieval System: Find relevant chunks based on user queries
- Generation: Combine retrieved context with user questions for LLM
Setting Up the Environment
First, let's install the required dependencies:
npm init -y
npm install openai @pinecone-database/pinecone pdf-parse
npm install dotenv express multerCreate a .env file:
OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_ENVIRONMENT=your_pinecone_envDocument Processing and Embedding
Let's create a document processor that converts PDFs into embeddings:
const OpenAI = require('openai');
const { Pinecone } = require('@pinecone-database/pinecone');
const pdf = require('pdf-parse');
const fs = require('fs');
class DocumentProcessor {
constructor() {
this.openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
this.pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY
});
this.index = this.pinecone.index('rag-knowledge-base');
}
// Split text into chunks
chunkText(text, maxChunkSize = 1000, overlap = 200) {
const chunks = [];
let start = 0;
while (start < text.length) {
const end = Math.min(start + maxChunkSize, text.length);
const chunk = text.slice(start, end);
chunks.push(chunk.trim());
start = end - overlap;
}
return chunks;
}
// Generate embeddings for text chunks
async generateEmbeddings(text) {
const response = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: text
});
return response.data[0].embedding;
}
// Process and store document
async processDocument(filePath, documentId) {
try {
const dataBuffer = fs.readFileSync(filePath);
const pdfData = await pdf(dataBuffer);
const chunks = this.chunkText(pdfData.text);
const vectors = [];
for (let i = 0; i < chunks.length; i++) {
const embedding = await this.generateEmbeddings(chunks[i]);
vectors.push({
id: `${documentId}-chunk-${i}`,
values: embedding,
metadata: {
text: chunks[i],
documentId,
chunkIndex: i
}
});
}
await this.index.upsert(vectors);
console.log(`Processed ${chunks.length} chunks for document ${documentId}`);
} catch (error) {
console.error('Document processing error:', error);
throw error;
}
}
}Building the Retrieval System
Now let's create the retrieval component that finds relevant information:
class RetrievalSystem {
constructor() {
this.openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
this.pinecone = new Pinecone({
apiKey: process.env.PINECONE_API_KEY
});
this.index = this.pinecone.index('rag-knowledge-base');
}
async retrieveRelevantChunks(query, topK = 5) {
try {
// Generate embedding for the query
const queryEmbedding = await this.openai.embeddings.create({
model: 'text-embedding-3-small',
input: query
});
// Search for similar chunks
const searchResponse = await this.index.query({
vector: queryEmbedding.data[0].embedding,
topK,
includeMetadata: true
});
return searchResponse.matches.map(match => ({
text: match.metadata.text,
score: match.score,
documentId: match.metadata.documentId
}));
} catch (error) {
console.error('Retrieval error:', error);
throw error;
}
}
}Creating the RAG Response Generator
Finally, let's combine retrieval with generation:
class RAGSystem {
constructor() {
this.retrieval = new RetrievalSystem();
this.openai = new OpenAI({
apiKey: process.env.OPENAI_API_KEY
});
}
async generateAnswer(question) {
try {
// Retrieve relevant chunks
const relevantChunks = await this.retrieval.retrieveRelevantChunks(question);
// Prepare context from retrieved chunks
const context = relevantChunks
.map(chunk => chunk.text)
.join('\n\n');
// Create prompt with context
const prompt = `
Context information:
${context}
Question: ${question}
Based on the context provided above, please answer the question. If the context doesn't contain enough information to answer the question, please say so.
Answer:`;
// Generate response
const completion = await this.openai.chat.completions.create({
model: 'gpt-4',
messages: [{
role: 'user',
content: prompt
}],
temperature: 0.1,
max_tokens: 500
});
return {
answer: completion.choices[0].message.content,
sources: relevantChunks.map(chunk => chunk.documentId),
relevanceScores: relevantChunks.map(chunk => chunk.score)
};
} catch (error) {
console.error('RAG generation error:', error);
throw error;
}
}
}Putting It All Together
Let's create an Express API to use our RAG system:
const express = require('express');
const multer = require('multer');
require('dotenv').config();
const app = express();
const upload = multer({ dest: 'uploads/' });
const ragSystem = new RAGSystem();
const docProcessor = new DocumentProcessor();
app.use(express.json());
// Upload and process documents
app.post('/upload', upload.single('document'), async (req, res) => {
try {
const documentId = Date.now().toString();
await docProcessor.processDocument(req.file.path, documentId);
res.json({ message: 'Document processed successfully', documentId });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
// Ask questions
app.post('/ask', async (req, res) => {
try {
const { question } = req.body;
const result = await ragSystem.generateAnswer(question);
res.json(result);
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.listen(3000, () => {
console.log('RAG system running on port 3000');
});Best Practices and Optimization
To improve your RAG system:
- Chunk Size: Experiment with different chunk sizes and overlaps
- Reranking: Use cross-encoders to rerank retrieved chunks
- Hybrid Search: Combine semantic search with keyword search
- Metadata Filtering: Add document type, date, or category filters
- Caching: Cache embeddings and frequent queries
Conclusion
You now have a working RAG system that can answer questions based on your own documents. This foundation can be extended with features like multi-document conversations, real-time updates, and advanced filtering. RAG opens up powerful possibilities for creating AI applications that are both intelligent and grounded in your specific domain knowledge.
Related Posts
Building a RAG-Powered Documentation Assistant with Vector Search
Learn to build an intelligent documentation assistant using RAG architecture, vector embeddings, and semantic search for better developer experience.
Building AI-Powered Search with RAG and Vector Databases
Learn how to implement Retrieval-Augmented Generation (RAG) with vector databases to create intelligent search systems that understand context.
Building AI-Powered Chat Interfaces with RAG and Vector Search
Learn to build intelligent chat applications using Retrieval-Augmented Generation and vector databases for contextual responses.