Building Your First RAG System: Combining LLMs with Custom Knowledge

Introduction

Large Language Models like GPT-4 are incredibly powerful, but they have limitations: knowledge cutoffs, hallucinations, and no access to your private data. Retrieval-Augmented Generation (RAG) solves these problems by combining the reasoning power of LLMs with your own knowledge base. Today, we'll build a practical RAG system from scratch.

What is RAG and Why Use It?

RAG works by retrieving relevant information from your documents before generating a response. Instead of relying solely on the model's training data, RAG:

Provides up-to-date information from your knowledge base
Reduces hallucinations by grounding responses in factual data
Allows you to work with private or domain-specific information
Maintains context relevance through semantic search

Architecture Overview

Our RAG system has three main components:

Document Processing: Convert documents into searchable embeddings
Retrieval System: Find relevant chunks based on user queries
Generation: Combine retrieved context with user questions for LLM

Setting Up the Environment

First, let's install the required dependencies:

npm init -y
npm install openai @pinecone-database/pinecone pdf-parse
npm install dotenv express multer

Create a .env file:

OPENAI_API_KEY=your_openai_key
PINECONE_API_KEY=your_pinecone_key
PINECONE_ENVIRONMENT=your_pinecone_env

Document Processing and Embedding

Let's create a document processor that converts PDFs into embeddings:

const OpenAI = require('openai');
const { Pinecone } = require('@pinecone-database/pinecone');
const pdf = require('pdf-parse');
const fs = require('fs');

class DocumentProcessor {
 constructor() {
 this.openai = new OpenAI({
 apiKey: process.env.OPENAI_API_KEY
 });
 this.pinecone = new Pinecone({
 apiKey: process.env.PINECONE_API_KEY
 });
 this.index = this.pinecone.index('rag-knowledge-base');
 }

 // Split text into chunks
 chunkText(text, maxChunkSize = 1000, overlap = 200) {
 const chunks = [];
 let start = 0;
 
 while (start < text.length) {
 const end = Math.min(start + maxChunkSize, text.length);
 const chunk = text.slice(start, end);
 chunks.push(chunk.trim());
 start = end - overlap;
 }
 
 return chunks;
 }

 // Generate embeddings for text chunks
 async generateEmbeddings(text) {
 const response = await this.openai.embeddings.create({
 model: 'text-embedding-3-small',
 input: text
 });
 return response.data[0].embedding;
 }

 // Process and store document
 async processDocument(filePath, documentId) {
 try {
 const dataBuffer = fs.readFileSync(filePath);
 const pdfData = await pdf(dataBuffer);
 const chunks = this.chunkText(pdfData.text);
 
 const vectors = [];
 
 for (let i = 0; i < chunks.length; i++) {
 const embedding = await this.generateEmbeddings(chunks[i]);
 
 vectors.push({
 id: `${documentId}-chunk-${i}`,
 values: embedding,
 metadata: {
 text: chunks[i],
 documentId,
 chunkIndex: i
 }
 });
 }
 
 await this.index.upsert(vectors);
 console.log(`Processed ${chunks.length} chunks for document ${documentId}`);
 
 } catch (error) {
 console.error('Document processing error:', error);
 throw error;
 }
 }
}

Building the Retrieval System

Now let's create the retrieval component that finds relevant information:

class RetrievalSystem {
 constructor() {
 this.openai = new OpenAI({
 apiKey: process.env.OPENAI_API_KEY
 });
 this.pinecone = new Pinecone({
 apiKey: process.env.PINECONE_API_KEY
 });
 this.index = this.pinecone.index('rag-knowledge-base');
 }

 async retrieveRelevantChunks(query, topK = 5) {
 try {
 // Generate embedding for the query
 const queryEmbedding = await this.openai.embeddings.create({
 model: 'text-embedding-3-small',
 input: query
 });

 // Search for similar chunks
 const searchResponse = await this.index.query({
 vector: queryEmbedding.data[0].embedding,
 topK,
 includeMetadata: true
 });

 return searchResponse.matches.map(match => ({
 text: match.metadata.text,
 score: match.score,
 documentId: match.metadata.documentId
 }));
 } catch (error) {
 console.error('Retrieval error:', error);
 throw error;
 }
 }
}

Creating the RAG Response Generator

Finally, let's combine retrieval with generation:

class RAGSystem {
 constructor() {
 this.retrieval = new RetrievalSystem();
 this.openai = new OpenAI({
 apiKey: process.env.OPENAI_API_KEY
 });
 }

 async generateAnswer(question) {
 try {
 // Retrieve relevant chunks
 const relevantChunks = await this.retrieval.retrieveRelevantChunks(question);
 
 // Prepare context from retrieved chunks
 const context = relevantChunks
 .map(chunk => chunk.text)
 .join('\n\n');

 // Create prompt with context
 const prompt = `
Context information:
${context}

Question: ${question}

Based on the context provided above, please answer the question. If the context doesn't contain enough information to answer the question, please say so.

Answer:`;

 // Generate response
 const completion = await this.openai.chat.completions.create({
 model: 'gpt-4',
 messages: [{
 role: 'user',
 content: prompt
 }],
 temperature: 0.1,
 max_tokens: 500
 });

 return {
 answer: completion.choices[0].message.content,
 sources: relevantChunks.map(chunk => chunk.documentId),
 relevanceScores: relevantChunks.map(chunk => chunk.score)
 };
 } catch (error) {
 console.error('RAG generation error:', error);
 throw error;
 }
 }
}

Putting It All Together

Let's create an Express API to use our RAG system:

const express = require('express');
const multer = require('multer');
require('dotenv').config();

const app = express();
const upload = multer({ dest: 'uploads/' });
const ragSystem = new RAGSystem();
const docProcessor = new DocumentProcessor();

app.use(express.json());

// Upload and process documents
app.post('/upload', upload.single('document'), async (req, res) => {
 try {
 const documentId = Date.now().toString();
 await docProcessor.processDocument(req.file.path, documentId);
 res.json({ message: 'Document processed successfully', documentId });
 } catch (error) {
 res.status(500).json({ error: error.message });
 }
});

// Ask questions
app.post('/ask', async (req, res) => {
 try {
 const { question } = req.body;
 const result = await ragSystem.generateAnswer(question);
 res.json(result);
 } catch (error) {
 res.status(500).json({ error: error.message });
 }
});

app.listen(3000, () => {
 console.log('RAG system running on port 3000');
});

Best Practices and Optimization

To improve your RAG system:

Chunk Size: Experiment with different chunk sizes and overlaps
Reranking: Use cross-encoders to rerank retrieved chunks
Hybrid Search: Combine semantic search with keyword search
Metadata Filtering: Add document type, date, or category filters
Caching: Cache embeddings and frequent queries

Conclusion

You now have a working RAG system that can answer questions based on your own documents. This foundation can be extended with features like multi-document conversations, real-time updates, and advanced filtering. RAG opens up powerful possibilities for creating AI applications that are both intelligent and grounded in your specific domain knowledge.