What Is RAG and Why Does It Matter for Business Chatbots?
Retrieval-Augmented Generation (RAG) is the technique that makes AI chatbots actually useful for business. Instead of relying on a general-purpose LLM that hallucinates answers, a RAG chatbot retrieves real information from your documents, knowledge base, or product catalog before generating a response.
The result: accurate, grounded answers that reference your actual data. No hallucinations. No made-up product specs. Just reliable responses your customers can trust.
In this guide I walk through building a production RAG chatbot using Django, OpenAI embeddings, pgvector for vector storage, and GPT-4o for generation. This is the same architecture I use in my AI chatbot projects for clients across Europe.
RAG Architecture Overview
A RAG chatbot has three core components:
- Document ingestion pipeline: your documents are split into chunks, converted to vector embeddings, and stored in a vector database.
- Retrieval engine: when a user asks a question, the system finds the most relevant document chunks using semantic similarity search.
- Generation engine: the retrieved chunks are passed to an LLM as context, which generates a natural-language answer grounded in your data.
Step 1: Set Up Django with pgvector
pgvector is a PostgreSQL extension that adds vector similarity search. Since Django already uses PostgreSQL in most production setups, pgvector is the most natural choice — no separate vector database needed.
# Install dependencies
pip install django openai pgvector psycopg2-binary tiktoken
# Enable pgvector in PostgreSQL
# Run in psql: CREATE EXTENSION IF NOT EXISTS vector;
Create a Django model to store document chunks with their embeddings:
from django.db import models
from pgvector.django import VectorField
class DocumentChunk(models.Model):
source = models.CharField(max_length=500)
content = models.TextField()
embedding = VectorField(dimensions=1536) # OpenAI ada-002
created_at = models.DateTimeField(auto_now_add=True)
class Meta:
indexes = [
models.Index(fields=['source']),
]
Step 2: Build the Document Ingestion Pipeline
The ingestion pipeline reads your documents, splits them into chunks of 500-800 tokens, generates embeddings using OpenAI, and stores them in PostgreSQL.
import tiktoken
from openai import OpenAI
client = OpenAI()
encoder = tiktoken.encoding_for_model("text-embedding-ada-002")
def chunk_text(text, max_tokens=600, overlap=100):
words = text.split()
chunks, current = [], []
current_tokens = 0
for word in words:
word_tokens = len(encoder.encode(word))
if current_tokens + word_tokens > max_tokens and current:
chunks.append(" ".join(current))
# Keep overlap
overlap_words = current[-overlap // 4:]
current = overlap_words
current_tokens = sum(len(encoder.encode(w)) for w in current)
current.append(word)
current_tokens += word_tokens
if current:
chunks.append(" ".join(current))
return chunks
def generate_embedding(text):
response = client.embeddings.create(
model="text-embedding-ada-002",
input=text
)
return response.data[0].embedding
def ingest_document(source_name, text):
chunks = chunk_text(text)
for chunk in chunks:
embedding = generate_embedding(chunk)
DocumentChunk.objects.create(
source=source_name,
content=chunk,
embedding=embedding,
)
return len(chunks)
Step 3: Implement Semantic Search
When a user asks a question, convert it to an embedding and find the closest document chunks using cosine similarity:
from pgvector.django import CosineDistance
def search_documents(query, top_k=5):
query_embedding = generate_embedding(query)
results = (
DocumentChunk.objects
.annotate(distance=CosineDistance('embedding', query_embedding))
.order_by('distance')[:top_k]
)
return results
This query runs entirely in PostgreSQL, making it fast and avoiding the cost of a separate vector database service like Pinecone (which charges $70+/month).
Step 4: Generate Answers with Context
Pass the retrieved chunks as context to GPT-4o:
def answer_question(user_message, conversation_history=None):
# Retrieve relevant context
chunks = search_documents(user_message)
context = "\n\n".join([
f"[Source: {c.source}]\n{c.content}"
for c in chunks
])
system_prompt = (
"You are a helpful assistant for our business.
"
"Answer questions using ONLY the context provided below.
"
"If the context does not contain the answer, say you do not know.
"
"Always be accurate and cite your sources.
"
f"Context:
{context}"
)
messages = [{"role": "system", "content": system_prompt}]
if conversation_history:
messages.extend(conversation_history[-6:]) # Last 3 turns
messages.append({"role": "user", "content": user_message})
response = client.chat.completions.create(
model="gpt-4o",
messages=messages,
temperature=0.3, # Lower = more factual
max_tokens=1000,
)
return response.choices[0].message.content
Step 5: Create the Django API Endpoint
Expose the chatbot as a REST API endpoint:
from django.http import JsonResponse
from django.views.decorators.csrf import csrf_exempt
import json
@csrf_exempt
def chat_endpoint(request):
if request.method != 'POST':
return JsonResponse({'error': 'POST required'}, status=405)
data = json.loads(request.body)
message = data.get('message', '').strip()
history = data.get('history', [])
if not message:
return JsonResponse({'error': 'Empty message'}, status=400)
answer = answer_question(message, history)
return JsonResponse({'reply': answer})
Production Considerations
Embedding Cost
OpenAI text-embedding-ada-002 costs $0.10 per million tokens. Ingesting 1,000 pages of documentation costs roughly $0.50. Query embeddings are negligible — a typical chatbot with 1,000 daily questions costs under $1/month for embeddings.
Chunk Size Optimization
Smaller chunks (300-500 tokens) give more precise retrieval but may miss context. Larger chunks (800-1200 tokens) provide more context but reduce precision. For most business use cases, 500-800 tokens with 100-token overlap works well.
Caching
Cache frequently asked questions and their answers using Django cache framework or Redis. This reduces API calls and response latency from 2-3 seconds to under 100ms for repeated questions.
Monitoring
Log every conversation and flag queries where the chatbot says "I don't know." These gaps show you which documents to add to your knowledge base. Over time, this feedback loop makes the chatbot increasingly accurate.
RAG vs Fine-Tuning: Which Should You Choose?
RAG is better for most business chatbots because:
- No training costs: fine-tuning GPT-4 costs $25+ per million tokens and takes hours. RAG uses the base model.
- Easy updates: add new documents anytime. Fine-tuning requires retraining.
- Transparency: you can see which documents the chatbot used to generate each answer.
- Accuracy: RAG grounds answers in your actual data, reducing hallucinations dramatically.
Fine-tuning makes sense only when you need the model to adopt a specific writing style or handle highly specialized terminology that the base model struggles with.
What This Costs as a Complete Project
A production RAG chatbot with Django, pgvector, and OpenAI typically costs EUR 2,000-4,000 for development and EUR 30-80/month to run. This includes document ingestion, semantic search, conversation memory, web widget, and admin dashboard. See my detailed chatbot pricing guide for more specifics.
Need a RAG chatbot for your business? I build production-ready AI chatbots with Django and OpenAI. Tell me about your knowledge base and I will provide a detailed quote.
Get in touch →