Building an AI Chatbot with RAG - Complete Deep Dive
Building an AI Chatbot with RAG - Complete Deep Dive
Hey everyone! Welcome back to another tutorial. Today we are going to understand how to build an AI-powered chatbot that can answer questions about ANY business using their website content!
I am super excited about this topic because this is how modern AI assistants work! Companies like Intercom, Drift, and Zendesk use similar technology. Trust me, understanding RAG will make you stand out in interviews!
What we will cover:
- What is RAG (Retrieval Augmented Generation)?
- System Architecture Overview
- Web Scraping with BeautifulSoup
- Vector Databases and Embeddings
- Building the RAG Pipeline
- FastAPI Backend Server
- Frontend Chat Interface
- How Everything Works Together
- Interview Questions
What is RAG (Retrieval Augmented Generation)?
Let's start with the most important question - What is RAG?
Here's the simple definition:
"RAG is a technique that combines information retrieval with AI text generation to provide accurate, context-aware responses."
Wait, what does that mean? Let me break it down for you!
Traditional ChatGPT: ==================== User: "What are your hotel's check-in hours?" AI: "I don't have specific information about your hotel..." RAG-powered Chatbot: ==================== User: "What are your hotel's check-in hours?" AI: *searches hotel website data* AI: "Our check-in time is 3:00 PM and check-out is 11:00 AM!"
See the difference? RAG gives the AI access to YOUR specific data!
- Retrieval - Find relevant information from a database
- Augmented - Add that information to the AI's context
- Generation - Generate a response using both AI knowledge + your data
Why is RAG important?
- AI can answer questions about YOUR specific business
- No need to fine-tune expensive models
- Data stays up-to-date (just re-scrape!)
- Reduces AI hallucinations
System Architecture Overview
Before diving into code, let's understand the big picture!
AI Chatbot Architecture:
========================
┌─────────────────┐
│ User Browser │
│ (Chat Widget) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ FastAPI Server │
│ (server.py) │
└────────┬────────┘
│
▼
┌─────────────────┐
│ Main Agent │
│ (main.py) │
└────────┬────────┘
│
┌──────────────┼──────────────┐
▼ ▼ ▼
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Qdrant │ │ OpenAI │ │ OpenAI │
│ Vector Store │ │ Embeddings │ │ GPT-4o │
└──────────────┘ └──────────────┘ └──────────────┘
The system has 3 main phases:
- Indexing Phase - Scrape website → Create embeddings → Store in Qdrant
- Query Phase - User asks question → Find relevant content → Generate answer
- Response Phase - AI generates response using retrieved context
Phase 1: Web Scraping with BeautifulSoup
First, we need to scrape the business website to get all the content!
Let's look at the indexing.py file:
# indexing.py
from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse
def scrape_website(base_url, max_depth=2):
visited = set()
documents = []
headers = {"User-Agent": "aichat-scraper/1.0"}
def scrape_page(url, depth):
if depth > max_depth or url in visited:
return
visited.add(url)
try:
response = requests.get(url, headers=headers, timeout=10)
soup = BeautifulSoup(response.content, 'html.parser')
# Remove script and style tags
for script in soup(["script", "style"]):
script.decompose()
# Extract clean text
text = soup.get_text(separator=' ', strip=True)
if text:
documents.append({
"page_content": text,
"metadata": {"source": url}
})
# Find and follow internal links
if depth < max_depth:
for link in soup.find_all('a', href=True):
href = link['href']
full_url = urljoin(url, href)
# Only follow same-domain links
if urlparse(full_url).netloc == urlparse(base_url).netloc:
scrape_page(full_url, depth + 1)
except Exception as e:
print(f"Error scraping {url}: {e}")
scrape_page(base_url, 0)
return documents
How does web scraping work?
Web Scraping Flow: ================== 1. Start at base URL (e.g., https://myhotel.com/) │ 2. Download HTML content using requests │ 3. Parse HTML with BeautifulSoup │ 4. Remove <script> and <style> tags (noise!) │ 5. Extract clean text content │ 6. Find all <a href="..."> links │ 7. Follow internal links (same domain only) │ 8. Repeat until max_depth reached!
Key Points:
- max_depth=2 - How many levels deep to crawl (home → subpage → sub-subpage)
- visited set - Prevents visiting same page twice
- User-Agent header - Identifies our scraper to the website
- Same domain check - Don't follow external links!
Phase 2: Vector Embeddings and Qdrant
Now comes the magic part - converting text into vectors!
What are Embeddings?
Embeddings convert text into numbers (vectors) that capture meaning!
Text to Embedding:
==================
"Our hotel has a swimming pool"
│
▼ OpenAI Embedding Model
│
[0.023, -0.156, 0.892, 0.445, ... 1536 numbers]
"We offer a pool for guests"
│
▼
[0.021, -0.148, 0.887, 0.451, ... 1536 numbers]
These vectors are SIMILAR because meanings are similar!
Why vectors? Because we can do similarity search!
# Using OpenAI Embeddings from langchain_openai import OpenAIEmbeddings embedding = OpenAIEmbeddings(model="text-embedding-3-small") # This model converts text → 1536-dimensional vector
What is Qdrant?
Qdrant is a Vector Database - it stores and searches vectors efficiently!
Vector Database (Qdrant):
=========================
┌──────────────────────────────────────────────────┐
│ Collection: "website_collection" │
├──────────────────────────────────────────────────┤
│ ID │ Vector (1536 dims) │ Metadata │
├───────┼──────────────────────┼───────────────────┤
│ 1 │ [0.02, -0.15, ...] │ {url: "/about"} │
│ 2 │ [0.89, 0.23, ...] │ {url: "/rooms"} │
│ 3 │ [-0.12, 0.67, ...] │ {url: "/contact"}│
└──────────────────────────────────────────────────┘
When user asks: "Do you have rooms available?"
→ Convert question to vector
→ Find similar vectors in database
→ Return matching content!
Here's how we index the website content:
# indexing.py
from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
def index_website():
# Load scraped documents
with open("website_documents.json", "r") as f:
documents_json = json.load(f)
# Convert to LangChain Document objects
documents = [
Document(page_content=doc["page_content"], metadata=doc["metadata"])
for doc in documents_json
]
# Split into smaller chunks (important!)
splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # Max 1000 characters per chunk
chunk_overlap=200 # 200 char overlap between chunks
)
texts = splitter.split_documents(documents)
# Store in Qdrant with embeddings
vector_store = QdrantVectorStore.from_documents(
texts,
embedding=embedding,
collection_name="website_collection",
url="http://localhost:6333"
)
Why do we split text into chunks?
Why Chunking is Important: ========================== Original page: 5000 characters "Welcome to our hotel... [lots of content] ...Contact us!" Problem: Too big! Embedding loses detail. Solution: Split into chunks! Chunk 1: "Welcome to our hotel. We offer luxury rooms..." Chunk 2: "Our amenities include swimming pool, gym..." Chunk 3: "Room types: Standard, Deluxe, Suite..." Chunk 4: "Contact us at email@hotel.com..." Now each chunk has FOCUSED meaning!
Phase 3: The RAG Query Pipeline
Now let's see how the chatbot answers questions!
Look at the main.py file:
# main.py
def query_website(query: str) -> str:
# Connect to existing Qdrant collection
vector_store = QdrantVectorStore.from_existing_collection(
embedding=embedding,
collection_name="website_collection",
url=QDRANT_URL
)
# Find top 10 most similar chunks!
relevant_docs = vector_store.similarity_search(query, k=10)
# Combine all relevant content
combined_content = "\n".join([doc.page_content for doc in relevant_docs])
return combined_content
How similarity search works:
User Query: "What rooms do you have?"
│
▼
Convert to vector
[0.45, -0.23, 0.89, ...]
│
▼
Compare with ALL vectors in Qdrant
│
▼
Return top 10 most similar chunks:
1. "Room types: Standard, Deluxe, Suite..." (0.92 similarity)
2. "Our Deluxe rooms feature king beds..." (0.89 similarity)
3. "Suite amenities include jacuzzi..." (0.85 similarity)
...
Now we pass this context to GPT-4:
# main.py
def get_business_info(query: str) -> str:
# Get relevant content from vector store
content = query_website(query)
llm = ChatOpenAI(model="gpt-4o", temperature=0.8)
prompt = f"""
You are responding on behalf of {business_name} team.
Always answer as "we" or "our" (the staff), not as a third-party.
If information is not available, politely say so.
Website Content:
{content}
User Query:
{query}
Your answer (as the business team):
"""
response = llm.invoke(prompt)
return response.content
This is the complete RAG flow:
Complete RAG Pipeline:
======================
User: "Do you have a gym?"
│
▼
┌─────────────────────────────┐
│ 1. RETRIEVE │
│ - Convert query → vector │
│ - Search Qdrant │
│ - Get relevant chunks │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 2. AUGMENT │
│ - Create prompt │
│ - Add retrieved content │
│ - Add instructions │
└─────────────────────────────┘
│
▼
┌─────────────────────────────┐
│ 3. GENERATE │
│ - Send to GPT-4 │
│ - Get AI response │
│ - Return to user │
└─────────────────────────────┘
│
▼
AI: "Yes! We have a fully equipped fitness center
on the 3rd floor, open 24/7 for our guests!"
LangChain Tools and Agents
The code also uses LangChain Tools - let's understand them!
# main.py
from langchain.tools import tool
@tool
def business_info_tool(query: str) -> str:
"""
Tool to get website related information
"""
return get_business_info(query)
What is a Tool?
A Tool is a function that AI can call when it needs specific capabilities!
LangChain Agent with Tools: =========================== ┌──────────────────────────────────────────────────┐ │ AI AGENT │ │ │ │ "Hmm, user is asking about hotel services..." │ │ "I should use the business_info_tool!" │ │ │ │ ┌────────────────────────────────────────────┐ │ │ │ Available Tools: │ │ │ │ │ │ │ │ 🔧 business_info_tool - Get website info │ │ │ │ 🔧 book_service - Book a service │ │ │ └────────────────────────────────────────────┘ │ └──────────────────────────────────────────────────┘
The agent decides which tool to use based on the question!
# main.py
def mainAgent(query: str, message: list) -> str:
llm = ChatOpenAI(model="gpt-4o", temperature=0)
# Bind tools to the LLM
tool_calling_llm = llm.bind_tools([business_info_tool])
# LLM decides if it needs to call a tool
ai_message = tool_calling_llm.invoke(messages)
# If tool was called, execute it
if ai_message.tool_calls:
for tool_call in ai_message.tool_calls:
if tool_call.get("name") == "business_info_tool":
result = business_info_tool.invoke(tool_call.get("args"))
return result
return ai_message.content
FastAPI Backend Server
The backend uses FastAPI to handle HTTP requests!
# server.py
from fastapi import FastAPI, Request
from main import mainAgent
app = FastAPI()
# Serve the chat interface
@app.get("/app")
async def get_app():
return FileResponse("index.html")
# Chat endpoint - receives user messages
@app.post("/chat")
async def chat_endpoint(request: Request):
data = await request.json()
query = data.get("query", "")
# Process with AI agent
response = mainAgent(query, [])
return {"response": response}
API Flow:
Frontend → Backend Flow:
========================
Browser Server
│ │
│ POST /chat │
│ {"query": "What rooms?"} │
│ ─────────────────────────────→│
│ │ → mainAgent()
│ │ → query_website()
│ │ → GPT-4 response
│ {"response": "We offer..."} │
│ ←─────────────────────────────│
│ │
Display response │
Frontend Chat Interface
The frontend is a simple HTML + JavaScript chat widget!
// index.html - Key parts
// Send message to backend
form.addEventListener("submit", async (e) => {
e.preventDefault();
const userMsg = input.value.trim();
// Show user message in chat
addMessage("user", userMsg);
// Call backend API
const response = await fetch("http://localhost:8110/chat", {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ query: userMsg })
});
const result = await response.json();
// Show AI response (with Markdown support!)
addMessage("assistant", marked.parse(result.response));
});
The UI uses Tailwind CSS for styling and marked.js for Markdown rendering!
Docker Setup
The project uses Docker Compose to run everything together!
# docker-compose.yml
services:
qdrant:
image: qdrant/qdrant:latest
ports:
- "6333:6333" # REST API
volumes:
- qdrant_data:/qdrant/storage
web:
build: .
ports:
- "8110:8000"
depends_on:
- qdrant
command: sh -c "python indexing.py && python server.py"
Docker Architecture:
====================
┌────────────────────────────────────────────┐
│ Docker Compose │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ qdrant │ │ web │ │
│ │ (Vector DB) │◄───│ (FastAPI) │ │
│ │ Port: 6333 │ │ Port: 8110 │ │
│ └──────────────┘ └──────────────┘ │
│ │
└────────────────────────────────────────────┘
│ │
▼ ▼
localhost:6333 localhost:8110/app
Complete Data Flow - Putting It All Together
Let's see the entire system working together!
Complete System Flow:
=====================
SETUP PHASE (One Time):
───────────────────────
1. Run: docker-compose up -d qdrant
→ Starts Qdrant vector database
2. Run: python indexing.py
→ Scrapes website (BeautifulSoup)
→ Splits text into chunks
→ Creates embeddings (OpenAI)
→ Stores in Qdrant
3. Run: uvicorn server:app --port 8110
→ Starts FastAPI server
RUNTIME PHASE (Every Query):
────────────────────────────
User types: "What amenities do you have?"
│
▼
┌───────────────┐
│ Frontend │
│ (index.html) │
└───────┬───────┘
│ POST /chat
▼
┌───────────────┐
│ FastAPI │
│ (server.py) │
└───────┬───────┘
│ mainAgent()
▼
┌───────────────┐
│ Main Agent │
│ (main.py) │
└───────┬───────┘
│
┌───────────┴───────────┐
▼ ▼
┌───────────────┐ ┌───────────────┐
│ Qdrant │ │ OpenAI │
│ Vector Search │ │ GPT-4 │
└───────┬───────┘ └───────┬───────┘
│ │
│ Relevant chunks │ AI Response
└───────────┬───────────┘
│
▼
"We offer a swimming pool,
fitness center, spa, and
complimentary breakfast!"
Tech Stack Summary
| Component | Technology | Purpose |
|---|---|---|
| Web Scraping | BeautifulSoup + Requests | Extract website content |
| Text Splitting | LangChain RecursiveCharacterTextSplitter | Break content into chunks |
| Embeddings | OpenAI text-embedding-3-small | Convert text to vectors |
| Vector Database | Qdrant | Store and search vectors |
| LLM | OpenAI GPT-4o | Generate responses |
| Framework | LangChain | Orchestrate RAG pipeline |
| Backend | FastAPI | REST API server |
| Frontend | HTML + Tailwind CSS + JavaScript | Chat interface |
| Container | Docker + Docker Compose | Deployment |
Interview Questions - Quick Fire!
Q: What is RAG?
"RAG stands for Retrieval Augmented Generation. It's a technique that retrieves relevant information from a knowledge base and uses it to augment the AI's context before generating a response. This allows AI to answer questions about specific data without fine-tuning."
Q: Why use vector embeddings?
"Vector embeddings convert text into numerical representations that capture semantic meaning. This allows us to perform similarity searches - finding text that is semantically similar even if the exact words differ. For example, 'room' and 'accommodation' would have similar vectors."
Q: What is Qdrant?
"Qdrant is an open-source vector database optimized for storing and searching high-dimensional vectors. It's designed for AI/ML applications and provides fast similarity search using algorithms like HNSW."
Q: Why split documents into chunks?
"Splitting into chunks improves retrieval accuracy. Large documents have mixed topics, making embeddings less focused. Smaller chunks have specific meanings, so when we search, we get more relevant results. Overlap ensures context isn't lost at boundaries."
Q: What is LangChain?
"LangChain is a framework for building applications with LLMs. It provides tools for connecting LLMs to external data sources, creating agents that can use tools, and building RAG pipelines. It abstracts away complexity of working with embeddings and vector stores."
Q: How does the agent decide which tool to use?
"The LLM analyzes the user query and the tool descriptions. Based on the query intent and the tool's docstring, it decides if calling a tool would help answer the question. This is called function calling or tool calling in modern LLMs."
Key Points to Remember
- RAG = Retrieval + Augmentation + Generation
- Embeddings convert text to vectors that capture meaning
- Vector databases (Qdrant) enable fast similarity search
- Chunking improves retrieval accuracy (chunk_size=1000, overlap=200)
- LangChain orchestrates the entire RAG pipeline
- Tools give AI agents specific capabilities
- FastAPI handles HTTP requests from frontend
- Docker Compose runs all services together
- AI responds as "we" to represent the business
- This architecture is used by modern AI assistants!
Running the Project
Here's how to run this project yourself:
# Step 1: Clone and setup cd aichat # Step 2: Configure environment cp .env.example .env # Add your OPENAI_API_KEY # Set BASE_URL to the website you want to index # Step 3: Start Qdrant docker-compose up -d qdrant # Step 4: Index the website python indexing.py # Step 5: Start the server uvicorn server:app --reload --host 0.0.0.0 --port 8110 # Step 6: Open browser # Go to http://localhost:8110/app
What's Next?
Now that you understand how RAG-powered chatbots work, you can:
- Build chatbots for any business website
- Add more tools (booking, scheduling, etc.)
- Implement streaming responses
- Add conversation memory
- Deploy to production with Docker
Keep coding, keep learning! See you in the next one!
Post a Comment