Building an AI Chatbot with RAG - Complete Deep Dive

Building an AI Chatbot with RAG - Complete Deep Dive

Hey everyone! Welcome back to another tutorial. Today we are going to understand how to build an AI-powered chatbot that can answer questions about ANY business using their website content!

I am super excited about this topic because this is how modern AI assistants work! Companies like Intercom, Drift, and Zendesk use similar technology. Trust me, understanding RAG will make you stand out in interviews!

What we will cover:

  • What is RAG (Retrieval Augmented Generation)?
  • System Architecture Overview
  • Web Scraping with BeautifulSoup
  • Vector Databases and Embeddings
  • Building the RAG Pipeline
  • FastAPI Backend Server
  • Frontend Chat Interface
  • How Everything Works Together
  • Interview Questions

What is RAG (Retrieval Augmented Generation)?

Let's start with the most important question - What is RAG?

Here's the simple definition:

"RAG is a technique that combines information retrieval with AI text generation to provide accurate, context-aware responses."

Wait, what does that mean? Let me break it down for you!

Traditional ChatGPT:
====================
User: "What are your hotel's check-in hours?"
AI: "I don't have specific information about your hotel..."

RAG-powered Chatbot:
====================
User: "What are your hotel's check-in hours?"
AI: *searches hotel website data*
AI: "Our check-in time is 3:00 PM and check-out is 11:00 AM!"

See the difference? RAG gives the AI access to YOUR specific data!

  • Retrieval - Find relevant information from a database
  • Augmented - Add that information to the AI's context
  • Generation - Generate a response using both AI knowledge + your data

Why is RAG important?

  • AI can answer questions about YOUR specific business
  • No need to fine-tune expensive models
  • Data stays up-to-date (just re-scrape!)
  • Reduces AI hallucinations

System Architecture Overview

Before diving into code, let's understand the big picture!

AI Chatbot Architecture:
========================

                    ┌─────────────────┐
                    │   User Browser  │
                    │  (Chat Widget)  │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │  FastAPI Server │
                    │   (server.py)   │
                    └────────┬────────┘
                             │
                             ▼
                    ┌─────────────────┐
                    │   Main Agent    │
                    │   (main.py)     │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
     ┌──────────────┐ ┌──────────────┐ ┌──────────────┐
     │    Qdrant    │ │   OpenAI     │ │   OpenAI     │
     │ Vector Store │ │  Embeddings  │ │   GPT-4o     │
     └──────────────┘ └──────────────┘ └──────────────┘

The system has 3 main phases:

  1. Indexing Phase - Scrape website → Create embeddings → Store in Qdrant
  2. Query Phase - User asks question → Find relevant content → Generate answer
  3. Response Phase - AI generates response using retrieved context

Phase 1: Web Scraping with BeautifulSoup

First, we need to scrape the business website to get all the content!

Let's look at the indexing.py file:

# indexing.py

from bs4 import BeautifulSoup
import requests
from urllib.parse import urljoin, urlparse

def scrape_website(base_url, max_depth=2):
    visited = set()
    documents = []
    headers = {"User-Agent": "aichat-scraper/1.0"}

    def scrape_page(url, depth):
        if depth > max_depth or url in visited:
            return
        visited.add(url)

        try:
            response = requests.get(url, headers=headers, timeout=10)
            soup = BeautifulSoup(response.content, 'html.parser')

            # Remove script and style tags
            for script in soup(["script", "style"]):
                script.decompose()

            # Extract clean text
            text = soup.get_text(separator=' ', strip=True)

            if text:
                documents.append({
                    "page_content": text,
                    "metadata": {"source": url}
                })

            # Find and follow internal links
            if depth < max_depth:
                for link in soup.find_all('a', href=True):
                    href = link['href']
                    full_url = urljoin(url, href)
                    # Only follow same-domain links
                    if urlparse(full_url).netloc == urlparse(base_url).netloc:
                        scrape_page(full_url, depth + 1)

        except Exception as e:
            print(f"Error scraping {url}: {e}")

    scrape_page(base_url, 0)
    return documents

How does web scraping work?

Web Scraping Flow:
==================

1. Start at base URL (e.g., https://myhotel.com/)
   │
2. Download HTML content using requests
   │
3. Parse HTML with BeautifulSoup
   │
4. Remove <script> and <style> tags (noise!)
   │
5. Extract clean text content
   │
6. Find all <a href="..."> links
   │
7. Follow internal links (same domain only)
   │
8. Repeat until max_depth reached!

Key Points:

  • max_depth=2 - How many levels deep to crawl (home → subpage → sub-subpage)
  • visited set - Prevents visiting same page twice
  • User-Agent header - Identifies our scraper to the website
  • Same domain check - Don't follow external links!

Phase 2: Vector Embeddings and Qdrant

Now comes the magic part - converting text into vectors!

What are Embeddings?

Embeddings convert text into numbers (vectors) that capture meaning!

Text to Embedding:
==================

"Our hotel has a swimming pool"
        │
        ▼ OpenAI Embedding Model
        │
[0.023, -0.156, 0.892, 0.445, ... 1536 numbers]

"We offer a pool for guests"
        │
        ▼
[0.021, -0.148, 0.887, 0.451, ... 1536 numbers]

These vectors are SIMILAR because meanings are similar!

Why vectors? Because we can do similarity search!

# Using OpenAI Embeddings
from langchain_openai import OpenAIEmbeddings

embedding = OpenAIEmbeddings(model="text-embedding-3-small")

# This model converts text → 1536-dimensional vector

What is Qdrant?

Qdrant is a Vector Database - it stores and searches vectors efficiently!

Vector Database (Qdrant):
=========================

┌──────────────────────────────────────────────────┐
│  Collection: "website_collection"                │
├──────────────────────────────────────────────────┤
│  ID   │  Vector (1536 dims)  │  Metadata         │
├───────┼──────────────────────┼───────────────────┤
│  1    │  [0.02, -0.15, ...]  │  {url: "/about"}  │
│  2    │  [0.89, 0.23, ...]   │  {url: "/rooms"}  │
│  3    │  [-0.12, 0.67, ...]  │  {url: "/contact"}│
└──────────────────────────────────────────────────┘

When user asks: "Do you have rooms available?"
→ Convert question to vector
→ Find similar vectors in database
→ Return matching content!

Here's how we index the website content:

# indexing.py

from langchain_qdrant import QdrantVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter

def index_website():
    # Load scraped documents
    with open("website_documents.json", "r") as f:
        documents_json = json.load(f)

    # Convert to LangChain Document objects
    documents = [
        Document(page_content=doc["page_content"], metadata=doc["metadata"])
        for doc in documents_json
    ]

    # Split into smaller chunks (important!)
    splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,      # Max 1000 characters per chunk
        chunk_overlap=200     # 200 char overlap between chunks
    )
    texts = splitter.split_documents(documents)

    # Store in Qdrant with embeddings
    vector_store = QdrantVectorStore.from_documents(
        texts,
        embedding=embedding,
        collection_name="website_collection",
        url="http://localhost:6333"
    )

Why do we split text into chunks?

Why Chunking is Important:
==========================

Original page: 5000 characters
"Welcome to our hotel... [lots of content] ...Contact us!"

Problem: Too big! Embedding loses detail.

Solution: Split into chunks!

Chunk 1: "Welcome to our hotel. We offer luxury rooms..."
Chunk 2: "Our amenities include swimming pool, gym..."
Chunk 3: "Room types: Standard, Deluxe, Suite..."
Chunk 4: "Contact us at email@hotel.com..."

Now each chunk has FOCUSED meaning!

Phase 3: The RAG Query Pipeline

Now let's see how the chatbot answers questions!

Look at the main.py file:

# main.py

def query_website(query: str) -> str:
    # Connect to existing Qdrant collection
    vector_store = QdrantVectorStore.from_existing_collection(
        embedding=embedding,
        collection_name="website_collection",
        url=QDRANT_URL
    )

    # Find top 10 most similar chunks!
    relevant_docs = vector_store.similarity_search(query, k=10)

    # Combine all relevant content
    combined_content = "\n".join([doc.page_content for doc in relevant_docs])

    return combined_content

How similarity search works:

User Query: "What rooms do you have?"
                │
                ▼
        Convert to vector
        [0.45, -0.23, 0.89, ...]
                │
                ▼
    Compare with ALL vectors in Qdrant
                │
                ▼
    Return top 10 most similar chunks:

    1. "Room types: Standard, Deluxe, Suite..." (0.92 similarity)
    2. "Our Deluxe rooms feature king beds..." (0.89 similarity)
    3. "Suite amenities include jacuzzi..." (0.85 similarity)
    ...

Now we pass this context to GPT-4:

# main.py

def get_business_info(query: str) -> str:
    # Get relevant content from vector store
    content = query_website(query)

    llm = ChatOpenAI(model="gpt-4o", temperature=0.8)

    prompt = f"""
You are responding on behalf of {business_name} team.
Always answer as "we" or "our" (the staff), not as a third-party.
If information is not available, politely say so.

Website Content:
{content}

User Query:
{query}

Your answer (as the business team):
"""

    response = llm.invoke(prompt)
    return response.content

This is the complete RAG flow:

Complete RAG Pipeline:
======================

User: "Do you have a gym?"
        │
        ▼
┌─────────────────────────────┐
│ 1. RETRIEVE                 │
│    - Convert query → vector │
│    - Search Qdrant          │
│    - Get relevant chunks    │
└─────────────────────────────┘
        │
        ▼
┌─────────────────────────────┐
│ 2. AUGMENT                  │
│    - Create prompt          │
│    - Add retrieved content  │
│    - Add instructions       │
└─────────────────────────────┘
        │
        ▼
┌─────────────────────────────┐
│ 3. GENERATE                 │
│    - Send to GPT-4          │
│    - Get AI response        │
│    - Return to user         │
└─────────────────────────────┘
        │
        ▼
AI: "Yes! We have a fully equipped fitness center
     on the 3rd floor, open 24/7 for our guests!"

LangChain Tools and Agents

The code also uses LangChain Tools - let's understand them!

# main.py

from langchain.tools import tool

@tool
def business_info_tool(query: str) -> str:
    """
    Tool to get website related information
    """
    return get_business_info(query)

What is a Tool?

A Tool is a function that AI can call when it needs specific capabilities!

LangChain Agent with Tools:
===========================

┌──────────────────────────────────────────────────┐
│                    AI AGENT                       │
│                                                  │
│  "Hmm, user is asking about hotel services..."   │
│  "I should use the business_info_tool!"          │
│                                                  │
│  ┌────────────────────────────────────────────┐  │
│  │ Available Tools:                           │  │
│  │                                            │  │
│  │ 🔧 business_info_tool - Get website info   │  │
│  │ 🔧 book_service - Book a service           │  │
│  └────────────────────────────────────────────┘  │
└──────────────────────────────────────────────────┘

The agent decides which tool to use based on the question!

# main.py

def mainAgent(query: str, message: list) -> str:
    llm = ChatOpenAI(model="gpt-4o", temperature=0)

    # Bind tools to the LLM
    tool_calling_llm = llm.bind_tools([business_info_tool])

    # LLM decides if it needs to call a tool
    ai_message = tool_calling_llm.invoke(messages)

    # If tool was called, execute it
    if ai_message.tool_calls:
        for tool_call in ai_message.tool_calls:
            if tool_call.get("name") == "business_info_tool":
                result = business_info_tool.invoke(tool_call.get("args"))
                return result

    return ai_message.content

FastAPI Backend Server

The backend uses FastAPI to handle HTTP requests!

# server.py

from fastapi import FastAPI, Request
from main import mainAgent

app = FastAPI()

# Serve the chat interface
@app.get("/app")
async def get_app():
    return FileResponse("index.html")

# Chat endpoint - receives user messages
@app.post("/chat")
async def chat_endpoint(request: Request):
    data = await request.json()
    query = data.get("query", "")

    # Process with AI agent
    response = mainAgent(query, [])

    return {"response": response}

API Flow:

Frontend → Backend Flow:
========================

Browser                          Server
   │                               │
   │  POST /chat                   │
   │  {"query": "What rooms?"}     │
   │ ─────────────────────────────→│
   │                               │  → mainAgent()
   │                               │  → query_website()
   │                               │  → GPT-4 response
   │  {"response": "We offer..."}  │
   │ ←─────────────────────────────│
   │                               │
   Display response                │

Frontend Chat Interface

The frontend is a simple HTML + JavaScript chat widget!

// index.html - Key parts

// Send message to backend
form.addEventListener("submit", async (e) => {
    e.preventDefault();
    const userMsg = input.value.trim();

    // Show user message in chat
    addMessage("user", userMsg);

    // Call backend API
    const response = await fetch("http://localhost:8110/chat", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ query: userMsg })
    });

    const result = await response.json();

    // Show AI response (with Markdown support!)
    addMessage("assistant", marked.parse(result.response));
});

The UI uses Tailwind CSS for styling and marked.js for Markdown rendering!

Docker Setup

The project uses Docker Compose to run everything together!

# docker-compose.yml

services:
  qdrant:
    image: qdrant/qdrant:latest
    ports:
      - "6333:6333"   # REST API
    volumes:
      - qdrant_data:/qdrant/storage

  web:
    build: .
    ports:
      - "8110:8000"
    depends_on:
      - qdrant
    command: sh -c "python indexing.py && python server.py"
Docker Architecture:
====================

┌────────────────────────────────────────────┐
│            Docker Compose                   │
│                                            │
│  ┌──────────────┐    ┌──────────────┐      │
│  │   qdrant     │    │     web      │      │
│  │  (Vector DB) │◄───│  (FastAPI)   │      │
│  │  Port: 6333  │    │  Port: 8110  │      │
│  └──────────────┘    └──────────────┘      │
│                                            │
└────────────────────────────────────────────┘
         │                    │
         ▼                    ▼
    localhost:6333      localhost:8110/app

Complete Data Flow - Putting It All Together

Let's see the entire system working together!

Complete System Flow:
=====================

SETUP PHASE (One Time):
───────────────────────
1. Run: docker-compose up -d qdrant
   → Starts Qdrant vector database

2. Run: python indexing.py
   → Scrapes website (BeautifulSoup)
   → Splits text into chunks
   → Creates embeddings (OpenAI)
   → Stores in Qdrant

3. Run: uvicorn server:app --port 8110
   → Starts FastAPI server


RUNTIME PHASE (Every Query):
────────────────────────────
User types: "What amenities do you have?"
                    │
                    ▼
            ┌───────────────┐
            │   Frontend    │
            │  (index.html) │
            └───────┬───────┘
                    │ POST /chat
                    ▼
            ┌───────────────┐
            │   FastAPI     │
            │  (server.py)  │
            └───────┬───────┘
                    │ mainAgent()
                    ▼
            ┌───────────────┐
            │  Main Agent   │
            │  (main.py)    │
            └───────┬───────┘
                    │
        ┌───────────┴───────────┐
        ▼                       ▼
┌───────────────┐       ┌───────────────┐
│    Qdrant     │       │   OpenAI      │
│ Vector Search │       │    GPT-4      │
└───────┬───────┘       └───────┬───────┘
        │                       │
        │ Relevant chunks       │ AI Response
        └───────────┬───────────┘
                    │
                    ▼
            "We offer a swimming pool,
             fitness center, spa, and
             complimentary breakfast!"

Tech Stack Summary

Component Technology Purpose
Web Scraping BeautifulSoup + Requests Extract website content
Text Splitting LangChain RecursiveCharacterTextSplitter Break content into chunks
Embeddings OpenAI text-embedding-3-small Convert text to vectors
Vector Database Qdrant Store and search vectors
LLM OpenAI GPT-4o Generate responses
Framework LangChain Orchestrate RAG pipeline
Backend FastAPI REST API server
Frontend HTML + Tailwind CSS + JavaScript Chat interface
Container Docker + Docker Compose Deployment

Interview Questions - Quick Fire!

Q: What is RAG?

"RAG stands for Retrieval Augmented Generation. It's a technique that retrieves relevant information from a knowledge base and uses it to augment the AI's context before generating a response. This allows AI to answer questions about specific data without fine-tuning."

Q: Why use vector embeddings?

"Vector embeddings convert text into numerical representations that capture semantic meaning. This allows us to perform similarity searches - finding text that is semantically similar even if the exact words differ. For example, 'room' and 'accommodation' would have similar vectors."

Q: What is Qdrant?

"Qdrant is an open-source vector database optimized for storing and searching high-dimensional vectors. It's designed for AI/ML applications and provides fast similarity search using algorithms like HNSW."

Q: Why split documents into chunks?

"Splitting into chunks improves retrieval accuracy. Large documents have mixed topics, making embeddings less focused. Smaller chunks have specific meanings, so when we search, we get more relevant results. Overlap ensures context isn't lost at boundaries."

Q: What is LangChain?

"LangChain is a framework for building applications with LLMs. It provides tools for connecting LLMs to external data sources, creating agents that can use tools, and building RAG pipelines. It abstracts away complexity of working with embeddings and vector stores."

Q: How does the agent decide which tool to use?

"The LLM analyzes the user query and the tool descriptions. Based on the query intent and the tool's docstring, it decides if calling a tool would help answer the question. This is called function calling or tool calling in modern LLMs."

Key Points to Remember

  • RAG = Retrieval + Augmentation + Generation
  • Embeddings convert text to vectors that capture meaning
  • Vector databases (Qdrant) enable fast similarity search
  • Chunking improves retrieval accuracy (chunk_size=1000, overlap=200)
  • LangChain orchestrates the entire RAG pipeline
  • Tools give AI agents specific capabilities
  • FastAPI handles HTTP requests from frontend
  • Docker Compose runs all services together
  • AI responds as "we" to represent the business
  • This architecture is used by modern AI assistants!

Running the Project

Here's how to run this project yourself:

# Step 1: Clone and setup
cd aichat

# Step 2: Configure environment
cp .env.example .env
# Add your OPENAI_API_KEY
# Set BASE_URL to the website you want to index

# Step 3: Start Qdrant
docker-compose up -d qdrant

# Step 4: Index the website
python indexing.py

# Step 5: Start the server
uvicorn server:app --reload --host 0.0.0.0 --port 8110

# Step 6: Open browser
# Go to http://localhost:8110/app

What's Next?

Now that you understand how RAG-powered chatbots work, you can:

  • Build chatbots for any business website
  • Add more tools (booking, scheduling, etc.)
  • Implement streaming responses
  • Add conversation memory
  • Deploy to production with Docker

Keep coding, keep learning! See you in the next one!