GuardLabs · Technical note

Building a Customer Support AI Chatbot from a Knowledge Base

To build a customer support chatbot that answers questions using your company's proprietary knowledge base, use a Retrieval-Augmented Generation (RAG) architecture. This approach retrieves relevant documents from your database first, then feeds them to a Large Language Model (LLM) as context to generate an accurate, grounded response.

Architecture Overview

The system consists of three main components:

Knowledge Base: Your raw support documents, FAQs, or markdown files.
Vector Database: A database (like ChromaDB or Pinecone) that stores vector embeddings of your document chunks to enable semantic search.
LLM (OpenAI GPT): The generative model that synthesizes the retrieved context into a natural, conversational support response.

Step 1: Environment Setup

Install the required Python libraries. We will use chromadb as our vector database and openai for embeddings and text generation.

pip install openai chromadb

Step 2: Prepare and Index the Knowledge Base

This script reads your support documents, generates vector embeddings using OpenAI's text-embedding-3-small model, and stores them in a local ChromaDB collection.

import chromadb
from openai import OpenAI

# Initialize clients
client = OpenAI(api_key="YOUR_OPENAI_API_KEY")
chroma_client = chromadb.PersistentClient(path="./chroma_db")
collection = chroma_client.get_or_create_collection(name="kb_collection")

# Sample knowledge base data
knowledge_base = [
    {"id": "kb_01", "text": "To request a refund, navigate to Settings > Billing and click 'Request Refund'. Refunds take 5-10 business days to process."},
    {"id": "kb_02", "text": "Our support hours are Monday through Friday, 9:00 AM to 5:00 PM EST. We are closed on weekends and major holidays."},
    {"id": "kb_03", "text": "The Basic plan costs $19/month. The Pro plan costs $49/month and includes API access and priority support."}
]

def get_embedding(text):
    response = client.embeddings.create(
        model="text-embedding-3-small",
        input=text
    )
    return response.data[0].embedding

# Index documents
for doc in knowledge_base:
    embedding = get_embedding(doc["text"])
    collection.add(
        embeddings=[embedding],
        documents=[doc["text"]],
        ids=[doc["id"]]
    )
print("Knowledge base indexed successfully.")

Step 3: Build the Retrieval and Generation Pipeline

This function processes user queries, retrieves the most relevant document chunk from the vector database, and uses GPT-4o-mini to draft the final response based strictly on that context.

def query_chatbot(user_query):
    # 1. Generate embedding for the user query
    query_vector = get_embedding(user_query)
    
    # 2. Query the vector database for the closest match
    results = collection.query(
        query_embeddings=[query_vector],
        n_results=1
    )
    
    # Extract the retrieved context
    if results['documents'] and len(results['documents'][0]) > 0:
        context = results['documents'][0][0]
    else:
        context = "No relevant information found."

    # 3. Construct the prompt with system instructions and retrieved context
    system_prompt = (
        "You are a precise customer support assistant. Answer the user's question using ONLY the provided context. "
        "If the context does not contain the answer, say 'I am sorry, but I do not have that information in my knowledge base.' "
        "Do not make up or assume any facts.\n\n"
        f"Context:\n{context}"
    )

    # 4. Generate response from the LLM
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_query}
        ],
        temperature=0.0  # Kept at 0.0 to minimize creative hallucinations
    )
    
    return response.choices[0].message.content

# Example Execution
user_question = "How long does a refund take?"
answer = query_chatbot(user_question)
print(f"Q: {user_question}\nA: {answer}")

Limitations and Production Considerations

Hallucinations: Setting LLM temperature to 0.0 and using strict system prompts minimizes, but does not completely eliminate, the risk of the model generating incorrect information.
Context Window Limits: If your documents are large, you must implement a text-splitting strategy (e.g., recursive character splitting) to chunk documents into manageable sizes (e.g., 500-character segments) before indexing.
Data Privacy: Sending queries and document chunks to OpenAI transmits data to external servers. If you handle highly sensitive or regulated data, consider self-hosting an open-source model (like Llama 3) and running embeddings locally.

Need this done fast? order it on Kwork.

Published 2026-06-23 3 min read All articles EN / RU / ES

Need help with this?

I take on freelance fixes and builds in this area.