GuardLabs · Technical note

Building a RAG Chatbot from PDF Documents: A Practical Implementation Guide

Retrieval-Augmented Generation (RAG) optimizes Large Language Model (LLM) outputs by querying a local data source before generating a response. This practical guide walks you through building a local RAG pipeline to query PDF documents using Python, LangChain, and ChromaDB.

The RAG Pipeline Architecture

The system operates in five distinct phases:

Document Loading: Extracting raw text from PDF files.
Chunking: Splitting the text into smaller, overlapping segments to maintain context window limits.
Embedding: Converting text chunks into high-dimensional vector representations.
Vector Storage: Storing embeddings in a database for fast similarity searches.
Retrieval & Generation: Fetching relevant chunks based on user queries and passing them to the LLM as context.

Environment Setup

Install the required dependencies using pip. This setup uses LangChain, OpenAI's API, and ChromaDB.

pip install langchain langchain-community langchain-openai chromadb pypdf

Ensure you have your OpenAI API key set in your environment variables:

export OPENAI_API_KEY="your-api-key-here"

Python Implementation

Below is the complete, executable script to load a PDF, index its contents, and query it using a RAG pipeline.

import os
from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import Chroma
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate

def build_rag_system(pdf_path: str, query: str):
    # 1. Load the PDF document
    if not os.path.exists(pdf_path):
        raise FileNotFoundError(f"Could not find PDF at {pdf_path}")
    
    loader = PyPDFLoader(pdf_path)
    docs = loader.load()

    # 2. Chunk the text into manageable pieces
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000, 
        chunk_overlap=200
    )
    splits = text_splitter.split_documents(docs)

    # 3. Embed text chunks and store in a local Chroma vector database
    vectorstore = Chroma.from_documents(
        documents=splits, 
        embedding=OpenAIEmbeddings()
    )
    retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

    # 4. Define the prompt template for the LLM
    system_prompt = (
        "You are an assistant for question-answering tasks. "
        "Use the following pieces of retrieved context to answer the question. "
        "If you do not know the answer, state clearly that you do not know.\n\n"
        "Context:\n{context}"
    )
    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", "{input}"),
    ])

    # 5. Set up the LLM and the QA chain
    llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
    question_answer_chain = create_stuff_documents_chain(llm, prompt)
    rag_chain = create_retrieval_chain(retriever, question_answer_chain)

    # 6. Execute the query
    response = rag_chain.invoke({"input": query})
    return response["answer"]

if __name__ == "__main__":
    # Replace with your actual PDF file path
    pdf_file = "sample_report.pdf" 
    user_query = "What are the key findings in section 3?"
    
    try:
        answer = build_rag_system(pdf_file, user_query)
        print(f"\nAnswer:\n{answer}")
    except Exception as e:
        print(f"Error: {e}")

Key Production Considerations

While the script above provides a working baseline, production deployments require addressing specific architectural challenges:

Parsing Complex PDFs: Standard text extractors like `pypdf` struggle with multi-column layouts, tables, and embedded images. For complex documents, consider specialized parsers like Unstructured or PyMuPDF.
Chunking Strategy: A fixed chunk size of 1000 characters is a starting point. If your document contains dense data, smaller chunks with higher overlap may yield better retrieval precision.
Retrieval Quality: Basic vector search can return irrelevant noise. Implementing a re-ranking step (using models like Cohere Rerank) helps ensure the most contextually relevant chunks are prioritized before reaching the LLM.

Need this done fast? order a RAG assistant on Kwork.

Published 2026-06-22 2 min read All articles EN / RU / ES

Need help with this?

I take on freelance fixes and builds in this area.