GuardLabs · Technical note

How to Reduce Hallucinations in a RAG Chatbot

Reducing hallucinations in Retrieval-Augmented Generation (RAG) pipelines requires systematic improvements across three core stages: data ingestion/retrieval, prompt construction, and post-generation validation. Below is a technical guide to implementing these controls.

1. Optimize Retrieval Quality (The "Garbage In, Garbage Out" Problem)

If the retriever fetches irrelevant context, the LLM will hallucinate to fill the gaps. Implement these retrieval-stage optimizations:

Semantic Chunking: Replace arbitrary character-count chunking with semantic chunking. Split documents based on structural changes (e.g., headers) or semantic similarity transitions to keep context intact.
Hybrid Search: Combine keyword-based search (BM25) with dense vector search (embeddings). This ensures both exact keyword matches (like product IDs) and conceptual meaning are captured.
Re-ranking: Apply a Cross-Encoder re-ranker (such as Cohere Rerank or BGE-Reranker) to the top retrieved documents. Re-rankers evaluate the exact query-document relationship, filtering out low-relevance chunks before passing them to the LLM.

2. Enforce Strict Prompt Constraints

Configure the system prompt to explicitly restrict the LLM's operational boundaries. The prompt must instruct the model to rely solely on the provided context and define a clear fallback mechanism.

import openai

def generate_rag_response(query: str, retrieved_chunks: list) -> str:
    context = "\n---\n".join(retrieved_chunks)
    
    system_instruction = (
        "You are a factual assistant. Answer the user's question using ONLY the provided context below. "
        "If the context does not contain the answer, reply with: 'I cannot answer this based on the provided information.' "
        "Do not use external knowledge or make assumptions. Cite the source document names if available."
    )
    
    user_prompt = f"Context:\n{context}\n\nQuestion: {query}"
    
    response = openai.chat.completions.create(
        model="gpt-4o-mini", # Use highly instruction-compliant models
        temperature=0.0,     # Minimize creativity to reduce variance
        messages=[
            {"role": "system", "content": system_instruction},
            {"role": "user", "content": user_prompt}
        ]
    )
    return response.choices[0].message.content

3. Implement Post-Generation Guardrails

Do not rely solely on the LLM's self-discipline. Implement programmatic validation layers to check the output before serving it to the user:

NLI (Natural Language Inference) Checks: Use a smaller, specialized model (like a DeBERTa-v3 NLI model) to verify if the generated response is logically entailed by the retrieved context chunks. If the relationship is classified as a "contradiction" or "neutral", block or regenerate the response.
Self-RAG / Self-Correction: Prompt a secondary LLM (or run a separate agentic loop) to evaluate the generated output against the source context using a binary rubric (e.g., "Does the response contain claims not supported by the context?").
Guardrail Frameworks: Integrate open-source tools like NeMo Guardrails or Guardrails AI to programmatically enforce output schemas and factual alignment.

4. Set Temperature to Zero

For factual Q&A, set the LLM's temperature parameter to 0.0 (or as close to 0 as the API allows). This forces greedy decoding, ensuring the model consistently selects the highest-probability tokens, which significantly reduces random, creative deviations.

Need this done? We handle this hands-on at GuardLabs — get in touch.

Published 2026-06-23 2 min read All articles EN / RU / ES

Need help with this?

I take on freelance fixes and builds in this area.