Xtillion - AI You Can Trust: How to Keep LLMs Grounded

Understanding AI Hallucination

Hallucination in AI, especially with large language models (LLMs), happens when these models generate information that sounds correct but is actually wrong or nonsensical. Imagine an AI confidently stating that “Paris is the capital of Italy”—it sounds right at first glance, but it’s completely off. This can be problematic for businesses, where relying on incorrect information can harm decision-making and trust.

Hallucination isn’t just a bug—it’s part of the limitations of current AI technology. AI models don’t always have access to real-time data or complete knowledge, and when asked a difficult question, they sometimes “guess.” As highlighted in recent research, hallucination is actually an inevitable outcome:

“LLMs cannot learn all of the computable functions and will therefore always hallucinate when faced with complex or unseen information.”

The issue stems from limitations in training data, model architecture, and the complexity of real-world tasks. For example, if a model isn’t trained on specific details, it fills the gaps with confident but incorrect answers, creating a hallucination.

For example, Air Canada’s customer service chatbot once cited a non-existent policy during a high-traffic sales event, causing widespread confusion and customer complaints. Or take the case of a lawyer who used ChatGPT for legal research and ended up citing fake cases in a federal court filing, leading to potential sanctions. In both instances, the AI confidently generated information that sounded correct but was completely fabricated, showing how damaging AI hallucinations can be if left unchecked.

Grounding AI with Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) is a technique that enhances how AI models generate text. Instead of solely relying on the data the model has been trained on (its “memory”), RAG allows the model to access external information during the generation process. This external data can come from sources like databases, documents, or knowledge bases, helping the model produce more accurate, fact-based outputs.

Visual representation of RAG, working on a report with a room full of experts

Picture this: You’re writing an important report. You’ve got your own knowledge, but you also have access to a room full of experts, each of whom has access to a huge library. Every time you need to double-check a fact, you ask one of these experts to pull up exactly what you need. This keeps your report not only well-written but also rock-solid when it comes to accuracy.

That’s how RAG works. It’s like giving AI access to a team of experts who can instantly fetch the most relevant, up-to-date information. So, instead of relying solely on what it knows, the AI is backed by real, verified facts from external sources.

Now, here’s why this matters:

Accuracy You Can Trust: RAG helps prevent those cringe-worthy AI hallucinations—when AI spits out something that sounds good but is totally wrong. By pulling in real data, RAG keeps the AI grounded, so the answers you get are fact-based and trustworthy.
Staying Current: The world changes fast, and AI models don’t always know the latest info. But with RAG, that’s not a problem. It can “look up” new facts in real-time, so it’s always up to date, no matter how fast things are changing.
Building Confidence: For businesses, trust is everything. If your AI gives out reliable, fact-checked information, it helps build confidence in the technology. You’re not just getting creative answers—you’re getting answers you can act on without second-guessing.

Some of the most popular LLM providers have guardrails built in. These features include fact-checking, input filtering, and response validation that automatically flag or block outputs that stray from factual data. During model development, additional techniques such as reinforcement learning and early stopping can be integrated to further minimize errors and enhance the quality of generated responses.

For businesses, it’s important to go beyond these default settings and implement custom guardrails tailored to their specific needs. This could include adding rule-based constraints, domain-specific checks, or even third-party tools that monitor outputs in real time. Continuous evaluation is key—regularly updating and fine-tuning these guardrails ensures your AI stays accurate and aligned with business goals.

Measuring Hallucination

How can we actually tell when AI is hallucinating and how often it’s happening? One of the simplest ways is to compare the AI’s responses to a set of known facts. This can be done manually—like fact-checking—or automatically using evaluation metrics designed specifically for this purpose.

When evaluating content generation, it’s crucial to assess how closely the AI-generated responses align with the retrieved information. It’s not just about pulling the right data but ensuring the output is accurate and relevant.

Traditionally, metrics like Mean Reciprocal Rank (MRR), F1 Score, and Precision/Recall were used to evaluate retrieval performance, focusing on how well models retrieved relevant information. However, many of these benchmarks are repurposed retrieval or question-answering datasets, which don’t effectively measure critical aspects like the accuracy of citations, the importance of each piece of text to the overall answer, or how conflicting information is handled. While these traditional metrics are still valuable, they are now being used in conjunction with newer metrics like faithfulness, which provide a more comprehensive evaluation of both retrieval and content generation quality.

Faithfulness & LLM-as-a-Judge: The New Standard for RAG Evaluation

With RAG, a new kind of challenge emerged—ensuring that AI-generated responses are factually consistent with the information retrieved. Traditional metrics couldn’t fully address this, leading to the birth of faithfulness as a key evaluation metric.

Faithfulness evaluates whether the generated response remains consistent with the facts retrieved, ensuring the AI doesn’t fabricate information. Faithfulness or Factual Consistency Score (FCS) measures whether all claims in the generated output can be inferred from the retrieved context. It’s calculated as:

This score ranges from 0 to 1, with 0 indicating a likely hallucination and 1 suggesting the AI's output closely aligns with the provided facts. A high faithfulness score ensures that the AI isn't merely generating plausible-sounding answers but is grounding its responses in verifiable data, thereby minimizing hallucinations.

Now, how do we validate these programmatically? Here’s where “LLM-as-a-Judge” steps in.

LLM-as-a-judge evaluation method.

LLM-as-a-Judge is an evaluation method that leverages LLMs to evaluate and score the outputs of other AI models, acting as an automated judge. The process involves providing the LLM with a test case that includes the input, the model’s output, and the relevant context (such as retrieval data). The LLM Judge scorer then assesses the output, assigning a score based on how well it aligns with the given context. If the output meets a predetermined threshold, it is marked as passed; otherwise, it is flagged as failed.

LLMs as judges have proven to be highly efficient in evaluating AI outputs compared to traditional manual validation. A recent large-scale study on LLM-as-a-Judge over 20 NLP evaluation tasks shows that LLMs like GPT-4 strongly correlate with human judgments across various NLP tasks. In fact, LLMs can provide faster, scalable evaluations with minimal annotator noise, which often affects manual assessments.

Let’s discuss how the Air Canada chatbot incident could have been prevented using faithfulness and LLM-as-a-Judge frameworks. Air Canada’s AI-powered chatbot provided incorrect, non-existent policy information during a high-traffic event, leading to confusion and customer complaints. This situation likely stemmed from the AI generating plausible-sounding but false information—a classic example of AI hallucination.

Had faithfulness been used as a core metric, it would have ensured that the chatbot’s responses were grounded in accurate, verifiable data. In this instance, the chatbot’s output could have been programmatically validated by applying the LLM-as-a-Judge method. An LLM could have acted as an automated judge, evaluating whether the output met the required accuracy threshold by cross-referencing the retrieval data (in this case, the actual company policy). If the output failed to align with the true information, the system would have flagged or rejected it before delivering it to customers, ensuring only factual responses were provided.

Let’s examine some frameworks currently available to mitigate these issues. The following are a few specialized frameworks that leverage LLMs as judges for various metrics, such as faithfulness.

RAGAS: RAGAS is an open-source framework designed to help you evaluate your RAG pipelines. RAG models use external data to augment the LLM’s context, and while tools exist to build these pipelines, measuring their performance can be challenging. RAGAS provides a structured approach to evaluating and quantifying your pipeline’s effectiveness, ensuring the retrieval and generation process is aligned with performance expectations.
DeepEval: DeepEval is an open-source evaluation framework designed specifically for large language models (LLMs). It makes it incredibly easy to test and iterate on LLM applications with several powerful features. The framework allows you to “unit test” LLM outputs in a way similar to Pytest, ensuring that responses are evaluated in a systematic and repeatable manner. DeepEval supports over 14 LLM-evaluated metrics, many of which are backed by research, making it a robust tool for testing accuracy, relevance, and other critical factors. Additionally, the framework includes synthetic dataset generation using state-of-the-art techniques, enabling you to evolve and refine test datasets as needed. Metrics are easily customizable, covering diverse use cases, and DeepEval even supports real-time evaluations in production environments, providing immediate feedback on LLM performance.
ARES: ARES is a cutting-edge framework specifically designed to evaluate RAG pipelines. It combines synthetic data generation with fine-tuned classifiers to efficiently assess key factors like context relevance, answer faithfulness, and answer relevance, reducing the reliance on human annotations. ARES uses advanced techniques such as synthetic query generation and Prediction-Powered Inference (PPI) to deliver accurate evaluations with statistical confidence. This automated approach allows for quick and reliable assessment of RAG models, making it a valuable tool for businesses looking to optimize the accuracy and reliability of their AI systems.

The field also explores alternatives that don’t rely on "LLM-as-a-Judge” for detecting hallucinations. Instead, classifier models like HHEM are gaining traction.

The Hughes Hallucination Evaluation Model (HHEM), particularly its latest version, HHEM-2.1, is not based on a transformer architecture like the large language models (LLMs) it evaluates. Instead, it is a dedicated classification model designed to detect hallucinations in the outputs of generative AI systems. HHEM’s advantage lies in its efficiency and speed. Unlike transformer-based LLMs that may be used for “LLM-as-a-Judge” approaches, HHEM is optimized for faster, real-time performance. It can evaluate factual consistency and identify hallucinations with lower latency and computational costs, making it more suitable for enterprise-level applications where speed and scalability are crucial.

Why is this important for your business?

AI rapidly transforms the enterprise landscape, becoming a critical innovation and competitive advantage driver. The speed at which AI-first companies are scaling is nothing short of remarkable. According to Stripe data, AI-first companies founded after 2020 are growing unprecedentedly—reaching over $30 million in revenue in just 20 months, compared to 65 months for traditional SaaS businesses. This speed demonstrates AI's enormous potential, but it also underscores the importance of ensuring that your AI systems are reliable, trustworthy, and grounded in real data.

To avoid costly missteps, businesses must implement technologies like faithfulness metrics, LLM-as-a-Judge, and models like HHEM. These tools enhance performance and ensure that your AI outputs are accurate, reducing the risk of hallucinations that could damage your brand and erode customer trust. Faithfulness metrics guarantee that AI-generated content is factually consistent with the provided context. At the same time, LLM-as-a-Judge and hallucination detection models offer an automated, scalable solution for real-time validation.

The message is clear: embracing these tools allows your business to harness AI's full power while avoiding the pitfalls of unreliable information. Investing in these capabilities isn’t just about improving accuracy—it’s about protecting your brand, building trust, and ensuring your business can scale confidently in an increasingly AI-driven world.