March 26, 2025
Detecting hallucinations in language models is challenging. There are three general approaches: Measuring token-level probability…
Even ChatGPT knows it’s not always right. When prompted, “Are large language models (LLMs) always accurate?” ChatGPT says no and confirms, “While they are powerful tools capable of generating fluent and contextually appropriate text based on their training, there are several reasons why they may produce inaccurate or unreliable information.”
One of those reasons: Hallucinations.
LLM hallucinations occur when the response to a user prompt is presented confidently by the model as a true statement, but is in fact inaccurate or completely fabricated nonsense. These coherent but incorrect outputs call the reliability and trustworthiness of LLM tools into question, and are a known issue you will need to mitigate if you are developing an app with an LLM integration.
For reference, if we look at GPT-4 as an example, the rate of LLM hallucinations varies anywhere from 1.8% for general usage all the way to 28.6% for specific applications. And there can be real consequences for businesses who rely on LLM agents to serve their customers. Take the case of Air Canada last year—the company was taken to court, and required to pay legal expenses plus a settlement due to a hallucination its LLM-driven chatbot made while answering a customer’s question about refund policies.
To help you address the challenges hallucinations cause during the LLM app development process and reduce your chances of becoming the person with the revenue-losing chatbot, this article will explore:
LLM hallucination is when the output from an LLM is well-formed and sounds plausible, but is actually incorrect information that the model “hallucinated” based on the data and patterns it learned. Even though LLMs excel at recognizing language patterns and understanding how words and phrases typically fit together, they don’t “know” things the way humans do.
LLMs are limited by the datasets of language they are trained on, and simply learn to predict the likelihood of the next word or phrase in a sentence. This allows them to craft convincing, human-like responses regardless of whether the information is true or not. There is no mechanism for fact-checking within the models themselves, nor is there the ability for complex, nuanced reasoning. This is where human oversight is required.
Hallucinations are a known phenomenon that LLM providers like OpenAI, Google, Meta, and Anthropic are actively working to solve. If you’re a developer building on top of LLMs, it’s important to be aware of the risks and implement safeguards and systems to continually monitor and mitigate hallucinated responses.
The cause of LLM hallucinations is multifaceted. There are various factors that contribute to hallucinated outputs, but these factors can be grouped into three general categories:
While all hallucinated responses are inaccurate and potentially misleading, there are two primary categories of LLM hallucinations: intrinsic and extrinsic. In simple terms, intrinsic hallucinations are LLM responses that directly contradict the given information, while extrinsic hallucinations are LLM responses that can’t be confirmed or denied with the given information so it’s unclear if they are true or not.
Although intrinsic versus extrinsic hallucinations has been a useful categorization up until this point, in an excellent 2024 paper from the Harbin Institute of Technology and Huawei, Inc., researchers make a compelling case for shifting the hallucination type terminology to better reflect the evolving, user-centric applications of LLMs. They define factuality and faithfulness as the two primary categories of LLM hallucinations and provide the examples listed below.
Factuality hallucinations are LLM responses that are either contradictory to real-world information or completely fabricated and unverifiable, i.e. they are false facts.
User Input | LLM Output | Explanation |
---|---|---|
What are Thomas Edison’s main contributions to science and technology? | Thomas Edison developed the first practical telephone and invented the light bulb | The response is factually incorrect. In reality, Edison improved the light bulb, building on earlier designs by others, and Alexander Graham Bell invented the telephone. |
What are the major environmental impacts of the construction of the Eiffel Tower? | The construction of the Eiffel Tower in 1889 led to the extinction of the Parisian tiger, a species that played a crucial role in the region’s ecosystem. Additionally, it is widely recognized as the event that sparked the global green architecture movement. | The response erroneously states that the construction of the Eiffel Tower led to the extinction of the “Parisian tiger” and sparked the global green architecture movement. In fact, there is no evidence of a “Parisian tiger” ever existing, making this a fabricated claim. Moreover, attributing the origins of green architecture to the Eiffel Tower is an exaggeration, as this movement has diverse roots and cannot be traced back to a single event. |
Faithfulness hallucinations are LLM responses that deviate from user instructions, context, or logical consistency. This results in responses that are unaligned or contradictory to the provided information, i.e. they are not faithful to the user input.
User Input | LLM Output | Explanibility |
---|---|---|
Translate the English question into Spanish: “What is the capital of France?” | The capital of France is Paris. | The user explicitly provided an instruction to translate the English question into Spanish. Instead of performing the translation as instructed, the LLM chose to answer the question. |
The Nile is a major north-flowing river in northeastern Africa. It’s traditionally considered the longest river in the world. Originating in the Great Lakes region of central Africa, it flows through several countries before emptying into the Mediterranean Sea.
Summarize the passage above. |
The Nile originates from the mountain ranges of central Africa and flows into the Mediterranean. |
The summary incorrectly states that the Nile originates from the “mountain ranges of central Africa,” while the paragraph mentions that the Nile originates in the Great Lakes region of central Africa. |
Solve the equation 2x + 3 = 11 step by step | Step 1: Subtract 3 from both sides to get: 2x = 8. Step 2: Divide both sides by 2 to get: x = 3. So the result of the equation is x = 3. |
The first step correctly isolates the term with x to get 2x = 8. However, the next step inaccurately divides 8 by 2 to yield a result of x = 3, which is inconsistent with the earlier reasoning. |
There is no doubt that AI has transformed the playing field of app development, and the possibilities for LLM integration are endless. Dev teams are building and shipping products that rely on LLMs for everything from routine customer service tasks to code generation to marketing content production.
Since these tools touch every part of business, the trustworthiness of your AI product matters. Unfortunately, this can easily be undermined by LLM hallucinations. Addressing the potentially negative impact of hallucinated responses brings a new set of challenges to the app development process.
Unlike a basic interaction where a user asks ChatGPT a question and gets a hallucinated response, in app development, LLMs are often called on at scale and process hundreds or thousands of requests. This creates a situation where you, as the developer, need to track and manage hallucinations across many queries, and even if your hallucination rate is low, that percentage quickly adds up when your app is making frequent, repeated calls to the LLM.
One of the biggest challenges in LLM app development is figuring out how to effectively track, measure, and reduce hallucinations to maintain the credibility of your app. You’ll need to understand how often hallucinations occur in real-world usage so you can decide on the best way to minimize their impact.
The key metric here is the hallucination rate, or the percentage of LLM outputs that are incorrect or misleading. You’ll first need to set up a tracking system to flag any LLM-generated content that doesn’t align with the source information or expected output. Then, once you have a clear understanding of how often hallucinations occur and under what circumstances, you’ll be able to develop a mitigation strategy.
LLM hallucinations are a multifaceted problem that will likely require several solutions. In order to deploy the right strategies for reducing your hallucination rate to an acceptable level for your app’s use case, it’s critical to have visibility into the quality of your app’s LLM calls.
As the developer, you are responsible for ensuring the app works correctly every time, and that hallucinated errors don’t damage the app’s performance. This is particularly true if your app serves customers in highly regulated industries where the accuracy and validity of information is crucial, and the use of AI is already being questioned, such as legal, financial, or medical fields. LLM hallucinations can easily erode user trust and decrease customer confidence in the usability and reliability of your product.
When integrating an LLM into your app, it’s crucial to continually evaluate its performance so you can detect and mitigate hallucinations. One way to do this is to automate the LLM evaluation process, which involves three key steps:
Once you identify the setup with the best performance, i.e. the one that minimizes hallucinations, you can ship it in your production app with more confidence. Automating the LLM evaluation process for hallucination detection not only helps you monitor and improve your app’s performance, but also helps you maintain the trustworthiness of your app.
Detecting LLM hallucinations and measuring the accuracy of your LLM-driven app is an ongoing, iterative process. Once you’ve automated the logging and annotation of your app’s outputs, you’ll need a reliable way to detect, quantify, and measure LLM hallucinations within those responses.
This requires a method of scoring and comparison, which typically involves running evaluations on your dataset to assess how accurately your LLM’s responses match the expected answers. The LLM evaluation metrics and methods you use to score will depend on the desired outputs of your app, your performance needs, and the level of accuracy you are accountable for. Your LLM accuracy metrics should be specific to your app, and may not be applicable to other use cases.
A common approach to hallucination detection is the LLM-as-a-judge concept, introduced in this 2023 paper. The basic principle is you can use another LLM to review your LLM’s outputs, and grade them against whatever criteria you’d like to set. LLM-as-a-judge metrics are generally non-deterministic, meaning they provide a form of measurement that can have variability and produce different results for the same data set when applied more than once.
There are a wide variety of LLM-as-a-judge approaches you could choose to take. For example, the open-source LLM evaluation tool Opik provides the following built-in LLM-as-a-judge metrics and prompt templates to help you detect and measure hallucination:
Unfortunately there is not currently an AI industry standard for benchmarking LLM hallucination rates. There are many researchers exploring different potential methods, and companies developing their own hallucination index systems for measuring the performance of LLMs, but there is still progress that needs to be made.
Ali Arsanjani, Director of AI at Google, has this to say on the matter: “Designing benchmarks and metrics for measuring and assessing hallucination is a much needed endeavor. However, recent work in the commercial domain without extensive peer reviews can lead to misleading results and interpretations.”
One example of a company-led LLM hallucination benchmark is FACTS from Google’s research team, an index of LLM performance based on factuality scores, or the percentage of factually accurate responses generated. Researchers also confirm, “While this benchmark represents a step forward in evaluating factual accuracy, more work remains to be done.”
The first step to prevent hallucinations is to understand what your app needs to achieve with the LLM integration. Are you aiming for high accuracy in technical data, conversational responses, or creative outputs? Clearly define what a successful output looks like in the context of your app. Consider factors like factual accuracy, coherence, and relevance, and be explicit about your LLM’s responsibilities and limitations. Identifying your priorities will help guide decisions and expectations.
The quality of your training data plays a huge role in minimizing hallucinations. For specialized applications, such as healthcare or finance, consider training your LLM with domain-specific data to ensure it understands the intricacies of your field. High-quality, relevant data reduces the chances the LLM will produce misleading or irrelevant information and enhances its ability to generate accurate outputs aligned with your app’s needs.
The way input prompts are structured can greatly influence the quality of the LLM’s output. By controlling the input options for users and reducing ambiguity as much as possible, you can guide the model to generate more reliable responses. Conduct adversarial testing to ensure you have good guardrails on your prompts and minimize the risk of unexpected or “jailbroken” outputs. The clearer and more specific you can make the input, the more likely you are to avoid hallucinations.
Hallucinations can’t be avoided entirely, but they can be minimized with consistent testing and refinement. Track inputs and outputs over time, and look for patterns or common triggers that lead to hallucinations. Implement an evaluation process and collect continuous feedback from users to help you monitor the impact of LLM hallucinations and develop iterative improvements. This continual fine-tuning of your app performance will help you stay proactive, detect issues early, and adjust as needed to maintain app quality.
Even with automation and testing systems in place, human oversight is still crucial. Human judgment and reasoning can catch subtle errors or misalignments, and ensure that your app functions well on a consistent basis. Empowering humans with the right tools, including feedback mechanisms and evaluation systems, will help you keep hallucinations in check and improve the overall quality of your app.
If you’re exploring different methods for reducing LLM hallucinations, test out a free LLM evaluation tool that was built with AI developers in mind: Opik.
With Opik’s open-source evaluation feature set—compatible with any LLM you’d like —you can:
Sign up for free today to improve the reliability of your LLM integration.