Hallucination¶

class opik.evaluation.metrics.Hallucination(model: str | OpikBaseModel | None = None, name: str = 'hallucination_metric', few_shot_examples: List[FewShotExampleHallucination] | None = None, track: bool = True)¶

Bases: BaseMetric

A metric that evaluates whether an LLM’s output contains hallucinations based on given input and context.

This metric uses another LLM to judge if the output is factual or contains hallucinations. It returns a score of 1.0 if hallucination is detected, and 0.0 otherwise.

Parameters:

model – The LLM to use for evaluation. Can be a string (model name) or an opik.evaluation.models.OpikBaseModel subclass instance. opik.evaluation.models.LiteLLMChatModel is used by default.
name – The name of the metric.
few_shot_examples – A list of few-shot examples to use for hallucination detection. If None, default examples will be used.
track – Whether to track the metric. Defaults to True.

Example

>>> from opik.evaluation.metrics import Hallucination
>>> hallucination_metric = Hallucination()
>>> result = hallucination_metric.score(
...     input="What is the capital of France?",
...     output="The capital of France is London.",
...     context=["The capital of France is Paris."]
... )
>>> print(result.value)
1.0
>>> print(result.reason)
The answer provided states that the capital of France is London, which contradicts the fact stated in the context that the capital of France is Paris.

score(input: str, output: str, context: List[str] | None = None, **ignored_kwargs: Any) → ScoreResult¶

Calculate the hallucination score for the given input, output, and optional context field.

Parameters:

input – The original input/question.
output – The LLM’s output to evaluate.
context – A list of context strings. If not provided, the presence of hallucinations will be evaluated based on the output only.
**ignored_kwargs – Additional keyword arguments that are ignored.

Returns:

A ScoreResult object with a value of 1.0 if hallucination: is detected, 0.0 otherwise, along with the reason for the verdict.

Return type:

score_result.ScoreResult

async ascore(input: str, output: str, context: List[str] | None = None, **ignored_kwargs: Any) → ScoreResult¶

Asynchronously calculate the hallucination score for the given input, output, and optional context field.

Parameters:

input – The original input/question.
output – The LLM’s output to evaluate.
context – A list of context strings. If not provided, the presence of hallucinations will be evaluated based on the output only.
**ignored_kwargs – Additional keyword arguments that are ignored.

Returns:

A ScoreResult object with a value of 1.0 if hallucination: is detected, 0.0 otherwise, along with the reason for the verdict.

Return type:

score_result.ScoreResult