Hallucination

class opik.evaluation.metrics.Hallucination(model: str | OpikBaseModel | None = None, name: str = 'hallucination_metric', few_shot_examples: List[FewShotExampleHallucination] | None = None)

Bases: BaseMetric

A metric that evaluates whether an LLM’s output contains hallucinations based on given input and context.

This metric uses another LLM to judge if the output is factual or contains hallucinations. It returns a score of 1.0 if hallucination is detected, and 0.0 otherwise.

Parameters:
  • model – The LLM to use for evaluation. Can be a string (model name) or a CometBaseModel instance.

  • name – The name of the metric.

  • few_shot_examples – A list of few-shot examples to use for hallucination detection. If None, default examples will be used.

Example

>>> from comet_llm_eval.evaluation.metrics import Hallucination
>>> hallucination_metric = Hallucination()
>>> result = hallucination_metric.score(
...     input="What is the capital of France?",
...     output="The capital of France is London.",
...     context=["The capital of France is Paris."]
... )
>>> print(result.value)
1.0
>>> print(result.reason)
The answer provided states that the capital of France is London, which contradicts the fact stated in the context that the capital of France is Paris.
score(input: str, output: str, context: List[str], **ignored_kwargs: Any) ScoreResult

Calculate the hallucination score for the given input, output, and context.

Parameters:
  • input – The original input/question.

  • output – The LLM’s output to evaluate.

  • context – A list of context strings.

  • **ignored_kwargs – Additional keyword arguments that are ignored.

Returns:

A ScoreResult object with a value of 1.0 if hallucination

is detected, 0.0 otherwise, along with the reason for the verdict.

Return type:

score_result.ScoreResult

async ascore(input: str, output: str, context: List[str], **ignored_kwargs: Any) ScoreResult

Asynchronously calculate the hallucination score for the given input, output, and context.

Parameters:
  • input – The original input/question.

  • output – The LLM’s output to evaluate.

  • context – A list of context strings.

  • **ignored_kwargs – Additional keyword arguments that are ignored.

Returns:

A ScoreResult object with a value of 1.0 if hallucination

is detected, 0.0 otherwise, along with the reason for the verdict.

Return type:

score_result.ScoreResult