Heuristic metrics | Opik Documentation

Heuristic metrics are rule-based evaluation methods that allow you to check specific aspects of language model outputs. These metrics use predefined criteria or patterns to assess the quality, consistency, or characteristics of generated text.

You can use the following heuristic metrics:

Metric	Description
Equals	Checks if the output exactly matches an expected string
Contains	Check if the output contains a specific substring, can be both case sensitive or case insensitive
RegexMatch	Checks if the output matches a specified regular expression pattern
IsJson	Checks if the output is a valid JSON object
Levenshtein	Calculates the Levenshtein distance between the output and an expected string
SentenceBLEU	Calculates a single-sentence BLEU score for a candidate vs. one or more references
CorpusBLEU	Calculates a corpus-level BLEU score for multiple candidates vs. their references
ROUGE	Calculates the ROUGE score for a candidate vs. one or more references
Sentiment	Analyzes the sentiment of text using NLTK’s VADER sentiment analyzer

Score an LLM response

You can score an LLM response by first initializing the metrics and then calling the score method:

1 from opik.evaluation.metrics import Contains
2 
3 metric = Contains(name="contains_hello", case_sensitive=True)
4 
5 score = metric.score(output="Hello world !", reference="Hello")
6 
7 print(score)

Metrics

Equals

The Equals metric can be used to check if the output of an LLM exactly matches a specific string. It can be used in the following way:

1 from opik.evaluation.metrics import Equals
2 
3 metric = Equals()
4 
5 score = metric.score(output="Hello world !", reference="Hello, world !")
6 print(score)

Contains

The Contains metric can be used to check if the output of an LLM contains a specific substring. It can be used in the following way:

1 from opik.evaluation.metrics import Contains
2 
3 metric = Contains(case_sensitive=False)
4 
5 score = metric.score(output="Hello world !", reference="Hello")
6 print(score)

RegexMatch

The RegexMatch metric can be used to check if the output of an LLM matches a specified regular expression pattern. It can be used in the following way:

1 from opik.evaluation.metrics import RegexMatch
2 
3 metric = RegexMatch(regex="^[a-zA-Z0-9]+$")
4 
5 score = metric.score("Hello world !")
6 print(score)

IsJson

The IsJson metric can be used to check if the output of an LLM is valid. It can be used in the following way:

1 from opik.evaluation.metrics import IsJson
2 
3 metric = IsJson(name="is_json_metric")
4 
5 score = metric.score(output='{"key": "some_valid_sql"}')
6 print(score)

LevenshteinRatio

The LevenshteinRatio metric can be used to check if the output of an LLM is valid. It can be used in the following way:

1 from opik.evaluation.metrics import LevenshteinRatio
2 
3 metric = LevenshteinRatio()
4 
5 score = metric.score(output="Hello world !", reference="hello")
6 print(score)

BLEU

The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:

SentenceBLEU – Single-sentence BLEU
CorpusBLEU – Corpus-level BLEU Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.

You will need nltk library:

$ pip install nltk

Use SentenceBLEU to compute single-sentence BLEU between a single candidate and one (or more) references:

1 from opik.evaluation.metrics import SentenceBLEU
2 
3 metric = SentenceBLEU(n_grams=4, smoothing_method="method1")
4 
5 # Single reference
6 score = metric.score(
7     output="Hello world!",
8     reference="Hello world"
9 )
10 print(score.value, score.reason)
11 
12 # Multiple references
13 score = metric.score(
14     output="Hello world!",
15     reference=["Hello planet", "Hello world"]
16 )
17 print(score.value, score.reason)

Use CorpusBLEU to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:

1 from opik.evaluation.metrics import CorpusBLEU
2 
3 metric = CorpusBLEU()
4 
5 outputs = ["Hello there", "This is a test."]
6 references = [
7     # For the first candidate, two references
8     ["Hello world", "Hello there"],
9     # For the second candidate, one reference
10     "This is a test."
11 ]
12 
13 score = metric.score(output=outputs, reference=references)
14 print(score.value, score.reason)

You can also customize n-grams, smoothing methods, or weights:

1 from opik.evaluation.metrics import SentenceBLEU
2 
3 metric = SentenceBLEU(
4     n_grams=4,
5     smoothing_method="method2",
6     weights=[0.25, 0.25, 0.25, 0.25]
7 )
8 
9 score = metric.score(
10     output="The cat sat on the mat",
11     reference=["The cat is on the mat", "A cat sat here on the mat"]
12 )
13 print(score.value, score.reason)

Note: If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.

Sentiment

The Sentiment metric analyzes the sentiment of text using NLTK’s VADER (Valence Aware Dictionary and sEntiment Reasoner) sentiment analyzer. It returns scores for positive, neutral, negative, and compound sentiment.

You will need the nltk library and the vader_lexicon:

$ pip install nltk
> python -m nltk.downloader vader_lexicon

Use Sentiment to analyze the sentiment of text:

1 from opik.evaluation.metrics import Sentiment
2 
3 metric = Sentiment()
4 
5 # Analyze sentiment
6 score = metric.score(output="I love this product! It's amazing.")
7 print(score.value)  # Compound score (e.g., 0.8802)
8 print(score.metadata)  # All sentiment scores (pos, neu, neg, compound)
9 print(score.reason)  # Explanation of the sentiment
10 
11 # Negative sentiment example
12 score = metric.score(output="This is terrible, I hate it.")
13 print(score.value)  # Negative compound score (e.g., -0.7650)

The metric returns:

value: The compound sentiment score (-1.0 to 1.0)
metadata: Dictionary containing all sentiment scores:
- pos: Positive sentiment (0.0-1.0)
- neu: Neutral sentiment (0.0-1.0)
- neg: Negative sentiment (0.0-1.0)
- compound: Normalized compound score (-1.0 to 1.0)

The compound score is a normalized score between -1.0 (extremely negative) and 1.0 (extremely positive), with scores:

≥ 0.05: Positive sentiment
-0.05 and < 0.05: Neutral sentiment
≤ -0.05: Negative sentiment

ROUGE

The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics estimate how close the LLM outputs are to one or more reference summaries, commonly used for evaluating summarization and text generation tasks. It measures the overlap between an output string and a reference string, with support for multiple ROUGE types. This metrics is a wrapper around the Google Research reimplementation of ROUGE, which is based on the rouge-score library. You will need rouge-score library:

$ pip install rouge-score

It can be used in a following way:

1 from opik.evaluation.metrics import ROUGE
2 
3 metric = ROUGE()
4 
5 # Single reference
6 score = metric.score(
7     output="Hello world!",
8     reference="Hello world"
9 )
10 print(score.value, score.reason)
11 
12 # Multiple references
13 score = metric.score(
14     output="Hello world!",
15     reference=["Hello planet", "Hello world"]
16 )
17 print(score.value, score.reason)

You can customize the ROUGE metric using the following parameters:

rouge_type (str): Type of ROUGE score to compute. Must be one of:
- rouge1: Unigram-based scoring
- rouge2: Bigram-based scoring
- rougeL: Longest common subsequence-based scoring
- rougeLsum: ROUGE-L score based on sentence splitting
Default: rouge1
use_stemmer (bool): Whether to use stemming in ROUGE computation.
Default: False
split_summaries (bool): Whether to split summaries into sentences.
Default: False
tokenizer (Any | None): Custom tokenizer for sentence splitting.
Default: None

1 from opik.evaluation.metrics import ROUGE
2 
3 metric = ROUGE(
4     rouge_type="rouge2",
5     use_stemmer=True
6 )
7 
8 score = metric.score(
9     output="The cats sat on the mats",
10     reference=["The cat is on the mat", "A cat sat here on the mat"]
11 )
12 print(score.value, score.reason)

AggregatedMetric

You can use the AggregatedMetric function to compute averages across multiple metrics for each item in your experiment.

You can define the metric as:

1 from opik.evaluation.metrics import AggregatedMetric, Hallucination, GEval
2 
3 metric = AggregatedMetric(
4   name="average_score",
5   metrics=[
6     Hallucination(),
7     GEval(
8       task_introduction="Identify factual inaccuracies",
9       evaluation_criteria="Return a score of 1 if there are inaccuracies, 0 otherwise"
10     )
11   ],
12   aggregator=lambda metric_results: sum([score_result.value for score_result in metric_results]) / len(metric_results)
13 )

References

Notes

The metric is case-insensitive.
ROUGE scores are useful for comparing text summarization models or evaluating text similarity.
Consider using stemming for improved evaluation in certain cases.