Heuristic metrics
Heuristic metrics are rule-based evaluation methods that allow you to check specific aspects of language model outputs. These metrics use predefined criteria or patterns to assess the quality, consistency, or characteristics of generated text.
You can use the following heuristic metrics:
Score an LLM response
You can score an LLM response by first initializing the metrics and then calling the score
method:
Metrics
Equals
The Equals
metric can be used to check if the output of an LLM exactly matches a specific string. It can be used in the following way:
Contains
The Contains
metric can be used to check if the output of an LLM contains a specific substring. It can be used in the following way:
RegexMatch
The RegexMatch
metric can be used to check if the output of an LLM matches a specified regular expression pattern. It can be used in the following way:
IsJson
The IsJson
metric can be used to check if the output of an LLM is valid. It can be used in the following way:
LevenshteinRatio
The LevenshteinRatio
metric can be used to check if the output of an LLM is valid. It can be used in the following way:
BLEU
The BLEU (Bilingual Evaluation Understudy) metrics estimate how close the LLM outputs are to one or more reference translations. Opik provides two separate classes:
SentenceBLEU
– Single-sentence BLEUCorpusBLEU
– Corpus-level BLEU Both rely on the underlying NLTK BLEU implementation with optional smoothing methods, weights, and variable n-gram orders.
You will need nltk library:
Use SentenceBLEU
to compute single-sentence BLEU between a single candidate and one (or more) references:
Use CorpusBLEU
to compute corpus-level BLEU for multiple candidates vs. multiple references. Each candidate and its references align by index in the list:
You can also customize n-grams, smoothing methods, or weights:
Note: If any candidate or reference is empty, SentenceBLEU or CorpusBLEU will raise a MetricComputationError. Handle or validate inputs accordingly.
ROUGE
The ROUGE (Recall-Oriented Understudy for Gisting Evaluation) metrics estimate how close the LLM outputs are to one or more reference summaries, commonly used for evaluating summarization and text generation tasks. It measures the overlap between an output string and a reference string, with support for multiple ROUGE types. This metrics is a wrapper around the Google Research reimplementation of ROUGE, which is based on the rouge-score
library. You will need rouge-score library:
It can be used in a following way:
You can customize the ROUGE metric using the following parameters:
-
rouge_type
(str): Type of ROUGE score to compute. Must be one of:rouge1
: Unigram-based scoringrouge2
: Bigram-based scoringrougeL
: Longest common subsequence-based scoringrougeLsum
: ROUGE-L score based on sentence splitting
Default:
rouge1
-
use_stemmer
(bool): Whether to use stemming in ROUGE computation.
Default:False
-
split_summaries
(bool): Whether to split summaries into sentences.
Default:False
-
tokenizer
(Any | None): Custom tokenizer for sentence splitting.
Default:None
References
Notes
- The metric is case-insensitive.
- ROUGE scores are useful for comparing text summarization models or evaluating text similarity.
- Consider using stemming for improved evaluation in certain cases.