Overview
Opik provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls. These metrics are broken down into two main categories:
- Heuristic metrics
- LLM as a Judge metrics
Heuristic metrics are deterministic and are often statistical in nature. LLM as a Judge metrics are non-deterministic and are based on the idea of using an LLM to evaluate the output of another LLM.
Opik provides the following built-in evaluation metrics:
Metric | Type | Description | Documentation |
---|---|---|---|
Equals | Heuristic | Checks if the output exactly matches an expected string | Equals |
Contains | Heuristic | Check if the output contains a specific substring, can be both case sensitive or case insensitive | Contains |
RegexMatch | Heuristic | Checks if the output matches a specified regular expression pattern | RegexMatch |
IsJson | Heuristic | Checks if the output is a valid JSON object | IsJson |
Levenshtein | Heuristic | Calculates the Levenshtein distance between the output and an expected string | Levenshtein |
Hallucination | LLM as a Judge | Check if the output contains any hallucinations | Hallucination |
G-Eval | LLM as a Judge | Task agnostic LLM as a Judge metric | G-Eval |
Moderation | LLM as a Judge | Check if the output contains any harmful content | Moderation |
AnswerRelevance | LLM as a Judge | Check if the output is relevant to the question | AnswerRelevance |
ContextRecall | LLM as a Judge | Check if the output contains any hallucinations | ContextRecall |
ContextPrecision | LLM as a Judge | Check if the output contains any hallucinations | ContextPrecision |
You can also create your own custom metric, learn more about it in the Custom Metric section.
Customizing LLM as a Judge metrics
By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model
in the model_name
parameter of each LLM as a Judge metric.
from opik.evaluation.metrics import Hallucination
metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
metric.score(
input="What is the capital of France?",
output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
)
This functionality is based on LiteLLM framework, you can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.