Overview | Opik Documentation

Opik provides a set of built-in evaluation metrics that can be used to evaluate the output of your LLM calls. These metrics are broken down into two main categories:

Heuristic metrics
LLM as a Judge metrics

Heuristic metrics are deterministic and are often statistical in nature. LLM as a Judge metrics are non-deterministic and are based on the idea of using an LLM to evaluate the output of another LLM.

Opik provides the following built-in evaluation metrics:

Metric	Type	Description	Documentation
Equals	Heuristic	Checks if the output exactly matches an expected string	Equals
Contains	Heuristic	Check if the output contains a specific substring, can be both case sensitive or case insensitive	Contains
RegexMatch	Heuristic	Checks if the output matches a specified regular expression pattern	RegexMatch
IsJson	Heuristic	Checks if the output is a valid JSON object	IsJson
Levenshtein	Heuristic	Calculates the Levenshtein distance between the output and an expected string	Levenshtein
Hallucination	LLM as a Judge	Check if the output contains any hallucinations	Hallucination
G-Eval	LLM as a Judge	Task agnostic LLM as a Judge metric	G-Eval
Moderation	LLM as a Judge	Check if the output contains any harmful content	Moderation
AnswerRelevance	LLM as a Judge	Check if the output is relevant to the question	AnswerRelevance
Usefulness	LLM as a Judge	Check if the output is useful to the question	Usefulness
ContextRecall	LLM as a Judge	Check if the output contains any hallucinations	ContextRecall
ContextPrecision	LLM as a Judge	Check if the output contains any hallucinations	ContextPrecision
Conversational Coherence	LLM as a Judge	Calculates the conversational coherence score for a given conversation thread.	ConversationalCoherence
Session Completeness Quality	LLM as a Judge	Evaluates the completeness of a session within a conversational thread.	SessionCompleteness
User Frustration	LLM as a Judge	Calculates the user frustration score for a given conversational thread.	UserFrustration

You can also create your own custom metric, learn more about it in the Custom Metric section.

Customizing LLM as a Judge metrics

By default, Opik uses GPT-4o from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model in the model_name parameter of each LLM as a Judge metric.

{pytest_codeblocks_skip=true}

1 from opik.evaluation.metrics import Hallucination
2 
3 metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4 
5 metric.score(
6     input="What is the capital of France?",
7     output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
8 )

This functionality is based on LiteLLM framework, you can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.