Custom Metric
Opik allows you to define your own custom metrics, which is especially important when the metrics you need are not already available out of the box.
When to Create Custom Metrics
It is specially relevant to define your own metrics when:
- You have domain-specific goals
- Standard metrics don’t capture the nuance you need
- You want to align with business KPIs
- You’re experimenting with new evaluation approaches
If you want to write an LLM as a Judge metric, you can use either the G-Eval metric or create your own from scratch.
Writing your own custom metrics
To define a custom metric, you need to subclass the BaseMetric
class and implement the score
method and an optional ascore
method:
The score
method should return a ScoreResult
object. The ascore
method is optional and can be used to compute asynchronously if needed.
You can also return a list of ScoreResult
objects as part of your custom metric. This is useful if you want to
return multiple scores for a given input and output pair.
Now you can use the custom metric to score LLM outputs:
Also, this metric can now be used in the evaluate
function as explained here: Evaluating LLMs.
Example: Creating a metric with OpenAI model
You can implement your own custom metric by creating a class that subclasses the BaseMetric
class and implements the score
method.
You can then use this metric to score your LLM outputs:
In this example, we used the OpenAI Python client to call the LLM. You don’t have to use the OpenAI Python client, you can update the code example above to use any LLM client you have access to.
Example: Adding support for many LLM providers
In order to support a wide range of LLM providers, we recommend using the litellm
library to call your LLM. This allows you to support hundreds of models without having to maintain a custom LLM client.
Opik providers a LitellmChatModel
class that wraps the litellm
library and can be used in your custom metric:
You can then use this metric to score your LLM outputs:
Example: Creating a metric with multiple scores
You can implement a metric that returns multiple scores, which will display as separate columns in the UI when using it in an evaluation.
To do so, setup your score
method to return a list of ScoreResult
objects.
Example: Enforcing structured outputs
In the examples above, we ask the LLM to respond with a JSON object. However as this is not enforced, it is possible that the LLM returns a non-structured response. In order to avoid this, you can use the litellm
library to enforce a structured output. This will make our custom metric more robust and less prone to failure.
For this we define the format of the response we expect from the LLM in the LLMJudgeBinaryResult
class and pass it to the LiteLLM client:
Similarly to the previous example, you can then use this metric to score your LLM outputs:
Creating a custom metric using G-Eval
G-eval allows you to specify a set of criteria for your metric and it will use a Chain of Thought prompting technique to create some evaluation steps and return a score. You can read more about this advanced metric here.
To use G-Eval, you will need to specify a task introduction and evaluation criteria:
Custom Conversation Metrics
For evaluating multi-turn conversations and dialogue systems, you’ll need specialized conversation metrics. These metrics evaluate entire conversation threads rather than single input-output pairs.
Learn how to create custom conversation metrics in the Custom Conversation Metrics guide.
What’s next
Creating custom metrics is just the beginning of building a comprehensive evaluation system for your LLM applications. In this guide, you’ve learned how to create custom metrics using different approaches, from simple metrics to sophisticated LLM-as-a-judge implementations, including specialized conversation thread metrics for multi-turn dialogue evaluation.
From here, you might want to:
- Evaluate your LLM application following the Evaluate your LLM application guide
- Evaluate conversation threads using the Evaluate Threads guide
- Explore built-in metrics in the Metrics overview