Usefulness
The usefulness metric allows you to evaluate how useful an LLM response is given an input. It uses a language model to assess the usefulness and provides a score between 0.0 and 1.0, where higher values indicate higher usefulness. Along with the score, it provides a detailed explanation of why that score was assigned.
How to use the Usefulness metric
You can use the Usefulness
metric as follows:
Asynchronous scoring is also supported with the ascore
scoring method.
Understanding the scores
The usefulness score ranges from 0.0 to 1.0:
- Scores closer to 1.0 indicate that the response is highly useful, directly addressing the input query with relevant and accurate information
- Scores closer to 0.0 indicate that the response is less useful, possibly being off-topic, incomplete, or not addressing the input query effectively
Each score comes with a detailed explanation (result.reason
) that helps understand why that particular score was assigned.
Usefulness Prompt
Opik uses an LLM as a Judge to evaluate usefulness, for this we have a prompt template that is used to generate the prompt for the LLM. By default, the gpt-4o
model is used to evaluate responses but you can change this to any model supported by LiteLLM by setting the model
parameter. You can learn more about customizing models in the Customize models for LLM as a Judge metrics section.
The template is as follows: