Overview
A high-level overview on how to use Opik’s evaluation features including some code snippets
Evaluation in Opik helps you assess and measure the quality of your LLM outputs across different dimensions. It provides a framework to systematically test your prompts and models against datasets, using various metrics to measure performance.
Opik also provides a set of pre-built metrics for common evaluation tasks. These metrics are designed to help you quickly and effectively gauge the performance of your LLM outputs and include metrics such as Hallucination, Answer Relevance, Context Precision/Recall and more. You can learn more about the available metrics in the Metrics Overview section.
If you are interested in evaluating your LLM application in production, please refer to the Online evaluation guide. Online evaluation rules allow you to define LLM as a Judge metrics that will automatically score all, or a subset, of your production traces.
Running an Evaluation
Each evaluation is defined by a dataset, an evaluation task and a set of evaluation metrics:
- Dataset: A dataset is a collection of samples that represent the inputs and, optionally, expected outputs for your LLM application.
- Evaluation task: This maps the inputs stored in the dataset to the output you would like to score. The evaluation task is typically a prompt template or the LLM application you are building.
- Metrics: The metrics you would like to use when scoring the outputs of your LLM
To simplify the evaluation process, Opik provides two main evaluation methods: evaluate_prompt
for evaluation prompt
templates and a more general evaluate
method for more complex evaluation scenarios.
Evaluating Prompts
Evaluating RAG applications and Agents
Using the Playground
To evaluate a specific prompt against a dataset:
Analyzing Evaluation Results
Once the evaluation is complete, Opik allows you to manually review the results and compare them with previous iterations.
data:image/s3,"s3://crabby-images/67047/670477e170ccfb3f47d7ef8fd54db808a5919e61" alt=""
In the experiment pages, you will be able to:
- Review the output provided by the LLM for each sample in the dataset
- Deep dive into each sample by clicking on the
item ID
- Review the experiment configuration to know how the experiment was Run
- Compare multiple experiments side by side
Learn more
You can learn more about Opik’s evaluation features in: