Annotate traces | Opik Documentation

Annotating traces is a crucial aspect of evaluating and improving your LLM-based applications. By systematically recording qualitative or quantitative feedback on specific interactions or entire conversation flows, you can:

Track performance over time
Identify areas for improvement
Compare different model versions or prompts
Gather data for fine-tuning or retraining
Provide stakeholders with concrete metrics on system effectiveness

Opik allows you to annotate traces through the SDK or the UI.

Annotating Traces through the UI

To annotate traces through the UI, you can navigate to the trace you want to annotate in the traces page and click on the Annotate button. This will open a sidebar where you can add annotations to the trace.

You can annotate both traces and spans through the UI, make sure you have selected the correct span in the sidebar.

Once a feedback scores has been provided, you can also add a reason to explain why this particular score was provided. This is useful to add additional context to the score.

You can also add comments to traces and experiments to share insights with other team members.

Online evaluation

You don’t need to manually annotate each trace to measure the performance of your LLM applications! By using Opik’s online evaluation feature, you can define LLM as a Judge metrics that will automatically score all, or a subset, of your production traces.

Online evaluation

Annotating traces and spans using the SDK

You can use the SDK to annotate traces and spans which can be useful both as part of the evaluation process or if you receive user feedback scores in your application.

Annotating Traces through the SDK

Feedback scores can be logged for traces using the log_traces_feedback_scores method:

1 from opik import Opik
2 
3 client = Opik(project_name="my_project")
4 
5 trace = client.trace(name="my_trace")
6 
7 client.log_traces_feedback_scores(
8     scores=[
9         {"id": trace.id, "name": "overall_quality", "value": 0.85},
10         {"id": trace.id, "name": "coherence", "value": 0.75},
11     ]
12 )

The scores argument supports an optional reason field that can be provided to each score. This can be used to provide a human-readable explanation for the feedback score.

Annotating Spans through the SDK

To log feedback scores for individual spans, use the log_spans_feedback_scores method:

1 from opik import Opik
2 
3 client = Opik()
4 
5 trace = client.trace(name="my_trace")
6 span = trace.span(name="my_span")
7 
8 client.log_spans_feedback_scores(
9     scores=[
10         {"id": span.id, "name": "overall_quality", "value": 0.85},
11         {"id": span.id, "name": "coherence", "value": 0.75},
12     ],
13 )

The FeedbackScoreDict class supports an optional reason field that can be used to provide a human-readable explanation for the feedback score.

Using Opik’s built-in evaluation metrics

Computing feedback scores can be challenging due to the fact that Large Language Models can return unstructured text and non-deterministic outputs. In order to help with the computation of these scores, Opik provides some built-in evaluation metrics.

Opik’s built-in evaluation metrics are broken down into two main categories:

Heuristic metrics
LLM as a judge metrics

Heuristic Metrics

Heuristic metrics are use rule-based or statistical methods that can be used to evaluate the output of LLM models.

Opik supports a variety of heuristic metrics including:

EqualsMetric
RegexMatchMetric
ContainsMetric
IsJsonMetric
PerplexityMetric
BleuMetric
RougeMetric

You can find a full list of metrics in the Heuristic Metrics section.

These can be used by calling:

1 from opik.evaluation.metrics import Contains
2 
3 metric = Contains()
4 score = metric.score(
5     output="The quick brown fox jumps over the lazy dog.",
6     reference="The quick brown fox jumps over the lazy dog."
7 )

LLM as a Judge Metrics

For LLM outputs that cannot be evaluated using heuristic metrics, you can use LLM as a judge metrics. These metrics are based on the idea of using an LLM to evaluate the output of another LLM.

Opik supports many different LLM as a Judge metrics out of the box including:

FactualityMetric
ModerationMetric
HallucinationMetric
AnswerRelevanceMetric
ContextRecallMetric
ContextPrecisionMetric

You can find a full list of supported metrics in the Metrics Overview section.