Evaluate your agent
Evaluating your LLM application allows you to have confidence in the performance of your LLM application. In this guide, we will walk through the process of evaluating complex applications like LLM chains or agents.
In this guide, we will focus on evaluating complex LLM applications. If you are looking at evaluating single prompts you can refer to the Evaluate A Prompt guide.
The evaluation is done in five steps:
- Add tracing to your LLM application
- Define the evaluation task
- Choose the
Datasetthat you would like to evaluate your application on - Choose the metrics that you would like to evaluate your application with
- Create and run the evaluation experiment
Running an offline evaluation
1. (Optional) Add tracking to your LLM application
While not required, we recommend adding tracking to your LLM application. This allows you to have
full visibility into each evaluation run. In the example below we will use a combination of the
track decorator and the track_openai function to trace the LLM application.
1 from opik import track 2 from opik.integrations.openai import track_openai 3 import openai 4 5 openai_client = track_openai(openai.OpenAI()) 6 7 # This method is the LLM application that you want to evaluate 8 # Typically this is not updated when creating evaluations 9 @track 10 def your_llm_application(input: str) -> str: 11 response = openai_client.chat.completions.create( 12 model="gpt-3.5-turbo", 13 messages=[{"role": "user", "content": input}], 14 ) 15 16 return response.choices[0].message.content
Here we have added the track decorator so that this trace and all its nested
steps are logged to the platform for further analysis.
2. Define the evaluation task
Once you have added instrumentation to your LLM application, we can define the evaluation task. The evaluation task takes in as an input a dataset item and needs to return a dictionary with keys that match the parameters expected by the metrics you are using. In this example we can define the evaluation task as follows:
1 import { EvaluationTask } from "opik"; 2 import { OpenAI } from "openai"; 3 4 // Define dataset item type 5 type DatasetItem = { 6 input: string; 7 expected: string; 8 }; 9 10 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => { 11 const { input } = datasetItem; 12 13 const openai = new OpenAI(); 14 const response = await openai.chat.completions.create({ 15 model: "gpt-4o", 16 messages: [ 17 { role: "system", content: "You are a coding assistant" }, 18 { role: "user", content: input } 19 ], 20 }); 21 22 return { output: response.choices[0].message.content }; 23 24 };
If the dictionary returned does not match with the parameters expected by the metrics, you will get inconsistent evaluation results.
3. Choose the evaluation Dataset
In order to create an evaluation experiment, you will need to have a Dataset that includes all your test cases.
If you have already created a Dataset, you can use the get or create dataset methods to fetch it.
1 import { Opik } from "opik"; 2 3 const client = new Opik(); 4 const dataset = await client.getOrCreateDataset<DatasetItem>("Example dataset"); 5 6 // Opik deduplicates items that are inserted into a dataset so we can insert them 7 // for multiple times 8 await dataset.insert([ 9 { 10 input: "Hello, world!", 11 expected: "Hello, world!" 12 }, 13 { 14 input: "What is the capital of France?", 15 expected: "Paris" 16 }, 17 ]);
4. Choose evaluation metrics
Opik provides a set of built-in evaluation metrics that you can choose from. These are broken down into two main categories:
- Heuristic metrics: These metrics that are deterministic in nature, for example
equalsorcontains - LLM-as-a-judge: These metrics use an LLM to judge the quality of the output; typically these are used for detecting
hallucinationsorcontext relevance
In the same evaluation experiment, you can use multiple metrics to evaluate your application:
1 import { ExactMatch } from "opik"; 2 3 const exact_match_metric = new ExactMatch();
Each metric expects the data in a certain format. You will need to ensure that the task you have defined in step 2 returns the data in the correct format.
5. Run the evaluation
Now that we have the task we want to evaluate, the dataset to evaluate on, and the metrics we want to evaluate with, we can run the evaluation:
1 import { EvaluationTask, Opik, ExactMatch, evaluate } from "opik"; 2 import { OpenAI } from "openai"; 3 4 // Define dataset item type 5 type DatasetItem = { 6 input: string; 7 expected: string; 8 }; 9 10 // Define the evaluation task 11 const llmTask: EvaluationTask<DatasetItem> = async (datasetItem) => { 12 const { input } = datasetItem; 13 14 const openai = new OpenAI(); 15 const response = await openai.chat.completions.create({ 16 model: "gpt-4o", 17 messages: [ 18 { role: "system", content: "You are a coding assistant" }, 19 { role: "user", content: input } 20 ], 21 }); 22 23 return { output: response.choices[0].message.content }; 24 }; 25 26 // Get or create the dataset - items are automatically deduplicated 27 const client = new Opik(); 28 const dataset = await client.getOrCreateDataset<DatasetItem>("Example dataset"); 29 await dataset.insert([ 30 { 31 input: "Hello, world!", 32 expected: "Hello, world!" 33 }, 34 { 35 input: "What is the capital of France?", 36 expected: "Paris" 37 }, 38 ]); 39 40 // Define the metric 41 const exact_match_metric = new ExactMatch(); 42 43 // Run the evaluation 44 const result = await evaluate({ 45 dataset, 46 task: llmTask, 47 scoringMetrics: [exact_match_metric], 48 experimentName: "Example Evaluation", 49 }); 50 console.log(`Experiment ID: ${result.experimentId}`); 51 console.log(`Experiment Name: ${result.experimentName}`); 52 console.log(`Total test cases: ${result.testResults.length}`);
You can use the experiment_config parameter to store information about your
evaluation task. Typically we see teams store information about the prompt
template, the model used and model parameters used to evaluate the
application.
6. Analyze the evaluation results
Once the evaluation is complete, you will get a link to the Opik UI where you can analyze the evaluation results. In addition to being able to deep dive into each test case, you will also be able to compare multiple experiments side by side.

Advanced usage
Missing arguments for scoring methods
When you face the opik.exceptions.ScoreMethodMissingArguments exception, it means that the dataset
item and task output dictionaries do not contain all the arguments expected by the scoring method.
The way the evaluate function works is by merging the dataset item and task output dictionaries and
then passing the result to the scoring method. For example, if the dataset item contains the keys
user_question and context while the evaluation task returns a dictionary with the key output,
the scoring method will be called as scoring_method.score(user_question='...', context= '...', output= '...').
This can be an issue if the scoring method expects a different set of arguments.
You can solve this by either updating the dataset item or evaluation task to return the missing
arguments or by using the scoring_key_mapping parameter of the evaluate function. In the example
above, if the scoring method expects input as an argument, you can map the user_question key to
the input key as follows:
1 evaluation = evaluate({ 2 dataset, 3 task: evaluation_task, 4 scoringMetrics: [hallucination_metric], 5 scoringKeyMapping: {"input": "user_question"}, 6 })
Linking prompts to experiments
The Opik prompt library can be used to version your prompt templates.
When creating an Experiment, you can link the Experiment to a specific prompt version:
1 import { Opik, evaluate, evaluatePrompt } from 'opik'; 2 import { Hallucination } from 'opik'; 3 4 const client = new Opik(); 5 6 // Create a prompt 7 const prompt = await client.createPrompt({ 8 name: "My prompt", 9 prompt: "Translate to French: {{input}}", 10 }); 11 12 // Link prompt to evaluation experiment 13 await evaluatePrompt({ 14 dataset: myDataset, 15 messages: [ 16 { role: "user", content: "Translate to French: {{input}}" }, 17 ], 18 model: "gpt-4o", 19 scoringMetrics: [new Hallucination()], 20 prompts: [prompt], 21 });
The experiment will now be linked to the prompt allowing you to view all experiments that use a specific prompt:

Logging traces to a specific project
You can use the project_name parameter of the evaluate function to log evaluation traces to a specific project:
1 const evaluation = await evaluate({ 2 dataset, 3 task: evaluation_task, 4 scoringMetrics: [hallucination_metric], 5 projectName: "hallucination-detection", 6 })
Evaluating a subset of the dataset
You can use the nb_samples parameter to specify the number of samples to use for the evaluation. This is useful if you only want to evaluate a subset of the dataset.
1 const evaluation = await evaluate({ 2 dataset, 3 task: evaluation_task, 4 scoringMetrics: [hallucination_metric], 5 nbSamples: 10, 6 })
Sampling the dataset for evaluation
You can use the dataset_sampler parameter to specify the instance of dataset sampler to use for sampling the dataset.
This is useful if you want to sample the dataset differently than the default sampling strategy (accept all items).
For example, you can use the RandomDatasetSampler to sample the dataset randomly:
1 from opik.evaluation import samplers 2 3 evaluation = evaluate( 4 experiment_name="My experiment", 5 dataset=dataset, 6 task=evaluation_task, 7 scoring_metrics=[hallucination_metric], 8 dataset_sampler=samplers.RandomDatasetSampler(max_samples=10), 9 )
In the example above, the evaluation will sample 10 random items from the dataset.
Also, you can implement your own dataset sampler by extending the BaseDatasetSampler and overriding the sample method.
1 import re 2 from typing import List 3 4 from opik.api_objects.dataset import dataset_item 5 from opik.evaluation import samplers 6 7 class MyDatasetSampler(samplers.BaseDatasetSampler): 8 9 def __init__(self, filter_string: str, field_name: str) -> None: 10 self.filter_regex = re.compile(filter_string) 11 self.field_name = field_name 12 13 def sample(self, dataset: List[dataset_item.DatasetItem]) -> List[dataset_item.DatasetItem]: 14 # Sample items from the dataset that match the filter string in the 'field_name' field 15 return [item for item in filter(lambda x: self.filter_regex.search(x[self.field_name]), dataset)] 16 17 # Example usage 18 evaluation = evaluate( 19 experiment_name="My experiment", 20 dataset=dataset, 21 task=evaluation_task, 22 scoring_metrics=[hallucination_metric], 23 dataset_sampler=MyDatasetSampler(filter_string="\\.*SUCCESS\\.*", field_name="output"), 24 )
Implementing your own dataset sampler is useful if you want to implement a custom sampling strategy. For instance, you can implement a dataset sampler that samples the dataset using some filtering criteria as in the example above.
Analyzing the evaluation results
The evaluate function returns an EvaluationResult object that contains the evaluation results.
You can create aggregated statistics for each metric by calling its aggregate_evaluation_scores method:
1 evaluation = evaluate( 2 experiment_name="My experiment", 3 dataset=dataset, 4 task=evaluation_task, 5 scoring_metrics=[hallucination_metric], 6 ) 7 8 # Retrieve and print the aggregated scores statistics (mean, min, max, std) per metric 9 scores = evaluation.aggregate_evaluation_scores() 10 for metric_name, statistics in scores.aggregated_scores.items(): 11 print(f"{metric_name}: {statistics}")
Aggregated statistics can help analyze evaluation results and are useful for comparing the performance of different models or different versions of the same model, for example.
Python SDK
Using async evaluation tasks
The evaluate function does not support async evaluation tasks, if you pass
an async task you will get an error similar to:
1 Input should be a valid dictionary [type=dict_type, input_value='<coroutine object kyc_qu...ng_task at 0x3336d0a40>', input_type=str]
As it might not always be possible to convert all your LLM logic to not rely on async logic,
we recommend using asyncio.run within the evaluation task:
1 import asyncio 2 3 async def your_llm_application(input: str) -> str: 4 return "Hello, World" 5 6 def evaluation_task(x): 7 # your_llm_application here is an async function 8 result = asyncio.run(your_llm_application(x['input'])) 9 return { 10 "output": result 11 }
This should solve the issue and allow you to run the evaluation.
If you are running in a Jupyter notebook, you will need to add the following line to the top of your notebook:
1 import nest_asyncio 2 nest_asyncio.apply()
otherwise you might get the error RuntimeError: asyncio.run() cannot be called from a running event loop
The evaluate function uses multi-threading under the hood to speed up the evaluation run. Using both
asyncio and multi-threading can lead to unexpected behavior and hard to debug errors.
If you run into any issues, you can disable the multi-threading in the SDK by setting task_threads to 1:
1 evaluation = evaluate( 2 dataset=dataset, 3 task=evaluation_task, 4 scoring_metrics=[hallucination_metric], 5 task_threads=1 6 )
Disabling threading
In order to evaluate datasets more efficiently, Opik uses multiple background threads to evaluate the dataset. If this is causing issues, you can disable these by setting task_threads and scoring_threads to 1 which will lead Opik to run all calculations in the main thread.
Passing additional arguments to evaluation_task
Sometimes your evaluation task needs extra context besides the dataset item (commonly referred to as x). For example, you may want to pass a model name, a system prompt, or a pre-initialized client.
Since evaluate calls the task as task(x) for each dataset item, the recommended pattern is to create a wrapper (or use functools.partial) that closes over any additional arguments.
Using a wrapper function:
1 # Extra dependencies you want to provide to the task 2 MODEL = "gpt-4o" 3 IMAGE_TYPE = "thumbnail" 4 5 def evaluation_task(x, model, image_type, client, prompt): 6 full_response = client.get_answer( 7 x["question"], 8 x["image_paths"][image_type], 9 prompt.format(), 10 model=model, 11 ) 12 response = full_response["response"] 13 return { 14 "response": response, 15 "bbox": full_response.get("bounding_boxes"), 16 "image_url": full_response.get("image_url"), 17 } 18 19 def make_task(model, image_type, client, prompt): 20 # Return a unary function that evaluate() can call as task(x) 21 def _task(x): 22 return evaluation_task(x, model, image_type, client, prompt) 23 return _task 24 25 task = make_task(MODEL, IMAGE_TYPE, bot, system_prompt) 26 27 evaluation = evaluate( 28 dataset=dataset, 29 task=task, # evaluate will call task(x) for each item 30 scoring_metrics=[levenshteinratio_metric], 31 scoring_key_mapping={ 32 "input": "question", 33 "output": "response", 34 "reference": "expected_answer", 35 }, 36 )
Using task span evaluation metrics
Opik supports advanced evaluation metrics that can analyze the detailed execution information of your LLM tasks. These metrics receive a task_span parameter containing structured data about the task execution, including input, output, metadata, and nested operations.
Task span metrics are particularly useful for evaluating:
- The internal structure and behavior of your LLM applications
- Performance characteristics like execution patterns
- Quality of intermediate steps in complex workflows
- Cost and usage optimization opportunities
- Agent trajectory
Creating task span metrics
To create a task span evaluation metric, define a metric class that accepts a task_span parameter in its score method. The task_span parameter is a SpanModel object that contains detailed information about the task execution:
1 from typing import Any, Optional 2 from opik.evaluation.metrics import BaseMetric, score_result 3 from opik.message_processing.emulation.models import SpanModel 4 5 class ExecutionTimeMetric(BaseMetric): 6 def score(self, task_span: SpanModel, \*\*ignored_kwargs: Any) -> score_result.ScoreResult: # Calculate execution duration 7 if task_span.start_time and task_span.end_time: 8 duration = (task_span.end_time - task_span.start_time).total_seconds() 9 10 # Score based on execution speed 11 if duration < 1.0: 12 score = 1.0 13 reason = f"Fast execution: {duration:.2f}s" 14 elif duration < 5.0: 15 score = 0.8 16 reason = f"Acceptable execution time: {duration:.2f}s" 17 else: 18 score = 0.5 19 reason = f"Slow execution: {duration:.2f}s" 20 else: 21 score = 0.0 22 reason = "Cannot determine execution time" 23 24 return score_result.ScoreResult( 25 value=score, 26 name=self.name, 27 reason=reason 28 )
Using task span metrics in evaluation
Task span metrics work alongside regular evaluation metrics and are automatically detected by the evaluation engine:
1 from opik import evaluate 2 from opik.evaluation.metrics import Equals 3 4 # Create both regular and task span metrics 5 equals_metric = Equals() 6 timing_metric = ExecutionTimeMetric() 7 8 # Run evaluation with mixed metric types 9 evaluation = evaluate( 10 dataset=dataset, 11 task=evaluation_task, 12 scoring_metrics=[ 13 equals_metric, # Regular metric 14 timing_metric, # Task span metric 15 ], 16 experiment_name="Comprehensive Evaluation" 17 )
When you use task span metrics, Opik automatically enables span collection and
analysis. You don’t need to configure anything special - the system will
detect metrics with task_span parameters and handle them appropriately.
Accessing span hierarchy
Task spans can contain nested spans representing sub-operations. You can analyze the complete execution hierarchy.
Here’s an example of a tracked function that produces nested spans:
1 from opik import track 2 from opik.integrations.openai import track_openai 3 import openai 4 5 openai_client = track_openai(openai.OpenAI()) 6 7 @track 8 def research_topic(topic: str) -> str: 9 """Main research function that creates nested spans.""" 10 11 # This will create a nested span for gathering context 12 context = gather_context(topic) 13 14 # This will create another nested span for analysis 15 analysis = analyze_information(context, topic) 16 17 # Final span for generating summary 18 summary = generate_summary(analysis, topic) 19 20 return summary 21 22 @track 23 def gather_context(topic: str) -> str: 24 """Gather background context - creates its own span.""" 25 response = openai_client.chat.completions.create( 26 model="gpt-3.5-turbo", 27 messages=[{ 28 "role": "user", 29 "content": f"Provide background context about: {topic}" 30 }] 31 ) 32 return response.choices[0].message.content 33 34 @track 35 def analyze_information(context: str, topic: str) -> str: 36 """Analyze the gathered information - creates its own span.""" 37 response = openai_client.chat.completions.create( 38 model="gpt-3.5-turbo", 39 messages=[{ 40 "role": "user", 41 "content": f"Analyze this context about {topic}: {context}" 42 }] 43 ) 44 return response.choices[0].message.content 45 46 @track 47 def generate_summary(analysis: str, topic: str) -> str: 48 """Generate final summary - creates its own span.""" 49 response = openai_client.chat.completions.create( 50 model="gpt-3.5-turbo", 51 messages=[{ 52 "role": "user", 53 "content": f"Create a summary for {topic} based on: {analysis}" 54 }] 55 ) 56 return response.choices[0].message.content
When you call research_topic("artificial intelligence"), Opik will create a hierarchy of spans:
1 SpanModel(id='0199f2c5-4097-7139-8e20-ce93d10ca3b0', 2 start_time=datetime.datetime(2025, 10, 17, 15, 23, 57, 462154, tzinfo=TzInfo(UTC)), 3 name='research_topic', 4 input={'topic': 'artificial intelligence'}, 5 output={'output': 'In summary, artificial intelligence is a field in computer science that focuses on ' 6 'creating machines or software that can replicate human intelligence. This includes tasks ' 7 'like learning, problem-solving, decision-making, and natural language processing. Recent ' 8 'advancements in AI technologies have enabled machines to perform complex tasks such as ' 9 'image and speech recognition, autonomous driving, and medical diagnosis. Different ' 10 'approaches to AI include symbolic AI and machine learning, with deep learning using ' 11 "neural networks to mimic the human brain's structure. AI has applications across various " 12 'industries, but also raises concerns about privacy, bias, and job displacement. As AI ' 13 'continues to progress, it will be crucial to address ethical and societal issues related ' 14 'to its implementation.'}, 15 tags=None, 16 metadata=None, 17 type='general', 18 usage=None, 19 end_time=datetime.datetime(2025, 10, 17, 15, 24, 5, 196086, tzinfo=TzInfo(UTC)), 20 project_name='Default Project', 21 spans=[SpanModel(id='0199f2c5-4098-7c21-a23e-c361eb71b9de', 22 start_time=datetime.datetime(2025, 10, 17, 15, 23, 57, 462447, tzinfo=TzInfo(UTC)), 23 name='gather_context', 24 input={'topic': 'artificial intelligence'}, 25 output={'output': 'Artificial intelligence (AI) is a branch of computer science that ' 26 'focuses on creating machines or software that can perform tasks that ' 27 'typically require human intelligence. This includes tasks such as ' 28 'learning, problem-solving, decision-making, and natural language ' 29 'processing. AI technologies have advanced rapidly in recent years, ' 30 'enabling machines to perform increasingly complex tasks such as image ' 31 'and speech recognition, autonomous driving, and medical diagnosis.\n' 32 '\n' 33 'There are several approaches to AI, including symbolic AI, which relies ' 34 'on rules and logic, and machine learning, which involves training ' 35 'algorithms on large amounts of data to make predictions or decisions. ' 36 'Deep learning is a subset of machine learning that involves neural ' 37 'networks with multiple layers, mimicking the structure of the human ' 38 'brain.\n' 39 '\n' 40 'AI has a wide range of applications across various industries, including ' 41 'healthcare, finance, education, transportation, and entertainment. It ' 42 'has the potential to revolutionize many aspects of everyday life, but ' 43 'also raises ethical and societal concerns about privacy, bias, and job ' 44 'displacement.\n' 45 '\n' 46 'Overall, artificial intelligence represents a rapidly evolving field ' 47 'with the potential to greatly impact society in the coming years.'}, 48 tags=None, 49 metadata=None, 50 type='general', 51 usage=None, 52 end_time=datetime.datetime(2025, 10, 17, 15, 24, 0, 23394, tzinfo=TzInfo(UTC)), 53 project_name='Default Project', 54 spans=[SpanModel(id='0199f2c5-4099-7bef-994a-36d67f95b652', 55 start_time=datetime.datetime(2025, 10, 17, 15, 23, 57, 462529, tzinfo=TzInfo(UTC)), 56 name='chat_completion_create', 57 input={'messages': [{'content': 'Provide background context about: ' 58 'artificial intelligence', 59 'role': 'user'}]}, 60 output={'choices': [{'finish_reason': 'stop', 61 'index': 0, 62 'logprobs': None, 63 'message': {'annotations': [], 64 'audio': None, 65 'content': 'Artificial intelligence (AI) ' 66 'is a branch of computer ' 67 'science that focuses on ' 68 'creating machines or software ' 69 'that can perform tasks that ' 70 'typically require human ' 71 'intelligence. This includes ' 72 'tasks such as learning, ' 73 'problem-solving, ' 74 'decision-making, and natural ' 75 'language processing. AI ' 76 'technologies have advanced ' 77 'rapidly in recent years, ' 78 'enabling machines to perform ' 79 'increasingly complex tasks ' 80 'such as image and speech ' 81 'recognition, autonomous ' 82 'driving, and medical ' 83 'diagnosis.\n' 84 '\n' 85 'There are several approaches ' 86 'to AI, including symbolic AI, ' 87 'which relies on rules and ' 88 'logic, and machine learning, ' 89 'which involves training ' 90 'algorithms on large amounts ' 91 'of data to make predictions ' 92 'or decisions. Deep learning ' 93 'is a subset of machine ' 94 'learning that involves neural ' 95 'networks with multiple ' 96 'layers, mimicking the ' 97 'structure of the human ' 98 'brain.\n' 99 '\n' 100 'AI has a wide range of ' 101 'applications across various ' 102 'industries, including ' 103 'healthcare, finance, ' 104 'education, transportation, ' 105 'and entertainment. It has the ' 106 'potential to revolutionize ' 107 'many aspects of everyday ' 108 'life, but also raises ethical ' 109 'and societal concerns about ' 110 'privacy, bias, and job ' 111 'displacement.\n' 112 '\n' 113 'Overall, artificial ' 114 'intelligence represents a ' 115 'rapidly evolving field with ' 116 'the potential to greatly ' 117 'impact society in the coming ' 118 'years.', 119 'function_call': None, 120 'refusal': None, 121 'role': 'assistant', 122 'tool_calls': None}}]}, 123 tags=['openai'], 124 metadata={'created': 1760714637, 125 'created_from': 'openai', 126 'id': 'chatcmpl-CRgb7Al2eepM3s2aalsXUwSYYhX4f', 127 'model': 'gpt-3.5-turbo-0125', 128 'object': 'chat.completion', 129 'service_tier': 'default', 130 'system_fingerprint': None, 131 'type': 'openai_chat', 132 'usage': {'completion_tokens': 212, 133 'completion_tokens_details': {'accepted_prediction_tokens': 0, 134 'audio_tokens': 0, 135 'reasoning_tokens': 0, 136 'rejected_prediction_tokens': 0}, 137 'prompt_tokens': 14, 138 'prompt_tokens_details': {'audio_tokens': 0, 139 'cached_tokens': 0}, 140 'total_tokens': 226}}, 141 type='llm', 142 usage={'completion_tokens': 212, 143 'original_usage.completion_tokens': 212, 144 'original_usage.completion_tokens_details.accepted_prediction_tokens': 0, 145 'original_usage.completion_tokens_details.audio_tokens': 0, 146 'original_usage.completion_tokens_details.reasoning_tokens': 0, 147 'original_usage.completion_tokens_details.rejected_prediction_tokens': 0, 148 'original_usage.prompt_tokens': 14, 149 'original_usage.prompt_tokens_details.audio_tokens': 0, 150 'original_usage.prompt_tokens_details.cached_tokens': 0, 151 'original_usage.total_tokens': 226, 152 'prompt_tokens': 14, 153 'total_tokens': 226}, 154 end_time=datetime.datetime(2025, 10, 17, 15, 24, 0, 23173, tzinfo=TzInfo(UTC)), 155 project_name='Default Project', 156 spans=[], 157 feedback_scores=[], 158 model='gpt-3.5-turbo-0125', 159 provider='openai', 160 error_info=None, 161 total_cost=None, 162 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 0, 23320, tzinfo=TzInfo(UTC)))], 163 feedback_scores=[], 164 model=None, 165 provider=None, 166 error_info=None, 167 total_cost=None, 168 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 0, 23407, tzinfo=TzInfo(UTC))), 169 SpanModel(id='0199f2c5-4a97-75b4-8067-293062038a45', 170 start_time=datetime.datetime(2025, 10, 17, 15, 24, 0, 23674, tzinfo=TzInfo(UTC)), 171 name='analyze_information', 172 input={'context': 'Artificial intelligence (AI) is a branch of computer science that ' 173 'focuses on creating machines or software that can perform tasks that ' 174 'typically require human intelligence. This includes tasks such as ' 175 'learning, problem-solving, decision-making, and natural language ' 176 'processing. AI technologies have advanced rapidly in recent years, ' 177 'enabling machines to perform increasingly complex tasks such as image ' 178 'and speech recognition, autonomous driving, and medical diagnosis.\n' 179 '\n' 180 'There are several approaches to AI, including symbolic AI, which relies ' 181 'on rules and logic, and machine learning, which involves training ' 182 'algorithms on large amounts of data to make predictions or decisions. ' 183 'Deep learning is a subset of machine learning that involves neural ' 184 'networks with multiple layers, mimicking the structure of the human ' 185 'brain.\n' 186 '\n' 187 'AI has a wide range of applications across various industries, including ' 188 'healthcare, finance, education, transportation, and entertainment. It ' 189 'has the potential to revolutionize many aspects of everyday life, but ' 190 'also raises ethical and societal concerns about privacy, bias, and job ' 191 'displacement.\n' 192 '\n' 193 'Overall, artificial intelligence represents a rapidly evolving field ' 194 'with the potential to greatly impact society in the coming years.', 195 'topic': 'artificial intelligence'}, 196 output={'output': 'Artificial intelligence, as described in the context, is a field within ' 197 'computer science that aims to create machines or software that can mimic ' 198 'human intelligence. This includes tasks such as learning, ' 199 'problem-solving, decision-making, and natural language processing. AI ' 200 'technologies have seen significant advancements in recent years, ' 201 'allowing machines to perform complex tasks like image and speech ' 202 'recognition, autonomous driving, and medical diagnosis.\n' 203 '\n' 204 'There are different approaches to AI, including symbolic AI and machine ' 205 'learning. Machine learning, in particular, involves training algorithms ' 206 'on large datasets to make predictions or decisions. Deep learning, a ' 207 'subset of machine learning, uses neural networks with multiple layers to ' 208 "imitate the human brain's structure.\n" 209 '\n' 210 'AI has a wide range of applications in various industries, from ' 211 'healthcare to entertainment. It has the potential to revolutionize many ' 212 'aspects of daily life, but also raises concerns about privacy, bias, and ' 213 'job displacement.\n' 214 '\n' 215 'In conclusion, artificial intelligence is a rapidly evolving field that ' 216 'has the potential to significantly impact society in the future. As ' 217 'advancements continue, it will be important to address ethical and ' 218 'societal issues related to AI implementation.'}, 219 tags=None, 220 metadata=None, 221 type='general', 222 usage=None, 223 end_time=datetime.datetime(2025, 10, 17, 15, 24, 2, 363253, tzinfo=TzInfo(UTC)), 224 project_name='Default Project', 225 spans=[SpanModel(id='0199f2c5-4a98-72b5-a152-fdbfacbc6785', 226 start_time=datetime.datetime(2025, 10, 17, 15, 24, 0, 23909, tzinfo=TzInfo(UTC)), 227 name='chat_completion_create', 228 input={'messages': [{'content': 'Analyze this context about artificial ' 229 'intelligence: Artificial intelligence ' 230 '(AI) is a branch of computer science that ' 231 'focuses on creating machines or software ' 232 'that can perform tasks that typically ' 233 'require human intelligence. This includes ' 234 'tasks such as learning, problem-solving, ' 235 'decision-making, and natural language ' 236 'processing. AI technologies have advanced ' 237 'rapidly in recent years, enabling ' 238 'machines to perform increasingly complex ' 239 'tasks such as image and speech ' 240 'recognition, autonomous driving, and ' 241 'medical diagnosis.\n' 242 '\n' 243 'There are several approaches to AI, ' 244 'including symbolic AI, which relies on ' 245 'rules and logic, and machine learning, ' 246 'which involves training algorithms on ' 247 'large amounts of data to make predictions ' 248 'or decisions. Deep learning is a subset ' 249 'of machine learning that involves neural ' 250 'networks with multiple layers, mimicking ' 251 'the structure of the human brain.\n' 252 '\n' 253 'AI has a wide range of applications ' 254 'across various industries, including ' 255 'healthcare, finance, education, ' 256 'transportation, and entertainment. It has ' 257 'the potential to revolutionize many ' 258 'aspects of everyday life, but also raises ' 259 'ethical and societal concerns about ' 260 'privacy, bias, and job displacement.\n' 261 '\n' 262 'Overall, artificial intelligence ' 263 'represents a rapidly evolving field with ' 264 'the potential to greatly impact society ' 265 'in the coming years.', 266 'role': 'user'}]}, 267 output={'choices': [{'finish_reason': 'stop', 268 'index': 0, 269 'logprobs': None, 270 'message': {'annotations': [], 271 'audio': None, 272 'content': 'Artificial intelligence, as ' 273 'described in the context, is ' 274 'a field within computer ' 275 'science that aims to create ' 276 'machines or software that can ' 277 'mimic human intelligence. ' 278 'This includes tasks such as ' 279 'learning, problem-solving, ' 280 'decision-making, and natural ' 281 'language processing. AI ' 282 'technologies have seen ' 283 'significant advancements in ' 284 'recent years, allowing ' 285 'machines to perform complex ' 286 'tasks like image and speech ' 287 'recognition, autonomous ' 288 'driving, and medical ' 289 'diagnosis.\n' 290 '\n' 291 'There are different ' 292 'approaches to AI, including ' 293 'symbolic AI and machine ' 294 'learning. Machine learning, ' 295 'in particular, involves ' 296 'training algorithms on large ' 297 'datasets to make predictions ' 298 'or decisions. Deep learning, ' 299 'a subset of machine learning, ' 300 'uses neural networks with ' 301 'multiple layers to imitate ' 302 "the human brain's structure.\n" 303 '\n' 304 'AI has a wide range of ' 305 'applications in various ' 306 'industries, from healthcare ' 307 'to entertainment. It has the ' 308 'potential to revolutionize ' 309 'many aspects of daily life, ' 310 'but also raises concerns ' 311 'about privacy, bias, and job ' 312 'displacement.\n' 313 '\n' 314 'In conclusion, artificial ' 315 'intelligence is a rapidly ' 316 'evolving field that has the ' 317 'potential to significantly ' 318 'impact society in the future. ' 319 'As advancements continue, it ' 320 'will be important to address ' 321 'ethical and societal issues ' 322 'related to AI implementation.', 323 'function_call': None, 324 'refusal': None, 325 'role': 'assistant', 326 'tool_calls': None}}]}, 327 tags=['openai'], 328 metadata={'created': 1760714640, 329 'created_from': 'openai', 330 'id': 'chatcmpl-CRgbA7W6uLjdALHSqIYBRtCzY50s8', 331 'model': 'gpt-3.5-turbo-0125', 332 'object': 'chat.completion', 333 'service_tier': 'default', 334 'system_fingerprint': None, 335 'type': 'openai_chat', 336 'usage': {'completion_tokens': 215, 337 'completion_tokens_details': {'accepted_prediction_tokens': 0, 338 'audio_tokens': 0, 339 'reasoning_tokens': 0, 340 'rejected_prediction_tokens': 0}, 341 'prompt_tokens': 226, 342 'prompt_tokens_details': {'audio_tokens': 0, 343 'cached_tokens': 0}, 344 'total_tokens': 441}}, 345 type='llm', 346 usage={'completion_tokens': 215, 347 'original_usage.completion_tokens': 215, 348 'original_usage.completion_tokens_details.accepted_prediction_tokens': 0, 349 'original_usage.completion_tokens_details.audio_tokens': 0, 350 'original_usage.completion_tokens_details.reasoning_tokens': 0, 351 'original_usage.completion_tokens_details.rejected_prediction_tokens': 0, 352 'original_usage.prompt_tokens': 226, 353 'original_usage.prompt_tokens_details.audio_tokens': 0, 354 'original_usage.prompt_tokens_details.cached_tokens': 0, 355 'original_usage.total_tokens': 441, 356 'prompt_tokens': 226, 357 'total_tokens': 441}, 358 end_time=datetime.datetime(2025, 10, 17, 15, 24, 2, 363045, tzinfo=TzInfo(UTC)), 359 project_name='Default Project', 360 spans=[], 361 feedback_scores=[], 362 model='gpt-3.5-turbo-0125', 363 provider='openai', 364 error_info=None, 365 total_cost=None, 366 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 2, 363184, tzinfo=TzInfo(UTC)))], 367 feedback_scores=[], 368 model=None, 369 provider=None, 370 error_info=None, 371 total_cost=None, 372 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 2, 363270, tzinfo=TzInfo(UTC))), 373 SpanModel(id='0199f2c5-53bb-7110-8832-51d9fa92285d', 374 start_time=datetime.datetime(2025, 10, 17, 15, 24, 2, 363463, tzinfo=TzInfo(UTC)), 375 name='generate_summary', 376 input={'analysis': 'Artificial intelligence, as described in the context, is a field within ' 377 'computer science that aims to create machines or software that can ' 378 'mimic human intelligence. This includes tasks such as learning, ' 379 'problem-solving, decision-making, and natural language processing. AI ' 380 'technologies have seen significant advancements in recent years, ' 381 'allowing machines to perform complex tasks like image and speech ' 382 'recognition, autonomous driving, and medical diagnosis.\n' 383 '\n' 384 'There are different approaches to AI, including symbolic AI and machine ' 385 'learning. Machine learning, in particular, involves training algorithms ' 386 'on large datasets to make predictions or decisions. Deep learning, a ' 387 'subset of machine learning, uses neural networks with multiple layers ' 388 "to imitate the human brain's structure.\n" 389 '\n' 390 'AI has a wide range of applications in various industries, from ' 391 'healthcare to entertainment. It has the potential to revolutionize many ' 392 'aspects of daily life, but also raises concerns about privacy, bias, ' 393 'and job displacement.\n' 394 '\n' 395 'In conclusion, artificial intelligence is a rapidly evolving field that ' 396 'has the potential to significantly impact society in the future. As ' 397 'advancements continue, it will be important to address ethical and ' 398 'societal issues related to AI implementation.', 399 'topic': 'artificial intelligence'}, 400 output={'output': 'In summary, artificial intelligence is a field in computer science that ' 401 'focuses on creating machines or software that can replicate human ' 402 'intelligence. This includes tasks like learning, problem-solving, ' 403 'decision-making, and natural language processing. Recent advancements in ' 404 'AI technologies have enabled machines to perform complex tasks such as ' 405 'image and speech recognition, autonomous driving, and medical diagnosis. ' 406 'Different approaches to AI include symbolic AI and machine learning, ' 407 "with deep learning using neural networks to mimic the human brain's " 408 'structure. AI has applications across various industries, but also ' 409 'raises concerns about privacy, bias, and job displacement. As AI ' 410 'continues to progress, it will be crucial to address ethical and ' 411 'societal issues related to its implementation.'}, 412 tags=None, 413 metadata=None, 414 type='general', 415 usage=None, 416 end_time=datetime.datetime(2025, 10, 17, 15, 24, 5, 196015, tzinfo=TzInfo(UTC)), 417 project_name='Default Project', 418 spans=[SpanModel(id='0199f2c5-53bc-7609-889b-b8b1e6f8e3ca', 419 start_time=datetime.datetime(2025, 10, 17, 15, 24, 2, 363735, tzinfo=TzInfo(UTC)), 420 name='chat_completion_create', 421 input={'messages': [{'content': 'Create a summary for artificial ' 422 'intelligence based on: Artificial ' 423 'intelligence, as described in the ' 424 'context, is a field within computer ' 425 'science that aims to create machines or ' 426 'software that can mimic human ' 427 'intelligence. This includes tasks such as ' 428 'learning, problem-solving, ' 429 'decision-making, and natural language ' 430 'processing. AI technologies have seen ' 431 'significant advancements in recent years, ' 432 'allowing machines to perform complex ' 433 'tasks like image and speech recognition, ' 434 'autonomous driving, and medical ' 435 'diagnosis.\n' 436 '\n' 437 'There are different approaches to AI, ' 438 'including symbolic AI and machine ' 439 'learning. Machine learning, in ' 440 'particular, involves training algorithms ' 441 'on large datasets to make predictions or ' 442 'decisions. Deep learning, a subset of ' 443 'machine learning, uses neural networks ' 444 'with multiple layers to imitate the human ' 445 "brain's structure.\n" 446 '\n' 447 'AI has a wide range of applications in ' 448 'various industries, from healthcare to ' 449 'entertainment. It has the potential to ' 450 'revolutionize many aspects of daily life, ' 451 'but also raises concerns about privacy, ' 452 'bias, and job displacement.\n' 453 '\n' 454 'In conclusion, artificial intelligence is ' 455 'a rapidly evolving field that has the ' 456 'potential to significantly impact society ' 457 'in the future. As advancements continue, ' 458 'it will be important to address ethical ' 459 'and societal issues related to AI ' 460 'implementation.', 461 'role': 'user'}]}, 462 output={'choices': [{'finish_reason': 'stop', 463 'index': 0, 464 'logprobs': None, 465 'message': {'annotations': [], 466 'audio': None, 467 'content': 'In summary, artificial ' 468 'intelligence is a field in ' 469 'computer science that focuses ' 470 'on creating machines or ' 471 'software that can replicate ' 472 'human intelligence. This ' 473 'includes tasks like learning, ' 474 'problem-solving, ' 475 'decision-making, and natural ' 476 'language processing. Recent ' 477 'advancements in AI ' 478 'technologies have enabled ' 479 'machines to perform complex ' 480 'tasks such as image and ' 481 'speech recognition, ' 482 'autonomous driving, and ' 483 'medical diagnosis. Different ' 484 'approaches to AI include ' 485 'symbolic AI and machine ' 486 'learning, with deep learning ' 487 'using neural networks to ' 488 "mimic the human brain's " 489 'structure. AI has ' 490 'applications across various ' 491 'industries, but also raises ' 492 'concerns about privacy, bias, ' 493 'and job displacement. As AI ' 494 'continues to progress, it ' 495 'will be crucial to address ' 496 'ethical and societal issues ' 497 'related to its ' 498 'implementation.', 499 'function_call': None, 500 'refusal': None, 501 'role': 'assistant', 502 'tool_calls': None}}]}, 503 tags=['openai'], 504 metadata={'created': 1760714643, 505 'created_from': 'openai', 506 'id': 'chatcmpl-CRgbDujtWhm4gH1bHDPeZIbJ4ChiV', 507 'model': 'gpt-3.5-turbo-0125', 508 'object': 'chat.completion', 509 'service_tier': 'default', 510 'system_fingerprint': None, 511 'type': 'openai_chat', 512 'usage': {'completion_tokens': 133, 513 'completion_tokens_details': {'accepted_prediction_tokens': 0, 514 'audio_tokens': 0, 515 'reasoning_tokens': 0, 516 'rejected_prediction_tokens': 0}, 517 'prompt_tokens': 230, 518 'prompt_tokens_details': {'audio_tokens': 0, 519 'cached_tokens': 0}, 520 'total_tokens': 363}}, 521 type='llm', 522 usage={'completion_tokens': 133, 523 'original_usage.completion_tokens': 133, 524 'original_usage.completion_tokens_details.accepted_prediction_tokens': 0, 525 'original_usage.completion_tokens_details.audio_tokens': 0, 526 'original_usage.completion_tokens_details.reasoning_tokens': 0, 527 'original_usage.completion_tokens_details.rejected_prediction_tokens': 0, 528 'original_usage.prompt_tokens': 230, 529 'original_usage.prompt_tokens_details.audio_tokens': 0, 530 'original_usage.prompt_tokens_details.cached_tokens': 0, 531 'original_usage.total_tokens': 363, 532 'prompt_tokens': 230, 533 'total_tokens': 363}, 534 end_time=datetime.datetime(2025, 10, 17, 15, 24, 5, 195846, tzinfo=TzInfo(UTC)), 535 project_name='Default Project', 536 spans=[], 537 feedback_scores=[], 538 model='gpt-3.5-turbo-0125', 539 provider='openai', 540 error_info=None, 541 total_cost=None, 542 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 5, 195954, tzinfo=TzInfo(UTC)))], 543 feedback_scores=[], 544 model=None, 545 provider=None, 546 error_info=None, 547 total_cost=None, 548 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 5, 196032, tzinfo=TzInfo(UTC)))], 549 feedback_scores=[], 550 model=None, 551 provider=None, 552 error_info=None, 553 total_cost=None, 554 last_updated_at=datetime.datetime(2025, 10, 17, 15, 24, 5, 196101, tzinfo=TzInfo(UTC)))
You can then analyze this complete execution hierarchy using task span metrics:
1 class HierarchyAnalysisMetric(BaseMetric): 2 def _analyze_hierarchy_recursively(self, span: SpanModel, hierarchy_stats: dict = None) -> dict: 3 """Recursively analyze span hierarchy across the entire span tree.""" 4 if hierarchy_stats is None: 5 hierarchy_stats = { 6 'total_spans': 0, 7 'llm_spans': 0, 8 'tool_spans': 0, 9 'other_spans': 0, 10 'max_depth': 0, 11 'current_depth': 0, 12 'llm_span_names': [], 13 'tool_span_names': [] 14 } 15 16 # Count current span 17 hierarchy_stats['total_spans'] += 1 18 hierarchy_stats['max_depth'] = max(hierarchy_stats['max_depth'], hierarchy_stats['current_depth']) 19 20 # Categorize span types 21 if span.type == "llm": 22 hierarchy_stats['llm_spans'] += 1 23 hierarchy_stats['llm_span_names'].append(span.name) 24 elif span.type == "tool": 25 hierarchy_stats['tool_spans'] += 1 26 hierarchy_stats['tool_span_names'].append(span.name) 27 else: 28 hierarchy_stats['other_spans'] += 1 29 30 # Recursively analyze nested spans with depth tracking 31 for nested_span in span.spans: 32 hierarchy_stats['current_depth'] += 1 33 self._analyze_hierarchy_recursively(nested_span, hierarchy_stats) 34 hierarchy_stats['current_depth'] -= 1 35 36 return hierarchy_stats 37 38 def score(self, task_span: SpanModel, **ignored_kwargs: Any) -> score_result.ScoreResult: 39 # Analyze hierarchy across the entire span tree 40 # Only for illustrative purposes. 41 # Please adjust for your specific use case! 42 hierarchy_stats = self._analyze_hierarchy_recursively(task_span) 43 44 total_operations = hierarchy_stats['total_spans'] 45 llm_operations = hierarchy_stats['llm_spans'] 46 tool_operations = hierarchy_stats['tool_spans'] 47 max_depth = hierarchy_stats['max_depth'] 48 49 # Analyze the complexity and structure of the operation 50 if llm_operations > 5: 51 # Many LLM calls might indicate inefficient processing 52 if tool_operations == 0: 53 score = 0.4 54 reason = f"Over-complex operation: {llm_operations} LLM calls with no tool usage (depth: {max_depth})" 55 else: 56 score = 0.6 57 reason = f"Complex operation: {llm_operations} LLM calls, {tool_operations} tool calls (depth: {max_depth})" 58 elif llm_operations == 0: 59 # No reasoning might indicate a purely mechanical process 60 score = 0.3 if tool_operations > 0 else 0.1 61 reason = f"No reasoning detected: {tool_operations} tool calls only" if tool_operations > 0 else "No LLM or tool operations detected" 62 else: 63 # Balanced approach with reasonable LLM usage 64 balance_ratio = min(llm_operations, tool_operations) / max(llm_operations, tool_operations) if tool_operations > 0 else 0.8 65 depth_bonus = 1.0 if max_depth <= 3 else max(0.8, 1.0 - (max_depth - 3) * 0.05) 66 67 score = min(1.0, 0.7 + balance_ratio * 0.2 + depth_bonus * 0.1) 68 69 if tool_operations > 0: 70 reason = f"Well-structured operation: {llm_operations} LLM calls, {tool_operations} tool calls across {total_operations} spans (depth: {max_depth})" 71 else: 72 reason = f"Reasoning-focused operation: {llm_operations} LLM calls across {total_operations} spans (depth: {max_depth})" 73 74 return score_result.ScoreResult( 75 value=score, 76 name=self.name, 77 reason=reason 78 )
For the SpanModel’s hierarchy given above the HierarchyAnalysisMetric metric’s score will be:
Score: 0.96, Reason: Reasoning-focused operation: 3 LLM calls across 7 spans (depth: 2)
Best practices for task span metrics
- Focus on execution patterns: Use task span metrics to evaluate how your application executes, not just the final output
- Combine with regular metrics: Mix task span metrics with traditional output-based metrics for comprehensive evaluation
- Analyze performance: Leverage timing, cost, and usage information for optimization insights
- Handle missing data gracefully: Always check for None values in optional span attributes
Task span metrics have access to detailed execution information including inputs, outputs, and metadata. Be mindful of sensitive data and ensure your metrics handle this information appropriately.
Accessing logged experiments
You can access all the experiments logged to the platform from the SDK with the
get experiment by name methods:
1 import { Opik } from "opik"; 2 3 const client = new Opik({ 4 apiKey: "your-api-key", 5 apiUrl: "https://www.comet.com/opik/api", 6 projectName: "your-project-name", 7 workspaceName: "your-workspace-name", 8 }); 9 const experiments = await client.getExperimentsByName("My experiment"); 10 11 // Access the first experiment content 12 const items = await experiments[0].getItems(); 13 console.log(items);