evaluate

opik.evaluation.evaluate(dataset: Dataset, task: Callable[[DatasetItem], Dict[str, Any]], scoring_metrics: List[BaseMetric], experiment_name: str, experiment_config: Dict[str, Any] | None = None, verbose: int = 1, task_threads: int = 16) EvaluationResult

Performs task evaluation on a given dataset.

Parameters:
  • dataset – An Opik dataset instance

  • task – A callable object that takes DatasetItem as input and returns dictionary which will later be used for scoring

  • experiment_name – The name of the experiment associated with evaluation run

  • experiment_config – The dictionary with parameters that describe experiment

  • scoring_metrics – List of metrics to calculate during evaluation. Each metric has score(…) method, arguments for this method are taken from the task output, check the signature of the score method in metrics that you need to find out which keys are mandatory in task-returned dictionary.

  • verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar. 0 - no outputs, 1 - outputs are enabled (default).

  • task_threads – amount of thread workers to run tasks. If set to 1, no additional threads are created, all tasks executed in the current thread sequentially. are executed sequentially in the current thread. Use more than 1 worker if your task object is compatible with sharing across threads.