evaluate

opik.evaluation.evaluate(dataset: Dataset, task: Callable[[Dict[str, Any]], Dict[str, Any]], scoring_metrics: List[BaseMetric] | None = None, experiment_name: str | None = None, project_name: str | None = None, experiment_config: Dict[str, Any] | None = None, verbose: int = 1, nb_samples: int | None = None, task_threads: int = 16, prompt: Prompt | None = None, scoring_key_mapping: Dict[str, str | Callable[[Dict[str, Any]], Any]] | None = None) EvaluationResult

Performs task evaluation on a given dataset.

Parameters:
  • dataset – An Opik dataset instance

  • task – A callable object that takes dict with dataset item content as input and returns dict which will later be used for scoring.

  • experiment_name – The name of the experiment associated with evaluation run. If None, a generated name will be used.

  • project_name – The name of the project. If not provided, traces and spans will be logged to the Default Project

  • experiment_config – The dictionary with parameters that describe experiment

  • scoring_metrics – List of metrics to calculate during evaluation. Each metric has score(…) method, arguments for this method are taken from the task output, check the signature of the score method in metrics that you need to find out which keys are mandatory in task-returned dictionary. If no value provided, the experiment won’t have any scoring metrics.

  • verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar. 0 - no outputs, 1 - outputs are enabled (default).

  • nb_samples – number of samples to evaluate. If no value is provided, all samples in the dataset will be evaluated.

  • task_threads – number of thread workers to run tasks. If set to 1, no additional threads are created, all tasks executed in the current thread sequentially. are executed sequentially in the current thread. Use more than 1 worker if your task object is compatible with sharing across threads.

  • prompt – Prompt object to link with experiment.

  • scoring_key_mapping – A dictionary that allows you to rename keys present in either the dataset item or the task output so that they match the keys expected by the scoring metrics. For example if you have a dataset item with the following content: {“user_question”: “What is Opik ?”} and a scoring metric that expects a key “input”, you can use scoring_key_mapping {“input”: “user_question”} to map the “user_question” key to “input”.