evaluate_experiment¶

opik.evaluation.evaluate_experiment(experiment_name: str, scoring_metrics: List[BaseMetric], scoring_threads: int = 16, verbose: int = 1, scoring_key_mapping: Dict[str, str | Callable[[Dict[str, Any]], Any]] | None = None, experiment_id: str | None = None) → EvaluationResult¶

Update existing experiment with new evaluation metrics.

Parameters:

experiment_name – The name of the experiment to update.
scoring_metrics – List of metrics to calculate during evaluation. Each metric has score(…) method, arguments for this method are taken from the task output, check the signature of the score method in metrics that you need to find out which keys are mandatory in task-returned dictionary.
scoring_threads – amount of thread workers to run scoring metrics.
verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar.
scoring_key_mapping – A dictionary that allows you to rename keys present in either the dataset item or the task output so that they match the keys expected by the scoring metrics. For example if you have a dataset item with the following content: {“user_question”: “What is Opik ?”} and a scoring metric that expects a key “input”, you can use scoring_key_mapping {“input”: “user_question”} to map the “user_question” key to “input”.