evaluate_prompt

opik.evaluation.evaluate_prompt(dataset: Dataset, messages: List[Dict[str, Any]], model: str | OpikBaseModel | None = None, scoring_metrics: List[BaseMetric] | None = None, experiment_name: str | None = None, project_name: str | None = None, experiment_config: Dict[str, Any] | None = None, verbose: int = 1, nb_samples: int | None = None, task_threads: int = 16, prompt: Prompt | None = None) EvaluationResult

Performs prompt evaluation on a given dataset.

Parameters:
  • dataset – An Opik dataset instance

  • messages – A list of prompt messages to evaluate.

  • model – The name of the model to use for evaluation. Defaults to “gpt-3.5-turbo”.

  • scoring_metrics – List of metrics to calculate during evaluation. The LLM input and output will be passed as arguments to each metric score(…) method.

  • experiment_name – name of the experiment.

  • experiment_config – configuration of the experiment.

  • scoring_threads – amount of thread workers to run scoring metrics.

  • nb_samples – number of samples to evaluate.

  • verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar.