evaluate_prompt¶
- opik.evaluation.evaluate_prompt(dataset: Dataset, messages: List[Dict[str, Any]], model: str | OpikBaseModel | None = None, scoring_metrics: List[BaseMetric] | None = None, experiment_name: str | None = None, project_name: str | None = None, experiment_config: Dict[str, Any] | None = None, verbose: int = 1, nb_samples: int | None = None, task_threads: int = 16, prompt: Prompt | None = None) EvaluationResult ¶
Performs prompt evaluation on a given dataset.
- Parameters:
dataset – An Opik dataset instance
messages – A list of prompt messages to evaluate.
model – The name of the model to use for evaluation. Defaults to “gpt-3.5-turbo”.
scoring_metrics – List of metrics to calculate during evaluation. The LLM input and output will be passed as arguments to each metric score(…) method.
experiment_name – name of the experiment.
experiment_config – configuration of the experiment.
scoring_threads – amount of thread workers to run scoring metrics.
nb_samples – number of samples to evaluate.
verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar.