evaluate_prompt¶

opik.evaluation.evaluate_prompt(dataset: Dataset, messages: List[Dict[str, Any]], model: str | OpikBaseModel | None = None, scoring_metrics: List[BaseMetric] | None = None, experiment_name: str | None = None, project_name: str | None = None, experiment_config: Dict[str, Any] | None = None, verbose: int = 1, nb_samples: int | None = None, task_threads: int = 16, prompt: Prompt | None = None, dataset_item_ids: List[str] | None = None) → EvaluationResult¶

Performs prompt evaluation on a given dataset.

Parameters:

dataset – An Opik dataset instance
messages – A list of prompt messages to evaluate.
model – The name of the model to use for evaluation. Defaults to “gpt-3.5-turbo”.
scoring_metrics – List of metrics to calculate during evaluation. The LLM input and output will be passed as arguments to each metric score(…) method.
experiment_name – name of the experiment.
project_name – The name of the project to log data
experiment_config – configuration of the experiment.
verbose – an integer value that controls evaluation output logs such as summary and tqdm progress bar.
nb_samples – number of samples to evaluate.
task_threads – amount of thread workers to run scoring metrics.
prompt – Prompt object to link with experiment.
dataset_item_ids – list of dataset item ids to evaluate. If not provided, all samples in the dataset will be evaluated.