Manage datasets
Guides you through the process of creating and managing datasets
Datasets can be used to track test cases you would like to evaluate your LLM on. Each dataset is made up of dictionary
with any key value pairs. When getting started, we recommend having an input
and optional expected_output
fields for
example. These datasets can be created from:
- Python SDK: You can use the Python SDK to create an dataset and add items to it.
- Traces table: You can add existing logged traces (from a production application for example) to a dataset.
- The Opik UI: You can manually create a dataset and add items to it.
Once a dataset has been created, you can run Experiments on it. Each Experiment will evaluate an LLM application based on the test cases in the dataset using an evaluation metric and report the results back to the dataset.
Creating a dataset using the SDK
You can create a dataset and log items to it using the get_or_create_dataset
method:
If a dataset with the given name already exists, the existing dataset will be returned.
Insert items
Inserting dictionary items
You can insert items to a dataset using the insert
method:
Opik automatically deduplicates items that are inserted into a dataset when using the Python SDK. This means that you
can insert the same item multiple times without duplicating it in the dataset. This combined with the
get_or_create_dataset
method means that you can use the SDK to manage your datasets in a “fire and forget” manner.
Once the items have been inserted, you can view them them in the Opik UI:

Inserting items from a JSONL file
You can also insert items from a JSONL file:
The format of the JSONL file should be a JSON object per line. For example:
Inserting items from a Pandas DataFrame
You can also insert items from a Pandas DataFrame:
The keys_mapping
parameter maps the column names in the DataFrame to the keys in the dataset items, this can be useful if you want to rename columns before inserting them into the dataset:
Deleting items
You can delete items in a dataset by using the delete
method:
You can also remove all the items in a dataset by using the clear
method:
Downloading a dataset from Opik
You can download a dataset from Opik using the get_dataset
method:
Once the dataset has been retrieved, you can access it’s items using the to_pandas()
or to_json
methods: