Overview
Opik provides a set of built-in evaluation metrics that you can mix and match to evaluate LLM behaviour. These metrics are broken down into two main categories:
- Heuristic metrics – deterministic checks that rely on rules, statistics, or classical NLP algorithms.
- LLM as a Judge metrics – delegate scoring to an LLM so you can capture semantic, task-specific, or conversation-level quality signals.
Heuristic metrics are ideal when you need reproducible checks such as exact matching, regex validation, or similarity scores against a reference. LLM as a Judge metrics are useful when you want richer qualitative feedback (hallucination detection, helpfulness, summarisation quality, regulatory risk, etc.).
Built-in metrics
Heuristic metrics
Conversation heuristic metrics
LLM as a Judge metrics
Conversation LLM as a Judge metrics
Customizing LLM as a Judge metrics
By default, Opik uses GPT-5-nano from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.
For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.
For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.