Overview

Opik provides a set of built-in evaluation metrics that you can mix and match to evaluate LLM behaviour. These metrics are broken down into two main categories:

  1. Heuristic metrics – deterministic checks that rely on rules, statistics, or classical NLP algorithms.
  2. LLM as a Judge metrics – delegate scoring to an LLM so you can capture semantic, task-specific, or conversation-level quality signals.

Heuristic metrics are ideal when you need reproducible checks such as exact matching, regex validation, or similarity scores against a reference. LLM as a Judge metrics are useful when you want richer qualitative feedback (hallucination detection, helpfulness, summarisation quality, regulatory risk, etc.).

Built-in metrics

Heuristic metrics

MetricDescriptionDocumentation
BERTScoreContextual embedding similarity scoreBERTScore
ChrFCharacter n-gram F-score (chrF / chrF++)ChrF
ContainsChecks whether the output contains a specific substringContains
Corpus BLEUComputes corpus-level BLEU across multiple outputsCorpusBLEU
EqualsChecks if the output exactly matches an expected stringEquals
GLEUEstimates grammatical fluency for candidate sentencesGLEU
IsJsonValidates that the output can be parsed as JSONIsJson
JSDivergenceJensen–Shannon similarity between token distributionsJSDivergence
JSDistanceRaw Jensen–Shannon divergenceJSDistance
KLDivergenceKullback–Leibler divergence with smoothingKLDivergence
Language AdherenceVerifies output language codeLanguage Adherence
LevenshteinCalculates the normalized Levenshtein distance between output and referenceLevenshtein
ReadabilityReports Flesch Reading Ease and FK gradeReadability
RegexMatchChecks if the output matches a specified regular expression patternRegexMatch
ROUGECalculates ROUGE variants (rouge1/2/L/Lsum/W)ROUGE
Sentence BLEUComputes a BLEU score for a single output against one or more referencesSentenceBLEU
SentimentScores sentiment using VADERSentiment
Spearman RankingSpearman’s rank correlationSpearman Ranking
ToneFlags tone issues such as shouting or negativityTone

Conversation heuristic metrics

MetricDescriptionDocumentation
DegenerationCDetects repetition and degeneration patterns over a conversationDegenerationC
Knowledge RetentionChecks whether the last assistant reply preserves user facts from earlier turnsKnowledge Retention

LLM as a Judge metrics

MetricDescriptionDocumentation
Agent Task Completion JudgeChecks whether an agent fulfilled its assigned taskAgent Task Completion
Agent Tool Correctness JudgeEvaluates whether an agent used tools correctlyAgent Tool Correctness
Answer RelevanceChecks whether the answer stays on-topic with the questionAnswer Relevance
Compliance Risk JudgeIdentifies non-compliant or high-risk statementsCompliance Risk
Context PrecisionEnsures the answer only uses relevant contextContext Precision
Context RecallMeasures how well the answer recalls supporting contextContext Recall
Dialogue Helpfulness JudgeEvaluates how helpful an assistant reply is in a dialogueDialogue Helpfulness
G-EvalTask-agnostic judge configurable with custom instructionsG-Eval
HallucinationDetects unsupported or hallucinated claims using an LLM judgeHallucination
LLM Juries JudgeAverages scores from multiple judge metrics for ensemble scoringLLM Juries
ModerationFlags safety or policy violations in assistant responsesModeration
Prompt Uncertainty JudgeDetects ambiguity in prompts that may confuse LLMsPrompt Diagnostics
QA Relevance JudgeDetermines whether an answer directly addresses the user questionQA Relevance
Structured Output ComplianceChecks JSON or schema adherence for structured responsesStructured Output
Summarization Coherence JudgeRates the structure and coherence of a summarySummarization Coherence
Summarization Consistency JudgeChecks if a summary stays faithful to the sourceSummarization Consistency
Trajectory AccuracyScores how closely agent trajectories follow expected stepsTrajectory Accuracy
UsefulnessRates how useful the answer is to the userUsefulness

Conversation LLM as a Judge metrics

MetricDescriptionDocumentation
Conversational CoherenceEvaluates coherence across sliding windows of a dialogueConversational Coherence
Session Completeness QualityChecks whether user goals were satisfied during the sessionSession Completeness
User FrustrationEstimates the likelihood a user was frustratedUser Frustration

Customizing LLM as a Judge metrics

By default, Opik uses GPT-5-nano from OpenAI as the LLM to evaluate the output of other LLMs. However, you can easily switch to another LLM provider by specifying a different model parameter.

1from opik.evaluation.metrics import Hallucination
2
3metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4
5metric.score(
6input="What is the capital of France?",
7output="The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.",
8)

For Python, this functionality is based on LiteLLM framework. You can find a full list of supported LLM providers and how to configure them in the LiteLLM Providers guide.

For TypeScript, the SDK integrates with the Vercel AI SDK. You can use model ID strings for simplicity or LanguageModel instances for advanced configuration. See the Models documentation for more details.