G-Eval

G-Eval is a task-agnostic LLM-as-a-judge metric that allows you to specify a task description and evaluation criteria. The model first drafts step-by-step evaluation instructions and then produces a score between 0 and 1. You can learn more about G-Eval in the original paper.

To use G-Eval, supply two pieces of information:

  1. A task introduction describing what should be evaluated.
  2. Evaluation criteria outlining what “good” looks like.

The judge responds with an integer between 0 and 10. Opik divides that value by 10 so callers receive a score between 0.0 and 1.0. We recommend packaging the full scenario (prompt, context, answer, etc.) inside a single string and passing it via the output argument; any other keyword arguments are ignored by the metric interface.

1from opik.evaluation.metrics import GEval
2
3metric = GEval(
4 task_introduction="You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context.",
5 evaluation_criteria="In the provided text the OUTPUT must not introduce new information beyond what's provided in the CONTEXT.",
6)
7
8payload = """INPUT: What is the capital of France?
9CONTEXT: France is a country in Western Europe. Its capital is Paris, which is known for landmarks like the Eiffel Tower.
10OUTPUT: Paris is the capital of France.
11"""
12
13metric.score(output=payload)

How it works

G-Eval first expands your task description into a step-by-step Chain of Thought (CoT). This CoT becomes the rubric the judge will follow when scoring the provided answer. The model then evaluates the answer, returning a score in the 0–10 range which Opik normalises to 0–1.

By default, the gpt-5-nano model is used, but you can change this to any model supported by LiteLLM via the model parameter. Learn more in the custom model guide.

To make the metric more robust, Opik requests the top 20 log probabilities from the LLM and computes a weighted average of the scores, as recommended by the original paper. The evaluator always returns an integer between 0 and 10; Opik divides that value by 10 before exposing it so callers see numbers in the [0, 1] range. Newer models in the GPT-5 family and other providers may not expose log probabilities, so scores can vary when switching models.

Built-in G-Eval judges

Opik ships opinionated presets for common evaluation needs. Each class inherits from GEval and exposes the same constructor parameters (model, track, temperature, etc.).

Compliance Risk Judge

Flags statements that may be non-factual, non-compliant, or risky (e.g. finance, healthcare, legal). This judge is useful when you need an automated review step before customer-facing responses are sent, or when auditing historical conversations for policy breaches.

Compliance example
1from opik.evaluation.metrics import ComplianceRiskJudge
2
3metric = ComplianceRiskJudge(model="gpt-4o-mini")
4
5payload = """INPUT: Customer asked about wire-transfer reversal policies.
6OUTPUT: Just reverse it whenever the customer asks.
7"""
8
9score = metric.score(output=payload)
10print(score.value, score.reason)

Inspect score.reason for granular rationales and route risky cases accordingly. The raw 0–10 judgement is divided by 10 in the returned value.

Prompt Uncertainty Judge

PromptUncertaintyJudge estimates how ambiguous a user prompt is before it reaches your model. Run it on raw user messages to prioritise agent hand-offs or to warn users when the request is ill-posed.

Prompt uncertainty
1from opik.evaluation.metrics import PromptUncertaintyJudge
2
3prompt = "Summarise the attached 400 page contract in one sentence and guarantee there are no mistakes."
4
5uncertainty = PromptUncertaintyJudge().score(prompt=prompt)
6print(uncertainty.value)

Use the score to highlight prompts that may confuse downstream models; the judge emits an integer from 0 (best) to 10 (worst) before normalisation.

Summarization Consistency Judge

Checks whether a generated summary is faithful to the source material. This is the right choice when a downstream workflow consumes summaries and you need to enforce factual alignment with the source document.

Summary faithfulness
1from opik.evaluation.metrics import SummarizationConsistencyJudge
2
3metric = SummarizationConsistencyJudge(model="gpt-4o")
4
5payload = """CONTEXT: ...long article text...
6SUMMARY: The article confirms new safety protocols but misstates the deadline.
7"""
8
9score = metric.score(output=payload)
10print(score.value, score.reason)

Pair this metric with alerts or automated rollbacks when the score drops below a threshold; the evaluator still returns raw integers in 0–10 before Opik scales them.

Summarization Coherence Judge

Scores the structure, clarity, and organisation of a summary. Use it when you optimise for human readability or want to catch summaries that are factually right but poorly written.

Summary coherence
1from opik.evaluation.metrics import SummarizationCoherenceJudge
2
3metric = SummarizationCoherenceJudge()
4
5score = metric.score(output="""SUMMARY: First... Secondly... Finally...""")
6print(score.value, score.reason)

High scores correlate with summaries that maintain logical ordering and concise transitions between ideas. A perfect 10 becomes 1.0 after Opik normalisation.

Dialogue Helpfulness Judge

Examines how helpful an assistant reply is in the context of the preceding dialogue. Helpful for agent tuning or support chat routing where you want to surface conversations that require escalation.

Dialogue helpfulness
1from opik.evaluation.metrics import DialogueHelpfulnessJudge
2
3transcript = """USER: How do I reset my password?
4ASSISTANT: Visit settings and click reset.
5USER: I cannot see that option.
6ASSISTANT: Please contact support.
7"""
8
9score = DialogueHelpfulnessJudge().score(output=transcript)
10print(score.value, score.reason)

Low scores typically indicate the assistant ignored prior context or refused to offer actionable steps. The normalised value originates from an integer between 0 and 10.

QA Relevance Judge

Determines whether an answer directly addresses the user’s question. Ideal for dataset regression tests where each sample has a clear question/answer pair.

QA relevance
1from opik.evaluation.metrics import QARelevanceJudge
2
3metric = QARelevanceJudge()
4
5payload = """QUESTION: What causes rainbows?
6ANSWER: The capital of France is Paris.
7"""
8
9score = metric.score(output=payload)
10print(score.value, score.reason)

Combine with hallucination metrics to distinguish totally off-topic answers from confident but wrong responses; the judge still works on a 0–10 scale internally.

Agent Task Completion Judge

Evaluates if an agent fulfilled its assigned high-level task. Works well for long-running workflows where success is defined by end-state rather than a single response.

Task completion
1from opik.evaluation.metrics import AgentTaskCompletionJudge
2
3trace_summary = "Agent gathered quotes, compared options, and booked travel."
4score = AgentTaskCompletionJudge().score(output=trace_summary)
5print(score.value, score.reason)

Use the reason text to inspect which sub-goals the judge believed were satisfied; a raw 0–10 verdict is divided by 10 in the returned value.

Agent Tool Correctness Judge

Assesses whether an agent invoked tools appropriately and interpreted outputs correctly. Especially useful for production agents integrating external APIs.

Tool correctness
1from opik.evaluation.metrics import AgentToolCorrectnessJudge
2
3call_trace = "Tool weather_api called with city='Paris' but response ignored."
4score = AgentToolCorrectnessJudge().score(output=call_trace)
5print(score.value, score.reason)

Lower scores suggest the agent mis-handled tool results or skipped required invocations. Raw values remain in the 0–10 range before normalisation.

Trajectory Accuracy

Scores whether an agent’s trajectory (series of states or actions) matches the expected path. Use it to audit reinforcement-learning agents or scripted flows that should follow specific checkpoints.

Trajectory accuracy
1from opik.evaluation.metrics import TrajectoryAccuracy
2
3expected = ["start", "search_docs", "summarise", "respond"]
4actual = ["start", "search_docs", "respond"]
5score = TrajectoryAccuracy(expected_path=expected).score(output=actual)
6print(score.value, score.reason)

This metric highlights missing or out-of-order actions so you can tighten guardrails around multi-step agents.

LLM Juries Judge

LLMJuriesJudge is an ensemble wrapper that averages the outputs of multiple judge metrics. This is useful when you want to combine bespoke criteria—e.g. take the mean of hallucination, helpfulness, and compliance scores.

1from opik.evaluation.metrics import LLMJuriesJudge, Hallucination, ComplianceRiskJudge
2
3jury = LLMJuriesJudge([
4 Hallucination(model="gpt-4o-mini"),
5 ComplianceRiskJudge(model="gpt-4o-mini"),
6])
7payload = """INPUT: Summarise compliance requirements for fintech onboarding.
8OUTPUT: No need for KYC; just accept the payment.
9"""
10
11result = jury.score(output=payload)
12print(result.value, result.metadata["judge_scores"])

Conversation adapters

Need to apply G-Eval-based judges to full conversations? Use the conversation adapters in opik.evaluation.metrics.conversation.llm_judges.g_eval_wrappers, exposed via Conversation* classes. They focus on the last assistant turn (or full transcript for summaries) and keep the original GEval reasoning.

Refer to Conversation-level GEval Metrics for available adapters and usage examples.

Customising models

All GEval-derived metrics expose the model parameter so you can switch the underlying LLM. For example:

1from opik.evaluation.metrics import Hallucination
2
3metric = Hallucination(model="bedrock/anthropic.claude-3-sonnet-20240229-v1:0")
4
5payload = """INPUT: What is the capital of France?
6OUTPUT: The capital of France is Paris. It is famous for its iconic Eiffel Tower and rich cultural heritage.
7"""
8
9score = metric.score(output=payload)

This functionality relies on LiteLLM. See the LiteLLM Providers guide for a full list of supported providers and model identifiers.