G-Eval
G-Eval is a task-agnostic LLM-as-a-judge metric that allows you to specify a task description and evaluation criteria. The model first drafts step-by-step evaluation instructions and then produces a score between 0 and 1. You can learn more about G-Eval in the original paper.
To use G-Eval, supply two pieces of information:
- A task introduction describing what should be evaluated.
- Evaluation criteria outlining what “good” looks like.
The judge responds with an integer between 0 and 10. Opik divides that value by 10 so callers receive a score between 0.0 and 1.0. We recommend packaging the full scenario (prompt, context, answer, etc.) inside a single string and passing it via the output argument; any other keyword arguments are ignored by the metric interface.
How it works
G-Eval first expands your task description into a step-by-step Chain of Thought (CoT). This CoT becomes the rubric the judge will follow when scoring the provided answer. The model then evaluates the answer, returning a score in the 0–10 range which Opik normalises to 0–1.
By default, the gpt-5-nano model is used, but you can change this to any model supported by LiteLLM via the model parameter. Learn more in the custom model guide.
To make the metric more robust, Opik requests the top 20 log probabilities from the LLM and computes a weighted average of the scores, as recommended by the original paper. The evaluator always returns an integer between 0 and 10; Opik divides that value by 10 before exposing it so callers see numbers in the [0, 1] range. Newer models in the GPT-5 family and other providers may not expose log probabilities, so scores can vary when switching models.
Built-in G-Eval judges
Opik ships opinionated presets for common evaluation needs. Each class inherits from GEval and exposes the same constructor parameters (model, track, temperature, etc.).
Compliance Risk Judge
Flags statements that may be non-factual, non-compliant, or risky (e.g. finance, healthcare, legal). This judge is useful when you need an automated review step before customer-facing responses are sent, or when auditing historical conversations for policy breaches.
Inspect score.reason for granular rationales and route risky cases accordingly. The raw 0–10 judgement is divided by 10 in the returned value.
Prompt Uncertainty Judge
PromptUncertaintyJudge estimates how ambiguous a user prompt is before it reaches your model. Run it on raw user messages to prioritise agent hand-offs or to warn users when the request is ill-posed.
Use the score to highlight prompts that may confuse downstream models; the judge emits an integer from 0 (best) to 10 (worst) before normalisation.
Summarization Consistency Judge
Checks whether a generated summary is faithful to the source material. This is the right choice when a downstream workflow consumes summaries and you need to enforce factual alignment with the source document.
Pair this metric with alerts or automated rollbacks when the score drops below a threshold; the evaluator still returns raw integers in 0–10 before Opik scales them.
Summarization Coherence Judge
Scores the structure, clarity, and organisation of a summary. Use it when you optimise for human readability or want to catch summaries that are factually right but poorly written.
High scores correlate with summaries that maintain logical ordering and concise transitions between ideas. A perfect 10 becomes 1.0 after Opik normalisation.
Dialogue Helpfulness Judge
Examines how helpful an assistant reply is in the context of the preceding dialogue. Helpful for agent tuning or support chat routing where you want to surface conversations that require escalation.
Low scores typically indicate the assistant ignored prior context or refused to offer actionable steps. The normalised value originates from an integer between 0 and 10.
QA Relevance Judge
Determines whether an answer directly addresses the user’s question. Ideal for dataset regression tests where each sample has a clear question/answer pair.
Combine with hallucination metrics to distinguish totally off-topic answers from confident but wrong responses; the judge still works on a 0–10 scale internally.
Agent Task Completion Judge
Evaluates if an agent fulfilled its assigned high-level task. Works well for long-running workflows where success is defined by end-state rather than a single response.
Use the reason text to inspect which sub-goals the judge believed were satisfied; a raw 0–10 verdict is divided by 10 in the returned value.
Agent Tool Correctness Judge
Assesses whether an agent invoked tools appropriately and interpreted outputs correctly. Especially useful for production agents integrating external APIs.
Lower scores suggest the agent mis-handled tool results or skipped required invocations. Raw values remain in the 0–10 range before normalisation.
Trajectory Accuracy
Scores whether an agent’s trajectory (series of states or actions) matches the expected path. Use it to audit reinforcement-learning agents or scripted flows that should follow specific checkpoints.
This metric highlights missing or out-of-order actions so you can tighten guardrails around multi-step agents.
LLM Juries Judge
LLMJuriesJudge is an ensemble wrapper that averages the outputs of multiple judge metrics. This is useful when you want to combine bespoke criteria—e.g. take the mean of hallucination, helpfulness, and compliance scores.
Conversation adapters
Need to apply G-Eval-based judges to full conversations? Use the conversation adapters in opik.evaluation.metrics.conversation.llm_judges.g_eval_wrappers, exposed via Conversation* classes. They focus on the last assistant turn (or full transcript for summaries) and keep the original GEval reasoning.
Refer to Conversation-level GEval Metrics for available adapters and usage examples.
Customising models
All GEval-derived metrics expose the model parameter so you can switch the underlying LLM. For example:
This functionality relies on LiteLLM. See the LiteLLM Providers guide for a full list of supported providers and model identifiers.