G-Eval | Opik Documentation

G-Eval is a task agnostic LLM as a Judge metric that allows you to specify a set of criteria for your metric and it will use a Chain of Thought prompting technique to create some evaluation steps and return a score. You can learn more about G-Eval in the original paper.

To use G-Eval, you need to specify just two pieces of information:

A task introduction: This describes the task you want to evaluate
Evaluation criteria: This is a list of criteria that the LLM will use to evaluate the task.

You can then use the GEval metric to score your LLM outputs:

1 from opik.evaluation.metrics import GEval
2 
3 metric = GEval(
4     task_introduction="You are an expert judge tasked with evaluating the faithfulness of an AI-generated answer to the given context.",
5     evaluation_criteria="In provided text the OUTPUT must not introduce new information beyond what's provided in the CONTEXT.",
6 )
7 
8 metric.score(
9     output="""
10            OUTPUT: Paris is the capital of France.
11            CONTEXT: France is a country in Western Europe. Its capital is Paris, which is known for landmarks like the Eiffel Tower.
12            """
13 )

How it works

The way the G-Eval metric works is by first using the task introduction and evaluation criteria to create a set of evaluation steps. These evaluation steps are then combined with the task introduction and evaluation criteria to return a single score.

By default, the gpt-4o model is used to generate the final score, but you can change this to any model supported by LiteLLM by setting the model parameter. You can learn more about customizing models in the Customize models for LLM as a Judge metrics section.

The evaluation steps are generated using the following prompt:

1 \*\*\* TASK:
2 Based on the following task description and evaluation criteria,
3 generate a detailed Chain of Thought (CoT) that outlines the necessary Evaluation Steps
4 to assess the solution. The CoT should clarify the reasoning process for each step of evaluation.
5 
6 \*\*\* INPUT:
7 
8 TASK INTRODUCTION:
9 {task_introduction}
10 
11 EVALUATION CRITERIA:
12 {evaluation_criteria}
13 
14 FINAL SCORE:
15 IF THE USER'S SCALE IS DIFFERENT FROM THE 0 TO 10 RANGE, RECALCULATE THE VALUE USING THIS SCALE.
16 SCORE VALUE MUST BE AN INTEGER.

The final score is generated by combining the evaluation steps returned by the prompt above with the task introduction and evaluation criteria:

1 \*\*\* TASK INTRODUCTION:
2 {task_introduction}
3 
4 \*\*\* EVALUATION CRITERIA:
5 {evaluation_criteria}
6 
7 {chain_of_thought}
8 
9 \*\*\* INPUT:
10 {input}
11 
12 \*\*\* OUTPUT:
13 NO TEXT, ONLY SCORE

In order to make the G-Eval metric more robust, we request the top 10 log_probs from the LLM and compute a weighted average of the scores as recommended by the original paper.