November 21, 2024
Perplexity is, historically speaking, one of the "standard" evaluation metrics for language models. And while…
In LangChain, comparison evaluators are designed to measure and compare outputs from two different chains or LLMs. These tools are invaluable for A/B testing between models or analyzing distinct versions. Moreover, they can be employed to generate preference scores for AI-assisted reinforcement learning.
At their core, these evaluators derive from the PairwiseStringEvaluator class, facilitating a comparison between two output strings. This could result from two distinct prompts, models, or simply different versions of the same model. Essentially, these evaluators assess pairs of strings, providing a detailed evaluation score and other pertinent information.
To craft a tailored comparison evaluator, developers can inherit from the PairwiseStringEvaluator class and modify the _evaluate_string_pairs
method. Asynchronous evaluations are also supported by overwriting the _evaluate_string_pairs
method.
Key features of a comparison evaluator include:
evaluate_string_pairs
: Overwrite this to design custom evaluators.aevaluate_string_pairs
: Use this for asynchronous evaluations.requires_input
: Determines if an input string is needed.requires_reference
: Specifies if a reference label is essential.Comparison evaluators excel at juxtaposing outputs from two models or prompts, yielding a score that elucidates the preference between the two outputs. They can be adapted to cater to specific comparative analysis requirements.
For detailed evaluation, the PairwiseStringEvalChain class’s evaluate_string_pairs
method compares two output strings, determining the preferred one based on specific criteria. This function can be used with or without a reference. While using a reference provides a more reliable result, the absence of one will rely on the evaluator’s preference, which might be less accurate.
Customization is at the heart of these evaluators. Developers can define their evaluation criteria or use predefined ones from LangChain. Additionally, one can customize the evaluation prompt for task-specific instructions, ensuring the evaluator scores the output as desired.
These evaluators are helpful for comparative analyses, such as A/B testing between two language models or comparing different versions of the same model. They can also help generate preference scores for ai-assisted reinforcement learning.
These evaluators inherit from the PairwiseStringEvaluator
class, providing a comparison interface for two strings – typically, the outputs from two different prompts or models or two versions of the same model.
tl;dr: A comparison evaluator evaluates a pair of strings and returns a dictionary containing the evaluation score and other relevant details.
To create a custom comparison evaluator, inherit from the PairwiseStringEvaluator
class and overwrite the _evaluate_string_pairs method.
If you require asynchronous evaluation, overwrite the _aevaluate_string_pairs
method.
Here’s a summary of the essential methods and properties of a comparison evaluator:
evaluate_string_pairs
: Evaluate the output string pairs. This function should be overwritten when creating custom evaluators.aevaluate_string_pairs
: Asynchronously evaluate the output string pairs. This function should be overwritten for asynchronous evaluation.requires_input
: This property indicates whether this evaluator requires an input string.requires_reference
: This property specifies whether this evaluator requires a reference label.In summary, comparison evaluators allow comparing two models/prompts by evaluating their outputs. They return a score quantifying the preference between the two outputs. You can customize them for your specific comparative analysis needs.
Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI.
Often, you will want to compare predictions of an LLM, Chain, or Agent for a given input.
The StringComparison
evaluators facilitate this so you can answer questions like:
The simplest and often most reliable automated way to choose a preferred prediction for a given input is to use the pairwise_string
evaluator.
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("labeled_pairwise_string")
evaluate_string_pairs
The evaluate_string_pairs
method of the PairwiseStringEvalChain
class is designed to evaluate and compare two output strings (prediction
and prediction_b
) to determine which one is preferred based on certain criteria.
_prepare_input
method. This method organizes the input data (prediction
, prediction_b
, input
, and reference
) into a dictionary format suitable for evaluation.prediction
and prediction_b
) based on the criteria defined elsewhere in the class or module._prepare_output
method. This method organizes the raw evaluation result into a more structured and readable format.The method returns a dictionary with the following keys:
prediction
), ‘B’ (for prediction_b
), or None
if there’s no preference.1
for ‘A’, 0
for ‘B’, and 0.5
if there’s no preference.In essence, the _evaluate_string_pairs
method is a utility to compare two model outputs and determine which is better based on predefined criteria.
llm
argument. By default, it uses GPT-4
.evaluator.evaluate_string_pairs(
prediction="Sikhism was founded by Guru Nanak Dev Ji in the 15th century.",
prediction_b="Sikhism was established by a philosopher named Ravi in the 16th century.",
input="Who is the founder of Sikhism?",
reference="Sikhism was founded by Guru Nanak Dev Ji in the late 15th century.",
verbose=True
)
{'reasoning': "Assistant A's response is more helpful, relevant, and correct. It accurately identifies Guru Nanak Dev Ji as the founder of Sikhism in the 15th century, which aligns with the reference answer provided. On the other hand, Assistant B's response is incorrect. It incorrectly identifies a philosopher named Ravi as the founder of Sikhism in the 16th century, which is not accurate according to the reference answer and historical facts. Therefore, Assistant A's response demonstrates a greater depth of thought and knowledge about the topic. \n\nFinal Verdict: [[A]]",
'value': 'A',
'score': 1}
When references aren’t available, you can still predict the preferred response.
The results will reflect the evaluation model’s preference, which is less reliable and may result in preferences that are factually incorrect.
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("pairwise_string")
evaluator.evaluate_string_pairs(
prediction="Stars are primarily made of hydrogen.",
prediction_b="Stars are primarily composed of hydrogen, which undergoes nuclear fusion to produce helium, releasing energy in the process.",
input="What is the primary component of a star?",
verbose=True
)
{'reasoning': "Both Assistant A and Assistant B provided correct and relevant answers to the user's question. However, Assistant B's response was more detailed and insightful, explaining not only that stars are primarily composed of hydrogen, but also how this hydrogen undergoes nuclear fusion to produce helium, releasing energy in the process. This additional information demonstrates a greater depth of thought and understanding of the topic. Therefore, Assistant B's response is superior based on the evaluation criteria. \n\nFinal Verdict: [[B]]",
'value': 'B',
'score': 0}
By default, the LLM is instructed to select the ‘preferred’ response based on helpfulness, relevance, correctness, and depth of thought.
You can customize the criteria by passing in a criteria argument, where the criteria could take any of the following forms:
conciseness
, relevance
, correctness
, coherence
, harmfulness
, maliciousness
, helpfulness
, controversiality
, misogyny
, criminality
, insensitivity
, depth
, creativity
, detail
Here’s an example of determining the scientific rigour and quality of a given text:
scientific_criteria = {
"accuracy": "Is the information presented accurate based on known scientific knowledge?",
"comprehensiveness": "Does the text cover the topic in a thorough manner, addressing all relevant aspects?",
"referencing": "Are claims and statements backed up with appropriate citations or sources?",
"objectivity": "Is the writing unbiased and free from personal opinions or beliefs?",
"terminology": "Does the text use correct and appropriate scientific terms and language?",
"methodology": "If applicable, is the scientific method or approach described in a clear and rigorous manner?",
"relevance": "Is the information presented relevant to the current state of the field or topic?",
"innovation": "Does the text introduce new concepts, theories, or methodologies?",
}
evaluator = load_evaluator("pairwise_string", criteria=scientific_criteria)
evaluator.evaluate_string_pairs(
prediction="The theory of relativity, proposed by Einstein, suggests that time and space are relative and all the motion must be relative to a frame of reference.",
prediction_b="Einstein's relativity idea posits that if you travel super fast, like near the speed of light, time slows down relative to others who are stationary.",
input="Explain the theory of relativity in a sentence.",
)
{'reasoning': "Both Assistant A and Assistant B provided accurate and relevant responses to the user's question. However, Assistant A's response is more comprehensive as it covers both the aspects of relativity - time and space, and the concept of motion relative to a frame of reference. On the other hand, Assistant B's response focuses only on the time aspect of relativity and does not mention the space aspect or the concept of relative motion. Both responses use appropriate scientific terminology and are objective, without any personal opinions or beliefs. Neither response introduces new concepts, theories, or methodologies, which is appropriate given the user's request for a one-sentence explanation. Neither assistant provided references, but this is not expected in a one-sentence explanation. Therefore, based on the criteria provided, Assistant A's response is superior. \n\nFinal Verdict: [[A]]",
'value': 'A',
'score': 1}
You can use your custom evaluation prompt to add task-specific instructions or instruct the evaluator to score the output.
Note: If you use a prompt that expects to generate a result in a unique format, you may also have to pass in a custom output parser (output_parser=your_parser()
) instead of the default PairwiseStringResultOutputParser
.
from langchain.prompts import PromptTemplate
prompt_template = PromptTemplate.from_template(
"""
**Task**: Compare the two responses, A and B, based on the provided criteria.
Provide a step-by-step reasoning for your preference and conclude with either [[A]] or [[B]] on a separate line.
Ensure your evaluation is objective and based solely on the given criteria.
**Criteria**:
{criteria}
**Data**:
- **Input Context**: {input}
- **Reference Answer**: {reference}
- **Response A**: {prediction}
- **Response B**: {prediction_b}
**Begin Reasoning Below**:
"""
)
evaluator = load_evaluator(
"labeled_pairwise_string", prompt=prompt_template
)
# The prompt was assigned to the evaluator
print(evaluator.prompt)
input_variables=['input', 'prediction_b', 'reference', 'prediction'] partial_variables={'criteria': 'For this evaluation, you should primarily consider the following criteria:\nhelpfulness: Is the submission helpful, insightful, and appropriate?\nrelevance: Is the submission referring to a real quote from the text?\ncorrectness: Is the submission correct, accurate, and factual?\ndepth: Does the submission demonstrate depth of thought?'} template='\n**Task**: Compare the two responses, A and B, based on the provided criteria. \nProvide a step-by-step reasoning for your preference and conclude with either [[A]] or [[B]] on a separate line. \nEnsure your evaluation is objective and based solely on the given criteria.\n\n**Criteria**:\n{criteria}\n\n**Data**:\n- **Input Context**: {input}\n- **Reference Answer**: {reference}\n- **Response A**: {prediction}\n- **Response B**: {prediction_b}\n\n**Begin Reasoning Below**:\n\n'
evaluator.evaluate_string_pairs(
prediction="The primary gas in Earth's atmosphere is carbon dioxide.",
prediction_b="Earth's atmosphere is primarily composed of nitrogen.",
input="What is the primary gas in Earth's atmosphere?",
reference="The primary gas in Earth's atmosphere is nitrogen.",
)
{'reasoning': "Helpfulness: Both responses attempt to answer the question, but Response B is more helpful because it provides the correct answer.\n\nRelevance: Both responses are relevant to the input context as they both refer to the primary gas in Earth's atmosphere.\n\nCorrectness: Response A is incorrect because the primary gas in Earth's atmosphere is not carbon dioxide, it's nitrogen. Response B is correct.\n\nDepth: Neither response demonstrates a significant depth of thought, as they both provide straightforward answers to the question. However, Response B is more accurate.\n\nBased on these criteria, Response B is the better response.\n\n[[B]]",
'value': 'B',
'score': 0}
In conclusion, LangChain’s comparison evaluators offer a robust and versatile toolset for assessing and contrasting the outputs of different chains or LLMs.
They are indispensable in A/B testing, model version analysis, and AI-driven reinforcement learning. Built on the foundational PairwiseStringEvaluator class, these evaluators provide detailed insights into pairs of strings, making them invaluable for developers and researchers. The flexibility to craft custom evaluators, define unique evaluation criteria, and modify evaluation prompts ensures that users can tailor evaluations to specific needs.
As LLMs evolve and integrate into various applications, such evaluators will be crucial in ensuring the optimal performance, accuracy, and utility of language models and their outputs.