December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
While string evaluators provide a robust way to measure a model’s accuracy, myriad other methods offer nuanced and targeted approaches to evaluation.
For developers and data scientists venturing into building applications with language models, ensuring the reliability of the model’s output becomes paramount. From the simplicity of an exact match to the depth of embedding distances, each evaluation method serves a unique purpose in the grand tapestry of language model validation.
Delving deeper, this guide explores various string evaluation techniques — each with its strengths, intricacies, and use cases.
Whether you’re looking to validate a specific format using regex or measure semantic similarity through embeddings, understanding these evaluation methods is key to creating AI-driven applications that are both accurate and effective.
When building apps with language models, it’s crucial to ensure your models produce reliable and valuable results for various inputs and integrate seamlessly with other software components. This often requires a mix of intelligent application design, thorough testing, and runtime checks.
Probably the simplest ways to evaluate an LLM or runnable’s string output against a reference label is by a simple string equivalence.
The ExactMatchStringEvaluator
simply checks if the prediction string exactly matches the reference string.
It is case-sensitive by default.
from langchain.evaluation import ExactMatchStringEvaluator
evaluator = ExactMatchStringEvaluator()
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain",
)
{'score': 0}
evaluator.evaluate_strings(prediction="My name is Harpreet, and I love to learn LangChain",
reference="My name is Harpreet, and I love to learn LangChain",
)
{'score': 1}
ExactMatchStringEvaluator
You can relax the “exactness” when comparing strings.
evaluator = ExactMatchStringEvaluator(
ignore_case=True,
ignore_numbers=True,
ignore_punctuation=True,
)
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="my name is harpreet, and I love to learn langchain!"
)
# will output {'score': 1}
String distance is a measure of the difference between two strings.
The smaller the distance, the more similar the two strings are. Different algorithms provide different ways of calculating this distance.
Under the hood, LangChain uses the RapidFuzz
library to perform several calculations.
This can be used alongside approximate/fuzzy matching criteria for fundamental unit testing.
The StringDistanceStringEvaluator
measures the similarity between two strings using a string distance algorithm like Levenshtein distance.
It returns a score between 0 and 1, with 1 indicating an exact match.
Want to learn how to build modern software with LLMs using the newest tools and techniques in the field? Check out this free LLMOps course from industry expert Elvis Saravia of DAIR.AI.
This enumeration defines the types of string distance metrics supported:
Damerau-Levenshtein
: Considers insertions, deletions, substitutions, and the transposition of two adjacent characters.Levenshtein
: Considers insertions, deletions, and substitutions.Jaro
: Measures the similarity between two strings.Jaro-Winkler
: A modification of Jaro’s similarity to give more weight to the prefix.Hamming
: Measures the difference between two strings of equal length.Indel
: Considers only insertions and deletions.from langchain.evaluation import load_evaluator, StringDistance
evaluator = load_evaluator("string_distance")
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain",
)
# will output {'score': 0.31919191919191914}
You can change the metric like so:
levenshtein_evaluator = load_evaluator(
"string_distance",
distance='levenshtein'
)
levenshtein_evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain",
)
# {'score': 0.52}
For some metrics, you need to instantiate the StringDistanceEvalChain:
from langchain.evaluation import StringDistanceEvalChain
evaluator = StringDistanceEvalChain(value='indel')
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain",
)
# {'score': 0.31919191919191914}
To measure semantic similarity (or dissimilarity) between a prediction and a reference label string, you could use a vector vector distance metric the two embedded representations using the embedding_distance
evaluator.
Note: This returns a distance score, meaning that the lower the number, the more similar the prediction is to the reference, according to their embedded representation.
These distance measures you can choose from are:
"cosine"
): This is computed as (1 — {cosine similarity}). The cosine similarity measures the cosine of the angle between two vectors. A cosine similarity of 1 means the vectors are identical, while a value of 0 means they are orthogonal (entirely dissimilar). Therefore, a cosine distance of 0 indicates that the embeddings are identical, and a value of 1 indicates they are entirely dissimilar."euclidean"
): It is the straight-line distance between two points in Euclidean space."manhattan"
): It is the sum of the absolute differences of their coordinates. In a 2D space, it represents the distance between two points measured along the axes at right angles."chebyshev"
): It is the maximum absolute difference between elements of the vectors. It’s essentially the infinity norm of the difference between the vectors."hamming"
): It measures the minimum number of substitutions required to change one string into the other or the minimum number of errors that could have transformed one string into the other. In the context of this code, it seems to be applied to vectors by determining the proportion of differing vector elements.In general, cosine distance is a common choice for text embeddings. However, it’s beneficial to experiment with different metrics based on your specific needs and validate them against a known benchmark or application outcome.
from langchain.evaluation import load_evaluator
evaluator = load_evaluator("embedding_distance")
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain"
)
# {'score': 0.0404781648420105}
evaluator = load_evaluator(
"embedding_distance",
distance_metric="euclidean"
)
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain"
)
# {'score': 0.2844376766821911}
The constructor uses OpenAI embeddings by default, but you can configure this however you want. Below, use HuggingFace local embeddings:
from langchain.embeddings import HuggingFaceEmbeddings
embedding_model = HuggingFaceEmbeddings()
hf_evaluator = load_evaluator("embedding_distance",
embeddings=embedding_model)
hf_evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain"
)
# {'score': 0.2803533789378635}
The RegexMatchStringEvaluator checks if a regex pattern matches the prediction string. This is useful for validating outputs.
from langchain.evaluation import RegexMatchStringEvaluator
evaluator = RegexMatchStringEvaluator()
evaluator.evaluate_strings(
prediction="The date is 2022-01-01",
reference="The date is 2022-01-01"
)
# {'score': 1}
# Check for the presence of a MM-DD-YYYY string.
evaluator.evaluate_strings(
prediction="The delivery will be made on 2024-01-05",
reference=".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"
)
# {'score': 0}
evaluator.evaluate_strings(
prediction="The delivery will be made on 01-05-2024",
reference=".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"
)
# {'score': 1}
To match against multiple patterns, use a regex union “|”.
# Check for the presence of a MM-DD-YYYY string or YYYY-MM-DD
evaluator.evaluate_strings(
prediction="The delivery will be made on 01-05-2024",
reference="|".join([".*\\b\\d{4}-\\d{2}-\\d{2}\\b.*", ".*\\b\\d{2}-\\d{2}-\\d{4}\\b.*"])
)
# {'score': 1}
RegexMatchStringEvaluator
You can specify any regex flags to use when matching.
import re
evaluator = RegexMatchStringEvaluator(
flags=re.IGNORECASE
)
evaluator.evaluate_strings(
prediction="My name is Harpreet, and I love to learn LangChain",
reference="Harpreet loves learning langchain"
)
# {'score': 0}
As we journey through the multifaceted landscape of language model evaluation, it becomes evident that more than a one-size-fits-all approach is required.
From the precision of exact matches to the interpretive power of embedding distances, each evaluation technique offers a unique lens through which we can scrutinize our models. The role of regex in format validation and the nuanced ways string distance algorithms operate underscore the richness and diversity of tools at our disposal.
For developers and AI enthusiasts, understanding and leveraging these evaluation methods are crucial steps toward building applications that not only function seamlessly but also uphold the standards of reliability and accuracy.
A comprehensive toolkit like this ensures we remain equipped to meet challenges, validate outputs, and drive innovation. As you conclude this guide, I hope you’re better prepared and inspired to harness the power of these evaluative techniques, ensuring that your AI applications are always a cut above the rest.