Perplexity for LLM Evaluation

Follow along with the Colab!

Perplexity is, historically speaking, one of the “standard” evaluation metrics for language models. And while recent years have seen a surge in more complex and robust metrics, including LLM-based evaluations, perplexity still has a lot of value as a component in your evaluation suite. If you want to build effective evaluation pipelines—or just understand what researchers mean when they report perplexity scores—you need to have a grasp on what perplexity is and how it can be implemented.

Perplexity seeks to quantify the “uncertainty” a model experiences when when predicting the next token in a sequence. High uncertainty occurs when the model is unsure about the next word or token in a sequence. This can happen when the input is ambiguous or the model hasn’t seen similar examples during training. Quantifying uncertainty in language models helps us judge when it might need human oversight or further training, allowing us to handle those cases differently. This is especially critical in high-stakes situations, like with medical or legal advice, where an overconfident wrong answer could have serious consequences.

But this is just scratching the surface of perplexity. In this article, I want to go in depth, covering perplexity’s mathematical basis, underlying intuitions, and limitations. I’ll show you how to implement perplexity from scratch in Python, and how to add perplexity to your evaluation suite using Opik, our open-source LLM evaluation framework.

Let’s dive in!

A Little Background on Perplexity

Perplexity was first introduced 1977 by a team of IBM researchers working on speech recognition. The team, led by Frederick Jelinek, was looking for a metric that could measure the difficulty a statistical model experienced while making a prediction. As an interesting aside, Jelinek is the original author of the famous quote “Every time I fire a linguist, the performance of the speech recognizer goes up.”

Table from the original paper: Perplexity—a measure of the difficulty of speech recognition tasks

The key insight of the initial perplexity paper is that by applying concepts from information theory to a model’s internal state, we can begin to quantify more subtle qualities of a model. While the original authors were looking for a metric to approximate the “difficulty” of speech recognition tasks, researchers working on NLP quickly recognized perplexity as relevant to their work as well.

Throughout the 1980s and 1990s, perplexity emerged as the key metric for evaluating the performance of n-gram models. Perplexity was used to measure how well these models captured linguistic patterns by quantifying the average uncertainty of predictions. “Uncertainty” was calculated using entropy and its close mathematical relative, cross-entropy, both of which we’ll explore in more detail shortly.

Perplexity remains a primary benchmark to this day and is a popular metric for evaluating sequential neural networks (including the GPT family of models). Its many advantages, and its historical role in benchmarking, make it common even in contemporary research. At the same time, its many limitations make it insufficient as a standalone evaluation metric, especially for modern LLMs.

In order to gain a more intuitive understanding of perplexity and its pros and cons, we need to first explore the underlying mathematics. Namely, we need to understand entropy and cross-entropy. If you already feel comfortable with these topics, feel free to skip the following section.

Entropy, Cross-entropy and Information Theory

Perplexity, as an evaluation metric, has its roots in information theory and probabilistic modeling, building on Claude Shannon’s work on entropy in the 1940s. Shannon used language entropy to describe the amount of information in a message, specifically when converting from a programming language to raw binary and back to a programming language:

“The entropy is a statistical parameter which measures, in a certain sense, how much information is produced on the average for each letter of a text in the language. If the language is translated into binary digits (0 or 1) in the most efficient way, the entropy H is the average of binary digits required per letter of the original language.” – (Claude Shannon in Prediction and Entropy of Printed English, 1951)

As described by Shannon in the 40s and 50s, language entropy quantifies the average amount of information contained in a word or sequence of words, reflecting how unpredictable the next word is based on previous context. In other words, language entropy refers to the degree of uncertainty or unpredictability in a language’s word distribution.

In the experiment below, Shannon counted how many guesses it took a human being to correctly predict each letter (including spaces) in the sentence, given only the preceding letters in the sequence.

Graphic from Prediction and Entropy of Printed English by Claude Shannon, 1950

Entropy is calculated as H(P) where p(w1) is the probability of the ith word occurring in a given context, and the summation runs over all possible words in the vocabulary. The negative sign ensures that entropy is a non-negative value, as logp(w) is negative.

Mathematical equation for entropy

Higher entropy values indicate lower predictability and greater diversity in word choice, while lower values suggest more predictable language patterns, reflecting the underlying complexity of the language being modeled.

Because the output of a large language model is typically a probability distribution calculated across all possible output tokens, entropy is very straightforward to calculate. A natural next question is how we might use entropy to train a language model, and that is where entropy’s close relative cross-entropy comes in.

While entropy measures the average uncertainty in a single probability distribution, cross-entropy quantifies the difference between two probability distributions. In the case of language modeling, these would be the true distribution, P, and the model’s predicted distribution, Q. In this way cross-entropy provides a way to assess how well the model’s predictions approximate the actual distribution of the data.

Mathematical equation for cross-entropy

Here, p(xi) represents the true probability of outcome xi and q(xi) denotes the predicted probability. Lower cross-entropy values indicate higher certainty, as they imply that the model’s probabilities are more closely aligned with the actual distribution.

In the context of model training, cross-entropy is used as the backbone of many loss functions. It measures the difference between how likely the model is to select a given token, versus the true likelihood that a given token is correct.

Once you have a grasp of entropy and cross-entropy, perplexity follows intuitively.

So, what is perplexity, really?

Like entropy and cross-entropy, perplexity also quantifies a model’s uncertainty in predicting the next token in a sequence. So, why not just use entropy or cross-entropy? It turns out perplexity is far more intuitive in explaining model behavior.

Mathematical equation for perplexity

Mathematically speaking, perplexity is defined as the exponentiated average log-likelihood of the predicted words in a sequence. Or, less verbosely, perplexity is cross-entropy with the exponential function applied. This transformation might seem somewhat arbitrary at first, but it actually makes a big difference, especially in terms of interpretability.

Because cross-entropy is a negative log measure, when we take its exponential, we “undo” the log, converting this measure back into a regular probability space. This value represents the tangible count of likely choices the model considers at each step, or the “effective branching factor.”

Perplexity, then, is essentially a measure of how many options the model finds plausible on average, with lower values indicating fewer options (more confident predictions) and higher values indicating more options (greater uncertainty).

Entropy	Measures the average uncertainty in a single probability distribution P
Cross-entropy	Measures how well a predicted distribution Q approximates the true distribution P
Perplexity	Exponentiation of cross-entropy; Measures how many likely candidate tokens the model is choosing between

In summary, entropy measures the inherent uncertainty in a true probability distribution, reflecting the average unpredictability of outcomes, such as words in a language. Cross-entropy extends this concept by measuring the difference between the true distribution of the data and the predicted distribution from a model, penalizing inaccurate predictions. Perplexity builds on cross-entropy by transforming it into a more interpretable form, using the exponential function to express how many equally likely word choices the model is effectively considering.

Note that the perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token. This means that if a language model has a perplexity of 10, on average, the model is selecting between 10 equally likely options for the next word.

The perplexity score of a language model on a sequence of tokens is the average of the perplexity scores for each predicted token.

Using this intuition, a lower perplexity score is better because it indicates that a model is effectively “choosing” between fewer viable options for the next word and is “less surprised.” A higher perplexity score, on the other hand, indicates more “uncertainty.”

Of course, it is entirely possible for a language model to be “confident” and “incorrect,” so perplexity should not be confused with an accuracy metric. But we’ll dive into more of perplexity’s limitations later on. First, let’s explore some of its advantages.

Advantages of Perplexity

As mentioned, one of the biggest advantages of perplexity is that it is highly intuitive and explainable in a field that is notoriously opaque. This is a notable advantage over learned metrics like BERTScore and LLM-as-a-Judge metrics like G-Eval.

Having an estimate of a model’s certainty is also especially useful when using an LLM to plan or guide actions. While high certainty suggests the model has strong backing for a given prediction, low certainty can prompt further human oversight or additional checks before execution.

Perplexity is also computationally straightforward to calculate, making it fast and efficient. This also allows practitioners to evaluate model performance in real-time during training, helping to identify issues and improvements promptly and leading to faster development cycles.

As we’ll see in the next section on perplexity’s limitations, it is not an end-all evaluation metric for LLMs. However, given its explainability and low-overhead, perplexity is a quick and useful first-pass metric that works well when used in conjunction with other LLM evaluation metrics.

Limitations of Perplexity

The most important limitation of perplexity is that it does not convey a model’s “understanding.” Perplexity is strictly a measure of uncertainty, and a model being uncertain doesn’t mean it is right or wrong. A model may be correct but unconfident or wrong but confident. So, a perplexity score isn’t a measure of accuracy, just of confidence.

It is also difficult to use perplexity as a benchmark between models. Perplexity scores are influenced by various model-specific factors, such as tokenization method, dataset, pre-processing steps, vocabulary size, and context length. For example, a character-level model may have a lower perplexity than a word-level model, but that doesn’t necessarily mean the character-level model is better.

A model can also achieve a low perplexity score by assigning high probabilities to common words, like articles and conjunctions, leading to a misleadingly low score. Overfit models can show low perplexity but lack true understanding. Research also suggests that perplexity doesn’t correlate well with an LLM’s long-term understanding, likely because it fails to capture long-term dependencies. Additionally, perplexity can be skewed by punctuation and repeated text spans, which lower scores but don’t necessarily improve text quality.

While perplexity has limitations, it remains a valuable first-pass metric when combined with other task-specific LLM evaluation metrics, offering both interpretability and efficiency.

Implementing Perplexity From Scratch in Python

Using what we’ve learned so far about perplexity, let’s implement it from scratch in Python so we can apply it directly to our LLM outputs. Note that because perplexity is such a common evaluation metric, there are several pre-built modules to implement it in Python, including Hugging Face’s evaluate.metrics.perplexity and perplexed from Stability AI. But coding the metric from scratch will help build intuition for what perplexity is actually doing under the hood. Later, we’ll test our function out on GPT-2 and learn how to automatically track the perplexity scores of our LLM using a custom metric in Opik.

Throughout our implementation, we’ll be using PyTorch and HuggingFace’s Transformers library.

Our basic perplexity function will take logits and target labels as inputs.

Logits are the raw scores output by the model for each token in the vocabulary for a given position in the input sequence. For each position in the sequence, the model outputs a vector of logits, where each entry in that vector corresponds to a token in the vocabulary.

Shape of logits vector in LLM output

The targets will be the ground truth label tensors.

To calculate perplexity, we’ll need to:

Convert the logits to probabilities using the log_softmax function, which normalizes the scores.
Gather the log probabilities of the correct target tokens.
As these results will be negatives, we’ll multiply them by -1 to convert the probabilities into the range of 0-1. This value represents the entropy.
Take the mean entropy of all tokens in the sequence. This value represents the cross-entropy.
Take the exponential of the cross-entropy. This value represents the perplexity, or effective branching factor of each token in the sequence.


import torch

def calculate_perplexity(logits, target):
    """
    Calculate perplexity from logits and target labels.

    Args:
    - logits (torch.Tensor): Logits output from the model (batch_size, seq_length, vocab_size).
    - target (torch.Tensor): Ground truth labels (batch_size, seq_length).

    Returns:
    - perplexity (float): The perplexity score.
    """

    # Convert logits to log probabilities
    log_probs = torch.nn.functional.log_softmax(logits, dim=-1)

    # Gather the log probabilities for the correct target tokens
    # log_probs has shape (batch_size, seq_length, vocab_size)
    # target has shape (batch_size, seq_length)
    # The gather method will pick the log probabilities of the true target tokens
    target_log_probs = log_probs.gather(dim=-1, index=target.unsqueeze(-1)).squeeze(-1)

    # Calculate the negative log likelihood
    negative_log_likelihood = -target_log_probs

    # Calculate the mean negative log likelihood over all tokens
    mean_nll = negative_log_likelihood.mean()

    # Calculate perplexity as exp(mean negative log likelihood)
    perplexity = torch.exp(mean_nll)

    return perplexity.item()

# Example usage
# Simulate a batch of logits (batch_size=2, seq_length=4, vocab_size=10)
logits = torch.randn(2, 4, 10)
# Simulate ground truth target tokens
target = torch.tensor([[1, 2, 3, 4], [4, 3, 2, 1]])

# Calculate perplexity
perplexity = calculate_perplexity(logits, target)
print(f'Perplexity: {perplexity}')

The function above calculates perplexity from a mathematical perspective, but it requires some adjustments to handle raw text, as you would encounter in real-world scenarios.

Now that we’ve covered the math behind perplexity, let’s modify the function to work with the inputs and outputs of a large language model.For this version of our function we’ll want to:

Shift the logits and target tensors so that each model prediction (logit) matches the actual token in the sequence (target/input_ids). Since each token is predicted based on the previous tokens, the prediction for token 𝑡 should be compared to the actual token at 𝑡 + 1.
Add batching to handle texts longer than the model’s context length by splitting them into smaller chunks for parallel processing.
Use padding tokens to standardize input lengths across sentences of varying lengths.
Apply an attention mask to exclude padding tokens from perplexity calculations.
Average the perplexity scores across tokens for a sequence-level score.
Average the sequence-level scores for an overall batch-level perplexity score.

Since each token is predicted based on the previous tokens, the prediction for token 𝑡 should be compared to the actual token at 𝑡 + 1.


import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the model and tokenizer (e.g., GPT-2)
model_name = "gpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# Assign the EOS token as the padding token
tokenizer.pad_token = tokenizer.eos_token

def calculate_batch_perplexity(input_texts):
    """
    Calculate perplexity for a batch of input texts using a pretrained language model.

    Args:
    - input_texts (List[str]): A list of input texts to evaluate.

    Returns:
    - List[float]: A list of perplexity scores, one for each input text.
    """
    # Tokenize the batch of texts with padding for uniform length
    inputs = tokenizer(
        input_texts, return_tensors="pt", padding=True, truncation=True
    )

    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Pass the input batch through the model to get logits
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)
        logits = outputs.logits

    # Shift the logits and input_ids to align targets correctly
    # Logits dimensions are: (batch_size, seq_length, vocab_size)
    shift_logits = logits[:, :-1, :]  # Ignore the last token's logits
    shift_labels = input_ids[:, 1:]   # Skip the first token in the labels

    # Compute log probabilities
    log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)

    # Gather the log probabilities for the correct tokens
    target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)

    # Mask out positions corresponding to padding tokens
    target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)

    # Compute the mean negative log-likelihood for each sequence
    negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)

    # Compute perplexity for each sequence
    perplexities = torch.exp(negative_log_likelihood)
    perplexities = perplexities.tolist()

# Take mean of perplexities of each batch
    mean_perplexity_score = torch.mean(perplexities)

    return {"perplexities": perplexities, "mean_perplexity": mean_perplexity_score}

# Example usage
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "A journey of a thousand miles begins with a single step."
]
print(f"Perplexity scores: {calculate_batch_perplexity(texts)}")

This function takes as input a list of texts, and outputs a dictionary containing a list of perplexity scores for each text in the input list, as well as the average perplexity score of text sequences in the input list.

Note that taking the average of perplexity scores across texts of different lengths can lead to a skewed overall perplexity score for a couple of reasons.

First, perplexity scores tend to be more stable for longer sequences, while shorter sequences may have higher variance, leading to outliers. Second, taking a simple arithmetic mean across scores for texts of varying lengths can give disproportionate weight to tokens in shorter sequences. Nevertheless, using the arithmetic mean is currently the most common approach to calculating overall perplexity, so we use it here for the sake of consistency.

Implementing Perplexity in Opik

In the real world, you’ll likely want to use an evaluation framework to implement LLM metrics. In this section, we’ll implement perplexity in Opik, Comet’s open source LLM evaluation framework.

Here, we use our original perplexity function and modify it slightly to implement it as a custom Opik metric with a `score` method that returns a `ScoreResult` object:


from opik.evaluation.metrics import base_metric, score_result

class Perplexity(base_metric.BaseMetric):
    """
    Perplexity (PPL) is a common LLM evaluation metric defined as the exponentiated average
    negative log-likelihood of a sequence.

    For more information on perplexity, see:
    https://en.wikipedia.org/wiki/Perplexity

    Args:
        name: The name of the metric, perplexity.
    """

    def __init__(
        self,
        name: str = "Perplexity",
    ):
        super().__init__(name=name)

    def score(
        self, input_ids: torch.Tensor, logits: torch.Tensor, attention_mask: torch.Tensor
    ) -> score_result.ScoreResult:
        """
        Calculate the perplexity score of each token give the previous tokens in the sequence.

        Args:
            input_ids: input ids of the text sequence input to the model (torch.Tensor)
            logits: output logits of the model (torch.Tensor)
            attention_mask: attention mask

        Returns:
            score_result.ScoreResult: A ScoreResult object
        """

        # Shift the logits and input_ids to align targets correctly
        shift_logits = logits[:, :-1, :]  # Ignore the last token's logits
        shift_labels = input_ids[:, 1:]   # Skip the first token in the labels

        # Compute log probabilities
        log_probs = torch.nn.functional.log_softmax(shift_logits, dim=-1)

        # Gather the log probabilities for the correct tokens
        target_log_probs = log_probs.gather(dim=-1, index=shift_labels.unsqueeze(-1)).squeeze(-1)

        # Mask out positions corresponding to padding tokens
        target_log_probs = target_log_probs * attention_mask[:, 1:].to(log_probs.dtype)

        # Compute the mean negative log-likelihood for each sequence
        negative_log_likelihood = -target_log_probs.sum(dim=-1) / attention_mask[:, 1:].sum(dim=-1)

        # Take the exp(negative_log_likelihood)
        perplexities = torch.exp(negative_log_likelihood)

        # Take the mean of perplexity scores
        mean_perplexity_score = torch.mean(perplexities)

        return score_result.ScoreResult(value=mean_perplexity_score, name=self.name)

perplexity = Perplexity()

After defining perplexity as a custom metric, we can use it by:

Defining the model’s forward pass in your_llm_application.
Calling our application in evaluation_task and returning a dictionary with keys that match the parameters expected by our custom Perplexity metric above.
Add tracking by decorating our functions with Opik’s @track decorator to automatically log relevant data to the platform.
Pass the evaluation_task function to Opik’s evaluate function, which runs and logs the full evaluation process, including calculating perplexity scores for each call.

You can find the full code in the Colab.


from opik import track
from opik.evaluation import evaluate

@track
def your_llm_application(input: str) -> str:

    # Tokenize the batch of texts with padding for uniform length
    inputs = tokenizer(
        input, return_tensors="pt", padding=True, truncation=True
    )

    input_ids = inputs["input_ids"]
    attention_mask = inputs["attention_mask"]

    # Pass the input batch through the model to get logits
    with torch.no_grad():
        outputs = model(input_ids, attention_mask=attention_mask)

    return {"input_ids": input_ids,
            "logits": outputs.logits,
            "attention_mask": attention_mask}

@track
def evaluation_task(x):
    llm_outputs = your_llm_application(x['input'])
    return {
        "input_ids": llm_outputs['input_ids'],
        "logits": llm_outputs['logits'],
        "attention_mask": llm_outputs['attention_mask']
    }

evaluation = evaluate(
    experiment_name="My ppl experiment",
    dataset=dataset,
    task=evaluation_task,
    scoring_metrics=[perplexity],
    experiment_config={
        "model": model_name
    }
)

And here is what the output of your evaluation should look like from within the Opik UI:

Our perplexity metric calculations on our dataset, as stored in Opik

Adding Perplexity to Your LLM Evaluation Suite

Perplexity is extremely popular for its intuitiveness and efficiency, but it only provides a partial picture of a language model’s performance. It captures a model’s certainty about its predictions but, notably, it does not convey a model’s “understanding.”

For a more complete understanding of a model’s behavior, perplexity should be used alongside other evaluation metrics, such as accuracy and fluency, as well as task-specific metrics like relevance, coherence, factuality, and hallucination detection. Because of its computational efficiency, perplexity is particularly useful as a first-pass metric, but has significant limitations that require additional evaluation methods to address.

More nuanced evaluation methods include using an LLM-as-a-judge, but these methods are also often less interpretable. Especially when relying on the same language model being evaluated, they can lead to potential biases, circular reasoning, and high variability in results. These limitations make it essential to pair LLM-as-a-judge metrics with other evaluation methods, like perplexity, which has been shown to outperform the results of prompting the LLMs-as-a-judge with basic instructions at estimating text quality.

Image from Murugadoss, B., Poelitz, C., Drosos, I., Le, V., McKenna, N., Negreanu, C.S., Parnin, C., & Sarkar, A. (2024). Evaluating the Evaluator: Measuring LLMs’ Adherence to Task Evaluation Instructions. Retrieved from https://arxiv.org/html/2408.08781v1

Using perplexity as part of a “suite” of metrics is useful beyond just the “extra cover” of additional metrics, however. Seeing where these metrics diverge can help you identify problematic data and points of failure in your evaluation suite.

For example, a high perplexity and high accuracy score may indicate that while the model is correct in specific answers, it is uncertain overall and needs additional training. Likewise, a model with low perplexity and coherence may produce text it is confident in, but that doesn’t flow logically, which may not be acceptable for your application and which could point to issues with sentence structure in the training data. Conversely, a model with high perplexity and high coherence suggests the model is uncertain about its predictions even when producing coherent text. As a final example, if both hallucination detection scores and perplexity scores are high, the model is both uncertain and likely producing fabricated content, suggesting potential weaknesses in grounding or fact-based reasoning within the training pipeline. Monitoring these divergences helps identify specific areas for model and data improvement to better align with your model’s intended performance.

In summary, perplexity is a valuable metric for evaluating language models by measuring their confidence and predicting text sequences. While it offers useful insights, perplexity should be used alongside other metrics to get a fuller picture of model performance. This approach helps highlight specific strengths and weaknesses, allowing for more targeted improvements and reliable assessments of model quality.

Follow along with the Colab!

If you found this article useful, follow me on LinkedIn and Twitter for more content!