skip to Main Content

Best Practices When Evaluating Fine-Tuned LLMs

Welcome to Lesson 8 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.


This lesson will focus on evaluating our fine-tuned LLM Twin model.

Before doing that, let’s walk through a short recap, to understand how we’ve gotten to the LLM evaluation stage:

→ In Lesson 6, we showcased extracting filtered data samples from QDrantUsing Knowledge Distillation, we have the GPT3.5 Turbo to structure and generate the fine-tuning dataset that is versioned with Comet.

→ In Lesson 7we built the fine-tuning pipeline using the versioned datasets we’ve logged on Comet, composed the workflow, and deployed the pipeline on Qwak [2] to train our model.

→ In Lesson 8 we’ll focus on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ll focus on human-in-the-loop and use a larger model to assess the coherence and quantify other metrics for our LLM generations.

It is important to differentiate between evaluating LLM models singlehandedly and evaluating LLM-based systems.

During LLM evaluation, we focus only on how our fine-tuned model generates content and how cohesive is the generation.

Here’s what we’re going to learn in this lesson:

  • Common LLM evaluation methods for different LLM tasks.
  • Composing evaluation prompt templates for specific use cases.
  • Prompt, Chain Monitoring, and CometLLM integration.
  • The LLM-Twin model evaluation workflow.
flow chart showing the steps to evaluate a fine-tuned llm
LLM Twin Model Evaluation

Table of Contents

  1. What is LLM evaluation?
  2. Evaluation Techniques
  3. How we evaluate our LLM-Twin Model
  4. Comet Prompt Monitoring
  5. Conclusion

What is LLM evaluation?

LLM evaluation is a crucial process used to assess the performance and capabilities of the models. It involves a series of tests and analyses to determine how well the model understands, interprets, and generates human-like text.

Being a fairly recent and fast-evolving AI field, LLM evaluation is not straightforward and there is no unified approach to measure their performance.

Due to the generative nature of LLMs, the evaluation processes for these models involve both quantitative and qualitative assessments.

Ensuring the effectiveness and safety of LLMs in practical applications should be a mandatory goal. Evaluating LLM models to reduce hallucinations, guarantee accuracy, and ethical use is crucial as they become more integrated into diverse sectors.

Several metrics have been proposed in the literature for evaluating the performance of LLMs. It is essential to use the right metrics suitable for the problem we are attempting to solve.

LLM Evaluation vs RAG Evaluation
LLM evaluation focuses on the model’s ability to generate coherent, relevant, and contextually appropriate text based solely on its pre-trained knowledge.
This involves assessing metrics such as fluency, coherence, relevance, and adherence to the given prompts.

RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.
Metrics for RAG models often include precision and recall of the retrieval process, as well as the overall coherence and relevance of the augmented generation.

Next, let’s iterate over a few commonly model valuation used techniques.

Evaluation Techniques

Let’s split these techniques by their intended use case.

Quantitative evaluation

Quantitative evaluation involves statistical measures to assess the accuracy, fluency, and other aspects of the generated text.
Here are some common metrics:

  • Perplexity
    Lower perplexity indicates better performance and reflects the model’s ability to anticipate the next word in a sequence.
  • BLEU (Bilingual Evaluation Understudy):
    Compares the n-gram overlap between the generated text and a reference text. Commonly used for machine translation, it’s also applicable to text-generation tasks. A higher BLEU score indicates better quality.
  • ROUGE (Recall-Oriented Understudy for Gisting Evaluation):
    Measures the overlap of n-grams, longest common subsequence, and word sequences between the generated text and reference texts.
    It’s widely used for evaluating summarization and translation models.

In case the LLM output is in a structured format, one could evaluate it against classical ML metrics, such as the following:

  • Accuracy:
    The ratio of correctly predicted instances to the total instances.
    Useful for tasks where the output is categorical or where there is a clear right or wrong answer, for example, named entity recognition (NER).
  • Precision, Recall, and F1 Score:
    The ratio of true positive predictions to the total positive predictions made by the model, the ratio of positive predictions to the total predictions, and the harmonic mean of precision/recall to quantify the balance between the two. Valuable in classification or entity extraction tasks performed by LLMs.

Qualitative evaluation

Qualitative evaluation involves human-in-the-loop judgment or larger models assessing aspects like relevancecoherencecreativity, and appropriateness of the content.
This type of evaluation provides insights that quantitative metrics might miss.

  • Human Review:
    Having domain experts or general users review the generated content to assess its quality based on various criteria such as coherence, fluency, relevance, and creativity.
  • Human-in-the-loop:
    Reinforcement Learning from Human Feedback, RLHF — humans can rate the quality of model outputs, and this feedback is used to fine-tune the model through reinforcement learning techniques.
  • LLM-based Evaluation:
    Involves using a larger general-knowledge model to evaluate the model’s behavior.

In our particular case, quantitative methods like BLEU & ROUGE are not applicable as they can’t yield valuable insights. Since we’re evaluating how our fine-tuned LLM can generate written content, and its task is not summarisation or translation-oriented, we can effectively only evaluate the quality of the generated content using an LLM-based evaluation technique.

Why don’t BLEU & ROUGE work in our use case?

  1. They focus on measuring N-gram Overlaps.
    The generated content might have high variations in wording while still reflecting the user’s query.
  2. Lack of Semantic Understanding.
    They do not help evaluate the depth, coherence, or originality of the content.
  3. Weak Creativity
    Can’t quantify stylistic elements or the overall human-like quality.

How we evaluate our LLM-Twin Model

We aim to verify if our fine-tuned model can generate contextual accurate posts/articles to reflect the provided query.

Within this LLM evaluation stage, we’ll focus on this section of the LLM Twin system design:

system diagram showing the evaluation stage of the llm twin model design
Section from LLM Twin’s System Design. Image by the author.

Here’s the workflow overview:

1. Defining the Evaluation Prompt Template
2. Define the user query
3. Generate content based on the user query
4. Populate the evaluation template
5. Use GPT3.5-Turbo to evaluate
6. Log evaluation prompt on Comet LLM.

LLM Twin Model Evaluation

The Evaluation Prompt Template

from abc import ABC, abstractmethod

from langchain.prompts import PromptTemplate
from pydantic import BaseModel


class BasePromptTemplate(ABC, BaseModel):
    @abstractmethod
    def create_template(self, *args) -> PromptTemplate:
        pass

class LLMEvaluationTemplate(BasePromptTemplate):
    prompt: str = """
        You are an AI assistant and your task is to evaluate the output generated by another LLM.
        You need to follow these steps:
        Step 1: Analyze the user query: {query}
        Step 2: Analyze the response: {output}
        Step 3: Evaluate the generated response based on the following criteria and provide a score from 1 to 5 along with a brief justification for each criterion:

        Evaluation:
        Relevance - [score]
        [1 sentence justification why relevance = score]
        Coherence - [score]
        [1 sentence justification why coherence = score]
        Conciseness - [score]
        [1 sentence justification why conciseness = score]
"""

    def create_template(self) -> PromptTemplate:
        return PromptTemplate(template=self.prompt, input_variables=["query", "output"])

Unpacking this template, we’re specifying that given a user query and the generated responsefrom our fine-tuned model, the evaluation model should analyze both (query, response) and rank the relationship between the query and the response on 3 criteria.

 Relevance

Relevance measures how well the generated content aligns with the user query.

It calculates:

  • Content Match: how closely the generated content addresses the question posed by the query.
  • Topicality: The degree to which the content stays on-topic.

Example Evaluation Criteria:

  • Does the response directly answer the query?
  • Are the key points of the query adequately covered?
  • Is the information provided accurate and pertinent to the topic?

→ Cohesiveness

How logically and smoothly the generated text flows.

It calculates:

  • Sentence Structure: how well sentences are constructed to relate to each other.
  • Clarity of Thought: overall readability and understandability of the text.

Example Evaluation Criteria:

  • Are the ideas presented in a logical order?
  • Do transitions between sentences and paragraphs make sense?
  • Is the text easy to follow and understand?

→ Conciseness

How compact is the generated text, free from unnecessary or redundant words.

It calculates:

  • Elimination of Redundancy: avoidance of repetitive information.
  • Directness: the ability to communicate ideas straightforwardly.

Example Evaluation Criteria:

  • Is the text compact and to the point?
  • Are there any redundant or repetitive phrases?

For all these criteria, we’re asking the larger LLM (GPT3.5-Turbo) to rank each of them on a 1–5 scale.

Code Walkthrough

Here’s how we define our eval method logic, where we compose, populate, and send the full prompt to GPT3.5-Turbo.

def eval(query: str, output: str) -> str:
    evaluation_template = templates.LLMEvaluationTemplate()
    prompt_template = evaluation_template.create_template()

    model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY)
    chain = GeneralChain.get_chain(
        llm=model, output_key="llm_eval", template=prompt_template
    )

    response = chain.invoke({"query": query, "output": output})

    return response["llm_eval"]

The full eval workflow looks like this:

class LLMTwin:
    def __init__(self) -> None:
        self.qwak_client = RealTimeClient(
            model_id=settings.QWAK_DEPLOYMENT_MODEL_ID,
            model_api=settings.QWAK_DEPLOYMENT_MODEL_API,
        )
        self.template = InferenceTemplate()
        self.prompt_monitoring_manager = PromptMonitoringManager()

    def generate(
        self,
        query: str,
        enable_rag: bool = False,
        enable_evaluation: bool = False,
        enable_monitoring: bool = True,
    ) -> dict:
        prompt_template = self.template.create_template(enable_rag=enable_rag)
        prompt_template_variables = {
            "question": query,
        }

        if enable_rag is True:
            retriever = VectorRetriever(query=query)
            hits = retriever.retrieve_top_k(
                k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
            )
            context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
            prompt_template_variables["context"] = context

            prompt = prompt_template.format(question=query, context=context)
        else:
            prompt = prompt_template.format(question=query)

        input_ = pd.DataFrame([{"instruction": prompt}]).to_json()

        response: list[dict] = self.qwak_client.predict(input_)
        answer = response[0]["content"][0]

        if enable_evaluation is True:
            evaluation_result = evaluate_llm(query=query, output=answer)
        else:
            evaluation_result = None

        if enable_monitoring is True:
            if evaluation_result is not None:
                metadata = {"llm_evaluation_result": evaluation_result}
            else:
                metadata = None

            self.prompt_monitoring_manager.log(
                prompt=prompt,
                prompt_template=prompt_template.template,
                prompt_template_variables=prompt_template_variables,
                output=answer,
                metadata=metadata,
            )
            self.prompt_monitoring_manager.log_chain(
                query=query, response=answer, eval_output=evaluation_result
            )

To check the full implementation, visit our LLM Twin Repository [1].

Note that here we’re using the model deployed in our previous lesson, in the training pipeline, which will serve as the model we’re evaluating. The model we’ll deploy to production, within the inference pipeline will be a separate instance, resulting after we’ve selected the best candidate based on our evaluation results.

In the next lesson, Lesson 9, we’ll discuss the inference pipeline and the production deployment in detail.

Points from this implementation:

  • We’re passing the query to our evaluation Mistral7b model deployed on Qwak [2]
  • We get the response and pass the (query, response) pair to evaluation.
  • The evaluation template is populated and sent to GPT3.5-Turbo
  • The resulting prompt is logged to Comet LLM [3] feature.

To find more about how we’ve deployed the training pipeline on Qwak, where we fine-tuned Mistral7b-Instruct on a custom dataset, check Lesson 7.

Here’s an example:

query: 
Could you please draft a LinkedIn post discussing Vector Databases? 
I`m particularly interested in how do they work.

response:
Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. 
At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. 
These vectors are derived from the data itself, typically through techniques like hashing or embedding. 
From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. 
By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. 
Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"

Next, you can see the logs from our Evaluation Chain.

> Entering new LLMChain chain...
Prompt after formatting:
You are an AI assistant and your task is to evaluate the output generated by another LLM.
    You need to follow these steps:
    Step 1: Analyze the user query: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.
    Step 2: Analyze the response: {'content': ["<s> You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.\n    Question: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.\n\nAnswer: Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. These vectors are derived from the data itself, typically through techniques like hashing or embedding. From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"]}
    Step 3: Evaluate the generated response based on the following blueprint, of [rank_score] - [description]:
    - Relevance [rank_score] - [description] : where you give a score from 1 to 5 on how relevant the output is to the user query.
    - Coherence [rank_score] - [description] : where you give a score from 1 to 5 on how coherent the output is.
    - Conciseness [rank_score] - [description]: where you give a score from 1 to 5 on how concise the output is.
    
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"

> Finished chain.
Step 1: Analyze the user query:
The user is requesting a LinkedIn post draft that discusses Vector Databases, with a focus on their functionality.

Step 2: Analyze the response:
The response generated by the other LLM provides an answer that explains vector databases, how they represent data, their similarity to search engines, and touches on the process of indexing and searching within these databases.

Step 3: Evaluate the generated response:
- Relevance [4] - The output is highly relevant as it directly addresses the user's interest in vector databases and how they work.
- Coherence [5] - The output is coherent as it presents a logical flow of information regarding vector databases.
- Conciseness [4] - The output is fairly concise, delivering a good amount of information in a compact format suitable for a LinkedIn post.

Comet Prompt Monitoring

Apart from the rich feature set for experiment tracking, Comet LLM [3] also offers quite useful features to monitor your LLM-based applications.

Why Monitoring Prompts?
Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.

Why is it a best practice?

  • By logging and inspecting multiple sets of resulting prompts, one could extract insights into a generalized metric.
  • Useful for RLHF analysis
  • Useful to inspect a full chain, alongside the metadata, processing time, and chain stages being executed.

Other advantages include filtering out inappropriate content and providing real-time feedback, accessible from a centralized dashboard on how the model behaves.

Apart from monitoring the actual prompt, we’ll also log the chain logic workflow that will allow us to enhance the debugging process step-by-step, to identify if any chain-stage might have corrupted the end response.

Below, you’ll find an example of a chain + prompt monitoring dashboard from Comet LLM [3]:

Comet LLM prompt + chain logging

To log prompts to Comet LLM, we used this straightforward implementation:


  def log(
      cls,
      prompt: str,
      output: str,
      prompt_template: str | None = None,
      prompt_template_variables: dict | None = None,
      metadata: dict | None = None,
  ) -> None:
      comet_llm.init()

      metadata = metadata or {}
      metadata = {
          "model": settings.MODEL_TYPE,
          **metadata,
      }

      comet_llm.log_prompt(
          workspace=settings.COMET_WORKSPACE,
          project=f"{settings.COMET_PROJECT}-monitoring",
          api_key=settings.COMET_API_KEY,
          prompt=prompt,
          prompt_template=prompt_template,
          prompt_template_variables=prompt_template_variables,
          output=output,
          metadata=metadata,
      )

To log chains, we have to log each chain step in order. In the example below, we’ve started the chain using the {"user_query" : query} and have linked the next chain stage using the comet_llm.Span where the inputs must be the same as the previous stage.

We would have a chain INPUT -> TWIN_RESPONSE -> GPT3.5-EVAL -> END .

For more details on structuring and logging chains on Comet LLM [3], check
🔗 Comet Chain Logging [4]


def log_chain(cls, query: str, response: str, eval_output: str):
    comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
    comet_llm.start_chain(
        inputs={"user_query": query},
        project=f"{settings.COMET_PROJECT}-monitoring",
        api_key=settings.COMET_API_KEY,
        workspace=settings.COMET_WORKSPACE,
    )
    with comet_llm.Span(
        category="twin_response",
        inputs={"user_query": query},
    ) as span:
        span.set_outputs(outputs=response)

    with comet_llm.Span(
        category="gpt3.5-eval",
        inputs={"eval_result": eval_output},
    ) as span:
        span.set_outputs(outputs=response)
    comet_llm.end_chain(outputs={"response": response, "eval_output": eval_output})

Conclusion

Here we’re wrapping up Lesson 8 of the LLM Twin free course.

We’ve described common evaluation metrics, quantitative and qualitative, and have exemplified a common evaluation approach using a larger model (GPT3.5-Turbo) to assess and rank our model’s responses based on relevance, cohesiveness, and conciseness.

Completing Lesson 8, you’ve gained a good understanding of what LLM evaluation represents, the common metrics used, how to compose an evaluation prompt template, how to populate it, and how to monitor the resulting evaluation insights using the Comet LLM [3] feature, where we have shown how to log single prompts and entire chains.

In Lesson 9, we’ll cover the process of building the inference RAG pipeline. We’ll connect the various components of the LLM-Twin system, such as the QDrant Vector DB and Qwak Inference Pipeline, and prepare the system as a complete deployment. See you there!

🔗 Check out the code on GitHub [1] and support us with a ⭐️

References

[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization

[2] Qwak, 2024, The Qwak.ai Platform landing Page

[3] Comet LLM, The Comet LLM Platform

[4] Comet Chain Logging, The Comet LLM Documentation

Alexandru Razvant, Decoding ML

Alexandru Razvant

Decoding ML

Decoding ML

Back To Top