December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
Welcome to Lesson 8 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.
This lesson will focus on evaluating our fine-tuned LLM Twin model.
Before doing that, let’s walk through a short recap, to understand how we’ve gotten to the LLM evaluation stage:
→ In Lesson 6, we showcased extracting filtered data samples from QDrant. Using Knowledge Distillation, we have the GPT3.5 Turbo to structure and generate the fine-tuning dataset that is versioned with Comet.
→ In Lesson 7, we built the fine-tuning pipeline using the versioned datasets we’ve logged on Comet, composed the workflow, and deployed the pipeline on Qwak [2] to train our model.
→ In Lesson 8 we’ll focus on common evaluation methods for various tasks LLMs are performing, specifically in our case of content generation, we’ll focus on human-in-the-loop and use a larger model to assess the coherence and quantify other metrics for our LLM generations.
It is important to differentiate between evaluating LLM models singlehandedly and evaluating LLM-based systems.
During LLM evaluation, we focus only on how our fine-tuned model generates content and how cohesive is the generation.
Here’s what we’re going to learn in this lesson:
LLM evaluation is a crucial process used to assess the performance and capabilities of the models. It involves a series of tests and analyses to determine how well the model understands, interprets, and generates human-like text.
Being a fairly recent and fast-evolving AI field, LLM evaluation is not straightforward and there is no unified approach to measure their performance.
Due to the generative nature of LLMs, the evaluation processes for these models involve both quantitative and qualitative assessments.
Ensuring the effectiveness and safety of LLMs in practical applications should be a mandatory goal. Evaluating LLM models to reduce hallucinations, guarantee accuracy, and ethical use is crucial as they become more integrated into diverse sectors.
Several metrics have been proposed in the literature for evaluating the performance of LLMs. It is essential to use the right metrics suitable for the problem we are attempting to solve.
LLM Evaluation vs RAG Evaluation
LLM evaluation focuses on the model’s ability to generate coherent, relevant, and contextually appropriate text based solely on its pre-trained knowledge.
This involves assessing metrics such as fluency, coherence, relevance, and adherence to the given prompts.
RAG evaluation involves assessing how well the model integrates retrieved information into its responses. This requires evaluating not just the quality of the generated text, but also the accuracy and relevance of the retrieved information, and how effectively it enhances the final output.
Metrics for RAG models often include precision and recall of the retrieval process, as well as the overall coherence and relevance of the augmented generation.
Next, let’s iterate over a few commonly model valuation used techniques.
Let’s split these techniques by their intended use case.
Quantitative evaluation involves statistical measures to assess the accuracy, fluency, and other aspects of the generated text.
Here are some common metrics:
In case the LLM output is in a structured format, one could evaluate it against classical ML metrics, such as the following:
Qualitative evaluation involves human-in-the-loop judgment or larger models assessing aspects like relevance, coherence, creativity, and appropriateness of the content.
This type of evaluation provides insights that quantitative metrics might miss.
In our particular case, quantitative methods like BLEU & ROUGE are not applicable as they can’t yield valuable insights. Since we’re evaluating how our fine-tuned LLM can generate written content, and its task is not summarisation or translation-oriented, we can effectively only evaluate the quality of the generated content using an LLM-based evaluation technique.
Why don’t BLEU & ROUGE work in our use case?
We aim to verify if our fine-tuned model can generate contextual accurate posts/articles to reflect the provided query.
Within this LLM evaluation stage, we’ll focus on this section of the LLM Twin system design:
1. Defining the Evaluation Prompt Template
2. Define the user query
3. Generate content based on the user query
4. Populate the evaluation template
5. Use GPT3.5-Turbo to evaluate
6. Log evaluation prompt on Comet LLM.
from abc import ABC, abstractmethod
from langchain.prompts import PromptTemplate
from pydantic import BaseModel
class BasePromptTemplate(ABC, BaseModel):
@abstractmethod
def create_template(self, *args) -> PromptTemplate:
pass
class LLMEvaluationTemplate(BasePromptTemplate):
prompt: str = """
You are an AI assistant and your task is to evaluate the output generated by another LLM.
You need to follow these steps:
Step 1: Analyze the user query: {query}
Step 2: Analyze the response: {output}
Step 3: Evaluate the generated response based on the following criteria and provide a score from 1 to 5 along with a brief justification for each criterion:
Evaluation:
Relevance - [score]
[1 sentence justification why relevance = score]
Coherence - [score]
[1 sentence justification why coherence = score]
Conciseness - [score]
[1 sentence justification why conciseness = score]
"""
def create_template(self) -> PromptTemplate:
return PromptTemplate(template=self.prompt, input_variables=["query", "output"])
Unpacking this template, we’re specifying that given a user query
and the generated response
from our fine-tuned model, the evaluation model should analyze both (query, response) and rank the relationship between the query
and the response
on 3 criteria.
Relevance measures how well the generated content aligns with the user query.
It calculates:
Example Evaluation Criteria:
How logically and smoothly the generated text flows.
It calculates:
Example Evaluation Criteria:
How compact is the generated text, free from unnecessary or redundant words.
It calculates:
Example Evaluation Criteria:
For all these criteria, we’re asking the larger LLM (GPT3.5-Turbo) to rank each of them on a 1–5 scale.
Here’s how we define our eval
method logic, where we compose, populate, and send the full prompt to GPT3.5-Turbo.
def eval(query: str, output: str) -> str:
evaluation_template = templates.LLMEvaluationTemplate()
prompt_template = evaluation_template.create_template()
model = ChatOpenAI(model=settings.OPENAI_MODEL_ID, api_key=settings.OPENAI_API_KEY)
chain = GeneralChain.get_chain(
llm=model, output_key="llm_eval", template=prompt_template
)
response = chain.invoke({"query": query, "output": output})
return response["llm_eval"]
The full eval workflow looks like this:
class LLMTwin:
def __init__(self) -> None:
self.qwak_client = RealTimeClient(
model_id=settings.QWAK_DEPLOYMENT_MODEL_ID,
model_api=settings.QWAK_DEPLOYMENT_MODEL_API,
)
self.template = InferenceTemplate()
self.prompt_monitoring_manager = PromptMonitoringManager()
def generate(
self,
query: str,
enable_rag: bool = False,
enable_evaluation: bool = False,
enable_monitoring: bool = True,
) -> dict:
prompt_template = self.template.create_template(enable_rag=enable_rag)
prompt_template_variables = {
"question": query,
}
if enable_rag is True:
retriever = VectorRetriever(query=query)
hits = retriever.retrieve_top_k(
k=settings.TOP_K, to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
context = retriever.rerank(hits=hits, keep_top_k=settings.KEEP_TOP_K)
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)
else:
prompt = prompt_template.format(question=query)
input_ = pd.DataFrame([{"instruction": prompt}]).to_json()
response: list[dict] = self.qwak_client.predict(input_)
answer = response[0]["content"][0]
if enable_evaluation is True:
evaluation_result = evaluate_llm(query=query, output=answer)
else:
evaluation_result = None
if enable_monitoring is True:
if evaluation_result is not None:
metadata = {"llm_evaluation_result": evaluation_result}
else:
metadata = None
self.prompt_monitoring_manager.log(
prompt=prompt,
prompt_template=prompt_template.template,
prompt_template_variables=prompt_template_variables,
output=answer,
metadata=metadata,
)
self.prompt_monitoring_manager.log_chain(
query=query, response=answer, eval_output=evaluation_result
)
To check the full implementation, visit our LLM Twin Repository [1].
Note that here we’re using the model deployed in our previous lesson, in the training pipeline, which will serve as the model we’re evaluating. The model we’ll deploy to production, within the inference pipeline will be a separate instance, resulting after we’ve selected the best candidate based on our evaluation results.
In the next lesson, Lesson 9, we’ll discuss the inference pipeline and the production deployment in detail.
Points from this implementation:
To find more about how we’ve deployed the training pipeline on Qwak, where we fine-tuned Mistral7b-Instruct on a custom dataset, check Lesson 7.
Here’s an example:
query:
Could you please draft a LinkedIn post discussing Vector Databases?
I`m particularly interested in how do they work.
response:
Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches.
At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors.
These vectors are derived from the data itself, typically through techniques like hashing or embedding.
From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches.
By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets.
Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"
Next, you can see the logs from our Evaluation Chain.
> Entering new LLMChain chain...
Prompt after formatting:
You are an AI assistant and your task is to evaluate the output generated by another LLM.
You need to follow these steps:
Step 1: Analyze the user query: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.
Step 2: Analyze the response: {'content': ["<s> You are an AI language model assistant. Your task is to generate a cohesive and concise response to the user question.\n Question: Could you please draft a LinkedIn post discussing Vector Databases? I'm particularly interested in how do they work.\n\nAnswer: Vector databases are an interesting topic, as they allow us to represent data in a way that allows for more efficient similarity searches. At their core, vector databases are similar to indexed search engines, but instead of indexing text, they index numerical vectors. These vectors are derived from the data itself, typically through techniques like hashing or embedding. From there, the vectors are stored in a dictionary-like structure, allowing for constant-time lookups and approximate nearest neighbor searches. By using vectors instead of text-based searches, vector databases can be incredibly fast and scalable, especially when dealing with large datasets. Understanding how they work can help you make more informed decisions when it comes to data storage and search.</s>"]}
Step 3: Evaluate the generated response based on the following blueprint, of [rank_score] - [description]:
- Relevance [rank_score] - [description] : where you give a score from 1 to 5 on how relevant the output is to the user query.
- Coherence [rank_score] - [description] : where you give a score from 1 to 5 on how coherent the output is.
- Conciseness [rank_score] - [description]: where you give a score from 1 to 5 on how concise the output is.
INFO:httpx:HTTP Request: POST https://api.openai.com/v1/chat/completions "HTTP/1.1 200 OK"
> Finished chain.
Step 1: Analyze the user query:
The user is requesting a LinkedIn post draft that discusses Vector Databases, with a focus on their functionality.
Step 2: Analyze the response:
The response generated by the other LLM provides an answer that explains vector databases, how they represent data, their similarity to search engines, and touches on the process of indexing and searching within these databases.
Step 3: Evaluate the generated response:
- Relevance [4] - The output is highly relevant as it directly addresses the user's interest in vector databases and how they work.
- Coherence [5] - The output is coherent as it presents a logical flow of information regarding vector databases.
- Conciseness [4] - The output is fairly concise, delivering a good amount of information in a compact format suitable for a LinkedIn post.
Apart from the rich feature set for experiment tracking, Comet LLM [3] also offers quite useful features to monitor your LLM-based applications.
Why Monitoring Prompts?
Prompt monitoring is crucial in LLM-based applications for several reasons. It helps ensure the quality and relevance of responses, maintaining accuracy and coherence in user interactions but at the same time allows ML engineers maintaining the project to identify and mitigate bias or hallucination and work on fixing them early on.
Why is it a best practice?
Other advantages include filtering out inappropriate content and providing real-time feedback, accessible from a centralized dashboard on how the model behaves.
Apart from monitoring the actual prompt, we’ll also log the chain logic workflow that will allow us to enhance the debugging process step-by-step, to identify if any chain-stage might have corrupted the end response.
Below, you’ll find an example of a chain + prompt monitoring dashboard from Comet LLM [3]:
To log prompts to Comet LLM, we used this straightforward implementation:
def log(
cls,
prompt: str,
output: str,
prompt_template: str | None = None,
prompt_template_variables: dict | None = None,
metadata: dict | None = None,
) -> None:
comet_llm.init()
metadata = metadata or {}
metadata = {
"model": settings.MODEL_TYPE,
**metadata,
}
comet_llm.log_prompt(
workspace=settings.COMET_WORKSPACE,
project=f"{settings.COMET_PROJECT}-monitoring",
api_key=settings.COMET_API_KEY,
prompt=prompt,
prompt_template=prompt_template,
prompt_template_variables=prompt_template_variables,
output=output,
metadata=metadata,
)
To log chains, we have to log each chain step in order. In the example below, we’ve started the chain using the {"user_query" : query}
and have linked the next chain stage using the comet_llm.Span
where the inputs must be the same as the previous stage.
We would have a chain INPUT -> TWIN_RESPONSE -> GPT3.5-EVAL -> END
.
For more details on structuring and logging chains on Comet LLM [3], check
🔗 Comet Chain Logging [4]
def log_chain(cls, query: str, response: str, eval_output: str):
comet_llm.init(project=f"{settings.COMET_PROJECT}-monitoring")
comet_llm.start_chain(
inputs={"user_query": query},
project=f"{settings.COMET_PROJECT}-monitoring",
api_key=settings.COMET_API_KEY,
workspace=settings.COMET_WORKSPACE,
)
with comet_llm.Span(
category="twin_response",
inputs={"user_query": query},
) as span:
span.set_outputs(outputs=response)
with comet_llm.Span(
category="gpt3.5-eval",
inputs={"eval_result": eval_output},
) as span:
span.set_outputs(outputs=response)
comet_llm.end_chain(outputs={"response": response, "eval_output": eval_output})
Here we’re wrapping up Lesson 8 of the LLM Twin free course.
We’ve described common evaluation metrics, quantitative and qualitative, and have exemplified a common evaluation approach using a larger model (GPT3.5-Turbo) to assess and rank our model’s responses based on relevance, cohesiveness, and conciseness.
Completing Lesson 8, you’ve gained a good understanding of what LLM evaluation represents, the common metrics used, how to compose an evaluation prompt template, how to populate it, and how to monitor the resulting evaluation insights using the Comet LLM [3] feature, where we have shown how to log single prompts and entire chains.
In Lesson 9, we’ll cover the process of building the inference RAG pipeline. We’ll connect the various components of the LLM-Twin system, such as the QDrant Vector DB and Qwak Inference Pipeline, and prepare the system as a complete deployment. See you there!
🔗 Check out the code on GitHub [1] and support us with a ⭐️
[1] LLM Twin Github Repository, 2024, Decoding ML GitHub Organization
[2] Qwak, 2024, The Qwak.ai Platform landing Page
[3] Comet LLM, The Comet LLM Platform
[4] Comet Chain Logging, The Comet LLM Documentation