December 19, 2024
Introduction BERTScore represents a pivotal shift in LLM evaluation, moving beyond traditional heuristic-based metrics like…
Welcome to Lesson 9 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.
In Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM Twin system.
First, we will design and implement a scalable LLM & RAG inference pipeline based on microservices, separating the ML and business logic into two layers.
Secondly, we will use Comet to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis.
Ultimately, we will deploy the inference pipeline to Qwak and make the LLM Twin service available worldwide.
This lesson is part of a more extensive series in which we learn to build an end-to-end LLM system using LLMOps best practices.
In Lesson 4, we populated a Qdrant vector DB with cleaned, chunked, and embedded digital data (posts, articles, and code snippets).
In Lesson 5, we implemented the advanced RAG retrieval module to query relevant digital data. Here, we will learn to integrate it into the final inference pipeline.
In Lesson 7, we used Qwak to build a training pipeline to fine-tune an open-source LLM on our custom digital data. The LLM weights are available in a model registry.
In Lesson 8, we evaluated the fine-tuned LLM to ensure the production candidate behaves accordingly.
So… What must you know from all of this?
Don’t worry. If you don’t want to replicate the whole system, you can read this article independently from the previous lesson.
Thus, the following assumptions are what you have to know. We have:
→ In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud.
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Our inference pipeline contains the following core elements:
Let’s see how to hook these into a scalable and modular system.
As we follow the feature/training/inference (FTI) pipeline architecture, the communication between the 3 core components is clear.
Our LLM inference pipeline needs 2 things:
This perfectly aligns with the FTI architecture.
→ If you are unfamiliar with the FTI pipeline architecture, we recommend you review Lesson 1’s section on the 3-pipeline architecture.
Usually, the inference steps can be split into 2 big layers:
We can design our inference pipeline in 2 ways.
In a monolithic scenario, we implement everything into a single service.
Pros:
Cons:
The LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC.
Pros:
Cons:
Let’s focus on the “each component can scale individually” part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU.
As the LLM inference takes longer, you will often need more LLM service replicas to meet the demand. But remember that GPU VMs are really expensive.
By decoupling the 2 components, you will run only what is required on the GPU machine and not block the GPU VM with other computing that can quickly be done on a much cheaper machine.
Thus, by decoupling the components, you can scale horizontally as required, with minimal costs, providing a cost-effective solution to your system’s needs.
Let’s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline.
As explained in the sections above, we have the following components:
The business microservice is implemented as a Python module that:
As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services.
The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application.
However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud.
The LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU-enabled machines.
How does the LLM microservice work?
That’s it!
The prompt monitoring microservice is based on Comet’s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM.
Remember that a prompt can get quite complex. When building complex LLM apps, the prompt usually results from a chain containing other prompts, templates, variables, and metadata.
Thus, a prompt monitoring service, such as the one provided by Comet, differs from a standard logging service. It allows you to quickly dissect the prompt and understand how it was created. Also, by attaching metadata to it, such as the latency of the generated answer and the cost to generate the answer, you can quickly analyze and optimize your prompts.
Before diving into the code, let’s quickly clarify what is the difference between the training and inference pipelines.
Along with the apparent reason that the training pipeline takes care of training while the inference pipeline takes care of inference (Duh!), there are some critical differences you have to understand.
Do you remember our logical feature store based on the Qdrant vector DB and Comet artifacts? If not, consider checking out Lesson 6 for a refresher.
The core idea is that during training, the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage.
Our LLM twin architecture uses Comet artifacts to access, version, and track all our data.
The data is accessed in batches and fed to the training loop.
During inference, you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove.
During inference, you don’t care about data versioning and lineage. You just want to access your features quickly for a good user experience.
The data comes directly from the user and is sent to the inference logic.
The training pipeline’s final output is the trained weights stored in Comet’s model registry.
The inference pipeline’s final output is the predictions served directly to the user.
The training pipeline requires more powerful machines with as many GPUs as possible.
Why? During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute-hungry than the inference.
Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments.
The inference pipeline can do the job with less computation. During inference, you often pass a single sample or smaller batches to the model.
If you run a batch pipeline, you will still pass batches to the model but don’t perform any optimization steps.
If you run a real-time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step.
Yes! This is where the training-serving skew comes in.
During training and inference, you must carefully apply the same preprocessing and postprocessing steps.
If the preprocessing and postprocessing functions or hyperparameters don’t match, you will end up with the training-serving skew problem.
Enough with the theory. Let’s dig into the RAG business microservice ↓
First, let’s understand how we defined the settings to configure the inference pipeline components.
We used pydantic_settings and inherited its BaseSettings class.
This approach lets us quickly define a set of default settings variables and load sensitive values such as the API KEY from a .env file.
from pydantic_settings import BaseSettings, SettingsConfigDict
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8"
... # Settings.
# CometML config
COMET_API_KEY: str
COMET_WORKSPACE: str
COMET_PROJECT: str = "llm-twin-course"
... # More settings.
settings = AppSettings()
All the variables called settings.* (e.g., settings.Comet_API_KEY) come from this class.
We will define the RAG business module under the LLMTwin class. The LLM twin logic is directly correlated with our business logic.
We don’t have to introduce the word “business” in the naming convention of the classes. What we presented so far was used for a clear separation of concern between the LLM and business layers.
Initially, within the LLMTwin class, we define all the clients we need for our business logic ↓
Now let’s dig into the generate() method, where we:
Now, let’s look at the complete code of the generate() method. It’s the same thing as what we presented above, but with all the nitty-little details.
class LLMTwin:
def __init__(self) -> None:
...
def generate(
self,
query: str,
enable_rag: bool = True,
enable_monitoring: bool = True,
) -> dict:
prompt_template = self.template.create_template(enable_rag=enable_rag)
prompt_template_variables = {
"question": query,
}
if enable_rag is True:
retriever = VectorRetriever(query=query)
hits = retriever.retrieve_top_k(
k=settings.TOP_K,
to_expand_to_n_queries=settings.EXPAND_N_QUERY
)
context = retriever.rerank(
hits=hits,
keep_top_k=settings.KEEP_TOP_K
)
prompt_template_variables["context"] = context
prompt = prompt_template.format(question=query, context=context)
else:
prompt = prompt_template.format(question=query)
input_ = pd.DataFrame([{"instruction": prompt}]).to_json()
response: list[dict] = self.qwak_client.predict(input_)
answer = response[0]["content"][0]
if enable_monitoring is True:
self.prompt_monitoring_manager.log(
prompt=prompt,
prompt_template=prompt_template.template,
prompt_template_variables=prompt_template_variables,
output=answer,
metadata=metadata,
)
return {"answer": answer}
Let’s look at how our LLM microservice is implemented using Qwak.
As the LLM microservice is deployed on Qwak, we must first inherit from the QwakModel class and implement some specific functions.
Note: The build() function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline.
Let’s zoom into the implementation and the life cycle of the Qwak model.
The schema() method is used to define how the input and output of the predict() method look like. This will automatically validate the structure and type of the predict() method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string.
The other Qwak-specific methods are called in the following order:
>>> Note that these methods are called only during serving time (and not during training).
Qwak exposes your model as a RESTful API, where the predict() method is called on each request.
Inside the prediction method, we perform the following steps:
Here is the complete code for the implementation of the Qwak LLM microservice:
class CopywriterMistralModel(QwakModel):
def __init__(
self,
use_experiment_tracker: bool = True,
register_model_to_model_registry: bool = True,
model_type: str = "mistralai/Mistral-7B-Instruct-v0.1",
fine_tuned_llm_twin_model_type: str = settings.FINE_TUNED_LLM_TWIN_MODEL_TYPE,
dataset_artifact_name: str = settings.DATASET_ARTIFACT_NAME,
config_file: str = settings.CONFIG_FILE,
model_save_dir: str = settings.MODEL_SAVE_DIR,
) -> None:
self.use_experiment_tracker = use_experiment_tracker
self.register_model_to_model_registry = register_model_to_model_registry
self.model_save_dir = model_save_dir
self.model_type = model_type
self.fine_tuned_llm_twin_model_type = fine_tuned_llm_twin_model_type
self.dataset_artifact_name = dataset_artifact_name
self.training_args_config_file = config_file
def build(self) -> None:
# Training logic
...
def initialize_model(self) -> None:
self.model, self.tokenizer, _ = build_qlora_model(
pretrained_model_name_or_path=self.model_type,
peft_pretrained_model_name_or_path=self.fine_tuned_llm_twin_model_type,
bnb_config=self.nf4_config,
lora_config=self.qlora_config,
cache_dir=settings.CACHE_DIR,
)
self.model = self.model.to(self.device)
logging.info(f"Successfully loaded model from {self.model_save_dir}")
def schema(self) -> ModelSchema:
return ModelSchema(
inputs=[RequestInput(name="instruction", type=str)],
outputs=[InferenceOutput(name="content", type=str)],
)
@qwak.api(output_adapter=DefaultOutputAdapter())
def predict(self, df) -> pd.DataFrame:
input_text = list(df["instruction"].values)
input_ids = self.tokenizer(
input_text, return_tensors="pt", add_special_tokens=True
)
input_ids = input_ids.to(self.device)
generated_ids = self.model.generate(
**input_ids,
max_new_tokens=500,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
)
answer_start_idx = input_ids["input_ids"].shape[1]
generated_answer_ids = generated_ids[:, answer_start_idx:]
decoded_output = self.tokenizer.batch_decode(generated_answer_ids)[0]
return pd.DataFrame([{"content": decoded_output}])
Where the settings used in the code above have the following values:
class AppSettings(BaseSettings):
model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8")
... # Other settings.
DATASET_ARTIFACT_NAME: str = "posts-instruct-dataset"
FINE_TUNED_LLM_TWIN_MODEL_TYPE: str = "decodingml/llm-twin:1.0.0"
CONFIG_FILE: str = "./finetuning/config.yaml"
MODEL_SAVE_DIR: str = "./training_pipeline_output"
CACHE_DIR: Path = Path("./.cache")
The most important one is the FINE_TUNED_LLM_TWIN_MODEL_TYPE setting, which reflects what model and version to load from the model registry.
Access the code 🔗 here ←
The final step is to look at Comet’s prompt monitoring service. ↓
Comet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function:
Let’s look at the logs in Comet ML’sML’s LLMOps dashboard.
Here is how you can quickly access them ↓
Note: Comet provides a free version which is enough to run these examples.
This is how Comet’s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. ↓
You can click on any prompt and see everything we logged programmatically using the PromptMonitoringManager class.
Besides what we logged, adding various tags and the inference duration can be valuable.
Qwak makes the deployment of the LLM microservice straightforward.
During Lesson 7, we fine-tuned the LLM and built the Qwak model. As a quick refresher, we ran the following CLI command to build the Qwak model, where we used the build_config.yaml file with the build configuration:
poetry run qwak models build -f build_config.yaml .
After the build is finished, we can make various deployments based on the build. For example, we can deploy the LLM microservice using the following Qwak command:
qwak models deploy realtime \
--model-id "llm_twin" \
--instance "gpu.a10.2xl" \
--timeout 50000 \
--replicas 2 \
--server-workers 2
We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it.
🔗 More on Qwak instance types ←
Two replicas and two workers result in 4 microservices that run in parallel and can serve our users.
You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM.
To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies.
You can quickly close the deployment by running the following:
qwak models undeploy --model-id "llm_twin"
We strongly recommend closing down the deployment when you are done, as GPU VMs are expensive.
To run the LLM system with a predefined prompt example, you have to run the following Python file:
poetry run python main.py
Within the main.py file, we call the LLMTwin class, which calls the other services as explained during this lesson.
Note: The → complete installation & usage instructions ← are available in the README of the GitHub repository.
🔗 Check out the code on GitHub [1] and support us with a ⭐️
Congratulations! You are close to the end of the LLM twin series.
In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems.
First, you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines.
Secondly, we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet’s prompt monitoring service.
Ultimately, we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform.
In Lesson 10, we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs ’ answers relative to the query and context.
See you there! 🤗
🔗 Check out the code on GitHub [1] and support us with a ⭐️
[1] Your LLM Twin Course — GitHub Repository (2024), Decoding ML GitHub Organization
[2] Add your models to Model Registry (2024), Comet Guides
If not otherwise stated, all images are created by the author.