skip to Main Content

Architect Scalable and Cost-Effective LLM & RAG Inference Pipelines

Welcome to Lesson 9 of 11 in our free course series, LLM Twin: Building Your Production-Ready AI Replica. You’ll learn how to use LLMs, vector DVs, and LLMOps best practices to design, train, and deploy a production ready “LLM twin” of yourself. This AI character will write like you, incorporating your style, personality, and voice into an LLM. For a full overview of course objectives and prerequisites, start with Lesson 1.


In Lesson 9, we will focus on implementing and deploying the inference pipeline of the LLM Twin system.

First, we will design and implement a scalable LLM & RAG inference pipeline based on microservices, separating the ML and business logic into two layers.

Secondly, we will use Comet to integrate a prompt monitoring service to capture all input prompts and LLM answers for further debugging and analysis.

Ultimately, we will deploy the inference pipeline to Qwak and make the LLM Twin service available worldwide.

→ Context from previous lessons. What you must know.

This lesson is part of a more extensive series in which we learn to build an end-to-end LLM system using LLMOps best practices.

In Lesson 4, we populated a Qdrant vector DB with cleaned, chunked, and embedded digital data (posts, articles, and code snippets).

In Lesson 5, we implemented the advanced RAG retrieval module to query relevant digital data. Here, we will learn to integrate it into the final inference pipeline.

In Lesson 7, we used Qwak to build a training pipeline to fine-tune an open-source LLM on our custom digital data. The LLM weights are available in a model registry.

In Lesson 8, we evaluated the fine-tuned LLM to ensure the production candidate behaves accordingly.

So… What must you know from all of this?

Don’t worry. If you don’t want to replicate the whole system, you can read this article independently from the previous lesson.

Thus, the following assumptions are what you have to know. We have:

  • Qdrant vector DB populated with digital data (posts, articles, and code snippets)
  • a vector DB retrieval module to do advanced RAG
  • a fine-tuned open-source LLM available in a model registry from Comet

→ In this lesson, we will focus on gluing everything together into a scalable inference pipeline and deploying it to the cloud.

architectural diagram showing llm and rag inference pipeline design
Architect scalable and cost-effective LLM & RAG inference pipelines

1. The architecture of the inference pipeline

Our inference pipeline contains the following core elements:

  • a fine-tuned LLM
  • a RAG module
  • a monitoring service

Let’s see how to hook these into a scalable and modular system.

The interface of the inference pipeline

As we follow the feature/training/inference (FTI) pipeline architecture, the communication between the 3 core components is clear.

Our LLM inference pipeline needs 2 things:

  • a fine-tuned LLM: pulled from the model registry
  • features for RAG: pulled from a vector DB (which we modeled as a logical feature store)

This perfectly aligns with the FTI architecture.

→ If you are unfamiliar with the FTI pipeline architecture, we recommend you review Lesson 1’s section on the 3-pipeline architecture.

Monolithic vs. microservice inference pipelines

Usually, the inference steps can be split into 2 big layers:

  • the LLM service: where the actual inference is being done
  • the business service: domain-specific logic

We can design our inference pipeline in 2 ways.

Option 1: Monolithic LLM & business service

In a monolithic scenario, we implement everything into a single service.

Pros:

  • easy to implement
  • easy to maintain

Cons:

  • harder to scale horizontally based on the specific requirements of each component
  • harder to split the work between multiple teams
  • not being able to use different tech stacks for the two services
diagram showing the differences between monolithic vs microservice inference pipelines
Monolithic vs. microservice inference pipelines

Option 2: Different LLM & business microservices

The LLM and business services are implemented as two different components that communicate with each other through the network, using protocols such as REST or gRPC.

Pros:

  • each component can scale horizontally individually
  • each component can use the best tech stack at hand

Cons:

  • harder to deploy
  • harder to maintain

Let’s focus on the “each component can scale individually” part, as this is the most significant benefit of the pattern. Usually, LLM and business services require different types of computing. For example, an LLM service depends heavily on GPUs, while the business layer can do the job only with a CPU.

As the LLM inference takes longer, you will often need more LLM service replicas to meet the demand. But remember that GPU VMs are really expensive.

By decoupling the 2 components, you will run only what is required on the GPU machine and not block the GPU VM with other computing that can quickly be done on a much cheaper machine.

Thus, by decoupling the components, you can scale horizontally as required, with minimal costs, providing a cost-effective solution to your system’s needs.

Microservice architecture of the LLM twin inference pipeline

Let’s understand how we applied the microservice pattern to our concrete LLM twin inference pipeline.

As explained in the sections above, we have the following components:

  1. A business microservice
  2. An LLM microservice
  3. A prompt monitoring microservice

The business microservice is implemented as a Python module that:

  • contains the advanced RAG logic, which calls the vector DB and GPT-4 API for advanced RAG operations;
  • calls the LLM microservice through a REST API using the prompt computed utilizing the user’s query and retrieved context
  • sends the prompt and the answer generated by the LLM to the prompt monitoring microservice.

As you can see, the business microservice is light. It glues all the domain steps together and delegates the computation to other services.

The end goal of the business layer is to act as an interface for the end client. In our case, as we will ship the business layer as a Python module, the client will be a Streamlit application.

However, you can quickly wrap the Python module with FastAPI and expose it as a REST API to make it accessible from the cloud.

Microservice architecture of the LLM twin inference pipeline
Microservice architecture of the LLM twin inference pipeline

The LLM microservice is deployed on Qwak. This component is wholly niched on hosting and calling the LLM. It runs on powerful GPU-enabled machines.

How does the LLM microservice work?

  • It loads the fine-tuned LLM twin model from Comet’s model registry [2].
  • It exposes a REST API that takes in prompts and outputs the generated answer.
  • When the REST API endpoint is called, it tokenizes the prompt, passes it to the LLM, decodes the generated tokens to a string and returns the answer.

That’s it!

The prompt monitoring microservice is based on Comet’s LLM dashboard. Here, we log all the prompts and generated answers into a centralized dashboard that allows us to evaluate, debug, and analyze the accuracy of the LLM.

Remember that a prompt can get quite complex. When building complex LLM apps, the prompt usually results from a chain containing other prompts, templates, variables, and metadata.

Thus, a prompt monitoring service, such as the one provided by Comet, differs from a standard logging service. It allows you to quickly dissect the prompt and understand how it was created. Also, by attaching metadata to it, such as the latency of the generated answer and the cost to generate the answer, you can quickly analyze and optimize your prompts.

2. The training vs. the inference pipeline

Before diving into the code, let’s quickly clarify what is the difference between the training and inference pipelines.

Along with the apparent reason that the training pipeline takes care of training while the inference pipeline takes care of inference (Duh!), there are some critical differences you have to understand.

The input of the pipeline & How the data is accessed

Do you remember our logical feature store based on the Qdrant vector DB and Comet artifacts? If not, consider checking out Lesson 6 for a refresher.

The core idea is that during training, the data is accessed from an offline data storage in batch mode, optimized for throughput and data lineage.

Our LLM twin architecture uses Comet artifacts to access, version, and track all our data.

The data is accessed in batches and fed to the training loop.

During inference, you need an online database optimized for low latency. As we directly query the Qdrant vector DB for RAG, that fits like a glove.

During inference, you don’t care about data versioning and lineage. You just want to access your features quickly for a good user experience.

The data comes directly from the user and is sent to the inference logic.

The training vs. the inference pipeline
The training vs. the inference pipeline

The output of the pipeline

The training pipeline’s final output is the trained weights stored in Comet’s model registry.

The inference pipeline’s final output is the predictions served directly to the user.

The infrastructure

The training pipeline requires more powerful machines with as many GPUs as possible.

Why? During training, you batch your data and have to hold in memory all the gradients required for the optimization steps. Because of the optimization algorithm, the training is more compute-hungry than the inference.

Thus, more computing and VRAM result in bigger batches, which means less training time and more experiments.

The inference pipeline can do the job with less computation. During inference, you often pass a single sample or smaller batches to the model.

If you run a batch pipeline, you will still pass batches to the model but don’t perform any optimization steps.

If you run a real-time pipeline, as we do in the LLM twin architecture, you pass a single sample to the model or do some dynamic batching to optimize your inference step.

Are there any overlaps?

Yes! This is where the training-serving skew comes in.

During training and inference, you must carefully apply the same preprocessing and postprocessing steps.

If the preprocessing and postprocessing functions or hyperparameters don’t match, you will end up with the training-serving skew problem.

Enough with the theory. Let’s dig into the RAG business microservice ↓

3. Settings Pydantic class

First, let’s understand how we defined the settings to configure the inference pipeline components.

We used pydantic_settings and inherited its BaseSettings class.

This approach lets us quickly define a set of default settings variables and load sensitive values such as the API KEY from a .env file.

from pydantic_settings import BaseSettings, SettingsConfigDict

class AppSettings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8"

    ... # Settings.

    # CometML config
    COMET_API_KEY: str
    COMET_WORKSPACE: str
    COMET_PROJECT: str = "llm-twin-course"

    ... # More settings.

settings = AppSettings()

All the variables called settings.* (e.g., settings.Comet_API_KEY) come from this class.

4. The RAG business module

We will define the RAG business module under the LLMTwin class. The LLM twin logic is directly correlated with our business logic.

We don’t have to introduce the word “business” in the naming convention of the classes. What we presented so far was used for a clear separation of concern between the LLM and business layers.

Initially, within the LLMTwin class, we define all the clients we need for our business logic ↓

Inference pipeline business module: __init__() method → GitHub

Now let’s dig into the generate() method, where we:

  • call the RAG module;
  • create the prompt using the prompt template, query and context;
  • call the LLM microservice;
  • log the prompt, prompt template, and answer to Comet’s prompt monitoring service.
Inference pipeline business module: generate() method → GitHub

Now, let’s look at the complete code of the generate() method. It’s the same thing as what we presented above, but with all the nitty-little details.

class LLMTwin:
    def __init__(self) -> None:
        ...

    def generate(
        self,
        query: str,
        enable_rag: bool = True,
        enable_monitoring: bool = True,
    ) -> dict:
        prompt_template = self.template.create_template(enable_rag=enable_rag)
        prompt_template_variables = {
            "question": query,
        }

        if enable_rag is True:
            retriever = VectorRetriever(query=query)
            hits = retriever.retrieve_top_k(
                k=settings.TOP_K, 
                to_expand_to_n_queries=settings.EXPAND_N_QUERY
            )
            context = retriever.rerank(
                hits=hits, 
                keep_top_k=settings.KEEP_TOP_K
            )
            prompt_template_variables["context"] = context

            prompt = prompt_template.format(question=query, context=context)
        else:
            prompt = prompt_template.format(question=query)

        input_ = pd.DataFrame([{"instruction": prompt}]).to_json()

        response: list[dict] = self.qwak_client.predict(input_)
        answer = response[0]["content"][0]

        if enable_monitoring is True:
            self.prompt_monitoring_manager.log(
                prompt=prompt,
                prompt_template=prompt_template.template,
                prompt_template_variables=prompt_template_variables,
                output=answer,
                metadata=metadata,
            )

        return {"answer": answer}

Let’s look at how our LLM microservice is implemented using Qwak.

5. The LLM microservice

As the LLM microservice is deployed on Qwak, we must first inherit from the QwakModel class and implement some specific functions.

  • initialize_model(): where we load the fine-tuned model from the model registry at serving time
  • schema(): where we define the input and output schema
  • predict(): where we implement the actual inference logic

Note: The build() function contains all the training logic, such as loading the dataset, training the LLM, and pushing it to a Comet experiment. To see the full implementation, consider checking out Lesson 7, where we detailed the training pipeline.

LLM microservice → GitHub  ←

Let’s zoom into the implementation and the life cycle of the Qwak model.

The schema() method is used to define how the input and output of the predict() method look like. This will automatically validate the structure and type of the predict() method. For example, the LLM microservice will throw an error if the variable instruction is a JSON instead of a string.

The other Qwak-specific methods are called in the following order:

  1. __init__() → when deploying the model
  2. initialize_model() → when deploying the model
  3. predict() → on every request to the LLM microservice

>>> Note that these methods are called only during serving time (and not during training).

Qwak exposes your model as a RESTful API, where the predict() method is called on each request.

Inside the prediction method, we perform the following steps:

  • map the input text to token IDs using the LLM-specific tokenizer
  • move the token IDs to the provided device (GPU or CPU)
  • pass the token IDs to the LLM and generate the answer
  • extract only the generated tokens from the generated_ids variable by slicing it using the shape of the input_ids
  • decode the generated_ids back to text
  • return the generated text

Here is the complete code for the implementation of the Qwak LLM microservice:

class CopywriterMistralModel(QwakModel):
    def __init__(
        self,
        use_experiment_tracker: bool = True,
        register_model_to_model_registry: bool = True,
        model_type: str = "mistralai/Mistral-7B-Instruct-v0.1",
        fine_tuned_llm_twin_model_type: str = settings.FINE_TUNED_LLM_TWIN_MODEL_TYPE,
        dataset_artifact_name: str = settings.DATASET_ARTIFACT_NAME,
        config_file: str = settings.CONFIG_FILE,
        model_save_dir: str = settings.MODEL_SAVE_DIR,
    ) -> None:

        self.use_experiment_tracker = use_experiment_tracker
        self.register_model_to_model_registry = register_model_to_model_registry
        self.model_save_dir = model_save_dir
        self.model_type = model_type
        self.fine_tuned_llm_twin_model_type = fine_tuned_llm_twin_model_type
        self.dataset_artifact_name = dataset_artifact_name
        self.training_args_config_file = config_file

  def build(self) -> None:
      # Training logic
      ...

  def initialize_model(self) -> None:
      self.model, self.tokenizer, _ = build_qlora_model(
            pretrained_model_name_or_path=self.model_type,
            peft_pretrained_model_name_or_path=self.fine_tuned_llm_twin_model_type,
            bnb_config=self.nf4_config,
            lora_config=self.qlora_config,
            cache_dir=settings.CACHE_DIR,
        )
        self.model = self.model.to(self.device)

      logging.info(f"Successfully loaded model from {self.model_save_dir}")

  def schema(self) -> ModelSchema:
      return ModelSchema(
          inputs=[RequestInput(name="instruction", type=str)],
          outputs=[InferenceOutput(name="content", type=str)],
      )

  @qwak.api(output_adapter=DefaultOutputAdapter())
  def predict(self, df) -> pd.DataFrame:
      input_text = list(df["instruction"].values)
      input_ids = self.tokenizer(
          input_text, return_tensors="pt", add_special_tokens=True
      )
      input_ids = input_ids.to(self.device)

      generated_ids = self.model.generate(
          **input_ids,
          max_new_tokens=500,
          do_sample=True,
          pad_token_id=self.tokenizer.eos_token_id,
      )

      answer_start_idx = input_ids["input_ids"].shape[1]
      generated_answer_ids = generated_ids[:, answer_start_idx:]
      decoded_output = self.tokenizer.batch_decode(generated_answer_ids)[0]

      return pd.DataFrame([{"content": decoded_output}]) 

Where the settings used in the code above have the following values:

class AppSettings(BaseSettings):
    model_config = SettingsConfigDict(env_file=".env", env_file_encoding="utf-8")

    ... # Other settings.
    
    DATASET_ARTIFACT_NAME: str = "posts-instruct-dataset"
    FINE_TUNED_LLM_TWIN_MODEL_TYPE: str = "decodingml/llm-twin:1.0.0"
    CONFIG_FILE: str = "./finetuning/config.yaml"
    
    MODEL_SAVE_DIR: str = "./training_pipeline_output"
    CACHE_DIR: Path = Path("./.cache")

The most important one is the FINE_TUNED_LLM_TWIN_MODEL_TYPE setting, which reflects what model and version to load from the model registry.

Access the code 🔗 here ←

The final step is to look at Comet’s prompt monitoring service. ↓

6. Prompt monitoring

Comet makes prompt monitoring straightforward. There is just one API call where you connect to your project and workspace and send the following to a single function:

  • the prompt and LLM output
  • the prompt template and variables that created the final output
  • your custom metadata specific to your use case — here, you add information about the model, prompt token count, token generation costs, latency, etc.
Prompt monitoring service → GitHub  ←

Let’s look at the logs in Comet ML’sML’s LLMOps dashboard.

Here is how you can quickly access them ↓

  1. log in to Comet (or create an account)
  2. go to your workspace
  3. access the project with the “LLM” symbol attached to it. In our case, this is the “llm-twin-course-monitoring” project.

Note: Comet provides a free version which is enough to run these examples.

Screenshot from Comet’s dashboard

This is how Comet’s prompt monitoring dashboard looks. Here, you can scroll through all the prompts that were ever sent to the LLM. ↓

You can click on any prompt and see everything we logged programmatically using the PromptMonitoringManager class.

Screenshot from Comet’s dashboard

Besides what we logged, adding various tags and the inference duration can be valuable.

7. Deploying and running the inference pipeline

Qwak makes the deployment of the LLM microservice straightforward.

During Lesson 7, we fine-tuned the LLM and built the Qwak model. As a quick refresher, we ran the following CLI command to build the Qwak model, where we used the build_config.yaml file with the build configuration:

poetry run qwak models build -f build_config.yaml .

After the build is finished, we can make various deployments based on the build. For example, we can deploy the LLM microservice using the following Qwak command:

qwak models deploy realtime \
--model-id "llm_twin" \
--instance "gpu.a10.2xl" \ 
--timeout 50000 \ 
--replicas 2 \
--server-workers 2

We deployed two replicas of the LLM twin. Each replica has access to a machine with x1 A10 GPU. Also, each replica has two workers running on it.

🔗 More on Qwak instance types ←

Two replicas and two workers result in 4 microservices that run in parallel and can serve our users.

You can scale the deployment to more replicas if you need to serve more clients. Qwak provides autoscaling mechanisms triggered by listening to the consumption of GPU, CPU or RAM.

To conclude, you build the Qwak model once, and based on it, you can make multiple deployments with various strategies.

You can quickly close the deployment by running the following:

qwak models undeploy --model-id "llm_twin"

We strongly recommend closing down the deployment when you are done, as GPU VMs are expensive.

To run the LLM system with a predefined prompt example, you have to run the following Python file:

poetry run python main.py

Within the main.py file, we call the LLMTwin class, which calls the other services as explained during this lesson.

Note: The → complete installation & usage instructions ← are available in the README of the GitHub repository.

🔗 Check out the code on GitHub [1] and support us with a ⭐️

Conclusion

Congratulations! You are close to the end of the LLM twin series.

In Lesson 9 of the LLM twin course, you learned to build a scalable inference pipeline for serving LLMs and RAG systems.

First, you learned how to architect an inference pipeline by understanding the difference between monolithic and microservice architectures. We also highlighted the difference in designing the training and inference pipelines.

Secondly, we walked you through implementing the RAG business module and LLM twin microservice. Also, we showed you how to log all the prompts, answers, and metadata for Comet’s prompt monitoring service.

Ultimately, we showed you how to deploy and run the LLM twin inference pipeline on the Qwak AI platform.

In Lesson 10, we will show you how to evaluate the whole system by building an advanced RAG evaluation pipeline that analyzes the accuracy of the LLMs ’ answers relative to the query and context.

See you there! 🤗

🔗 Check out the code on GitHub [1] and support us with a ⭐️

References

Literature

[1] Your LLM Twin Course — GitHub Repository (2024), Decoding ML GitHub Organization

[2] Add your models to Model Registry (2024), Comet Guides

Images

If not otherwise stated, all images are created by the author.

Paul Iusztin, Decoding ML

Paul Iusztin

Decoding ML

Decoding ML

Back To Top