Learning Path to Building LLM-Based Solutions - For Practitioner Data Scientists

Words By Vasanthkumar Velayudham

February 12, 2024

As everyone would agree, the advent of LLM has transformed the technology industry, and technocrats have had a huge surge of interest in learning about LLMs.

Within a short time, technologies around LLMs have evolved so well, and different learning streams are catering to learners’ varying needs. This article primarily captures my understanding and my learning path to build LLM-based solutions.

graphic of two people look at a path to building an LLM — Img src

This article is more useful for practitioner data scientists who want to get into the LLM field and expand their scope of skills.

So, let’s get started…

Proprietary vs Open Source LLMs

Though OpenAI’s ChatGPT is the leader among the LLM models and has revolutionized the industry with its offerings, open-source LLM ecosystems are rapidly evolving and are becoming as good as proprietary LLMs in terms of performance.

Open-source LLMs, especially for data scientists, give a broader scope to learn and apply new things. The best resource to monitor the current status of open source LLMS is Hugging Face Open LLM Leaderboard, where the open source LLMs are ranked based on various evaluations.

Open LLM Leaderboard – a Hugging Face Space by HuggingFaceH4

In this article, the learning path I explain is primarily associated with building large language model solutions based on open-source LLMs.

Broad Streams of LLM Applications

Though LLM applications are vast, we can broadly categorize the LLM applications into the following streams:

Prompt Engineering:

This is the most basic and widely applicable one. Here, we primarily work with proprietary large language models such as ChatGPT. This is about learning the best way to compose the prompt messages so LLMs would give you the most appropriate answer.

Data scientists may not have much scope here, as it is primarily about learning how to best use available chat-based LLM solutions.

Langchain Integrations:

Langchain is a framework built to interface large language model solutions with other technologies. Say you have a use case that requires you to feed the input from your database to large language models — then you would need Langchain for integration. Langchain is very comprehensive, and its applications are evolving rapidly.

https://python.langchain.com/docs/get_started/introduction.html

Again, data scientists have limited scope here. Its primary applications are with engineers who build enterprise-scale solutions leveraging LLM outputs.

Fine-tuning LLMs:

LLM fine-tuning is one of the exciting areas where we curate the dataset specific to our needs and tune the LLM models built by the providers. This has a broad scope of learning for data scientists, and it’s necessary to have a good grasp of the following concepts to excel here.

→ Hugging Face Text Generation Pipeline: Hugging Face has become synonymous with large models, and they have built amazing libraries to aid the fine-tuning of pre-trained models.

Also, please note that large language models are causal language models that generate responses by predicting the best words to complete the sentence. So, it is essential to have a good grasp of training a causal language model from scratch, and the article below is beneficial:

Training a causal language model from scratch – Hugging Face NLP Course

→ PEFT, LORA, QLora concepts: Fine-tuning LLM models is not as straightforward as ‘transfer learning’ that we do with other models. Since we must deal with parameters in the billions scale, we must employ a more sophisticated process based on PEFT (Parameter Efficient Fine Tuning) concepts.

The following videos and articles were very useful to learn about the mentioned concepts:

Parameter-Efficient Fine-Tuning using 🤗 PEFT

→ Quantization: Quantization helps fine-tune massive LLMs on a single GPU without compromising performance. Again, Hugging Face has published excellent articles to understand the nuances of quantization in detail, as listed below:

A Gentle Introduction to 8-bit Matrix Multiplication for transformers at scale using transformers

Making LLMs even more accessible with bitsandbytes, 4-bit quantization and QLoRA

→ Instruction Dataset: To fine-tune large language models properly, we require high-quality datasets in an instructional format.

If you observe the Open LLMs Llama, Falcon, etc — they release two versions of models, such as the base version and the instruct version.

The base version is trained on an open corpus of massive text at an internet scale, and usually, it will be trained for multiple months in a GPU farm. These base version models are foundational to building the LLMs toward specific needs.

Instruct versions are the ones that are built using base versions with high-quality instructional datasets, where quality presides over quantity. We can build high-performance instruct models with the instructional dataset containing even ~20K records.

Alpaca is an industry standard that describes the format and contains around ~55K records spanning domains.

tatsu-lab/alpaca · Datasets at Hugging Face

So, to fine-tune the model for our business use case — we need to curate high-quality datasets in the Alpaca format.

Also, please note that for fine-tuning, base versions are recommended.

Retrieval Augmented Generation (RAG):

This is a more straightforward application of large language models, but it still has a good scope of learning for data scientists.

Here, we leverage the foundational models and build the RAG solution, where LLM models respond by summarizing the information associated with the query from your content.

Vector databases such as Chroma, Pinecone, etc., are widely employed here to identify the content specific to your query from your database, and identified content would be summarized by LLM as a response to the query.

Build Industry-Specific LLMs Using Retrieval Augmented Generation

RAG-based applications will have a broader scope of applications in real-time, as they’re very simple and straightforward.

Summary

Thus, we have seen the various sources associated with mastering large language models concepts in detail. Again, this path may be more applicable for practicing data scientists — as foundational DS knowledge is required to understand the ideas above. This is the summary of the learning path I followed, and I hope this will be helpful for my fellow practitioners.

Please follow my handle to learn about other insightful concepts associated with LLMs and DS.

Thanks! Happy Learning!

Run open source LLM evaluations with Opik!