skip to Main Content

How to Build a Text Classification Model Using HuggingFace Transformers and Comet

Text classification is an interesting part of machine learning and natural language processing that is used in business and everyday life to determine the sentiment of a text. For example, a company can use machine learning to determine whether customer reviews are positive or not. Machine learning models may now be developed in a variety of methods to classify text. You can create one from scratch or use a pre-trained model.

This article will show you how to build your text classification model using transformers (which includes a state-of-the-art pre-trained model) and how to utilize Comet to keep track of your model’s experiments. Without further ado, let’s get started!

What are Transformers?

Hugging face transformers deliver state-of-the-art (cutting-edge) pre-trained models that let you perform tasks on many aspects of data such as text, audio, or other input. They use a machine learning approach known as transfer learning, in which a model has already been trained, so you don’t have to worry about developing a model from scratch. This can save resources such as processing power, data sourcing, and so on. All you need to do is fine-tune the model so that it works effectively for your application.

For example, we may utilize transformers to build a text classification model, which saves time and money because we don’t have to train the model from scratch or obtain a large amount of data.

Transformers are supported by three of the most prominent deep learning libraries — Jax, PyTorch, and TensorFlow — with seamless integration.

What is Comet?

Comet is a machine-learning platform that allows you to track the artifacts of your machine-learning experiments such as model metrics (e.g., accuracy score, confusion matrix), hyperparameters, model metadata, and so on.

We will use Comet to keep track of the metrics, hyperparameters, and so on of the transformer model, we will build for text classification.

Now enough talking, let’s get to business!

To get the most out of this tutorial click here to download the Colab notebook so you can easily follow along.

Dataset

The dataset we will be utilizing is an IMDb dataset that contains movie reviews with labels indicating whether the review is positive or negative. The dataset is a public dataset from HuggingFace and can be accessed by installing the dataset library. You can install the dataset library using pip.

Libraries, tools, and environment

These are the libraries we will be using.

  • Environment: Google Colab for experimentation.
  • Transformers: for building our state-of-the-art text classification model.
  • Dataset: For fine-tuning our pre-trained model.
  • Comet: for tracking the experiments of our model.
  • Scikit-Learn: for evaluating the model performance.
  • PyTorch: to work with transformers.

As we’ve said earlier, Transformers is backed up by the three most prominent deep-learning libraries: Jax, PyTorch, and TensorFlow. For this project, we will be using PyTorch.

You can type the below code in Colab to install the above library.

%pip install comet_ml torch datasets transformers scikit-learn

Load the dataset

Once we’ve installed the above libraries the next thing is to start building our text classification model. Firstly we will import the necessary libraries we will be using, then we initialized our Comet experiment and named our project “Hugging Face Text Classification”

Note you will be prompted for your API key. If don’t have one you can signup here to get one.

Then we load the IMDb dataset and print it out.

import comet_ml
from datasets import load_dataset

from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import AutoTokenizer, Trainer, TrainingArguments, AutoModelForSequenceClassification, DataCollatorWithPadding

comet.init(project_name = "Hugging Face Text Classification")

df = load_dataset("imdb")
print(df)

We can see that the output is compressed to a dictionary of datasets, including an entry for the full dataset to be used in unsupervised learning applications, as well as train and test splits (50:50). Each dataset only has two features: the text — which contains the reviews — and label, which includes the value 0 or 1, indicating whether the review is negative or positive.

The dataset has many observations; however, in our situation, we will only use a subset of it. As a result, we will choose a subset of the train and test split. The model will be trained on 200 rows and evaluated on 100 rows. We will narrow down our sample size later on in our code.

The most important thing to keep in mind when building and deploying your model? Understanding your end-goal. Read our interview with ML experts from Stanford, Google, and HuggingFace to learn more.

Defining the Transformer Model

As previously said, transformers include hundreds of state-of-the-art pre-trained models. One of these models will be used to create the text classification application. The model we will be using is known as Distilbert-base-uncased-emotion.

distilbert-base-uncased-emotion is a HuggingFace model that has been fine-tuned for identifying emotions in texts such as sadness, joy, love, anger, fear, and surprise. We can use DistilBERT to identify whether the customer review is positive or negative.

We will assign the model name to a variable and also assign a random_seedvalue for reproducibility.

PRE_TRAINED_MODEL_NAME = "distilbert-base-uncased"
random_seed = 42

Tokenize Dataset

Now that we’ve done that, the next thing will be to tokenize the data (i.e get it readily available for the model to interpret). Note that in our definition of tokenize_function, we also select the smaller subset of data we mentioned earlier (300 rows). We can do that by typing the following command:

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

def tokenize_function(data):
    return tokenizer(data["text"], padding="max_length", truncation=True)


tokenize_df = df.map(tokenize_function, batched=True)
train_df = tokenize_df["train"].shuffle(seed=random_seed).select(range(200))
test_df = tokenize_df["test"].shuffle(seed=random_seed).select(range(100))

data_collator = DataCollatorWithPadding(tokenizer=tokenizer)

Let’s break that down line by line.

So, what we did in the above code was to create a tokenized function that will process the text so that it will be ready for the model.

tokenizer = AutoTokenizer.from_pretrained(PRE_TRAINED_MODEL_NAME)

With the first line of code (above), we instantiated the AutoTokenizer class and passed in the name of the pre-trained model which we created above as a variable (“Distilbert-base-uncased-emotion”)

Then we created a function that will tokenize the text. The parameters padding and truncation are required for the model’s efficiency since after the text is collected in batch, the characters are not always the same length. The model will then translate the text into a fixed-size sensor. Padding and truncation help to solve the problem. To begin, padding will add padding tokens to ensure that the small sequence is the same length as the larger one, and truncation, when set to True, will truncate the sequence to the maximum length that the model can execute.

To learn more about this process click here.

Then, the last part maps the function to the dataset, setting batch to True is also needed so that the model can process it in batches which will be faster.

Then, as previously stated, we subset the data we require, 200 for the training set and 100 for the evaluation set.

The last part of the code is required for the model to collect data in batches. We’ll see how this works at the end of the article.

Instantiating the Transformer Model

Once the data is ready, we can now build our transformer model. To do that we type the following:

model = AutoModelForSequenceClassification.from_pretrained(
    PRE_TRAINED_MODEL_NAME, num_labels=2
)

Creating The Trainer Model

Now that we’ve instantiated our model the next thing will be to create the Trainer object that will be used to fine-tune our model. The Trainer object has 11 parameters but we will be making use of the 6 most important ones. These are:

  • model: The model for predictions.
  • args: The arguments to tweak or fine-tune.
  • data_collator: The function that will be used to form a batch of the training and test set.
  • train_dataset: The dataset to be used for training.
  • eval_dataset: The dataset to be used for evaluation.
  • tokenizer: The tokenizer that is used to preprocess the data.
  • compute_metrics: The function that will be used to compute metrics for evaluating the model.

We don’t have two of these six parameters, which are training arguments and the function to compute the metrics of the evaluate set. First, we will create the training argument. To do so, we must use the TrainingArguments object we imported earlier and pass it the argument we want to use. We can accomplish this by entering the code below.

training_arguments = TrainingArguments(
    seed=random_seed,
    optim="adamw_torch",
    learning_rate=5e-5,
    num_train_epochs=1,
    output_dir="./results",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    evaluation_strategy="steps",
    eval_steps=25,
    save_strategy="steps",
    save_total_limit=10,
    save_steps=25
)

The transformer provides integration with Comet that can allow us to automatically report all metrics, logs, etc., to the Comet platform.

The transformer library has a lot of parameters that can be used in the training argument (about 94 of them in total!). You can click here to learn more about how to tweak your model more accurately.

After that, we can create the function that will be utilized to compute the metrics.

def compute_metrics(pred):
    
    #get global experiments
    experiment = comet_ml.get_global_experiment()
    
    #get y_true and y_preds for eval_dataset
    labels = pred.label_ids
    preds = pred.predictions.argmax(-1)
    
    #compute precision, recall, and F1 score
    precision, recall, f1, _ = precision_recall_fscore_support(
        labels, preds, average='macro')
    
    #compute accuracy score
    acc = accuracy_score(labels, preds)
    
    #log confusion matrix
    if experiment:
        epoch = int(experiment.curr_epoch) if experiment.curr_epoch is not None else 0
        experiment.set_epoch(epoch)
        experiment.log_confusion_matrix(
            y_true=labels,
            y_predicted=preds,
            labels=["negative", "postive"]
        )

    return {"accuracy": acc, 
            "f1": f1, 
            "precision": precision,
            "recall": recall
            }

Now that we are done with that we can build our model and train it by typing the following code.

%env COMET_MODE=ONLINE
%env COMET_LOG_ASSETS=TRUE
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_df,
    eval_dataset=test_df,
    compute_metrics=compute_metrics,
    data_collator=data_collator,
)
trainer.train()

Results

Conclusion

In this article, you learned how to create a text classification model using Transformers and how to keep track of the model experiment in Comet. Given that the training set was 200 observations, you can see that our model worked reasonably well. You can test its accuracy by increasing the training set size. Thank you for reading. You can find the link to the colab below.

Ibrahim Ogunbiyi

Sharmila Chockalingam, Head of Product Marketing at Comet ML

Sharmila Chockalingam

Back To Top