August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
When you learn about a specific topic, there’s always more to it. In the world of data, there’s always more math behind it, more processes to unravel, and more to learn. In this article, I will go through the different techniques you can use for tokenization in NLP.
Let’s start off with some definitions…
Natural Language Processing (NLP) is a computer/software/application’s ability to detect and understand human language through speech and text, just as we humans can.
Tokenization is the process of breaking down or splitting paragraphs and sentences into smaller units so that they can be easily defined to be used for NLP models. The raw text is broken down into smaller units called tokens.
Delimiter is a sequence of one or more characters that mark the beginning or end of a unit of data.
If tokenization is done to split up words, it is called word tokenization. If tokenization is done to split up sentences, it is called sentence tokenization.
Tokenization is the first step when working on your NLP pipeline; it has a domino effect on the rest of your pipeline. Tokens can help us to understand the frequency of a particular word in the data and can be used directly as a vector representing the data. In this way, the string goes from being unstructured into a numerical data structure which helps your NLP pipeline run smoothly.
There are different types of tokenization techniques, so let’s dive in!
This is known as one of the simplest tokenization techniques as it uses whitespace within the string as the delimiter of words. Wherever the white space is, it will split the data at that point.
Although this is one of the simplest and fastest tokenization techniques, the key to this is that it only effectively works with languages where white space is used to break apart words and sentences with meaning, such as English.
This technique can be easily executed using Python’s built-in functions.
The regular expression tokenizer is a rule-based tokenizer technique and should be used when other techniques are not serving a specific purpose. For example, there may be punctuation in the text that is causing the data to be unclean, therefore it uses regular expression to split a string into substrings.
This can be easily executed with NLTK, a Python toolkit built for working with NLP. The nltk.tokenize.regexp
module splits the string into substrings. For example:
Different teams have different needs. Comet’s got you covered. Learn how the team at Uber uses Comet’s experiment management to perform real-time model tuning.
The Penn TreeBank Tokenization reads in raw text and outputs tokens of classes based on Penn Treebank, one of the largest treebanks published. The TreeBank gives the semantic and syntactic annotation of a language.
Similar to the above technique, it uses regular expression to tokenize text which is similar to the tokenization used in the Penn Treebank and is also part of the NLTK Python toolkit. For example, you can see the slight difference in the example above and below:
SpaCy is an open-source Python library that can provide great flexibility. It is a modern version of tokenization in NLP, which is faster and can be simply customized. It has the ability to understand large volumes of text that do not need to be segmented using specific rules. For example:
The Moses Tokenizer technique, similar to SpaCy, is also a rule-based tokenizer. It has the ability to separate punctuation from words, while still being able to preserve special tokens such as a date, all whilst normalizing characters with the input of segmentation logic. For example:
The subword tokenization technique is based on the fact that frequently occurring words should be located in the vocabulary, such as “there”, “helping”, etc. However, words that aren’t that common will be split into frequent sub words. For example, the word ‘reiterate’ can be split into subwords of “re”, and “iterate”. These subwords are used more frequently and the meaning is still kept intact. These subwords are assigned to a unique ID, so that the model can learn better over time during the training phase.
There are different types of subword tokenization:
BPE was first described in the article “A New Algorithm for Data Compression” which was published in 1994. BPE is a data compression algorithm in which the most common pair of consecutive bytes of data is replaced with a byte that does not occur in that data.
WordPiece is the tokenization algorithm that was developed by Google to pre-train BERT. WordPiece acts by first pre-tokenizing the text data into words, as it splits on punctuation and whitespaces. At this point, it then moves on to tokenizing each word into subword units — which are called wordpieces.
The Unigram language model considers each token independent irrespective of the tokens before it. What the model does is that it starts its base vocabulary off with a large number of symbols and eventually trims each of these symbols down to generate a smaller vocabulary. It focuses on the fraction of time a specific word appears in comparison to the words in the training text.
During the training phase, the Unigram language model computes a loss (log-likelihood) on how much the overall loss would be if a specific symbol was removed from the vocabulary. The model will then remove the symbols that have the lowest loss increase and continues to do this until it obtains its desired vocabulary.
It is important to know that the model will always contain the base characters so that any work can be tokenized.
Sounds very similar to WordPiece right? However, it is not actually a Tokenizer — it is a method used for selecting tokens from a predefined list. This supplied corpus allows for it to optimize the tokenization process by implementing the Subword Regularization algorithm.
SentencePiece processes the sentences just as sequences of Unicode characters and then trains tokenization and detokenization models from the sentences.
It’s interesting how something that comes to us naturally can be processed in many different ways for computers.
However, tokenization comes with its own challenges. The biggest challenge is the language input itself. Let’s take English for example — most words are separated by using a space, and punctuation can give us a better understanding of the context. But not every language is like this. For example, the Mandarin language does not have the most identifiable boundaries between its symbols.
Language is difficult to learn, especially for computers…