August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
Although is has many forms, text wrangling is basically the pre-processing work that’s done to prepare raw text data ready for training. Simply put, it’s the process of cleaning your data to make it readable by your program, and then formatting it as such.
Many of you may be wrangling text without knowing it yourself. In this tutorial, I will teach you how to clean up your text in Python. I will show you to perform the most common forms of text wrangling: sentence splitting, tokenization, stemming, lemmatization, and stop word removal.
Obviously, you’ll need a little bit of Python know-how in order to run the code I’ll show below. I’ll be using a Google Colab notebook to host all my code. I’ll share the link at the end so you can see how your code compares. To create a new notebook, click here.
If you’ve worked with Natural Language code before in Python, you’re probably familiar with the Python package nltk
or the Natural Language Toolkit. It’s an amazing library with many functions for building Python programs to work with human language data. Let’s begin by typing the following code:
import nltk
nltk.download(‘punkt’)
In this cell, we are importing the library and asking our notebook to download punkt
. This is a tokenizer that divides a text into a list of sentences. This brings us to our first example of text wrangling—Sentence Splitting.
If you’ve ever been given a large paragraph of text, you know that the best way to analyze it is by splitting the text into sentences. In real life conversations, we also compute information at the sentence level by analyzing conjoined words. However, trying to split paragraphs of text into sentences can be difficult in raw code. Luckily, with nltk
, we can do this quite easily. Type the following code:
sampleString = “Let’s make this our sample paragraph. It will split at the end of a sentence marker, like a period. It even knows that the period in Mr. Jones is not the end. Try it out!”from nltk.tokenize import sent_tokenize tokenized_sent = sent_tokenize(sampleString) print(tokenized_sent)
This code might be self-explanatory, but it’s okay if this is your first time. Here is what we typed line by line:
sampleString
that contains a couple of sentences. You can change the text in this variable to whatever you wish.sent_tokenize
, which is the sentence tokenization function from the nltk
library.sent_tokenize
function on our sampleString
. This runs the tokenization function over our string and saves the results to a new variable called tokenized_sent
.tokenized_sent
to the log. You should receive an output that looks like this:[“Let’s make this our sample paragraph.”, ‘It will split at the end of a sentence marker, like a period.’, ‘It even knows that the period in Mr. Jones is not the end.’, ‘Try it out!’]
As you can see, we were able to split up the paragraph into exact sentences. What’s even more fascinating is that the code knows the difference between a period used to end a sentence versus a period used in the name Mr. Jones.
By now, you’re probably wondering what tokenization is. Well a token is the smallest text unit a machine can process. Therefore, every chunk of text needs to be tokenized before you can run natural language programs on it. Sometimes, it makes sense for the smallest unit to be either a word or a letter. In the previous section, we tokenized the paragraph into sentences.
For a language like English, it can be easy to tokenize text, especially with nltk
to guide us. Here’s how we can tokenize text using just a few lines of code:
msg = “Hey everyone! The party starts in 10mins. Be there ASAP!”
print(msg.split())
Like before, we define a variable called msg
(short for message). Then, we run a function called split
over this chunk of text and print the results to the console. You should receive an output like this:
[‘Hey’, ‘everyone!’, ‘The’, ‘party’, ‘starts’, ‘in’, ‘10mins.’, ‘Be’, ‘there’, ‘ASAP!’]
The split()
function is one of the simplest tokenizers. It looks for whitespace as the delimiter (the limit or boundary) and takes the words around it. However, we can take this to the next level with more functions. Type the following:
from nltk.tokenize import word_tokenize, regexp_tokenize
word_tokenize(msg)
nltk.tokenize
list of functions.word_tokenize()
function. This is very similar to the split()
function with one key difference. Instead of looking for the whitespace as the delimiter, it even splits the punctuation, as it considers exclamation points and periods as their own tokens.This is what your output should look like:
[‘Hey’,
‘everyone’,
‘!’,
‘The’,
‘party’,
‘starts’,
‘in’,
‘10mins’,
‘.’,
‘Be’,
‘there’,
‘ASAP’,
‘!’]
Finally, let’s take a look at the regex_tokenize
function. This is an even more advanced tokenizer that can be customized to fit your needs. Let’s take a look at an example:
regexp_tokenize(msg, pattern=”\w+”)
You might notice that we have an extra parameter in this function called pattern
. This is where developers can choose how they want to tokenize the text. \w+
means that we want all words and digits to be in our token, but symbols like punctuation can be ignored. This is why our output looks like this:
[‘Hey’,
‘everyone’,
‘The’,
‘party’,
‘starts’,
‘in’,
‘10mins’,
‘Be’,
‘there’,
‘ASAP’]
Now, let’s try a different pattern:
regexp_tokenize(msg, pattern=”\d+”)
Just like before, we have the same function, but with a different pattern: \d+
. This asks the text to print only the digits. That’s why our output only contains the number 10.
[‘10’]
These are the two most common tokenizers you’ll need to clean up your text. Next, let’s move over to stemming, another crucial step in text wrangling.
Stemming is exactly what it sounds like—cutting down a token to its root stem. For instance, take the word “running”. It can be broken down to its root: “run”. However, “run” itself has many variation: runs, ran, etc. With stemming, we can club all the variations of the word into a single root. Let’s look at the code to do this:
from nltk.stem import PorterStemmer
porter = PorterStemmer()
porter.stem(“running”)
PorterStemmer
from the toolkit. There are many algorithms to stem words, and PorterStemmer
uses just one of the many. However, I’ve found it to be the most precise since it uses a lot of rules.porter
and set it equal to the PorterStemmer()
.‘run’
Now, you could skip to the next section, but I’d like to take a moment and show you two more stemmers that use different algorithms. The first is Lancaster stemming. It’s very easy to implement and the results are close to that of Porter stemming. Here’s a look:
from nltk.stem import LancasterStemmer
lancaster = LancasterStemmer()
lancaster.stem(“eating”)
You should recognize this code by now. It’s the same as the previous example, only this time we import LancasterStemmer
. Running the stemmer on the word “eating” gives us an output of:
‘eat’
Now, the last stemmer I want to show you is the SnowballStemmer
. What makes this stemmer unique is that it’s been trained on many languages and works well for English, German, French, Russian, and many others. Here’s how you implement it. It’s a little different from the previous two stemmers:
from nltk.stem.snowball import SnowballStemmer
snowball = SnowballStemmer(“english”)
snowball.stem(“having”)
nltk.stem
, we import SnowballStemmer
from nltk.stem.snowball
since it is another major subsection.snowball
as our stemmer. However, when we do so, we define which language the stemmer should detect.‘have’
Stemming is great for its simplicity in NLP-related tasks. However, if we want to get more complex, stemming won’t be the best technique to use. Instead, this is where lemmatization shines.
Lemmatization is much more advanced than stemming because rather than just following rules, this process also takes into account context and part of speech to determine the lemma, or the root form of the word. Here’s a perfect example to show the difference between lemmatization and stemming:
nltk.download(‘wordnet’)
from nltk.stem import WordNetLemmatizer
lem = WordNetLemmatizer()
print(lem.lemmatize(“ate”))
print(porter.stem(“ate”))
wordnet
from the toolkit. Wordnet
is a massive semantic dictionary that’s used for search-specific lemmas of words.WordNetLemmatizer
from nltk.stem
lem
to be the lemmatization function.PorterStemmer
to stem the same word and print the result to the console.WordNet is constantly updating, but at the time of writing, this is what my console displayed:
eat
ate
So we can see that through lemmatization, we can even detect the tenses of the word and present the simplest form of the word in the present tense, all with a few lines of code. Lemmatization is one of the many powerful techniques in text wrangling.
Finally, we come to the last section of this tutorial—stop word removal. Stop words are commonly-used word that are usually ignored because of their many occurrences. Most of these words are articles and prepositions, such as “the”, “a”, “in”, etc.
These words can either end up taking too much space or eating up too much time. Luckily, nltk
has a list of stop words in 16 different languages. We can use this list to parse paragraphs of text and remove the stop words from them. Here’s how to do it:
nltk.download(‘stopwords’)
from nltk.corpus import stopwords
list = stopwords.words(‘english’)
paragraph = “This is a long paragraph of text. Somtimes important words like Apple and Machine Learning show up. Other words that are not important get removed.”
postPara = [word for word in paragraph.split() if word not in list]
print(postPara)
This is perhaps the most complex code in this tutorial, so I’ll run through it piece-by-piece:
stopwords
from the toolkit.nltk.corpus
. A corpus is a large dataset of texts.list
and set this to contain all the English stop words.paragraph
.postPara
, which is an array of all the words in paragraph
split up and not including the words in list
.postPara
to our console:[‘This’, ‘long’, ‘paragraph’, ‘text.’, ‘Somtimes’, ‘important’, ‘words’, ‘like’, ‘Apple’, ‘Machine’, ‘Learning’, ‘show’, ‘up.’, ‘Other’, ‘words’, ‘important’, ‘get’, ‘removed.’]
As you can see, our text is split up into different words, but the stop words are removed, showing you only the words deemed important. Most articles and prepositions are gone!
As you can see, text wrangling can be essential in making sure you have the best data to work with. With NLTK, it’s easier than ever to run complex algorithms on your text using only a few lines of code. You can split up your text however you want, weed out the unnecessary parts, and even reduce it to make it the most logical form for your computations.
We’ve barely scratched the surface in terms of what can be done with NLTK. I’d suggest taking a look at the official NLTK website. Feel free to leave me a message in the comments if you’ve got a question or need some help! For reference, here is the link to the complete Colab notebook.