August 30, 2024
A guest post from Fabrício Ceolin, DevOps Engineer at Comet. Inspired by the growing demand…
Natural language refers to the medium we humans use to communicate with each other, and processing simply means the conversion of data into a readable form. In short, natural language processing is a way to provide computers with the ability to understand and communicate in human language.
NLP is a branch of AI that uses text data as input and return models that can understand and generate insights from new text data. One of the most important steps of creating these models is converting raw text data into a much better-cleaned version that contains only useful information. In this blog, we will look at some techniques to perfectly clean text data for natural language processing.
It is important to apply each step in the same serial manner as mentioned below, otherwise, you could end up losing lots of useful data.
It is very common for any text data to have words that follow a certain capitalization like camel case, title case, sentence case, etc., or some mis-capitalized words (eg: pYthOn). Both types create problems in analysis thus it is important to normalize the text into lowercase.
text = 'Python PROGRAMMING LanGUage.'
text.lower()
------------------
python programming language.
Most of the text data you collect from the web may contain some extra spaces between words, before and after a sentence. It is important to remove these before applying any text processing or cleaning technique to the data.
doc = 'python programming language 'import regex as re re.sub("\s+"," ",doc) ------------------------ python programming language
Unwanted data refers to certain parts of the text that don’t add any value in analysis and model building. For example hashtags, HTML tags, mentions, emails, URLs, phone numbers, or some special combination of characters. We can remove these completely from our text data or replace them with their representative word.
HTML Tags
HTML Tags start with an <
followed by tag name, ends with>
.
doc = '<p> Food is very good and <b>cheap</b>.</p>'import regex as re re.sub('<.*?>','',doc) ------------------- Food is very good and cheap.
Emails
Gmail is one of the most famous and commonly used service providers for email services. Usually, an email starts with a personalized name followed by some initials like digits, special symbols, etc., then @
ends with an email service provider. Like dazzleninja_44@gmail.com
.
doc = 'you can contact me on my work email dazzleninja_44@gmail.com for any queries.'import regex as re re.sub(r'([a-z0-9+._-]+@[a-z0-9+._-]+\.[a-z0-9+_-]+)',"", doc) --------------------- you can contact me on my work email for any queries. """ [a-z0-9+._-]+ @ [a-z0-9+._-]+ \ . [a-z0-9+_-]+ """
URLs
A generic URL contains a protocol, subdomain, domain name, top level domain, and directory path.
doc = 'follow my medium profile at https://medium.com/@abhayparashar31 and subscribe to my email list at https://abhayparashar31.medium.com/subscribe' import regex as re re.sub(r'(http|https|ftp|ssh)://([\w_-]+(?:(?:\.[\w_-]+)+))([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])?', '' , doc) -------------------- """ follow my medium profile at and subscribe to my email list at. """ """ (http|https|ftp|ssh) :// ([\w_-]+(?:(?:\.[\w_-]+)+)) ([\w.,@?^=%&:/~+#-]*[\w@?^=%&/~+#-])? """
Relying on traditional processes and inconsistent model management can block your team from getting models to production. Building an MLOps strategy can help. Learn more with our free ebook.
Accented Characters
Accent marks are symbols used over letters especially vowels to emphasize the pronunciation of a word. These characters cause problems in analysis by increasing the vocabulary size unnecessarily.
For example, résumé and resume are two different words for our model, whereas both of them produce the same meaning. These usually occur when you try to collect data from a web source, or a multilingual source.
doc = 'résumé length is good. resume font is bad.'import unicodedata unicodedata.normalize('NFKD', doc).encode('ascii', 'ignore').decode('utf-8', 'ignore') ----------------------- resume length is good. resume font is bad.
An abbreviation is a shortened form of a word, for example: TTL: Talk to you later. These usually occur in social media datasets. It becomes important to replace abbreviations with their full form otherwise our model will not be able to learn proper patterns from the data. You can find the JSON file with the most common abbreviation short form and their full version here on my Github profile.
x = "it'd've better if less food oil is added."import json abbreviations = json.load(open('PATH')) for key in abbreviations: if key in x: x = x.replace(key,abbreviations[key]) print(x)
Special symbols are characters that are not considered either letters or digits. Different symbols, punctuation, and accent marks are considered special symbols. They don’t add any value while modeling thus it is important to remove all of them from the text.
doc = 'Congrats!, David You have won 1000$.'import regex as re re.sub(r'[^\w ]+', "", doc) ----------------------- Congrats David You have won 1000
Stopwords are English words that do not add any value to the sentence. For the purpose of analyzing text and building NLP models, these words might not add much value thus it is a best practice to remove all the stopwords before proceeding further for vectorization. Some of the most common stopwords are: the, is, for, when, to, at, etc.
There are many ways to remove stopwords, one of the simplest methods is by using the NLTK library.
doc = 'this is one of the best action movie i have ever watched.'import nltk nltk.download('stopwords') from nltk.corpus import stopwordsenglish_stopwords = set(stopwords.words('english'))cleaned_doc = ' '.join([word for word in doc.split() if word not in english_stopwords]) print(cleaned_doc) ------------------------ one best action movie ever watched.
Stemming is the process of converting a word to its root by removing suffix and prefix from it. Stemming will reduce ‘Learning,’ ‘Learns,’ and ‘Learned,’ to their root word ‘Learn.’ The NLTK library offers many stemmers, but out of them all, Porter Stemmer and its upgraded version are mostly used.
# nltk.download('punkt')from nltk.tokenize import word_tokenize from nltk.stem import PorterStemmer ps = PorterStemmer()doc = 'learning learn learned learns'text = " ".join([ps.stem(word) for word in word_tokenize(doc)]) print(text) ------------------ learn learn learn learn
Lemmatization is similar to stemming but the difference between the two is that it takes into consideration the morphological analysis of the words that allows us to differentiate between present, past, and indefinite tense.
# nltk.download('wordnet') # nltk.download('omw-1.4')doc = 'history always repeat itself.'from nltk.stem import WordNetLemmatizer lemmatizer = WordNetLemmatizer()text = " ".join([lemmatizer.lemmatize(word) for word in word_tokenize(doc)]) print('Lemmatization: ',text) ---------------- Lemmatization: history always repeat itself. Stemming: histori alway repeat itself.
As a quick recap of the article, the initial step for text cleaning is normalization which converts text into lowercase. The next step includes the use of regular expressions to remove any unwanted data from the text by replacing it with white space or some text initials. The text cleaning process ends by removing stopwords and converting text to its base using stemming or lemmatization.