Using Natural language processing (NLP) for Data Preprocessing — Data Science for Beginners

4 min readMay 18, 2024

The quality of data is very important in data science. Most of the time the provided data is not in good condition to perform any kind of analysis or prediction. Preprocessed data ensures accuracy and efficiency of model. Data preprocessing is the first step in any analysis project. In this article we will discuss major steps , implementation and their importance.

Here are the major steps required for text preprocessing using Natural Language processing (NLP).

Removing URLs
Removing Punctuation
Removing Digits
Tokenize data
Removing Stopwords
Lemmatize tokens

Lets start with loading the required libraries and adding sample data.

#Import required libraries
import pandas
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string

#sample data for preprocessing
textual_data = "Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that \
focuses on the interaction between computers and humans through natural language. It encompasses a \
variety of tasks, including text analysis, language translation, sentiment analysis, and \
information extraction. NLP algorithms process and analyze large volumes of textual data, \
enabling computers to understand, interpret, and generate human language. With advancements in \
machine learning and deep learning techniques, NLP has seen significant progress in recent years, \
leading to applications in diverse domains such as healthcare, finance, and customer service. \
We can remove urls and  digits ( 1 - 9 )  for complete prerpocessing.\
For more information of NLP we can check https://www.nltk.org/ ."

Removing URLs

Lets start with removing urls using regex library. In textual analysis most of the time you don’t really need URLs. They don’t have a meaning from textual semantics perspective. So we can eliminate them from our text.

data_without_url = re.sub(r'http\S+', '', textual_data)

print(data_without_url)

Removing Punctuation

Moving on to next step lets remove punctuation. Punctuation marks such as periods, commas, and exclamation marks serve grammatical purposes but may not carry much semantic meaning. So we can remove punctuation as well.

data_without_punction = data_without_url.translate(str.maketrans('', '', string.punctuation))
                                           
print(data_without_punction)

Removing Digits

Digits, like punctuation, are non-linguistic symbols that may not add much value to text analysis tasks. Removing them ensures that numerical values don’t interfere with textual analysis and allows the focus to remain on patterns and content.

# using regular expression
cleaned_text = re.sub(r'\d+', '', data_without_punction)

print(cleaned_text)

Tokenize data

Next step is tokenization. Tokenization breaks down the text into smaller units, such as words or phrases, making it easier to analyze and process. We will tokenize our text using nltk.

tokens = nltk.word_tokenize(cleaned_text)

print(tokens)

Removing Stopwords

Stopwords are common words like “the,” “is,” and “and” that occur frequently in language but often carry little semantic meaning. Removing them helps reduce noise in the text . We will remove them using library nltk.

# Get the list of stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)

Lemmatize tokens

Lemmatization reduces words to their base or root form, ensuring consistency in text representation. For example, “running,” “ran,” and “runs” would all be lemmatized to “run.” This process helps standardize the vocabulary and improves the performance of downstream tasks like text classification or information retrieval.

We will use WordNetLemmatizer() for this purpose.

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

print(lemmatized_tokens)

In data science, Natural Language Processing (NLP) plays an important role in preprocessing textual data to enhance its quality and facilitate analysis. Utilizing NLP techniques such as tokenization, stopwords removal and lemmatization, raw text data can be transformed into a structured format suitable for various machine learning and data analysis tasks. NLP enables the extraction of meaningful insights from unstructured text, enabling data scientists to perform sentiment analysis, text classification, information extraction, and other advanced analytics. By leveraging NLP for data preprocessing, data scientists can effectively handle the unique challenges posed by natural language data and unlock the full potential of text-based datasets in their analyses and predictive models.

References

NLTK documentation https://www.nltk.org/
Python Documentation https://www.python.org/