Using Natural language processing (NLP) for Data Preprocessing — Data Science for Beginners

Maria Asghar
4 min readMay 18, 2024

--

The quality of data is very important in data science. Most of the time the provided data is not in good condition to perform any kind of analysis or prediction. Preprocessed data ensures accuracy and efficiency of model. Data preprocessing is the first step in any analysis project. In this article we will discuss major steps , implementation and their importance.

Here are the major steps required for text preprocessing using Natural Language processing (NLP).

  1. Removing URLs
  2. Removing Punctuation
  3. Removing Digits
  4. Tokenize data
  5. Removing Stopwords
  6. Lemmatize tokens

Lets start with loading the required libraries and adding sample data.

#Import required libraries
import pandas
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
import string
#sample data for preprocessing
textual_data = "Natural Language Processing (NLP) is a subfield of artificial intelligence (AI) that \
focuses on the interaction between computers and humans through natural language. It encompasses a \
variety of tasks, including text analysis, language translation, sentiment analysis, and \
information extraction. NLP algorithms process and analyze large volumes of textual data, \
enabling computers to understand, interpret, and generate human language. With advancements in \
machine learning and deep learning techniques, NLP has seen significant progress in recent years, \
leading to applications in diverse domains such as healthcare, finance, and customer service. \
We can remove urls and digits ( 1 - 9 ) for complete prerpocessing.\
For more information of NLP we can check https://www.nltk.org/ ."

Removing URLs

Lets start with removing urls using regex library. In textual analysis most of the time you don’t really need URLs. They don’t have a meaning from textual semantics perspective. So we can eliminate them from our text.

data_without_url = re.sub(r'http\S+', '', textual_data)

print(data_without_url)
Here is the output of data_without_url

Removing Punctuation

Moving on to next step lets remove punctuation. Punctuation marks such as periods, commas, and exclamation marks serve grammatical purposes but may not carry much semantic meaning. So we can remove punctuation as well.

data_without_punction = data_without_url.translate(str.maketrans('', '', string.punctuation))

print(data_without_punction)
Output of data_without_punctuation

Removing Digits

Digits, like punctuation, are non-linguistic symbols that may not add much value to text analysis tasks. Removing them ensures that numerical values don’t interfere with textual analysis and allows the focus to remain on patterns and content.

# using regular expression
cleaned_text = re.sub(r'\d+', '', data_without_punction)

print(cleaned_text)
Output of cleaned_text

Tokenize data

Next step is tokenization. Tokenization breaks down the text into smaller units, such as words or phrases, making it easier to analyze and process. We will tokenize our text using nltk.

tokens = nltk.word_tokenize(cleaned_text)

print(tokens)
Output of tokens

Removing Stopwords

Stopwords are common words like “the,” “is,” and “and” that occur frequently in language but often carry little semantic meaning. Removing them helps reduce noise in the text . We will remove them using library nltk.

# Get the list of stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print(filtered_tokens)
Filtered Tokens

Lemmatize tokens

Lemmatization reduces words to their base or root form, ensuring consistency in text representation. For example, “running,” “ran,” and “runs” would all be lemmatized to “run.” This process helps standardize the vocabulary and improves the performance of downstream tasks like text classification or information retrieval.

We will use WordNetLemmatizer() for this purpose.

lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(word.lower()) for word in filtered_tokens]

print(lemmatized_tokens)

In data science, Natural Language Processing (NLP) plays an important role in preprocessing textual data to enhance its quality and facilitate analysis. Utilizing NLP techniques such as tokenization, stopwords removal and lemmatization, raw text data can be transformed into a structured format suitable for various machine learning and data analysis tasks. NLP enables the extraction of meaningful insights from unstructured text, enabling data scientists to perform sentiment analysis, text classification, information extraction, and other advanced analytics. By leveraging NLP for data preprocessing, data scientists can effectively handle the unique challenges posed by natural language data and unlock the full potential of text-based datasets in their analyses and predictive models.

References

  1. NLTK documentation https://www.nltk.org/
  2. Python Documentation https://www.python.org/

--

--

Maria Asghar
Maria Asghar

Written by Maria Asghar

Data Scientist | MSc Machine Learning & Big Data

No responses yet