Natural Language Processing (NLP) with Python’s Natural Language Tool Kit (NLTK)Package

Gayathri siva
7 min readJun 15, 2021

--

Natural Language Processing is an automated way to understand and analyze natural human languages and extract information from such data by applying machine algorithms. Various NLP libraries are NLTK, spaCy, Stanford CoreNLP Python, TextBlob, Gensim, etc.

Why NLP?

— Analyzing Tons of data

— Identifying various Languages and Dialects

— Applying Quantitative analysis

— Handling Ambiguities

What is NLTK?

The Natural Language Toolkit(NLTK) is a platform used for building Python programs that work with human language data for applying in statistical Natural Language Processing (NLP).

Installing NLTK in Windows/Linux:

Using Pip — pip3 install nltk

Using Conda — conda install nltk

To check if NLTK is installed properly, just type import nltk in your IDE or Python console. It must be run without any error. If this does not work, refer to this.

In your IDE or python console, after importing, continue to the next line and type nltk.download() and run this script. An installation window will pop up. Select all and click ‘Download’ to download and install the additional bundles. This will download all the dictionaries and other language and grammar data frames necessary for full NLTK functionality. NLTK fully supports the English language, but others like Spanish or French are not supported as extensively.

NLTK Dataset: NLTK module has many datasets available that you need to download to use. More technically it is called corpus. Some of the examples are stopwords, Gutenberg, framenet_v15, large_grammars, and so on.

NLP Terminologies:

  • Tokenization
  • Stemming
  • Lemmatization
  • Stop Words
  • Parts of Speech
  • Named Entity Recognition

Tokenization

A process by which a large quantity of text is divided into smaller parts called tokens. It works by separating words using spaces and punctuation.

Uses:

— Break a complex sentence into words.

— Understand the importance of each of the words with respect to the sentence.

— Produce a structural description of an input sentence.

Two types of Tokenization

  • Tokenizing by word or Word Tokenization
  • Tokenizing by sentence or Sentence Tokenization
from nltk import word_tokenize, sent_tokenizesent = "I will walk 500 miles and I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door!"print(word_tokenize(sent))print(sent_tokenize(sent))output: 
[‘I’, ‘will’, ‘walk’, ‘500’, ‘miles’, ‘.’, ‘And’, ‘I’, ‘would’, ‘walk’, ‘500’, ‘more’, ‘,’, ‘just’, ‘to’, ‘be’, ‘the’, ‘man’, ‘who’, ‘walks’, ‘a’, ‘thousand’, ‘miles’, ‘to’, ‘fall’, ‘down’, ‘at’, ‘your’, ‘door’, ‘.’]
[‘I will walk 500 miles.’, ‘And I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door.’]

Bigrams, Trigrams, Ngrams of Tokenization

Bigrams — Tokens of two consecutive written words, Trigrams — Tokens of three consecutive written words, Ngrams — Tokens of any number of consecutive written words

from nltk.util import ngramsn = 2 (n=2 Bigrams) (n=3 Trigrams, n=4,5,6,... Ngrams)
sentence = 'Whoever is happy will make others happy too'
unigrams = ngrams(sentence.split(), n)

for item in unigrams:
print(item)
Output: (n = 2 Bigrams)
('Whoever', 'is')
('is', 'happy')
('happy', 'will')
('will', 'make')
('make', 'others')
('others', 'happy')
('happy', 'too')
Output: (n = 3 Trigrams)
('Whoever', 'is', 'happy')
('is', 'happy', 'will')
('happy', 'will', 'make')
('will', 'make', 'others')
('make', 'others', 'happy')
('others', 'happy', 'too')

Stemming

Normalize words into their base form or root form.

Common types of Stemming algorithms porter, Lancaster, and snowball

from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)

Stemming is considered an important preprocessing step because it removed redundancy in the data and variations in the same word. As a result, data is filtered which will help in better machine training.

Now we pass a complete sentence and check for its behavior as an output.

sent2 = "You have to build a very good site and I love visiting your site."
token = word_tokenize(sent2)
stemmed = ""
for word in token:
stemmed += stemmer.stem(word) + " "
print(stemmed)
output:
you have build a veri good site and I love visit your site

Lemmatization

It is the process of converting the words of a sentence to its dictionary form.

Lemmatization usually refers to the morphological analysis of words, which aims to remove inflectional endings.

The NLTK Lemmatization method is based on WorldNet’s built-in morph function. Text preprocessing includes both stemming as well as lemmatization. Many people find the two terms confusing. Some treat these as the same, but there is a difference between stemming vs lemmatization.

Stemming vs Lemmatization

Stemmingimport nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print(porter_stemmer.stem(w))
Output:
studi
studi
cri
cri
Lemmatizationimport nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print(wordnet_lemmatizer.lemmatize(w))
Output:
study
studying
cry
cry

If you look at stemming for studies and studying, the output is the same (studi) but NLTK lemmatizer provides different lemma for both tokens study for studies and studying for studying. So when we need to make a feature set to train the machine, it would be great if lemmatization is preferred.

Use case:

Lemmatizer minimizes text ambiguity.

It reduces the word density in the given text and helps in preparing the accurate features for the training machine.

NLTK Lemmatizer will also save memory as well as computational cost.

Stop Words

A stop word is a commonly used word (such as “the”, “a”, “an”, “in”). It does not add much meaning to a sentence. They can safely be ignored without sacrificing the meaning of the sentence.

from nltk.corpus import stopwords  

stop_words = stopwords.words('english')
sent = "I will walk 500 miles and I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door!"token = word_tokenize(sent)
cleaned_token = []
for word in token:
if word not in stop_words:
cleaned_token.append(word)
print("This is the unclean version:", token)
print("This is the cleaned version:", cleaned_token)
output:
This is the unclean version: ['I', 'will', 'walk', '500', 'miles', 'and', 'I', 'would', 'walk', '500', 'more', ',', 'just', 'to', 'be', 'the', 'man', 'who', 'walks', 'a', 'thousand', 'miles', 'to', 'fall', 'down', 'at', 'your', 'door', '.']
This is the cleaned version: ['I', 'walk', '500', 'miles', 'I', 'would', 'walk', '500', ',', 'man', 'walks', 'thousand', 'miles', 'fall', 'door', '.']

Parts of Speech(POS)

POS tagging marks words in the corpus to corresponding parts of a speech tag based on its context and definition.

POS tagging is used in text analysis tools and in corpus searches.

from nltk import pos_tag sent = "I will walk 500 miles and I would walk 500 more, just to be the man who walks a thousand miles to fall down at your door!"token = word_tokenize(sent)tagged = pos_tag(token)                 
print(tagged)
output:
[('I', 'PRP'), ('will', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('miles', 'NNS'), ('and', 'CC'), ('I', 'PRP'), ('would', 'MD'), ('walk', 'VB'), ('500', 'CD'), ('more', 'JJR'), (',', ','), ('just', 'RB'), ('to', 'TO'), ('be', 'VB'), ('the', 'DT'), ('man', 'NN'), ('who', 'WP'), ('walks', 'VBZ'), ('a', 'DT'), ('thousand', 'NN'), ('miles', 'NNS'), ('to', 'TO'), ('fall', 'VB'), ('down', 'RP'), ('at', 'IN'), ('your', 'PRP$'), ('door', 'NN')]

The pos_tag() the method takes in a list of tokenized words, and tags each of them with a corresponding Parts of Speech identifier into tuples. For example, VB refers to ‘verb’, NNS refers to ‘plural nouns’, DT refers to a ‘determiner’.

NLTK POS Tags Examples are as below:

Named Entity Recognition(NER)

It takes a string of text as input and identifies important named entities in the text such as people, places, organizations, date, or any other category.

Before NER tagging, first, we process a sentence with the word tokenize, pos tagging. The pos tagged content is taken as input to the ner tagging.

from nltk import word_tokenize, pos_tag, ne_chunksentence = "Avengers: Endgame is a 2019 American superhero film based on the Marvel Comics superhero team the Avengers, produced by Marvel Studios and distributed by Walt Disney Studios Motion Pictures. The movie features an ensemble cast including Robert Downey Jr., Chris Evans, Mark Ruffalo, Chris Hemsworth, and others. (Source: wikipedia)."tokens = word_tokenize(sentence)pos_tags = pos_tag(tokens)named_entities = ne_chunk(pos_tags)
print(named_entities)
Output:(S
Avengers/NNS
:/:
Endgame/NN
is/VBZ
a/DT
2019/JJ
(GPE American/JJ)
superhero/NN
film/NN
based/VBN
on/IN
the/DT
(ORGANIZATION Marvel/NNP Comics/NNP)
superhero/NN
team/NN
the/DT
(ORGANIZATION Avengers/NNPS)
,/,
produced/VBN
by/IN
(PERSON Marvel/NNP Studios/NNP)
and/CC
distributed/VBN
by/IN
(PERSON Walt/NNP Disney/NNP Studios/NNP)
Motion/NNP
Pictures/NNP
./.
The/DT
movie/NN
features/VBZ
an/DT
ensemble/JJ
cast/NN
including/VBG
(PERSON Robert/NNP Downey/NNP Jr./NNP)
,/,
(PERSON Chris/NNP Evans/NNP)
,/,
(PERSON Mark/NNP Ruffalo/NNP)
,/,
(PERSON Chris/NNP Hemsworth/NNP)
,/,
and/CC
others/NNS
./.
(/(
(PERSON Source/NN)
:/:
wikipedia/NN
)/)
./.)

Here’s the list of named entity types from the NLTK book:

--

--