blog posts

What Is The NLTK Tool And How Is It Used In Natural Language Processing (NLP)?

What Is The NLTK Tool And How Is It Used In Natural Language Processing (NLP)?

Natural Language Processing (NLP) Is The Process Of Manipulating Or Understanding Text Or Speech By Any Software Or Machine. 

More specifically, it allows machines and humans to communicate in an understandable language so that the devices can provide the same answer as everyday conversations.

In general, natural language processing refers to the ability of machines to understand and respond instead of humans.

Today, there are various tools on the market for this purpose, each offering different capabilities, but NLTK is one of the most powerful options in this field.

What is NLTK?

The NLTK library (Natural Language Toolkit) is a collection of libraries and programs for statistical language processing. As mentioned, it is one of the most powerful options in this field, which includes packages for machines to understand and articulate human language. to answer.

Natural Language Processing (NLP) is a field that focuses on making natural human language usable by computer programs. Meanwhile, NLTK, or Natural Language Toolkit, is a Python package you can use in connection with natural language processing.

Much of the data you can analyze is unstructured and contains human-readable text. Before you can analyze that data programmatically, you must first preprocess it. This article will explore how to do this and the text preprocessing tasks you can do with NLTK. We’ll also explore how you can do some fundamental text analysis and data visualization.

If you are familiar with the basics of using Python and want to enter the world of natural language processing, this article is written for you.

At the end of this tutorial, you will be able to:

  • Find a text to analyze
  • Prepare and process your text for analysis
  • Analyze your text

Getting started with NLTK Python

The first thing you need to do is make sure you have Python installed. For this tutorial, we will use Python 3.9. If you haven’t installed Python yet, first go to the official website of this company, download the Python 3 file, and install it. This process is not particularly complicated, and everything is done without problems.

Once you install Python, the next step is to install NLTK with pip. For this tutorial, install version 3.5 as follows:

$ Python -m pip install nltk==3.5

We also need to install NumPy and Matplotlib to perform visualization in the context of named entity detection:

$ Python -m pip install numpy matplotlib

tokenization

The above approach allows you to work with smaller chunks of coherent and meaningful text, even without the full text. With punctuation, you can easily divide the text by word or sentence. This is the first step in converting unstructured data into structured data that is easier to analyze.

When you analyze text, go through the process of punctuation by word and sentence. The method of marking words and sentences is done as follows:

Word Markup: Words are like atoms in natural language. They are the minor units of meaning that are meaningful on their own. Text markup by word allows you to identify frequently used words. For example, if you are responsible for analyzing a group of job postings, you may notice that the word “Python” is used frequently. This could indicate a high demand for Python knowledge, but you need to do more analysis to dig deeper.

Sentence-based markup: When you perform the markup process on a sentence, you can analyze how words relate to each other and gain more information. For example, is there a lot of negative buzz around the word “Python” because hiring managers don’t like Python? Are there more terms in the field of software development that represent entirely different roles than Python and so on?

Here is how to import different NLTK modules so that you can learn how to mark up words and sentences:

>>> from nltk.tokenize import sent_tokenize, word_tokenize

Now that you’ve entered what you need, you can create a string to bookmark. Here’s a quote from Dune that you can use:

>>> example_string = “””

… Muad’Dib learned rapidly because his first training was in how to learn.

… And the first lesson was the basic trust he could learn.

… It’s shocking to find how many people do not believe they can learn,

… and how many more belief learning to be difficult.”””

You can use sent_tokenize to split example_string into sentences:

>>> sent_tokenize(example_string)

[“Muad’Dib learned rapidly because his first training was in how to learn.”,

‘And the first lesson of all was the basic trust that he could learn.’,

“It’s shocking to find how many people do not believe they can learn and how many more believe learning to be difficult.”]

Marking example_string with a sentence gives you a list of three strings as follows:

“Muad’Dib learned rapidly because his first training was in how to learn.”

‘And the first lesson was the basic trust he could learn.’

“It’s shocking how many people do not believe they can learn and how many more believe learning to be difficult.”

Now we’re ready to tokenize example_string:

>>> word_tokenize(example_string)

[“Muad’ Dib,”

‘learned,’

‘rapidly,’

‘Because’

‘his,’

‘first,’

‘training,’

‘was,’

‘in,’

‘how,’

‘to,’

‘learn,’

‘.’,

‘And,’

‘the,’

‘first,’

‘lesson,’

‘of,’

‘all,’

‘was,’

‘the,’

‘basic,’

‘trust,’

‘that,’

‘he,’

‘could,’

‘learn,’

‘.’,

‘It,’

“‘s,”

‘shocking,’

‘to,’

‘find,’

‘how,’

‘many,’

‘people,’

‘do,’

‘not,’

‘believe,’

‘they,’

‘can,’

‘learn,’

‘,,’

‘and,’

‘how,’

‘many,’

‘more,’

‘believe,’

‘learning,’

‘to,’

‘be,’

‘difficult,’

‘.’]

You’ve got a list of strings that NLTK considers to be words, such as:

“Muad’ Dib”

‘Training’

‘how’

But the following strings are also considered words:

“‘s”

‘,’

‘.’

See how the “It’s” split up in the Apocalypse to give you the “It” and the “s” but left the whole “Muad’Dib”? This happened because NLTK knew that “It” and “‘s” (a contraction of is) were two different words, so it counted them separately. But ‘Muad’Dib’ is not an accepted contraction like ‘It’s,’ so it was not read as two separate words and remained intact.

Filter out stop words.

Stop words are words you want to ignore, so you remove them from your text when processing text. Prevalent words like in are, and an are often used as stop words because they don’t add much meaning to the reader.

Now we need to import the required NLTK modules to filter stop words:

>>> nltk.download(“stopwords”)

>>> from nltk.corpus import stopwords

>>> from nltk.tokenize import word_tokenize

Here’s a quote from Worf that you can filter:

>>> worf_quote = “Sir, I protest. I am not a merry man!”

Now mark worf_quote with word and store the resulting list in words_in_quote:

>>> words_in_quote = word_tokenize(worf_quote)

>>> words_in_quote

[‘Sir,,’, ” ‘protest.’ ”, ‘merry,’ ‘man,’ ‘!’]

You have a list of words in worf_quote, so the next step is to create an array of stop words to filter out words_in_quote. For this example, you should focus on the stop words in “English”:

>>> stop_words = set(stopwords.words(“English”))

Next, create an empty list to hold the words that pass the filter:

>>> filtered_list = []

You created an empty list, filtered_list, to hold all words in words_in_quote that are not stop words. You can now use stop_words to filter words_in_quote:

>>> for a word in words_in_quote:

…    if word.case fold() not in stop_words:

…         filtered_list.append(word)

We used a for loop on words_in_quote to iterate through it and added all words that did not end terms to filtered_list. In addition, we used case fold on the word to ignore the capitalization or smallness of the letters. This is worth doing because of stopwords. Terms (‘English) only contain miniature versions of the stop words.

Alternatively, you can use list comprehension to prepare a list of all words in the text that are not terminal:

>>> filtered_list = [

…     word for word in words_in_quote if word.case fold() not in stop_words

… ]

When you use list comprehension, you don’t create an empty list; you add items to the end of it. Instead, you define the directory and its contents at the same time. It is often more Pythonic to use a list comprehension technique. Let’s take a look at the words that were included in the filtered_list:

>>> filtered_list

[‘Sir,,’, ” ‘protest.’ ”, ‘merry,’ ‘man,’ ‘!’]

You filtered out a few words like “am” and “a,” but you also filtered out “not,” which affects the overall meaning of the sentence.

Words like I and not may seem too important to filter, depending on the type of analysis you are doing. The reason for this is as follows:

I am a pronoun that is a text word instead of a content word:

Content words give you information about the topics discussed in the text or how the writer feels about those topics.

Context words give you information about the writing style. You can see patterns in how writers use text words to quantify their writing style. Once you measure their writing style, you can analyze the text written by the unknown author to see how well it follows a specific writing style, so you can try to identify the author.

Not technically an adverb, but it is still included in the NLTK list of English stop words. If you want to edit the list of stop words to remove not or make other changes, you need to download the corresponding module.

Therefore, I and not can be essential parts of the sentence, depending on what you want to learn from the sentence.

Stemming

Stemming is a text-processing process where you reduce words to their roots, which are the main parts of a word. For example, the words helping and helper are the roots of help. Stemming allows you to consider the original meaning of a word instead of all the details of how it is used. NLTK has multiple stemmers, but you will use the Porter stemmer.

Here’s how to import the relevant NLTK components to start stemming:

>>> from nltk.stem import PorterStemmer

>>> from nltk.tokenize import word_tokenize

Now that the module import process is complete, you can create a stemmer with PorterStemmer as follows:

>>> votes = PorterVotes()

The next step is to create a string for the stem. For this purpose, we do the following:

>>> string_for_stemming = “””

… The crew of the USS Discovery discovered many discoveries.

… Discovering is what explorers do.”””

Before you can root the words in that string, you must extract all the words in it:

>>> words = word_tokenize(string_for_stemming)

Now that you have a list of all the tagged words from the string, take a look at what’s in words:

>>> words

[‘The,’

‘crew,’

‘of,’

‘the,’

‘USS,’

‘Discovery,’

‘discovered,’

‘many,’

‘discoveries,’

‘.’,

‘Discovering,’

‘is,’

‘what,’

‘explorers,’

‘do,’

‘.’]

Create a list of stemmed words using stemmer. Stem:

>>> stemmed_words = [stemmer.stem(word) for word in words]

Take a look at what stemmed_words says:

>>> stemmed_words

[‘the,’

‘crew,’

‘of,’

‘the,’

‘Russ,’

‘discovery,’

‘Discov,’

‘mani,’

‘discovery,’

‘.’,

‘Discov,’

‘is,’

‘what,’

‘explore,’

‘do,’

‘.’]

Here are all the words that start with “discov” or “discov”:

The main dishes are as follows:

  • ‘Discovery’
  • ‘Discovered’
  • ‘discoveries’
  • ‘Discovering’

The rooted words are as follows:

  • ‘discovery’
  • ‘Discov’
  • ‘discovery’
  • ‘Discov’

These results seem contradictory since Discovery gives you Discovery while Discovering provides Discovery.

Overstemming and understanding are two ways that can produce wrong results:

  • Under-rooting occurs when two related words should reduce to the same root but not, creating a false negative.
  • Hyper-rooting occurs when two unrelated words are reduced to the same root, even though this should not happen, causing a false positive.
  • Porter’s stemming algorithm dates back to 1979, and, naturally, it is old. Snowball stemmer, also called Porter2, is an improvement of the original version and is also available through NLTK so you can use it in your projects. Also, it is worth noting that the goal of the Porter stemmer is not to generate complete words but to find different forms of the same comment.
  • Fortunately, you have other ways to reduce words to their original meaning, such as wordsmithing, which gives you much power in this area.