An Introduction to N-grams: What Are They and Why Do We Need Them?

In this post I am going to talk about N-grams, a concept found in Natural Language Processing ( aka NLP). First of all, let’s see what the term ‘N-gram’ means. Turns out that is the simplest bit, an N-gram is simply a sequence of N words. For instance, let us take a look at the following examples.

San Francisco (is a 2-gram)
The Three Musketeers (is a 3-gram)
She stood up slowly (is a 4-gram)

Now which of these three N-grams have you seen quite frequently? Probably, “San Francisco” and “The Three Musketeers”. On the other hand, you might not have seen “She stood up slowly” that frequently. Basically, “She stood up slowly” is an example of an N-gram that does not occur as often in sentences as Examples 1 and 2.

Now if we assign a probability to the occurrence of an N-gram or the probability of a word occurring next in a sequence of words, it can be very useful. Why?

First of all, it can help in deciding which N-grams can be chunked together to form single entities (like “San Francisco” chunked together as one word, “high school” being chunked as one word).

It can also help make next word predictions. Say you have the partial sentence “Please hand over your”. Then it is more likely that the next word is going to be “test” or “assignment” or “paper” than the next word being “school”.

It can also help to make spelling error corrections. For instance, the sentence “drink cofee” could be corrected to “drink coffee” if you knew that the word “coffee” had a high probability of occurrence after the word “drink” and also the overlap of letters between “cofee” and “coffee” is high.
As you can see, assigning these probabilities has a huge potential in the NLP domain.

Now that we understand this concept, we can build with it: that’s the N-gram model. Basically, an N-gram model predicts the occurrence of a word based on the occurrence of its N – 1 previous words. So here we are answering the question – how far back in the history of a sequence of words should we go to predict the next word? For instance, a bigram model (N = 2) predicts the occurrence of a word given only its previous word (as N – 1 = 1 in this case). Similarly, a trigram model (N = 3) predicts the occurrence of a word based on its previous two words (as N – 1 = 2 in this case).

Let us see a way to assign a probability to a word occurring next in a sequence of words. First of all, we need a very large sample of English sentences (called a corpus).

For the purpose of our example, we’ll consider a very small sample of sentences, but in reality, a corpus will be extremely large. Say our corpus contains the following sentences:

He said thank you.
He said bye as he walked through the door.
He went to San Diego.
San Diego has nice weather.
It is raining in San Francisco.

Let’s assume a bigram model. So we are going to find the probability of a word based only on its previous word. In general, we can say that this probability is (the number of times the previous word ‘wp’ occurs before the word ‘wn’) / (the total number of times the previous word ‘wp’ occurs in the corpus) =

 (Count (wp wn))/(Count (wp))

Let’s work this out with an example.
To find the probability of the word “you” following the word “thank”, we can write this as P (you | thank) which is a conditional probability.
This becomes equal to:

=(No. of times “Thank You” occurs) / (No. of times “Thank” occurs) 
= 1/1 
= 1

We can say with certainty that whenever “Thank” occurs, it will be followed by “You” (This is because we have trained on a set of only five sentences and “Thank” occurred only once in the context of “Thank You”). Let’s see an example of a case when the preceding word occurs in different contexts.

Let’s calculate the probability of the word “Diego” coming after “San”. We want to find the P (Diego | San). This means that we are trying to find the probability that the next word will be “Diego” given the word “San”. We can do this by:

=(No of times “San Diego” occurs) / (No. of times “San” occurs) 
= 2/3 
= 0.67

This is because in our corpus, one of the three preceding “San”s was followed by “Francisco”. So, the P (Francisco | San) = 1 / 3.
In our corpus, only “Diego” and “Francisco” occur after “San” with the probabilities 2 / 3 and 1 / 3 respectively. So if we want to create a next word prediction software based on our corpus, and a user types in “San”, we will give two options: “Diego” ranked most likely and “Francisco” ranked less likely.

Generally, the bigram model works well and it may not be necessary to use trigram models or higher N-gram models.

This was a basic introduction to N-grams. For further reading, you can check out the reference:
https://web.stanford.edu/~jurafsky/slp3/4.pdf

14 thoughts on “An Introduction to N-grams: What Are They and Why Do We Need Them?”

Vinod on March 27, 2018 at 6:56 am said:

Useful! Thanks

Reply ↓
Vedika Parvez on April 30, 2018 at 2:05 pm said:

Hi, Prachi!
Loved your article. It was very comprehensive and answered most of my queries related to N-grams.

However, I still have one doubt uncleared. In the case of lengthy sentences, is it recommended to implement 4-gram models or higher, instead of lower ones; or is the value of N dependent on the application of the text?

Reply ↓
Hady Koteish on December 2, 2018 at 11:17 am said:

very nice article!

Reply ↓
hanaa zakarneh on December 20, 2018 at 8:43 pm said:

thank you, this was useful for me

Reply ↓
Deepak Nair on February 28, 2019 at 4:59 pm said:

I

It is a great explanation.
Thanks

Reply ↓
Qureeb Hameed on April 2, 2019 at 9:16 am said:

Thank you, Prachi. This is a very good starting point.
Thanks!

Reply ↓
Surendran K R on June 4, 2019 at 1:06 pm said:

I can understand the N-gram model easily though I am student of literature and language. It is explained clearly and briefly. Keep up the good work.

Reply ↓
nima on June 11, 2019 at 5:55 am said:

thanks a lot…it was useful concept to beginners like me.

Reply ↓
harshini on June 18, 2019 at 5:03 am said:

well explained

Reply ↓
Ashutosh Sharma on July 12, 2019 at 7:00 pm said:

Thanks for writing this.

Reply ↓
Aritra on July 21, 2019 at 11:42 am said:

Hi Prachi,
This article has been really helpful!
I was struggling to understand the concepts of n-grams, but this article has helped me a lot!

Reply ↓
Nathan Sowatskey on September 8, 2019 at 6:54 am said:

Thank you for this. The reference above is for Naive Bayes and Sentiment Classification. The reference you probably wanted is: https://web.stanford.edu/~jurafsky/slp3/3.pdf

The use of the bigram/digram as an illustration helps set the basic context. It does mean that the denominator is a single word though, which leaves out the more complex case of the denominator being multiple occurrences of (n-1)grams.

Reply ↓
Nafees on November 1, 2019 at 9:45 am said:

Wao, very good explanation. Thanks a lot

Reply ↓
Chancy on December 3, 2019 at 10:00 am said:

Thanks Prachi! This is very easy to understand for beginners!

Reply ↓

XRDS

Crossroads – The ACM Magazine for Students

An Introduction to N-grams: What Are They and Why Do We Need Them?

14 thoughts on “An Introduction to N-grams: What Are They and Why Do We Need Them?”

Leave a Reply Cancel reply