Evolution of NLP Techniques based on the Google Books Corpus

Great Ideas in current Computer Science Research

Computer Science (CS) Research is an emergent and exciting area. Classical parts of CS are being reshaped to fit a more modern concept of computing. One domain that is experiencing a renaissance is Natural Language Processing (NLP). Classical NLP tasks are being expanded to include time-series information allowing us to capture evolutionary dynamics, and not just static information.  For example, the word “bitch” was historically synonymous with a female dog, and more recently became (pejoratively) synonymous with the word “feminist.”

ACM BLOG IMG1Fig1:  The Trend of “Feminist” Over Time and Its Close Relatives

Traditional thesauruses do not contain information on when this synonymy was generated, nor the surrounding events that gave rise to this. This additional information about the historicity of the linguistic change is so innovative that it blurs the boundary between disparate disciplines: NLP and Computational Linguistics. This added dimension also allows us to challenge the foundations of traditional NLP research.

Language is the foundation of civilization. The story of the Tower of Babel in the Bible describes language as the uniting force among humanity, the key to its technological advancement and ability to become like G-d. Speaking one same language, Babel’s inhabitants were able to work together to develop a city and build a tower high enough to reach heaven. Seeing this, G-d mixes up their language, taking away the source of the inhabitants’ power by breaking down their mutual understanding. This story illustrates the power and cultural significance of universal language. Continue reading

Automated Spelling Correction – The Basics of How it Works

In this post, I am going to talk about automated spelling correction. Let’s say you are writing a document on your computer, and instead of typing “morning”, you accidentally type “mornig”. If you have automated spelling correction enabled, you will probably see that “mornig” has been transformed to “morning” on its own. How does this work? How does your computer know that when you typed “mornig”, you actually meant “morning”? We are going to see how in this post.

Spelling mistakes could turn out to be real words!

Before we actually go through how spelling correction works, let’s think about the complexity of this problem. In the previous example, “mornig” was not a real word, so we knew it had to be a spelling mistake. But what if you misspelled “college” as “collage”, or you misspelled “three” as “tree”? In these cases, the word you typed incorrectly happens to be an actual word itself! Correcting these types of errors is called real word spelling correction. On the other hand, if the error is not a real word (like “mornig” instead of “morning”), correcting those errors is called non-word spelling correction. You can see that real world spelling correction seems more difficult than non-word spelling correction because every word that you type could be an error (even if it has a correct spelling). For example, the sentence “The tree threes were tail” makes no sense because every word except “the” and “were” is an error even though they are all actual words. The actual sentence should be “The three trees were tall”. In this post, I am going to talk about non-word spelling correction with a basic approach to it.

Continue reading

An Introduction to N-grams: What Are They and Why Do We Need Them?

In this post I am going to talk about N-grams, a concept found in Natural Language Processing ( aka NLP). First of all, let’s see what the term ‘N-gram’ means. Turns out that is the simplest bit, an N-gram is simply a sequence of N words. For instance, let us take a look at the following examples.

  1. San Francisco (is a 2-gram)
  2. The Three Musketeers (is a 3-gram)
  3. She stood up slowly (is a 4-gram)

Now which of these three N-grams have you seen quite frequently? Probably, “San Francisco” and “The Three Musketeers”. On the other hand, you might not have seen “She stood up slowly” that frequently. Basically, “She stood up slowly” is an example of an N-gram that does not occur as often in sentences as Examples 1 and 2.

Now if we assign a probability to the occurrence of an N-gram or the probability of a word occurring next in a sequence of words, it can be very useful. Why? Continue reading

The Power of WordNet and How to Use It in Python

In this post, I am going to talk about the relations in WordNet (https://wordnet.princeton.edu) and how you can use these in a Python project. WordNet is a database of English words with different relations between the words.

Take a look at the next four sentences.

  1.  “She went home and had pasta.”
  2. “Then she cleaned the kitchen and sat on the sofa.”
  3. “A little while later, she got up from the couch.”
  4. “She walked to her bed and in a few minutes she was snoring loudly.”

In Natural Language Processing, we try to use computer programs to find the meaning of sentences. In the above four sentences, with the help of WordNet, a computer program will be able to identify the following –

  1. “pasta” is a type of dish.
  2. “kitchen” is a part of “home”.
  3. “sofa” is the same thing as “couch”.
  4. “snoring” implies “sleeping”.

Let’s get started with using WordNet in Python. It is included as a part of the NLTK (http://www.nltk.org/) corpus. To use it, we need to import it first.

>>> from nltk.corpus import wordnet as wn

Continue reading

Natural Language Understanding : Let’s Play Dumb

What is the meaning of the word understanding? This was a question posed during  a particularly enlightening lecture given by Dr. Anupam Basu, a professor with the  Department of Computer Science Engineering at IIT Kharagpur, India.

Understanding something probably relates to being able to answer questions based on it, maybe form an image or a flow chart in your head. If you can make another human being comprehend the concept with the least amount of effort, well that means you do truly understand what you are talking about. But what about a computer? How does it understand? Continue reading