Evolution of NLP Techniques based on the Google Books Corpus

Great Ideas in current Computer Science Research

Computer Science (CS) Research is an emergent and exciting area. Classical parts of CS are being reshaped to fit a more modern concept of computing. One domain that is experiencing a renaissance is Natural Language Processing (NLP). Classical NLP tasks are being expanded to include time-series information allowing us to capture evolutionary dynamics, and not just static information.  For example, the word “bitch” was historically synonymous with a female dog, and more recently became (pejoratively) synonymous with the word “feminist.”

ACM BLOG IMG1Fig1:  The Trend of “Feminist” Over Time and Its Close Relatives

Traditional thesauruses do not contain information on when this synonymy was generated, nor the surrounding events that gave rise to this. This additional information about the historicity of the linguistic change is so innovative that it blurs the boundary between disparate disciplines: NLP and Computational Linguistics. This added dimension also allows us to challenge the foundations of traditional NLP research.

Language is the foundation of civilization. The story of the Tower of Babel in the Bible describes language as the uniting force among humanity, the key to its technological advancement and ability to become like G-d. Speaking one same language, Babel’s inhabitants were able to work together to develop a city and build a tower high enough to reach heaven. Seeing this, G-d mixes up their language, taking away the source of the inhabitants’ power by breaking down their mutual understanding. This story illustrates the power and cultural significance of universal language.

My research investigates the evolution of words through the Google Books Corpus.  I wanted to find out if the polarity of words (whether words are positive or negative) changes in time, and if so, why it changes.  Do discrete events cause such a change? Can white become black at different points in history? Does the female become male or does gender vanish altogether? In order to do so, I leveraged two gold standard techniques to determine how sentiment changes over time:  Pointwise Mutual Information (PMI) and opinionFinder. I have used these techniques and tried to apply them to this corpus, since it is unique in its volume and contents.  One of the first corpora that allows one to examine historical trends in language is the Google Books Corpus, an enormous record of words and grammatical contexts as well as their statistical usage in books.

Digitized BooksFig2:  Millions of Digitized Books

What makes the Google Books Corpus unique is volume and the amount of temporal content it comprises. The corpus spans several different languages: English with 361 billion words, French with 45 billion words, Spanish with 45 billion words, German with 37 billion words, Russian with 35 billion words, Chinese with 13 billion words and Hebrew with 2 billion words. The books were culled from over 40 libraries around the world, and the words and phrases were extracted via Optical Character Recognition (OCR). Full sentences are not available due to copyright issues.

Syntactic NgramsFig3:  The Dataset: Syntactic Ngrams

However, I soon found that neither PMI nor opinionFinder could be applied to the corpus. This shows the complexity of the data contained in the corpus since one of the techniques is otherwise suitable for most NLP tasks. The first technique did not work because it is unrepeatable. The author told me to rely on the results of my experiment rather than placing my trust in experts, let me know that his technique cannot be used on the corpus and failed to provide his own implementation of the algorithm. This corpus exposed the limitations of this technique, that was previously unknown to most people in the NLP community. The second technique, which is a Machine Learning based technique, did not work most likely because of the underlying type of data ( here there were books and opinionFinder was trained on news), according to the designer of the technique. I thus realized that I needed to come up with a new technique in order to answer my question.

At 4am, I stared at my computer and invented a technique to extract word similarity using a metric known as Jaccardian Similarity. I looked at two expletives (curse words) “angel” and “saint” (in the Talmudic sense of “sagi nahor” which is the use of opposite words to refer to unclean language instead of a direct expression of the vulgar) and asked how the computer knows that they are related? At first, I was thinking about polarity and how the polarity detector works better on negative words. Then I thought about the most negative words possible, which were expletives. I wanted to get inside the brain of the computer and figure out how the computer knows that a given word “angel” is an expletive and that it is related to another word that is “saint”.  I determined that it was based on their contexts (at that time adjectives), and later expanded to include other grammatical structures. The research derived rules for the birth of similarity of words, information carried and lost in various parts of speech, thesaurus properties, how language evolves over time in terms of speech/writing and word families. This demonstrates also how the brain evolved over time, since the brain contains its own semantic map, where words are placed together in close proximity based upon meaning.  If the thesauruses change in time, then the brain evolves accordingly.

This work put the foundations of the NLP field into question.  Since PMI and opinionFinder did not work on the corpus, it suggested that the background of the field itself was not firm. I discovered that the distributional hypothesis, which states that a word is known by the company that it keeps, was actually lifted from the Talmud’s Gzeirah Shavah, which states the inverse – the contexts are the same if they are encoded by the same word. 

Gzeirah ShavahFig4:  Gzeirah Shavah: The Daughters of Tzelofchad and the Identity of the Sabbath Desecrator

Furthermore, I also discovered that the Google Books Corpus is not random. It can serve as the exact testbed for an older linguistic concept underlying the Swadesh list, if you will, a “Swadesh hypothesis”. Morris Swadesh was one of the pioneers of glottochronology (the dating of language divergence) and lexicostatistics (the quantitative assessment of the genealogical relatedness of languages), and the generator of Swadesh lists.

Swadesh List Fig5: Swadesh List

This “hypothesis” served as the foundation for his work, specifically the lists. This “hypothesis” states that words that form the core component of language, learned early and spoken often are less susceptible to change than other words. In fact, if one were to expand to a multilingual corpus, one can prove that there is a core part of the language that is invariant to time and culture: the very foundation for the civilization described in the biblical tale of the Tower of Babel.  The new (Google Books Corpus) provides an innovation upon the old (Distributional Hypothesis, “Swadesh hypothesis”).

This entry was posted in Algorithms, Great Ideas in CS, HCI and tagged by Talia Kohen. Bookmark the permalink.

About Talia Kohen

Shortly after graduating from Cornell ECE in 2006, Kohen went to work for Raytheon with a position focused on tracking and discrimination. She is currently completing a master’s degree in computer science on the Evolution of Words in The Google Books Corpus. She has implemented three algorithms, two known, and one she designed and later found in a textbook, and with those has been able to analyze one billion lines of text in two-and-a-half hours. During her master’s work, Kohen independently developed by mere visualization a result in mathematics known as “Euler’s F-Vector”.

Kohen has received numerous awards and honors including: Anita Borg Birthday Celebration Director in Israel; a Google OutStander; Google Anita Borg Scholarship for Women in Europe and the Middle East Finalist; Google Campus Ambassador; Microsoft Israel Women of Excellence Program; Microsoft Excellence Summer Camp; ACM XRDS feature issue editor for the IoT Edition; IEEE International Radar Conference Poster Session Co-Chair and Steering Committee Raytheon Individual Performer Achievement Award – Ionospheric study Raytheon Spot Award; Raytheon Women’s Network MDC Site Representative; and was a Delegate to Grace Hopper Conference for Women In Computing. Kohen is the CEO of FemTech, a community for women in STEM in Israel. Upon completion of her masters degree, Kohen plans to earn a Ph.D. in artificial intelligence. She hopes to be the CEO of her own tech company some day.

Leave a Reply

Your email address will not be published. Required fields are marked *