CIS Faculty Leading Field of Natural Language Processing

By Louis DiPietro

The dynamics of human interaction encompass more than just the words we use. Context and our own individual knowledge are essential to understanding what it is being expressed, either verbally or in writing.

To consider the complexity and ambiguity of our human languages is to glimpse into the fundamental challenge for technologists working in the field of natural language processing, or NLP for short: How do we teach computers to decipher and understand what we are saying, and perhaps even communicate back to us?

The challenge is immense, considering just how dynamic our language is.  As Computing and Information Science Professor and NLP scholar Lillian Lee put in a 2012 talk, “Language is ripe with ambiguity, and ambiguity is one of the big problems that makes NLP hard for computers and people.”

Welcome to the field of natural language processing, where Cornell professors and scholars are using computational and statistical methods to teach computers to interpret and process the information we share with it – our spoken words, our documents – and communicate to us what we might fail to perceive. The ramifications of solving this complex, human-computer conundrum are potentially world-changing, answering the most perplexing questions of our time and leading to discoveries in nearly every field of study.  

Part of the strategy in statistical NLP is to feed computers as much written text as we can offer. Enter the internet and all of its users, who collectively generate 2.5 billion gigabytes of information per day, according to IBM’s estimates. At Cornell, scholars are using NLP tools for any number of different ends, from mining online text for signs of bias, betrayal and deception to improving how we search for and absorb online information. Some examples include:

• Cornell Computer Scientist PhD Vlad Niculae along with Information Science Assistant Professor Cristian Danescu-Niculescu-Mizil, and researchers from the Maryland and Colorado-Boulder used 145,000 messages shared among players of the online strategy game Diplomacy to learn of linguistic cues that signal a coming betrayal. Researchers found players who ultimately betrayed their teammates chatted less about future actions in the game, wrote more sentences in their chat messages, and were far more positive than usual.

• Cornell Information Science Assistant Professor and Sloan Research Fellow David Mimno, along with UCLA researchers, are currently using NLP tools to comb through 50,000 Danish folk tales in an effort to catch linguistic patterns and compare them with similar stories.

• Computer Science PhD Liye Fu, Danescu-Niculescu-Mizil and Lee dove into written transcripts of post-match press conferences with tennis pros. They found journalists are more likely to ask male tennis pros about the game, while female pros were asked fewer game-related questions.

• Cornell researchers, including Claire Cardie, developed a language-processing algorithm to determine if online hotel reviews were fake or not. The program was able to weed out fishy reviews about 90 percent of the time, providing yet more evidence that computers are much better than humans at finding linguistic patterns. 

Mimno’s work in topic modeling – using algorithms to comb volumes of documents for content topic and language patterns – is a notable sub-category of Cornell’s already-esteemed NLP research. Along with his look into Danish folk tales, Mimno is currently exploring transcripts of circuit court decisions from the U.S. Court of Appeals.

 “There’s this perception that the 9th Circuit court in California is more liberal. Does this come through in how words are associated?” he said. “Is there a measurable distance between these courts? Because we can’t talk about bias until we define that there is a difference.”

Both Mimno and Danescu-Niculescu-Mizil are quick to note Cornell as one of the leaders in the field, citing major influencers like Cornell’s Cardie, Lee, Cornell Tech Associate Professor Yoav Artzi, Computing and Information Science’s Jon Kleinberg and the late Cornell Computer Scientist Gerard Salton, who is widely considered a pioneer in information retrieval.

“We’re using technology to learn about culture from documents – sometimes it’s social media posts, news articles or even diaries from 1784,” Mimno said. “Natural language processing isn’t replacing anything. It’s a new capability.”