Programmer's Myopia -- Natural Language Grammars
I previously wrote about Programmer’s Myopia, an affliction that a developer acquires, when he becomes fascinated and absorbed by a particular programming concept. In the writing profession, authors are warned not to become too attached to their “baby,” which is probably best edited out.
I recently talked with another ex-Microsoft employee, Erich Finkelstein, who is also working with a Natural Language Processing company. I went through some of the issues that I had developing a natural language parser. I let him know that I had a low opinion of mainstream NLP practices, that are taught in CS courses or are used in commercial software. Developers seemed to following a herd mentality. A lot of NLP software rely on statistical techniques, which are heuristic—basically, hacks that don’t provide any guarantees for their results; these statistical techniques often fail on short sentences, where little content is available. The other symbolic parsers tend to incorporate multiple stages and rely on phrase structure grammars.
In developing my parser, I wanted to accomplish several goals.
- Simple and Expressive. As output, the parser would emit a data structure that would be expressive enough to represent all English phenomena (including some ambiguities as well as erroneous words) yet be extremely simple enough to manipulate and transform.
- Automatic Correction. It should be able to deal with bad input text, such as omitted and extra words. In addition, the parser should automatically correct for grammatical, misspelling, and confused words. The latter two cases would be achieved a parallel parse of a word and its best alternatives.
- Incomplete Sentence Parsing. The parser should correctly parse partial sentences that a user is currently entering.
- Fast. The parser needed to be able to process sentences quickly and interactively in a desktop application environment, introducing no perceptible delay to the user.
- Accurate. The parser, of course, should generate an accurate parse.
Undaunted, I went on create my perfect parser, with the help of some natural language data that I licensed from various universities and the Linguistics Data Consortium. Along the way, I came up with conclusion that the Natural Language Processing techniques using phrase structure grammars typically taught in courses and in popular textbooks are very poor.
I firmly believe that “hard problems when solved often have simple solutions.” I think that software developers at universities and large firms, like M——t, who have built their own parsers, felt that natural language was hard problem so they built baroque data structures and algorithms and kept piling on complexity. They completely missed the simple solution. They figured that, since everyone is using phrase structure grammars and abundant tools and corpuses for them exist, they are pursuing the correct path of development.
PSGs owes its fame to outspoken MIT professor, Noam Chomsky, the “father” of linguistics, who discovered Context Free Grammars. The problem is that we are dealing with a mathematical formalism that is an imprecise and inaccurate representation of the natural language. I am almost reminded of my computer science professor, Stuart Shieber, who proved that natural language is not context-free. We abandoned “Freud” and his below-the-belt preoccupations, and we should probably sidestep ol’ Noam.
This reminds me of a joke in IBM's speech recognition research labs. "Everytime a linguist leaves the room, the speech recognition rate goes up."
This is why I think that dependency grammars are superior to phrase-structure grammars. Dependency grammars represent words as nodes in a graph and relationships between words are represented by a single link between the two words. For example, a noun and a verb node could be connected with a “subject” or "direct object" link that describes the relationship of the two words.
- Better model of English. Phrase-structure grammars use these artificial structures that linguists have built up, such as noun phrases and verb phrases, which are flawed. There are many types of adhoc phrases in the English language and each type of phrase can often play the role of another type. I think that dependency grammars model the English language far more accurately and in the same natural way as humans.
- Well-factored. Dependency grammars are well-factored and results in tiny fraction of the rules of the equivalent phrase structure grammar. PSGs suffer from rule proliferation and don’t handle “gaps” from questions and clauses very well. PSG are too rigid. Dependency grammars have better support for idioms and collocations (word phrases that act as one word), handle sentence interruptions better, noise words. Any word of any part of speech can acquire any nontraditional role in the sentence, which is frequently the case in normal language.
- Graph versus trees. In a few instances in the past, I have discovered that by using a more advanced data structure, trees instead of lists, graphs instead of trees, I can obtain a more expressive data structure and simplification of the overall algorithm. The properties of graphs are also well understood.
- Simple data structures. A dependency parser only needs to emit a graph as output—a node for each word and a set of edges for each relation the word has to other nodes. Simpler data structures are easier to program, test, manipulate, and optimize.
- Delayerization. Dependency parsers have less need for layering into tagging, parsing, and semantic phases. Each dependency relation is not just a syntactic relation but a semantic one as well, allowing to combine two traditionally separate parsing phases into one. This is important, because human parsing integrates all phases together. I also get tagging for free, since it adds minimal cost to parsing. My dependency parser is fast, and also easily incorporates minimal overhead parallel parsing which allows me to recognize when the user has substituted an incorrect word into the sentence and correct automatically as part of the parsing process a confusable word or a mispelling (that happens to be another real word).
The advantage of PSG is that everyone knows them. There are more tools and corpuses built for them in mind. Unfortunately, the problems with them have led researchers to adopt statistical methods to escape the complexity.
Did I fall into a programmer’s myopia trap with my dependency grammars, or did the rest of the computer science community with their insistence on phrase structure grammars?