Siri Technology

10/24/2011 2:44:23 PM

Siri Technology

Considerable press puffery has followed Apple’s announcement of Siri dramatizing the significant nature of the natural language technology. Siri is touted as a $200 million DARPA-funded research in the same league as the Internet and GPS.

Siri is much advanced over existing popular voice control technology in its acceptance of free-form input, access to web services, and conversational feedback; however, it probably not too much of an advance. Microsoft and Google should catch up quickly. Microsoft previewed a future vision of its acquired Tellme voice control technology, and Google is likely to advance its Voice Actions to match Apple’s developments.

Voiced-based assistant technology is ideal for a mobile device because the phone is designed to host conversations, geolocation is available, internet data is available, and all the user’s information is already stored in the device. Because contextual information is abundant and the domain knowledge is constrained, speech recognition can be very effective.

My reaction to Siri was “finally”—finally a realization of state of art in technology into commercial products. There’s a lot of technology that I encountered in academia decades ago that have not yet made their way to consumers such as natural language analysis. In college, I wrote a Prolog-based natural language parser based on definite clause grammars that took parsed text and converted it into a semantic representation, on which we perform queries. As part of my entrepreneurial work, I rewrote the Link Grammar parser from CMU and obtained licenses to various natural language data sources like COMLEX, NOMLEX and WordNet. It felt odd that I might be the first to sell a consumer natural-language product aside from grammar checkers and translation software.

So, finally, but…  I am a bit skeptical about the technological advances, since I am simply one developer and, though I have leverage off the work of others, I did not need $200M. First, Siri uses Dragon’s speech recognition technology, which any developer can license as part of the Nuance Mobile Developer Program. Parsing natural language. Second, the DARPA halo is just for dramatics, but the kind of natural language analysis has been around for awhile. I credit Apple for pushing the quality of the communication beyond mere keyword and structure recognition and for putting existing art into its products. I suspect that most of those stated 300 SRI researchers consulted but did not actually work on CALO (let alone full-time). From the looks of the project involved, most of the technology appears in the backend (much of which may not even be relevant to Siri), very little in natural language analysis.

In How Siri Works, the author presents his own skepticism of Siri technology. Jeff is more impressed with application and integration of Siri into the OS than with the technology itself. Siri performs operations on a limited set of operations centered around built-in iPhone applications, plus it integrates with a number of web services such as Yelp, Wolfram|Alpha, OpenTable and Wikipedia. Despite the limitations, it still is an impressive achievement given the naturalness of the implementation.

Another CALO engineer confirms my thoughts of Siri as a compelling but not terribly advanced technology:

I worked at SRI on the CALO project, and built prototypes of the system that was spun off into SIRI. The system uses a simple semantic task model to map language to actions. There is no deep parsing - the model does simple keyword matching and slot filling, and it turns out that with some clever engineering, this is enough to make a very compelling system. It is great to see it launch as a built-in feature on the iPhone.

The NLP approach is based on work at Dejima, an NLP startup: “Iterative Statistical Language Model Generation for Use with an Agent-Oriented Natural Language Interface

A lot of the work is grounded in Adam Cheyer's (CTO of SIRI) work on the Open Agent Architecture: A more recent publication from Adam and Didier Guzzoni on the Active architecture, which is probably the closest you'll come to a public explanation of how SIRI works: Active, a Platform for Building Intelligent Software

His comments on the natural language parsing left me disappointed, but it’s possible that Apple upgraded that natural language processing capabilities of Siri with its homegrown version after acquiring Siri. However, after reading the Dejima paper, it turns out that a traditional parser may have too rigid a grammar for the short, conversional, and often grammatically incorrect speech input.

NLIs often use text as their main input modality; speech is however, a natural and, in many cases, preferred modality for NLIs. Various speech recognition techniques can be used to provide a speech front end to an NLI. Grammar-based recognizers are rigid and unforgiving, and thus can overshadow the robustness and usability of a good NLI. Word-spotting recognizers are reliable only when the input consists of short utterances, and the number of words to be spotted at each  given time is small. Dictation engines are processor and memory intensive, and often speaker dependent. The dictation vocabulary is often considerably larger than required for domain-specific tasks. General statistical language models (SLMs), although robust enough to be used as a front end for a structured domain, requires a very large training corpus. This is time consuming and expensive since a large number of users needs to be sampled and all speech has to be transcribed.

The system proposed in the paper is a statistical language designed explicitly for an agent-oriented natural language speech-based interface. It does have its failings such as the weaknesses detailed in Siri in Practice where Siri has trouble with parenthetical or quoted expressions that have might have been more properly handled with a grammar-based recognizer, assuming that Apple has not changed Siri’s parser.

Another telling failure is Siri’s response to “What’s the best iPhone wallpaper?” in which Siri responds with a canned response to “What’s the best phone?” as if it simply did not process the word “wallpaper” and simply hooked on keywords “best” and “iPhone.”


The response vaguely resembles that of fairly unsophisticated chatterbots. Siri could be performing a keyword match either on the syntactic or the semantic level. It would be easy to test this hypothesis by asking Siri variations of the questions. I doubt the seriousness of these mistakes, because Apple might be using a different catch-all, keyword matching-system for unanticipated queries.






SoftPerson develops innovative new desktop software applications by incorporating artificial intelligence and natural language technologies to bring human-like intelligence to everyday applications.

Social Media