speech-input user interfaces
Kragen Javier Sitaker
kragen at pobox.com
Mon Sep 4 22:52:20 EDT 2006
>From John Canny's "The Future of Human-Computer Interaction" in the
July/August ACM Queue, some optimistic notes about current speech
recognition. The URL is
http://www.acmqueue.com/modules.php?name=Content&pa=printer_friendly&pid=402&page=1
First of all, when PCs were mostly in offices, VUIs didn't make much
sense. Nothing wrong with the technology, but speech is a poor match
for most office work. ...
Let's remember the lessons from the Xerox Star. The Star was all about
having a real-use context (office work) and identifying an appropriate
set of user tasks. Phones are primarily about communicating using a
variety of media (sound, images, text) and to an increasing extent
about sharing and archiving those media. To support and augment those
communication services, we need some knowledge of what's "in" those
media, which is exactly a machine perception task. Furthermore, if
phones are to provide other services (besides communication) to users,
they also need to interpret the user's intent through whatever
interfaces the phone possesses. I already remarked on users' toils
with phone menus and buttons, while at the same time the phone is a
beautifully evolved speech platform. Speech interfaces do indeed look
like a great choice. They continue to improve in performance, but the
state of the art is much better than people realize.
Until last year, like most HCI researchers, I was skeptical about the
value of speech interfaces in HCI. But then I saw a Samsung phone
(P207) shipping with large-vocabulary speech recognition and getting
very good user reviews in all kinds of publications (including the
hard-to-impress business market).
I also taught a class on medical technologies and had a chance to meet
with many caregivers. There is already a large speech industry in
medicine, and it is widely seen as one of the key technologies moving
forward (it has probably already eclipsed "office ASR" and is a
significant part of the speech recognition industry overall).
I had committed the cardinal sin of generalizing experience from a
technology in one context (VUIs in the office) to its application in a
different context. ...
My only direct experience with speech interfaces was with the
burgeoning automated call-center industry, which had been quite
bad. But after learning more about the state of the art (Randy Allen
Harris's Voice Interaction Design or Blade Kotelly's The Art and
Business of Speech Recognition are excellent guides), I realized that
there are many superb examples of voice interface design. It's a lot
like Web sites and GUIs in the 1980s. The practice of human-centered
user interface design was not widely known back then, but as the HCI
discipline grew both in academia and industry, best practices
spread. Products that didn't follow a good user-centered process were
quickly displaced by competitors that did. There is an excellent set
of user-centered design practices for speech interfaces that are very
similar to the practices for core HCI. As yet, they aren't widely
adopted, but the differences between systems that follow them and
those that don't are so striking that this cannot last forever.
It has also become clear that the recognition accuracy of the ASR part
of the interface is not the limiting factor - it's the quality of the
overall VUI design and the match of the application to its context. In
other words, there's no reason to wait for future technical magic
before using speech interfaces. You can write excellent ones now,
assuming speech interaction fits your application context. (See the
recent examples that appeared in the article "'Conversational' Isn't
Always What You Think It Is" from Speech Technology Magazine,
July/August 2003; http://www.speechtechmag.com.)
After these epiphanies, I moved a significant amount of activity in my
group to speech and dialog-based interfaces (i.e., started four new
projects). While there are very good practices in speech interface
design today and many useful services that can be built with them,
there are still significant challenges and room for improvement. Those
limits have to do with the shared understanding between a human and a
machine sharing a speech interface. This is why speech interfaces are
also a rich research area. Much of the shared information is the
context we have already been talking about, and all of the
aforementioned projects are coupled with our work on context-awareness
(for more information, see my home page,
http://www.cs.berkeley.edu/~jfc).
This is all very promising for reducing the cost of devices, since
high-resolution LCD screens are a substantial part of the cost of
current cellphones and computers, and keyboards are a substantial part
of the size (and, I suspect, good keyboards are a substantial part of
the cost). He doesn't talk very much about the other direction of the
UI, the output.
More information about the Kragen-fw
mailing list