speech-input user interfaces

Kragen Javier Sitaker kragen at pobox.com
Mon Sep 4 22:52:20 EDT 2006


>From John Canny's "The Future of Human-Computer Interaction" in the
July/August ACM Queue, some optimistic notes about current speech
recognition.  The URL is
http://www.acmqueue.com/modules.php?name=Content&pa=printer_friendly&pid=402&page=1

    First of all, when PCs were mostly in offices, VUIs didn't make much
    sense. Nothing wrong with the technology, but speech is a poor match
    for most office work. ...

    Let's remember the lessons from the Xerox Star. The Star was all about
    having a real-use context (office work) and identifying an appropriate
    set of user tasks. Phones are primarily about communicating using a
    variety of media (sound, images, text) and to an increasing extent
    about sharing and archiving those media. To support and augment those
    communication services, we need some knowledge of what's "in" those
    media, which is exactly a machine perception task. Furthermore, if
    phones are to provide other services (besides communication) to users,
    they also need to interpret the user's intent through whatever
    interfaces the phone possesses. I already remarked on users' toils
    with phone menus and buttons, while at the same time the phone is a
    beautifully evolved speech platform. Speech interfaces do indeed look
    like a great choice. They continue to improve in performance, but the
    state of the art is much better than people realize.

    Until last year, like most HCI researchers, I was skeptical about the
    value of speech interfaces in HCI. But then I saw a Samsung phone
    (P207) shipping with large-vocabulary speech recognition and getting
    very good user reviews in all kinds of publications (including the
    hard-to-impress business market).

    I also taught a class on medical technologies and had a chance to meet
    with many caregivers. There is already a large speech industry in
    medicine, and it is widely seen as one of the key technologies moving
    forward (it has probably already eclipsed "office ASR" and is a
    significant part of the speech recognition industry overall).

    I had committed the cardinal sin of generalizing experience from a
    technology in one context (VUIs in the office) to its application in a
    different context. ...

    My only direct experience with speech interfaces was with the
    burgeoning automated call-center industry, which had been quite
    bad. But after learning more about the state of the art (Randy Allen
    Harris's Voice Interaction Design or Blade Kotelly's The Art and
    Business of Speech Recognition are excellent guides), I realized that
    there are many superb examples of voice interface design. It's a lot
    like Web sites and GUIs in the 1980s. The practice of human-centered
    user interface design was not widely known back then, but as the HCI
    discipline grew both in academia and industry, best practices
    spread. Products that didn't follow a good user-centered process were
    quickly displaced by competitors that did. There is an excellent set
    of user-centered design practices for speech interfaces that are very
    similar to the practices for core HCI. As yet, they aren't widely
    adopted, but the differences between systems that follow them and
    those that don't are so striking that this cannot last forever.

    It has also become clear that the recognition accuracy of the ASR part
    of the interface is not the limiting factor - it's the quality of the
    overall VUI design and the match of the application to its context. In
    other words, there's no reason to wait for future technical magic
    before using speech interfaces. You can write excellent ones now,
    assuming speech interaction fits your application context. (See the
    recent examples that appeared in the article "'Conversational' Isn't
    Always What You Think It Is" from Speech Technology Magazine,
    July/August 2003; http://www.speechtechmag.com.)

    After these epiphanies, I moved a significant amount of activity in my
    group to speech and dialog-based interfaces (i.e., started four new
    projects). While there are very good practices in speech interface
    design today and many useful services that can be built with them,
    there are still significant challenges and room for improvement. Those
    limits have to do with the shared understanding between a human and a
    machine sharing a speech interface. This is why speech interfaces are
    also a rich research area. Much of the shared information is the
    context we have already been talking about, and all of the
    aforementioned projects are coupled with our work on context-awareness
    (for more information, see my home page,
    http://www.cs.berkeley.edu/~jfc).

This is all very promising for reducing the cost of devices, since
high-resolution LCD screens are a substantial part of the cost of
current cellphones and computers, and keyboards are a substantial part
of the size (and, I suspect, good keyboards are a substantial part of
the cost).  He doesn't talk very much about the other direction of the
UI, the output.



More information about the Kragen-fw mailing list