1994 VR Conference Proceedings

Go to previous article 
Go to next article 
Return to the 1994 VR Table of Contents 

Teaching, Technology and the Human Spirit: Exploring Access to the Virtual Campus

"Simon Says": Using Speech to Perform Tasks in Virtual Environments

By: Teresa Middleton, Program Manager
Instructional Technology
SRI International
333 Ravenswood Avenue
Menlo Park, CA 94025
Phone: 415-859-3382
Fax: 415-859-2861
Email: middleton@sri.com

Duane Boman, Research Engineer
SRI International
333 Ravenswood Avenue
Menlo Park, CA 94025
Phone: 415-859-4269
Fax: 415-859-5984
Email: boman@sri.com


In this paper we are talking about a specific technology-speech recognition-as an access device for virtual environments. Before we discuss the technical elements of this application, I would like to say some words on discussing this access to virtual reality technology in terms of "universal design." Virtual reality is a technology that is still in its infancy and I believe this is a good time to be talking about making virtual environments available to as many people who want to be involved in this exciting technology as possible-to make sure we are thinking in terms of universal access.

Universal design is one of the newest, and probably the easiest to embrace, buzzwords in the world of technologies and disabilities. Universal design means that when a product (or service) is at the design stage, consideration should be given to making it, insofar as is possible, usable by everyone regardless of ability. Universal design makes good business sense, since if by some aspect of design a manufacturer can extend its set of potential buyers, it surely is worth including that aspect. Universal design goes toward answering the question of how we can provide universal access-without necessarily retrofitting existing products or developing specific devices for persons with disabilities.

Universal design, of course, is rarely completely achievable. Products that try to be all things to all people end up being of little use to anyone. The point is to give consideration to the concept during product development, so the end product addresses the needs of the largest set of people possible, and less people are excluded from consideration. Since this makes good marketing sense, there seems little reason not to think about it.

Sometimes manufacturers have stumbled on what might be considered good universal design by making a product for a specific set of people that just happens to be useful to a much larger set. Those telephones with huge letters and numbers were designed for people with poor eyesight and targeted toward the elderly -- but those large buttons turn out to be very popular with all kinds of people -- they are simply a lot easier to hit rather than the little ones. Captioned videos have become very useful tools not only for the hearing impaired for whom they were designed, but for people who need to learn a language.

Virtual reality relies heavily on "tricking" a person's brain into accepting a simulated world as a real one. It supports immersion in the virtual environment through the userÆs own senses: sight, hearing, touch, movement. The technology supports learning and understanding in new ways. A well-known example of an existing application that illustrates this is the (Japanese) kitchen design world where consumers can design their own kitchens, making choices through design maneuvers they make in a virtual world. They can move, pick up, and examine objects, and move themselves around in the virtual environment in order to see things from all angles, as they make their choices.

But access to objects and access to movement and direction in virtual environments is limited at present to use of hands and physically turning and shifting. Todays common access devices are the dataglove, trackball or joystick-all of which make use of the hand; less common are the devices (exemplified by the Biomuse) which use other muscle movements-in the forearm, around the eyes and so on. Tracking devices record physical location and movement. All these devices assume use of hands, or muscles, some vision, and some movement.

If we take the universal design model, we should now consider users who have restricted or no use of hands, or poor muscle control, or poor vision, and consider access modes that will make virtual environments available to them. Missing so far from the set access devices in virtual reality is speech. Because of the complexities involved in speech, practical applications for speech as an access device (outside of virtual reality) only started showing up in the past few years, although research on the topic has been going on for more than a quarter of a century -- since the birth of computing. But recognizing words or phrases spoken by individual users and using these to form understandable commands to a computer, poses significant problems. For decades substantial funds have been invested by the government and by corporate investors to help solve these problems. Now we are beginning to see signs of this investment in a variety of fields: in the military, in surgical and other healthcare settings, in the auto industry, and so on. Much of the application development is in support of what is called "hands busy" needs. When a task needs to be done, but the hands are busy performing another part of the activity, some other way has to be found to do the task.

In the interests of universal design, we would like to see the results of this research extended to virtual environment research. Why do I say "in the interests of universal design?" While speech recognition will widen accessibility to virtual worlds -- allowing those users who cannot use currently available devices to move and manipulate objects in a virtual environment-it will also enhance the experience for everyone else. Moving around in a virtual environment by pointing a finger is not an intuitive behavior-how many people have you seen fly out of the limits of the virtual world and lose themselves? It happens all the time. Speech seems to be much more in tune with the way we normally work. Being able to say "go back" and getting back to where you got lost is much easier than figuring out whether to point up or down, cock a thumb, and so on.

The universal design concept also comes into play when we consider how many other speech-activated, or speech recognition, applications are already available (or are in the product development stage) for people with disabilities in the real (non virtual) environment. People use speech to turn appliances on and off, dial telephone numbers, use word processing packages and program computers. So moving speech access into virtual environments is a natural step.

Described below is a prototype not a completed application. It is constructed to help visualize what the actual application would be like, how it would best be used and so on.

Technical Discussion: Integration of Speech Recognition into a Virtual Environment

A broad range of uses for speech recognition in virtual environments can easily be envisioned, but the actual utility of this communication medium has many constraints. In particular, the limitations imposed by virtual environment technologies will affect the use of speech recognition. Our initial work to integrate speech recognition into a virtual environment was aimed at learning about the best uses and limitations of this communication channel. We chose a virtual environment application we were familiar with, an architectural walk through, as we already knew the uses and limitations of other communication mediums in this setting.

Speech Recognition System

For speech recognition, we used the DECIPHERTM system that has been developed by the Speech Research and Technology Program at SRI. DECIPHER has a vocabulary of over 2,000 words. It is speaker independent and does not need to be retrained or recalibrated for each system user. It also recognizes continuous speech, so the system user is not constrained to single word commands. DECIPHER is also highly robust, maintaining high accuracy with a variety of microphones and background noises. The software runs on UNIX workstations from Sun and Silicon Graphics.

The Application Development Toolkit (ADT) which was recently developed by the Speech Program at SRI for application specific customization of speech recognition capabilities was also used. This tool kit was developed to allow application developers to easily incorporate speech recognition in their applications. The application developer can prepare a "grammar specification file" that includes the types of phrases that are expected in the application. The ADT then creates a reduced speech recognition package from its vocabulary of 160,000 words and phrases. An Application Programming Interface is then used to provide function calls to the application program from the speech recognition package.

Virtual Environment System

An RB2 (TM) system from VPL Research, Inc. was at the core of this virtual environment application system. The RB2 includes a Macintosh IIfx workstation that maintains the virtual environment and coordinates communication among interface devices and other computers. Two SGI VGX workstations provide the visual images for the two eyes. Two additional computers were used in this application; a Zenith 386/20 that included a Convolvotron (TM) for 3-D sound, and a Sun workstation that performed speech recognition. The five computers communicated over ethernet. The virtual environment system also included a Virtex Cyberglovirtual environment (TM) for gesture-based commands, a VPL Eyephone (TM) with stereo speakers, Polhemus magnetic trackers on the helmet and glove, and a wireless microphone.

The virtual environment was a model of the main lobby of our building at SRI and the hallways near our laboratory that was developed in RB2 Swivel (TM). Body Electric (TM) was used for programming the virtual environment dynamics and running the simulations. This system has proven to be very useful for prototyping applications.

Speech Commands

Our initial virtual environment speech recognition package included 45 words that were used to form 22 commands. The commands fell into five major categories (Table 1). With these commands, the virtual environment participant could select and manipulate objects, change object attributes, change the state of the virtual world, navigate through the virtual world, and ask a few simple questions.

We also needed to select a word to indicate to the computer when it was being spoken to so that it would not respond every time the virtual environment user spoke. Essentially, we needed to give the computer a name. This name needed to be both unique and easily understood by the computer. After considerable deliberation, the name "Simon" was chosen. To give a command to the computer, the user begins the sentence with "Simon...".

Table 1: Virtual Environment Speech Recognition Commands:
Command Category:
Object Selection and Manipulation

Select [this] [that]
Grab [this] [that]
Bring it here
Move that there
Tree [up] [down] [left] [right]
Command Category:
Object Attributes
Make [transparent] [solid]
Command Category:
Virtual Environment States
Calibrate [eyes] [hand]
Command Category:
Fly [forward] [backward] [quickly] [slowly]
Move [up] [down] [left] [right] [forward] [backward]
Take me there
Command Category: Questions
What is [this] [that]?
Where am I?

Initial Experiences With Speech in a Virtual Environment

Certain categories of commands were found to be very useful, whereas others were not. One of the first things we learned was that feedback was needed from the computer to indicate to the user whether or not the computer understood a command. Computer generated speech was used for these responses. For example, if the user said "Simon, fly forward slowly", the computer would say "Flying forward slowly". If the computer did not understand a command it would say "Sorry, I did not understand". It also became apparent that providing redundant command modes was preferable to having only one method for inputting a command.

Among the most useful commands were those for selecting and manipulating objects. When the virtual environment user held their gloved hand flat, a ray would emanate from their index finger. This ray could be used to point at distant objects. Commands such as "Simon, select that" and "Simon, bring it here" could then be used for manipulating objects. The user could use the "grab" command to hold onto selected objects while moving around the virtual environment. The user could also grab objects using a hand gesture. The hand gesture was the preferred mode of grabbing an object for short periods whereas the spoken command was preferred when the user wanted to keep the object for awhile.

The commands to change the state of the virtual environment were also very useful. Calibrating the glove and the user's interocular distance could be accomplished with keyboard commands. However, once inside the virtual environment, these functions were best accomplished through spoken commands.

The least useful commands were those for navigating through the virtual environment. When given the command "Simon, fly forward", the user would begin moving through the virtual environment in the direction in which the gloved hand was directed. Flying could also be accomplished with a pointing gesture. With the spoken command, the user needed to say "Simon, stop" to halt, whereas, when using the hand gesture, a simple gesture change would quickly halt flying. Delays between the time that "stop" was said and the actual cessation of flying caused difficulties. It was also difficult to control flying speed with spoken commands.

In general, speech is very useful for discrete commands but not for nondiscrete events such as flying. For example, the command "Simon, take me there" was used to move the user to within a few feet of a selected object. This type of command provided better navigation control than spoken commands for flying. Similarly, the spoken commands for selecting and changing objects produced discrete events which were very likely to produce the specifically desired change in the virtual environment.

Another command that was found to be very useful was "Simon, where am I?" In response to this command, the computer would slowly move the person above the virtual environment, mark the position in the model where the user had been standing with a flashing red 'X' and tell the user to "Look down". People frequently get lost within virtual environments; providing this 'God's eye view' was very useful for overcoming this problem. The user could then return to their previous location by saying "Simon, done".


What we have described is a proof of concept experiment. We have shown that, with a relatively small set of commands, speech is a powerful device for accessing, "holding" and moving objects in the virtual environment, and for verifying user location. It is less effective, at present, for moving around the environment. In the interests of universal design, we should consider ways to design virtual environments that can better take advantage of speech commands, minimizing the ambiguities inherent in language. Also, providing methods to easily attach spoken names to objects would simplify the task of creating vocabularies for each application.

This research was not conducted with a specific model of a user in mind, but was intended to increase the options available to all users. We believe it has particular relevance for persons with disabilities, who are, perhaps, unable to use a dataglove or joystick, have limited muscle control, or for any other reasons find existing access devices insufficient for their cyberspace visits.

Go to previous article 
Go to next article 
Return to the 1994 VR Table of Contents 
Return to the Table of Proceedings 

Reprinted with author(s) permission. Author(s) retain copyright.