Go to previous article
Go to next article
Return to 1995 VR Table of Contents
Klaus Böhm1, Jörg Zedler2, Volker Kühn1
1ZGDV - Computer Graphics Center
2Fraunhofer Institute for Computer Graphics
Wilhelminenstraße 7, D-64283 Darmstadt,
Federal Republic of Germany
Phone: +49 61 51 - 155 243, Fax: +49 61 51 - 155 480
The great potential of Virtual Reality is that the whole communication bandwidth including gestures, miming, gaze and speech can be used to pass information from the user to the computer. However, the research on interaction in virtual environments is just beginning. There is a need for sophisticated interaction techniques in combination with new interaction metaphors in order to make the full potential of VR technology available to different categories of users. However, it should be taken in consideration that only the wide availability of toolkits will enable the development of 3D- and virtual reality applications in a time- and cost-effective manner.
We developed a virtual environment toolkit called GIVEN which enables the construction of applications using multiple input channels. The concepts we realized have been proven by combining gesture and speech input. Several interaction techniques using these two input channels have been developed. We identified the prompt/feedback problematics which occur when working with independent input streams. This problem was tackled and the solution we achieved will be presented.
1 Introduction and Motivation
Virtual Reality (VR), based on the integration of Computer Graphics, Multi Media and new I/O technologies is a powerful paradigm for human- computer-interaction. With VR it is possible for the visitor of synthetic worlds to use the whole natural interaction bandwidth such as speech, gesture, gaze, facial/body expression and emphasis like (s)he would do in the real world. That way the user's experience, acquired in everyday contact with people and things, could be applied to working with the computer. Using multiple input channels -and therefore having a broader interaction bandwidth- allows for a more precise and effective interaction with the computer. In addition, the visitor can choose the particular channel which is best suited for an interaction task.
In this paper we concentrate on the input channels gesture and speech. However, the concepts for the handling of multiple input streams are not limited to these input modalities.
Usually the decision of which input channel(s) to use depends essentially on the interaction task itself: if it is a so-called linguistic task, the user knows exactly the desired value and the most efficient way to fulfill that task is using natural speech. If it is a so-called spatial task, (s)he has an approximate idea of what (s)he wants, but does not know the discrete value. Because of the technical characteristics of the input channels, this interaction task is more gesture-based. For some other tasks it might be the most efficient and least error-prone way to use more than one channel in parallel.
When implementing a user-interface for handicaped users, several additional aspects have to be taken into account: First, the user might not be able to use the -technically seen- most efficient input channel and has to switch therefore to another one which is convenient for him/her. Second, all available input channels might not have the precision and accuracy that they would have for non-handicaped users. Therefore special precautions for increasing the recognition accuracy for 'noisy' input have to be taken.
The implementation of a Virtual Reality application requires a lot of application independent functionality such as navigation, rendering, input device drivers, collision detection and more. But it still has to be extendable for application specific interaction or navigation techniques. The required flexibility leads to an open-architectured approach to interaction modelling which is embedded in a user-interface- toolkit. Using this toolkit, it is possible to time- and cost- effectively develop 3D- and VR-applications while still keeping the extendability.
In this paper, the platform of our development, the 3D user interface toolkit GIVEN (Gesture-based Interactions in Virtual Environments), will be described briefly. The concepts of the extendable input channel handling as well as the interaction modelling will be presented. Proof- of-concept will be shown by presenting several examples of multimodal interactions based on the input channels gesture and speech.
2 The 3D-User-Interface-Toolkit GIVEN
In this chapter we present the GIVEN system which we have been developing since 1992 [BöHüVä92]. It has already been used for the construction of the VR-user interface of several applications (e.g. [WiSoBö94]). In order to make the results of our research available to application developers, new interaction techniques as well as their supporting mechanisms will be integrated into the GIVEN toolkit. This is also the reason why we have focused our research not only on the question of usefullness and feasibility of multimodal interaction, but also on the development of mechanisms for the construction of new interaction techniques in an easy and comfortable way.
2.1 Motivation for the Toolkit development
The connection of various input devices, the use of special functions of those devices (i.e. gesture input with a dataglove), the combined use of several input devices at once, the detection and handling of object collisions, and the simulation of simple behaviors for the purpose of facilitating natural interaction in virtual environments are all examples of problems that must be solved in the development of every interactive 3D application. Only through the use of toolkits which provide solutions to the problems named above can virtual environments and interactive 3D applications be developed in a time and cost effective manner.
2.2 Concepts and Architecture
The conceptual model of the 3D user interface toolkit GIVEN is described in the following diagram:
Ill. 1: The GIVEN architecture
The toolkit kernel consists of modules providing functionality for collision detection, rendering, input device handling and communication issues, which can be accessed by the application via the Interface Library.The kernel modules are realized as independent managers, each having its own copy of relevant data stored in its own format. These four managers can be implemented as a separate process or grouped together to one process. Because the tasks are split up between the separate managers this way, it is possible for a module to handle collision detection based upon an octree representation of the scene [SaWe88] and runs on a fast 'number-cruncher' while the rendering module stores the same object data as a linked list for fast rendering and runs on a graphics workstation. The device drivers for the input devices are also implemented as independant processes thus allowing applications to use input devices running on other machines. In this way, computationaly intensive preprocessing of input, such as gesture recognition or the filtering of input values [Fel94], can also be distributed to other machines.
The toolkit components are as follows: Output Manager: Responsible for the control of various output devices, especially realtime rendering of the scene and audio output.
Input Device Manager: Implements the connection to the input units and generates standardized system events out of the various types of input data.
Space Manager: Responsible for detection of collisions between individual graphical objects. The precision of the collision detection can be changed depending on the needs of the specific application. Note that in cooperative environments with several instances of the toolkit there exists only one Space-Manager.
Dispatch Manager: Implements the communication between the toolkit instances, the managers, as well as the communication between the kernel and applications processes.
Interface Library: This module realizes the interface to the application. It includes functions for creating, manipulating, and deleting graphical objects. The Interface Library consists of an event handling mechanism, which collects, stores and distributes system events and informs the application via callbacks.
Interaction Library: Defines mechanisms for modelling and control of complex interactions. The Interaction Library is built from Interface Library functions.
Behavior Library: Collection of previously defined objects or object groups and their interaction behaviors. Examples could be switches or scaling units.
2.3 Current state of implementation
The current implementation of GIVEN is a reduced and simplified version of the concept described above. At the moment the system is only designed for single-user mode. The toolkit kernel as well as the application are implemented in one process. Because of this, the development of a Dispatch Manager has been postponed. Nevertheless, multi-user mode will be supported by the end of 1995 and all requirements for shared virtual environments will be provided.
2.4 Preprocessing of the input streams
Handling of "human oriented" input channels usually requires sophisticated preprocessing. In our case we focused on gesture and speech based input. Therefore we need advanced mechanisms in order to recognize the spoken words and gestures performed by the users. Since speech recognition is an area in which a lot of research has been conducted and usable results have been achieved based on several methodes including Hidden Markov Models and neural networks, we did not try to develop our own recognition system, but rather used an available program. We currently use a single word recognition system which was made available to us by Silicon Graphics. The system has a vocabulary of about 200 words and the ability to define grammars in order to improve the overall recognition accuracy.
Hand gesture recogniton, however, is a different situation. In the past years there have been no gesture recognition systems available on the market with robust, high quality recognition. Therefore we conducted research in this field and developed recognition methods based on using different types of neural nets for recognizing static and dynamic gestures [VäBö93][BöBrSo94]. For the recognition of static postures we achived high qualilty recognition results in real time using backpropagation neural nets. Thus we were able to integrate the gesture recognition module into the input pipeline of the toolkit. In order to avoid the overloading of the graphic workstation, which is performing the rendering, we distributed the preprocessing tasks on several different machines.
2.5 Handling the input channels
As mentioned above, the toolkit kernel needs to be extendable because new input or output devices may be developed and have to be integrated or application specific extensions have to be made.
To provide this functionality we developed a homogenious interface for all the data coming from the input devices as well as from the collision detector. By mapping this information to atomic data elements, so-called GIVEN events (GtEvents), it can be handled by the toolkit and interaction programmer in a common way. This is similar to the approach of the event driven X-Window System. All events are collected and distributed by using an event queue which is part of the Interface Library. In this way the appearance of an event (e.g. a recognized spoken command or a collision between two objects) is decoupled from the handling of that event.
We can subdivide all GIVEN events into three subclasses:
input device events: All values coming from input devices are handled via input device events (such as 'GtCyberGloveEvent' or 'GtSpeechEvent'). To support a new input device only an additional event which maps the input device values to the event structure has to be defined.
output device events:
As well as the above, output device events keep all information which is necessary to render a scene and trigger the output.
internal generated events:
Collision detection is part of the GIVEN kernel. The Space Manager checks for collisions and creates events like 'GtCollisionEvent' (two objects collide) or 'GtEnterAreaEvents' (an object enters a specific area).
The concept of events enables the use of an object-oriented interaction modelling mechanism, which was developed for the 2D user interface toolkit THESEUS++ [Hübn90].
In this approach interactions are hierarchical, dynamic data structures, linked together in trees, which consists of so-called basic interactions (the leaves of the trees) and complex interactions (the nodes). The management of the interactions (creation, activation, deactivation, ...) is done by the Interaction Library, which is based upon the Interface Library.
The basic interactions have a direct link to the GIVEN events. The complex interactions are simply logical operations like AND, OR, SEQUENCE or REPEAT, which define a special behavior regarding their subtrees. The subtrees can be either basic interactions (events) or again complex interactions.
In this way, a clear and easily extendable interface between the GIVEN kernel and the Interaction Library is available.
3 Multimodal Interaction
The term multimodal interaction means the usage of two or more independant input channels. Our work in this paper has focused on the combined use of speech and gesture. Several scientists have already conducted research in the area of multimodal interaction. Well known are Richard A. Bolt (MIT MediaLab, [Bolt80], [Bolt87]) with his "Put-That-There" paper and Alexander G. Hauptmann (Carnegie Mellon University, [Haup89]) who found that a combination of gestures and speech is preferable to pure speech or pure gesture based input.
3.1 Advantages of Multimodal Interaction For multimodal interactions the following advantages can be identified:
Unburdening [Bolt87]: If only one input channel is used, then it must bear the load of all interactions. A telephone call (speech only) is one example where emphasising sentences via gestures is not possible. In this case, the user must explain verbally those things which would otherwise have been conveyed as gestures.
Redundancy [Bolt87]: By using two or more input channels in parallel, the inherent redundancy leads to a higher degree of recognition accurcy. Thus it is recommended for use in critcal operations. Deleting an object, for example, may only occur when the delete command is given verbally at the same time as a confirming gesture is made. It is of special importance that by using two input channels, the error rate can be reduced to a level that is acceptable in an interactive environment.
Increased Information Efficiency: Because input data can be delivered by a chosen channel based on the characteristics of the data and the channel, multimodal interaction can improve the efficiency as well as the quality of the input.
3.2 Theoretical Aspects From our point of view, when using multiple input channels some theoretical aspects have to be taken into consideration.
3.2.1 Input channel characteristics
The two input channels,speech and gesture, have different characteristics. Typically, speech recognition produces discrete events which occur when a term has been recognized. This is especially well suited to discrete commands and precise input. A discrete command such as 'delete' or the input of numerical values are examples. In contrast, our method of gesture recognition delivers a continuous input stream including the current hand posture, position and orientation. This continuous input is more suited to natural interaction techniques similar to those performed by a real hand on real objects.
These two different characteristics have to be taken into account during interaction modelling.
3.2.2 Combining Approaches
The two independent input channels can be combined differently. We propose the following two strategies: * Parallel Approach: In human-human communication -especially in noisy environments-, the listener uses the body language of the speaker, as well as the spoken words, to better understand the content of the communication. Similar to this example, our approach uses both input channels to trigger an interaction task. The interaction is activated only if the speech recognizer as well as the gesture recognizer have created the respective events. The redundancy included in this approach leads to a higher degree of security, which is especially important for critical tasks, such as object deletion.
* Orthogonal Approach:
Using two input channels enables the user to perform and control different tasks at the same time. For example the user navigates through the scene by gesturing and opens/closes doors via spoken commands. This orthogonal approach can also be applied to interaction techniques which can be modified during execution by additional user input. The interaction itself is then triggered by one channel while the parameters are given by the second channel, e.g. during gesture-controlled navigation, spoken commands might alter the speed.
3.2.3 Usage of a single input channel
The availability of multiple input channels has the advantage of combining them in the ways mentioned above. Another advantage is that the user can choose the input channel which is most convenient to him or her. This is especially important if some input channels are not available due to a physical handicap of the user. Applications that support both single and multimodal interaction will appeal to a wider range of users. However, it should be noted that interaction techniques that use just one type of input are generally not as intuitive and efficient as multimodal interactions.
3.3 Prompt and Feedback
Typically, interactive systems provide a prompt to show what kind of action is next required from the user. As confirmation of the interaction the user receives a feedback. For example, when a new button appears, the user knows that the button pressing interaction is possible and when the button's shape changes, the user knows that it has been pressed. The use of two different input channels leads to specific requirements for prompt and feedback in order to show which input channel is currently active and which user input is possible at the moment. Speech input, in our case, has a very restricted vocabulary, about which the user must somehow be informed. The development of prompts/feedbacks for both input channels, either one for each or one for both is an important issue in interaction modelling.
3.3.1 The Interface Agent 'James'
To solve the aforementioned problems, we use the interface agent concept. The idea of using an interface agent to simplify the human- machine communication, and especially to serve as a bridge between the inaccuracy of human behavior and the machine enforced command precision, is discussed in [Lau91] and [Tho93].
The interface agent used in our project has the appearance and behavior of a butler. He appears when called for and remains as long as his services are desired. The butler can either stand in one place (fixed position relative to other objects) or he can follow the user, always remaining on the left edge of the display (fixed position relative to user's point of view). The user can toggle between these two modes with the commands "Keep_Position" and "Follow_Me". The prompt and feedback mechanism for the speech input is implemented by a tray carried by the butler. The shape of the objects on the tray tells the user what the current interaction context is, i.e. which commands will be accepted and what actions they will perform. For example, if there is a color cube on the tray, then the user knows that the color selection function is activated and that a color selecting command can be spoken (see Plates 2 + 3).
4 Examples of Multimodal Interactions
Several interaction techniques which we have developed are presented below. The scaling interaction is used as an example to show the use of the embedded interaction modelling mechanism.
4.1 Interaction modelling
In the interaction model used within GIVEN the whole
interaction bandwidth with all facilities is represented by
interaction trees. A dialog is modelled by a dynamic activation
and deactivation of subtrees, which is done based upon the
occurence of GIVEN events, that means based upon user input and
The following interaction tree, which is used for scaling, is an example of interaction modelling. An object in the scene can be scaled using either the dataglove and 3D handles or via speech input. The complex interaction Scale-IO (A) can manage both the gesture-based Scale-by- handle-IO (C) and the speech-based Scale-by-speech-IO.
Ill. 2: scale interaction modelling
If the scaling is spatial, the user typically uses gestures. For this the basic interaction (B) waits for an enter area event of the objects hand and handle. If the hand enters the area of the handle the scale interaction (C) (which is a complex interaction of the REPEAT class) is activated. If the gesture recognition system NeuroGlove now recognizes the GRAB gesture an appropriate event is generated (D) and the scale interaction starts. The scaling of the object is done repeatedly every time a new 3D position comes from the dataglove tracking system (E). The scaling ends if NeuroGlove recognizes any gesture different than GRAB (F). An example of the Scale-by-handle-IO can be found in Plate 1.
If the user wants to scale an object by a precise factor the Scale-by- speech-IO is more efficient. For example, by saying "Scale Number Zero Point Seven Five Okay" an object will be scaled by a factor of 0.75. (Plate 2). RGB color definition is another example of the power of multimodal interaction. Colors which can be identified by name can be selected via speech (e.g. RED or YELLOW), other colors can be specified by gesture-based positioning within the RGB-cube (Plate 3).
5 Summary and Conclusion
In this paper we focused on the handling of multiple independent input channels for the development of interaction techniques used in virtual reality applications.
The object oriented interaction model which we used in our research and integrated into our VR Toolkit was described briefly and its functionality was explained based on an example. The theoretical issues we introduced as well as the examples we showed are based on, but not limited to the interaction modalities gestures and spoken commands. The specific requirements for prompt and feedback when using multiple independent input channels were addressed and a solution with an interface agent was proposed.
The practical results of the use of multimodal interactions and the interface agent are very satisfying. The quality as well as the quantity of the interactions was improved compared to the pure gesture-based interaction which we used in the past. The handling of the homogeneous input interface and the use of the interaction model enables an easy and flexible way of constructing interaction techniques based on multiple input channels.
6 Future Work
Concerning the user interface toolkit GIVEN, we are currently developing a cooperative version which allows the construction of shared virtual environments. This places further demands upon the construction of new interaction techniques.
The results we achieved with multimodal interactions will also be applied to our augmented reality test environment where we are working on telepointing in stereoscopic video images.
For the future we propose to use further input channels for the development of new interaction techniques. Examples are gaze, facial expression and stressing. However in our opinion eye tracking will be the most challenging input stream.
[BöBrSo94] Böhm, Klaus; Broll, Wolfgang; Sokolewicz, Michael A.: "Dynamic Gesture Recognition Using Neural Networks; A Fundament for Advanced Interaction Construction". In: Fisher S. Merrit J., Bolan M. (Eds.): Stereoscopic Displays and Virtual Reality Systems; SPIE Conference Proceedings. Vol. 2177, San Jose, USA, Feb.1994.
[BöHüVä92] Böhm, Klaus; Hübner, Wolfgang; Väänänen, Kaisa: GIVEN: Gesture Driven Interactions in Virtual Environments - A Toolkit Approach to 3D Interactions; Interfaces to Real and Virtual Worlds Conf.; Montpellier, France; March 1992 [Bolt80] Bolt, Richard A.: "Put-That-There": Voice and Gesture at the Graphics Interface; ACM Computer Graphics, Vol. 14, No. 2, 1980, pp. 262-270.
[Bolt87] Bolt, Richard A.: New Directions in Multi-Modal Interface Design; Tutorial at ACM SIGCHI + GI 1987.
[Haup89] Hauptmann, Alexander G.: Speech and Gesture for Graphic Image Manipulation; ACM SIGCHI 1989 Conference Proceedings, pp. 241-245.
[Hübn90] Hübner, Wolfgang: Entwurf Graphischer Benutzerschnittstellen; PhD, Springer Verlag, Germany, 1990.
[Laur91] Laurel, Brenda: Interface Agents: Metaphors with Character; in: The Art of Human-Computer-Interface Design; Ed. Brenda Laurel, 3rd printing, Jan. 1991, Addison Wesley, ISBN 0-201-51797-3, pp. 355-365
[SaWe88] Samet H., Webber R. E.: "Hierarchical Data Structures and Algorithms for Computer Graphics". IEEE Computer Graphics & Applications, Mai 1988.
[Tho93] Thorisson, Kristinn R.: "Dialogue Control in Social Interface Agents", InterChi Adjunct Proceedings 1993, Conference on Human Factors in Computing Systems, Amsterdam, April 1993, pp. 139-140.
[VäBö93] Väänänen, Kaisa; Böhm, Klaus: Gesture Driven Interaction as a Human Factor in Virtual Environments - An Approach with Neurol Networks; Proc. of "Virtual Reality Systems" conf., British Computer Society, May 1992.
[WiSoBö94] Wirth, Hanno; Sokolewicz, Michael A.; Böhm, Klaus; John, Werner: "MuSE - Using VR in System Development and Validation". In: Earnshaw R. Jones H. Vince J. (Eds): Virtual Reality Applications, Proceedings, British Computer Society, Leeds, UK, Juni 1994.
Go to previous article
Go to next article
Return to 1995 VR Table of Contents
Return to Table of Proceedings