1995 VR Conference Proceedings

Go to previous article 
Go to next article 
Return to 1995 VR Table of Contents 

Audio Windows for Synchronous and Asynchronous Conferencing

Michael Cohen
Human Interface Lab
University of Aizu
965-80 Japan
voice: [+81](242)37-2537
fax: [+81](242)37-2549
email: mcohen@u-aizu.ac.jp


Spatial sound is the presentation of audio channels with positional attributes. DSP-synthesized spatial sound, driven by even a simple positional database, can denote directional cues useful to a sight- impaired user. "Augmented reality" describes hybrid presentations that overlay computer-generated imagery on top of real scenes.

Augmented audio reality extends this notion to include sonic effects, overlaying artificially spatialized sounds on a natural environment. MAW (acronymic for multidimensional audio windows) is a NextStep-based audio windowing system deployed as a binaural directional mixing console, capable of presenting such augmented audio reality spatial sound cues. By associating spatialized sound with natural directions, sight-impaired users can leverage off intuitive mental spatial models to identify sound sources and segregate audio streams. Applications of audio windows to asynchronous communication (like voicemail) or synchronous applications (like distributed realtime groupware) generalize traditional telephone answering machines and teleconferencing. Rotating and non- omnidirectional sources and sinks allow selective attention, and motivate deployment with extensions like a chair tracker or a hemispherical speaker array, which allow soundscape stabilization. Keywords audio windows, audio telecommunication, binaural directional mixing console, CSCW (computer-supported collaborative work), conferencing, groupware, GUI (graphical user interface), multimedia email, sound field control, sound localization, spatial sound.


It is important to exploit sound as a vital communication channel for computer-human interfaces. Audio windowing is conceived of as a frontend, or user interface, to an audio system with a spatial sound backend. This paper surveys the ideas underlying audio windowing and describes a system investigating asynchronous applications of these ideas. Features of a gui (graphical user interface) can be extended to support an audio windowing system, driving a spatial sound backend. Besides the reinterpretation of wimp (window/icon/menu/pointing device) conventions to support audio window operations for synchronous sessions like teleconferences, extra features can be added to support asynchronous operations like voicemail. After tracing some underlying technology of audio imaging in computer-human interfaces, an audio windowing prototype is described, ``MAW'' (acronymic for multidimensional audio windows), an exocentric graphical mouse-based interface based on an extended model of free-field planar spatial sound. Audio Windows "Audio windows" is an auditory-object manager, a user interface ( ("frontend") to a spatial sound system. By creating psychoacoustic effects (usually with DSP, digital signal processing), many scientists are developing ways of generating and controlling this multidimensional sound imagery [Wenzel et. al., 1988a] [Wenzel, et. al, 1988b] [Begault and Wenzel, 1992] [Kendall et. al., 1990] [Loomis et. al, 1990] [Wenzel et. al, 1990] [Wenzel et. al., 1991] [Middlebrooks and Green, 1991] [Wenzel, 1992].

The goal of a sound spatializer is to create the impression that the sound is coming from different sources and different places, just as one would hear ``in person.'' Spatial hearing can be stimulated by assigning each source a virtual position with respect to a sink and simulating the auditory positional cues. (Since the word "speaker" is overloaded, meaning both "loudspeaker" and "talker," we use "source" to mean either: a logical sound emitter. Similarly, "sink" is used to describe a virtual listener, a logical sound receiver.) A display based on this technology exploits the human ability to quickly and subconsciously localize sound sources.

The generalized control model of a window here is by analogy to graphical windows, as in a desktop metaphor: an organizational vehicle in the interface that does not directly involve room acoustics. Researchers have been studying applications and implementation techniques of audio windows for use in providing multimedia communications [Ludwig et. al., 1990] [Cohen and Ludwig, 1991a][Cohen and Ludwig, 1991b] [Cohen and Koizumi, 1991a] [Cohen and Koizumi, 1991b] [Cohen and Koizumi, 1992a] [Koizumi and Cohen, 1993]. MAW is a fully interactive audio windowing and planar spatial sound system, suitable as a teleconferencing system, as well as for asynchronous applications like voicemail. Rather than relying on disembodied sources, MAW includes a GUI, employing visual representations of sound elements manipulated through a workstation, integrating graphical and audio windows by manipulating spatial sound channels via graphical control, a binaural directional mixing console.

MAW is implemented in the NextStep environment. The program consists of about 20,000 lines of Objective-C and Postscript code, plus associated graphics and graphical configurations. The graphical representation of MAW's virtual room is an aerial projection, a planar bird's-eye view. This perspective flattening was implemented partly because of its suitability for visual display on a workstation monitor. As a snapshot of a typical session, Figure 1 includes a view of such an overhead representation, along with the border, buttons, and scrollers that make it a NextStep window. Each participant in a conference assumes a characteristic position, which might correspond to the layout of the cubicles or offices or some other scheme. The (clipped) scale map of the group's (admittedly cramped) cubicled facilities indicates everyone's approximate orientation while seated in front of his or her workstation. This map is a MAW document (whose name titles the window), specifying the spatialization of sound sources with respect to individually designated (respective) sinks: a plan view of the virtual acoustical space, including positions of sinks and sources.

This kind of planar slice was also chosen to maximize positional accuracy for MAW users; localization cues for virtual sound sources are more robust from person to person for the azimuthal dimension than for the elevational. Voice-and Multimedia mail Originally developed as an interactive teleconferencing frontend [Cohen and Koizumi, 1992b] [Koizumi et. al., 1990], MAW was retrofitted with a batch mode, making it also suitable for automatic invocation. Its architecture, shown in Figure 2, is appropriate for both synchronous and asynchronous applications. MAW's frontend is the previously described GUI interface to the spatialization functions (as shown in Figure 1). The spatialization backend is provided by any heterogenous combination of convolution engines. MAW uses configuration files, dynamic maps of virtual spatial sound spaces, to calculate gain control and HRTF selection for this heterogeneous backend, assigning logical channels to physical devices via a preferences (control) panel. The outputs of all spatialization filters are combined into a stereo pair presented to the user. Multimedia electronic mail allows the interleaving of text with inclusions of arbitrary files, including sound, animation, graphics. Alongside sound files, MAW files can spatialize voicemail by tagging channels with positional configurations. Hypermedia Alternatively, files may be webbed together into arbitrary non-linear organizations of attachments.

In Figure 3, a graphical utility with file indirection, has been used to organize some utterances into a simple tree structure. Multimedia hyperdocuments are closely related to cyberspace, [Zyda et al., 1994] as illustrated by Figure 3. The virtual coordinate space could correspond to anything: Physical space: The artificial reality might be an analog of our world, with representations of actual buildings and cities (for mnemonic purposes, as in Figure 1) or new types of conference centers, designed by psychologists instead of architects. MAW can adjust gain as a function of distance, so a happy side-effect is that the closeness (perceived proximity) of the audio signal corresponds to the urgency of the action. Alternatively, resetting attenuation creates a distance-independent signal. Issue space: Besides huddling and caucusing, users might express an opinion by "where they stand on an issue, "voting with their feet. Rival counselors could whisper in opposite ears of a decision maker, like temptation and conscience. Conceptual space: Sound sources could be arranged like records or CDs. They might be strategically placed for balance. Question and answer, point and counterpoint, and more complicated structures could find appropriate spatial configurations.

Social space: Maybe virtual proximity would reflect emotional intimacy. Or people would use some analog of height to reflect social hierarchy (like bowing). Proteges might gather at a mentor's ankles. Late Binding Even though email-embedded utterances are recorded a priori, they are spatialized in realtime, during individual audition. This late binding of the position parameters allows relative effects at two levels: Global: Multicasting On a distributed scale, the sound is spatialized as a source relative to the position of a designated sink. One advantage of MAW's exocentric visual presentation is its egalitarianism: all the users can share the same map. A message might be multicast or forwarded to others who will hear the utterance spatialized with respect to their own position; the respective listener positions in the map differ, established via the (sink attribute for each channel in the) personal preferences profile. This feature can be used mnemonically, assigning everyone a unique virtual position corresponding to his or her location in an office.

For example, in Figure 1, if voicemail is sent to the to the group, C will hear K's voice behind, while F will hear K's voice to the right front. To conserve allocated channels, a token (akin to a speaker's gavel) can be used to mark the location of the respective contributors. When a user wishes to record a to-be-spatialized comment, they include a map as a prelude to the recorded utterance, which flags their own position, via a collocated token, as that of the source with respect to all the prospective sinks. In Figure 1, the token is a workstation (in the top right of the map), whose transparent screen frames the designated source.

Local: Head- or Chair-Tracker At a local level, a head-or chair-tracker may be used to monitor the orientation of a user's swivelling seat, whose position updates are sent to MAW to adjust the spatialization. Continuously adjusting the orientation of the presentation stabilizes the soundscape so sources remain fixed in perceptual space as the user twists. This feature is especially important for resolving front--back ambiguity. The chair tracker blurs the distinction between egocentric and exocentric systems by integrating an egocentric auditory display, exocentric visual display, and ego- and exocentric control.

As illustrated by Figure 5, the virtual position of the sink, reflected by the (exocentric) orientation of its associated graphical icon, pivots in response to (egocentric) sensor data around the datum/baseline established by WIMP (exocentric) iconic manipulation. The WIMP-based operations of MAW can set absolute positions; the chair tracker's reporting of absolute positions has been disabled to allow graphical adjustment. With only (arbitrary) WIMP-based rotational initialization, the system behaves as a simple tracker, consistent with proprioceptive sensations.


MAW's audio window reinterpretation of standard idioms for WIMP systems-- including draggably rotating icons, and directionalized and non-atomic spatial sound objects-- compliments features that are well suited for both synchronous and asynchronous operations, including compatibility with hypermail (allowing spatial sound to be put into electronic mail). By embedding MAW documents, which might include dynamic effects, alongside voicemail, we tag each utterance as a spatial channel. MAW is designed to exploit innate localization abilities, our perception of spatial attributes, and our intuitive notions of how to select and manipulate objects distributed in 2-space. Everything in the system is manifest, and objects and actions can be understood in terms of their effect on the displays. Precise auditory localization is difficult, but informal experiments indicate that MAW's visual and acoustic displays complement each other, a glance at the map disambiguating auditory cues. Asynchronous features anticipate emerging (and eventual) ubiquity of multimedia email and hypermedia. As sound technology matures, and more and more audio and multimedia messages and sessions are sent and logged, the testimony of sound may come to rival that of the written word. Audio windows are a way of organizing and controlling sound. The issues involved in this research are both timeless and timely--- timeless because the research analyses the way people communicate with and through sound and space; timely because the requisite technology has only recently ripened sufficiently for scientists to start exploring the expressive potential of the medium.


Go to previous article 
Go to next article 
Return to 1995 VR Table of Contents 
Return to Table of Proceedings 

Reprinted with author(s) permission. Author(s) retain copyright.