1993 VR Conference Proceedings

Go to previous article 
Go to next article 
Go to Table of Contents for 1993 Virtual Reality Conference 

The Place of 3D-Audio Imaging in Disability-Access Applications

Dr. David A. Boonzaier
Rehabilitation Technology Group
Biomedical Engineering Department
University of Cape Town Groote Schuur Hospital
Medical School, 7925 Observatory, South Africa
Phone: +27 (21) 4046117
Fax: +27 (21) 4483291


The purpose of this paper is to review the background research which has made 3D/Virtual Audio Imaging possible and to discuss a novel and as yet untested potential application in the area of navigation for the blind. Recent years have seen many advances in computing technology with the associated requirement that people be able to manage and interpret increasingly complex systems of information. As a result, an increasing amount of applied research has been devoted to reconfigurable interfaces like the virtual display. As with most research in information displays, virtual displays have generally emphasized visual information. Many investigators, however, have pointed out the importance of the auditory system as an alternative or supplementary information output channel in Virtual Reality (VR) Systems. Of course, for a blind user, the audio component of such VR systems becomes the primary input.

Applications of a three-dimensional auditory display involve any context in which the users spatial awareness is important, particularly when visual cues are limited or absent.

Examples include:

In common amongst these is the notion that the operator is blind to the objects, be the blindness or the objects virtual or real. The Audio Image synthesis technique is based on the Head-Related Transfer Function (HRTF); the listener-specific, direction-dependent acoustic effects imposed on an incoming signal by the outer ears. HRTFs in the form of Finite Impulse Responses (FIRs) are measured with small probe microphones placed near the two eardrums of a listener seated in an anechoic chamber for 144 different speaker locations at intervals of 15 degrees azimuth and 18 degrees elevation. In order to synthesize localized sounds, a map of listener-specific "location filters" is constructed from the 144 pairs of Finite Impulse Response (FIR) filters represented in the time domain. The map of FIR filters is downloaded from a computer to the dual-port memory of a real-time digital signal-processor, the "Convolvotron," (ref.). The device convolves an analog signal with filter coefficients determined by the coordinates of the target location and the position of the listeners head, thus "placing" the signal in the perceptual three-D space of the user. The current four-convolvatron configuration allows up to 16 independent and simultaneous sources and is capable of more than 1200 million multiply-accumulates per second. The resulting data stream is converted to left and right analogue signals and presented over headphones.

Motion trajectories and static locations at greater resolution than the empirical measurements are simulated by interpolation with linear weighting functions. When integrated with a head-tracking system, the operators head position is monitored in real-time so that the 16 virtual sources are stabilized in fixed locations or in motion trajectories relative to the listener, utilizing the Polhemus, three translation, three orthogonal angle sensing system. Such real-time head-coupling should enhance the simulation since previous studies indicate that head movements are important for localization. Informal tests also suggest that this approach is feasible; simple linear interpolations between locations as far apart as 60 degrees azimuth are perceptually indistinguishable from stimuli synthesized from measured coefficients. As with any system required to compute data "on the fly," the term "real-time" is a relative one. The digital signal-processor is designed to have a maximal latency or directional update interval of 10-30 msec, depending upon such factors as the number of simultaneous sources and the number of filter coefficients per source. Additional latencies are introduced by the headtracker (approximately 50 msec) and the IBM AT host (approximately 10-50 msec, depending upon the complexity of the source geometry). Recent work on the perception of auditory motion suggests that these latencies are acceptable for moderate velocities.

The working assumption of the synthesis technique is that if, using headphones, one can produce ear-canal waveforms identical to those produced by a free-field source, the free-field experience would be duplicated. Preliminary data suggest that using non-listener-specific transforms to achieve synthesis of localized cues is feasible. Localization performance is only slightly degraded compared to a subjects inherent ability, even for the less robust elevation cues, as long as the transforms are derived form what one might call a "good" localizer (corresponding to an idealised HRTF) (ref).

As with any new technology novel concepts are introduced which have no accepted names. In these situations, researchers tend to use words with related semantic constructs in other areas. In the virtual audio field, much of the terminology is borrowed from the familiar visual world: auditory imaging; looming; flow fields; icons; surface texture; panning; focusing; to name but a few.

The Earsight Project: Possibilities for a New Blind-Aid

Some recent technological breakthroughs in mimicking the higher animal retinas ability to preprocess visual signals and the maturing of visual binaural (true 3D) sound imaging systems provide a unique opportunity window for a completely new device for helping the blind to "navigate" and to avoid obstacles.

In the past there have been severe stumbling blocks in that blind aids were either narrow-field ranging systems, e.g., the ultrasonic cane, which has a unidimensional audio output requiring a large amount of purposeful and cognitively demanding manual scanning to construct a mental picture, (serial scanning) or otherwise 2D (vibro-tactile) surface/skin representations with implications of limited channel capacity (information-transfer rate) and the awesome problem of distinguishing features of interest from background noise in the face of the poor 2D spatial resolution of the skin. The inherent limitation in transferring a large amount of information in parallel to the blind user remains unresolved, in spite of many years of research in 2D surface vibro-tactile, and implanted micro-electrode arrays in the occipital (visual) cortex, etc.

New Emerging Technologies

Analogue Real-Time Image Processing And Automatic Feature Extraction

Publications in the last two years by the CALTECH LSI group have revealed a completely new approach to the problem of handling large amounts of real-world data in real-time. The impreciseness and/or uncertainty of these kind of data lend themselves to large-scale manipulations such as preprocessing: feature extraction, spatial and temporal filtering, etc. In order to do these in anything like real-time precludes a digital approach. The technology, pioneered by Mead, uses multi-layered LSI networked elements in an analogue computing solution to the problem. The connectivity of the particular analogue elements represents the hard-wired (programming) aspects which uniquely define the modes of operation of the pattern recognition process. One recent project, the subject of a doctoral thesis by Misha Mahowald, uses just such an approach to mimicking the retinas architecture and function (ref.). The output of this "artificial retina" is not a large amount of 2D parallel digital data (as one would expect from a simple CCD video camera) but rather a qualitative set of 3D reference points where "something of interest" exists. Features of interest may include: hard edges, e.g., a vertical bar of narrow width might represent a pole, a horizontal step change in brightness - a horizon: or more interestingly, things moving relative to background may be either truly moving objects, e.g., another person walking, or otherwise inanimate objects appearing to move since they are nearer the moving "eye" (the so-called visual flow field), otherwise known as looming, indicated by an image getting nearer and bigger.

This work constitutes a major breakthrough in solving the real-time data reduction/feature extraction problem which has bedeviled the artificial vision, (sometimes called robotic vision) field for many years. For the first time we may have the basis for recognizing classes of objects and highlighting (tagging) them for audio-representation, as described below.

Virtual Audio Imaging

A more mature technology, pioneered by the NASA AMES group described earlier in this review, (Foster, Fisher, Wenzel) and since commercialised as the convolvotron (see refs) is part of the explosion of auditory, visual, touchy and feely human interface devices which are entering the highly lucrative market of virtual reality systems. This new concept addresses the fact that a human user attempting to interact with information technology devices is severely limited by the traditional interface, i.e., screen and keyboard. The multiplicity of human sensory and motor-output capabilities, demands a much richer experiential environment than current interfaces provide. The object of virtual reality research in general is to build a better mapping of the information space onto the users perceptual and semantic world. For those who have lost one or more sensory or motor modalities, the added richness of the VR environment can make or break the utility of the overall system and make it inaccessible to such users.

The Challenge

Totally blind people usually have all their other senses intact (including balance and rotation, which are often overlooked) and it is the object of this project to use redundancy in information-handling capacity by the blind to allow representation of their absent visual perception by transforming it into the auditory domain, i.e., mapping real 3D visual image information (as may be provided by the silicon retina) into the audio perceptual space of the user.

To clarify this idea, which for those of normal visual ability may be difficult to conceptualize, consider the notion of a blind person who enjoys going out into his garden when it is raining, to "listen to the garden" - not the rain. As you can imagine, the blind listener will "hear" each significant object - the garden shed, the pathway, the lawn differently - each in its precise place and with a particular "texture" or sound coloration peculiar to the nature of the surface, e.g. hard (the path) or soft (the lawn).

The concept of "visualizing" the shape and layout of a room by tapping a cane and/or listening to the natural echoes from incidental sounds and speech is well-known (ref.). Many blind people develop an uncanny ability to use this information - which, of course, is redundant in sighted people (and is probably overlooked).


The first phase of this project will be to encourage congenitally blind children, as young as is practicable, to play in a highly structured interior space, with walls, chairs, tables, sofas, doors, etc. Whilst experiencing the real sensations of 3D space: distance, position, speed, size and touch-texture, they will be given, via earphones, a set of synthesized "auditory icons," called earcons, of the same space, representing the real-world objects. These are produced by the convolvotron in their precise 3D location to help identify the situation, size, character and surface texture of the original objects. The biggest challenge in this phase of the project will be to choose suitable earcons (consisting of outlines) and fill-in sound-texturing to provide enough useful information to define relative position and velocity and to help identify objects using these auditory associations. These spatial and temporal associations will have to be learnt, since they are "artificial" in the sense that these objects have no inherent sounds of their own.

Once this audio psycho-perceptual problem is solved we will be in a better position to use the information to design an automatic feature-extraction process from a 3D-vision system which will provide the inputs to the earcon/sonic texture generator.

The possibilities provided by the convolvotron and related technologies in the audio-representation field are encouraging and we do not foresee technological limitations. The primary deficiency at present is the lack of basic psycho-perceptual knowledge as to how early childhood experience integrates the audio world with the visual world, and the difference as experienced by born-blind individuals in this regard. The exciting possibilities offered by virtual 3D audio for enabling this basic research may well provide some fascinating insights to many early developmental questions for the first time.

One remaining consideration is the need for an overall orientation-feedback mechanism for incorporating into the aid. It is possible that with the decrease in cost and increase in resolution of portable satellite globe positioning systems this technology may be usefully incorporated. This will allow blind users to have an absolute reference for direction and position even when walking freely in open spaces, the sort of reference which is normally provided to sighted people by distant objects such as a tall building a mountain or the sun.

Conclusions and Future Considerations

This is a very large project but can be attacked in chunks. Key researchers in the fields of:

have agreed to cooperate. The coordination of the project will be the responsibility of the author who is undertaking his sabbatical at Stanford University from June 1994. Researchers interested in collaborating in this project should communicate with the author.


I should like to thank the following persons who helped define the scope of this project, and who have responded with enthusiasm for future collaboration:


Go to previous article 
Go to next article 
Go to Table of Contents for 1993 Virtual Reality Conference 
Return to the Table of Proceedings 

Reprinted with author(s) permission. Author(s) retain copyright.