Go to previous article
Go to next article
Return to 2001 Table of Contents
Donald B. Egolf, Ph.D.
Department of Communication
1117 Cathedral of Learning
University of Pittsburgh
Pittsburgh, PA 15260
Voice Mail: 1-412-624-6763
Approximately 20 years ago, research in speech synthesis began to pay off. Devices were created that provided a functional level of intelligibility. And soon thereafter, devices were produced that generated intelligibility scores virtually equal to live voice comparisons. Voice choices also became available, most notably female and male voices. These research efforts enhanced the lives of many individuals who could not use speech as a primary means of communication.
Currently, a watershed mark has been passed in the research effort to synthesize a person's facial expressions in synchrony with that person's speech. The purpose of this paper is to review the new technology, to suggest ways in which the technology might augment an individual's communication, and to discuss the controversies that swirl around the technology.
Facial expression technology emerged from a sustained NASA-sponsored research effort at the Jet Propulsion Laboratory and the California Institute of Technology in Pasadena, CA. G-Tec (TM), Graphco Technologies, Inc., has recently acquired exclusive distribution rights to the technology and will co-develop the technology with JPL. The technology has been called "Digital Personnel (TM)."
Digital Personnel is a computer-based facial expression synthesizer. It synthesizes animated, life-like, facial expressions of an individual in synchrony with that individual's speech. The system is speech driven, that is, as an individual speaks the appropriate facial expressions are generated simultaneously. To initiate the system an individual is asked to recite a phonetically-balanced passage, one that has all of the phonemes of English represented in a variety of phonemic contexts. As the individual recites this passage, audio and video recordings are made. Facial expressions accompanying each phoneme are tagged and stored. This storage constitutes the basic data base. The data base can be "tweaked" by having the target individual give samples of winking, blinking, nodding, etc. These responses can be added to give the synthesized face more animation potential. From this data base it is possible thereafter to show the individual's face speaking naturally. All that is needed is to have the individual provide the speech either live, recorded, or for a speech-disabled individual, synthesized speech. In all cases the synthesized facial expressions are those of the actual individual.
Proposed commercial uses of the technology include web-based customer support, e-commerce sales, video telephony, dissemination of news, advertisements, entertainment, and distance learning, for example. In these cases, the communication interaction is believed to have been made more human by the addition of an animated person, or the person who is giving technical support can be looking through documents while still appearing attentive to the help seeker or the distance learning instructor or TV news anchor can appear to be in the college office or TV studio and reporting when in fact they may be phoning in from anywhere.
There, of course, are products similar to "Digital Personnel," such as Ananova, the cartoonlike visual newscaster on www.ananova.com, but none have the realistic quality of "Digital Personnel."
Would someone who is communicatively impaired want to augment his or her communication with synthesized facial expressions? One such person might be someone with a degenerative disease like AIDS or ALS, for example. This person might want to "bank" his or her facial expressions in the early stages of the disease, shortly after a definitive diagnosis has been made.
By "banking" is meant having oneself videotaped reading a phonetically- balanced passage. This would provide the stored data necessary for synthesizing facial expressions. As the disease progressed and the sufferer became weaker, he or she still could communicate with natural-looking synthesized facial expression that would appear in synchrony with his or her own voice or a synthesized voice. The communication could occur in a telecommunication situation or in the presence of one's communication partner.
Individuals with non-degenerative neuromuscular diseases like cerebral palsy, for instance, might also want to adopt the new technology. Many of these individuals already use synthesized speech. And although their neuromuscular involvement may preclude them from reciting a phonetically-balanced passage, video recordings of brief facial expressions could be recorded and thereafter attached to a synthesized phoneme bank to allow the individuals to present a speech-synchronized animated face during conversation. Again this might be utilized in telecommunications or in the presence of a communication partner.
A third area of application would apply to teachers, therapists, and trainers. These individuals may want to record a bank of phonemes using highly animated facial expressions. They could then use this bank to synthesize their facial expressions as they telecommunicated, prepared visual training products or as they spoke "live" in the classroom having their enhanced images projected on monitors.
Synthesized or digitized speech mechanisms, packaged in any number of augmentative communication devices are used as compensatory mechanisms by those for whom natural speech cannot be used as a primary means of communication. They have been accepted by their users and in varying degrees by the users' listeners. The acceptance of facial expression synthesizers in augmentative devices will probably not come so easily. In fact, many people recoil at the idea. Reasons for this response include the following.
First, a person's face is intricately related to that person's identity. When you think of someone, what comes to mind? In most cases, it is likely to be a person's face. To present a face that has been synthesized from a sample taken at an earlier age, as might be done in the case of someone who has a degenerative disease, or to take a series of videotaped samples from a person with severe dysarthria so that fluid facial movements might be made to accompany synthesized speech might be seen as a kind of deceit. Device users are presenting themselves as they are not.
Second, research has shown that the face is the part of the body most intricately related to self-concept. Having one set of expressions without a device and another when using a device with synthesized facial expression capabilities may not only be confusing to others but also to the user as well, leading the user to ask, "Who am I?"
Third, certain rules of discourse would be broken if the facial expression synthesizer were used in the face-to-face conversational setting. The device user's listener would be put in the position of having to choose between listening to synthesized speech and watching speech-synchronized facial expressions generated by the user's device, or maintaining eye contact with the user while listening to the device. The former may appear to be insulting and mortifying to the device user; it might be viewed as a form of rejection.
At the same time, augmentative communication users may prefer to have synthesized facial expression capabilities in their devices. It would give the users a wider communication bandwidth in that the visual or nonverbal would be there to complement the speech. It simply may make the users more competent communicators. The device user might use the synthesized facial capability in all situations, including face-to-face, or might choose to use it only in telecommunications, be it on videophones or across the internet. In the long run, of course, the users will decide. It will be the responsibility of researchers to provide the best facial expression synthesizers to help users make the best decision.
In mid-2000 news of a new scientific development was announced: a computer system that synthesizes speech-synchronous facial expressions of a given individual. In this paper the efficacy of using such a system in the augmentative communication area was addressed and sample applications were given. Utilization of the system would not be without controversy, however, and the pros and cons of utilization were discussed. In the end, of course, device users will decide whether or not the newly developed system will be beneficial.
Go to previous article
Go to next article
Return to 2001 Table of Contents
Return to Table of Proceedings