Voice I/O and
Most of you are familiar with CSUN's large, annual, international conference, "Technology and Persons with Disabilities," which features many lectures and exhibits. This meeting, "Voice I/O and Persons with Disabilities," represented a different meeting style for us. It took place in Palm Springs, California, October 1-3, 1991.
Unlike our well-known conference, this was a small-group workshop. It called for full participation on the part of all participants. It was directed at a clear-cut end result: a set of priorities in the Voice I/O field.
It also represents CSUN's involvement in smaller meetings devoted to specialized areas in the field of technology and disability. Look for announcements about others to follow in the near future. The purpose of this workshop was to give some focus to Voice I/O and talk about what is important to advance the field. What should we do? How should we go about it?
By the end of our time together in Palm Springs, we had arrived at a set of eight priorities. These are not set in concrete. Take a look at them and formulate your own sense of what is important for you as an individual, or your organization, or the field at large. If you agree that these are important, you may want to share them: share them with consumers, your colleagues and funding sources. The March 18-21, 1992 CSUN Conference picked up where we left off in Palm Springs. The eight priorities were revisited by Workshop Leader, Dr. Sarah Blackstone and participants at the 1992 CSUN conference. A mini-conference, with about 12-15 papers on Voice I/O, was a part of the larger conference.
We will continue to invite others to contribute their sense of what is important. If we keep the process going long enough, we will have met our priorities, fulfilled our dreams and made some differences in the lives of individuals with disabilities.
California State University, Northridge
There are a number of instances in which computers are beginning to emulate human behavior. Software exists which can play chess at the grand master level. Mechanical design, photographic technology and software work together within the field of robotics to allow machines to perform tasks which substitute for human actions, which play games, or pick up and move objects. Some software is designed to draw inferences similar to humans. Much of this research will prove to be extremely useful in the years ahead. But nowhere in the attempt to replicate various aspects of human behavior is there more challenge than in speech technology - technology which attempts to program computers to produce and understand human speech. We are still trying to understand how our communicative behavior can be assumed by machines, which aspects of communication can and should be replicated, what the requisite functionality is for speech I/O devices? It is the further understanding and development of this technology which will allow disabled and non-disabled individuals alike to more effectively participate in the 21st Century.
It is an understatement to suggest that speech is important. Man may be generically known as homo sapiens "wise man" but he more appropriately should be called homo loquens, "talking man." He has three characteristics which distinguish his ability to communicate from that of all other animals: (a) duality of patterning: his ability to combine meaningless sounds to form meaningful utterances; (b) productivity: the infiniteness of his utterances; and (c) displacement: his ability to discuss things which do not exist in time or space. It is a result of displacement, of course, which makes man the only animal which can lie. But it is this same characteristic which makes him the only animal that can create poetry, which can let his linguistic peregrinations take him on exquisite journeys into literature and lyrical music.
The ability to speak is a selective evolutionary trait. The chimpanzee, one of our closest relatives, lacks speech centers such as Broca's area, and the small symmetrical regions in both hemispheres which approximate Wernicke's area are unconnected to each other. So this ability to use language, while unique to man, also makes him vulnerable and truly disabled when no substitute for speech is available. For we have grown dependent upon speech. It is speech which rules our lives.
It is clear that we are predisposed to learn language. In the acquisition of a first language, children learn the same aspects of speech in the same order. For English-speaking youngsters, for example, -s plural is learned before -s possessive which is learned before -s third person singular). The babbling phase (typically starting at the age of 6 months) in normal speaking children has been extensively studied. But it is now known that even deaf infants of deaf parents who sign instead of speak also go through a "babbling" phase with their hands. We are wired for language.
We observe daily that human beings have no problem learning how to speak at a very early age. However, no one yet understands what causes a first language to be learned. And while we may know a bit more (or at least have more theories) about second language acquisition, it is infinitely easier to teach a person to speak a language than it is to teach a computer. We have not yet succeeded in algorithmically producing computer-generated speech which is indistinguishable from natural speech. Nor do we understand how to program machines to productively "understand" the wide variety of human utterances. It is now approximately 280 years since the first mechanical attempt to reproduce human speech and 52 years since the first electrical attempt. Each year brings us a bit closer to realizing our objectives. And in spite of the current imperfections of the technology, speech I/O can now be used for people with disabilities in ways which were not possible a few years ago. According to recent statistics from the National Institute on Disability Rehabilitation Research, 20% of all non-institutionalized individuals age 15 and over (nearly 40 million people) have some physical function limitation. Such limitations include the inability to hear normal conversation, the inability to have speech understood, or the inability to read newsprint even with corrective lenses. Under this definition, 12.8 million people have some visual limitation; 7.7 million people a hearing problem; and 2.5 million a speech problem. As much as .6 % of the school age population (ages 6-22) are non-speaking. The number of children that the last statistic represents is quite large. And in addition to these figures, some 14.3 million people have some other functional limitation.
The statistics on individuals with learning disabilities are in many ways just as alarming. 40% of the 14000 youngsters who appear in New York City's Family Court have learning disabilities. Most are 7-17 and have failed in school. One study of adult offenders found that 60% had some learning disability. There may be as many as 23 million Americans who cannot read maps, or road signs, labels or instructions, or who cannot count. Just as ominously on the educational front, according to a recent study by the Dept. of Education, approximately 50% of high school seniors read at or below the 8th grade level. There are unquestionable links between learning disabilities and poverty and between poverty and crime.
In each of these cases, speech I/O technology has something to offer. High quality speech synthesizers can now replace dysarthric speech and can read standard ASCII text from a screen with the assistance of widely-available screen-reading software and courseware written for individuals with dyslexia and other learning disabilities. But this technology is not a solution; it is rather a tool whose functionality must be further developed and, more importantly, a tool which must be made accessible to the millions of individuals who need it. The decade of the 1990's has already begun and will be over before most of us realize it. Now is the time to address the issue of priorities and to set tangible and feasible goals for our entry into the 21st Century. Voice will be an important and vital technology and one which will clearly be of assistance to disabled individuals.
The cost of speech I/O devices needs to be brought within reach of the average disabled user. This is beginning to happen due to the reduction in cost of hardware (both memory and processing power) as well as the growing realization on the part of manufacturers that devices which cost as much as mid-sized cars will not be readily accepted within this market. We can already see this happening: devices utilizing optical character reading (OCR) technology coupled with speech synthesis for users who have visual disabilities have come down from $50,000 not many years ago to $10,000 and more recently can be purchased for about $3,000. Similarly, the highest quality synthesizers, formerly available for over $4,000, may soon cost just over $1,000. Large vocabulary speech recognizers are currently priced at around $9,000 (30K words) to around $3500 for a smaller vocabulary (7K words) but this price will undoubtedly be lowered in the next few years. At the same time, it is unrealistic to expect costs of such devices as very high quality speech synthesizers or large vocabulary speech recognizers to be significantly reduced in the next two to three years. The problem is that as algorithms grow necessarily more complex (due to the complex nature of human speech), we will have greater reliance on faster microprocessors and faster and denser memory. Nevertheless, in the past few years, we have also seen a proliferation in extremely powerful personal computers (such as those based on the 80286 and 80386-class machines) which are within financial reach. In other words, the price of components seems to be decreasing more quickly than more complex algorithms necessitate faster components. Because of this, we will see speech I/O utilized in workstations and telecommunications within the next few years, and the resulting commercial applications will clearly benefit individuals with disabilities.
Voice technology needs to be made more widely available. For example, of the 933,000 people with disabilities (one-half million of those with severe disabilities) in Massachusetts, fewer than 5000 (approximately one out of 100) receive assistive technology services through agencies such as human service organizations. Until very recently, high quality speech synthesizers were simply out of financial reach for the average non-sighted or non-speaking individual. Fortunately, this is changing. Digital Equipment Corporation, for example, has licensed its speech synthesis technology to selected non-profit institutions which manufacture augmentative communication devices and which, in the past, had to rely on inferior-quality synthesis because of the low cost. As a result, high-quality speech synthesis is now available in a variety of AAC devices. High quality, large vocabulary speech recognition continues, however, to be out of the reach of most physically-challenged people.
The size of speech I/O devices needs to be reduced in order to make them more portable and less conspicuous. There is already a trend underway to accomplish this. For example, one of the highest quality speech synthesizers has been reduced from 16 lbs. to approximately 3 lbs. The bad news is that this reduction has taken 8 years; the good news is that if this trend continues, in a few years, these devices will be very small and will weigh almost nothing.
There are a number of priorities in the development of speech technology itself. Speech synthesis is constantly being improved in intelligibility and a great deal of interesting research has been done in this area. In the highest quality synthesizers, intelligibility is now very close to that of humans. However, due to the fact that so many people will be using speech synthesis as a prosthesis for vocal dysfunction, technologists must achieve a greater degree of naturalness (i.e. those qualities which make the synthesizer sound "human"). But although intelligibility and naturalness are clearly interrelated, the latter has been elusive to speech scientists. The quality of parametric synthesized speech is currently much lower than that of natural speech and it is in this area that a great deal of improvement is possible. One of the first tasks is to understand the differences between natural and synthetic speech in both the acoustic and the perceptual domain - very often, these are different. Also, the quest for naturalness informant speech synthesis is really the quest for "acceptable "naturalness - that is, a level of naturalness that does not claim to replicate the linguistic behavior of a native speaker but rather one which is absent of those aspects of synthesized speech which are most noticeable to the novice, annoying to the user or distracting to the listener. This is especially important when the synthesized voice becomes the voice of the user.
To achieve an acceptable level of naturalness, more analysis of real speech and finer tuning of segmental phonemes is required and better algorithms for more precise contextual variation of these phonemes are needed. Prosodic parameters such as tone, accent, meter and the like need to be better analyzed and understood in order for the speech scientist to develop procedures which will more accurately predict how humans behave linguistically. Specifically, improvement of and better granularity for non-statement intonation (e.g., the different question types) are needed, pause timing needs to be more accurately reproduced, and new phonetic models for phrasing, stress placement, tune and various other discourse phenomena need to be applied. Also, acoustic parameters connected to perceived emotion need to be made more accurate and more accessible to the user. However, in spite of these shortcomings, much headway has been made in the past few years, and high quality speech synthesizers now have the potential for much greater naturalness than the synthesizers of several years ago.
Since a large number of people use speech synthesis as a voice prosthesis, one of the concerns among the user community has been the minimal range of emotion. Allowing users a choice of a small set of emotion-specific intonation contours has become more important. Some interesting work has been done recently at a number of institutions including the Massachusetts Institute of Technology (Cahn, 1989) and the University of Dundee-Scotland (Murray et al, 1988, Newell, 1990) to factor emotion into synthesized speech.
Linguists well understand that prosodics may carry as much or more information about emotional content than the actual words themselves. In fact, it is believed that semantic differences conveyed by intonation contours are one of the earliest contrasts perceived by children. We know from our own understanding of natural language that a simple word can take on dozens of different meanings with nothing more than a change in intonation. We can communicate a wide variety of feelings with a single word, e.g., surprise, anger, fear, hesitation, happiness, affection, and the like. More syntactically complex utterances such as "I'll spank you if I ever catch you doing that again!" may not be felicitous vis-a-vis the speech act, if uttered with the same intonation as, say, "What a beautiful little baby you are." Yet in speech synthesis systems today, default prosodic contours are emotionally neutral no matter what the semantic content. Ways need to be found to allow users more flexibility in manipulating prosodics whether by commands which are pre-set within the synthesizer and selectable, or whether by more direct input from the user, e.g., with a joystick or some equivalent device. While there are valid arguments for discouraging direct user manipulation of prosodic parameters in commercial markets, users who are physically challenged have a overwhelming motivation to use such functionality effectively. Perhaps when more is understood about the acoustics and perception of the prosodics of emotion, improvements might be made to the electro-larynx which is characterized by an even more monotone intonation than the very low quality speech synthesizers.
Improvements in the female voice will also be necessary for similar reasons. Most speech synthesizers offer no female voice; those which do often create a "female" voice by simply scaling the male voice. In all cases, the female voice still lags behind the male voice in naturalness. Also, females who have a vocal impairment are justifiably dissatisfied with having to speak using a male voice. (Klatt, 1987) cites the case of a 16-year old Arizona girl, the victim of an auto accident, who refused to use a synthesizer because it offered only a male voice. This disaffection with using a voice which does not fit the sex, age or personality of the user is understandable and thus the improvement of synthesis of female voices is vital. Very recent research (Klatt and Klatt, 1990) suggests interesting paths for improvement of the female voice in formant synthesizers. High quality diphone synthesizers, of course, can offer a female voice as a separate entity and may be a temporary advantage in this regard although ultimately they may not be as extensible for the disabled user and are often, but not always, less natural.
Less work and analysis has gone into the acoustics of children's speech than that of adults although some very detailed work has been done in first language acquisition as well as the characteristic sociolinguistic phenomena associated with children's speech. In an admittedly oversimplified way, the acoustics of children's speech can be differentiated from that of adult speech in a number of ways. For one thing, prior to roughly age 8, it is rather difficult to determine the sex of a child simply from listening to his speech. However, it may still be the case that voices of male and female children need to be dealt with in seminally different ways, much like the speech of adult males and females. Currently, the synthesized voices of children are simply scaled from the adult voice parameters. However, the acoustics are much more complicated. Children's voices operate at higher pitch frequencies. The vocal tract is shorter than the 17 cm. average for an adult; the vocal cavities and head size are smaller, and so on. Children's speech also has intonational and discourse patterns quite distinct from those of adults. But while we can often perceive the relative age of a child from hearing only his speech, synthesizing these perceptual qualities adequately, even at the segmental phonemic level, is still relatively difficult.
New voices also need to be created to more closely emulate qualities of the voices of older children and adolescents since the only choice currently is between adult and pre-adolescent child. The flexibility of a high-quality formant synthesizer (and one of its principal advantages over diphone, demisyllable or syllable synthesis) is that it allows the user the ability to easily modify acoustic characteristics for voices of individuals of various ages, both male and female. This is crucial since even a rough matching of the general characteristics of a voice (in age and sex) to a particular user is extremely useful and will promote the use of the technology among individuals of different ages.
Even with highly natural synthesized speech, an interesting logistical problem remains. How does a person output speech in a reasonably efficient manner so as to be able to enter into a "normal" conversation? How does a person respond to a question without having to type each and every word of the answer into a computer? This and similar problems in the logistics of dyadic communication are now being addressed. Better algorithms are being developed to predict speech output for people with vocal dysfunction thereby encouraging more discourse and facilitating communication. An early example of this was the Bliss system (Carlson et al. 1982; Hunnicutt 1984) which contained about 500 symbols which could be translated into common words and longer utterances. This system was helpful in encouraging individuals with severe speech and language impairments to use speech synthesis. Work has been also done using a predictive system which displays the most frequent word from a word fragment. (Hunnicutt, 1985). Carlson (personal communication) has pointed out that the software is even designed to adapt itself to the user within preselected semantic domains.
Other interesting research in this area is currently being conducted at the University of Dundee in Scotland (Murray 1988, Alm and Newell 1989, Newell, 1990). Drawing on sub-branches of linguistics such as discourse analysis and pragmatics, software has been developed to assist normal conversational interaction. CHAT (Conversation Helped by Automatic Talk) is an interactive system which helps predict, for the user, the next most likely speech act. This is done by factoring in an analysis of the speech act (e.g., the participants, the mood, etc.).
In 1908, Henry Ford said, referring to his new model A, "You can have any color you want, as long as it's black." Fortunately, we now have a bit more choice as to color and functionality of automobiles. But unlike automotive technology which has had a tendency to concentrate more on the aesthetic than the functional (i.e. the internal combustion engine has remained essentially the same since the inception of the gas-powered automobile), we need to do more to increase the functionality of various assistive technologies because of the diversity of the client base and the immediate need of this population. Individuals who have some speech impairment need to be able to more flexibly break into conversations, summon attention, participate in a conversation in a reasonable rhythmic manner, or speed up communication when required. They also need to be able to maintain eye-contact while speaking, communicate with non-readers such as very young siblings or children and talk to others over the telephone or in another room. Those of us who do not have such disabilities tend to take these conversational primitives for granted. Yet these are issues which have not been adequately addressed by existing functionality of speech I/O devices. For example, Scherer (1979) shows that in conversational turn-taking, increase in amplitude is an efficient method of defending or reclaiming one's turn (i.e., a person tends to talk louder when he wishes to interrupt or when he does not want to be interrupted himself). Extroverts tend to avoid long pauses and have a higher mean FP when this is measured across the entire discourse. We all tend to understand the frustration of the non-speaking person who uses only a headpointer. But try to imagine the extroverted individual who is constrained to output speech in a relatively non-controlled, mechanical way with no method of increasing amplitude or pitch, no way of interrupting a colleague, no way of defending his turn when someone interrupts him. In other words, lack of a speech device may inhibit communication, but inadequate functionality in the speech device tends to frustrate basic personality traits.
We specifically need better response time in initiating and terminating speech. Timing in dyadic communication is critical and a lack of ability to break into the conversation at the right time, offer affirmation and agreement sounds, or break off speaking when interrupted are all necessities. The developer and the user need to be able to manipulate prosodics more easily for expressiveness. The learning-disabled individual needs the ability to syllabify text in both the phonetic and the orthographic domain, and the ability to segment utterances phoneme by phoneme. Also since some sighted individuals are able to visually scan text at rates of up to several thousand words per second, synthesizers need faster top speaking rates (i.e. rates well about the 550 wpm mark).
In addition, any initiative to broaden functionality should also focus on adults who need AAC. According to Blackstone (1990), it is the adult population (rather than persons under 22 years of age) who have been less served by professionals. In a recent survey done by Augmentative Communication News, only 8% of professionals claim to work with people over 65 years of age. This trend will be exacerbated in the near future since older individuals (those over 65) will represent a very large portion of the population as the baby boomers near retirement.
Technology, of course, is useful only if people who need it are aware of it and those who might use it have access to it. It is unfortunately the case that many individuals who could benefit from speech I/O are unaware of its existence or its advantages. While there is a small group of well-known professionals who are the innovators in the use of speech I/O for people with disabilities, many more professionals at the clinician and physician level are unaware of the technology or, if aware of the existence of the technology, are unaware of the state of the art. Therefore, there is a need for training, education and informative publicity for new technology as well as for applications which utilize that technology. For example, the new Information Age exhibit at the Smithsonian National Museum of American History in Washington, D.C. now contains an exhibit of speech synthesis used as assistive technology and this is viewed by thousands of people every day. But we need many more public exhibits such as this to educate the general public.
There needs to be a closer dialogue between clinicians and developers of technology as well as between developers of application software products and the major hardware vendors. Individuals with expertise in developing hardware and software are typically not the same as those with expertise in evaluation of appropriate functionality for people who have vocal, visual or learning problems. And individuals within those groups often do not overlap with a third group comprising the clinicians who have daily contact with physically-challenged men and women.
Larger corporations with the necessary resources need to become more responsive to the needs of the disabled population of the United States. Similarly, individuals concerned with assistive technology need to become creative about utilizing the resources of large corporations. Some large corporations, however, are beginning to take notice. In 1989, for example, Digital Equipment Corporation determined that a group should be established to address the needs of people with disabilities and formed the Assistive Technology Group. The ATG had a simple vision: (a) to make the company a leader in the assistive technology field by providing state-of-the-art technology to assist people with disabilities through collaborative efforts with industry and leading non-profit institutions; and (b) to be a catalyst and role model to energize other companies to work collaboratively with non-profit institutions. Its mission was equally simple: to utilize Digital's technology and resources to advance the state-of-the-art of assistive devices and associated technology and to facilitate the widespread availability of assistive devices.
As one of its first projects, the Assistive Technology Group assisted the Communication Enhancement Clinic at Children's Hospital Boston, through a substantial grant from Digital, in developing a small, light-weight battery-powered speech synthesizer for use by individuals with vocal impairments. In this case, Digital provided the expertise on hardware, software and speech technology and Children's Hospital provided additional software support and expertise on functionality and applications. More partnerships such as this need to be developed between large corporations and non-profit institutions which deal with AAC. In this way, professionals such as clinicians, physicians, and engineers can combine their resources, knowledge and talents with specialists from the computer industry in hardware, software, linguistics, acoustic phonetics and signal processing to initiate projects which specifically address assistive technology needs of the disabled community.
In recent years, more work has gone into the development of both speech synthesis and speech recognition for languages other than English. Increasingly, both phoneme formant synthesizers and especially diphone synthesizers have successfully synthesized a number of other languages. People with disabilities who speak many of the more common European languages now have the ability to use speech output devices. A great deal more work has to be done but reasonable quality synthesis can now be produced in the standard language of almost every major country of Western Europe. English, German, French, Spanish, Italian, Dutch, Portuguese, Swedish, Danish and Norwegian have all been successfully synthesized. Similarly, discrete word speech input algorithms which handle a (smaller) number of Western European languages are now available. This will result in more widespread use of speech I/O technology in countries outside the United States as well as inside the U.S. by individuals who may not have an adequate control of American English.
There may be no dramatic breakthroughs in increasing the naturalness of formant synthesis within the next few years. Speech technology, like any science, is cumulative. The paradigm shifts of scientific revolutions are still relatively uncommon and the chance is small that some dramatic breakthrough will occur in the synthesis or recognition of speech. As the science historian Thomas Kuhn observed a number of years ago, these revolutions occur not when we find new answers to old questions, but rather when we begin asking new questions. In the meantime, there is no magic elixir, no formula for natural synthetic speech or for high accuracy large vocabulary, connected word speech recognition. Naturalness in synthetic speech and high accuracy in speaker-independent, continuous speech recognition will likely result only from a more detailed analysis and understanding of linguistic phenomena (both production and perception) and the subsequent painstaking attempts to algorithmically replicate those phenomena one at a time.
As synthesized speech becomes more natural, it should also become more widely used. The individual with severe speech and language problems will be able to communicate more intelligibly and effectively as high quality synthesizers become less expensive and more portable. Communication needs of individuals with speech impairments will generate integrated speech-output systems with less cumbersome and less costly synthesizers running on laptops and the smaller, faster computers of the future. The high quality speech synthesizers will eventually be available in chip form and will be integrated into these small systems. Improved speaker technology will also help downsize these systems and make them more portable. More importantly, the characteristics of the speech will be customizable by the user. The next few years should also see more natural sounding voices for both sexes, further ease of manipulation of the voice characteristics, and a closer coupling with predictive dialogue algorithms to increase the speed and efficiency of dyadic communication.
Speech synthesizers of the future will offer a range of emotional parameters which will provide users with the ability to convey various emotions by allowing the prosodics to match the semantics of the utterance. A user will be able to produce a sentence such as "This is exciting technology!" and convey fervor rather than boredom.
Eventually, our further understanding of parameters connected to age and sex will provide for greater naturalness. This will allow the speech synthesizers of the future to offer a continuum of voice types which can be used to select a wide variety of voices of both sexes and different ages. Voices will be customizable by the user via flexible applications without resorting to trial-and-error experimentation with acoustic parameters. Some screen-reading applications are already beginning to do this.
Eventually, high quality speech synthesizers will be manufactured in the form of microchips and, with improvements in speaker technology (currently one of the biggest limitations in size reduction), these chips will be integrated into cosmetic artifacts such as jewelry. A small amount of digitized speech can now reside on a device (which includes batteries and a small speaker) approximately the size of a business card. Speech synthesizers will become small enough to fit into cosmetic jewelry such as brooches, wristwatches or tie clasps. Audio speaker technology will become miniaturized and will still produce high quality speech. Various assistive devices, using microchip technology, will replace the monotone electro-larynx and will offer more natural sounding and aesthetically pleasant speech and perhaps somewhat later may be able to be fine-tuned for the original idiolectal characteristics of the speaker.
Speech recognition and resynthesis algorithms will be combined in applications which will convert dysarthric output into natural sounding and highly intelligible speech. Even severe transmission problems may no longer be an obstacle to effective communication with machines. Speech-impaired individuals are now beginning to experiment with speech recognizers and in spite of difficulties encountered due to the inconsistent nature of some speech, the results are somewhat encouraging even if occasionally overstated. Shane (personal communication) has suggested that in certain cases of dysarthria, errors are consistent and recognition scores increase when the same passage is used several times. Furthermore, even if a number of speech parameters are dysfunctional, the spectral information which is received may still be sufficient (Boysen 1991). Similarly, the speech of many deaf individuals has enough consistency to make the use of this technology attractive. In one experiment (Carlson 1988) where the etiology was both hearing-impaired and cerebral palsy, subjects scored fairly well on a vocabulary of 50 words, some achieving a greater recognition rate by machine than by human. Among the conclusions were that with the addition of subset vocabularies as well as contextual information, the technology shows great promise, especially for those subjects with cerebral palsy. It is also likely that linguistic problems related to certain types of aphasia will be better understood in the future perhaps leading to more intelligent recognition algorithms. For example, Broca's aphasics demonstrate a marked drop in FP over connected two-word utterances no matter what the silence interval is between the words; and utterance-initial words are predictably longer in duration than utterance-final words (Cooper and Cooper 1980). Experiments such as these show promise that as the neurolinguistics of these speech and language problems become better understood, certain impairments will be more amenable to the use of speech recognition. The output of such speech could then be translated, and transformed into highly intelligible synthetic speech. This might be done with a text-to-speech device or perhaps even through a resynthesis of the original signal (again assuming a detailed understanding of the spectral characteristics of the speech). This would dramatically increase communication and even make conversational interaction a reality. Individuals with severe speech and language problems who speak different languages will be able to communicate with one another using a combination of automatic speech recognition, speech synthesis and machine translation technologies. Japanese companies are already experimenting with this combination of technologies (with normal speakers) for use in international telephone transactions.
There will be many more stand-alone applications making use of a variety of on-line biofeedback techniques for individuals with dysarthrias or hearing loss and which may lead to some functional improvement. These will be workstation-based, will combine a knowledge of neurolinguistics, and acoustic phonetics, and will allow a person to view his speech output and compare it to normal speech through simplified spectral displays. Such applications already exist, e.g., Say and See (Rosenblum 1991). Many of the currently existing techniques in articulation therapy and even specific applications which focus on a particular disorder, e.g., Melodic Intonation Therapy (Albert et al. 1973), Lingraphica (Steele 1989), will be automated, offered on inexpensive media, and usable on workstations.
Speech technology will show up in the home as well and television will have expanded programming for people with disabilities. For example, a new application called "Descriptive Video" is currently being piloted on Public Television. This allows unsighted individuals to enjoy a movie through the use of elaborate descriptions which are recorded on a separate audio channel and spliced into the movie. Currently, the manufacture of descriptive videos is extremely time- consuming but the use of speech technology has the potential to make the production easier and such videos more widespread.
The increased awareness on the part of the general public about persons with physical disabilities will be reflected in the school curriculum. For example, American Sign Language. (ASL), which is the third most widely spoken language in the U.S. (after English and Spanish), may well become an elective within the "foreign" language curriculum. In fact, this is being discussed at the present time since some school systems are beginning to understand that ASL is perhaps more utilitarian than many other languages now taught in the schools and certainly would augment communication between hearing and non-hearing individuals.
A great deal of assistive voice technology will be generated from commercial applications. Major progress is being made because voice is a technology which supplements input devices such as the keyboard and mouse and which helps compensate for the sensory overload on our vision and the limitations of manual dexterity. The link here is that the non-disabled individual needs speech technology for essentially the same reason as the disabled individual: the need to work more expediently and play more enjoyably. Non-disabled individuals simply have a different point d'appui. While the priorities of activities within speech I/O technology necessarily change when discussing its use as assistive technology rather than within the context of telecommunications or office automation, many of the goals will remain the same and a great deal of the groundbreaking work done in assistive technology will soon become useful for the commercial markets as well. Conversely, the commercial applications such as those for telecommunications and office automation will also spawn seminal research and development which will benefit assistive devices.
As an example of this, the computer industry now recognizes that the next generation of workstation, which is right around the corner, will be a multi-media workstation. This is a workstation which will combine graphics, image, and voice and the voice will be both input as well as output. The technology is here to do much of what the applications will soon call for, e.g., voice activated window systems, verbal feedback on input commands, the integration of the telephone into the workstation, and the like. It is the user interface and the applications which still need to be determined. But it is a small intellectual leap from this generic workstation to one which could be used by physically-challenged individuals. And these workstations, with earphones for privacy of the user and the people around him, will be found in universities, libraries, clinics and, of course, the home. They will be everywhere.
On-line services via the telephone will be ubiquitous. One such application, again from the commercial space, will be the "talking newspaper." Similar to the Talking Yellow Pages which is already in existence, the talking newspaper will use speech synthesis and DTMF tones or speech recognition to allow access to any information contained in the daily paper: sports, weather, news, etc. Newspapers will also be on-line at every public library and therefore, accessible to the unsighted individual at the other end of a telephone line.
More books and periodicals will be available from bookstores and libraries in machine-readable form and thus will be able to be linked to a speech synthesizer in the home for verbal output. Manuals for everything from food processors to microwave ovens will also be machine-readable since all such devices will come with computerized displays and on-line help. In addition, reading machines will be found in more public places including libraries and schools. More importantly, these will become a commodity much like VCRs.
Home automation, now being called the "smart house", is expected to become a $2-billion business by the year 2000. This will involve automation of lighting, entertainment systems, security, energy and communications, and voice will undoubtedly play a significant role. Commercial companies are already beginning to offer voice activated and voice-controlled appliances for the home. A voice controlled VCR is now being marketed in the U.S. In the future, the individual with a motor impairment will find daily routines less physically challenging - not because of voice-controlled TVs and VCRs but because of voice controlled light fixtures, ovens, microwaves, doors, and so on. Automotive companies are beginning to experiment with voice output for warnings ("Your door is ajar.") and some cellular telephone companies now offer speech recognition for voice dialing. Voice recognition will eventually be used to control selected aspects of the car itself (heat, defrost, radio, etc.), and much of the research done for voice in airplane cockpits will prove to be useful for these more mundane applications. The user interface and microphony issues will be elegantly resolved and talking to your microwave oven should be no more abnormal than talking to your children, with the exception that the microwave oven will always listen to instructions.
The transition into the next century will likely not see robots running around the kitchen fetching and cooking but rather it will be by an intelligent interconnection of existing speaker-specific electronic devices which will be voice activated and voice controlled. Some dysarthric speech will no longer be an obstacle to communication and motor-control problems will not be as constraining. Both businesses and new houses will be constructed for economy of movement for non-disabled individuals and this will benefit disabled individuals as well.
As long as technical individuals are the sole users of high technology such as speech I/O, then we, as champions of this technology, will have failed. We must help to make speech I/O a commodity. We live in an age of electronics. The image of a child with a headpointer and communication board watching a movie on a sophisticated electronic device such as a VCR is a strange juxtaposition. Electronic hardware is already available to help most non-vocal individuals but we must do all we can do to make its existence known and to encourage its fullest utilization.
We can dream. But our dreams should be inspired by reality and state-of-the-art technology. They need not be relegated to science fiction and the bizarre and they need not contradict basic laws of physics and pragmatics. We are now at the turn of the 21st Century. Those who lived 100 years ago would have been unable to describe life as it is at the present time. They would have been unable to easily contemplate open-heart surgery and organ transplants, supersonic flight and the space shuttle. We too are fairly incapable of imagining life at the turn of the 22nd Century - our grandchildren will see it first hand, of course - not because we are unintelligent or unable to articulate what is not, but because science has an intractable way of expanding technology at a non-linear pace. The innovator always dreams but he should recognize that such fantasies are often preludes to reality and that illusions are necessarily intertwined with actuality. He knows that technology will progress at an increasingly faster pace - as it always has. And he must constantly ask himself the inescapable question: Will I play a part?
[Tony Vitale is a Senior Consulting Engineer in Linguistics and Speech Technology at Digital Equipment Corporation and holds a Ph.D. in General Linguistics from Cornell University.]
Over three, very hot days in Palm Springs, California, in early October, 1991, a group of individuals dedicated to the advancement of Voice I/O were led by Dr. Sarah Blackstone through a series of group-process strategies that resulted in the identification of eight priorities in this field. There is no magic about eight, or six or ten priorities. A different group at a different time in a different place might have come up with a different set of priorities.
However, this group arrived at these eight (in ranked order):
Harry Murphy (Conference Director) is the Director of CSUN's Office of Disabled Student Services. He is also the Founder and Director of CSUN's Annual, International Conference, "Technology and Persons with Disabilities." Dr. Murphy travels frequently to Australia, New Zealand and Europe to consult with technology conferences which are modeled after the CSUN conference.
Sarah Blackstone (Workshop Team Leader) is the President and CEO of Sunset Enterprises, Monterey, California, and author of Augmentative Communication News. Dr. Blackstone was awarded the First Distinguished Service Award in 1990 by the International Society of Alternative Augmentative Communication in Stockholm, Sweden. She has served as group leader for Think Tanks co-sponsored by IBM and the American Speech and Hearing Association.
Gary Poock (Workshop Facilitator) is Editor-in-Chief of the Journal of the American Voice/Input Output Society (AVIOS) as well as a founding member of the society and a director of the organization. Dr. Poock is also a professor in the areas of human factors and human-computer interface design in Monterey, California. Marshall Raskind (Workshop Facilitator) is the Coordinator of CSUN's Computer Access Laboratory and Learning Disabled Program. Dr. Raskind is Editor of the national Learning Disabilities Electronic Bulletin Board on SpecialNet. Along with Neil Scott, he is the co-developer of "SoundProof," a new screen reading program for learning disabled persons.
Diane Bristow (Workshop Facilitator) is a trainer at CSUN under a grant from Rehabilitation Services Administration (RSA). Ms. Bristow regularly conducts assistive device training sessions for rehabilitation counselors, facilities personnel and employers in California, Arizona, Nevada, Hawaii, Guam, Saipan and American Samoa.
Jodi Johnson (Recorder) is the Administrative Assistant for CSUN's Office of Disabled Student Services. Ms. Johnson has coordinated the speaker schedule and technical equipment needs for the "Technology and Persons with Disabilities" conference. She has also used desktop publishing programs to produce much of the printed material for the conference, including the Conference Program.
Jennifer Zvi (Recorder) is a Learning Disability Specialist at CSUN. Dr. Zvi assesses those students who suspect they have a learning disability, and if the student meets state eligibility requirements, provides educational support services to meet their needs. She also sits on the national Board of Directors of the Orton Dyslexia Society.
Nina Treiman (Recorder) is a counselor in CSUN's Office of Disabled Student Services. Ms. Treiman is also assistant director of CSUN's annual, international conference, "Technology and Persons with Disabilities."
Tom Lawrence (Recorder) is a clerical support person at CSUN's Office of Disabled Student Services with responsibilities for tracking both disabled and veteran students. Mr. Lawrence is a disabled veteran.
Tony Vitale is Principal Engineer in Linguistics and Speech Technology for the Assistive Technology Center at Digital Equipment Corporation, Northboro, Massachusetts. Dr. Vitale is a former Professor of Linguistics and has been working on various aspects of speech and language since 1971. He was twice appointed Senior Fulbright Professor in Linguistics and has lectured in a number of countries in the Near East, Africa and Europe. Dr. Vitale is on the Board of Directors of the American Voice Input/Output Society (AVIOS) and is current Vice-President of that organization. He was a member of the development team for the DECtalk speech synthesizer and has worked in speech I/O technology at Digital since 1983.
Richard Amori is a Professor and Chairman of the Computer Science Department at East Stroudsburg University located in the Pocono Mountains of Eastern Pennsylvania. Dr. Amori is a former Naval Officer and was a Research Scientist at Bell Labs, General Electric and Honeywell. Two recent projects are CASL (Computer Aided Sign Language) a prototype translator from voice-to-American Sign Language, stored on a video disk, and Vocomotion, a prototype voice controlled wheelchair.
Carl Brown is an independent management consultant, specializing in the areas of technology for people who are disabled. Mr. Brown is the president of two companies, CPB Group and Abilities Development Associates (ADA). Prior to becoming a consultant he was employed at International Business Machines (IBM) for more than twenty-nine years.
Robin Burkholz is a licensed speech-language pathologist who is in private practice and operates her own company called CompuSpeech in Los Angeles. Ms. Burkholz specializes in providing home-based services using technology to children and adults with special needs. Robin is past president of the national organization, Computer Users in Speech and Hearing.
Jeff Burnett, a Professor in Washington State University's School of Architecture, configures voice and head movement-operated CAD workstations for quadriplegics. Dr. Burnett's current funding from the U.S. Department of Education has his programming staff completely re-writing support software for the VOTAN VPC speech card, so that it runs under Windows 3.0, dedicated for voice only users at Seattle's Resource Center for the Handicapped.
Kevin Caves is with the Rancho Rehab Engineering Program/Center for Applied Rehab Technology (CART)/Rancho Los Amigos Medical Center, Los Angeles. Mr. Caves is currently performing research in the area of access to technology and integration of the control of assistive technologies. He also works as part of the technology team at CART which recommends assistive technology to people with disabilities.
Wayne Chenoweth is a Training Specialist/Instructor at the California Community Colleges High Tech Center Training Unit, DeAnza Community College, Cupertino, California. Mr. Chenoweth conducts trainings for High Tech Center staff from throughout California as well as many other states in the use of adaptive computer technology including a one-day workshop on speech input. He is also editor of the High Tech Center News newsletter.
Frank DeRuyter is Director of the Communication Disorders Department and the Center for Applied Rehabilitation Technology (CART) at Rancho Los Amigos Medical Center, Los Angeles. Dr. DeRuyter serves on numerous state and national association committees and has presented as well as published in the areas of augmentative communication, service delivery, assistive technology, and quality improvement programming.
John Drescher is innovations manager for Special Education Technology - (Vancouver) British Columbia (SET-B.C.). The SET-BC program is a provincial government initiative, established to assist school districts in educating school-aged students with physical disabilities or visual impairments through the use of technology. Presently, Mr. Drescher is initiating several pilot projects involving voice recognition systems.
Carol Esser is an Advisory Planner with IBM Special Needs Systems, Boca Raton, Florida. Ms. Esser is responsible for the planning of computer access products for mobility impaired persons. She was a member of the Special Needs development team for Phone Communicator and VoiceType.
John Eulenberg is an Associate Professor in five different departments at Michigan State University. Dr. Eulenberg is the publisher of Communication Outlook. His research includes work in voice output in Indian, Chinese, Black English and Hebrew.
Douglas Forer directs the Technology Research Laboratory at Educational Testing Service in Princeton, New Jersey. Mr. Forer's initiatives are aimed at influencing policies and implementing procedures which will assure access to ETS computer-based tests. To this end, voice technologies will play a significant role.
Jim Fruchterman is President of Arkenstone, Sunnyvale, California, a non-profit firm which is the leading producer of reading systems for people with visual and reading disabilities. Mr. Fruchterman was a technical co-founder of and held the positions of Vice President of Finance and of Marketing at Calera Recognition Systems, maker of the award winning optical character recognition products for personal computers: TrueScan and WordScan.
Deborah Gilden is Associate Director of the Rehabilitation Engineering Center at the Smith-Kettlewell Eye Research Institute in San Francisco. Dr. Gilden is the developer of a number of talking game modules for blind children which use synthetic speech to teach forms, geography and time. Adaptive Communication Systems manufactures and distributes her "Flexi-Formboard."
Cheryl Goodenough-Trepagnier's disciplinary background is in linguistics, and her research over the last fifteen years has been directed at disorders of communication and development of technology to compensate for them. Dr. Goodenough-Trepagnier, along with Michael Rosen, has received NIDRR funding for a research project at Tufts University School of Medicine in Boston to develop and test a technique for utilizing commercial speech recognition systems in computer interface designs for people with dysarthria.
Bonnie Haapa is Manager of Linguistics at Emerson & Stern Associates, Inc., San Diego, California, a software research and development firm specializing in speech and language engineering. Ms. Haapa oversees linguistic and human factors aspects of software design and conducts linguistic experiments in language learning and Voice I/O for individuals with speech impairments, head injuries, hearing impairments, and learning disabilities, as well as for speakers of English as a second language.
Jeff Higginbotham is with the University of Buffalo, Communication and Assistive Devices Lab.
David Horowitz is Assistant Clinical Professor of Rehabilitation Medicine at Tufts University School of Medicine and Research Fellow of Mechanical Engineering at MIT. Mr. Horowitz is currently the Principal Investigator on a 3-year Rehabilitation Services Administration (RSA) Special Projects and Demonstration grant focused on the use of Large Vocabulary Speech Recognition Systems in the vocational rehabilitation of severely disabled individuals.
Paul Jones is Associate Professor of Counseling & Educational Psychology at the University of Nevada, Las Vegas and consulting neuropsychologist for Nevada Services to the Blind. Dr. Jones' research interests include a number of publications on assessment with visually impaired persons.
Cliff Kushler is the Technical Director of Research and Development at Prentke Romich Company, Wooster, Ohio, the largest manufacturer of electronic augmentative communication devices. Dr. Kushler received his doctorate from the University of Tokyo, where he developed two speech training systems for the hearing impaired, designed specifically for the Japanese language. He has worked on the design and development of a number of AAC devices, including integration and modification of various speech synthesizers used in the device.
Dick Moyer is the President of Compeer, Inc., of San Jose, California, a manufacturer of a family of powerful computer-based augmentative speech systems for use in school, college and the workplace. Mr. Moyer is actively involved in the research and design efforts to develop innovative new products that fill current and expected future needs in the areas of augmentative communication and voice/communications interface technology.
Mark Rosen is president of Nexus Applied Research, Inc., an electronics R&D firm. He is an electronics engineer with experience in rehabilitation engineering and telecommunications. Mr. Rosen is currently developing a very flexible interface to DEC, Sun, and other high powered workstations for people with disabilities. The work is being sponsored by the U.S. Department of Education and features voice input and output.
Neil Scott is an engineer at CSUN who has specialized in the design and application of assistive technology for the past fifteen years. Mr. Scott is currently developing a Universal Access System which uses speech I/O, among other strategies, to enable any disabled individual to use any computer.
Robert Segalman is a research analyst and consultant on disability with the California Department of Justice, Sacramento. Dr. Segalman also chairs the California Relay Service Advisory Committee of the Public Utilities Commission. This committee oversees the telephone relay services available to Californians with speech and hearing disabilities.
Milo Street is President of Street Electronic Corporation, Carpenteria, California, which manufactures the Echo line of speech products. Mr. Street has been designing speech input and output hardware and text-to-speech systems for microcomputers for over ten years. He has also been heavily involved in speech compression and recognition research.
Rich Walsh is the Executive Director of the Resource Center for the Handicapped, a private, non-profit vocational technical training institute dedicated to increasing the level of independence of individuals with severe physical and sensory disabilities. Rich founded the Center in 1981 to equip the disabled with marketable skills necessary to mainstream them in the work place.
Bruce Wilson is responsible for identifying computer-science-based technologies, assessing their impact on BCS stragic business plans and consulting with BCS Technical programs and executive management on the assessments; managing the BCS Long-Range Technology Forecast and the Long-Range Technology Requirements teams; exploring the needs for human adaptable interactive interface capability, including virtual reality systems; heads up displays for manufacturing and inspection; voice input and output for computing in engineering and operations; natural language interfaces; expert system access front ends, and RF/cellular protable systems access for Boeing in Seattle, Washington.
Hale Zukas, a headstick user, is a program analyst at the World Institute on Disability of Oakland, California and a member of the Equipment Program Advisory Committee of California's Deaf and Disabled Telecommunications Program. On balance, a plain old conversation board still serves him better in face-to-face communication than any electronic communication aid he has seen.
Pramod Pandya is a consultant in Engineering, Simulation and Computer Graphics software within the Academic Computing Support Group of the Chancellor's Office of the California State University system, Long Beach, California. Dr. Pandya is presently involved with the setting up of the Speciality Centres at the following CSU campuses: San Luis Obispo - Sonoma - CSULA.
Lloyd Yam, of Burlingame, California, is the inventor of the SoundBuster, an add-on board for the PC as synthesizer and sampler for speech and music. Dr. Yam also pioneered the very first mass educational program on British Television (using videotex known as Prestel in England). His recent interest is in pay-per-call and he has just completed a book entitled "From Rings to Riches" that explains how to make money in 900 and telemarketing.