1998 Conference Proceedings

Go to previous article. 
Go to next article. 
Return to 1998 Conference Table of Contents 


David Arnold
Olive Tree Software, Inc.
9999 Manchester Road, #337
St. Louis, MO 63122-1927
Phone: (314) 961-7735
Fax: (314) 961-7710
WWW: http://www.olivetreesoftware.com
Email: davida@olivetreesoftware.com
Email: info@olivetreesoftware.com

Case studies reflecting success and failure. Comparisons of past and current product from Dragon Systems, Inc., IBM, and Kurzweil. New (6-97) continuous speech technology.


Speech recognition is a useful adaptive technology for a wide range of users including those with motor impairment, learning disabilities, and speech impairment. Criteria for success and failure are presented.

Speech recognition technology has changed a great deal every two years since about 1990. Some past successes and failures continue to be instructive today. Cognitive limitations can prevent success.

Changes in technology may have ramifications for people with disabilities as the products become easier to use for people without disabilities.

Different vendor's interfaces required different skills. Therefore, an understanding of the differences between current products can determine an appropriate fit for a particular user.


Previous presentations at "Closing the Gap" and "CSUN" have addressed the use of speech recognition by people with motor impairment, dysarthric speech, motivational problems, and phonological difficulties.

Speech recognition has been the path to independent access for people with various motor impairments acquired as the result of accident, injury, neurological disease, brain injury, or other causes including repetitive stress injuries. Speech recognition can allow complete control of a computer and all its software and everything that software can access. This can include employment in a standard setting, the use of mainframe computers through a desktop PC, the use of Internet, banking by computer, the control of environmental control units, appliances, lights, VCR, stereo, television and entertainment.

People with dysarthric speech present a human listener with a catch 22. Automated speech recognition can be trained to respond to any consistent sound source. Dysarthric speech which is so severe that it prevents intelligibility also prevents a human listener from determining whether that speech is consistent or inconsistent. Speech recognition has been shown in previous presentations to provide a man with heavily dysarthric speech an advantage. Human listeners who were able to understand 20 percent to 40 percent of his speech were bested by an old version of DragonDictate for DOS which scored an 80 percent recognition rate. This recognition rate is clearly too low for us to accept because our alternatives are more accurate. But for him this was a clear improvement.

Another presentation dealt with students for whom English was essentially a second language. These students demonstrated chronic phonological difficulties with English and persistent problems spelling as a result. The presenter represented that the students were able to correctly pick the appropriate word from the choice list of the speech recognition system even when multiple forms of the same sounding word were presented (their, there.) To his surprise the acceptance of incorrect words was in fact nonexistent. In spite of profound spelling problems speech recognition provided the specific assistance his sample students required for competent performance.

Another presentation reflected that an impaired user with a typing speed of 15 words per minute, who could attain 25 words per minute with speech recognition nonetheless preferred not to use speech recognition. In fact, other studies have shown that speech recognition is used successfully most often when other alternatives to generating text are not in fact used. The use of other more familiar alternatives tends to extinguish use of speech recognition systems.

A recent trainee with a speech recognition system was obviously articulate and motivated. Success with speech recognition has not yet been achieved due to difficulties maintaining divided attention required for the task. Certainly the cognitive requirements of speech recognition need to be explicitly stated, as they will be in this summary.

A prospect for speech recognition was a young student. Difficulties with identifying correct spelling in the presence of a correctly spelled but incorrect word provided too much difficulty for this candidate. The ability to remember the original thought while being visually stimulated with an incorrect word is a required ability for the use of speech recognition.

A young woman with multiple sclerosis found that the volume requirements and the requirement for consistent volume prevented her from using speech recognition in the afternoon when she became tired and weak. To be successful she was required to use speech recognition during the morning. In the afternoon she performed other tasks.

An adult male quadriplegic on a ventilator discovered that his voice changed enough between his upright posture in a chair and his reclining posture in his bed that he trained both a day voice for standard work and a night voice to handle ECU's (environmental control units) like television, VCR, and stereo controls as well as room lights.


The restraining factor in speech recognition has been the power of the computer. Computer power has altered what can be attempted. And currently computers are powerful enough for continuous speech recognition. In the future true artificial intelligence will be a real possibility.

But in the past only the most crude pattern matching was possible. And to create speech recognition with large vocabulary's the most successful model used what was known as a speaker dependent model. This simply meant that the speaker provided the majority of the information that the computer used to discern what words that same speaker was actually saying.

The speaker dependent model is particularly effective for people who have nonstandard speech. Many speech therapists have approached me with glee in their eye. Thinking that speech recognition required standard pronunciation they hoped to compel their clients to use standard pronunciation. With early systems, your pronunciation could differ wildly from standard. The sounds which you made were simply substituted for the sample standard sounds for that word. Recognition was truly customized.

Speech modeling systems have changed. With advances that require less training standard pronunciation becomes more important. This kind of system is called speaker independent. The speaker independent system knows a great deal about how words sound before they hear any particular speaker. These systems still adapt for individual differences, but the more systematic and advanced the system is the more it expects a good match to the standard sample.

At the time of the name change for IBM VoiceType, two products were combined by IBM. One was the personal dictation system and the other was a continuous speech system which worked with a much smaller vocabulary. The resulting product was based on the speaker independent model. During training if the way you pronounce a word did not match the sample stored by the system you would be asked to repeat that word over and over again. If you cannot alter the way you said that word to the satisfaction of the system you'd be forced to manually skip that word.

With Dragon Systems, version 2 moved from a speaker dependent model to a speaker independent model. The command to turn off the microphone temporarily "go to sleep" would formerly allow the letter "t" to be pronounced as a "d". This alternate pronunciation was as acceptable as any other. Currently the product virtually requires that you say some sort of "t" in the phrase. With the latest update phonetic models can be produced by spelling phrases and phonetically.

To my knowledge no one has witnessed the degradation in performance for people with speech impairments as the result of the new technologies involved in speech recognition. This may be the result of the inevitable effect of the more powerful processors used on today's computers with today's speech recognition. However, some people who are able to use DragonDictate discrete speech recognition cannot use NaturallySpeaking continuous speech recognition because of breath control requirements during training.


When discussing speech recognition products the major players in large vocabulary PC-based speech recognition have been:

Dragon Systems first product was available on DOS IBM PC compatible computers. IBM VoiceType and its predecessor products actually started on Windows 3.1 and OS/2. Kurzweil voice has only appeared as a Windows product.

Dragon Systems was the first to produce a Windows 3.1 speech recognition system, followed by compatibility for Windows 95 and Windows NT.

IBM VoiceType was never available on DOS, and worked considerably better on a heavily loaded OS/2 machine than did on the Windows machine. IBM VoiceType relied on a custom sound card as did DragonDictate. However with the IBM VoiceType product a custom sound card did so much of processing that the impact of the computers CPU was somewhat less.

Dragon Systems DragonDictate allowed complete control of the computer by voice. After the introduction of the Windows product complete control of a Windows PC was available by DragonDictate for Windows.

IBM VoiceType always required some mouse motion, button clicks, and keyboard interaction for standard use.

Kurzweil voice also required some mouse motion and keyboard interaction in its normally use.

Dragon Systems and Kurzweil required correction as you dictate. There was a buffer of a limited size, (about 12 words), which you could correct. Errors further back in your dictation were out of reach for the purpose of correcting the voice file. The IBM product on the other hand did not permit correction of errors as they occurred. The IBM product required you to complete your dictation, switch modes, read your text, refind the errors and click on them to correct them.

Some people preferred the idea of the dictating first and correcting later. But in practice it appealed mostly to people who did not correct their own dictation. The IBM system recorded the voice file to which another person could listen to proof a document.

This operational difference marked a major distinction between IBM VoiceType on the one hand, and DragonDictate and Kurzweil voice on the other.

In practice, this has meant that people who needed complete control hands-free went with DragonDictate, and people who wanted other people to correct their dictation went with IBM VoiceType. Kurzweil Voice served the market for those for whom these considerations were not defining.


Though the DragonDictate interface was easy for me personally, I ran into people with considerable education who had great difficulty dividing their attention between their dictation, the choice list, and word history which appeared when making corrections. It was an eye opener to appreciate that such cognitive difficulties might be hidden in people who would certainly not consider themselves learning disabled or cognitive impaired.


With IBM VoiceType there were two requirements. The first was to be able to recognize misrecognitions buried in text dictated minutes or hours ago. The second was that the choice list under IBM VoiceType was static. If the word you wanted was not on the list, you had better be able to spell it correctly start to finish. The DragonDictate product would progressively spell as you provided characters. By the time you had spelled three characters the word you were looking for was usually on the list.

Even with DragonDictate a user must be able to spelling first several characters of a word correctly. Spelling errors early in the word distract the progressive spelling engine and result in no help whatsoever. Spelling errors later in the word are rarely relevant since many people who spell poorly after the first three or four letters are still very capable of selecting the correct word from a list of correctly spelled words.


Continuous Speech Recognition is speech recognition without a pause between words. If you have ever been in a foreign language class and have wondered the first day where the words were you understand the problem. By letting the computer know exactly where the breaks between words are, computational requirements are reduced. However, today with Pentium processors running at 300 MHz and faster, computational power allows for continuous speech recognition.

IBM has now released Via Voice. Dragon NaturallySpeaking currently performs better, but the IBM product is sure to improve or its current state.

Dragon Systems NaturallySpeaking has a much simpler interface than DragonDictate. Divided attention is no longer a concern. Everything happens right before your eyes. You select text by saying the text you want selected; you replace the text by speaking the text you want to replace it. You can correct as you go, or you can correct at the end of your dictation. You can mix the strategies as you go as suits yourself. You can have your dictation spoken back to you, or you can ignore the capability and the system will still function well for you.

The most recent advances include the ability to run DragonDictate, which allows complete control of your computer hands-free, with Dragon NaturallySpeaking, which allows continuous speech dictation of text. This combination gives full control over computer. It is obviously anticipated that sometime in future when we have even more powerful microprocessors running our computers these two products will be seamlessly integrated into a single product.

I would feel remiss if I did not say that we should expect speech interfaces on all of our computers in the very near future. Certainly in the next two years to the year 2000 processors power should double yet again. As this power increases the ability of computers to handle spoken language well increases. As the market for this technology increases, the price of this technology decreases dramatically.


1) microphone position. Consistent and correct microphone position is one of the very most important factors in successful speech recognition with a computer.

2) consistent volume. Consistent volume, not consistent intonation, is also a very important factor in successful speech recognition with a computer.

3) natural speech. Speaking naturally means not speaking the way you think you should, but rather speaking the way you actually do. Most people become significantly self-conscious when speaking to a computer. They say words in unaccustomed ways, which it turns out, they cannot repeat later.

4) speak clearly. Speaking clearly simply means actually saying each word you intend to say. Many people actually slur over so many words that they actually say quite a few fewer than they think they do. Overt enunciation will dramatically hurt speech recognition. But so will leaving out great chunks of the words.

5) Consistent pace. Benjamin Franklin said, "a stitch in time saves nine." This is very true with continuous speech recognition. Each word should take an appropriate time to its size. A long word should take longer to say, a short word should take a shorter time to say. Many people speaking to a computer begin to speak in a very artificial way. One of the things people tend to do is to give every word the exact same amount of time in which to be spoken. This artificial imposition degrades the ability of speech recognition to operate properly.


Many people, even the very well-educated and powerful, consider computers to be an authority figure. As a result, when prompted they will speak. During training this can produce humorous but unproductive behaviors. Computers do not need to breathe, people do. Many people try to maintain a pace that only a computer would want to maintain. Frequently it is useful to remind people that they are the boss -- the computer is the tool.

When a misrecognition occurs, many people feel criticized by the computer. The computer of course has no such intent or meaning. In response however, people very frequently do one or all of the following:

(In my personal experience, and from hearing stories related by other people, who have watched demonstrations of continuous speech, stage fright or stage tension can radically decrease the accuracy of speech recognition. My solution has been to stop the active demonstration, talk with the audience, and demonstrate speech recognition again after I have had a chance to get to know the audience. This is a problem that does not occur in the privacy of my own office.)

There are sometimes things that you can do to improve the accuracy of your speech recognition which do involve altering the way you speak.

It was only after I began working with speech recognition technology that I discovered that I was in the practice of not finishing words that ended with various constants. For the word "stop" I would say the word without the "p." I said truck by simply stopping the word after vowel. In fact, I said eight as if it had no "t" at all. On discovering this I made an instant pledge to immediately and permanently reform myself. As everyone knows, Rome was not built in a day. I could not maintain such a radical change. That taught me to speak naturally first, and pick single words to improve over the space of weeks not hours.

Minor, not major changes result in the best improvements in speech recognition performance. With continuous speech recognition, the dictation of complete thoughts improves the recognition. For a time, IBM VoiceType boasted that it considered three words at a time in order to choose the correct word. DragonDictate at that time only considered two words. Dragon NaturallySpeaking, sometimes very visibly, considers all the words in its buffer before it decides what it will commit to the document.

Speech recognition is certainly a moving target. What was true about speech recognition two years ago, in some ways, hardly applies. The price has dropped dramatically. Speech recognition of top-quality is now available for $200. Training is still advisable for top performance. In practice, the majority of cost is in hardware, training, and then software. Seven years ago you could get a computer and speech recognition software for 15 thousand dollars. Now you can get a very competent computer for $3500, training for a thousand dollars, and speech recognition software for under a thousand dollars.

Demonstration of ten lines of training. Demonstration of vocabulary builder. Demonstration of correction and editing, mouse control, and computer control.

Go to previous article. 
Go to next article. 
Return to 1998 Conference Table of Contents 
Return to Table of Proceedings 

Reprinted with author(s) permission. Author(s) retain copyright.