2003 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2003 Table of Contents 


John Rye, Dolphin Oceanic Ltd.
Tel 44 (0)1905 754 577
Email: john.rye@dolphinoceanic.com

Copyright 2002 Dolphin Oceanic Ltd.

1 Abstract

This paper reviews some important aspects of TTS for use by people with a visual disability. It will be of particular interest to those concerned with improving the experience, at work and at home, for people who use speech as their primary access tool. It also describes the main features and benefits of Orpheus version 2, the new speech synthesizer from Dolphin. It pays particular attention to some of the features of the synthesizer devoted to the particular needs of visually impaired speech users. We will show the benefits of using Orpheus where fast speech is needed and where natural sounding speech is required. Demonstrations of some aspects of the synthesis will be given using a screen reader or other talking application.

2 Dolphin and Speech Synthesis

Dolphin Oceanic (the development company of the Dolphin group) is one of the few companies in the disability and access fields that has the core competence required to engage in, and sustain, development of its own speech synthesis engine. Orpheus is not licensed from anyone; it is Dolphin's own system. Dolphin is thus able to include features in the synthesis specifically for the needs of many visually impaired speech users, whilst at the same time improving other aspects of the speech for its use in a wider speech market.

3 Key Features of Orpheus version 2

Version 2 of Orpheus is the new software TTS engine from Dolphin Oceanic. Orpheus's primary features are:

Some of these are described and demonstrated.

4 Natural Sounding Speech

Many recently developed speech synthesizers use what is known as concatenative synthesis. They stitch together small segments of real speech taken from recordings. The results are highly natural sounding since they are natural segments. The real technology of interest in these systems is that which controls the selection of the segments to synthesize. Orpheus now includes voices made like this. Examples will be played.

Many people prefer these types of voice from their PC or laptop, especially when reading or announcing information. They are much more like human voices but cannot be manipulated as much as the synthetic sounding, or formant synthesis voices. The reason for this is that concatenative voices are made up of tiny fragments of recorded speech stitched together by the synthesizer. There is not much you can easily do to these voices other than change their speed and pitch. On the other hand formant synthesizers model the human speech production mechanisms and therefore have many parameters to tweak. The resulting voices can be altered in many interesting ways.

Concatenative synthesizers can require a large amount of memory to hold a voice database of speech segments allowing the synthesizer to select segments on-the-fly while speaking. Some of these databases, for some synthesizers, can be enormous, as some of you may have found. It would seem to be that the larger the database, the more natural the speech. However, Orpheus uses a relatively modest amount of memory for its voice databases typically 4 to 10MB per voice.

5 Fast Speech

One aspect of synthetic speech special to many VI users is fast speech. We may all want natural sounding speech, but perhaps not all of the time. If you have to listen to speech as your primary means of getting input from you PC or laptop screen, then on many occasions particularly in the work environment, for many users, normal speed of delivery of the speech is just slow, tedious and boring.

In a previous paper we presented a discussion of the effects of talking-rate on the intelligibility of the more synthetic sounding, formant synthesis, and reviewed some of the effects obtained. We pointed out that because a formant synthesizer models the speaking mechanism, you could tweak its parameters to obtain improved intelligibility at unnaturally fast talking-rates. You cannot do this where you have concatenated segments from a real speaker. Since that time we have developed these techniques in order to increase the intelligibility of our formant synthesis at the fastest talking rates even further. These will be discussed and some effects demonstrated using Orpheus version 1 and version 2.

This is a bit of a personal view, but who is to say that when the wider public in general is as much used to listening to talking machines as are many visually impaired computer users, that some of them won't want fast intelligible speech. After all, it may cut your phone bills! The general populace may then have caught up with frequent users of speech at last.

6 Responsiveness

Allied to fast speech is the need for response to the key press. Most people want the speech from their screen reader and synthesizer to keep up with their typing, for the same reasons, as they want fast speech. They want to work as fast as sighted colleagues. All Dolphin systems are designed to reduce latency; through the various stages of the screen reader as well as the synthesizer. This is one reason why we cannot afford a huge voice database; it would take up too much memory, be slower to select from and generally slow down the response of a user's system.

7 Skim Reading

Orpheus knows about the grammar of sentences. It therefore also knows which are the more important words in a sentence. So when scanning a document you might want to use the skim reading function to skip over the unimportant 'glue' words of the document. You can do this by setting up a voice in your screen reader specifically for skimming and choosing it as your document read or document review voice.

8 Multilingual Aspects

Dolphin has always been able to produce synthesis in many languages from its tts engines. This continues with version 2 of Orpheus. In particular version 2 is able to speak both Mandarin Chinese and Cantonese Chinese, and these are being developed.

9 Summary

Orpheus Version 2 incorporates the natural sounding type of voice we have all come to expect these days. It also incorporates the familiar synthetic, or formant synthesis voice, which is very good at fast-talking required by many VI users. Other features are also demonstrated.

Go to previous article 
Go to next article 
Return to 2003 Table of Contents 
Return to Table of Proceedings

Reprinted with author(s) permission. Author(s) retain copyright.