1999 Conference Proceedings

Go to previous article 
Go to next article 
Return to 1999 Conference Table of Contents


SPEECH SYNTHESIS AT HIGHER SPEAKING RATES

John Rye
Dolphin Oceanic Ltd. (UK)

Abstract

Synthetic speech is used widely with adaptive software to enable visually impaired computer users to have the same level of computer access as their sighted counterparts. Some users prefer, and others need, to be able to listen to synthesized texts spoken rapidly. This paper discusses some issues relevant to the design and operation of speech synthesizers used in these systems, giving particular emphasis on the ability to produce rapid but intelligible speech.

Introduction

Synthetic speech has been used for a number of years as a final output stage in adaptive systems for visually impaired PC users. Visually impaired users make widely varying use of their PCs. At one end of the spectrum an occasional user may read a book, or a 'phone bill for instance, with their home PC and special attachments such as an optical character recogniser. At the other end the professional PC user, a programmer for instance, may use a talking PC for much of their working day. Many people of course vary the amount of time at a PC or terminal according to need or fancy.

Intelligibility and naturalness are accepted as being important desirable aspects of synthesizer performance. Intelligibility is the obvious requirement that any synthetic speech should meet, but a high degree of naturalness has been widely accepted as easing listening strain.

The example of the visually impaired professional PC user is important; this type of user tends to make demands on the speech nds become altered and blur into one another. At these speeds linguistically unimportant sounds sometimes disappear altogether. Shorter alternative pronunciations may also be used. As an example, in British English, the word temperature is sometimes pronounced temp'eture; even by BBC weather announcers at normal speed. Similarly, US English speakers might consider alternative pronunciations of the word particular.

Shortening of the transitions between the spoken phonetic segments themselves does not occur, at least in part, because of limited speeds of movement of the articulators.

It is highly desirable for a synthesizer to model non-linear shortening of the phonetic segments at normal speaking rates in order to maintain naturalness as speed increases. However, there is sometimes the requirement, for example in the case of the visually impaired professional PC user, to synthesize at unnaturally fast rates. If we carried on shortening the sounds but not the transitions as the rate went up, then many sounds would be blurred into their neighbors or disappear altogether. This fact leads to the hypothesis that, increasing the speed of interpolation between phonetic segments as synthesizer talking rate increases, may improve intelligibility in unnaturally fast speech.

High Voiced Pitch or Low Pitch?

The pitch of the synthesized voice is primarily a matter of taste for the listener. However, if we are to consider only the intelligibility at a high speaking rate, then the selection of overall voice pitch may have a bearing.

At very low pitch the voice pulses are relatively far apart in time, consequently they do not sample synthesizer phonetic segments very often. The perception of short voiced sounds in which the vocal tract parameters, for instance formant frequencies, are varying most rapidly, may then be impaired. Conversely, too high a pitch value may produce voice harmonics too far apart to sample the synthesized vocal tract resonances effectively, resulting again in a loss in intelligibility.

Implications for Synthesizer Design

To speed up the interpolation from one phonetic segment to the next, it is necessary to control the variation in time of synthesis vocal tract model parameters, such as formant frequencies. Consequently, formant or LPC based synthesizers are better equipped to achieve this kind of alteration. Systems that synthesize from gross segments, for instance the waveform concatenation of diphones, although capable of a high degree of naturalness and intelligibility, at normal rates, cannot directly achieve this kind of manipulation.

It does not matter whether the synthesizer uses phonemes or diphones, so long as the interpolation between phonemes or the transitional section of the diphones can be quickened.

Tests

Much of the justification for altering phonetic element interpolations comes from informal tests, observations and a little intuition. However, to test the effects of segment interpolation steepening, a software speech synthesizer has been modified to control the overall phonetic segment interpolation rate. Test phrases will then be synthesized at various speaking rate and interpolation rate and played to a panel of listeners. The results from the panel will be analyzed to see if increasing the rate of interpolation improves intelligibility at fast speaking rate. The details and results will be reported.

It will also be demonstrated that at normal speaking rates, increasing the interpolation rate introduces distortion into the speech, reducing its naturalness and intelligibility.

Discussion

It would seem that the following properties are useful in a speech synthesizer intended for use as an adaptive aid for visually impaired computer users.

There would, however, appear to be a conflict between the desirability of natural sounding synthesis on the one hand, and the need for rapid intelligible speech on the other. Certain synthesis techniques can help resolve this dichotomy, by gradually shifting from a model of the natural phonetic segment interpolation process, to a more specialized one as speed increases.


Go to previous article 
Go to next article 
Return to 1999 Conference Table of Contents 
Return to Table of Proceedings


Reprinted with author(s) permission. Author(s) retain copyright.