2003 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2003 Table of Contents 


Chieko Asakawa(1,2)
Hironobu Takagi(1)
Shuichi Ino(2)
Tohru Ifukube(3)


(1) IBM Japan Ltd., Tokyo Research Laboratory, 1623-14, Shimotsuruma, Yamato-shi, Kanagawa-ken, Japan
(2) Research Institute for Electronic Science, Hokkaido University, N12-W6,Sapporo, Hokkaido, Japan
(3) Research Center for Advanced Science and Technology, The University of Tokyo, 4-6-1, Komaba, Meguro, Tokyo, Japan


In this presentation, we describe the highest and the most suitable listening rates for the blind, based on our experiments with blind subjects, aiming at producing an kind of indicator for listening rate for use by developers.


Blind users mostly use voice output when they access computers, since that approach does not require any additional devices, simply assistive technology software such as screen readers. Braille output is also available with assistive technology software and is very useful for reading the screen information as text, but Braille devices are extremely expensive. For the price of one Braille output device, they might be able to buy several laptop computers.

Even though the voice output is most used, there is little objective data about how quickly and accurately they can understand a document by listening to it, or about the most suitable listening rates. Assistive technology software developers have no such data to refer to.

The highest rate that is provided by the most frequently used text-to-speech engine in Japan (IBM, ProTALKER) is 320 words per minute (wpm). In the U.S. , one engine (SpeechWorks , ETI-Eloquence) goes up to 700 wpm. The differences are related to each enginefs specifications, but from the usability point of view, there are pros and cons. Advanced users in Japan often complain that the page should be read much faster than it is.

Users in the U.S. do not complain that the reading rate is too slow, but they can lose track of the page when they try to set it at the highest rate, since it is impossible for anyone to recognize any words at the maximum rate of 700 words per minute. In addition, the voice quality becomes worse at extremely high rates.

As a first step in solving such problems, we investigated the highest and the most suitable listening rates for blind users with objective and subjective test methods, aiming at producing an indicator for listening rates. This could be a reference for the developers and improve the usability voice of voice interfaces.

Seven blind test subjects (two advanced, two intermediate, and three novice users) were asked to perform the experiments. A recorded human voice was used as test data to make the results more suitable as the quality of TTS engines continues to improve in the future. While there are important differences among TTS engines, we wanted to focus these experiments exclusively on listening rates.

The results showed that the advanced blind subjects could listen to the spoken material at speeds 1.6 times faster than the highest rate of the Japanese TTS engine. This is approximately 490 words per minute. In addition, the average highest rate chosen in the subjective evaluation was mostly equal to the objective average rate for both the advanced and novice users. This indicates that blind users could listen to the documents much faster than was assumed, and the TTS engines in Japan should support faster rates to improve the nonvisual user interface. It also indicates that the rates faster than 540 words per minute cannot be used, since none of our test subjects could recognize the test data with any faster rate than 540 wpm.

In this paper, two experiments to determine the optimal subjective and objective rates are described in detail. Then the results and discussion are followed by the conclusion.


The objective of the experiment was to investigate the highest listening rates that blind users can recognize and the most suitable listening rates that they can listen to comfortably.

Here, we define the suitable rate as the rate such that they do not feel any stress when they listen to the screen information and they can understand it without error. The highest rate is defined as the possible rate such that they can still recognize half of the words with effort. In other words, this is the rate at which they understand about half of the information. The experiment is composed of subjective and objective evaluations.

Seven blind subjects performed the experiments. For the test data, a recorded human voice was used. One set of test data consisted of 15 wave files for the same sentence, recoded with the different rates using CoolEdit [1], as shown in Table 1. Each sentence was relatively short, such as gIt is delicious to have a cup of Russian tea with plenty of jam,h gI was looking for someone who will protect meh and so on. Since the subjects were Japanese, Japanese sentences were used for the entire experiment. We converted the measurements to mora per minute, so that the measured Japanese rates correspond to the English rates in words per minute.

Table 1. Reading Rate

Experiment 1: Subjective Evaluation

Ten sets of data were used. The fastest wave file in a set is presented first and then the next slower one is presented and so on, the same order as shown in Table 1. When a subject first subjectively recognized the sentence, that rate was reported as the highest rate for that subject. After that, the rate continued to be reduced, and when the subject felt that it was the most suitable rate, that rate was recorded. Each wave file was presented once.

Results of Experiment 1

Results of Experiment 1

Figure 1 shows the results. The average highest rate for the two advanced users was about 490 wpm and 500 wpm. Even for novice users, the average highest rate was about 370 wpm.

Experiment 2: Objective Evaluation

Ten sets of data were also used for this experiment, and were presented in the same way as in Experiment 1. This time the subjects were asked to recall what they had heard, and their voices were recorded on an MD disk. After the experiment was over, we transcribed the recorded voices and the recall rate was defined as the number of correct words compared with the total words number of words in the test data. In the case of a sentence like gThis is a testh, consisting of 4 words, if a subject recalled it as gThis c a testh, then the recall rate was 75%.

After the average recall rate for each rate of each subject was determined, we defined the highest and the most suitable rates in the objective evaluation. When the recall rate first reached 100%, this was recognized as the most suitable rate. The accuracy level of 50% was regarded as the highest rate.

Figure 2.  Results of Experiment 2

Results of Experiment 2

Figure 2 shows the results. These results show that the highest rate is almost the same between the subjective and objective evaluations. Our expectations were not accurate on this point. This indicates that blind users can recognize the read information at much faster rates than we had expected.

The most suitable listening rate is faster than the standard (default) setting rate of the screen reading applications[2][3] for both advanced and novice users.


We conclude that advanced users are able to recognize the information at about 500 wpm and this indicates that TTS engines in general should support such faster rates with high quality voice output. For Japanese, rates above 540 wpm do not appear to be needed, since human listeners cannot recognize words at such speeds. The highest rates are often affected by the difficulty of the presented data, and the perceived difficulty varied for each of our test subjects.

Currently, users can only change the rate using the menu or keyboard commands. The rate cannot be changed at the level of sentences, phrases, words, or characters. This lack of control might cause blind users to misunderstand information, since they canft easily slow down the reading rate, even when they have difficulty in understanding. This indicates that the reading rates should be easily and interactively changed by the users with immediate response.

We would like to perform further experiments to investigate how recall rates are changed by presenting different types of sentences, such as daily conversations, news articles, technical topics, and so on. Our final goal is to propose a new nonvisual user interface based on blind usersf cognitive abilities. For this purpose, we also plan to investigate how the sense of touch[4] could be used for presenting visual information nonvisually.

  1. CoolEdit, Syntrillium, http://syntrillium.com/
  2. Homepage Reader, IBM, http://www-3.ibm.com/able/hpr.html
  3. JAWS, FreedomScientific, http://www.freedomscientific.com/
  4. Asakawa, C., Takagi, H., Ino, S., Ifukube, T., gAuditory and Tactile Interfaces for Representing the Visual Effects on the Webh, in Proceedings of ACM Conference on Assistive Technologies (ASSETS 2002), pp.65-72, Jul 2002, ACM Press

Go to previous article 
Go to next article 
Return to 2003 Table of Contents 
Return to Table of Proceedings

Reprinted with author(s) permission. Author(s) retain copyright.