2001 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2001 Table of Contents


Charles Silverman, M.Ed.
Deborah I Fels, P.Eng, Ph.D.


Interactive television is emerging as an exciting new and innovative technology that is on the verge of revolutionizing the broadcast industry. Interactive television can be simply defined as the merging of television with Web-based interactivity. This does not, however, imply watching television while working on the computer. Rather, it allows users access to Web-based functions while watching television (e.g., browsing, live participation in gameshows, etc.). For now, interactivity is often restricted to navigation, and limited text entry. However, there is room for very creative and innovative applications and ideas.

Television is a very ubiquitous and affordable environment. There are an average of 2 to 3 televsion sets per American household (http://www.nrc-recycle.org/Programs/electronics/mgmt.htm). Efforts to make television more accessible to people with disabilities are bound to have an impact on a large number of users.

Television has traditionally been a very audio and video rich environment. Over the last 30 years, standards evolved to permit limited access by people who are deaf or hard of hearing. The FCC has now mandated captioning for practically all television programming under Section 713 of the Communications Act.

While there are specifications for minimum size, color, position, etc. that provide a standardized mechanism for displaying that text, a large proportion of the information conveyed by audio is not conveyed by present text based closed captioning conventions. The bandwidth explosion that is evident for Web connectivity is also appearing for television. The bandwidth explosion that is evident for Web connectivity is also evolving in television as well, initially in limited form by various interactive television services now appearing on the consumer landscape. In addition, the conversion from analog to digital television will take over the next ten years. These new and future services provide an opportunity to re-visit how audio information is presented to viewers who are deaf or hard of hearing. We can now begin to explore the accessibility opportunities, that up until now, have remained in the computing domain.

Ryerson Polytechnic University and its partners are engaged in a project that is exploring new captioning conventions to convey the rich audio information that is frequently missed. This exploration is being made possible by the new multimedia and interactive TV set-top technologies.

Information Conveyed by Audio

Consider for a moment all the paralinguistic (non-text) information that a person receives when listening to speech and sounds. For example, tone of voice, inflection, rate of speech, the timing of responses, volume, the texture or fluency of speech, the unique identity of a voice, and accent, all convey semantics and contribute to our understanding of the “words” being presented. Music, sound effects and silence also contribute a great deal to the underlying mood and semantics of a scene. When one or more of these elements is missing, there can be a reduction or loss in our understanding which may in turn result in a reduction of our appreciation of a broad and varied genre of television, video and movie content.

Presently, Closed Captioning communicates the verbatim or paraphrased transcript of the spoken words. Music is labeled and a small descriptor identifies the tune or the type of music. Emotions may be labeled when very overt with a single descriptor such as "angrily" or "crying," or punctuation such as “!”. Background sounds may be described with one or two words when highly significant.

The audio channel can present a large amount of information in parallel, a single tone, a slight inflection can alter the interpretation of an entire conversation. In contrast, text is processed in a serial fashion. Normal conversation occurs at approximately 280 words per minute (reference). The preference for reading captions is 141 to 150 words per minute with many viewers not experiencing difficulty until captions reach 170 words per minute (Robson, 1997, Jensema, 1998). Robson (1997) reported that younger, literate children are comfortable reading 60 words per minute.

Text can therefore carry only a transcript of the spoken words because speaking speeds and reading speeds are inequitable. There is little or no room to describe with text the paralinguistic audio information (and it is traditionally not described). Information about emotions, music, sound effects must be inserted into the conversational pauses and is often not synchronized with the audio track. This robs the viewer of the full experience of the video. Conventional captioning is therefore analogous to having the music notation presented without the signatures and commentaries as a visual equivalent of listening to a full orchestra.

Caption researchers have noted that various kinds of paralinguistic descriptions are often reduced to minimal descriptors or non-existent with the caption viewer wishing for this information, even at the risk of redundancy (Harkins, et. al., 1995). The difficulty is that there are practical limits to the amount of text-based description that can fit on the screen, that can be read by the user, that can be handled by conventional analog television. Using other methods, combined with text, or on their own, may overcome these limitations.

What Web TV Enables

WebTV and other set-top technologies significantly widen the TV pipeline, by combining World Wide Web with television broadcasting conventions. By inserting special signals throughout the broadcast, these systems deliver accompanying data from the Web. WebTV can manipulate The broadcast video can be manipulated with WebTV and other systems so that it appears as a frame within a specially designed Web page, complete with text, graphics, along with buttons and hypertext. Alternatively, a full television video may be overlayed with text and graphics. The iInteractivity incorporated with current set-top technologies allows for a variety of user preferences such as as well as navigation, and text entry. Barker (1998) suggests that there are many existing access strategies that can be used with WebTV (e.g., trainable remote controls, print enlargers, etc.). We believe that not only can existing access strategies be used with WebTV and other set-top technologies but that new strategies that enhance existing ones can be created. All of these features make this environment ideal for exploring enhanced multimedia captioning.

Uncharted Territory

Few conventions exist that would guide the translation of the paralinguistic qualities of audio into a visual format. One of the challenges of this project is to find and verify the conventions that do exist, determine how broadly they are held and determine how easily they can be extended. If no conventions exist for visually interpreting a type of audio information we must determine how we can build upon and extrapolate existing conventions so that they can become widely accepted.

A rich popular medium that has conveyed paralinguistic information visually is print-based cartoons (comics). Comic book conventions are sufficiently ingrained into the North American culture to be recognizable by the majority of the North American population (MacCloud, 1993). MacCloud (1993) suggests that sound is represented through a variety of visual devices, such as lines, icons, text, and word balloons. Comic art, in particular, according to MacCloud, strives to engage all the viewer’s senses through a visual medium. ReadSpeak (http://http://www.readspeak.com/) is one example of a captioning system that places the text captions by the mouth of the speaker (similar to a word balloon in conventional cartooning) in an attempt to convey some of the paralinguistic information. Others who are attempting to convey paralinguistic information include Sign interpreters, Mime and interactive artists. However, there are few reported conventions followed in these domains. The conventions introduced by these artists/interpreters can serve to guide us in translating the inherent audio information into a recognizable visual format.

One notable attempt visually represent sound (music) was in Walt Disney’s classic movie, Fantasia, where visual interpretations of classical music are presented as animations ( AnimationArtist, 2000). This, however, was a highly personal, interpretive art piece that did not use standard conventions.

Work with Content Developers

To begin our explorations we required a reference point upon which to base our discussions. A collaboration was established with MarbleMedia, a video production company that has produced video projects for Bravo TV and other specialty channels. Their invaluable contribution has been to provide content and comments on how well our enhanced captioning of the paralinguistic information matched with their intentions. Using their video short "Il Menu," a whimsical spoof of Italian operas, we havedemonstrated the viability of interactive set-top technology for deliveringadditional graphical information.

In Il Menu, a chef and waiter must serve voracious Divas, who are not satisfied by any of the pastas they are served. Eventually, the Divas respond enthusiastically to canned spaghetti. Much subtlety and play abound in this piece. The annoyance and ultimate exasperation of the chef and waiter and the delightfully temperamental indulgence into food and song of the Divas are expressed in a rich canopy of sounds that play on alternating tensions and levity. Without moment by moment conveyance of information about the audio, the viewer who must rely on conventional captions misses most of the experience and meaning of the film.

With Il Menu as a springboard, we have begun to map out what will lead to a system of catergorizing the different kinds of paralinguistic elements, along with potential visual representations. The initial prototypes have explored the use of speech bubbles, color, and text to convey different aspects of dialog. These early prototypes have been a proof-of-concept, showing that interactive TV technology, and potentially, digital, or DTV, when it arrrives, can support the kind of enhanced captioning that we are proposing.

For the initial prototypes, basic character emotions were conveyed through the shape, colour and size of word bubbles and icons, as well as choice and style of fonts for specific speakers and utterances. Background sounds were highlight by iconic representation. For example, anger was represented by a spiked bubble.

The producer and director of the video viewed the prototype and suggested that they caption enhancments captured the essence of their creative intentions and also they commented that the caption enhancements provided a clarification of their work. They further indicated that these additions would be helpful for focusing mainstream, hearing audiences. We are currently planning to work with them on additional content pieces to add open enhanced captioning for mainstream viewing. We are also currently planning to include 3-D text and animation as a way of highlighting the paralinguistic elements of speech.

While the visual modality is an obvious one method of representing sound, we are exploring the potential of using the tactile modality to convey paralinguistic content. Question to be answered are:

· Can we involve the senses more intimately with touch?

· What is the best tactile representation of dynamically changing sounds (e.g., audience applause)?

· How can we enable the continuous engagement of deaf viewer to music through feeling, conveying music the moment by moment experience, rather than merely summarizing it?

We have begun building a haptics display to convey some of the audio content in Il Menu. However, this component is at an embryonic stage.

A critical and inherent part of future work will involve usability testing. We are designing a series of studies to examine the experience of hearing and deaf viewers regarding the depth of engagement and understanding comparing standard captioning with a variety of “enhanced” captioning techniques.


Interactive TV technologies are now appearing on the consumer scene, with the inevitable processes of conflicting standards and elimination of various players. The technologies are just a hint of what is to come. Within the next ten years, more powerful versions of these systems will be folded into digital TV, which is scheduled to replace the current broadcast system as we know it. Once digital television becomes mainstream, so too will far more powerful versions of the current interactive technologies. Our goal with this project is to examine the ways to use these technologies to enable deaf and hard of hearing viewers to achieve the kind of engagement and parity that is routinely enjoyed by hearing viewers.


AnimationArtist (2000). Walt Disney’s Fantasia 2000(http://www.animationartist.com).

Jensema, C. (1998). Viewer reaction to different captioned television speeds. American Annals of the Deaf. 143(4).

Electronics Recycling Initiative (http://www.nrc-recycle.org/Programs/electronics/mgmt.htm)

MacCloud, Scott, Understanding Comics. HarperPerennial, 1993.

ReadSpeak Inc. (2000). http://www.readspeak.com/

Robson, Gary, Captioning FAQ (http://www.robson.org)

Go to previous article 
Go to next article 
Return to 2001 Table of Contents 
Return to Table of Proceedings

Reprinted with author(s) permission. Author(s) retain copyright.