2001 Conference Proceedings
Go to previous article
Go to next article
Return to 2001 Table of Contents
BEYOND CAPTIONING: THE NEXT FRONTIER
Charles Silverman, M.Ed.
Deborah I Fels, P.Eng, Ph.D.
Interactive television is emerging as an exciting new and
innovative technology that is on the verge of revolutionizing the
broadcast industry. Interactive television can be simply defined
as the merging of television with Web-based interactivity. This
does not, however, imply watching television while working on the
computer. Rather, it allows users access to Web-based functions
while watching television (e.g., browsing, live participation in
gameshows, etc.). For now, interactivity is often restricted to
navigation, and limited text entry. However, there is room for
very creative and innovative applications and ideas.
Television is a very ubiquitous and affordable environment.
There are an average of 2 to 3 televsion sets per American
Efforts to make television more accessible to people with
disabilities are bound to have an impact on a large number of
Television has traditionally been a very audio and video rich
environment. Over the last 30 years, standards evolved to permit
limited access by people who are deaf or hard of hearing. The FCC
has now mandated captioning for practically all television
programming under Section 713 of the Communications Act.
While there are specifications for minimum size, color,
position, etc. that provide a standardized mechanism for
displaying that text, a large proportion of the information
conveyed by audio is not conveyed by present text based closed
captioning conventions. The bandwidth explosion that is evident
for Web connectivity is also appearing for television. The
bandwidth explosion that is evident for Web connectivity is also
evolving in television as well, initially in limited form by
various interactive television services now appearing on the
consumer landscape. In addition, the conversion from analog to
digital television will take over the next ten years. These new
and future services provide an opportunity to re-visit how audio
information is presented to viewers who are deaf or hard of
hearing. We can now begin to explore the accessibility
opportunities, that up until now, have remained in the computing
Ryerson Polytechnic University and its partners are engaged in a
project that is exploring new captioning conventions to convey
the rich audio information that is frequently missed. This
exploration is being made possible by the new multimedia and
interactive TV set-top technologies.
Information Conveyed by Audio
Consider for a moment all the paralinguistic (non-text)
information that a person receives when listening to speech and
sounds. For example, tone of voice, inflection, rate of speech,
the timing of responses, volume, the texture or fluency of
speech, the unique identity of a voice, and accent, all convey
semantics and contribute to our understanding of the
“words” being presented. Music, sound effects and
silence also contribute a great deal to the underlying mood and
semantics of a scene. When one or more of these elements is
missing, there can be a reduction or loss in our understanding
which may in turn result in a reduction of our appreciation of a
broad and varied genre of television, video and movie content.
Presently, Closed Captioning communicates the verbatim or
paraphrased transcript of the spoken words. Music is labeled and
a small descriptor identifies the tune or the type of music.
Emotions may be labeled when very overt with a single descriptor
such as "angrily" or "crying," or punctuation such as
“!”. Background sounds may be described with one or
two words when highly significant.
The audio channel can present a large amount of information in
parallel, a single tone, a slight inflection can alter the
interpretation of an entire conversation. In contrast, text is
processed in a serial fashion. Normal conversation occurs at
approximately 280 words per minute (reference). The preference
for reading captions is 141 to 150 words per minute with many
viewers not experiencing difficulty until captions reach 170
words per minute (Robson, 1997, Jensema, 1998). Robson (1997)
reported that younger, literate children are comfortable reading
60 words per minute.
Text can therefore carry only a transcript of the spoken words
because speaking speeds and reading speeds are inequitable. There
is little or no room to describe with text the paralinguistic
audio information (and it is traditionally not described).
Information about emotions, music, sound effects must be inserted
into the conversational pauses and is often not synchronized with
the audio track. This robs the viewer of the full experience of
the video. Conventional captioning is therefore analogous to
having the music notation presented without the signatures and
commentaries as a visual equivalent of listening to a full
Caption researchers have noted that various kinds of
paralinguistic descriptions are often reduced to minimal
descriptors or non-existent with the caption viewer wishing for
this information, even at the risk of redundancy (Harkins, et.
al., 1995). The difficulty is that there are practical limits to
the amount of text-based description that can fit on the screen,
that can be read by the user, that can be handled by conventional
analog television. Using other methods, combined with text, or on
their own, may overcome these limitations.
What Web TV Enables
WebTV and other set-top technologies significantly widen the TV
pipeline, by combining World Wide Web with television
broadcasting conventions. By inserting special signals throughout
the broadcast, these systems deliver accompanying data from the
Web. WebTV can manipulate The broadcast video can be manipulated
with WebTV and other systems so that it appears as a frame within
a specially designed Web page, complete with text, graphics,
along with buttons and hypertext. Alternatively, a full
television video may be overlayed with text and graphics. The
iInteractivity incorporated with current set-top technologies
allows for a variety of user preferences such as as well as
navigation, and text entry. Barker (1998) suggests that there are
many existing access strategies that can be used with WebTV
(e.g., trainable remote controls, print enlargers, etc.). We
believe that not only can existing access strategies be used with
WebTV and other set-top technologies but that new strategies that
enhance existing ones can be created. All of these features make
this environment ideal for exploring enhanced multimedia
Few conventions exist that would guide the translation of the
paralinguistic qualities of audio into a visual format. One of
the challenges of this project is to find and verify the
conventions that do exist, determine how broadly they are held
and determine how easily they can be extended. If no conventions
exist for visually interpreting a type of audio information we
must determine how we can build upon and extrapolate existing
conventions so that they can become widely accepted.
A rich popular medium that has conveyed paralinguistic
information visually is print-based cartoons (comics). Comic book
conventions are sufficiently ingrained into the North American
culture to be recognizable by the majority of the North American
population (MacCloud, 1993). MacCloud (1993) suggests that sound
is represented through a variety of visual devices, such as
lines, icons, text, and word balloons. Comic art, in particular,
according to MacCloud, strives to engage all the viewer’s
senses through a visual medium. ReadSpeak
(http://http://www.readspeak.com/) is one example of a captioning
system that places the text captions by the mouth of the speaker
(similar to a word balloon in conventional cartooning) in an
attempt to convey some of the paralinguistic information. Others
who are attempting to convey paralinguistic information include
Sign interpreters, Mime and interactive artists. However, there
are few reported conventions followed in these domains. The
conventions introduced by these artists/interpreters can serve to
guide us in translating the inherent audio information into a
recognizable visual format.
One notable attempt visually represent sound (music) was in Walt
Disney’s classic movie, Fantasia, where visual
interpretations of classical music are presented as animations (
AnimationArtist, 2000). This, however, was a highly personal,
interpretive art piece that did not use standard conventions.
Work with Content Developers
To begin our explorations we required a reference point upon
which to base our discussions. A collaboration was established
with MarbleMedia, a video production company that has produced
video projects for Bravo TV and other specialty channels. Their
invaluable contribution has been to provide content and comments
on how well our enhanced captioning of the paralinguistic
information matched with their intentions. Using their video
short "Il Menu," a whimsical spoof of Italian operas, we
havedemonstrated the viability of interactive set-top technology
for deliveringadditional graphical information.
In Il Menu, a chef and waiter must serve voracious Divas, who
are not satisfied by any of the pastas they are served.
Eventually, the Divas respond enthusiastically to canned
spaghetti. Much subtlety and play abound in this piece. The
annoyance and ultimate exasperation of the chef and waiter and
the delightfully temperamental indulgence into food and song of
the Divas are expressed in a rich canopy of sounds that play on
alternating tensions and levity. Without moment by moment
conveyance of information about the audio, the viewer who must
rely on conventional captions misses most of the experience and
meaning of the film.
With Il Menu as a springboard, we have begun to map out what
will lead to a system of catergorizing the different kinds of
paralinguistic elements, along with potential visual
representations. The initial prototypes have explored the use of
speech bubbles, color, and text to convey different aspects of
dialog. These early prototypes have been a proof-of-concept,
showing that interactive TV technology, and potentially, digital,
or DTV, when it arrrives, can support the kind of enhanced
captioning that we are proposing.
For the initial prototypes, basic character emotions were
conveyed through the shape, colour and size of word bubbles and
icons, as well as choice and style of fonts for specific speakers
and utterances. Background sounds were highlight by iconic
representation. For example, anger was represented by a spiked
The producer and director of the video viewed the prototype and
suggested that they caption enhancments captured the essence of
their creative intentions and also they commented that the
caption enhancements provided a clarification of their work. They
further indicated that these additions would be helpful for
focusing mainstream, hearing audiences. We are currently planning
to work with them on additional content pieces to add open
enhanced captioning for mainstream viewing. We are also currently
planning to include 3-D text and animation as a way of
highlighting the paralinguistic elements of speech.
While the visual modality is an obvious one method of
representing sound, we are exploring the potential of using the
tactile modality to convey paralinguistic content. Question to be
· Can we involve the senses more intimately with touch?
· What is the best tactile representation of dynamically
changing sounds (e.g., audience applause)?
· How can we enable the continuous engagement of deaf
viewer to music through feeling, conveying music the moment by
moment experience, rather than merely summarizing it?
We have begun building a haptics display to convey some of the
audio content in Il Menu. However, this component is at an
A critical and inherent part of future work will involve
usability testing. We are designing a series of studies to
examine the experience of hearing and deaf viewers regarding the
depth of engagement and understanding comparing standard
captioning with a variety of “enhanced” captioning
Interactive TV technologies are now appearing on the consumer
scene, with the inevitable processes of conflicting standards and
elimination of various players. The technologies are just a hint
of what is to come. Within the next ten years, more powerful
versions of these systems will be folded into digital TV, which
is scheduled to replace the current broadcast system as we know
it. Once digital television becomes mainstream, so too will far
more powerful versions of the current interactive technologies.
Our goal with this project is to examine the ways to use these
technologies to enable deaf and hard of hearing viewers to
achieve the kind of engagement and parity that is routinely
enjoyed by hearing viewers.
AnimationArtist (2000). Walt Disney’s Fantasia
Jensema, C. (1998). Viewer reaction to different captioned
television speeds. American Annals of the Deaf. 143(4).
Electronics Recycling Initiative
MacCloud, Scott, Understanding Comics. HarperPerennial,
ReadSpeak Inc. (2000). http://www.readspeak.com/
Robson, Gary, Captioning FAQ (http://www.robson.org)
Go to previous article
Go to next article
Return to 2001 Table of Contents
Return to Table of
Reprinted with author(s) permission. Author(s) retain copyright.