2004 Conference Proceedings

Go to previous article 
Go to next article 
Return to 2004 Table of Contents 


Jared Smith
WebAIM (Web Accessibility in Mind)
Center for Persons with Disabilities
Utah State University
United States
Email: jared@cpd2.usu.edu 


Captions are text versions of the spoken word. Captions allow Web audio and video to be both perceivable to those who do not have access to audio and understandable to a wider audience. Though captioning is primarily intended for those who cannot receive the benefit of audio, it has been found to greatly help those who can hear the audio, those who may not be fluent in the language in which the audio is presented, or those who may have learning/cognitive impairments.

The Web Content Accessibility Guidelines 1.0 of the World Wide Web Consortium say, "provide equivalent alternatives to auditory and visual content". Section 508 of the Rehabilitative Act states that, "Equivalent alternatives for any multimedia presentation shall be synchronized with the presentation." According to these guidelines captions should be:

On the Web, synchronized, equivalent captions should be provided any time audio content is presented. This obviously pertains to the use of audio and video that is played through multimedia players such as Quicktime, RealPlayer, or Windows Media Player, but can also pertain to such technologies as Flash, Shockwave, or Java when audio content is a part of the multimedia presentation.

The Process

There are two primary forms of Web multimedia _ on-demand and live. With on-demand or archived audio/video resources, captioning is almost always easily accomplished. The process of captioning on-demand multimedia typically involves the following steps: 1. Acquire a text transcript of any _relevant_ spoken content. Only content that is important to the end user should be provided in captions. In other words, background music or filler words (e.g., _um_ and _uh_) that do not convey content do not need to be included. Transcripts can be acquired from notes, scripts, and other resources or generated by transcribing or voice recognition systems. Getting a textual version of the audible content is often the most difficult and time-intensive step in captioning multimedia. 2. Convert the text into a form understood by the media technology and generate synchronization information. In most cases, the media player delivering the Web audio/video content can also deliver the captions. The text for the captions, however, must be in a form that these technologies understand. This form varies based on the media player, but there are two primary standards that have evolved, as described below. In order for the captions to be synchronized with the audio, information must be generated to tell the media player when to display a certain line or section of the caption text. This involves separating the text into individual one or two line sections that will comprise an individual caption display. Then some form of time descriptor is associated with the specific caption displays. For instance, 4:00 _ See Jane run, might tell the media player to display the words _See Jane run_ at a point 4 seconds into the movie. This step can be done manually using a text edi! tor or automatically with a captioning tool. 3. The media and the captions must be presented to the user. Because caption information is usually separate from the audio/video content, the media player must be instructed to display the captions and in what manner to display them. Again, doing this varies among media players.

For live multimedia, the process becomes much harder because the text equivalent of audio content must by synchronized with the live content. Because the spoken word is often much too fast to transcribe, common solutions involve using a stenographer or voice recognition technologies to generate the text equivalent _on the fly_. This text must then be delivered to the end user with the multimedia.


On the Web, there are three primary technologies for presenting multimedia through audio and video: Microsoft_s Windows Media Player, Real Network_s RealPlayer, and Apple_s Quicktime. A variety of technologies and standards are used for adding captions to these media players.

SMIL (Synchronized Multimedia Integration Language) is a standards-based language used by Quicktime and RealPlayer to control the layout and presentation of visual and audible items. SMIL is used to control the display, positioning, and timing of captions. The captions themselves are stored in a Text Track file if you_re using Quicktime or a RealText file if you_re using RealPlayer. Techniques for creating these caption files vary.

SAMI (Synchronized Accessible Media Interchange) is Microsoft_s technique for adding captions. A SAMI file is a file that contains the text to be displayed within the captions and information that synchronizes individual caption displays to the multimedia presentation.

Of the three major media players, none yet easily allow real-time captioning. However, tools are in development by WebAIM and others to more easily and cost-effectively allow captioning of live audio and video presentations on the Web. Macromedia Flash is also being utilized more than ever to present multimedia content on the Web and strategies are being developed to allow captioning of Flash content.


Fortunately, there are many online resources available to help developers gain an understanding of the technologies and standards used for Web captioning. WebAIM has developed extensive tutorials at http://www.webaim.org/howto/captions/. Developers of the media players also have extensive, though somewhat complicated, resources on how to develop captions for their respective players. See _Relevant Resources_ below.


Creating captioned Web multimedia does not have to be a difficult or time-intensive endeavor. There are many tools available to help developers in creating caption files, controlling layout and timing of captions, and getting accessible Web multimedia online. A few popular captioning tools include:

Relevant Resources


Web Accessibility In Mind (WebAIM) is administered in K-12 settings through a grant provided by the Office of Special Education Programs

(OSEP) of the Office of Special Education and Rehabilitative Services

(OSERS) and in Post-secondary Education settings through a grant provided by the Fund for the Improvement of Postsecondary Education

(FIPSE) Learning Anywhere Anytime Partnerships (LAAP).

Go to previous article 
Go to next article 
Return to 2004 Table of Contents 
Return to Table of Proceedings

Reprinted with author(s) permission. Author(s) retain copyright.