2006 Conference General Sessions




Kohtaroh Miyamoto
IBM Japan
1612—14 Shimotsuruma
Yamato 242-8502 Japan
Day Phone: 046—215—2949
Email: kmiya@jp.ibm.com

Captions are vital for deaf and hard of hearing people to understand the content of digital video. For example in Japan there is public announcement that clearly sets target for caption availability for almost all the broadcasting contents by 2007. There is also an industry standard that requires caption to be available for all digital contents with audio. But the reality is yet far behind, where for private broadcasting contents caption is only available for 20.3% to 46.8%[1].

The most primitive approach to edit caption text is to type in manually from the
corresponding audio. Unfortunately typing is very much time consuming, especially for
double-byte character languages such as Japanese, Chinese, and Korean, etc., since the oi1owing steps are typically required to type in each phrase.

Step 1. Type phonetic characters for the phrase Step 2. Enter a key (e.g. space key) that invokes the list of candidate replacements Step 3. Select the appropriate combination of characters from the list of candidates Step 4. Enter a key (e.g. enter key) that finalizes
the characters Therefore, it is too costly to type in character strings directly as a manual effort by listening to the audio.

Recently some techniques coordinated with speech recognition technology have been introduced for caption editing. Internal experimentation result shows approximately 35% time consumption reduction by use of caption editing systems. But error editing technology based on speech recognition result is still at a level where correcting the recognition errors is very labor intensive. Therefore there is still a very strong demand to further improve the productivity for lowering the total cost. Based on this requirement, a unique system has been designed called IBM Caption Editing System (CES). CBS will improve editing cost in spontaneous speech where no source of information is available [2) or in many cases specific presentation software is available (patent pending [3)), also when some text information or transcript is available. This paper will focus on how CES can contribute to lowering the editing cost when transcript or presentation package is available.

CES System
CES System consists of CES Recorder and CES Editor. CES Recorder which encapsulates the voice recognition engine transcribes the audio to output the caption candidate text. CES Editor will input the caption candidate text output from the CES Recorder and allow the user to edit and correct all the errors for a complete caption. CES is designed to work well in wide range of voice recognition rates since it varies to a great extent depending on the user scenario.

Obviously it is very easy to obtain correct text if there is at least a portion of transcript available. But it is important to note that caption is a combination set of timestamp and caption text. In order to provide adequate caption to contents, both the accuracy of the timestamp and the caption text needs to be assured. When the speaker in the content starts to speak a word or a sentence, it would be necessary for understanding that the associated caption text appears at precisely the same time. This will allow deaf and hard of hearing people to understand which speaker has spoken the caption text and also recognize external impression such as speaker’s expression or any other graphical event.

Specifically the description •of the scenario from the user’s perspective is the following.
1) Speaker prepares some sort of transcript or presentation package before the speech.
2) Speaker makes the speech by using the CBS Recorder and may use such feature as the
slideshow. CBS Recorder will transcribe the speech to text by encapsulated speech
recognition engine and output the caption candidate and also capture the slideshow page,
image, and text.
3) Editor maybe the same person as the speaker or different than the speaker. Editor will
use the CES Editor to correct the caption candidate errors by making use of the transcript
and text derived from presentation package.
4) User only needs to specify the range of text to match and the matching will be
performed automatically. The benefit of using the CES transcript matching feature is that
there is less need to make correction to the caption text and also to the timestamps.
5) CBS will create a full content with image, audio, presentation, and caption.

Technically CES has the capability to automatically match the caption candidate text with transcript. Previous method uses DP Matching algorithm to obtain each character by character matching mapping. CES will go beyond character by character OP Matching and thus compare not only characters but phonemes as well (Patent Pending [4). Internal test results show the timestamp matching error has decreased for two different voice recognition results compared to previous matching methods. Here, IBM ViaVoice V1O Professional was used as the voice recognition engine. As presentation software Microsoft PowerPoint XP was used. Comparison was performed between “previous matching method” which refers to DP matching by character level and “CBS matching method” which refers to gDP Matching by character and also by phoneme”.
When Voice Recognition result was 81.4%, timestamp error decreased from 8.4% to 2.5%. When Voice Recognition result was 60.9%, timestamp error decreased from 73.6% to 32.8%.

The timestamp error rate associates to the editing hours (cost) in the following way.
When Voice Recognition result was 81.4%, editing total hours decreased by 15%.
When Voice Recognition result is 60.9%, editing total hours decreased by 35%.
To even the conditions, editor is the same individual. And for same voice recognition

result, a single content was split into two. All other conditions are equal.

Speech Recognition alone will not solve the problem of low caption availability rate in the world. CES has covered many user scenarios so that correcting all the speech recognition errors. The paper has shown how CES can be effective in lowering the cost of caption editing in the case where transcription text or presentation package exists even in such cases where voice recognition rate is relatively low.

[1] Caption Availability in Japan (Fiscal Year 2003), Http://www.soumu.go.jp/snews/2004/pdf/04080631.pdf, 2005/09/20

[2] Kohtaroh Miyamoto, Effective Master Client Closed Caption Editing System for Wide Range Workforce, HCI International 2005, Volume 7 Universal Access in HCI, 2005

[3] lKohtaroh Miyamoto, Noriko Negishi, Kenichi Arakawa, Harmonization of Voice
Recognition, Caption Editing, and Presentation, JP9-2004--0223

[4] Kohtaroh Miyamoto, Midori Shobji, Precisely Time Stamped Closed Caption by Up—Scaling Matching and Optimum Presentation Method, JP9—2004—0021

Go to previous article
Go to next article
Return to 2006 Table of Contents

Reprinted with author(s) permission. Author(s) retain copyright