COST REDUCTION OF CAPTION EDITING USING SUPPLEMENTAL TEXT INFORMATION
Yamato 242-8502 Japan
Day Phone: 046—215—2949
Captions are vital for deaf and hard of hearing people to understand the content of digital video. For example in
most primitive approach to edit caption text is to type in manually from the
corresponding audio. Unfortunately typing is very much time consuming, especially for
double-byte character languages such as Japanese, Chinese, and Korean, etc., since the oi1owing steps are typically required to type in each phrase.
1. Type phonetic characters for the phrase Step 2. Enter a key (e.g. space key)
that invokes the list of candidate replacements Step 3. Select the appropriate
combination of characters from the list of candidates Step 4. Enter a key (e.g.
enter key) that finalizes
the characters Therefore, it is too costly to type in character strings directly as a manual effort by listening to the audio.
some techniques coordinated with speech recognition technology have been introduced
for caption editing. Internal experimentation result shows approximately 35% time
consumption reduction by use of caption editing systems. But error editing
technology based on speech recognition result is still at a level where
correcting the recognition errors is very labor intensive. Therefore there is
still a very strong demand to further improve the productivity for lowering the
total cost. Based on this requirement, a unique system has been designed called
IBM Caption Editing System (CES). CBS will improve editing cost in spontaneous
speech where no source of information is available [2) or in many cases
specific presentation software is available (patent pending [3)), also when
some text information or transcript is available. This paper will focus on how
CES can contribute to lowering the editing cost when transcript or presentation
package is available.
CES System consists of CES Recorder and CES Editor. CES Recorder which encapsulates the voice recognition engine transcribes the audio to output the caption candidate text. CES Editor will input the caption candidate text output from the CES Recorder and allow the user to edit and correct all the errors for a complete caption. CES is designed to work well in wide range of voice recognition rates since it varies to a great extent depending on the user scenario.
it is very easy to obtain correct text if there is at least a portion of
transcript available. But it is important to note that caption is a combination
set of timestamp and caption text. In order to provide adequate caption to
contents, both the accuracy of the timestamp and the caption text needs to be
assured. When the speaker in the content starts to speak a word or a sentence,
it would be necessary for understanding that the associated caption text
appears at precisely the same time. This will allow deaf and hard of hearing
people to understand which speaker has spoken the caption text and also
recognize external impression such as speaker’s expression or any other
the description •of the scenario from the user’s perspective is the following.
1) Speaker prepares some sort of transcript or presentation package before the speech.
2) Speaker makes the speech by using the CBS Recorder and may use such feature as the
slideshow. CBS Recorder will transcribe the speech to text by encapsulated speech
recognition engine and output the caption candidate and also capture the slideshow page,
image, and text.
3) Editor maybe the same person as the speaker or different than the speaker. Editor will
use the CES Editor to correct the caption candidate errors by making use of the transcript
and text derived from presentation package.
4) User only needs to specify the range of text to match and the matching will be
performed automatically. The benefit of using the CES transcript matching feature is that
there is less need to make correction to the caption text and also to the timestamps.
5) CBS will create a full content with image, audio, presentation, and caption.
Technically CES has the capability to automatically match the caption candidate text with transcript. Previous method uses DP Matching algorithm to obtain each character by character matching mapping. CES will go beyond character by character OP Matching and thus compare not only characters but phonemes as well (Patent Pending [4). Internal test results show the timestamp matching error has decreased for two different voice recognition results compared to previous matching methods. Here, IBM ViaVoice V1O Professional was used as the voice recognition engine. As presentation software Microsoft PowerPoint XP was used. Comparison was performed between “previous matching method” which refers to DP matching by character level and “CBS matching method” which refers to gDP Matching by character and also by phoneme”.
When Voice Recognition result was 81.4%, timestamp error decreased from 8.4% to 2.5%. When Voice Recognition result was 60.9%, timestamp error decreased from 73.6% to 32.8%.
timestamp error rate associates to the editing hours (cost) in the following
When Voice Recognition result was 81.4%, editing total hours decreased by 15%.
When Voice Recognition result is 60.9%, editing total hours decreased by 35%.
To even the conditions, editor is the same individual. And for same voice recognition
a single content was split into two. All other conditions are equal.
Speech Recognition alone will not solve the problem of low caption availability rate in the world. CES has covered many user scenarios so that correcting all the speech recognition errors. The paper has shown how CES can be effective in lowering the cost of caption editing in the case where transcription text or presentation package exists even in such cases where voice recognition rate is relatively low.
 Caption Availability in
 Kohtaroh Miyamoto, Effective Master Client Closed Caption Editing System for Wide Range Workforce, HCI International 2005, Volume 7 Universal Access in HCI, 2005
lKohtaroh Miyamoto, Noriko Negishi, Kenichi Arakawa, Harmonization of Voice
Recognition, Caption Editing, and Presentation, JP9-2004--0223
 Kohtaroh Miyamoto, Midori Shobji, Precisely Time Stamped Closed Caption by Up—Scaling Matching and Optimum Presentation Method, JP9—2004—0021