COST REDUCTION OF CAPTION
EDITING USING SUPPLEMENTAL TEXT INFORMATION
Presenter(s)
Kohtaroh Miyamoto
IBM Japan
1612—14 Shimotsuruma
Yamato 242-8502 Japan
Day Phone: 046—215—2949
Email: kmiya@jp.ibm.com
Introduction
Captions are vital for deaf and hard of hearing people to understand the
content of digital video. For example in
The
most primitive approach to edit caption text is to type in manually from the
corresponding audio. Unfortunately typing is very much time consuming,
especially for
double-byte character languages such as Japanese, Chinese, and Korean, etc.,
since the oi1owing steps are typically required to type in each phrase.
Step
1. Type phonetic characters for the phrase Step 2. Enter a key (e.g. space key)
that invokes the list of candidate replacements Step 3. Select the appropriate
combination of characters from the list of candidates Step 4. Enter a key (e.g.
enter key) that finalizes
the characters Therefore, it is too costly to type in character strings
directly as a manual effort by listening to the audio.
Recently
some techniques coordinated with speech recognition technology have been introduced
for caption editing. Internal experimentation result shows approximately 35% time
consumption reduction by use of caption editing systems. But error editing
technology based on speech recognition result is still at a level where
correcting the recognition errors is very labor intensive. Therefore there is
still a very strong demand to further improve the productivity for lowering the
total cost. Based on this requirement, a unique system has been designed called
IBM Caption Editing System (CES). CBS will improve editing cost in spontaneous
speech where no source of information is available [2) or in many cases
specific presentation software is available (patent pending [3)), also when
some text information or transcript is available. This paper will focus on how
CES can contribute to lowering the editing cost when transcript or presentation
package is available.
CES
System
CES System consists of CES Recorder and CES Editor. CES Recorder which
encapsulates the voice recognition engine transcribes the audio to output the
caption candidate text. CES Editor will input the caption candidate text output
from the CES Recorder and allow the user to edit and correct all the errors for
a complete caption. CES is designed to work well in wide range of voice
recognition rates since it varies to a great extent depending on the user
scenario.
Obviously
it is very easy to obtain correct text if there is at least a portion of
transcript available. But it is important to note that caption is a combination
set of timestamp and caption text. In order to provide adequate caption to
contents, both the accuracy of the timestamp and the caption text needs to be
assured. When the speaker in the content starts to speak a word or a sentence,
it would be necessary for understanding that the associated caption text
appears at precisely the same time. This will allow deaf and hard of hearing
people to understand which speaker has spoken the caption text and also
recognize external impression such as speaker’s expression or any other
graphical event.
Specifically
the description •of the scenario from the user’s perspective is the following.
1) Speaker prepares some sort of transcript or presentation package before the
speech.
2) Speaker makes the speech by using the CBS Recorder and may use such feature
as the
slideshow. CBS Recorder will transcribe the speech to text by encapsulated
speech
recognition engine and output the caption candidate and also capture the
slideshow page,
image, and text.
3) Editor maybe the same person as the speaker or different than the speaker.
Editor will
use the CES Editor to correct the caption candidate errors by making use of the
transcript
and text derived from presentation package.
4) User only needs to specify the range of text to match and the matching will
be
performed automatically. The benefit of using the CES transcript matching
feature is that
there is less need to make correction to the caption text and also to the
timestamps.
5) CBS will create a full content with image, audio, presentation, and caption.
Experimentation
Technically CES has the capability to automatically match the caption candidate
text with transcript. Previous method uses DP Matching algorithm to obtain each
character by character matching mapping. CES will go beyond character by
character OP Matching and thus compare not only characters but phonemes as well
(Patent Pending [4). Internal test results show the timestamp matching error
has decreased for two different voice recognition results compared to previous
matching methods. Here, IBM ViaVoice V1O Professional was used as the voice
recognition engine. As presentation software Microsoft PowerPoint XP was used.
Comparison was performed between “previous matching method” which refers to DP
matching by character level and “CBS matching method” which refers to gDP
Matching by character and also by phoneme”.
When Voice Recognition result was 81.4%, timestamp error decreased from 8.4% to
2.5%. When Voice Recognition result was 60.9%, timestamp error decreased from
73.6% to 32.8%.
The
timestamp error rate associates to the editing hours (cost) in the following
way.
When Voice Recognition result was 81.4%, editing total hours decreased by 15%.
When Voice Recognition result is 60.9%, editing total hours decreased by 35%.
To even the conditions, editor is the same individual. And for same voice
recognition
result,
a single content was split into two. All other conditions are equal.
Conclusion
Speech Recognition alone will not solve the problem of low caption availability
rate in the world. CES has covered many user scenarios so that correcting all
the speech recognition errors. The paper has shown how CES can be effective in
lowering the cost of caption editing in the case where transcription text or
presentation package exists even in such cases where voice recognition rate is
relatively low.
References
[1] Caption Availability in
[2] Kohtaroh Miyamoto, Effective Master Client Closed Caption Editing System
for Wide Range Workforce, HCI International 2005, Volume 7 Universal Access in
HCI, 2005
[3]
lKohtaroh Miyamoto, Noriko Negishi, Kenichi Arakawa, Harmonization of Voice
Recognition, Caption Editing, and Presentation, JP9-2004--0223
[4] Kohtaroh Miyamoto, Midori Shobji, Precisely Time Stamped Closed Caption by Up—Scaling Matching and Optimum Presentation Method, JP9—2004—0021
Go to previous article
Go to next article
Return to 2006 Table of Contents