• CSUN Assistive Technology Conference Logo

    The Premier AT Conference

How 7 Top Speech Recognition APIs Perform for Captioning

What is the current state of speech technology? When will automatic speech recognition (ASR) be sufficient for closed captioning – or even for live captioning? Is synthesized speech a real option for audio description? Do we still need humans? This session will cover our 2021 study of the 7 leading ASR engines to understand where speech AI stands for the task of captioning & transcription as we enter 2022. First, it’s important to understand the application of speech recognition technology for captioning vs. for other technologies. In specific applications, such as Apple Siri and Amazon Echo or Alexa, speech-to-text (really, speech-to-meaning) technology has become increasingly useful, to the point where many people are able to send text messages, search the web, control their music players, and more by voice. It is important to draw a careful distinction between “automated assistant” applications like Siri and the captioning/transcription task. Captioning/transcription is much more challenging. This task is primarily characterized by long-form content where the speaker is completely unknown and where it is essential to transcribe each and every word that is spoken. Our research tested seven of the most popular ASR technologies. We tested 490 files from 160 projects across 10 industries. In total, we tested approximately 100 hours of content, for a total of around 800K words. Word Error Rate vs. Formatted Error Rate Word Error Rate (WER) is widely used across the speech recognition community to judge and determine quality. When measuring WER, ASR researchers make several accommodations in order to maintain comparable results between systems. For example, a WER measurement will treat the following two sequences as equivalent, despite the very clear formatting differences: “Mr. Johnson delivered the package, which contained one hundred pounds of feathers, at five o’clock PM.” “mister johnson delivered the package which contained 100 lbs of feathers at 5:00pm.” Formatted Error Rate (FER) is the percentage of word errors when formatting elements such as punctuation, grammar, speaker identification, non-speech elements, capitalization, and other notations are taken into account. WER Results The WER test found that Speechmatics V2 API (SMX+) had the lowest overall error rate, with an error score of 13.1% according to the McNemar test. SMX+ also had the lowest deletion rate at 3.82%. This is integral to the use of ASR for creating captions and transcripts, as deleting words can change the meaning and convey incorrect information. Google, for instance had the highest deletion rate at 7.75%. Removing words could cause the user to miss out on large pieces of content. But, all of the above concerns WER measurements, and WER is not the entire story. FER Results Punctuation and capitalization are crucial to relaying the correct message. Incorrect punctuation can also make it difficult to comprehend a file, such as following along or knowing who is speaking. For these reasons, it is important to measure accuracy rates which include punctuation as a factor. Note the significant difference in meaning in each of these sentence pairs: “Let’s eat Grandma!” vs. “Let’s eat, Grandma!” “I like cooking my family and pets.” vs. “I like cooking, my family, and pets.” In terms of FER, none of the providers are producing an output close to sufficient for compliance. Accuracy rates dropped quite a bit across the board when considering FER as well as WER. The research we conducted confirmed that even the best automatic captions have much room for improved accuracy, which can only be obtained through human customization. There will need to be some very fundamental advances in machine learning and natural language processing in order to replicate the intelligence that professional editors can bring to bear on this very demanding task.  
  • Higher Education
  • Information & Communications Technology
  • Media & Publishing
  • Research & Development
Audience Level
Session Summary (Abstract)
Do we still need humans? This session will share the latest research comparing 7 leading speech recognition APIs to share the current state of ASR for captioning & transcription.  
Session Type
General Track  
  • Artificial Intelligence (AI) & Machine Learning (ML)
  • Captions & Transcription
  • Digital Accessibility
  • Video & Live Streaming


  • Lily Bond
    3Play Media

Back to Session List