Language agnostic missing subtitle detection,EURASIP Journal on Audio, Speech, and Music Processing

当前位置： X-MOL 学术 › EURASIP J. Audio Speech Music Proc. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Language agnostic missing subtitle detection
EURASIP Journal on Audio, Speech, and Music Processing ( IF 1.7 ) Pub Date : 2022-06-11 , DOI: 10.1186/s13636-022-00244-9
Honey Gupta , Mayank Sharma

Subtitles are a crucial component of Digital Entertainment Content (DEC such as movies and TV shows) localization. With ever increasing catalog (≈ 2M titles) and localization expansion (30+ languages), automated subtitle quality checks becomes paramount. Being a manual creation process, subtitles can have errors such as missing transcriptions, out-of-sync subtitle blocks with the audio and incorrect translations. Such erroneous subtitles result in an unpleasant viewing experience and impact the viewership. Moreover, manual correction is laborious, highly costly and requires expertise of audio and subtitle languages. A typical subtitle correction process consists of (1) linear watch of the movie, (2) identification of time stamps associated with erroneous subtitle blocks, and (3) correcting procedure. Among the three, time taken to watch the entire movie by a human expert is the most time consuming step. This paper discusses the problem of missing transcription, where the subtitle blocks corresponding to some speech segments in the DEC are non-existent. We present a solution to augment human correction process by automatically identifying the timings associated with the non-transcribed dialogues in a language agnostic manner. The correction step can then be performed by either human-in-the-loop mechanism or automatically using neural transcription (speech-to-text in same language) and translation (text-to-text in different languages) engines. Our method uses a language agnostic neural voice activity detector (VAD) and an audio classifier (AC) trained explicitly on DEC corpora for better generalization. The method consists of three steps: first, we use VAD to identify the timings associated with dialogues (predicted speech blocks). Second, we refine those timings using the AC module by removing the timings associated with the leading and trailing non-speech segments identified as speech by VAD. Finally, we compare the predicted dialogue timings to the dialogue timings present in the subtitle file (subtitle speech blocks) and flag the missing transcriptions. We empirically demonstrate that the proposed method (a) reduces incorrect predicted missing subtitle timings by 10%, (b) improves the predicted missing subtitle timings by 2.5%, (c) reduces false positive rate (FPR) of overextending the predicted timings by 77%, and (d) improves the predicted speech block-level precision by a 119% over VAD baseline on a human-annotated dataset of missing subtitle speech blocks.

中文翻译：

与语言无关的缺失字幕检测

字幕是数字娱乐内容（DEC，如电影和电视节目）本地化的重要组成部分。随着目录的不断增加（约 200 万个标题）和本地化扩展（30 多种语言），自动字幕质量检查变得至关重要。作为手动创建过程，字幕可能会出现错误，例如缺少转录、字幕块与音频不同步以及翻译不正确。这种错误的字幕会导致不愉快的观看体验并影响收视率。此外，手动校正费力、成本高，并且需要音频和字幕语言的专业知识。一个典型的字幕校正过程包括（1）电影的线性观看，（2）与错误字幕块相关的时间戳的识别，以及（3）校正过程。三者之中，人类专家观看整部电影所花费的时间是最耗时的步骤。本文讨论了缺失转录的问题，其中 DEC 中某些语音段对应的字幕块不存在。我们提出了一种解决方案，通过以与语言无关的方式自动识别与非转录对话相关的时间来增强人工校正过程。然后可以通过人工在环机制或使用神经转录（相同语言的语音到文本）和翻译（不同语言的文本到文本）引擎自动执行校正步骤。我们的方法使用了一个与语言无关的神经语音活动检测器 (VAD) 和一个在 DEC 语料库上明确训练的音频分类器 (AC)，以实现更好的泛化。该方法包括三个步骤：首先，我们使用 VAD 来识别与对话相关的时间（预测的语音块）。其次，我们使用 AC 模块通过删除与 VAD 识别为语音的前导和尾随非语音段相关的时序来优化这些时序。最后，我们将预测的对话时间与字幕文件（字幕语音块）中存在的对话时间进行比较，并标记丢失的转录。我们凭经验证明，所提出的方法 (a) 将不正确的预测丢失字幕时间减少了 10%，(b) 将预测的丢失字幕时间提高了 2.5%，(c) 将过度延长预测时间的误报率 (FPR) 降低了 77 %，和 (d) 在缺失字幕语音块的人工注释数据集上，将预测语音块级精度比 VAD 基线提高 119%。

更新日期：2022-06-12

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文