当前位置: X-MOL 学术IEEE Trans. Pattern Anal. Mach. Intell. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
EgoCom: A Multi-Person Multi-Modal Egocentric Communications Dataset.
IEEE Transactions on Pattern Analysis and Machine Intelligence ( IF 20.8 ) Pub Date : 2023-05-05 , DOI: 10.1109/tpami.2020.3025105
Curtis Northcutt , Shengxin Zha , Steven Lovegrove , Richard Newcombe

Multi-modal datasets in artificial intelligence (AI) often capture a third-person perspective, but our embodied human intelligence evolved with sensory input from the egocentric, first-person perspective. Towards embodied AI, we introduce the Egocentric Communications (EgoCom) dataset to advance the state-of-the-art in conversational AI, natural language, audio speech analysis, computer vision, and machine learning. EgoCom is a first-of-its-kind natural conversations dataset containing multi-modal human communication data captured simultaneously from the participants' egocentric perspectives. EgoCom includes 38.5 hours of synchronized embodied stereo audio, egocentric video with 240,000 ground-truth, time-stamped word-level transcriptions and speaker labels from 34 diverse speakers. We study baseline performance on two novel applications that benefit from embodied data: (1) predicting turn-taking in conversations and (2) multi-speaker transcription. For (1), we investigate Bayesian baselines to predict turn-taking within 5 percent of human performance. For (2), we use simultaneous egocentric capture to combine Google speech-to-text outputs, improving global transcription by 79 percent relative to a single perspective. Both applications exploit EgoCom's synchronous multi-perspective data to augment performance of embodied AI tasks.

中文翻译:

EgoCom:多人多模式自我中心通信数据集。

人工智能 (AI) 中的多模态数据集通常捕捉第三人称视角,但我们的具身人类智能随着以自我为中心的第一人称视角的感官输入而进化。对于具身人工智能,我们引入了以自我为中心的通信 (EgoCom) 数据集,以推进会话人工智能、自然语言、音频语音分析、计算机视觉和机器学习的最新技术水平。EgoCom 是同类首创的自然对话数据集,其中包含从参与者的自我中心角度同时捕获的多模态人类交流数据。EgoCom 包括 38.5 小时的同步体现立体声音频、带有 240,000 个真实情况的以自我为中心的视频、带时间戳的单词级转录和来自 34 位不同演讲者的演讲者标签。我们研究了两个受益于体现数据的新应用程序的基线性能:(1)预测对话中的轮流和(2)多说话者转录。对于 (1),我们研究了贝叶斯基线,以预测轮流变化在人类表现的 5% 以内。对于 (2),我们使用同时​​以自我为中心的捕获来组合 Google 语音到文本的输出,相对于单一视角,将全局转录提高了 79%。这两个应用程序都利用 EgoCom 的同步多视角数据来增强具体人工智能任务的性能。我们使用同步的以自我为中心的捕获来结合谷歌语音到文本的输出,相对于单一视角,将全局转录提高了 79%。这两个应用程序都利用 EgoCom 的同步多视角数据来增强具体人工智能任务的性能。我们使用同步的以自我为中心的捕获来结合谷歌语音到文本的输出,相对于单一视角,将全局转录提高了 79%。这两个应用程序都利用 EgoCom 的同步多视角数据来增强具体人工智能任务的性能。
更新日期:2020-09-18
down
wechat
bug