Listen, Look, and Find the One,ACM Transactions on Multimedia Computing, Communications, and Applications

当前位置： X-MOL 学术 › ACM Trans. Multimed. Comput. Commun. Appl. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Listen, Look, and Find the One
ACM Transactions on Multimedia Computing, Communications, and Applications ( IF 5.1 ) Pub Date : 2020-05-25 , DOI: 10.1145/3380549
Xiao Wang ₁ , Wu Liu ₂ , Jun Chen ₃ , Xiaobo Wang ₂ , Chenggang Yan ₄ , Tao Mei ₂

Affiliation

Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.

中文翻译：

听，看，找到那个

人像搜索，试图一次使用一张人像图像在任意场景中搜索目标，是多媒体领域一个基本但尚未探索的问题。现有方法主要依赖于人的视觉信息，当人的外表因复杂环境和姿势、妆容和服装的变化而发生变化时，无法解决问题。与现有方法相比，在本文中，我们提出了一种用于人脸、身体和语音信息搜索的关联多模态索引。在离线阶段，提出了一个关联网络来学习人脸、身体和语音信息之间的关系。它可以自适应地估计每个嵌入的权重以构建适当的表示。多模态索引可以通过使用这些表示来构建，这些表示利用面部和声音作为长期键，将身体外观作为短期连接。在在线阶段，通过索引中的多模态关联，我们可以仅根据查询人像的面部特征检索所有目标。此外，为了评估我们的多模态搜索框架并促进相关研究，我们构建了带有语音的电影中的演员搜索 (CSM-V) 数据集，这是一个包含与 192 部电影中的 tracklet 对应的 127K 注释语音的大规模基准。根据对 CSM-V 数据集的广泛实验，所提出的多模态人物搜索框架优于最先进的方法。它利用面部和声音作为长期关键，利用身体外观作为短期联系。在在线阶段，通过索引中的多模态关联，我们可以仅根据查询人像的面部特征检索所有目标。此外，为了评估我们的多模态搜索框架并促进相关研究，我们构建了带有语音的电影中的演员搜索 (CSM-V) 数据集，这是一个包含与 192 部电影中的 tracklet 对应的 127K 注释语音的大规模基准。根据对 CSM-V 数据集的广泛实验，所提出的多模态人物搜索框架优于最先进的方法。它利用面部和声音作为长期关键，利用身体外观作为短期联系。在在线阶段，通过索引中的多模态关联，我们可以仅根据查询人像的面部特征检索所有目标。此外，为了评估我们的多模态搜索框架并促进相关研究，我们构建了带有语音的电影中的演员搜索 (CSM-V) 数据集，这是一个包含与 192 部电影中的 tracklet 对应的 127K 注释语音的大规模基准。根据对 CSM-V 数据集的广泛实验，所提出的多模态人物搜索框架优于最先进的方法。为了评估我们的多模态搜索框架并促进相关研究，我们构建了带有语音的电影演员搜索 (CSM-V) 数据集，这是一个包含 127K 注释语音的大型基准，对应于 192 部电影的轨迹。根据对 CSM-V 数据集的广泛实验，所提出的多模态人物搜索框架优于最先进的方法。为了评估我们的多模态搜索框架并促进相关研究，我们构建了带有语音的电影演员搜索 (CSM-V) 数据集，这是一个包含 127K 注释语音的大型基准，对应于 192 部电影的轨迹。根据对 CSM-V 数据集的广泛实验，所提出的多模态人物搜索框架优于最先进的方法。

更新日期：2020-05-25

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>