Abstract
Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.
- T. Naoya, G. Michael, and G. Luc. 2018. AENet: Learning deep audio features for video analysis. IEEE Trans. Multimedia 20, 3 (2018), 513--524.Google ScholarDigital Library
- S. Li, X. Liu, W. Liu, H. Ma, and H. Zhang. 2016. A discriminative null space based deep learning approach for person re-identification. In Proceedings of the CCIS. 480--484.Google Scholar
- L. Zheng, Y. Yang, and Q. Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2018), 1224--1244.Google ScholarCross Ref
- S. Hao, X. Wu, Z. Bing, Y. Wu, and Y. Jia. 2019. Temporal action localization in untrimmed videos using action pattern trees. IEEE Trans. Multimedia 21, 3 (2019), 717--730.Google ScholarCross Ref
- W. Ruan, J. Chen, Y. Wu, J. Wang, and C. Liang. 2019. Multi-correlation filters with triangle-structure constraints for object tracking. IEEE Trans. Multimedia 21, 5 (2019), 1122--1134.Google ScholarDigital Library
- W. Liu, C. Zhang, H. Ma, and S. Li. 2018. Learning efficient spatial-temporal gait features with deep learning for human identification. Neuroinformatics 16, 3–4 (2018), 457--471.Google ScholarCross Ref
- Q. Huang, W. Liu, and D. Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the ECCV. 425--441.Google Scholar
- C. Loy, D. Lin, and W. Ouyang. 2018. WIDER face and pedestrian challenge: http://wider-challenge.org/. arXiv:1902.06854 (2018).Google Scholar
- Y. Gao, J. Ma, and A Yuille. 2017. Semi-supervised sparse representation based classification for face recognition with insufficient labeled samples. IEEE Trans. Image Process. 26, 5 (2017), 2545--2560.Google ScholarDigital Library
- F. Mokhayeri, E. Granger, and G. Bilodeau. 2018. Domain-specific face synthesis for video face recognition from a single sample per person. IEEE Trans. Inf. Forens. Secur. 14, 3 (2018), 757--772.Google ScholarCross Ref
- M. Rui, N. Kose, and J. Dugelay. 2017. KinectFaceDB: A kinect database for face recognition. IEEE Trans. Syst. Man, Cybern. 44, 11 (2017), 1534--1548.Google Scholar
- L. Best-Rowden and A. Jain. 2018. Longitudinal study of automatic face recognition. IEEE Trans. Pattern Anal. Mach. Intell 1, 99 (2018), 148--162.Google ScholarCross Ref
- Z. Wang, R. Hu, C. Chen, Y. Yu, J. Jiang, C. Liang, and S. Satoh. 2017. Person reidentification via discrepancy matrix and matrix metric. IEEE Trans. Cybern. 1, 99 (2017), 1--15.Google ScholarDigital Library
- A. Torfi, N. Nasrabadi, and J. Dawson. 2017. Text-independent speaker verification using 3D convolutional neural networks. arXiv:1705.09422 (2017).Google Scholar
- WVU multimodal dataset. Retrieved from http://biic.wvu.edu.Google Scholar
- A. Nagrani, S. Albanie, and A. Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the CVPR. 8427--8436.Google Scholar
- T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the CVPR. 3415--3424.Google Scholar
- L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. 2017. Person re-identification in the wild. In Proceedings of the CVPR. 1367--1376.Google Scholar
- H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan. 2017. Neural person search machines. In Proceedings of the ICCV. 493--501.Google Scholar
- B. Munjal, S. Amin, F. Tombari, and F. Galasso. 2019. Query-guided end-to-end person search. (2019), 811--820.Google Scholar
- S. Horiguchi, N. Kanda, and K. Nagamatsu. 2018. Face-voice matching using cross-modal embeddings. In Proceedings of the ACM Multimedia. 1--10.Google Scholar
- C. Gan, T. Yang, and B. Gong. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the CVPR. 87--97.Google Scholar
- R. Arandjelovic and A. Zisserman. 2017. Look, listen and learn. arXiv:1705.08168 (2017).Google Scholar
- C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba. 2019. Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the ICCV. 7053--7062.Google Scholar
- H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. 2018. The sound of pixels. In Proceedings of the ECCV. 570--586.Google Scholar
- S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Li. 2017. Faceboxes: A CPU real-time face detector with high accuracy. In Proceedings of the IJCB. 1--9.Google Scholar
- V. Jain and E. Learned-Miller. 2010. FDDB: A benchmark for face detection in unconstrained settings. In UMass Amherst Technical Report. 1--6.Google Scholar
- Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu. 2018. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the CVPR. 2235--2245.Google Scholar
- C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the ICCV Workshops. 1--7.Google Scholar
- F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. Change Loy. 2018. The devil of face recognition is in the noise. In Proceedings of the ECCV. 765--780.Google Scholar
- Retrieved from http://trillionpairs.deepglint.com/overview. ([n. d.]).Google Scholar
- K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.Google Scholar
- K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In Proceedings of the AAAI. 7138--7145.Google Scholar
- B. Normalization. 2015. Accelerating deep network training by reducing internal covariate shift. CoRR.abs/1502.03167 (2015).Google Scholar
- X. Long, C. Gan, G. de Melo, J. Wu, X. Liu, and S. Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the CVPR. 7834--7843.Google Scholar
- A. Nagrani, J. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. (2017), 2616--2620.Google Scholar
- J. Chung, A. Nagrani, and A. Zisserman. VoxCeleb2: Deep speaker recognition. ([n. d.]), 1086--1090.Google Scholar
- M. Corbetta and G. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Rev. Neurosci. 3, 3 (2002), 201--210.Google ScholarCross Ref
- H. Ke, D. Chen, T. Shah, X. Liu, X. Zhang, L. Zhang, and X. Li. 2018. Cloud aided online EEG classification system for brain healthcare: A case study of depression evaluation with a lightweight CNN. Softw: Pract Exper. (2018), 1--15.Google Scholar
- B. Hasan, S. Awwad, M. Valdessosa, J. Gross, and Pascal Belin. 2016. Hearing faces and seeing voices: Amodal coding of person identity in the human brain. Sci. Rep. 108, 374 (2016), 44--49.Google Scholar
- D. Chen, Y. Tang, H. Zhang, L. Wang, and X. Li. 2019. Incremental factorization of big time series data with blind factor approximation. IEEE Trans. Knowl. Data Eng. DOI: 10.1109/TKDE.2019.2931687Google ScholarDigital Library
- D. Chen, Y. Hu, L. Wang, A. Y. Zomaya, and X. Li. 2017. HPARAFAC: Hierarchical parallel factor analysis of multidimensional big data. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1091--1104.Google ScholarDigital Library
- W. Liu, X. Liu, H. Ma, and C. Peng. 2017. Beyond human-level license plate super-resolution with progressive vehicle search and domain priori GAN. In Proceedings of the ACM Multimedia. 1618--1626.Google Scholar
- W. Liu, T. Mei, Y. Zhang, J. Li, and S. Li. 2013. Listen, look, and gotcha: Instant video search with mobile phones by layered audio-video indexing. In Proceedings of the ACM Multimedia. 887--896.Google Scholar
- J. Liu, S. Nishimura, and T. Araki. 2019. P-Index: A novel index based on prime factorization for similarity search. In Proceedings of the BigComp. 1--8.Google Scholar
- L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. 2016. MARS: A video benchmark for large-scale person re-identification. In Proceedings of the ECCV. 868--884.Google Scholar
- O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.Google ScholarDigital Library
- Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, C. Lin, J. Jiang, and Y. Fan. 2018. iQIYI-VID: A large dataset for multi-modal person identification. arXiv:1811.07548 (2018).Google Scholar
- K. Zhang, Z. Zhang, Z. Li, and Q. Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Proc. Lett. 23, 10 (2016), 1499--1503.Google ScholarCross Ref
- Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. 2016. MS-celeb-1M: Challenge of recognizing one million celebrities in the real world. Electron. Imag. 2016, 11 (2016), 1--6.Google ScholarCross Ref
Index Terms
- Listen, Look, and Find the One: Robust Person Search with Multimodality Index
Recommendations
With one look: robust face recognition using single sample per person
MM '13: Proceedings of the 21st ACM international conference on MultimediaIn this paper, we address the problem of robust face recognition using single sample per person. Given only one training image per subject of interest, our proposed method is able to recognize query images with illumination or expression changes, or ...
Open-set face recognition across look-alike faces in real-world scenarios
The open-set problem is among the problems that have significantly changed the performance of face recognition algorithms in real-world scenarios. Open-set operates under the supposition that not all the probes have a pair in the gallery. Most face ...
Look across elapse: disentangled representation learning and photorealistic cross-age face synthesis for age-invariant face recognition
AAAI'19/IAAI'19/EAAI'19: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Applications of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial IntelligenceDespite the remarkable progress in face recognition related technologies, reliably recognizing faces across ages still remains a big challenge. The appearance of a human face changes substantially over time, resulting in significant intraclass variations. ...
Comments