skip to main content
research-article

Listen, Look, and Find the One: Robust Person Search with Multimodality Index

Published:22 May 2020Publication History
Skip Abstract Section

Abstract

Person search with one portrait, which attempts to search the targets in arbitrary scenes using one portrait image at a time, is an essential yet unexplored problem in the multimedia field. Existing approaches, which predominantly depend on the visual information of persons, cannot solve problems when there are variations in the person’s appearance caused by complex environments and changes in pose, makeup, and clothing. In contrast to existing methods, in this article, we propose an associative multimodality index for person search with face, body, and voice information. In the offline stage, an associative network is proposed to learn the relationships among face, body, and voice information. It can adaptively estimate the weights of each embedding to construct an appropriate representation. The multimodality index can be built by using these representations, which exploit the face and voice as long-term keys and the body appearance as a short-term connection. In the online stage, through the multimodality association in the index, we can retrieve all targets depending only on the facial features of the query portrait. Furthermore, to evaluate our multimodality search framework and facilitate related research, we construct the Cast Search in Movies with Voice (CSM-V) dataset, a large-scale benchmark that contains 127K annotated voices corresponding to tracklets from 192 movies. According to extensive experiments on the CSM-V dataset, the proposed multimodality person search framework outperforms the state-of-the-art methods.

References

  1. T. Naoya, G. Michael, and G. Luc. 2018. AENet: Learning deep audio features for video analysis. IEEE Trans. Multimedia 20, 3 (2018), 513--524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Li, X. Liu, W. Liu, H. Ma, and H. Zhang. 2016. A discriminative null space based deep learning approach for person re-identification. In Proceedings of the CCIS. 480--484.Google ScholarGoogle Scholar
  3. L. Zheng, Y. Yang, and Q. Tian. 2018. SIFT meets CNN: A decade survey of instance retrieval. IEEE Trans. Pattern Anal. Mach. Intell. 40, 5 (2018), 1224--1244.Google ScholarGoogle ScholarCross RefCross Ref
  4. S. Hao, X. Wu, Z. Bing, Y. Wu, and Y. Jia. 2019. Temporal action localization in untrimmed videos using action pattern trees. IEEE Trans. Multimedia 21, 3 (2019), 717--730.Google ScholarGoogle ScholarCross RefCross Ref
  5. W. Ruan, J. Chen, Y. Wu, J. Wang, and C. Liang. 2019. Multi-correlation filters with triangle-structure constraints for object tracking. IEEE Trans. Multimedia 21, 5 (2019), 1122--1134.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Liu, C. Zhang, H. Ma, and S. Li. 2018. Learning efficient spatial-temporal gait features with deep learning for human identification. Neuroinformatics 16, 3–4 (2018), 457--471.Google ScholarGoogle ScholarCross RefCross Ref
  7. Q. Huang, W. Liu, and D. Lin. 2018. Person search in videos with one portrait through visual and temporal links. In Proceedings of the ECCV. 425--441.Google ScholarGoogle Scholar
  8. C. Loy, D. Lin, and W. Ouyang. 2018. WIDER face and pedestrian challenge: http://wider-challenge.org/. arXiv:1902.06854 (2018).Google ScholarGoogle Scholar
  9. Y. Gao, J. Ma, and A Yuille. 2017. Semi-supervised sparse representation based classification for face recognition with insufficient labeled samples. IEEE Trans. Image Process. 26, 5 (2017), 2545--2560.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Mokhayeri, E. Granger, and G. Bilodeau. 2018. Domain-specific face synthesis for video face recognition from a single sample per person. IEEE Trans. Inf. Forens. Secur. 14, 3 (2018), 757--772.Google ScholarGoogle ScholarCross RefCross Ref
  11. M. Rui, N. Kose, and J. Dugelay. 2017. KinectFaceDB: A kinect database for face recognition. IEEE Trans. Syst. Man, Cybern. 44, 11 (2017), 1534--1548.Google ScholarGoogle Scholar
  12. L. Best-Rowden and A. Jain. 2018. Longitudinal study of automatic face recognition. IEEE Trans. Pattern Anal. Mach. Intell 1, 99 (2018), 148--162.Google ScholarGoogle ScholarCross RefCross Ref
  13. Z. Wang, R. Hu, C. Chen, Y. Yu, J. Jiang, C. Liang, and S. Satoh. 2017. Person reidentification via discrepancy matrix and matrix metric. IEEE Trans. Cybern. 1, 99 (2017), 1--15.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Torfi, N. Nasrabadi, and J. Dawson. 2017. Text-independent speaker verification using 3D convolutional neural networks. arXiv:1705.09422 (2017).Google ScholarGoogle Scholar
  15. WVU multimodal dataset. Retrieved from http://biic.wvu.edu.Google ScholarGoogle Scholar
  16. A. Nagrani, S. Albanie, and A. Zisserman. 2018. Seeing voices and hearing faces: Cross-modal biometric matching. In Proceedings of the CVPR. 8427--8436.Google ScholarGoogle Scholar
  17. T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang. 2017. Joint detection and identification feature learning for person search. In Proceedings of the CVPR. 3415--3424.Google ScholarGoogle Scholar
  18. L. Zheng, H. Zhang, S. Sun, M. Chandraker, Y. Yang, and Q. Tian. 2017. Person re-identification in the wild. In Proceedings of the CVPR. 1367--1376.Google ScholarGoogle Scholar
  19. H. Liu, J. Feng, Z. Jie, K. Jayashree, B. Zhao, M. Qi, J. Jiang, and S. Yan. 2017. Neural person search machines. In Proceedings of the ICCV. 493--501.Google ScholarGoogle Scholar
  20. B. Munjal, S. Amin, F. Tombari, and F. Galasso. 2019. Query-guided end-to-end person search. (2019), 811--820.Google ScholarGoogle Scholar
  21. S. Horiguchi, N. Kanda, and K. Nagamatsu. 2018. Face-voice matching using cross-modal embeddings. In Proceedings of the ACM Multimedia. 1--10.Google ScholarGoogle Scholar
  22. C. Gan, T. Yang, and B. Gong. 2016. Learning attributes equals multi-source domain generalization. In Proceedings of the CVPR. 87--97.Google ScholarGoogle Scholar
  23. R. Arandjelovic and A. Zisserman. 2017. Look, listen and learn. arXiv:1705.08168 (2017).Google ScholarGoogle Scholar
  24. C. Gan, H. Zhao, P. Chen, D. Cox, and A. Torralba. 2019. Self-supervised moving vehicle tracking with stereo sound. In Proceedings of the ICCV. 7053--7062.Google ScholarGoogle Scholar
  25. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. 2018. The sound of pixels. In Proceedings of the ECCV. 570--586.Google ScholarGoogle Scholar
  26. S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Li. 2017. Faceboxes: A CPU real-time face detector with high accuracy. In Proceedings of the IJCB. 1--9.Google ScholarGoogle Scholar
  27. V. Jain and E. Learned-Miller. 2010. FDDB: A benchmark for face detection in unconstrained settings. In UMass Amherst Technical Report. 1--6.Google ScholarGoogle Scholar
  28. Z. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu. 2018. Wing loss for robust facial landmark localisation with convolutional neural networks. In Proceedings of the CVPR. 2235--2245.Google ScholarGoogle Scholar
  29. C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 2013. 300 faces in-the-wild challenge: The first facial landmark localization challenge. In Proceedings of the ICCV Workshops. 1--7.Google ScholarGoogle Scholar
  30. F. Wang, L. Chen, C. Li, S. Huang, Y. Chen, C. Qian, and C. Change Loy. 2018. The devil of face recognition is in the noise. In Proceedings of the ECCV. 765--780.Google ScholarGoogle Scholar
  31. Retrieved from http://trillionpairs.deepglint.com/overview. ([n. d.]).Google ScholarGoogle Scholar
  32. K. He, X. Zhang, S. Ren, and J. Sun. 2016. Deep residual learning for image recognition. In Proceedings of the CVPR. 770--778.Google ScholarGoogle Scholar
  33. K. Liu, W. Liu, C. Gan, M. Tan, and H. Ma. 2018. T-C3D: Temporal convolutional 3D network for real-time action recognition. In Proceedings of the AAAI. 7138--7145.Google ScholarGoogle Scholar
  34. B. Normalization. 2015. Accelerating deep network training by reducing internal covariate shift. CoRR.abs/1502.03167 (2015).Google ScholarGoogle Scholar
  35. X. Long, C. Gan, G. de Melo, J. Wu, X. Liu, and S. Wen. 2018. Attention clusters: Purely attention based local feature integration for video classification. In Proceedings of the CVPR. 7834--7843.Google ScholarGoogle Scholar
  36. A. Nagrani, J. Chung, and A. Zisserman. 2017. VoxCeleb: A large-scale speaker identification dataset. (2017), 2616--2620.Google ScholarGoogle Scholar
  37. J. Chung, A. Nagrani, and A. Zisserman. VoxCeleb2: Deep speaker recognition. ([n. d.]), 1086--1090.Google ScholarGoogle Scholar
  38. M. Corbetta and G. Shulman. 2002. Control of goal-directed and stimulus-driven attention in the brain. Nature Rev. Neurosci. 3, 3 (2002), 201--210.Google ScholarGoogle ScholarCross RefCross Ref
  39. H. Ke, D. Chen, T. Shah, X. Liu, X. Zhang, L. Zhang, and X. Li. 2018. Cloud aided online EEG classification system for brain healthcare: A case study of depression evaluation with a lightweight CNN. Softw: Pract Exper. (2018), 1--15.Google ScholarGoogle Scholar
  40. B. Hasan, S. Awwad, M. Valdessosa, J. Gross, and Pascal Belin. 2016. Hearing faces and seeing voices: Amodal coding of person identity in the human brain. Sci. Rep. 108, 374 (2016), 44--49.Google ScholarGoogle Scholar
  41. D. Chen, Y. Tang, H. Zhang, L. Wang, and X. Li. 2019. Incremental factorization of big time series data with blind factor approximation. IEEE Trans. Knowl. Data Eng. DOI: 10.1109/TKDE.2019.2931687Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Chen, Y. Hu, L. Wang, A. Y. Zomaya, and X. Li. 2017. HPARAFAC: Hierarchical parallel factor analysis of multidimensional big data. IEEE Trans. Parallel Distrib. Syst. 28, 4 (2017), 1091--1104.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Liu, X. Liu, H. Ma, and C. Peng. 2017. Beyond human-level license plate super-resolution with progressive vehicle search and domain priori GAN. In Proceedings of the ACM Multimedia. 1618--1626.Google ScholarGoogle Scholar
  44. W. Liu, T. Mei, Y. Zhang, J. Li, and S. Li. 2013. Listen, look, and gotcha: Instant video search with mobile phones by layered audio-video indexing. In Proceedings of the ACM Multimedia. 887--896.Google ScholarGoogle Scholar
  45. J. Liu, S. Nishimura, and T. Araki. 2019. P-Index: A novel index based on prime factorization for similarity search. In Proceedings of the BigComp. 1--8.Google ScholarGoogle Scholar
  46. L. Zheng, Z. Bie, Y. Sun, J. Wang, C. Su, S. Wang, and Q. Tian. 2016. MARS: A video benchmark for large-scale person re-identification. In Proceedings of the ECCV. 868--884.Google ScholarGoogle Scholar
  47. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein. 2015. ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 3 (2015), 211--252.Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Y. Liu, P. Shi, B. Peng, H. Yan, Y. Zhou, B. Han, Y. Zheng, C. Lin, J. Jiang, and Y. Fan. 2018. iQIYI-VID: A large dataset for multi-modal person identification. arXiv:1811.07548 (2018).Google ScholarGoogle Scholar
  49. K. Zhang, Z. Zhang, Z. Li, and Q. Yu. 2016. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Proc. Lett. 23, 10 (2016), 1499--1503.Google ScholarGoogle ScholarCross RefCross Ref
  50. Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. 2016. MS-celeb-1M: Challenge of recognizing one million celebrities in the real world. Electron. Imag. 2016, 11 (2016), 1--6.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Listen, Look, and Find the One: Robust Person Search with Multimodality Index

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image ACM Transactions on Multimedia Computing, Communications, and Applications
            ACM Transactions on Multimedia Computing, Communications, and Applications  Volume 16, Issue 2
            May 2020
            390 pages
            ISSN:1551-6857
            EISSN:1551-6865
            DOI:10.1145/3401894
            Issue’s Table of Contents

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 22 May 2020
            • Online AM: 7 May 2020
            • Accepted: 1 January 2020
            • Revised: 1 December 2019
            • Received: 1 October 2019
            Published in tomm Volume 16, Issue 2

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader

          HTML Format

          View this article in HTML Format .

          View HTML Format