Abstract
Video face clustering is a challenging task with wide applications. Unlike ordinary image clustering, faces in videos usually exist as a series of tracks, which provide prior knowledge. Specifically, faces from the same track are considered to be the same person while faces from the different tracks appearing in the same frame are considered to be different people. Based on this prior knowledge, we propose the self-supervised deep subspace clustering network (SDSCN). SDSCN adopts autoencoder to nonlinearly map the faces into latent space and adds the fully connected layer between the encoder and decoder to explore the self-expressiveness property. Prior knowledge is automatically incorporated into the loss function to guide the training. We further propose efficient training strategy for our network and clustering. The experiments on the two public datasets (BBT0101 and Notting-Hill) demonstrate the advantages of our method. Specifically, our method achieves about 3–17% improvement in clustering accuracy on BBT0101 and about 6–23% improvement on Notting-Hill compared to the state-of-the-art methods.
Similar content being viewed by others
References
Cinbis, R.G., Verbeek, J.J., Schmid, C.: Unsupervised metric learning for face identification in TV video. In: 2011 IEEE International Conference on Computer Vision (2011)
Elhamifar, E., Vidal, R.: Sparse subspace clustering: algorithm, theory, and applications. In: 2013 IEEE Transactions on Pattern Analysis Machine Intelligence, vol. 35, pp. 2765–2781 (2013)
Liu, G., Lin, Z., Yan, S., Sun, J., Yu, Y., Ma, Y.: Robust recovery of subspace structures by low-rank representation. In: 2013 IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 35, pp. 171–184 (2013)
Pan, J., Tong, Z., Li, H., Salzmann, M., Reid, I.: Deep subspace clustering networks. In: 2017 Conference and Workshop on Neural Information Processing Systems (2017)
Roth, M., Bäuml, M., Nevatia, R., Stiefelhagen, R.: Robust multi-pose face tracking by multi-stage tracklet association. In: Proceedings of the 21st International Conference on Pattern Recognition, pp. 1012–1016 (2012)
Zhang, Y., Xu, C., Lu, H., Huang, Y.: Character identification in feature-length films using global face-name matching. IEEE Trans. Multimed. 11(7), 1276–1288 (2009)
Lu, C.-Y., Min, H., Zhao, Z.-Q., Zhu, L., Huang, D.-S., Yan, S.: Robust and efficient subspace segmentation via least squares regression. In: 2012 European Conference on Computer Vision, pp. 347–360 (2012)
Patel, V.M., Vidal, R.: Kernel sparse subspace clustering. In: 2014 IEEE International Conference on Image Processing, pp. 2849–2853 (2014)
Patel, V.M., Nguyen, H.V., Vidal, R.: Latent space sparse and low-rank subspace clustering. IEEE J. Sel. Top. Signal Process. 9, 691–701 (2015)
Patel, V.M., Nguyen, H.V., Vidal, R.: Latent space sparse subspace clustering. In: 2013 IEEE International Conference on Computer Vision, pp. 225–232 (2013)
Xiao, S., Tan, M., Xu, D., Dong, Z.Y.: Robust kernel low-rank representation. IEEE Trans. Neural Netw. Learn. Syst. 27, 2268–2281 (2016)
Yin, M., Guo, Y., Gao, J., He, Z., Xie, S.: Kernel sparse subspace clustering on symmetric positive definite manifolds. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (2016)
Peng, X., Feng, J., Xiao, S., Yau, W., Zhou, J.T., Yang, S.: Structured autoencoders for subspace clustering. IEEE Trans. Image Process. 27(10), 5076–5086 (2018)
Cinbis, R.G., Verbeek, J., Schmid, C.: Unsupervised metric learning for face identification in TV video. In: 2011 International Conference on Computer Vision, pp. 1559–1566 (2011)
Du, M., Chellappa, R.: Face association across unconstrained video frames using conditional random fields. In: 2012 European Conference on Computer Vision, pp. 167–180 (2012)
Hu, Y., Mian, A.S., Owens, R.: Sparse approximated nearest points for image set classification. In: 2011 IEEE Conference on Computer Vision and Pattern Recognition, pp. 121–128 (2011)
Tapaswi, M., Law, M.T., Fidler, S.: Video face clustering with unknown number of clusters. In: International Conference on Computer Vision (2019)
Wu, B., Zhang, Y., Hu, B., Ji, Q.: Constrained clustering and its application to face clustering in videos. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3507–3514 (2013)
Xiao, S., Tan, M., Xu, D.: Weighted block-sparse low rank representation for face clustering in videos. In: 2014 European Conference on Computer Vision, pp. 123–138 (2014)
Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Joint face representation adaptation and clustering in videos. In: Proceedings of 2016 European Conference on Computer Vision, vol. 9907, pp. 236–251 (2016)
Zhang, Y., Tang, Z., Wu, B., Ji, Q., Lu, H.: A coupled hidden conditional random field model for simultaneous face clustering and naming in videos. IEEE Trans. Image Process. 25, 5780–5792 (2016)
Sharma, V., Tapaswi, M., Sarfraz, M.S., Stiefelhagen, R.: Self-supervised learning of face representations for video face clustering. In: IEEE International Conference on Automatic Face and Gesture Recognition (2019)
Klein, D., Kamvar, S., Manning, C.: From instancelevel constraints to space-level constraints: making the most of prior knowledge in data clustering. In: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 307–314 (2002)
Chapelle, O., Schölkopf, B., Zien, A.: Probabilistic Semi-Supervised Clustering with Constraints, pp. 73–102. MIT Press, Cambridge (2006)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Ji, P., Salzmann, M., Li, H.: Efficient dense subspace clustering. In: IEEE Winter Conference on Applications of Computer Vision, pp. 461–468 (2014)
Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: analysis and an algorithm. In: 2001 Conference on Neural Information Processing Systems (2001)
Bishop, C.: Pattern Recognition and Machine Learning, pp. 140–155. Springer, Berlin (2006)
Pini, S., Cornia, M., Bolelli, F.: M-VAD names: a dataset for video captioning with naming. Multimed. Tools Appl. 78, 14007–14027 (2019)
Acknowledgements
The work was supported by Zhejiang Provincial Natural Science Foundation of China under Grant No. LY18F020034, and Natural Science Foundation of China under Grant No. 61801428.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Yunhao Qiu and Pengyi Hao have contributed equally.
Rights and permissions
About this article
Cite this article
Qiu, Y., Hao, P. Self-supervised deep subspace clustering network for faces in videos. Vis Comput 37, 2253–2261 (2021). https://doi.org/10.1007/s00371-020-01984-5
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-020-01984-5