Abstract
We present CrossInfoMobileNet, a hand pose estimation convolutional neural network based on CrossInfoNet, specifically tuned to mobile phone processors through the optimization, modification, and replacement of computationally critical CrossInfoNet components. By introducing a state-of-the-art MobileNetV3 network as a feature extractor and refiner, replacing ReLU activation with a better performing H-Swish activation function, we have achieved a network that requires 2.37 times less multiply-add operations and 2.22 times less parameters than the CrossInfoNet network, while maintaining the same error on the state-of-the-art datasets. This reduction of multiply-add operations resulted in an average 1.56 times faster real-world performance on both desktop and mobile devices, making it more suitable for embedded applications. The full source code of CrossInfoMobileNet including the sample dataset and its evaluation is available online through Code Ocean.
Similar content being viewed by others
Notes
We have to note that ResNet networks were introduced in 2015 and are considered too computationally demanding to be run at above real-time rates on mobile devices.
There are two variants of MobileNetV3; MobileNetV3-Small and MobileNetV3-Large [19]. Those two differ in the number of used Bottleneck modules (11 vs. 17). After trying both variants inside of the final version of the Hand Pose Estimation (HPE) network, it was found that counterintuitively, the Large variant performed worse than the Small one, therefore a modified MobileNetV3-Small architecture was used for general features extraction.
Note that the bottleneck is not inverted (unlike in MobileNet).
The error is computed in leave-one-out fashion by taking the average of the 9 models’ errors. Each model is trained on 8 subjects and tested on the ninth one (the MSRA dataset consists of 9 subjects performing 17 gestures each). Average of all 9 results is made, which is reported in Table 1.
References
Chen, K.Y., Patel, S.N., Keller, S.: Finexus: Tracking precise motions of multiple fingertips using magnetic sensing. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI’16, p. 1504-1514. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2858036.2858125
Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395, 138–149 (2020)
Chen, X., Wang, G., Zhang, C., Kim, T.K., Ji, X.: Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)
Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6961–6970 (2019)
Delamarre, Q., Faugeras, O.: 3d articulated models and multiview tracking with physical forces. Comput. Vis. Image Underst. 81(3), 328–357 (2001)
Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L., Wang, H.: Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224 (2017)
Ding, M., Fan, G.: Articulated gaussian kernel correlation for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 57–64 (2015)
Du, K., Lin, X., Sun, Y., Ma, X.: Crossinfonet: Multi-task information sharing based hand pose estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9888–9897 (2019). https://doi.org/10.1109/CVPR.2019.01013
Eger, S., Youssef, P., Gurevych, I.: Is it time to swish? comparing deep learning activation functions across nlp tasks. arXiv preprint arXiv:1901.02671 (2019)
Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X.: Vision-based hand pose estimation: a review. Comput. Vis. Image Understanding 108(1), 52–73 (2007)
Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand pointnet: 3d hand pose estimation using point sets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8417–8426 (2018)
Ge, L., Ren, Z., Yuan, J.: Point-to-point regression pointnet for 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 475–491 (2018)
Glauser, O., Wu, S., Panozzo, D., Hilliges, O., Sorkine-Hornung, O.: Interactive hand pose estimation using a stretch-sensing soft glove. ACM Trans. Graph. 10(1145/3306346), 3322957 (2019)
Google, I.: Google soli. https://atap.google.com/soli/
Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yang, H.: Region ensemble network: Improving convolutional network for hand pose estimation. In: 2017 IEEE International conference on image processing (ICIP), pp. 4512–4516. IEEE (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385
He, L., Wang, G., Liao, Q., Xue, J.H.: Depth-images-based pose estimation using regression forests and graphical models. Neurocomputing 164, 210–219 (2015)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. CoRR arXiv:1905.02244
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR arXiv:1704.04861
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2540–2548 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Liu, G., Zhang, R., Wang, Y., Man, R.: Road scene recognition of forklift agv equipment based on deep learning. Processes 9(11), 1955 (2021)
Liu, S., Wang, G., Xie, P., Zhang, C.: Light and fast hand pose estimation from spatial-decomposed latent heatmap. IEEE Access 8, 53072–53081 (2020). https://doi.org/10.1109/ACCESS.2020.2979507
Looking Glass Factory, I.: The holographic display revolutionizing 3d work. https://lookingglassfactory.com/
Malireddi, S.R., Mueller, F., Oberweger, M., Bojja, A.K., Lepetit, V., Theobalt, C., Tagliasacchi, A.: Handseg: A dataset for hand segmentation from depth images. CoRR (2017). arXiv:1711.05944
Misra, D.: Mish: A self regularized non-monotonic neural activation function. (2019) arXiv:1908.08681
Moon, G., Chang, J.Y., Lee, K.M.: V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 5079–5088 (2018)
Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 502–517 (2018)
Oberweger, M., Lepetit, V.: Deepprior++: Improving fast and accurate 3d hand pose estimation. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 585–594 (2017)
Park, E., Yoo, S.: Profit: a novel training method for sub-4-bit mobilenet models. In: European conference on computer vision, pp. 430–446. Springer (2020)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017)
Rad, M., Oberweger, M., Lepetit, V.: Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4663–4672 (2018)
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. arXiv preprint arXiv:1710.059417, 1 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597
Ruder, S.: An overview of multi-task learning in deep neural networks. (2017). arXiv:1706.05098
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. (2018). arXiv:1801.04381
Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp. 3213–3221 (2015)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)
Sun, X., Wei, Y., Shuang Liang, Tang, X., Sun, J.: Cascaded hand pose regression. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp. 824–832 (2015). https://doi.org/10.1109/CVPR.2015.7298683
Sun, Y.: The role of activation function in image classification. In: 2021 International Conference on communications, information system and computer engineering (CISCE), pp. 275–278. IEEE (2021)
Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE international conference on computer vision, pp. 1868–1876 (2015)
Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: Structured estimation of 3d articulated hand posture. In: 2014 IEEE Conference on computer vision and pattern recognition, pp. 3786–3793 (2014)
Tateno, S., Zhu, Y., Meng, F.: Hand gesture recognition system for in-car device control based on infrared array sensor. In: 2019 58th Annual conference of the society of instrument and control engineers of Japan (SICE), pp. 701–706 (2019)
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33(5), 169:1-169:10 (2014). https://doi.org/10.1145/2629500
Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 5147–5156 (2018)
Wang, G., Chen, X., Guo, H., Zhang, C.: Region ensemble network: towards good practices for deep 3d hand pose estimation. J. Vis. Commun. Image Represent. 55, 404–414 (2018)
Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6769–6778 (2017)
Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J.T., Yuan, J.: A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (2019)
Ye, M., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2345–2352 (2014)
Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Yong Chang, J., Mu Lee, K., Molchanov, P., Kautz, J., Honari, S., Ge, L., et al.: Depth-based 3d hand pose estimation: From current achievements to future goals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2636–2645 (2018)
Zhang, Y., Chen, X.: Lightweight semantic segmentation algorithm based on mobilenetv3 network. In: 2020 International conference on intelligent computing, automation and systems (ICICAS), pp. 429–433. IEEE (2020)
Zhou, W., Jiang, X., Chen, X., Miao, S., Chen, C., Mei, S., Liu, Y.H.: Hmtnet: 3d hand pose estimation from single depth image based on hand morphological topology. IEEE Sens. J. 20(11), 6004–6011 (2020)
Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854 (2016)
Zhou, Y., Lu, J., Du, K., Lin, X., Sun, Y., Ma, X.: Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 501–516 (2018)
Funding
This work is partially supported by Grant of SGS No. SP2020/26, SP2021/88, and SP2022/81 VŠB—Technical University of Ostrava, Czech Republic and also supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Šimoník, M., Krumnikl, M. Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices. Machine Vision and Applications 33, 78 (2022). https://doi.org/10.1007/s00138-022-01332-8
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-022-01332-8