Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices

Šimoník, Marek; Krumnikl, Michal

doi:10.1007/s00138-022-01332-8

Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices

Original Paper
Published: 13 August 2022

Volume 33, article number 78, (2022)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

250 Accesses
1 Citation
2 Altmetric
Explore all metrics

Abstract

We present CrossInfoMobileNet, a hand pose estimation convolutional neural network based on CrossInfoNet, specifically tuned to mobile phone processors through the optimization, modification, and replacement of computationally critical CrossInfoNet components. By introducing a state-of-the-art MobileNetV3 network as a feature extractor and refiner, replacing ReLU activation with a better performing H-Swish activation function, we have achieved a network that requires 2.37 times less multiply-add operations and 2.22 times less parameters than the CrossInfoNet network, while maintaining the same error on the state-of-the-art datasets. This reduction of multiply-add operations resulted in an average 1.56 times faster real-world performance on both desktop and mobile devices, making it more suitable for embedded applications. The full source code of CrossInfoMobileNet including the sample dataset and its evaluation is available online through Code Ocean.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

Computer vision-based hand gesture recognition for human-robot interaction: a review

Article Open access 19 July 2023

A review of convolutional neural networks in computer vision

Article Open access 23 March 2024

Notes

We have to note that ResNet networks were introduced in 2015 and are considered too computationally demanding to be run at above real-time rates on mobile devices.
There are two variants of MobileNetV3; MobileNetV3-Small and MobileNetV3-Large [19]. Those two differ in the number of used Bottleneck modules (11 vs. 17). After trying both variants inside of the final version of the Hand Pose Estimation (HPE) network, it was found that counterintuitively, the Large variant performed worse than the Small one, therefore a modified MobileNetV3-Small architecture was used for general features extraction.
Note that the bottleneck is not inverted (unlike in MobileNet).
The error is computed in leave-one-out fashion by taking the average of the 9 models’ errors. Each model is trained on 8 subjects and tested on the ninth one (the MSRA dataset consists of 9 subjects performing 17 gestures each). Average of all 9 results is made, which is reported in Table 1.

References

Chen, K.Y., Patel, S.N., Keller, S.: Finexus: Tracking precise motions of multiple fingertips using magnetic sensing. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI’16, p. 1504-1514. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2858036.2858125
Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395, 138–149 (2020)
Article Google Scholar
Chen, X., Wang, G., Zhang, C., Kim, T.K., Ji, X.: Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)
Article Google Scholar
Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6961–6970 (2019)
Delamarre, Q., Faugeras, O.: 3d articulated models and multiview tracking with physical forces. Comput. Vis. Image Underst. 81(3), 328–357 (2001)
Article Google Scholar
Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L., Wang, H.: Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224 (2017)
Ding, M., Fan, G.: Articulated gaussian kernel correlation for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 57–64 (2015)
Du, K., Lin, X., Sun, Y., Ma, X.: Crossinfonet: Multi-task information sharing based hand pose estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9888–9897 (2019). https://doi.org/10.1109/CVPR.2019.01013
Eger, S., Youssef, P., Gurevych, I.: Is it time to swish? comparing deep learning activation functions across nlp tasks. arXiv preprint arXiv:1901.02671 (2019)
Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X.: Vision-based hand pose estimation: a review. Comput. Vis. Image Understanding 108(1), 52–73 (2007)
Article Google Scholar
Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand pointnet: 3d hand pose estimation using point sets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8417–8426 (2018)
Ge, L., Ren, Z., Yuan, J.: Point-to-point regression pointnet for 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 475–491 (2018)
Glauser, O., Wu, S., Panozzo, D., Hilliges, O., Sorkine-Hornung, O.: Interactive hand pose estimation using a stretch-sensing soft glove. ACM Trans. Graph. 10(1145/3306346), 3322957 (2019)
Google Scholar
Google, I.: Google soli. https://atap.google.com/soli/
Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yang, H.: Region ensemble network: Improving convolutional network for hand pose estimation. In: 2017 IEEE International conference on image processing (ICIP), pp. 4512–4516. IEEE (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385
He, L., Wang, G., Liao, Q., Xue, J.H.: Depth-images-based pose estimation using regression forests and graphical models. Neurocomputing 164, 210–219 (2015)
Article Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. CoRR arXiv:1905.02244
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR arXiv:1704.04861
Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)
Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2540–2548 (2015)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)
Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)
Liu, G., Zhang, R., Wang, Y., Man, R.: Road scene recognition of forklift agv equipment based on deep learning. Processes 9(11), 1955 (2021)
Article Google Scholar
Liu, S., Wang, G., Xie, P., Zhang, C.: Light and fast hand pose estimation from spatial-decomposed latent heatmap. IEEE Access 8, 53072–53081 (2020). https://doi.org/10.1109/ACCESS.2020.2979507
Article Google Scholar
Looking Glass Factory, I.: The holographic display revolutionizing 3d work. https://lookingglassfactory.com/
Malireddi, S.R., Mueller, F., Oberweger, M., Bojja, A.K., Lepetit, V., Theobalt, C., Tagliasacchi, A.: Handseg: A dataset for hand segmentation from depth images. CoRR (2017). arXiv:1711.05944
Misra, D.: Mish: A self regularized non-monotonic neural activation function. (2019) arXiv:1908.08681
Moon, G., Chang, J.Y., Lee, K.M.: V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 5079–5088 (2018)
Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 502–517 (2018)
Oberweger, M., Lepetit, V.: Deepprior++: Improving fast and accurate 3d hand pose estimation. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 585–594 (2017)
Park, E., Yoo, S.: Profit: a novel training method for sub-4-bit mobilenet models. In: European conference on computer vision, pp. 430–446. Springer (2020)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017)
Rad, M., Oberweger, M., Lepetit, V.: Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4663–4672 (2018)
Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. arXiv preprint arXiv:1710.059417, 1 (2017)
Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597
Ruder, S.: An overview of multi-task learning in deep neural networks. (2017). arXiv:1706.05098
Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. (2018). arXiv:1801.04381
Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp. 3213–3221 (2015)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)
Sun, X., Wei, Y., Shuang Liang, Tang, X., Sun, J.: Cascaded hand pose regression. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp. 824–832 (2015). https://doi.org/10.1109/CVPR.2015.7298683
Sun, Y.: The role of activation function in image classification. In: 2021 International Conference on communications, information system and computer engineering (CISCE), pp. 275–278. IEEE (2021)
Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE international conference on computer vision, pp. 1868–1876 (2015)
Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: Structured estimation of 3d articulated hand posture. In: 2014 IEEE Conference on computer vision and pattern recognition, pp. 3786–3793 (2014)
Tateno, S., Zhu, Y., Meng, F.: Hand gesture recognition system for in-car device control based on infrared array sensor. In: 2019 58th Annual conference of the society of instrument and control engineers of Japan (SICE), pp. 701–706 (2019)
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33(5), 169:1-169:10 (2014). https://doi.org/10.1145/2629500
Article Google Scholar
Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 5147–5156 (2018)
Wang, G., Chen, X., Guo, H., Zhang, C.: Region ensemble network: towards good practices for deep 3d hand pose estimation. J. Vis. Commun. Image Represent. 55, 404–414 (2018)
Article Google Scholar
Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6769–6778 (2017)
Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J.T., Yuan, J.: A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (2019)
Ye, M., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2345–2352 (2014)
Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Yong Chang, J., Mu Lee, K., Molchanov, P., Kautz, J., Honari, S., Ge, L., et al.: Depth-based 3d hand pose estimation: From current achievements to future goals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2636–2645 (2018)
Zhang, Y., Chen, X.: Lightweight semantic segmentation algorithm based on mobilenetv3 network. In: 2020 International conference on intelligent computing, automation and systems (ICICAS), pp. 429–433. IEEE (2020)
Zhou, W., Jiang, X., Chen, X., Miao, S., Chen, C., Mei, S., Liu, Y.H.: Hmtnet: 3d hand pose estimation from single depth image based on hand morphological topology. IEEE Sens. J. 20(11), 6004–6011 (2020)
Article Google Scholar
Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854 (2016)
Zhou, Y., Lu, J., Du, K., Lin, X., Sun, Y., Ma, X.: Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 501–516 (2018)

Download references

Funding

This work is partially supported by Grant of SGS No. SP2020/26, SP2021/88, and SP2022/81 VŠB—Technical University of Ostrava, Czech Republic and also supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).

Author information

Authors and Affiliations

Department of Computer Science, FEECS, VŠB – Technical University of Ostrava, 17. listopadu 15, 708 33, Ostrava, Poruba, Czech Republic
Marek Šimoník & Michal Krumnikl

Authors

Marek Šimoník
View author publications
You can also search for this author in PubMed Google Scholar
Michal Krumnikl
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michal Krumnikl.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Šimoník, M., Krumnikl, M. Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices. Machine Vision and Applications 33, 78 (2022). https://doi.org/10.1007/s00138-022-01332-8

Download citation

Received: 12 November 2020
Revised: 20 March 2022
Accepted: 21 July 2022
Published: 13 August 2022
DOI: https://doi.org/10.1007/s00138-022-01332-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices

Abstract

Access this article

Similar content being viewed by others

A survey of the recent architectures of deep convolutional neural networks

Computer vision-based hand gesture recognition for human-robot interaction: a review

A review of convolutional neural networks in computer vision

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices

Abstract

Access this article

Similar content being viewed by others

A survey of the recent architectures of deep convolutional neural networks

Computer vision-based hand gesture recognition for human-robot interaction: a review

A review of convolutional neural networks in computer vision

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation