Skip to main content
Log in

Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

We present CrossInfoMobileNet, a hand pose estimation convolutional neural network based on CrossInfoNet, specifically tuned to mobile phone processors through the optimization, modification, and replacement of computationally critical CrossInfoNet components. By introducing a state-of-the-art MobileNetV3 network as a feature extractor and refiner, replacing ReLU activation with a better performing H-Swish activation function, we have achieved a network that requires 2.37 times less multiply-add operations and 2.22 times less parameters than the CrossInfoNet network, while maintaining the same error on the state-of-the-art datasets. This reduction of multiply-add operations resulted in an average 1.56 times faster real-world performance on both desktop and mobile devices, making it more suitable for embedded applications. The full source code of CrossInfoMobileNet including the sample dataset and its evaluation is available online through Code Ocean.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. We have to note that ResNet networks were introduced in 2015 and are considered too computationally demanding to be run at above real-time rates on mobile devices.

  2. There are two variants of MobileNetV3; MobileNetV3-Small and MobileNetV3-Large [19]. Those two differ in the number of used Bottleneck modules (11 vs. 17). After trying both variants inside of the final version of the Hand Pose Estimation (HPE) network, it was found that counterintuitively, the Large variant performed worse than the Small one, therefore a modified MobileNetV3-Small architecture was used for general features extraction.

  3. Note that the bottleneck is not inverted (unlike in MobileNet).

  4. The error is computed in leave-one-out fashion by taking the average of the 9 models’ errors. Each model is trained on 8 subjects and tested on the ninth one (the MSRA dataset consists of 9 subjects performing 17 gestures each). Average of all 9 results is made, which is reported in Table 1.

References

  1. Chen, K.Y., Patel, S.N., Keller, S.: Finexus: Tracking precise motions of multiple fingertips using magnetic sensing. In: Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems, CHI’16, p. 1504-1514. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2858036.2858125

  2. Chen, X., Wang, G., Guo, H., Zhang, C.: Pose guided structured region ensemble network for cascaded hand pose estimation. Neurocomputing 395, 138–149 (2020)

    Article  Google Scholar 

  3. Chen, X., Wang, G., Zhang, C., Kim, T.K., Ji, X.: Shpr-net: Deep semantic hand pose regression from point clouds. IEEE Access 6, 43425–43439 (2018)

    Article  Google Scholar 

  4. Chen, Y., Tu, Z., Ge, L., Zhang, D., Chen, R., Yuan, J.: So-handnet: Self-organizing network for 3d hand pose estimation with semi-supervised learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp. 6961–6970 (2019)

  5. Delamarre, Q., Faugeras, O.: 3d articulated models and multiview tracking with physical forces. Comput. Vis. Image Underst. 81(3), 328–357 (2001)

    Article  Google Scholar 

  6. Deng, X., Yang, S., Zhang, Y., Tan, P., Chang, L., Wang, H.: Hand3d: Hand pose estimation using 3d neural network. arXiv preprint arXiv:1704.02224 (2017)

  7. Ding, M., Fan, G.: Articulated gaussian kernel correlation for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 57–64 (2015)

  8. Du, K., Lin, X., Sun, Y., Ma, X.: Crossinfonet: Multi-task information sharing based hand pose estimation. In: 2019 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9888–9897 (2019). https://doi.org/10.1109/CVPR.2019.01013

  9. Eger, S., Youssef, P., Gurevych, I.: Is it time to swish? comparing deep learning activation functions across nlp tasks. arXiv preprint arXiv:1901.02671 (2019)

  10. Erol, A., Bebis, G., Nicolescu, M., Boyle, R.D., Twombly, X.: Vision-based hand pose estimation: a review. Comput. Vis. Image Understanding 108(1), 52–73 (2007)

    Article  Google Scholar 

  11. Ge, L., Cai, Y., Weng, J., Yuan, J.: Hand pointnet: 3d hand pose estimation using point sets. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8417–8426 (2018)

  12. Ge, L., Ren, Z., Yuan, J.: Point-to-point regression pointnet for 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 475–491 (2018)

  13. Glauser, O., Wu, S., Panozzo, D., Hilliges, O., Sorkine-Hornung, O.: Interactive hand pose estimation using a stretch-sensing soft glove. ACM Trans. Graph. 10(1145/3306346), 3322957 (2019)

    Google Scholar 

  14. Google, I.: Google soli. https://atap.google.com/soli/

  15. Guo, H., Wang, G., Chen, X., Zhang, C., Qiao, F., Yang, H.: Region ensemble network: Improving convolutional network for hand pose estimation. In: 2017 IEEE International conference on image processing (ICIP), pp. 4512–4516. IEEE (2017)

  16. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. CoRR arXiv:1512.03385

  17. He, L., Wang, G., Liao, Q., Xue, J.H.: Depth-images-based pose estimation using regression forests and graphical models. Neurocomputing 164, 210–219 (2015)

    Article  Google Scholar 

  18. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  19. Howard, A., Sandler, M., Chu, G., Chen, L., Chen, B., Tan, M., Wang, W., Zhu, Y., Pang, R., Vasudevan, V., Le, Q.V., Adam, H.: Searching for mobilenetv3. CoRR arXiv:1905.02244

  20. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR arXiv:1704.04861

  21. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360 (2016)

  22. Khamis, S., Taylor, J., Shotton, J., Keskin, C., Izadi, S., Fitzgibbon, A.: Learning an efficient model of hand shape variation from depth images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2540–2548 (2015)

  23. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, pp. 1097–1105 (2012)

  24. Li, Y., Zhang, K., Cao, J., Timofte, R., Van Gool, L.: Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707 (2021)

  25. Liu, G., Zhang, R., Wang, Y., Man, R.: Road scene recognition of forklift agv equipment based on deep learning. Processes 9(11), 1955 (2021)

    Article  Google Scholar 

  26. Liu, S., Wang, G., Xie, P., Zhang, C.: Light and fast hand pose estimation from spatial-decomposed latent heatmap. IEEE Access 8, 53072–53081 (2020). https://doi.org/10.1109/ACCESS.2020.2979507

    Article  Google Scholar 

  27. Looking Glass Factory, I.: The holographic display revolutionizing 3d work. https://lookingglassfactory.com/

  28. Malireddi, S.R., Mueller, F., Oberweger, M., Bojja, A.K., Lepetit, V., Theobalt, C., Tagliasacchi, A.: Handseg: A dataset for hand segmentation from depth images. CoRR (2017). arXiv:1711.05944

  29. Misra, D.: Mish: A self regularized non-monotonic neural activation function. (2019) arXiv:1908.08681

  30. Moon, G., Chang, J.Y., Lee, K.M.: V2v-posenet: voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In: Proceedings of the IEEE conference on computer vision and pattern Recognition, pp. 5079–5088 (2018)

  31. Nie, X., Feng, J., Yan, S.: Mutual learning to adapt for joint human parsing and pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 502–517 (2018)

  32. Oberweger, M., Lepetit, V.: Deepprior++: Improving fast and accurate 3d hand pose estimation. In: Proceedings of the IEEE international conference on computer vision workshops, pp. 585–594 (2017)

  33. Park, E., Yoo, S.: Profit: a novel training method for sub-4-bit mobilenet models. In: European conference on computer vision, pp. 430–446. Springer (2020)

  34. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: Pointnet: Deep learning on point sets for 3d classification and segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 652–660 (2017)

  35. Rad, M., Oberweger, M., Lepetit, V.: Feature mapping for learning fast and accurate 3d pose inference from synthetic images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4663–4672 (2018)

  36. Ramachandran, P., Zoph, B., Le, Q.V.: Swish: a self-gated activation function. arXiv preprint arXiv:1710.059417, 1 (2017)

  37. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. CoRR abs/1505.04597 (2015). arXiv:1505.04597

  38. Ruder, S.: An overview of multi-task learning in deep neural networks. (2017). arXiv:1706.05098

  39. Sandler, M., Howard, A.G., Zhu, M., Zhmoginov, A., Chen, L.: Inverted residuals and linear bottlenecks: mobile networks for classification, detection and segmentation. (2018). arXiv:1801.04381

  40. Sridhar, S., Mueller, F., Oulasvirta, A., Theobalt, C.: Fast and robust hand tracking using detection-guided optimization. In: Proceedings of the ieee conference on computer vision and pattern recognition, pp. 3213–3221 (2015)

  41. Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 5693–5703 (2019)

  42. Sun, X., Wei, Y., Shuang Liang, Tang, X., Sun, J.: Cascaded hand pose regression. In: 2015 IEEE Conference on computer vision and pattern recognition (CVPR), pp. 824–832 (2015). https://doi.org/10.1109/CVPR.2015.7298683

  43. Sun, Y.: The role of activation function in image classification. In: 2021 International Conference on communications, information system and computer engineering (CISCE), pp. 275–278. IEEE (2021)

  44. Supancic, J.S., Rogez, G., Yang, Y., Shotton, J., Ramanan, D.: Depth-based hand pose estimation: data, methods, and challenges. In: Proceedings of the IEEE international conference on computer vision, pp. 1868–1876 (2015)

  45. Tang, D., Chang, H.J., Tejani, A., Kim, T.: Latent regression forest: Structured estimation of 3d articulated hand posture. In: 2014 IEEE Conference on computer vision and pattern recognition, pp. 3786–3793 (2014)

  46. Tateno, S., Zhu, Y., Meng, F.: Hand gesture recognition system for in-car device control based on infrared array sensor. In: 2019 58th Annual conference of the society of instrument and control engineers of Japan (SICE), pp. 701–706 (2019)

  47. Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33(5), 169:1-169:10 (2014). https://doi.org/10.1145/2629500

    Article  Google Scholar 

  48. Wan, C., Probst, T., Van Gool, L., Yao, A.: Dense 3d regression for hand pose estimation. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp. 5147–5156 (2018)

  49. Wang, G., Chen, X., Guo, H., Zhang, C.: Region ensemble network: towards good practices for deep 3d hand pose estimation. J. Vis. Commun. Image Represent. 55, 404–414 (2018)

    Article  Google Scholar 

  50. Xia, F., Wang, P., Chen, X., Yuille, A.L.: Joint multi-person pose estimation and semantic part segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6769–6778 (2017)

  51. Xiong, F., Zhang, B., Xiao, Y., Cao, Z., Yu, T., Zhou, J.T., Yuan, J.: A2j: Anchor-to-joint regression network for 3d articulated pose estimation from a single depth image. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV) (2019)

  52. Ye, M., Yang, R.: Real-time simultaneous pose and shape estimation for articulated objects using a single depth camera. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2345–2352 (2014)

  53. Yuan, S., Garcia-Hernando, G., Stenger, B., Moon, G., Yong Chang, J., Mu Lee, K., Molchanov, P., Kautz, J., Honari, S., Ge, L., et al.: Depth-based 3d hand pose estimation: From current achievements to future goals. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2636–2645 (2018)

  54. Zhang, Y., Chen, X.: Lightweight semantic segmentation algorithm based on mobilenetv3 network. In: 2020 International conference on intelligent computing, automation and systems (ICICAS), pp. 429–433. IEEE (2020)

  55. Zhou, W., Jiang, X., Chen, X., Miao, S., Chen, C., Mei, S., Liu, Y.H.: Hmtnet: 3d hand pose estimation from single depth image based on hand morphological topology. IEEE Sens. J. 20(11), 6004–6011 (2020)

    Article  Google Scholar 

  56. Zhou, X., Wan, Q., Zhang, W., Xue, X., Wei, Y.: Model-based deep hand pose estimation. arXiv preprint arXiv:1606.06854 (2016)

  57. Zhou, Y., Lu, J., Du, K., Lin, X., Sun, Y., Ma, X.: Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp. 501–516 (2018)

Download references

Funding

This work is partially supported by Grant of SGS No. SP2020/26, SP2021/88, and SP2022/81 VŠB—Technical University of Ostrava, Czech Republic and also supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90140).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michal Krumnikl.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Šimoník, M., Krumnikl, M. Optimized hand pose estimation CrossInfoNet-based architecture for embedded devices. Machine Vision and Applications 33, 78 (2022). https://doi.org/10.1007/s00138-022-01332-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01332-8

Keywords

Navigation