Skip to main content
Log in

A lightweight convolutional neural network for pose estimation of a planar model

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

The 3D pose estimation problem consists of calculating the position and orientation of a three-dimensional object from its projection onto a two-dimensional image relative to a given reference frame. In recent years, convolutional neural networks (CNNs) have achieved impressive results in addressing some of the traditional problems of computer vision, including 3D pose estimation. In general, CNNs employed contain convolutional and fully connected layers with many neurons and trainable parameters. That is, they are heavyweight architectures. Such models are difficult to train, highly memory-consuming, and, as the number of trainable parameters increases, they tend to suffer from overfitting. In this work, we present a lightweight CNN called Pose Network with Spatial Pyramid Pooling (PNSPP), capable of estimating the six-degree-of-freedom pose of a planar model from a single RGB image. Inspired by PoseNet, our CNN employs almost the same architecture but contains 4X fewer parameters (for a chosen image size) thanks to its optimized regression layers. In all tests, PNSPP outperformed PoseNet in the pose predictions. The overall relative improvements were in the ranges of 24–40%, and 9–33% for the estimated position and orientation errors, respectively. Other performance metrics, such as RMSE and ADD, also favored PNSPP. Finally, we propose a method that estimates the scale factor \(\beta \) used in the pose error functions to balance the contributions of the position and orientation terms. Unlike other approaches that perform potentially expensive grid or random searches, our method uses simple heuristics to adjust this value as the neural network training progress. At the end of each experiment, the estimated \(\beta \) values deviated roughly ± 10% from the optimal values, which in our case seems reasonable given the computational cost of performing more exhaustive searches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. Publicly available code at https://github.com/kentsommer/keras-posenet/blob/master/posenet.py.

Abbreviations

2D:

Two-dimensional

3D:

Three-dimensional

CNN:

Convolutional neural network

RGB:

Red-Green-Blue

DOF:

Degree of freedom

PnP:

Perspective-n-point

MP:

Max pooling

AP:

Average pooling

CONV:

Convolutional

FC:

Fully connected

GPU:

Graphical processing unit

DO:

Dropout

GAP:

Global average pooling

CNN-GAP:

Concatenation of a global average pooling layer to a convolutional neural network

CNN-FC:

Concatenation of a fully connected layer to a convolutional neural network

ResNet:

Residual neural network

SPP:

Spatial pyramid pooling

SIFT:

Scale-Invariant feature transform

SURF:

Speeded-up robust feature

ORB:

Oriented fast and rotated BRIEF

LINE-2D:

Linearizing 2D

BOLD:

Bunch of line descriptor

SSD:

Single shot detector

LSTM:

Long short-term memory

INC:

Inception

LRN:

Local response normalization

BN:

Batch normalization

DSC:

Depth-wise separable convolutions

Concat:

Concatenation

SE:

Euclidean special group

ADD:

Average distance metric

RMSE:

Root-mean-squared error

IQR:

Interquartile range

Q:

Quartile

\(\beta \) :

Scale factor to balance the contribution of the rotation against the position term

P :

Camera matrix for 3D–2D mapping

\(\lambda \) :

Factor that indicates the invariance of the projection to uniform scale changes

K :

Calibration matrix that relates the camera coordinates with the image coordinates

[R|t]:

A rigid transformation that relates the world coordinates with the camera coordinates

R :

Rotation matrix

t :

Position vector

\((u_0, v_0)\) :

The intersection of the optical axis and the image plan (principal point)

\(\alpha _u\) :

Scale factor of the image axis u

\(\alpha _v\) :

Scale factor of the image axis v

\({X, \mu }\) :

Set of matching 3D–2D points

T :

Tensor

\(T_{in}\) :

Input tensor

F :

Set of filters

\(F_j\) :

j-th filter

\(\delta _t\) :

Position metric distance

\(\delta _\theta \) :

Rotation metric distance

q :

Unit quaternion

\(\mathscr {L}_1\) :

Loss function one

\(\mathscr {L}_2\) :

Loss function two

\(\xi _{t}\) :

Error of position

\(\xi _{q}\) :

Error of orientation

r :

Error ratio

\(vl_i\) :

i-th validation loss

\(vl_b\) :

Best validation loss

\({\mathcal {T}}\) :

Training dataset

s :

Stability threshold

\(m_i\) :

i-th stability margin

f :

Proportionality factor

\(\phi \) :

Activation function

\(n_e\) :

Number of epochs

\(\mathscr {M}_1\) :

Planar object one

\(\mathscr {M}_2\) :

Planar object two

\(\mathscr {D}_1\) :

Image dataset from \(M_1\)

\(\mathscr {D}_2\) :

Image dataset from \(M_2\)

lr :

Learning rate

\(H_0\) :

Null hypothesis

References

  1. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-scale machine learning on heterogeneous systems (2015). http://tensorflow.org/. Software available from tensorflow.org

  2. Alvarez, J., Petersson, L.: Decomposeme: Simplifying convnets for end-to-end learning. arXiv preprint arXiv:1606.05426 (2016)

  3. Ansar, A., Daniilidis, K.: Linear pose estimation from points or lines. IEEE Trans. Pattern Anal. Mach. Intell. 25(5), 578–589 (2003)

    Article  Google Scholar 

  4. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008)

    Article  Google Scholar 

  5. Billings, G., Johnson-Roberson, M.: Silhonet: an rgb method for 6D object pose estimation. IEEE Robot. Autom. Lett. 4(4), 3727–3734 (2019)

    Article  Google Scholar 

  6. Blalock, D., Gonzalez Ortiz, J.J., Frankle, J., Guttag, J.: What is the state of neural network pruning? Proc. Mach. Learn. Syst. 2, 129–146 (2020)

    Google Scholar 

  7. Blanton, H., Greenwell, C., Workman, S., Jacobs, N.: Extending absolute pose regression to multiple scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 38–39 (2020)

  8. Chollet, F.: Xception: deep learning with depthwise separable convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258 (2017)

  9. Chollet, F., et al.: Keras. https://keras.io (2015)

  10. Collins, T., Bartoli, A.: Infinitesimal plane-based pose estimation. Int. J. Comput. Vision 109(3), 252–286 (2014)

    Article  MathSciNet  Google Scholar 

  11. Di Gregorio, R.: A novel point of view to define the distance between two rigid-body poses. In: Advances in robot kinematics: Analysis and design, pp. 361–369. Springer (2008)

  12. Diebel, J.: Representing attitude: Euler angles, unit quaternions, and rotation vectors. Matrix 58(15–16), 1–35 (2006)

    Google Scholar 

  13. Do, T.T., Cai, M., Pham, T., Reid, I.: Deep-6dpose: Recovering 6d object pose from a single rgb image. arXiv preprint arXiv:1802.10367 (2018)

  14. Fiala, M.: Artag, a fiducial marker system using digital techniques. In: Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on, vol. 2, pp. 590–596. IEEE (2005)

  15. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24(6), 381–395 (1981)

    Article  MathSciNet  Google Scholar 

  16. Gedik, O.S., Alatan, A.A.: Rgbd data based pose estimation: Why sensor fusion? In: 2015 18th International Conference on Information Fusion (Fusion), pp. 2129–2136. IEEE (2015)

  17. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256 (2010)

  18. Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. Adv. Neural Inf. Process. Syst. 28, 1005 (2015)

    Google Scholar 

  19. Harada, K., Tanaka, S., Tamaki, T., Raytchev, B., Kaneda, K., Amano, T.: Comparison of 3 dof pose representations for pose estimations, vol. 123, pp. 408–413 (2010)

  20. Harris, C.G., Stephens, M., et al.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, pp. 10–5244. Citeseer (1988)

  21. Hartley, R., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, New York (2003)

    MATH  Google Scholar 

  22. Hati, S., Sengupta, S.: Robust camera parameter estimation using genetic algorithm. Pattern Recogn. Lett. 22(3–4), 289–298 (2001)

    Article  Google Scholar 

  23. He, C., Kazanzides, P., Sen, H.T., Kim, S., Liu, Y.: An inertial and optical sensor fusion approach for six degree-of-freedom pose estimation. Sensors 15(7), 16448–16465 (2015)

    Article  Google Scholar 

  24. He, K., Zhang, X., Ren, S., Sun, J.: Spatial pyramid pooling in deep convolutional networks for visual recognition. In: European conference on computer vision, pp. 346–361. Springer (2014)

  25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  26. He, Z., Feng, W., Zhao, X., Lv, Y.: 6d pose estimation of objects: recent technologies and challenges. Appl. Sci. 11(1), 228 (2021)

    Article  Google Scholar 

  27. Hesch, J.A., Roumeliotis, S.I.: A direct least-squares (dls) method for pnp. In: 2011 International Conference on Computer Vision, pp. 383–390. IEEE (2011)

  28. Hinterstoisser, S., Cagniart, C., Ilic, S., Sturm, P., Navab, N., Fua, P., Lepetit, V.: Gradient response maps for real-time detection of textureless objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(5), 876–888 (2011)

    Article  Google Scholar 

  29. Hinterstoisser, S., Lepetit, V., Ilic, S., Fua, P., Navab, N.: Dominant orientation templates for real-time detection of texture-less objects. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2257–2264. IEEE (2010)

  30. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G., Konolige, K., Navab, N.: Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes. In: Asian conference on computer vision, pp. 548–562. Springer (2012)

  31. Holzer, S., Hinterstoisser, S., Ilic, S., Navab, N.: Distance transform templates for object detection and pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009, pp. 1177–1184. IEEE (2009)

  32. Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  33. Iandola, F.N., Han, S., Moskewicz, M.W., Ashraf, K., Dally, W.J., Keutzer, K.: Squeezenet: Alexnet-level accuracy with 50x fewer parameters and \(< 0.5\) mb model size. arXiv preprint arXiv:1602.07360 (2016)

  34. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International conference on machine learning, pp. 448–456. PMLR (2015)

  35. Jin, L., Wang, X., He, M., Wang, J.: Drnet: a depth-based regression network for 6d object pose estimation. Sensors 21(5), 1692 (2021)

  36. Kehl, W., Manhardt, F., Tombari, F., Ilic, S., Navab, N.: Ssd-6d: Making rgb-based 3d detection and 6d pose estimation great again. In: Proceedings of the International Conference on Computer Vision (ICCV 2017), Venice, Italy, pp. 22–29 (2017)

  37. Kendall, A., Grimes, M., Cipolla, R.: Posenet: A convolutional network for real-time 6-dof camera relocalization. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2938–2946 (2015)

  38. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  39. Kleeberger, K., Huber, M.F.: Single shot 6D object pose estimation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6239–6245. IEEE (2020)

  40. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural. Inf. Process. Syst. 25, 1097–1105 (2012)

    Google Scholar 

  41. Lepetit, V., Moreno-Noguer, F., Fua, P.: Epnp: an accurate o (n) solution to the pnp problem. Int. J. Comput. Vision 81(2), 155 (2009)

    Article  Google Scholar 

  42. Li, J., Aghajan, H., Casar, J.R., Philips, W.: Camera pose estimation by vision-inertial sensor fusion: an application to augmented reality books. Electron. Imaging 2016(4), 1–6 (2016)

    Google Scholar 

  43. Lin, G., Milan, A., Shen, C., Reid, I.D.: Refinenet: multi-path refinement networks for high-resolution semantic segmentation. Cvpr 1, 5 (2017)

    Google Scholar 

  44. Lin, M., Chen, Q., Yan, S.: Network in network. arXiv preprint arXiv:1312.4400 (2013)

  45. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440 (2015)

  46. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer Vision, vol. 2, pp. 1150–1157. IEEE (1999)

  47. Marchand, E., Uchiyama, H., Spindler, F.: Pose estimation for augmented reality: a hands-on survey. IEEE Trans. Visual Comput. Gr. 22(12), 2633–2651 (2016)

    Article  Google Scholar 

  48. Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)

    Article  Google Scholar 

  49. Nakajima, Y., Saito, H.: Robust camera pose estimation by viewpoint classification using deep learning. Comput. Visual Media 3(2), 189–198 (2017)

    Article  Google Scholar 

  50. Naseer, T., Burgard, W.: Deep regression for monocular camera-based 6-dof global localization in outdoor environments. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1525–1530. IEEE (2017)

  51. Payet, N., Todorovic, S.: From contours to 3D object detection and pose estimation. In: 2011 IEEE International Conference on Computer Vision (ICCV), pp. 983–990. IEEE (2011)

  52. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)

  53. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 6, 1137–1149 (2017)

    Article  Google Scholar 

  54. Romero-Ramirez, F.J., Muñoz-Salinas, R., Medina-Carnicer, R.: Speeded up detection of squared fiducial markers. Image Vision Comput. 2, 10047 (2018)

    Google Scholar 

  55. Rublee, E., Rabaud, V., Konolige, K., Bradski, G.: Orb: An efficient alternative to sift or surf. In: 2011 International Conference on Computer Vision, pp. 2564–2571. IEEE (2011)

  56. Sahin, C., Garcia-Hernando, G., Sock, J., Kim, T.K.: A review on object pose recovery: from 3D bounding box detectors to full 6D pose estimators. Image Vis. Comput. 96, 103898 (2020)

    Article  Google Scholar 

  57. Scripting, A.: Unity technologies. Saatavissa: http://unity3d.com/unity/workflow/scripting. Hakupäivä 3, (2013)

  58. Seifi, S., Tuytelaars, T.: How to improve cnn-based 6-dof camera pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)

  59. Shin, Y.D., Park, J.H., Baeg, M.H.: 6dof pose estimation using 2d-3d sensor fusion. In: 2012 IEEE International Conference on Automation Science and Engineering (CASE), pp. 714–717. IEEE (2012)

  60. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  61. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  62. Su, H., Qi, C.R., Li, Y., Guibas, L.J.: Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694 (2015)

  63. Su, J.Y., Cheng, S.C., Chang, C.C., Chen, J.M.: Model-based 3D pose estimation of a single rgb image using a deep viewpoint classification neural network. Appl. Sci. 9(12), 2478 (2019)

    Article  Google Scholar 

  64. Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

  65. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

  66. Tombari, F., Franchi, A., Di Stefano, L.: Bold features to detect texture-less objects. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1265–1272 (2013)

  67. Toyama, F., Shoji, K., Miyamichi, J.: Model-based pose estimation using genetic algorithm. In: Fourteenth International Conference on Pattern Recognition, 1998. Proceedings. vol. 1, pp. 198–201. IEEE (1998)

  68. Trabelsi, A., Chaabane, M., Blanchard, N., Beveridge, R.: A pose proposal and refinement network for better 6d object pose estimation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2382–2391 (2021)

  69. Tulsiani, S., Malik, J.: Viewpoints and keypoints. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1510–1519 (2015)

  70. Xiang, Y., Schmidt, T., Narayanan, V., Fox, D.: Posecnn: A convolutional neural network for 6d object pose estimation in cluttered scenes. arXiv preprint arXiv:1711.00199 (2017)

  71. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

  72. Xu, Q., Zhang, M., Gu, Z., Pan, G.: Overfitting remedy by sparsifying regularization on fully-connected layers of cnns. Neurocomputing 328, 69–74 (2019)

    Article  Google Scholar 

  73. Xu, Z., Chen, K., Jia, K.: W-posenet: Dense correspondence regularized pixel pair pose regression. arXiv preprint arXiv:1912.11888 (2019)

  74. Yu, Y.K., Wong, K.H., Chang, M.M.Y.: Pose estimation for augmented reality applications using genetic algorithm. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 35(6), 1295–1301 (2005)

    Article  Google Scholar 

  75. Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)

  76. Zhang, Z.: A flexible new technique for camera calibration. IEEE Trans. Pattern Anal. Mach. Intell. 22, 114 (2000)

    Article  Google Scholar 

  77. Zheng, Y., Kuang, Y., Sugimoto, S., Astrom, K., Okutomi, M.: Revisiting the pnp problem: a fast, general and optimal solution. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2344–2351 (2013)

  78. Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., Oliva, A.: Learning deep features for scene recognition using places database. In: Advances in Neural Information Processing Systems, pp. 487–495 (2014)

  79. Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation representations in neural networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

Download references

Acknowledgements

The authors thank Consejo Nacional de Ciencia y Tecnología(CONACYT) for the support provided.

Funding

Vladimir Ocegueda-Hernández received an scholarship from Consejo Nacional de Ciencia y Tecnología to pursue postgraduate studies. Gerardo Mendizabal-Ruiz and Israel Román-Godínez receive monthly support from Consejo Nacional de Ciencia y Tecnología for performing research activities.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the study conception and design. Material preparation, data collection and analysis were performed by Vladimir Ocegueda-Hernández. The first draft of the manuscript was written by Vladimir Ocegueda-Hernández and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Gerardo Mendizabal-Ruiz.

Ethics declarations

Data Availability

All the data and code employed for this work is available at https://github.com/percepcioncomputacional/PlanarPoseEstimation

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ocegueda-Hernández, V., Román-Godínez, I. & Mendizabal-Ruiz, G. A lightweight convolutional neural network for pose estimation of a planar model. Machine Vision and Applications 33, 42 (2022). https://doi.org/10.1007/s00138-022-01292-z

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01292-z

Keywords

Navigation