A sketch is worth a thousand navigational instructions

Ahmad, Haseeb; Usama, Sardar Muhammad; Hussain, Wajahat; Anjum, Muhammad Latif

doi:10.1007/s10514-020-09965-2

A sketch is worth a thousand navigational instructions

Published: 20 February 2021

Volume 45, pages 313–333, (2021)
Cite this article

Autonomous Robots Aims and scope Submit manuscript

Haseeb Ahmad¹,
Sardar Muhammad Usama¹,
Wajahat Hussain ORCID: orcid.org/0000-0002-2899-7493¹ &
…
Muhammad Latif Anjum¹

481 Accesses
5 Citations
Explore all metrics

Abstract

Is it possible for a robot to navigate an unknown area without the ability to understand verbal instructions? This work proposes the use of pictorial cues (hand drawn sketches) to assist navigation in scenarios where verbal instructions seem less practical. These scenarios include verbal instructions referring to novel objects or complex instructions describing fine details. Furthermore, there are patterns (textures, languages) which are difficult to describe verbally. Given a single sketch, our novel “draw in 2D and match in 3D” algorithm spots the desired content under large view variations. We show that off-the-shelf deep features, for sketch matching, have limited view point invariance. Additionally, this work exposes the challenges of using the scene text as a pictorial cue. We propose a novel strategy to overcome these challenges across multiple languages. Our “just draw it” method overcomes the language understanding barrier. We show that sketch based text spotting works, without alteration, for arbitrary font shapes, which standard text detectors find hard to spot. Even in case of custom made text detector (for arbitrary shaped fonts), sketch based text spotting demonstrates complimentary performance. We provide extensive evaluation on public datasets. We also provide a fine grained dataset “Crossroads” which includes tough scenarios for generating navigational instructions. Finally we demonstrate the performance of our view invariant sketch detectors in robotic navigation scenarios using MINOS simulator which contains reconstructed indoor environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

3D sketching for 3D object retrieval

Article 11 November 2020

Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions

Notes

https://github.com/Sardar-Usama/A-sketch-is-worth-a-thousand-instructions.
\({\dagger }\) will be made public.

References

Ammirato, P., Poirson, P., Park, E., Košecká, J., & Berg, A. C. (2017). A dataset for developing and benchmarking active vision. In ICRA. IEEE
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5965–5974).
Boniardi, F., Valada, A., Burgard, W., & Tipaldi, G. D. (2016). Autonomous indoor robot navigation using a sketch interface for drawing maps and routes. In ICRA.
Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In ICCV.
Chen, D. L., & Mooney, R. J. (2011). Learning to interpret natural language navigation instructions from observations. In AAAI.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV. IEEE
Cherubini, A., Spindler, F., & Chaumette, F. (2014). Autonomous visual navigation and laser-based moving obstacle avoidance. IEEE Transactions on Intelligent Transportation Systems, 15(5), 2101–2110.
Article Google Scholar
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR.
Coronado, E., Villalobos, J., Bruno, B., & Mastrogiovanni, F. (2017). Gesture-based robot control: Design challenges and evaluation with humans. In ICRA. IEEE
Costante, G., Forster, C., Delmerico, J., Valigi, P., & Scaramuzza, D. (2016). Perception-aware path planning. arXiv preprint arXiv:1605.04151.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017). Visual dialog. In CVPR.
Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR.
Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3d features. In ICCV. IEEE
Furlan, A., Miller, S. D., Sorrenti, D. G., Li, F. F., Savarese, S. (2013). Free your camera: 3d indoor scene understanding from arbitrary camera motion. In BMVC.
Gupta, A., & Davis, L. S. (2008). Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV. Springer.
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In CVPR.
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.
MATH Google Scholar
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV. IEEE
Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV.
Hemachandra, S., Duvallet, F., Howard, T. M., Roy, N., Stentz, A., & Walter, M. R. (2015). Learning models for following natural language directions in unknown environments. In ICRA. IEEE
Hussain, W., Civera, J., Montano, L., & Hebert, M. (2016). Dealing with small data and training blind spots in the manhattan world. In WACV. IEEE
Khosla, A., An An, B., Lim, J. J., & Torralba, A. (2014). Looking beyond the visible scene. In CVPR.
Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In CVPR.
Lam, O., Dayoub, F., Schulz, R., & Corke, P. (2015). Automated topometric graph generation from floor plan analysis. In ACRA.
Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2136–2143). IEEE.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV. Springer.
Liu, C., Schwing, A. G., Kundu, K., Urtasun, R., & Fidler, S. (2015). Rent3d: Floor-plan priors for monocular layout estimation. In CVPR.
Liu, C., Wu, J., & Furukawa, Y. (2018). Floornet: A unified framework for floorplan reconstruction from 3d scans. In ECCV. Springer.
Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.
Article Google Scholar
MacMahon, M., Stankiewicz, B., & Kuipers, B. (2006). Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI.
Matuszek, C., Fox, D., Koscher, K. (2010). Following directions using statistical machine translation. In 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE.
Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.
Nabbe, B., Hoiem, D., Efros, A. A., & Hebert, M. (2006). Opportunistic use of vision to push back the path-planning horizon. In IROS. IEEE
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.
Quy Phan, T., Shivakumara, P., Tian, S., & Lim Tan, C. (2013). Recognizing text with perspective distortion in natural scenes. In ICCV.
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In CVPR.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Article MathSciNet Google Scholar
Salas, M., Hussain, W., Concha, A., Montano, L., Civera, J., & Montiel, J. (2015). Layout aware visual tracking and mapping. In IROS. IEEE
Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: Learning to retrieve badly drawn bunnies. SIGGRAPH
Savva, M., Chang, A. X., Dosovitskiy, A., Funkhouser, T., & Koltun, V. (2017). Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops.
Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. In ACM transactions on graphics (Vol. 30, p. 154). ACM.
Skubic, M., Blisard, S., Carle, A., & Matsakis, P. (2002). Hand-drawn maps for robot navigation. In AAAI.
Tellex, S., Kollar, T., Dickerson, S., Walter, M. R., Banerjee, A. G., Teller, S. J., & Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
Wang, S., Fidler, S., & Urtasun, R. (2015). Lost shopping! monocular localization in large indoor spaces. In ICCV.
Winograd, T. (1971). Procedures as a representation for data in a computer program for understanding natural language. Massachusetts Institute of Tech Cambridge Project MAC, Technical report.
Xu, K., Chen, K., Fu, H., Sun, W. L., & Hu, S. M. (2013). Sketch2scene: Sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG), 32(4), 1–15.
Article Google Scholar
Yamauchi, B. (1997). A frontier-based approach for autonomous exploration. In 1997 IEEE international symposium on computational intelligence in robotics and automation, 1997. CIRA’97, Proceedings. IEEE.
Yuliang, L., Lianwen, J., Shuaitao, Z., & Sheng, Z. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA. IEEE
Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR.

Download references

Acknowledgements

This work was supported by Higher Education Commission (HEC), Govt of Pakistan through its research Grant Number 6025/Federal/NRPU/RD/HEC/2016. We really appreciate the continuous guidance we received from anonymous reviewers. We are thankful to Khadija Azhar, Shanza Nasir, Tuba Tanveer and other ROMI lab members for helping out with sketches. We are also thankful to Saran Khaliq for providing assistance with deep object detectors.

Author information

Authors and Affiliations

Robotics and Machine Intelligence (ROMI) Lab, School of Electrical Engineering and Computer Science (SEECS), National University of Science and Technology (NUST), Islamabad, Pakistan
Haseeb Ahmad, Sardar Muhammad Usama, Wajahat Hussain & Muhammad Latif Anjum

Authors

Haseeb Ahmad
View author publications
You can also search for this author in PubMed Google Scholar
Sardar Muhammad Usama
View author publications
You can also search for this author in PubMed Google Scholar
Wajahat Hussain
View author publications
You can also search for this author in PubMed Google Scholar
Muhammad Latif Anjum
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wajahat Hussain.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ahmad, H., Usama, S.M., Hussain, W. et al. A sketch is worth a thousand navigational instructions. Auton Robot 45, 313–333 (2021). https://doi.org/10.1007/s10514-020-09965-2

Download citation

Received: 15 January 2019
Accepted: 29 December 2020
Published: 20 February 2021
Issue Date: February 2021
DOI: https://doi.org/10.1007/s10514-020-09965-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sketch is worth a thousand navigational instructions

Abstract

Access this article

Similar content being viewed by others

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

3D sketching for 3D object retrieval

Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A sketch is worth a thousand navigational instructions

Abstract

Access this article

Similar content being viewed by others

FS-COCO: Towards Understanding of Freehand Sketches of Common Objects in Context

3D sketching for 3D object retrieval

Bridging the Visual Semantic Gap in VLN via Semantically Richer Instructions

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation