Skip to main content

Advertisement

Log in

A sketch is worth a thousand navigational instructions

  • Published:
Autonomous Robots Aims and scope Submit manuscript

Abstract

Is it possible for a robot to navigate an unknown area without the ability to understand verbal instructions? This work proposes the use of pictorial cues (hand drawn sketches) to assist navigation in scenarios where verbal instructions seem less practical. These scenarios include verbal instructions referring to novel objects or complex instructions describing fine details. Furthermore, there are patterns (textures, languages) which are difficult to describe verbally. Given a single sketch, our novel “draw in 2D and match in 3D” algorithm spots the desired content under large view variations. We show that off-the-shelf deep features, for sketch matching, have limited view point invariance. Additionally, this work exposes the challenges of using the scene text as a pictorial cue. We propose a novel strategy to overcome these challenges across multiple languages. Our “just draw it” method overcomes the language understanding barrier. We show that sketch based text spotting works, without alteration, for arbitrary font shapes, which standard text detectors find hard to spot. Even in case of custom made text detector (for arbitrary shaped fonts), sketch based text spotting demonstrates complimentary performance. We provide extensive evaluation on public datasets. We also provide a fine grained dataset “Crossroads” which includes tough scenarios for generating navigational instructions. Finally we demonstrate the performance of our view invariant sketch detectors in robotic navigation scenarios using MINOS simulator which contains reconstructed indoor environments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21

Similar content being viewed by others

Notes

  1. https://github.com/Sardar-Usama/A-sketch-is-worth-a-thousand-instructions.

  2. \({\dagger }\) will be made public.

References

  • Ammirato, P., Poirson, P., Park, E., Košecká, J., & Berg, A. C. (2017). A dataset for developing and benchmarking active vision. In ICRA. IEEE

  • Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR.

  • Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.

  • Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

  • Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5965–5974).

  • Boniardi, F., Valada, A., Burgard, W., & Tipaldi, G. D. (2016). Autonomous indoor robot navigation using a sketch interface for drawing maps and routes. In ICRA.

  • Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In ICCV.

  • Chen, D. L., & Mooney, R. J. (2011). Learning to interpret natural language navigation instructions from observations. In AAAI.

  • Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

  • Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV. IEEE

  • Cherubini, A., Spindler, F., & Chaumette, F. (2014). Autonomous visual navigation and laser-based moving obstacle avoidance. IEEE Transactions on Intelligent Transportation Systems, 15(5), 2101–2110.

    Article  Google Scholar 

  • Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR.

  • Coronado, E., Villalobos, J., Bruno, B., & Mastrogiovanni, F. (2017). Gesture-based robot control: Design challenges and evaluation with humans. In ICRA. IEEE

  • Costante, G., Forster, C., Delmerico, J., Valigi, P., & Scaramuzza, D. (2016). Perception-aware path planning. arXiv preprint arXiv:1605.04151.

  • Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017). Visual dialog. In CVPR.

  • Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR.

  • Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3d features. In ICCV. IEEE

  • Furlan, A., Miller, S. D., Sorrenti, D. G., Li, F. F., Savarese, S. (2013). Free your camera: 3d indoor scene understanding from arbitrary camera motion. In BMVC.

  • Gupta, A., & Davis, L. S. (2008). Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV. Springer.

  • Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In CVPR.

  • Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.

    MATH  Google Scholar 

  • Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV. IEEE

  • Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV.

  • Hemachandra, S., Duvallet, F., Howard, T. M., Roy, N., Stentz, A., & Walter, M. R. (2015). Learning models for following natural language directions in unknown environments. In ICRA. IEEE

  • Hussain, W., Civera, J., Montano, L., & Hebert, M. (2016). Dealing with small data and training blind spots in the manhattan world. In WACV. IEEE

  • Khosla, A., An An, B., Lim, J. J., & Torralba, A. (2014). Looking beyond the visible scene. In CVPR.

  • Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In CVPR.

  • Lam, O., Dayoub, F., Schulz, R., & Corke, P. (2015). Automated topometric graph generation from floor plan analysis. In ACRA.

  • Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2136–2143). IEEE.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV. Springer.

  • Liu, C., Schwing, A. G., Kundu, K., Urtasun, R., & Fidler, S. (2015). Rent3d: Floor-plan priors for monocular layout estimation. In CVPR.

  • Liu, C., Wu, J., & Furukawa, Y. (2018). Floornet: A unified framework for floorplan reconstruction from 3d scans. In ECCV. Springer.

  • Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.

    Article  Google Scholar 

  • MacMahon, M., Stankiewicz, B., & Kuipers, B. (2006). Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI.

  • Matuszek, C., Fox, D., Koscher, K. (2010). Following directions using statistical machine translation. In 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE.

  • Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.

  • Nabbe, B., Hoiem, D., Efros, A. A., & Hebert, M. (2006). Opportunistic use of vision to push back the path-planning horizon. In IROS. IEEE

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.

  • Quy Phan, T., Shivakumara, P., Tian, S., & Lim Tan, C. (2013). Recognizing text with perspective distortion in natural scenes. In ICCV.

  • Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In CVPR.

  • Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.

    Article  MathSciNet  Google Scholar 

  • Salas, M., Hussain, W., Concha, A., Montano, L., Civera, J., & Montiel, J. (2015). Layout aware visual tracking and mapping. In IROS. IEEE

  • Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: Learning to retrieve badly drawn bunnies. SIGGRAPH

  • Savva, M., Chang, A. X., Dosovitskiy, A., Funkhouser, T., & Koltun, V. (2017). Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.

  • Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops.

  • Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. In ACM transactions on graphics (Vol. 30, p. 154). ACM.

  • Skubic, M., Blisard, S., Carle, A., & Matsakis, P. (2002). Hand-drawn maps for robot navigation. In AAAI.

  • Tellex, S., Kollar, T., Dickerson, S., Walter, M. R., Banerjee, A. G., Teller, S. J., & Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI.

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.

  • Wang, S., Fidler, S., & Urtasun, R. (2015). Lost shopping! monocular localization in large indoor spaces. In ICCV.

  • Winograd, T. (1971). Procedures as a representation for data in a computer program for understanding natural language. Massachusetts Institute of Tech Cambridge Project MAC, Technical report.

  • Xu, K., Chen, K., Fu, H., Sun, W. L., & Hu, S. M. (2013). Sketch2scene: Sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG), 32(4), 1–15.

    Article  Google Scholar 

  • Yamauchi, B. (1997). A frontier-based approach for autonomous exploration. In 1997 IEEE international symposium on computational intelligence in robotics and automation, 1997. CIRA’97, Proceedings. IEEE.

  • Yuliang, L., Lianwen, J., Shuaitao, Z., & Sheng, Z. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.

  • Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA. IEEE

  • Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR.

Download references

Acknowledgements

This work was supported by Higher Education Commission (HEC), Govt of Pakistan through its research Grant Number 6025/Federal/NRPU/RD/HEC/2016. We really appreciate the continuous guidance we received from anonymous reviewers. We are thankful to Khadija Azhar, Shanza Nasir, Tuba Tanveer and other ROMI lab members for helping out with sketches. We are also thankful to Saran Khaliq for providing assistance with deep object detectors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wajahat Hussain.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ahmad, H., Usama, S.M., Hussain, W. et al. A sketch is worth a thousand navigational instructions. Auton Robot 45, 313–333 (2021). https://doi.org/10.1007/s10514-020-09965-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10514-020-09965-2

Keywords

Navigation