Abstract
Is it possible for a robot to navigate an unknown area without the ability to understand verbal instructions? This work proposes the use of pictorial cues (hand drawn sketches) to assist navigation in scenarios where verbal instructions seem less practical. These scenarios include verbal instructions referring to novel objects or complex instructions describing fine details. Furthermore, there are patterns (textures, languages) which are difficult to describe verbally. Given a single sketch, our novel “draw in 2D and match in 3D” algorithm spots the desired content under large view variations. We show that off-the-shelf deep features, for sketch matching, have limited view point invariance. Additionally, this work exposes the challenges of using the scene text as a pictorial cue. We propose a novel strategy to overcome these challenges across multiple languages. Our “just draw it” method overcomes the language understanding barrier. We show that sketch based text spotting works, without alteration, for arbitrary font shapes, which standard text detectors find hard to spot. Even in case of custom made text detector (for arbitrary shaped fonts), sketch based text spotting demonstrates complimentary performance. We provide extensive evaluation on public datasets. We also provide a fine grained dataset “Crossroads” which includes tough scenarios for generating navigational instructions. Finally we demonstrate the performance of our view invariant sketch detectors in robotic navigation scenarios using MINOS simulator which contains reconstructed indoor environments.
Similar content being viewed by others
Notes
\({\dagger }\) will be made public.
References
Ammirato, P., Poirson, P., Park, E., Košecká, J., & Berg, A. C. (2017). A dataset for developing and benchmarking active vision. In ICRA. IEEE
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., & van den Hengel, A. (2018). Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In CVPR.
Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., & Parikh, D. (2015). VQA: Visual question answering. In ICCV.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
Bansal, A., Russell, B., & Gupta, A. (2016). Marr revisited: 2d–3d alignment via surface normal prediction. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5965–5974).
Boniardi, F., Valada, A., Burgard, W., & Tipaldi, G. D. (2016). Autonomous indoor robot navigation using a sketch interface for drawing maps and routes. In ICRA.
Busta, M., Neumann, L., & Matas, J. (2017). Deep textspotter: An end-to-end trainable scene text localization and recognition framework. In ICCV.
Chen, D. L., & Mooney, R. J. (2011). Learning to interpret natural language navigation instructions from observations. In AAAI.
Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollár, P., & Zitnick, C. L. (2015). Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV. IEEE
Cherubini, A., Spindler, F., & Chaumette, F. (2014). Autonomous visual navigation and laser-based moving obstacle avoidance. IEEE Transactions on Intelligent Transportation Systems, 15(5), 2101–2110.
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., & Vedaldi, A. (2014). Describing textures in the wild. In CVPR.
Coronado, E., Villalobos, J., Bruno, B., & Mastrogiovanni, F. (2017). Gesture-based robot control: Design challenges and evaluation with humans. In ICRA. IEEE
Costante, G., Forster, C., Delmerico, J., Valigi, P., & Scaramuzza, D. (2016). Perception-aware path planning. arXiv preprint arXiv:1605.04151.
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., Parikh, D., & Batra, D. (2017). Visual dialog. In CVPR.
Doumanoglou, A., Kouskouridas, R., Malassiotis, S., & Kim, T. K. (2016). Recovering 6d object pose and predicting next-best-view in the crowd. In CVPR.
Flint, A., Murray, D., & Reid, I. (2011). Manhattan scene understanding using monocular, stereo, and 3d features. In ICCV. IEEE
Furlan, A., Miller, S. D., Sorrenti, D. G., Li, F. F., Savarese, S. (2013). Free your camera: 3d indoor scene understanding from arbitrary camera motion. In BMVC.
Gupta, A., & Davis, L. S. (2008). Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers. In ECCV. Springer.
Gupta, S., Davidson, J., Levine, S., Sukthankar, R., & Malik, J. (2017). Cognitive mapping and planning for visual navigation. In CVPR.
Hartley, R., & Zisserman, A. (2003). Multiple view geometry in computer vision. Cambridge: Cambridge University Press.
Hedau, V., Hoiem, D., & Forsyth, D. (2009). Recovering the spatial layout of cluttered rooms. In ICCV. IEEE
Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In ECCV.
Hemachandra, S., Duvallet, F., Howard, T. M., Roy, N., Stentz, A., & Walter, M. R. (2015). Learning models for following natural language directions in unknown environments. In ICRA. IEEE
Hussain, W., Civera, J., Montano, L., & Hebert, M. (2016). Dealing with small data and training blind spots in the manhattan world. In WACV. IEEE
Khosla, A., An An, B., Lim, J. J., & Torralba, A. (2014). Looking beyond the visible scene. In CVPR.
Kong, C., Lin, D., Bansal, M., Urtasun, R., & Fidler, S. (2014). What are you talking about? text-to-image coreference. In CVPR.
Lam, O., Dayoub, F., Schulz, R., & Corke, P. (2015). Automated topometric graph generation from floor plan analysis. In ACRA.
Lee, D. C., Hebert, M., & Kanade, T. (2009). Geometric reasoning for single image structure recovery. In 2009 IEEE conference on computer vision and pattern recognition (pp. 2136–2143). IEEE.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., et al. (2014). Microsoft coco: Common objects in context. In ECCV. Springer.
Liu, C., Schwing, A. G., Kundu, K., Urtasun, R., & Fidler, S. (2015). Rent3d: Floor-plan priors for monocular layout estimation. In CVPR.
Liu, C., Wu, J., & Furukawa, Y. (2018). Floornet: A unified framework for floorplan reconstruction from 3d scans. In ECCV. Springer.
Liu, Y., Jin, L., Zhang, S., Luo, C., & Zhang, S. (2019). Curved scene text detection via transverse and longitudinal sequence connection. Pattern Recognition, 90, 337–345.
MacMahon, M., Stankiewicz, B., & Kuipers, B. (2006). Walk the talk: Connecting language, knowledge, and action in route instructions. In AAAI.
Matuszek, C., Fox, D., Koscher, K. (2010). Following directions using statistical machine translation. In 2010 5th ACM/IEEE international conference on human–robot interaction (HRI). IEEE.
Mishra, A., Alahari, K., & Jawahar, C. (2012). Top-down and bottom-up cues for scene text recognition. In CVPR.
Nabbe, B., Hoiem, D., Efros, A. A., & Hebert, M. (2006). Opportunistic use of vision to push back the path-planning horizon. In IROS. IEEE
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.
Quy Phan, T., Shivakumara, P., Tian, S., & Lim Tan, C. (2013). Recognizing text with perspective distortion in natural scenes. In ICCV.
Redmon, J., & Farhadi, A. (2017). Yolo9000: Better, faster, stronger. In CVPR.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., et al. (2015). Imagenet large scale visual recognition challenge. IJCV, 115(3), 211–252.
Salas, M., Hussain, W., Concha, A., Montano, L., Civera, J., & Montiel, J. (2015). Layout aware visual tracking and mapping. In IROS. IEEE
Sangkloy, P., Burnell, N., Ham, C., & Hays, J. (2016). The sketchy database: Learning to retrieve badly drawn bunnies. SIGGRAPH
Savva, M., Chang, A. X., Dosovitskiy, A., Funkhouser, T., & Koltun, V. (2017). Minos: Multimodal indoor simulator for navigation in complex environments. arXiv preprint arXiv:1712.03931.
Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). Cnn features off-the-shelf: an astounding baseline for recognition. In CVPR workshops.
Shrivastava, A., Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Data-driven visual similarity for cross-domain image matching. In ACM transactions on graphics (Vol. 30, p. 154). ACM.
Skubic, M., Blisard, S., Carle, A., & Matsakis, P. (2002). Hand-drawn maps for robot navigation. In AAAI.
Tellex, S., Kollar, T., Dickerson, S., Walter, M. R., Banerjee, A. G., Teller, S. J., & Roy, N. (2011). Understanding natural language commands for robotic navigation and mobile manipulation. In AAAI.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
Wang, S., Fidler, S., & Urtasun, R. (2015). Lost shopping! monocular localization in large indoor spaces. In ICCV.
Winograd, T. (1971). Procedures as a representation for data in a computer program for understanding natural language. Massachusetts Institute of Tech Cambridge Project MAC, Technical report.
Xu, K., Chen, K., Fu, H., Sun, W. L., & Hu, S. M. (2013). Sketch2scene: Sketch-based co-retrieval and co-placement of 3d models. ACM Transactions on Graphics (TOG), 32(4), 1–15.
Yamauchi, B. (1997). A frontier-based approach for autonomous exploration. In 1997 IEEE international symposium on computational intelligence in robotics and automation, 1997. CIRA’97, Proceedings. IEEE.
Yuliang, L., Lianwen, J., Shuaitao, Z., & Sheng, Z. (2017). Detecting curve text in the wild: New dataset and new solution. arXiv preprint arXiv:1712.02170.
Zhu, Y., Mottaghi, R., Kolve, E., Lim, J. J., Gupta, A., Fei-Fei, L., et al. (2017). Target-driven visual navigation in indoor scenes using deep reinforcement learning. In ICRA. IEEE
Zitnick, C. L., & Parikh, D. (2013). Bringing semantics into focus using visual abstraction. In CVPR.
Acknowledgements
This work was supported by Higher Education Commission (HEC), Govt of Pakistan through its research Grant Number 6025/Federal/NRPU/RD/HEC/2016. We really appreciate the continuous guidance we received from anonymous reviewers. We are thankful to Khadija Azhar, Shanza Nasir, Tuba Tanveer and other ROMI lab members for helping out with sketches. We are also thankful to Saran Khaliq for providing assistance with deep object detectors.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ahmad, H., Usama, S.M., Hussain, W. et al. A sketch is worth a thousand navigational instructions. Auton Robot 45, 313–333 (2021). https://doi.org/10.1007/s10514-020-09965-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10514-020-09965-2