What and where: A context-based recommendation system for object insertion

Zhang, Song-Hai; Zhou, Zheng-Ping; Liu, Bin; Dong, Xi; Hall, Peter

doi:10.1007/s41095-020-0158-8

What and where: A context-based recommendation system for object insertion

Research Article
Open access
Published: 02 April 2020

Volume 6, pages 79–93, (2020)
Cite this article

Download PDF

You have full access to this open access article

Computational Visual Media Aims and scope Submit manuscript

What and where: A context-based recommendation system for object insertion

Download PDF

Song-Hai Zhang^1,2,
Zheng-Ping Zhou¹,
Bin Liu¹,
Xi Dong¹ &
…
Peter Hall³

1060 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

We propose a novel problem revolving around two tasks: (i) given a scene, recommend objects to insert, and (ii) given an object category, retrieve suitable background scenes. A bounding box for the inserted object is predicted in both tasks, which helps downstream applications such as semiautomated advertising and video composition. The major challenge lies in the fact that the target object is neither present nor localized in the input, and furthermore, available datasets only provide scenes with existing objects. To tackle this problem, we build an unsupervised algorithm based on object-level contexts, which explicitly models the joint probability distribution of object categories and bounding boxes using a Gaussian mixture model. Experiments on our own annotated test set demonstrate that our system outperforms existing baselines on all sub-tasks, and does so using a unified framework. Future extensions and applications are suggested.

Article PDF

Modeling Item Categories for Effective Recommendation

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Personalized Re-ranking for Recommendation with Mask Pretraining

Article Open access 02 September 2023

Peng Han, Silin Zhou, … Shuo Shang

References

Ricci, F.; Rokach, L.; Shapira, B. Recommender Systems Handbook. Boston: Springer, 2011.
Book Google Scholar
Recommender system. Available at https://en.wikipedia.org/wiki/Recommender_system.
Johnson, J.; Krishna, R.; Stark, M.; Li, L. J.; Shamma, D. A.; Bernstein, M. S.; Fei-Fei, L. Image retrieval using scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3668–3678, 2015.
Google Scholar
Wang, J.; Liu, W.; Kumar, S.; Chang, S. F. Learning to hash for indexing big data: A survey. Proceedings of the IEEE Vol. 104, No. 1, 34–57, 2016.
Article Google Scholar
Zheng, L.; Yang, Y.; Tian, Q. SIFT meets CNN: A decade survey of instance retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence Vol. 40, No. 5, 1224–1244, 2018.
Article Google Scholar
Rabinovich, A.; Vedaldi, A.; Galleguillos, C.; Wiewiora, E.; Belongie, S. Objects in context. In: Proceedings of the IEEE 11th International Conference on Computer Vision, 1–8, 2007.
Google Scholar
He, K. M.; Zhang, X. Y.; Ren, S. Q.; Sun, J. Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778, 2016.
Google Scholar
Krizhevsky, A.; Sutskever, I.; Hinton, G. E. ImageNet classification with deep convolutional neural networks. In: Proceedings of the Advances in Neural Information Processing Systems 25, 1097–1105, 2012.
Google Scholar
Szegedy, C.; Liu, W.; Jia, Y. Q.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1–9, 2015.
Google Scholar
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Proceedings of the Advances in Neural Information Processing Systems 28, 91–99, 2015.
Google Scholar
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 580–587, 2014.
Google Scholar
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C. Y.; Berg, A. C. SSD: Single shot MultiBox detector. In: Computer Vision-ECCV 2016. Lecture Notes in Computer Science, Vol. 9905. Leibe, B.; Matas, J.; Sebe, N.; Welling, M. Eds. Springer Cham, 21–37, 2016.
Google Scholar
Zhou, B. L.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2921–2929, 2016.
Google Scholar
Bilen, H.; Vedaldi, A. Weakly supervised deep detection networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2846–2854, 2016.
Google Scholar
Kantorov, V.; Oquab, M.; Cho, M.; Laptev, I. ContextLocNet: Context-aware deep network models for weakly supervised localization. In: Computer Vision-ECCV 2016. Lecture Notes in Computer Science, Vol. 9909. Leibe B.; Matas J.; Sebe N.; Welling M. Eds. Springer Cham, 350–365, 2016.
Google Scholar
He, K. M.; Gkioxari, G.; Dollar, P.; Girshick, R. Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, 2961–2969, 2017.
Google Scholar
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440, 2015.
Google Scholar
Liu, W.; Rabinovich, A.; Berg, A. C. Parsenet: Looking wider to see better. arXiv preprint arXiv:1506.04579, 2015.
Google Scholar
Zhou, W.; Li, H.; Tian, Q. Recent advance in content-based image retrieval: A literature survey. arXiv preprint arXiv:1706.06064, 2017.
Google Scholar
Hu, S.-M.; Zhang, F.-L.; Wang, M; Martin, R. R.; Wang, J. PatchNet: A patch-based image representation for interactive library-driven image editing. ACM Transactions on Graphics Vol. 32, No. 6, Article No. 196, 2013.
Google Scholar
Yu, J. H.; Lin, Z.; Yang, J. M.; Shen, X. H.; Lu, X.; Huang, T. S. Generative image inpainting with contextual attention. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5505–5514, 2018.
Google Scholar
Hong, S.; Yan, X.; Huang, T.; Lee, H. Learning hierarchical semantic image manipulation through structured representations. In: Proceedings of the 32nd Conference on Neural Information Processing Systems, 2708–2718, 2018.
Google Scholar
Lee, D.; Liu, S.; Gu, J.; Liu, M.-Y.; Yang, M.-H.; Kautz, J. Context-aware synthesis and placement of object instances. In: Proceedings of the Advances in Neural Information Processing Systems 31, 10393–10403, 2018.
Google Scholar
Lin, C.H.; Yumer, E.; Wang, O.; Shechtman, E.; Lucey, S. ST-GAN: Spatial transformer generative adversarial networks for image compositing. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 9455–9464, 2018.
Google Scholar
Tan, F. W.; Bernier, C.; Cohen, B.; Ordonez, V.; Barnes, C. Where and who? Automatic semantic-aware person composition. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision, 1519–1528, 2018.
Google Scholar
Anderson, P.; He, X. D.; Buehler, C.; Teney, D.; Johnson, M.; Gould, S.; Zhang, L. Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6077–6086, 2018.
Google Scholar
Xu, D. F.; Zhu, Y. K.; Choy, C. B.; Fei-Fei, L. Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3097–3106, 2017.
Google Scholar
Krishna, R.; Zhu, Y. K.; Groth, O.; Johnson, J.; Hata, K. J.; Kravitz, J.; Chen, S.; Kalantidis, Y.; Li, L.-J.; Shamma, D. A. et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision Vol. 123, No. 1, 32–73, 2017.
Article MathSciNet Google Scholar
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research Vol. 12, 2825–2830, 2011.
MathSciNet MATH Google Scholar
Järvelin, K.; Kekäläinen, J. Cumulated gain-based evaluation of IR techniques. ACM Transactions on Information Systems Vol. 20, No. 4, 422–446, 2002.
Article Google Scholar
Bag-of-words model. Available at https://en.wikipedia.org/wiki/Bag-of-words_model.
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122, 2015.
Lin, T. Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollar, P.; Zitnick, C. L. Microsoft COCO: Common objects in context. In: Computer Vision — ECCV 2014. Lecture Notes in Computer Science, Vol. 8693. Fleet, D.; Pajdla, T.; Schiele, B.; Tuytelaars, T. Eds. Springer Cham, 740–755, 2014.
Google Scholar
Zhang, S. H.; Zhang, S. K.; Liang, Y.; Hall, P. A survey of 3D indoor scene synthesis. Journal of Computer Science and Technology Vol. 34, No. 3, 594–608, 2019.
Article Google Scholar
Ge, S. M.; Jin, X.; Ye, Q. T.; Luo, Z.; Li, Q. Image editing by object-aware optimal boundary searching and mixed-domain composition. Computational Visual Media Vol. 4, No. 1, 71–82, 2018.
Article Google Scholar
Todo, H.; Yamaguchi, Y. Estimating reflectance and shape of objects from a single cartoon-shaded image. Computational Visual Media Vol. 3, No. 1, 21–31, 2017.
Article Google Scholar

Download references

Acknowledgements

We would like to thank all reviewers for their thoughtful comments, and we would like to thank Prof. Ralph Martin for his valuable suggestions on paper revision. This work was supported by the National Key Technology R&D Program (Project Number 2016YFB1001402), the National Natural Science Foundation of China (Project Numbers 61521002, 61772298), Research Grant of Beijing Higher Institution Engineering Research Center, and Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.

Author information

Authors and Affiliations

Tsinghua University, Beijing, 100084, China
Song-Hai Zhang, Zheng-Ping Zhou, Bin Liu & Xi Dong
Beijing National Research Center for Information Science and Technology (BNRist), Tsinghua University, Beijing, 100084, China
Song-Hai Zhang
Department of Computer Science Media Technology Research Center, University of Bath, Bath, BA2 7AY, UK
Peter Hall

Authors

Song-Hai Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng-Ping Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Bin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xi Dong
View author publications
You can also search for this author in PubMed Google Scholar
Peter Hall
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Song-Hai Zhang.

Additional information

Song-Hai Zhang received his Ph.D. degree from Tsinghua University, China, in 2007. He is currently an associate professor of computer science at Tsinghua University. His research interests include image and video processing as well as geometric computing.

Zheng-Ping Zhou is an undergraduate student in the Department of Computer Science and Technology, Tsinghua University. She hopes to receive her bachelor degree in computer science in 2019. Her research interests include image processing and computer graphics.

Bin Liu is a Ph.D. student in the Department of Computer Science and Technology, Tsinghua University. He received his bachelor degree in computer science from the same university in 2013. His research interests include image and video editing.

Xin Dong is a master student in the Department of Computer Science and Technology, Tsinghua University. She received her bachelor degree in computer science from the same university in 2016. Her research interests include image understanding.

Peter Hall is an associate professor in the Department of Computer Science at the University of Bath. He is also the director of the Media Technology Research Centre, Bath. He founded the Vision, Video, and Graphics network of excellence in the United Kingdom, and has served on the executive committee of the British Machine Vision Conference since 2003. He has published extensively in computer vision, especially where it interfaces with computer graphics. More recently he is developing an interest in robotics.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Other papers from this open access journal are available free of charge from http://www.springer.com/journal/41095. To submit a manuscript, please go to https://www.editorialmanager.com/cvmj.

Reprints and permissions

About this article

Cite this article

Zhang, SH., Zhou, ZP., Liu, B. et al. What and where: A context-based recommendation system for object insertion. Comp. Visual Media 6, 79–93 (2020). https://doi.org/10.1007/s41095-020-0158-8

Download citation

Received: 24 December 2019
Accepted: 29 January 2020
Published: 02 April 2020
Issue Date: March 2020
DOI: https://doi.org/10.1007/s41095-020-0158-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

What and where: A context-based recommendation system for object insertion

Abstract

Article PDF

Similar content being viewed by others

Modeling Item Categories for Effective Recommendation

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Personalized Re-ranking for Recommendation with Mask Pretraining

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

What and where: A context-based recommendation system for object insertion

Abstract

Article PDF

Similar content being viewed by others

Modeling Item Categories for Effective Recommendation

OS2D: One-Stage One-Shot Object Detection by Matching Anchor Features

Personalized Re-ranking for Recommendation with Mask Pretraining

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation