Skip to main content
Log in

Enhanced bag of visual words representations for content based image retrieval: a comparative study

  • Published:
Artificial Intelligence Review Aims and scope Submit manuscript

Abstract

The exponential growth of digital image data poses numerous open problems to computer vision researchers. In this regard, designing an efficient and more accurate mechanism that finds and retrieve desired images from large repositories is of greater importance. To this end, various types of content based image retrieval (CBIR) systems have been developed. A typical CBIR system enables the search and retrieval of desired images from large databases that are similar to a given query image by means of automatically extracted visual features from image pixels. In CBIR domain, the bag of visual words (BoVW) model is one of the most widely used feature representation scheme and there exist a number of image retrieval frameworks based on BoVW model. It has been observed that most of them demonstrated promising results for the task of medium and large scale image retrieval. However, image retrieval literature lacks a comparative evaluation of these extended BoVW formulations. To this end, this paper aims to categorize and evaluate the existing BoVW model based formulations for the task of content based image retrieval. The commonly used datasets and the evaluation metrics to assess the retrieval effectiveness of these existing models are discussed. Moreover, quantitative evaluation of state of the art image retrieval systems based on BoVW model is also provided. Finally, certain promising directions for future research are proposed on the basis of the existing models and the demand from real-world.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Ahonen T, Hadid A, Pietikainen M (2006) Face description with local binary patterns: application to face recognition. IEEE Trans Pattern Anal Mach Intell 102:2037–2041

    Article  Google Scholar 

  • Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5297–5307

  • Blei DM, Lafferty JD (2005) Correlated topic models. In: Proceedings of the 18th international conference on neural information processing systems, MIT Press, Cambridge, MA, USA, pp 147–154

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3(Jan):993–1022

    MATH  Google Scholar 

  • Boulemden A, Tlili Y (2012) Image indexing and retrieval with pachinko allocation model: application on local and global features. In: Proceedings of the 12th pacific rim conference on knowledge management and acquisition for intelligent systems, Springer, Berlin, Heidelberg, pp 140–146

  • Boyd S, Vandenberghe L (2004) Convex optimization. Cambridge University Press, New York, NY

    Book  Google Scholar 

  • Cai J, Zha Z-J, Luan H, Zhang S, Tian Q (2013) Learning attribute-aware dictionary for image classification and search. In: Proceedings of the 3rd ACM international conference on multimedia retrieval, ACM, pp 33–40

  • Cao Y, Wang C, Li Z, Zhang L, Zhang L (2010) Spatial-bag-of-features. In: Proceedings of the 2010 IEEE conference on computer vision and pattern recognition, IEEE, pp 3352–3359

  • Chen SS, Donoho DL, Saunders MA (1998) Atomic decomposition by basis pursuit. SIAM J Sci Comput 20(1):33–61

    Article  MathSciNet  Google Scholar 

  • Chiang C-C, Wu J-W, Lee GC (2012) Probabilistic semantic component descriptor. Multimed Tools Appl 59(2):629–643

    Article  Google Scholar 

  • Engan K, Aase SO, Husoy JH (1999) Method of optimal directions for frame design. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing, vol 5, IEEE, pp. 2443–2446

  • Gao Y, Ji R, Liu W, Dai Q, Hua G (2014) Weakly supervised visual dictionary learning by harnessing image attributes. IEEE Trans Image Process 23(12):5400–5411

    Article  MathSciNet  Google Scholar 

  • Ge T, Ke Q, Sun J (2013) Sparse-coded features for image retrieval. In: BMVC

  • Gehler PV, Holub AD, Welling M (2006) The rate adapting poisson model for information retrieval and object recognition. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 337–344

  • Greif T, Hörster E, Lienhart R (2008) Correlated topic models for image retrieval. Technical report, University of Augsburg, Germany, July

  • Grubinger M, Clough P, Müller H, Deselaers T (2006) The IAPR TC-12 benchmark: a new evaluation resource for visual information systems. In: Proceedings of international conference on language resources and evaluation, vol 5, ELRA, p 10

  • Hinton G (2010) A practical guide to training restricted boltzmann machines. Momentum 9(1):926–947

    Google Scholar 

  • Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507

    Article  MathSciNet  Google Scholar 

  • Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196

    Article  Google Scholar 

  • Hörster E, Lienhart R, Effelsberg W, Möller B (2009) Topic models for image retrieval on large-scale databases. ACM Sigmultimed Rec 1(4):15–16

    Article  Google Scholar 

  • Huang Y, Huang K, Yu Y, Tan T (2011) Salient coding for image classification. In: Proceedings of the IEEE international conference on computer vision and pattern recognition, IEEE Computer Society, pp 1753–1760

  • Huang Y, Wu Z, Wang L, Tan T (2014) Feature coding in image classification: a comprehensive study. IEEE Trans Pattern Anal Mach Intell 36(3):493–506

    Article  Google Scholar 

  • Huiskes MJ, Thomee B, Lew MS (2010) New trends and ideas in visual concept detection: the MIR Flickr retrieval evaluation initiative. In: Proceedings of international conference on multimedia information retrieval (ACM), pp 527–536

  • Jacobs CE, Finkelstein A, Salesin DH (1995) Fast multiresolution image querying, In: Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, ACM, pp 277–286

  • Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search. In: Proceedings of the 10th European conference on computer vision: Part I, Springer, Berlin, Heidelberg, pp 304–317

  • Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation. In: Proceedings of international conference on computer vision and pattern recognition, IEEE, pp 3304–3311

  • Karaman S, Benois-Pineau J, Mégret R, Bugeau A (2012) Multi-layer local graph words for object recognition. In: Proceedings of the 18th international conference on advances in multimedia modeling, Springer, Berlin, pp 29–39

  • Koch G, Zemel R, Salakhutdinov R (2015) Siamese neural networks for one-shot image recognition. In: ICML deep learning workshop, vol 2

  • Kogler M, Lux M (2010) Bag of visual words revisited: an exploratory study on robust image retrieval exploiting fuzzy codebooks. In: Proceedings of the tenth international workshop on multimedia data mining, ACM, New York, NY, USA, pp 3–136

  • Koller D, Friedman N (2009) Probabilistic graphical models: principles and techniques- adaptive computation and machine learning. The MIT press, Cambridge

    Google Scholar 

  • Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of the 2006 IEEE computer society conference on computer vision and pattern recognition, vol 2, IEEE Computer Society, Washington, DC, USA, pp 2169–2178

  • Le Pennec E, Mallat S (2005) Bandelet image approximation and compression. Multiscale Model Simul 4(3):992–1039

    Article  MathSciNet  Google Scholar 

  • Li P, Cheng J, Li Z, Lu H (2011) Correlated PLSA for image clustering. In: Proceedings of the 17th international conference on advances in multimedia modeling, vol Part I, Springer, Berlin, Heidelberg, pp 307–316

  • Li W, McCallum A (2006) Pachinko allocation: DAG-structured mixture models of topic correlations. In: Proceedings of the 23rd international conference on machine learning, ACM, pp 577–584

  • Lienhart R, Romberg S, Hörster E (2009) Multilayer pLSA for multimodal image retrieval. In: Proceedings of the ACM international conference on image and video retrieval, ACM, p. 9

  • Liu G-H, Yang J-Y, Li Z (2015) Content-based image retrieval using computational visual attention model. Pattern Recognit 48(8):2554–2566

    Article  Google Scholar 

  • Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: Proceedings of the 2011 international conference on computer vision, IEEE Computer Society, Washington, DC, USA, pp 2486–2493

  • Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137

    Article  MathSciNet  Google Scholar 

  • Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  • Mairal J, Bach F, Ponce J, Sapiro G (2009) Online dictionary learning for sparse coding. In: Proceedings of the 26th annual international conference on machine learning, ACM, pp 689–696

  • Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification. In: Proceedings of the 9th European conference on computer vision, vol Part IV, Springer, Berlin, pp 490–503

  • Passalis N, Tefas A (2017) Learning bag-of-features pooling for deep convolutional neural networks. In: Proceedings of the IEEE international conference on computer vision, pp 5755–5763

  • Pedrosa GV, Traina AJ (2013) From bag-of-visual-words to bag-of-visual-phrases using n-grams. In: Proceedings of the 26th conference on graphics, patterns and images, IEEE, pp 304–311

  • Penatti OA, Silva FB, Valle E, Gouet-Brunet V, Torres RDS (2014) Visual word spatial arrangement for image retrieval and classification. Pattern Recognit 47(2):705–720

    Article  Google Scholar 

  • Perronnin F, Sánchez J, Mensink T (2010) Improving the Fisher Kernel for large-scale image classification. In: Proceedings of the 11th European conference on computer vision: Part IV, Springer, Berlin, Heidelberg, pp 143–156

  • Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2007) Object retrieval with large vocabularies and fast spatial matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, IEEE, pp 1–8

  • Quelhas P, Monay F, Odobez J-M, Gatica-Perez D, Tuytelaars T (2007) A thousand words in a scene. IEEE Trans Pattern Anal Mach Intell 29(9):1575–1589

    Article  Google Scholar 

  • Rahman MM, Bhattacharya P, Desai BC (2009) A unified image retrieval framework on local visual and semantic concept-based feature spaces. J Vis Commun Image Represent 20(7):450–462

    Article  Google Scholar 

  • Salakhutdinov R, Hinton G (2009) Replicated softmax: an undirected topic model. In: Proceedings of the 22nd international conference on neural information processing systems, Curran Associates Inc., USA, pp 1607–1614

  • Saxe AM, Mcclelland JL, Ganguli S (2014) Exact solutions to the nonlinear dynamics of learning in deep linear neural network. In: International conference on learning representations

  • Skretting K, Engan K (2010) Recursive least squares dictionary learning algorithm. IEEE Trans Signal Process 58(4):2121–2130

    Article  MathSciNet  Google Scholar 

  • Srinivas M, Naidu RR, Sastry CS, Mohan CK (2015) Content based medical image retrieval using dictionary learning. Neurocomputing 168:880–895

    Article  Google Scholar 

  • Tariyal S, Majumdar A, Singh R, Vatsa M (2016) Greedy deep dictionary learning. arXiv preprint arXiv:1602.00203

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B Methodol 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Tishby N, Pereira FC, Bialek W (2000) The information bottleneck method. arXiv preprint arXiv:physics/0004057

  • Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching pursuit. IEEE Trans Inf Theory 53(12):4655–4666

    Article  MathSciNet  Google Scholar 

  • Vedaldi A, Fulkerson B (2010) VLFeat: an open and portable library of computer vision algorithms. In: Proceedings of the 18th ACM international conference on multimedia, ACM, pp 1469–1472

  • Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification. In: Proceedings of international conference on computer vision and pattern recognition, IEEE, pp 3360–3367

  • Wu Z, Huang Y, Wang L, Tan T (2012) Group encoding of local features in image classification. In: Proceedings of the 21st international conference on pattern recognition, IEEE, pp 1505–1508

  • Yang M, Zhang L, Feng X, Zhang D (2011) Fisher discrimination dictionary learning for sparse representation. In: Proceedings of the 2011 international conference on computer vision, IEEE Computer Society, Washington, DC, USA, pp 543–550

  • Yu K, Zhang T, Gong Y (2009) Nonlinear learning using local coordinate coding. In: Proceedings of advances in neural information processing systems, pp 2223–2231

  • Zhang D, Lu G (2003) Evaluation of similarity measurement for image retrieval. In: Proceedings of the 2003 international conference on neural networks and signal processing, 2003, vol 2, IEEE, pp 928–931

  • Zhou N, Fan J (2014) Jointly learning visually correlated dictionaries for large-scale visual recognition applications. IEEE Trans Pattern Anal Mach Intell 36(4):715–730

    Article  MathSciNet  Google Scholar 

  • Zhou W, Kamata S-i (2012) Face recognition with learned local curvelet patterns and 2-directional l1-norm based 2DPCA. In: Asian conference on computer vision, Springer, pp 109–120

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to K. S. Arun.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Arun, K.S., Govindan, V.K. & Kumar, S.D.M. Enhanced bag of visual words representations for content based image retrieval: a comparative study . Artif Intell Rev 53, 1615–1653 (2020). https://doi.org/10.1007/s10462-019-09715-6

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10462-019-09715-6

Keywords

Navigation