Skip to main content
Log in

Content-Based Document Image Retrieval Based on Document Modeling

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

Recently, language models have gained importance in the field of information retrieval. In this paper, we propose a generic language model to improve a content-based document retrieval system. In this approach, character images are extracted, clustered, and analyzed to form high-level semantic terms using a statistical document model. This model simulates the long-term relationships between characters. Documents are then indexed according to these terms, and a query document is proposed to retrieve the relevant documents. The query document can be a single keyword, or it can be synthesized from a text string. The aim is to generate a semantic representation from low-level image pixels through pattern matching and document modeling. The conventional approach of generating semantic terms in document retrieval includes every possible symbol sequence in the feature representation. Comparatively, our approach can considerably reduce the dimensions of the feature space while producing retrieval results comparable to those of the conventional and state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. In this dataset, each document containing approximately 1,000 characters was used as the query image to retrieve relevant content from a total of 121,000 characters.

  2. The Chinese Buddhist Canon (http://etext.fgs.org.tw/) is one of the largest Chinese Buddhist sutra collections in the world.

References

  • Zhou, W, Li, H, & Tian, Q. (2017). “Recent Advance in Content-based Image Retrieval:, A Literature Survey,” in. arXiv:http://arxiv.org.abs/1706.06064.

  • Ahmed, R, Al-Khatib, WG, & Mahmoud, S. (2017). A Survey on handwritten documents word spotting. International Journal of Multimedia Information Retrieval, 6(1), 31–47.

    Article  Google Scholar 

  • Liu, Y, Zhang, D, Lu, G, & Ma, W-Y. (2007). “A survey of content-based image retrieval with high-level semantics”. Pattern Recognition, 40(1), 262–282.

    Article  Google Scholar 

  • Sivic, J, Russell, BC, Efros, AA, Zisserman, A, & Freeman, WT. (2005). “Discovering objects and their location in images”. In proceedings of International Conference on Computer Vision.

  • Ma, WY, & Chen, KJ. (2003). “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. In proceedings of ACL Second SIGHAN Workshop on Chinese Language Processing.

  • Manmatha, R, Han, C, Riseman, E, & Croft, W. (1996). Indexing handwriting using word matching. In Proceedings of 1st ACM International Conference on Digital Libraries (ICDL) (pp. 151–159).

  • Rath, TM, & Manmatha, R. (2007). “Word spotting for historical documents”. International Journal of Document Analysis and Recognition, 9(2), 139–152.

    Article  Google Scholar 

  • Wei, H, & Gao, G. (2014). A keyword retrieval system for historical Mongolian document images. International Journal on Document Analysis and Recognition, 17 (1), 33–45.

    Article  Google Scholar 

  • Uijlings, JRR, van de Sande, KEA, Gevers, T, & Smeulders, AWM. (2013). “Selective Search for Object Recognition”. International Journal of Computer Vision, 104(2), 154–171.

  • Girshick, R, Donahue, J, Darrell, T, & Malik, J. (2013). “Rich feature hierarchies for accurate object detection and semantic segmentation”. arXiv:http://arxiv.org.abs/1311.2524[cs.CV].

  • Ghosh, SK, & Valveny, E. (2015). “Query by string word spotting based on character bi-gram indexing”. In 13th international conference on document analysis and recognition (ICDAR).

  • Vinciarelli, A, Bengio, S, & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on PAMI, 26(6), 709–720.

    Article  Google Scholar 

  • Fischer, A, Frinken, V, Bunke, H, & Suen, C. (2013). Improving HMM-based keyword spotting with character language models. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 506–510.

  • Tan, CL, Huang, W, Yu, Z, & Xu, Y. (2002). “Imaged Document Text Retrieval Without OCR”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (6), 838–844.

    Article  Google Scholar 

  • Schuster, S, & Manning, CD. (2016). “Enhanced English universal dependencies: an improved representation for natural language understanding tasks”. In international conference on language resources and evaluation.

  • Zhao, H, Cai, D, Huang, C, & Kit, C. (2019). “Chinese Word Segmentation:, Another Decade Review (2007–2017). arXiv:http://arxiv.org.abs/1901.06079[cs.CL].

  • Shiah, CY, & Yen, YS. (2013). “Compression of Chinese Document Images by Complex Shape Matching”. The Computer Journal, 56(11), 1292–1304.

    Article  Google Scholar 

  • Manning, CD, Raghavan, P, & Schutze, H. (2008). Introduction to information retrieval NY: cambridge university press.

  • Cormen, TH, Leiserson, CE, Rivest, RL, & Stein, C. (2009). Introduction To Algorithms, 3rd edn., (pp. 359–397). Cambridge, MA: The MIT Press,.

    MATH  Google Scholar 

  • Wong, PK, & Chan, C. (1996). “Chinese Word Segmentation based on Maximum Matching and Word Binding Force”. In proceedings of the 16th International Conference on Computational Linguistics (Coling1996), Stroudsburg, PA.

  • Lan, M, Tan, C-L, & Low, H-B. (2006). “Proposing a new term weighting scheme for text categorization”. In AAAI’06 Proceedings of the 21st national conference on Artificial Intelligence.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chwan-Yi Shiah.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shiah, CY. Content-Based Document Image Retrieval Based on Document Modeling. J Intell Inf Syst 55, 287–306 (2020). https://doi.org/10.1007/s10844-020-00600-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-020-00600-1

Keywords

Navigation