Abstract
Recently, language models have gained importance in the field of information retrieval. In this paper, we propose a generic language model to improve a content-based document retrieval system. In this approach, character images are extracted, clustered, and analyzed to form high-level semantic terms using a statistical document model. This model simulates the long-term relationships between characters. Documents are then indexed according to these terms, and a query document is proposed to retrieve the relevant documents. The query document can be a single keyword, or it can be synthesized from a text string. The aim is to generate a semantic representation from low-level image pixels through pattern matching and document modeling. The conventional approach of generating semantic terms in document retrieval includes every possible symbol sequence in the feature representation. Comparatively, our approach can considerably reduce the dimensions of the feature space while producing retrieval results comparable to those of the conventional and state-of-the-art approaches.
Similar content being viewed by others
Notes
In this dataset, each document containing approximately 1,000 characters was used as the query image to retrieve relevant content from a total of 121,000 characters.
The Chinese Buddhist Canon (http://etext.fgs.org.tw/) is one of the largest Chinese Buddhist sutra collections in the world.
References
Zhou, W, Li, H, & Tian, Q. (2017). “Recent Advance in Content-based Image Retrieval:, A Literature Survey,” in. arXiv:http://arxiv.org.abs/1706.06064.
Ahmed, R, Al-Khatib, WG, & Mahmoud, S. (2017). A Survey on handwritten documents word spotting. International Journal of Multimedia Information Retrieval, 6(1), 31–47.
Liu, Y, Zhang, D, Lu, G, & Ma, W-Y. (2007). “A survey of content-based image retrieval with high-level semantics”. Pattern Recognition, 40(1), 262–282.
Sivic, J, Russell, BC, Efros, AA, Zisserman, A, & Freeman, WT. (2005). “Discovering objects and their location in images”. In proceedings of International Conference on Computer Vision.
Ma, WY, & Chen, KJ. (2003). “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. In proceedings of ACL Second SIGHAN Workshop on Chinese Language Processing.
Manmatha, R, Han, C, Riseman, E, & Croft, W. (1996). Indexing handwriting using word matching. In Proceedings of 1st ACM International Conference on Digital Libraries (ICDL) (pp. 151–159).
Rath, TM, & Manmatha, R. (2007). “Word spotting for historical documents”. International Journal of Document Analysis and Recognition, 9(2), 139–152.
Wei, H, & Gao, G. (2014). A keyword retrieval system for historical Mongolian document images. International Journal on Document Analysis and Recognition, 17 (1), 33–45.
Uijlings, JRR, van de Sande, KEA, Gevers, T, & Smeulders, AWM. (2013). “Selective Search for Object Recognition”. International Journal of Computer Vision, 104(2), 154–171.
Girshick, R, Donahue, J, Darrell, T, & Malik, J. (2013). “Rich feature hierarchies for accurate object detection and semantic segmentation”. arXiv:http://arxiv.org.abs/1311.2524[cs.CV].
Ghosh, SK, & Valveny, E. (2015). “Query by string word spotting based on character bi-gram indexing”. In 13th international conference on document analysis and recognition (ICDAR).
Vinciarelli, A, Bengio, S, & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on PAMI, 26(6), 709–720.
Fischer, A, Frinken, V, Bunke, H, & Suen, C. (2013). Improving HMM-based keyword spotting with character language models. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 506–510.
Tan, CL, Huang, W, Yu, Z, & Xu, Y. (2002). “Imaged Document Text Retrieval Without OCR”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (6), 838–844.
Schuster, S, & Manning, CD. (2016). “Enhanced English universal dependencies: an improved representation for natural language understanding tasks”. In international conference on language resources and evaluation.
Zhao, H, Cai, D, Huang, C, & Kit, C. (2019). “Chinese Word Segmentation:, Another Decade Review (2007–2017). arXiv:http://arxiv.org.abs/1901.06079[cs.CL].
Shiah, CY, & Yen, YS. (2013). “Compression of Chinese Document Images by Complex Shape Matching”. The Computer Journal, 56(11), 1292–1304.
Manning, CD, Raghavan, P, & Schutze, H. (2008). Introduction to information retrieval NY: cambridge university press.
Cormen, TH, Leiserson, CE, Rivest, RL, & Stein, C. (2009). Introduction To Algorithms, 3rd edn., (pp. 359–397). Cambridge, MA: The MIT Press,.
Wong, PK, & Chan, C. (1996). “Chinese Word Segmentation based on Maximum Matching and Word Binding Force”. In proceedings of the 16th International Conference on Computational Linguistics (Coling1996), Stroudsburg, PA.
Lan, M, Tan, C-L, & Low, H-B. (2006). “Proposing a new term weighting scheme for text categorization”. In AAAI’06 Proceedings of the 21st national conference on Artificial Intelligence.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Shiah, CY. Content-Based Document Image Retrieval Based on Document Modeling. J Intell Inf Syst 55, 287–306 (2020). https://doi.org/10.1007/s10844-020-00600-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10844-020-00600-1