Content-Based Document Image Retrieval Based on Document Modeling

Shiah, Chwan-Yi

doi:10.1007/s10844-020-00600-1

Content-Based Document Image Retrieval Based on Document Modeling

Published: 06 June 2020

Volume 55, pages 287–306, (2020)
Cite this article

Journal of Intelligent Information Systems Aims and scope Submit manuscript

Chwan-Yi Shiah¹

332 Accesses
4 Citations
Explore all metrics

Abstract

Recently, language models have gained importance in the field of information retrieval. In this paper, we propose a generic language model to improve a content-based document retrieval system. In this approach, character images are extracted, clustered, and analyzed to form high-level semantic terms using a statistical document model. This model simulates the long-term relationships between characters. Documents are then indexed according to these terms, and a query document is proposed to retrieve the relevant documents. The query document can be a single keyword, or it can be synthesized from a text string. The aim is to generate a semantic representation from low-level image pixels through pattern matching and document modeling. The conventional approach of generating semantic terms in document retrieval includes every possible symbol sequence in the feature representation. Comparatively, our approach can considerably reduce the dimensions of the feature space while producing retrieval results comparable to those of the conventional and state-of-the-art approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Comparison of CNN and Conventional Descriptors for Word Spotting Approach: Application to Handwritten Document Image Retrieval

Image understanding and the web: a state-of-the-art review

Article 12 June 2014

Web Image Indexing Using WICE and a Learning-Free Language Model

Notes

In this dataset, each document containing approximately 1,000 characters was used as the query image to retrieve relevant content from a total of 121,000 characters.
The Chinese Buddhist Canon (http://etext.fgs.org.tw/) is one of the largest Chinese Buddhist sutra collections in the world.

References

Zhou, W, Li, H, & Tian, Q. (2017). “Recent Advance in Content-based Image Retrieval:, A Literature Survey,” in. arXiv:http://arxiv.org.abs/1706.06064.
Ahmed, R, Al-Khatib, WG, & Mahmoud, S. (2017). A Survey on handwritten documents word spotting. International Journal of Multimedia Information Retrieval, 6(1), 31–47.
Article Google Scholar
Liu, Y, Zhang, D, Lu, G, & Ma, W-Y. (2007). “A survey of content-based image retrieval with high-level semantics”. Pattern Recognition, 40(1), 262–282.
Article Google Scholar
Sivic, J, Russell, BC, Efros, AA, Zisserman, A, & Freeman, WT. (2005). “Discovering objects and their location in images”. In proceedings of International Conference on Computer Vision.
Ma, WY, & Chen, KJ. (2003). “Introduction to CKIP Chinese Word Segmentation System for the First International Chinese Word Segmentation Bakeoff”. In proceedings of ACL Second SIGHAN Workshop on Chinese Language Processing.
Manmatha, R, Han, C, Riseman, E, & Croft, W. (1996). Indexing handwriting using word matching. In Proceedings of 1st ACM International Conference on Digital Libraries (ICDL) (pp. 151–159).
Rath, TM, & Manmatha, R. (2007). “Word spotting for historical documents”. International Journal of Document Analysis and Recognition, 9(2), 139–152.
Article Google Scholar
Wei, H, & Gao, G. (2014). A keyword retrieval system for historical Mongolian document images. International Journal on Document Analysis and Recognition, 17 (1), 33–45.
Article Google Scholar
Uijlings, JRR, van de Sande, KEA, Gevers, T, & Smeulders, AWM. (2013). “Selective Search for Object Recognition”. International Journal of Computer Vision, 104(2), 154–171.
Girshick, R, Donahue, J, Darrell, T, & Malik, J. (2013). “Rich feature hierarchies for accurate object detection and semantic segmentation”. arXiv:http://arxiv.org.abs/1311.2524[cs.CV].
Ghosh, SK, & Valveny, E. (2015). “Query by string word spotting based on character bi-gram indexing”. In 13th international conference on document analysis and recognition (ICDAR).
Vinciarelli, A, Bengio, S, & Bunke, H. (2004). Offline recognition of unconstrained handwritten texts using HMMs and statistical language models. IEEE Transactions on PAMI, 26(6), 709–720.
Article Google Scholar
Fischer, A, Frinken, V, Bunke, H, & Suen, C. (2013). Improving HMM-based keyword spotting with character language models. In: International Conference on Document Analysis and Recognition (ICDAR), pp. 506–510.
Tan, CL, Huang, W, Yu, Z, & Xu, Y. (2002). “Imaged Document Text Retrieval Without OCR”. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24 (6), 838–844.
Article Google Scholar
Schuster, S, & Manning, CD. (2016). “Enhanced English universal dependencies: an improved representation for natural language understanding tasks”. In international conference on language resources and evaluation.
Zhao, H, Cai, D, Huang, C, & Kit, C. (2019). “Chinese Word Segmentation:, Another Decade Review (2007–2017). arXiv:http://arxiv.org.abs/1901.06079[cs.CL].
Shiah, CY, & Yen, YS. (2013). “Compression of Chinese Document Images by Complex Shape Matching”. The Computer Journal, 56(11), 1292–1304.
Article Google Scholar
Manning, CD, Raghavan, P, & Schutze, H. (2008). Introduction to information retrieval NY: cambridge university press.
Cormen, TH, Leiserson, CE, Rivest, RL, & Stein, C. (2009). Introduction To Algorithms, 3rd edn., (pp. 359–397). Cambridge, MA: The MIT Press,.
MATH Google Scholar
Wong, PK, & Chan, C. (1996). “Chinese Word Segmentation based on Maximum Matching and Word Binding Force”. In proceedings of the 16th International Conference on Computational Linguistics (Coling1996), Stroudsburg, PA.
Lan, M, Tan, C-L, & Low, H-B. (2006). “Proposing a new term weighting scheme for text categorization”. In AAAI’06 Proceedings of the 21st national conference on Artificial Intelligence.

Download references

Author information

Authors and Affiliations

Department of Applied Informatics, Fo Guang University, No.160, Linwei Rd., Jiaosi, Yilan County, 26247, Taiwan (R.O.C.)
Chwan-Yi Shiah

Authors

Chwan-Yi Shiah
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chwan-Yi Shiah.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shiah, CY. Content-Based Document Image Retrieval Based on Document Modeling. J Intell Inf Syst 55, 287–306 (2020). https://doi.org/10.1007/s10844-020-00600-1

Download citation

Received: 28 August 2019
Revised: 28 November 2019
Accepted: 18 March 2020
Published: 06 June 2020
Issue Date: October 2020
DOI: https://doi.org/10.1007/s10844-020-00600-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Content-Based Document Image Retrieval Based on Document Modeling

Abstract

Access this article

Similar content being viewed by others

A Comparison of CNN and Conventional Descriptors for Word Spotting Approach: Application to Handwritten Document Image Retrieval

Image understanding and the web: a state-of-the-art review

Web Image Indexing Using WICE and a Learning-Free Language Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Content-Based Document Image Retrieval Based on Document Modeling

Abstract

Access this article

Similar content being viewed by others

A Comparison of CNN and Conventional Descriptors for Word Spotting Approach: Application to Handwritten Document Image Retrieval

Image understanding and the web: a state-of-the-art review

Web Image Indexing Using WICE and a Learning-Free Language Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation