当前位置: X-MOL 学术J. Intell. Inf. Syst. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Content-Based Document Image Retrieval Based on Document Modeling
Journal of Intelligent Information Systems ( IF 2.3 ) Pub Date : 2020-06-06 , DOI: 10.1007/s10844-020-00600-1
Chwan-Yi Shiah

Recently, language models have gained importance in the field of information retrieval. In this paper, we propose a generic language model to improve a content-based document retrieval system. In this approach, character images are extracted, clustered, and analyzed to form high-level semantic terms using a statistical document model. This model simulates the long-term relationships between characters. Documents are then indexed according to these terms, and a query document is proposed to retrieve the relevant documents. The query document can be a single keyword, or it can be synthesized from a text string. The aim is to generate a semantic representation from low-level image pixels through pattern matching and document modeling. The conventional approach of generating semantic terms in document retrieval includes every possible symbol sequence in the feature representation. Comparatively, our approach can considerably reduce the dimensions of the feature space while producing retrieval results comparable to those of the conventional and state-of-the-art approaches.

中文翻译:

基于文档建模的基于内容的文档图像检索

最近,语言模型在信息检索领域越来越重要。在本文中,我们提出了一种通用语言模型来改进基于内容的文档检索系统。在这种方法中,使用统计文档模型提取、聚类和分析字符图像以形成高级语义术语。该模型模拟人物之间的长期关系。然后根据这些术语对文档进行索引,并提出查询文档来检索相关文档。查询文档可以是单个关键字,也可以由文本字符串合成。目的是通过模式匹配和文档建模从低级图像像素生成语义表示。在文档检索中生成语义术语的传统方法包括特征表示中的每个可能的符号序列。相比之下,我们的方法可以显着降低特征空间的维度,同时产生与传统和最先进方法相当的检索结果。
更新日期:2020-06-06
down
wechat
bug