Searchable Turkish OCRed historical newspaper collection 1928–1942,Journal of Information Science

当前位置： X-MOL 学术 › J. Inf. Sci. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Searchable Turkish OCRed historical newspaper collection 1928–1942
Journal of Information Science ( IF 1.8 ) Pub Date : 2021-03-21 , DOI: 10.1177/01655515211000642
Houssem Menhour ₁ , Hasan Basri Şahin ₂ , Ramazan Nejdet Sarıkaya ₁ , Medine Aktaş ₁ , Rümeysa Sağlam ₁ , Ekin Ekinci ₃ , Süleyman Eken ₂

Affiliation

The newspaper emerged as a distinct cultural form in early 17th-century Europe. It is bound up with the early modern period of history. Historical newspapers are of utmost importance to nations and its people, and researchers from different disciplines rely on these papers to improve our understanding of the past. In pursuit of satisfying this need, Istanbul University Head Office of Library and Documentation provides access to a big database of scanned historical newspapers. To take it another step further and make the documents more accessible, we need to run optical character recognition (OCR) and named entity recognition (NER) tasks on the whole database and index the results to allow for full-text search mechanism. We design and implement a system encompassing the whole pipeline starting from scrapping the dataset from the original website to providing a graphical user interface to run search queries, and it manages to do that successfully. Proposed system provides to search people, culture and security-related keywords and to visualise them.

中文翻译：

可搜索的土耳其OCRed历史报纸收藏1928–1942

该报纸在17世纪初的欧洲以独特的文化形式出现。它与近代早期的历史联系在一起。历史报纸对国家及其人民至关重要，来自不同学科的研究人员依靠这些报纸来增进我们对过去的了解。为了满足这一需求，伊斯坦布尔大学图书馆与文献总公司提供了对扫描过的历史报纸的大型数据库的访问权限。为了更进一步，并使文档更易于访问，我们需要在整个数据库上运行光学字符识别（OCR）和命名实体识别（NER）任务，并对结果进行索引以支持全文搜索机制。我们设计并实现了一个系统，该系统涵盖了整个管道，从从原始网站上抓取数据集到提供图形用户界面来运行搜索查询，它都能成功做到这一点。提议的系统提供了搜索人员，与文化和安全相关的关键字并将其可视化的功能。

更新日期：2021-03-22

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11