Automatic processing of Historical Arabic Documents: A comprehensive Survey,Pattern Recognition

当前位置： X-MOL 学术 › Pattern Recogn. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Automatic processing of Historical Arabic Documents: A comprehensive Survey
Pattern Recognition ( IF 7.5 ) Pub Date : 2020-04-01 , DOI: 10.1016/j.patcog.2019.107144
Mohamed Ibn Khedher , Houda Jmila , Mounim A. El-Yacoubi

Abstract Nowadays, there is a huge amount of Historical Arabic Documents (HAD) in the national libraries and archives around the world. Analyzing this type of data manually is a difficult and costly task. Thus, an automatic process is required to exploit these documents more rapidly. Processing historical documents is a recent research subject that has seen a remarkable growth in the last years. Processing Historical Arabic Documents is a particularly challenging problem. First, due to complicated nature of Arabic script compared to other scripts and second because the documents are ancient. This paper focuses on this difficult problem and provides a comprehensive survey of existing research work. First, we describe in detail the challenges making the automatic processing of Historical Arabic Documents a difficult task. Second, we classify this task into four applications of automatic processing of HAD: i) Analyze the document to extract the main text ii) Identify the writer of the document iii) Recognize some words or parts of the document in a reference dataset andiv) Retrieve and extract specific data from the document. For each application, existing approaches are surveyed and qualitatively described. Finally, we focus on available datasets and describe how they can be used in each application.

中文翻译：

阿拉伯历史文献的自动处理：综合调查

摘要如今，世界各地的国家图书馆和档案馆中都有大量的阿拉伯历史文献（HAD）。手动分析此类数据是一项困难且成本高昂的任务。因此，需要一个自动过程来更快地利用这些文档。处理历史文件是最近的一个研究课题，在过去几年中取得了显着的增长。处理历史阿拉伯文件是一个特别具有挑战性的问题。首先，由于与其他文字相比，阿拉伯文字的复杂性；其次，因为文件很古老。本文着眼于这一难题，并对现有研究工作进行了全面调查。首先，我们详细描述了使阿拉伯历史文献的自动处理成为一项艰巨任务的挑战。第二，我们将此任务分为 HAD 自动处理的四个应用程序：i) 分析文档以提取主要文本 ii) 识别文档的作者 iii) 在参考数据集中识别文档的某些单词或部分以及 iv) 检索和提取文档中的特定数据。对于每个应用程序，都会对现有方法进行调查和定性描述。最后，我们关注可用的数据集并描述如何在每个应用程序中使用它们。

更新日期：2020-04-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11