Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model,International Journal on Document Analysis and Recognition

当前位置： X-MOL 学术 › Int. J. Doc. Anal. Recognit. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Extracting text from scanned Arabic books: a large-scale benchmark dataset and a fine-tuned Faster-R-CNN model
International Journal on Document Analysis and Recognition ( IF 1.8 ) Pub Date : 2021-06-30 , DOI: 10.1007/s10032-021-00382-4
Randa Elanwar , Wenda Qin , Margrit Betke , Derry Wijaya

Datasets of documents in Arabic are urgently needed to promote computer vision and natural language processing research that addresses the specifics of the language. Unfortunately, publicly available Arabic datasets are limited in size and restricted to certain document domains. This paper presents the release of BE-Arabic-9K, a dataset of more than 9000 high-quality scanned images from over 700 Arabic books. Among these, 1500 images have been manually segmented into regions and labeled by their functionality. BE-Arabic-9K includes book pages with a wide variety of complex layouts and page contents, making it suitable for various document layout analysis and text recognition research tasks. The paper also presents a page layout segmentation and text extraction baseline model based on fine-tuned Faster R-CNN structure (FFRA). This baseline model yields cross-validation results with an average accuracy of 99.4% and F1 score of 99.1% for text versus non-text block classification on 1500 annotated images of BE-Arabic-9K. These results are remarkably better than those of the state-of-the-art Arabic book page segmentation system ECDP. FFRA also outperforms three other prior systems when tested on a competition benchmark dataset, making it an outstanding baseline model to challenge.

中文翻译：

从扫描的阿拉伯语书籍中提取文本：大规模基准数据集和微调的 Faster-R-CNN 模型

迫切需要阿拉伯语文档数据集来促进解决语言细节的计算机视觉和自然语言处理研究。不幸的是，公开可用的阿拉伯语数据集大小有限，并且仅限于某些文档域。本文介绍了 BE-Arabic-9K 的发布，这是一个包含来自 700 多本阿拉伯语书籍的 9000 多张高质量扫描图像的数据集。其中，1500 张图像已被手动分割成区域并按其功能进行标记。BE-Arabic-9K 包括具有多种复杂布局和页面内容的书页，使其适用于各种文档布局分析和文本识别研究任务。该论文还提出了一种基于微调的 Faster R-CNN 结构（FFRA）的页面布局分割和文本提取基线模型。对于 BE-Arabic-9K 的 1500 个带注释的图像，该基线模型产生的交叉验证结果的平均准确率为 99.4%，F1 分数为 99.1%，用于文本与非文本块分类。这些结果明显优于最先进的阿拉伯书页分割系统 ECDP。在竞争基准数据集上进行测试时，FFRA 的性能也优于其他三个先前的系统，使其成为一个优秀的基线模型来挑战。

更新日期：2021-07-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11