当前位置: X-MOL 学术Explor. Econ. Hist. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
HANA: A handwritten name database for offline handwritten text recognition
Explorations in Economic History ( IF 1.857 ) Pub Date : 2022-08-18 , DOI: 10.1016/j.eeh.2022.101473
Christian M. Dahl , Torben Johansen , Emil N. Sørensen , Simon Wittrock

Methods for linking individuals across historical data sets, typically in combination with AI based transcription models, are developing rapidly. Perhaps the single most important identifier for linking is personal names. However, personal names are prone to enumeration and transcription errors and although modern linking methods are designed to handle such challenges, these sources of errors are critical and should be minimized. For this purpose, improved transcription methods and large-scale databases are crucial components. This paper describes and provides documentation for HANA, a newly constructed large-scale database which consists of more than 3.3 million names. The database contains more than 105 thousand unique names with a total of more than 1.1 million images of personal names, which proves useful for transfer learning to other settings. We provide three examples hereof, obtaining significantly improved transcription accuracy on both Danish and US census data. In addition, we present benchmark results for deep learning models automatically transcribing the personal names from the scanned documents. Through making more challenging large-scale databases publicly available we hope to foster more sophisticated, accurate, and robust models for handwritten text recognition.



中文翻译:

HANA:用于离线手写文本识别的手写姓名数据库

跨历史数据集链接个人的方法,通常与基于人工智能的转录模型相结合,正在迅速发展。可能最重要的链接标识符是个人姓名。然而,个人姓名容易出现枚举和转录错误,尽管现代链接方法旨在应对此类挑战,但这些错误来源至关重要,应尽量减少。为此,改进的转录方法和大型数据库是关键组成部分。本文描述并提供了 HANA 的文档,这是一个新建的大型数据库,包含超过 330 万个名称。该数据库包含超过 105,000 个唯一名称,共有超过 110 万张个人姓名图像,这被证明对于将学习迁移到其他环境很有用。我们提供了三个示例,显着提高了丹麦和美国人口普查数据的转录准确性。此外,我们提供了深度学习模型的基准结果,该模型自动从扫描的文档中转录个人姓名。通过公开提供更具挑战性的大型数据库,我们希望培养更复杂、准确和健壮的手写文本识别模型。

更新日期:2022-08-18
down
wechat
bug