当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatically evaluating the quality of textual descriptions in cultural heritage records
International Journal on Digital Libraries Pub Date : 2021-04-23 , DOI: 10.1007/s00799-021-00302-1
Matteo Lorenzini , Marco Rospocher , Sara Tonelli

Metadata are fundamental for the indexing, browsing and retrieval of cultural heritage resources in repositories, digital libraries and catalogues. In order to be effectively exploited, metadata information has to meet some quality standards, typically defined in the collection usage guidelines. As manually checking the quality of metadata in a repository may not be affordable, especially in large collections, in this paper we specifically address the problem of automatically assessing the quality of metadata, focusing in particular on textual descriptions of cultural heritage items. We describe a novel approach based on machine learning that tackles this problem by framing it as a binary text classification task aimed at evaluating the accuracy of textual descriptions. We report our assessment of different classifiers using a new dataset that we developed, containing more than 100K descriptions. The dataset was extracted from different collections and domains from the Italian digital library “Cultura Italia” and was annotated with accuracy information in terms of compliance with the cataloguing guidelines. The results empirically confirm that our proposed approach can effectively support curators (F1 \(\sim \) 0.85) in assessing the quality of the textual descriptions of the records in their collections and provide some insights into how training data, specifically their size and domain, can affect classification performance.



中文翻译:

自动评估文化遗产记录中文字描述的质量

元数据对于在存储库,数字图书馆和目录中对文化遗产资源进行索引,浏览和检索至关重要。为了被有效利用,元数据信息必须满足一些质量标准,这些标准通常在馆藏使用指南中定义。由于手动检查存储库中元数据的质量可能负担不起,尤其是在大型馆藏中,因此,在本文中,我们专门解决自动评估元数据质量的问题,尤其要关注文化遗产项目的文字描述。我们描述了一种基于机器学习的新颖方法,该方法通过将其框架化为旨在评估文本描述准确性的二进制文本分类任务来解决此问题。我们使用我们开发的新数据集报告了对不同分类器的评估,该数据集包含超过10万个描述。该数据集是从意大利数字图书馆“ Cultura Italia”的不同馆藏和领域中提取的,并根据符合编目指南的要求提供了准确性信息。结果从经验上证实了我们提出的方法可以有效地支持策展人(F1\(\ sim \)  0.85)评估记录中记录的文字说明的质量,并提供一些有关培训数据(尤其是其大小和范围)如何影响分类性能的见解。

更新日期:2021-04-23
down
wechat
bug