当前位置: X-MOL 学术Journal of Documentation › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic classification of older electronic texts into the Universal Decimal Classification–UDC
Journal of Documentation ( IF 2.034 ) Pub Date : 2020-12-08 , DOI: 10.1108/jd-06-2020-0092
Matjaž Kragelj , Mirjana Kljajić Borštnar

Purpose

The purpose of this study is to develop a model for automated classification of old digitised texts to the Universal Decimal Classification (UDC), using machine-learning methods.

Design/methodology/approach

The general research approach is inherent to design science research, in which the problem of UDC assignment of the old, digitised texts is addressed by developing a machine-learning classification model. A corpus of 70,000 scholarly texts, fully bibliographically processed by librarians, was used to train and test the model, which was used for classification of old texts on a corpus of 200,000 items. Human experts evaluated the performance of the model.

Findings

Results suggest that machine-learning models can correctly assign the UDC at some level for almost any scholarly text. Furthermore, the model can be recommended for the UDC assignment of older texts. Ten librarians corroborated this on 150 randomly selected texts.

Research limitations/implications

The main limitations of this study were unavailability of labelled older texts and the limited availability of librarians.

Practical implications

The classification model can provide a recommendation to the librarians during their classification work; furthermore, it can be implemented as an add-on to full-text search in the library databases.

Social implications

The proposed methodology supports librarians by recommending UDC classifiers, thus saving time in their daily work. By automatically classifying older texts, digital libraries can provide a better user experience by enabling structured searches. These contribute to making knowledge more widely available and useable.

Originality/value

These findings contribute to the field of automated classification of bibliographical information with the usage of full texts, especially in cases in which the texts are old, unstructured and in which archaic language and vocabulary are used.



中文翻译:

自动将较旧的电子文本分类为通用小数分类法(UDC)

目的

这项研究的目的是开发一种使用机器学习方法将旧数字化文本自动分类为通用小数分类(UDC)的模型。

设计/方法/方法

通用研究方法是设计科学研究的固有方法,其中通过开发机器学习分类模型来解决旧的数字化文本的UDC分配问题。由图书馆员充分书目处理的70,000篇学术文献的语料库用于训练和测试该模型,该模型用于对200,000项文献的旧文献进行分类。人类专家评估了模型的性能。

发现

结果表明,机器学习模型可以为几乎所有学术文章在某种程度上正确分配UDC。此外,可以建议将该模型用于旧文本的UDC分配。十位馆员在150篇随机选择的文本中证实了这一点。

研究局限/意义

这项研究的主要局限性是无法获得带有标签的较旧文本,以及图书馆员的可用性有限。

实际影响

分类模型可以为图书馆员的分类工作提供建议;此外,它可以作为图书馆数据库中全文搜索的一个附加组件来实现。

社会影响

所提出的方法通过推荐UDC分类器来支持图书馆员,从而节省了他们的日常工作时间。通过自动分类较旧的文本,数字图书馆可以通过启用结构化搜索来提供更好的用户体验。这些有助于使知识更广泛地可用和使用。

创意/价值

这些发现为全文信息的使用对书目信息的自动分类做出了贡献,特别是在文本陈旧,结构混乱以及使用古语和词汇的情况下。

更新日期:2020-12-08
down
wechat
bug