当前位置: X-MOL 学术International Journal on Digital Libraries › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
An analysis and comparison of keyword recommendation methods for scientific data
International Journal on Digital Libraries ( IF 1.6 ) Pub Date : 2020-02-07 , DOI: 10.1007/s00799-020-00279-3
Youichi Ishida , Toshiyuki Shimizu , Masatoshi Yoshikawa

To classify and search various kinds of scientific data, it is useful to annotate those data with keywords from a controlled vocabulary. Data providers, such as researchers, annotate their own data with keywords from the provided vocabulary. However, for the selection of suitable keywords, extensive knowledge of both the research domain and the controlled vocabulary is required. Therefore, the annotation of scientific data with keywords from a controlled vocabulary is a time-consuming task for data providers. In this paper, we discuss methods for recommending relevant keywords from a controlled vocabulary for the annotation of scientific data through their metadata. Many previous studies have proposed approaches based on keywords in similar existing metadata; we call this the indirect method. However, when the quality of the existing metadata set is insufficient, the indirect method tends to be ineffective. Because the controlled vocabularies for scientific data usually provide definition sentences for each keyword, it is also possible to recommend keywords based on the target metadata and the keyword definitions; we call this the direct method. The direct method does not utilize the existing metadata set and therefore is independent of its quality. Also, for the evaluation of keyword recommendation methods, we propose evaluation metrics based on a hierarchical vocabulary structure, which is a distinctive feature of most controlled vocabularies. Using our proposed evaluation metrics, we can evaluate keyword recommendation methods with an emphasis on keywords that are more difficult for data providers to select. In experiments using real earth science datasets, we compare the direct and indirect methods to verify their effectiveness, and observe how the indirect method depends on the quality of the existing metadata set. The results show the importance of metadata quality in recommending keywords.



中文翻译:

科学数据关键词推荐方法的分析与比较

为了分类和搜索各种科学数据,用受控词汇表中的关键字注释这些数据非常有用。数据提供者(例如研究人员)使用提供的词汇表中的关键字来注释自己的数据。然而,为了选择合适的关键词,需要对研究领域和受控词汇都具有广泛的知识。因此,对数据提供者而言,用来自受控词汇的关键字对科学数据进行注释是一项耗时的任务。在本文中,我们讨论了从受控词汇表中推荐相关关键字以通过其元数据注释科学数据的方法。先前的许多研究都提出了基于相似的现有元数据中的关键字的方法。我们称其为间接方法。但是,当现有元数据集的质量不足时,间接方法往往无效。由于科学数据的受控词汇表通常会为每个关键字提供定义语句,因此也有可能根据目标元数据和关键字定义来推荐关键字。我们称其为直接方法。直接方法不利用现有的元数据集,因此与它的质量无关。此外,对于关键字推荐方法的评估,我们提出了基于分层词汇结构的评估指标,这是大多数受控词汇的显着特征。使用我们提出的评估指标,我们可以评估关键字推荐方法,重点放在数据提供者难以选择的关键字上。在使用真实地球科学数据集的实验中,我们比较了直接方法和间接方法以验证其有效性,并观察了间接方法如何取决于现有元数据集的质量。结果表明,元数据质量在推荐关键字中的重要性。

更新日期:2020-02-07
down
wechat
bug