Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles,arXiv - CS - Digital Libraries

当前位置： X-MOL 学术 › arXiv.cs.DL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pairwise Multi-Class Document Classification for Semantic Relations between Wikipedia Articles
arXiv - CS - Digital Libraries Pub Date : 2020-03-22 , DOI: arxiv-2003.09881
Malte Ostendorff, Terry Ruas, Moritz Schubotz, Georg Rehm, Bela Gipp

Many digital libraries recommend literature to their users considering the similarity between a query document and their repository. However, they often fail to distinguish what is the relationship that makes two documents alike. In this paper, we model the problem of finding the relationship between two documents as a pairwise document classification task. To find the semantic relation between documents, we apply a series of techniques, such as GloVe, Paragraph-Vectors, BERT, and XLNet under different configurations (e.g., sequence length, vector concatenation scheme), including a Siamese architecture for the Transformer-based systems. We perform our experiments on a newly proposed dataset of 32,168 Wikipedia article pairs and Wikidata properties that define the semantic document relations. Our results show vanilla BERT as the best performing system with an F1-score of 0.93, which we manually examine to better understand its applicability to other domains. Our findings suggest that classifying semantic relations between documents is a solvable task and motivates the development of recommender systems based on the evaluated techniques. The discussions in this paper serve as first steps in the exploration of documents through SPARQL-like queries such that one could find documents that are similar in one aspect but dissimilar in another.

中文翻译：

维基百科文章之间语义关系的成对多类文档分类

许多数字图书馆考虑到查询文档与其存储库之间的相似性，向其用户推荐文献。但是，他们常常无法区分使两个文档相似的关系是什么。在本文中，我们将寻找两个文档之间关系的问题建模为成对文档分类任务。为了找到文档之间的语义关系，我们应用了一系列技术，例如 GloVe、Paragraph-Vectors、BERT 和 XLNet 在不同配置（例如，序列长度、向量连接方案）下，包括用于基于 Transformer 的 Siamese 架构系统。我们在新提出的包含 32,168 个维基百科文章对和定义语义文档关系的维基数据属性的数据集上执行我们的实验。我们的结果表明，vanilla BERT 是性能最佳的系统，F1 分数为 0.93，我们手动检查以更好地了解其对其他领域的适用性。我们的研究结果表明，对文档之间的语义关系进行分类是一项可解决的任务，并推动了基于评估技术的推荐系统的开发。本文中的讨论是通过类似 SPARQL 的查询探索文档的第一步，这样人们就可以找到在一个方面相似但在另一个方面不同的文档。

更新日期：2020-03-24

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>