Classification of Scientific Texts Based on the Compression of Annotations to Publications,Automatic Documentation and Mathematical Linguistics

当前位置： X-MOL 学术 › Autom. Doc. Math. Linguist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Classification of Scientific Texts Based on the Compression of Annotations to Publications
Automatic Documentation and Mathematical Linguistics ( IF 0.5 ) Pub Date : 2020-02-26 , DOI: 10.3103/s0005105519060062
I. V. Selivanova , D. V. Kosyakov , A. E. Guskov

Abstract

This paper describes the possibility of establishing the semantic proximity of scientific texts by the method of their automatic classification based on the compression of annotations. The idea of the method is that the compression algorithms such as PPM (prediction by partial matching) compress terminologically similar texts much better than distant ones. If a kernel of publications (an analogue of a training set) is formed for each classified topic, then the best proportion of compression will indicate that the classified text belongs to the corresponding topic. Thirty thematic categories were determined; for each of them, annotations of approximately 500 publications were received in the Scopus database, out of which 100 annotations for the kernel and 20 annotations for testing were selected in different ways. It was found that building a kernel based on highly cited publications revealed an error level of up to 12 against 32% in the case of random sampling. The quality of classification is also affected by the initial number of categories: the fewer the categories that participate in the classification and the more terminological differences exist between them, the higher its quality is.

中文翻译：

基于出版物注释压缩的科学文本分类

摘要

本文介绍了通过基于注释压缩的自动分类方法来建立科学文本的语义接近度的可能性。该方法的思想是，诸如PPM（通过部分匹配进行预测）之类的压缩算法对术语相似的文本的压缩比对远距离的文本的压缩要好得多。如果为每个分类主题形成了出版物内核（训练集的类似物），则最佳压缩比例将指示分类文本属于相应主题。确定了三十个主题类别；对于它们中的每一个，在Scopus数据库中都收到了大约500个出版物的注释，其中以不同的方式选择了100个内核注释和20个测试注释。发现基于高度引用的出版物构建内核揭示了高达12的错误级别，而在随机抽样的情况下，错误级别高达32％。分类的质量还受类别初始数量的影响：参与分类的类别越少，并且它们之间存在的术语差异越大，则其质量越高。

更新日期：2020-02-26

点击分享查看原文

点击收藏

阅读更多本刊最新论文