Hierarchization of Topical Texts Based on the Estimate of Proximity to the Semantic Pattern without Paraphrasing,Pattern Recognition and Image Analysis

当前位置： X-MOL 学术 › Pattern Recognit. Image Anal. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Hierarchization of Topical Texts Based on the Estimate of Proximity to the Semantic Pattern without Paraphrasing
Pattern Recognition and Image Analysis ( IF 0.7 ) Pub Date : 2020-09-15 , DOI: 10.1134/s1054661820030207
D. V. Mikhaylov , G. M. Emelyanov

Abstract

The paper is devoted to the problem of numerically estimating the mutual semantic dependence of topical texts with respect to the most rational (i.e., standard) variants for describing the knowledge fragments they represent. The proximity of the text to the standard is evaluated without searching for paraphrases. This problem is relevant in determining the significance of information sources regarding tasks performed by the user. At this point, an example is the search for the optimal order of working with primary sources in the formation of the individual educational trajectory of a student. In the proposed solution, the basis for assessing the proximity of a text to the standard is the division of the words of each of its phrases into classes according to the value of the TF-IDF measure relative to the texts of the corpus, which was previously formed by an expert. The analyzed texts are the abstracts of scientific articles together with their titles. The principles of ranking and subsequent hierarchization of texts of an original collection based on the assessment variants relative to the title and phrase with the closest proximity to the standard are considered. The semantic images of the texts that are the closest to the standard are determined by the words with the highest TF-IDF values, which, when located next to each other in a linear row of a phrase, are most likely related by meaning and form key combinations together with the words that are close to the average value of the specified measure. An analysis of the occurrence of words with the highest TF-IDF values in different texts of the collection assesses the relationship of their standards as the basis for assessing the complementarity of texts in meaning.

中文翻译：

基于不带释义的语义模式邻近度估计的主题文本分层

摘要

本文致力于解决有关主题文本相对最合理（即标准）的变体形式的数字估计语义描述问题，以描述它们所代表的知识片段。无需搜索释义即可评估文本与标准的接近程度。此问题与确定有关用户执行的任务的信息源的重要性有关。在这一点上，一个例子是在形成学生的个人教育轨迹时，寻找与主要资源打交道的最佳顺序。在提出的解决方案中，评估文本与标准的接近度的基础是，根据TF-IDF量度相对于语料库文本的值，将每个短语的词分为几类，以前是由专家组成的。分析的文本是科学文章的摘要及其标题。考虑了基于相对于与标准最接近的标题和短语的评估变体对原始集合的文本进行排名和随后进行层次化的原则。最接近标准的文本的语义图像由具有最高TF-IDF值的单词确定，这些单词在短语的线性行中彼此相邻时，很可能与含义和形式相关键组合以及与指定小节的平均值接近的单词。

更新日期：2020-09-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文