Pairwise document similarity measure based on present term set,Journal of Big Data

当前位置： X-MOL 学术 › J. Big Data › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Pairwise document similarity measure based on present term set
Journal of Big Data ( IF 8.6 ) Pub Date : 2018-12-26 , DOI: 10.1186/s40537-018-0163-2
Marzieh Oghbaie , Morteza Mohammadi Zanjireh

Measuring pairwise document similarity is an essential operation in various text mining tasks. Most of the similarity measures judge the similarity between two documents based on the term weights and the information content that two documents share in common. However, they are insufficient when there exist several documents with an identical degree of similarity to a particular document. This paper introduces a novel text document similarity measure based on the term weights and the number of terms appeared in at least one of the two documents. The effectiveness of our measure is evaluated on two real-world document collections for a variety of text mining tasks, such as text document classification, clustering, and near-duplicates detection. The performance of our measure is compared with that of some popular measures. The experimental results showed that our proposed similarity measure yields more accurate results.

中文翻译：

基于当前术语集的成对文档相似性度量

测量成对文档的相似性是各种文本挖掘任务中的基本操作。大多数相似性度量都是基于术语权重和两个文档共同共享的信息内容来判断两个文档之间的相似性。但是，当存在与特定文档具有相同相似度的多个文档时，它们不足。本文基于术语权重和两个文档中至少一个出现的术语数，介绍了一种新颖的文本文档相似度度量。我们针对两个用于各种文本挖掘任务的实际文档集对我们的措施的有效性进行了评估，例如文本文档分类，聚类和近重复检测。我们的措施的表现与一些流行的措施进行了比较。

更新日期：2018-12-26

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文