A novel statistical mechanics-based metric for characterization of text documents,Computers & Electrical Engineering

当前位置： X-MOL 学术 › Comput. Electr. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A novel statistical mechanics-based metric for characterization of text documents
Computers & Electrical Engineering ( IF 4.3 ) Pub Date : 2020-10-01 , DOI: 10.1016/j.compeleceng.2020.106812
Mainak Biswas , Shubhro Chakrabartty , Jasjit S. Suri , Hanjung Song

Abstract Machine learning (ML) uses intelligence-based statistical models to resolve characterization problems. However, the accuracy of ML models decreases as the volume of data increases since they are not able to capture the diversity and important regularities accompanying the huge volume. In this regard, the statistical mechanics’ (SM) paradigm for ML models must be investigated. SM like Big Data also deals with a large volume of instances. A correlation principle is proposed between data represented as words in a text corpus and a large volume of particles within a system. Based on it a new similarity measure, namely degeneracy-based statistical similarity measure (DSSM) is proposed. The DSSM is directly proportional to the ratio of the total energy of two documents with maximally similar words and the total number of distinct words in the document. The DSSM showed higher performance with regards to existing measures such as Euclidean, Cosine, SMTP, and MBSM.

中文翻译：

一种新的基于统计力学的文本文档表征度量

摘要机器学习 (ML) 使用基于智能的统计模型来解决表征问题。然而，ML 模型的准确性随着数据量的增加而降低，因为它们无法捕捉伴随着巨大数据量的多样性和重要规律。在这方面，必须研究 ML 模型的统计力学 (SM) 范式。像大数据这样的 SM 也处理大量实例。提出了文本语料库中以词表示的数据与系统内大量粒子之间的关联原理。在此基础上提出了一种新的相似性度量，即基于退化的统计相似性度量（DSSM）。DSSM 与具有最大相似词的两个文档的总能量与文档中不同词的总数之比成正比。DSSM 在现有度量（如欧几里得、余弦、SMTP 和 MBSM）方面表现出更高的性能。

更新日期：2020-10-01

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>