当前位置: X-MOL 学术Sādhanā › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic categorization of web text documents using fuzzy inference rule
Sādhanā ( IF 1.4 ) Pub Date : 2020-06-27 , DOI: 10.1007/s12046-020-01401-6
Ankita Dhar , Himadri Mukherjee , Niladri Sekhar Dash , Kaushik Roy

The digital world is flooded with a huge number of documents belonging to multifarious categories. Most of these documents are uncategorized, which is a hindrance to efficient retrieval. In the case of news texts (one of the largest and most common sources of text information), it is often observed that a text does not belong to one particular category and has contents from multiple domains. This demands a text categorization system to segregate it into its respective domains for efficient information retrieval. The main challenge lies in handling the overlap of vocabulary among different domains at the time of categorization, which we have tackled using an approach based on fuzzy logic. In the present work a fuzzy rule inference system is presented, which works with newly proposed statistical features for segregating documents that belong to more than one or an undefined category. The generated model was defuzzified using five different techniques for determining the category of a document and the highest accuracy of 98.63% for the Centroid method was obtained. Experimentation was also carried out on standard English datasets (Reuters-21578 R8 and 20 Newsgroups). We obtain better results than those of reported works, thereby pointing to the language independence of our system.



中文翻译:

使用模糊推理规则对Web文本文档进行自动分类

数字世界充斥着大量属于各种类别的文档。这些文档大多数未分类,这妨碍了有效检索。就新闻文本(最大和最常见的文本信息来源之一)而言,通常会发现文本不属于一个特定类别,而是具有来自多个域的内容。这需要文本分类系统将其隔离到其各自的域中,以进行有效的信息检索。主要挑战在于在分类时如何处理不同领域之间的词汇重叠,我们已经使用基于模糊逻辑的方法解决了这一问题。在本工作中,提出了一种模糊规则推理系统,它与新近建议的统计功能一起使用,用于隔离属于多个或一个未定义类别的文档。使用五种不同的技术对生成的模型进行去模糊处理,以确定文档的类别,并获得质心法的最高准确度为98.63%。还对标准英语数据集(Reuters-21578 R8和20个新闻组)进行了实验。我们得到的结果比报道的结果要好,从而指出了系统语言的独立性。

更新日期:2020-06-27
down
wechat
bug