当前位置: X-MOL 学术ACM Trans. Asian Low Resour. Lang. Inf. Process. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
CESS-A System to Categorize Bangla Web Text Documents
ACM Transactions on Asian and Low-Resource Language Information Processing ( IF 2 ) Pub Date : 2020-06-18 , DOI: 10.1145/3398070
Ankita Dhar 1 , Himadri Mukherjee 1 , Niladri Sekhar Dash 2 , Kaushik Roy 1
Affiliation  

Technology has evolved remarkably, which has led to an exponential increase in the availability of digital text documents of disparate domains over the Internet. This makes the retrieval of the information a very much time- and resource-consuming task. Thus, a system that can categorize such documents based on their domains can truly help the users in obtaining the required information with relative ease and also reduce the workload of the search engines. This article presents a text categorization system (CESS) that categorizes text document using newly proposed hybrid features that combines term frequency-inverse document frequency-inverse class frequency and modified chi-square methods. Experiments were performed on real-world Bangla documents from eight domains comprises of 24,29,857 tokens, and the highest accuracy of 99.91% has been obtained with multilayer perceptron-based classification. Also, the experiments were tested on Reuters-21578 and 20 Newsgroups datasets and obtained accuracies of 97.29% and 94.67%, respectively, to show the language-independent nature of the system.

中文翻译:

CESS-A 孟加拉语 Web 文本文档分类系统

技术已经显着发展,这导致 Internet 上不同域的数字文本文档的可用性呈指数级增长。这使得信息的检索成为一项非常耗时和耗费资源的任务。因此,一个可以根据域对此类文档进行分类的系统可以真正帮助用户相对轻松地获得所需的信息,并减少搜索引擎的工作量。本文介绍了一种文本分类系统 (CESS),该系统使用新提出的混合特征对文本文档进行分类,该特征结合了词频-逆文档频率-逆类频率和改进的卡方方法。对来自八个域的真实孟加拉文文档进行了实验,包括 24,29,857 个标记,最高准确度为 99。91% 已通过基于多层感知器的分类获得。此外,这些实验在 Reuters-21578 和 20 个新闻组数据集上进行了测试,分别获得了 97.29% 和 94.67% 的准确率,以显示系统的语言独立性。
更新日期:2020-06-18
down
wechat
bug