当前位置: X-MOL 学术J. Appl. Stat. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unsupervised document classification integrating web scraping, one-class SVM and LDA topic modelling
Journal of Applied Statistics ( IF 1.5 ) Pub Date : 2021-04-27 , DOI: 10.1080/02664763.2021.1919063
Anton Thielmann 1 , Christoph Weisser 1, 2 , Astrid Krenz 1, 3 , Benjamin Säfken 1, 2
Affiliation  

Unsupervised document classification for imbalanced data sets poses a major challenge. To obtain accurate classification results, training data sets are often created manually by humans which requires expert knowledge, time and money. Depending on the imbalance of the data set, this approach also either requires human labelling of all of the data or it fails to adequately recognize underrepresented categories. We propose an integration of web scraping, one-class Support Vector Machines (SVM) and Latent Dirichlet Allocation (LDA) topic modelling as a multi-step classification rule that circumvents manual labelling. Unsupervised one-class document classification with the integration of out-of-domain training data is achieved and >80% of the target data is correctly classified. The proposed method thus even outperforms common machine learning classifiers and is validated on multiple data sets.



中文翻译:

集成网络抓取、一类 SVM 和 LDA 主题建模的无监督文档分类

不平衡数据集的无监督文档分类是一项重大挑战。为了获得准确的分类结果,训练数据集通常由人类手动创建,这需要专业知识、时间和金钱。根据数据集的不平衡,这种方法要么需要对所有数据进行人工标记,要么无法充分识别代表性不足的类别。我们建议将网络抓取、一类支持向量机 (SVM) 和 Latent Dirichlet 分配 (LDA) 主题建模作为一种多步分类规则来规避手动标记。实现了结合域外训练数据的无监督一类文档分类,>80% 的目标数据被正确分类。

更新日期:2021-04-27
down
wechat
bug