Learning from noisy out-of-domain corpus using dataless classification,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Learning from noisy out-of-domain corpus using dataless classification
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-06-17 , DOI: 10.1017/s1351324920000340
Yiping Jin , Dittaya Wanvarie , Phu T. V. Le

In real-world applications, text classification models often suffer from a lack of accurately labelled documents. The available labelled documents may also be out of domain, making the trained model not able to perform well in the target domain. In this work, we mitigate the data problem of text classification using a two-stage approach. First, we mine representative keywords from a noisy out-of-domain data set using statistical methods. We then apply a dataless classification method to learn from the automatically selected keywords and unlabelled in-domain data. The proposed approach outperformed various supervised learning and dataless classification baselines by a large margin. We evaluated different keyword selection methods intrinsically and extrinsically by measuring their impact on the dataless classification accuracy. Last but not least, we conducted an in-depth analysis of the behaviour of the classifier and explained why the proposed dataless classification method outperformed supervised learning counterparts.

中文翻译：

使用无数据分类从嘈杂的域外语料库中学习

在现实世界的应用中，文本分类模型经常受到缺乏准确标记文档的困扰。可用的标记文档也可能在域外，使训练后的模型无法在目标域中表现良好。在这项工作中，我们使用两阶段方法来缓解文本分类的数据问题。首先，我们使用统计方法从嘈杂的域外数据集中挖掘代表性关键字。然后，我们应用无数据分类方法从自动选择的关键字和未标记的域内数据中学习。所提出的方法大大优于各种监督学习和无数据分类基线。我们通过测量它们对无数据分类准确性的影响，从内在和外在评估了不同的关键字选择方法。最后但并非最不重要的，

更新日期：2020-06-17

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11