当前位置: X-MOL 学术Pattern Recogn. Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Question classification based on co-training style semi-supervised learning
Pattern Recognition Letters ( IF 5.1 ) Pub Date : 2010-10-01 , DOI: 10.1016/j.patrec.2010.06.010
Zhengtao Yu , Lei Su , Lina Li , Quan Zhao , Cunli Mao , Jianyi Guo

In statistical question classification, semi-supervised learning that can exploit the abundant unlabeled samples has received substantial attention in recent years. In this paper, a novel question classification approach with the co-training style semi-supervised learning is proposed. In particular, the method extracts high-frequency keywords as classification features, and uses the word semantic similarity to adjust the feature weights. The classifiers are initially trained from labeled data and then the learned models are refined using unlabeled data which can get labeled if the classifiers agree on the labeling. Experiments on the Chinese question answering system in tourism domain were conducted by employing different feature selections, different supervised and semi-supervised algorithms, different feature dimensions and different unlabeled rates. The experimental results show the proposed method can effectively improve the classification accuracy. Specifically, under the 40% unlabeled rate of training set, the average accuracy rates reach 88.9% on coarse types and 78.2% on fine types, respectively, which get an improvement of around 2-4% points.

中文翻译:

基于协同训练式半监督学习的问题分类

在统计问题分类中,利用大量未标记样本的半监督学习近年来受到了广泛关注。在本文中,提出了一种具有协同训练风格的半监督学习的新问题分类方法。特别是,该方法提取高频关键词作为分类特征,并利用词语义相似度来调整特征权重。分类器最初是从标记数据中训练出来的,然后使用未标记数据对学习模型进行细化,如果分类器同意标记,则可以标记这些数据。采用不同的特征选择、不同的监督和半监督算法,对旅游领域的中文问答系统进行了实验,不同的特征维度和不同的未标记率。实验结果表明,该方法能有效提高分类精度。具体来说,在 40% 的训练集未标注率下,粗略类型的平均准确率达到 88.9%,精细类型的平均准确率达到 78.2%,提高了 2-4% 左右。
更新日期:2010-10-01
down
wechat
bug