A deep learning analysis on question classification task using Word2vec representations,Neural Computing and Applications

当前位置： X-MOL 学术 › Neural Comput. & Applic. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A deep learning analysis on question classification task using Word2vec representations
Neural Computing and Applications ( IF 4.5 ) Pub Date : 2020-01-21 , DOI: 10.1007/s00521-020-04725-w
Seyhmus Yilmaz , Sinan Toklu

Abstract

Question classification is a primary essential study for automatic question answering implementations. Linguistic features take a significant role to develop an accurate question classifier. Recently, deep learning systems have achieved remarkable success in various text-mining problems such as sentiment analysis, document classification, spam filtering, document summarization, and web mining. In this study, we explain our study on investigating some deep learning architectures for a question classification task in a highly inflectional language Turkish that is an agglutinative language where word structure is produced by adding suffixes (morphemes) to root word. As a non-Indo-European language, languages like Turkish have some unique features, which make it challenging for natural language processing. For instance, Turkish has no grammatical gender and noun classes. In this study, user questions in Turkish are used to train and test the deep learning architectures. In addition to this, the details of the deep learning architectures are compared in terms of test and 10-cross fold validation accuracy. We use two major deep learning models in our paper: long short-term memory (LSTM), Convolutional Neural Networks (CNN), and we also implemented the combination of CNN-LSTM, CNN-SVM structures and a number of various those architectures by changing vector sizes and the embedding types. As well as this, we have built word embeddings using the Word2vec method with a CBOW and skip gram models with different vector sizes on a large corpus composed of user questions. Our another investigation is the effect of using different Word2vec pre-trained word embeddings on these deep learning architectures. Experiment results show that the use of different Word2vec models has a significant impact on the accuracy rate on different deep learning models. Additionally, there is no Turkish question dataset labeled and so another contribution in this study is that we introduce new Turkish question dataset which is translated from UIUC English question dataset. By using these techniques, we have reached an accuracy of 94% on the question dataset.

中文翻译：

使用Word2vec表示法对问题分类任务进行深度学习分析

摘要

问题分类是自动问答实施的主要基础研究。语言功能在开发准确的问题分类器中起着重要作用。最近，深度学习系统在各种文本挖掘问题（例如情感分析，文档分类，垃圾邮件过滤，文档摘要和Web挖掘）中取得了显著成功。在这项研究中，我们解释了我们的研究，该研究针对一种高度折衷的语言土耳其语（一种凝集性语言，其中通过在根词上添加后缀（词素）而产生词结构）来研究用于问题分类任务的一些深度学习体系结构。作为非印欧语系语言，土耳其语等语言具有一些独特的功能，这给自然语言处理带来了挑战。例如，土耳其语没有语法性别和名词类别。在本研究中，土耳其语中的用户问题用于训练和测试深度学习体系结构。除此之外，还根据测试和十倍交叉验证的准确性对深度学习架构的细节进行了比较。我们在论文中使用了两种主要的深度学习模型：长短期记忆（LSTM），卷积神经网络（CNN），并且我们还通过以下方式实现了CNN-LSTM，CNN-SVM结构和许多这些结构的组合：更改向量大小和嵌入类型。除此之外，我们还使用Word2vec方法和CBOW构建了单词嵌入，并在由用户问题组成的大型语料库上使用了具有不同矢量大小的跳过语法模型。我们的另一项调查是在这些深度学习体系结构上使用不同的Word2vec预训练单词嵌入的效果。实验结果表明，使用不同的Word2vec模型对不同深度学习模型的准确率有重大影响。此外，没有标记土耳其语问题数据集，因此本研究的另一个贡献是，我们引入了新的土耳其语问题数据集，该数据集是从UIUC英语问题数据集翻译而来的。通过使用这些技术，我们在问题数据集上的准确性达到94％。没有标注土耳其语问题数据集，因此本研究的另一贡献是我们引入了新的土耳其语问题数据集，该数据集是从UIUC英语问题数据集翻译而来的。通过使用这些技术，我们在问题数据集上的准确性达到94％。没有标注土耳其语问题数据集，因此本研究的另一贡献是我们引入了新的土耳其语问题数据集，该数据集是从UIUC英语问题数据集翻译而来的。通过使用这些技术，我们在问题数据集上的准确性达到94％。

更新日期：2020-03-30

点击分享查看原文

点击收藏

阅读更多本刊最新论文