当前位置: X-MOL 学术Neural Process Lett. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Combining Embeddings of Input Data for Text Classification
Neural Processing Letters ( IF 3.1 ) Pub Date : 2020-08-11 , DOI: 10.1007/s11063-020-10312-w
Zuzanna Parcheta , Germán Sanchis-Trilles , Francisco Casacuberta , Robin Rendahl

The problem of automatic text classification is an essential part of text analysis. The improvement of text classification can be done at different levels such as a preprocessing step, network implementation, etc. In this paper, we focus on how the combination of different methods of text encoding may affect classification accuracy. To do this, we implemented a multi-input neural network that is able to encode input text using several text encoding techniques such as BERT, neural embedding layer, GloVe, skip-thoughts and ParagraphVector. The text can be represented at different levels of tokenised input text such as the sentence level, word level, byte pair encoding level and character level. Experiments were conducted on seven datasets from different language families: English, German, Swedish and Czech. Some of those languages contain agglutinations and grammatical cases. Two out of seven datasets originated from real commercial scenarios: (1) classifying ingredients into their corresponding classes by means of a corpus provided by Northfork; and (2) classifying texts according to the English level of their corresponding writers by means of a corpus provided by ProvenWord. The developed architecture achieves an improvement with different combinations of text encoding techniques depending on the different characteristics of the datasets. Once the best combination of embeddings at different levels was determined, different architectures of multi-input neural networks were compared. The results obtained with the best embedding combination and best neural network architecture were compared with state-of-the-art approaches. The results obtained with the dataset used in the experiments were better than the state-of-the-art baselines.



中文翻译:

结合输入数据的嵌入进行文本分类

自动文本分类问题是文本分析的重要组成部分。可以在不同级别(例如预处理步骤,网络实现等)上完成文本分类的改进。在本文中,我们着眼于不同文本编码方法的组合如何影响分类准确性。为此,我们实现了一个多输入神经网络,该网络能够使用多种文本编码技术(例如BERT,神经嵌入层,GloVe,skip-thoughts和ParagraphVector)对输入文本进行编码。文本可以在标记化输入文本的不同级别上表示,例如句子级别,单词级别,字节对编码级别和字符级别。在来自不同语言族的七个数据集上进行了实验:英语,德语,瑞典语和捷克语。这些语言中的一些包含凝集和语法情况。七分之二的数据集来自真实的商业场景:(1)通过由以下人员提供的语料库将成分分为相应的类别诺福克; (2)通过ProvenWord提供的语料库根据相应作者的英语水平对文本进行分类。根据数据集的不同特性,所开发的体系结构使用文本编码技术的不同组合可以实现改进。一旦确定了不同层次上嵌入的最佳组合,就可以比较多输入神经网络的不同体系结构。将最佳嵌入组合和最佳神经网络架构获得的结果与最新方法进行了比较。实验中使用的数据集获得的结果优于最新的基线。

更新日期:2020-08-11
down
wechat
bug