当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Heavy-tailed Representations, Text Polarity Classification & Data Augmentation
arXiv - CS - Computation and Language Pub Date : 2020-03-25 , DOI: arxiv-2003.11593
Hamid Jalalzai, Pierre Colombo, Chlo\'e Clavel, Eric Gaussier, Giovanna Varni, Emmanuel Vignon, Anne Sabourin

The dominant approaches to text representation in natural language rely on learning embeddings on massive corpora which have convenient properties such as compositionality and distance preservation. In this paper, we develop a novel method to learn a heavy-tailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. In particular, a classifier dedicated to the tails of the proposed embedding is obtained which performance outperforms the baseline. This classifier exhibits a scale invariance property which we leverage by introducing a novel text generation method for label preserving dataset augmentation. Numerical experiments on synthetic and real text data demonstrate the relevance of the proposed framework and confirm that this method generates meaningful sentences with controllable attribute, e.g. positive or negative sentiment.

中文翻译:

重尾表示、文本极性分类和数据增强

自然语言中文本表示的主要方法依赖于在大量语料库上学习嵌入,这些语料库具有方便的特性,例如组合性和距离保持。在本文中,我们开发了一种新方法来学习关于分布尾部具有理想规律性的重尾嵌入,该方法允许使用多元极值理论的框架分析远离分布块的点。特别是,获得了专用于建议嵌入尾部的分类器,其性能优于基线。该分类器展示了尺度不变性,我们通过引入一种新颖的文本生成方法来增强标签保留数据集。
更新日期:2020-03-27
down
wechat
bug