Towards syntax-aware token embeddings,Natural Language Engineering

当前位置： X-MOL 学术 › Nat. Lang. Eng. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Towards syntax-aware token embeddings
Natural Language Engineering ( IF 2.3 ) Pub Date : 2020-07-08 , DOI: 10.1017/s1351324920000297
Diana Nicoleta Popa , Julien Perez , James Henderson , Eric Gaussier

Distributional semantic word representations are at the basis of most modern NLP systems. Their usefulness has been proven across various tasks, particularly as inputs to deep learning models. Beyond that, much work investigated fine-tuning the generic word embeddings to leverage linguistic knowledge from large lexical resources. Some work investigated context-dependent word token embeddings motivated by word sense disambiguation, using sequential context and large lexical resources. More recently, acknowledging the need for an in-context representation of words, some work leveraged information derived from language modelling and large amounts of data to induce contextualised representations. In this paper, we investigate Syntax-Aware word Token Embeddings (SATokE) as a way to explicitly encode specific information derived from the linguistic analysis of a sentence in vectors which are input to a deep learning model. We propose an efficient unsupervised learning algorithm based on tensor factorisation for computing these token embeddings given an arbitrary graph of linguistic structure. Applying this method to syntactic dependency structures, we investigate the usefulness of such token representations as part of deep learning models of text understanding. We encode a sentence either by learning embeddings for its tokens and the relations between them from scratch or by leveraging pre-trained relation embeddings to infer token representations. Given sufficient data, the former is slightly more accurate than the latter, yet both provide more informative token embeddings than standard word representations, even when the word representations have been learned on the same type of context from larger corpora (namely pre-trained dependency-based word embeddings). We use a large set of supervised tasks and two major deep learning families of models for sentence understanding to evaluate our proposal. We empirically demonstrate the superiority of the token representations compared to popular distributional representations of words for various sentence and sentence pair classification tasks.

中文翻译：

走向语法感知令牌嵌入

分布式语义词表示是大多数现代 NLP 系统的基础。它们的实用性已在各种任务中得到证明，特别是作为深度学习模型的输入。除此之外，许多工作调查了微调通用词嵌入以利用来自大量词汇资源的语言知识。一些工作使用顺序上下文和大量词汇资源研究了由词义消歧驱动的依赖于上下文的词标记嵌入。最近，由于承认需要在上下文中表示单词，一些工作利用从语言建模和大量数据中获得的信息来诱导上下文表示。在本文中，我们调查小号语法-一种洁字托克zh乙被褥（萨托克) 作为一种将来自句子语言分析的特定信息显式编码到向量中的方法，这些向量输入到深度学习模型中。我们提出了一种基于张量分解的高效无监督学习算法，用于在给定任意语言结构图的情况下计算这些标记嵌入。将这种方法应用于句法依赖结构，我们研究了这种标记表示作为文本理解深度学习模型的一部分的有用性。我们通过从头开始学习其标记的嵌入以及它们之间的关系或利用预先训练的关系嵌入来推断标记表示来对句子进行编码。给定足够的数据，前者比后者稍微准确一些，但两者都提供比标准单词表示更多信息的令牌嵌入，即使已经从更大的语料库（即预训练的基于依赖的词嵌入）中学习了相同类型的上下文中的词表示。我们使用大量监督任务和两个主要的深度学习模型系列来进行句子理解来评估我们的建议。我们从经验上证明了令牌表示在各种句子和句子对分类任务中与流行的单词分布表示相比的优越性。

更新日期：2020-07-08

点击分享查看原文

点击收藏

阅读更多本刊最新论文本刊介绍/投稿指南11