当前位置: X-MOL 学术Pattern Recogn. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Learning Variable-Length Representation of Words
Pattern Recognition ( IF 8 ) Pub Date : 2020-07-01 , DOI: 10.1016/j.patcog.2020.107306
Debasis Ganguly

Abstract A standard word embedding algorithm, such as ‘word2vec’, embeds each word as a dense vector of a preset dimensionality, the components of which are learned by maximizing the likelihood of predicting the context around it. However, as an inherent linguistic phenomenon, it is evident that there is a varying degree of difficulty in identifying words from their contexts. This suggests that a variable granularity in word vector representation may be useful to obtain sparser and more compressed word representations, requiring less storage space. To that end, in this paper, we propose a word vector training algorithm that uses a variable number of components to represent words. Given a text collection of documents, our algorithm, similar to the skip-gram approach of word2vec, learns to predict the context of a word given the current instance of a word. However, in contrast to skip-gram, which uses a static number of dimensions for each word vector, we propose to dynamically increase the dimensionality as a stochastic function of the prediction error. Our experiments with standard test collections demonstrate that our word representation method is able to achieve comparable (and sometimes even better) effectiveness than skip-gram word2vec, using a significantly smaller number of parameters (achieving compression ratio of around 65%).

中文翻译:

学习单词的变长表示

摘要 标准的词嵌入算法,例如“word2vec”,将每个词嵌入为一个具有预设维度的密集向量,通过最大化预测其周围上下文的可能性来学习其分量。然而,作为一种固有的语言现象,从上下文中识别单词显然存在不同程度的困难。这表明词向量表示中的可变粒度可能有助于获得更稀疏和更压缩的词表示,需要更少的存储空间。为此,在本文中,我们提出了一种词向量训练算法,该算法使用可变数量的组件来表示词。给定文档的文本集合,我们的算法类似于 word2vec 的 skip-gram 方法,学习在给定单词的当前实例的情况下预测单词的上下文。然而,与为每个词向量使用静态维数的 skip-gram 相比,我们建议动态增加维数作为预测误差的随机函数。我们对标准测试集的实验表明,我们的单词表示方法能够实现与 skip-gram word2vec 相当(有时甚至更好)的有效性,使用的参数数量明显减少(实现约 65% 的压缩率)。
更新日期:2020-07-01
down
wechat
bug