Balancing the composition of word embeddings across heterogenous data sets,arXiv - CS - Computation and Language

当前位置： X-MOL 学术 › arXiv.cs.CL › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Balancing the composition of word embeddings across heterogenous data sets
arXiv - CS - Computation and Language Pub Date : 2020-01-14 , DOI: arxiv-2001.04693
Stephanie Brandl, David Lassner, Maximilian Alber

Word embeddings capture semantic relationships based on contextual information and are the basis for a wide variety of natural language processing applications. Notably these relationships are solely learned from the data and subsequently the data composition impacts the semantic of embeddings which arguably can lead to biased word vectors. Given qualitatively different data subsets, we aim to align the influence of single subsets on the resulting word vectors, while retaining their quality. In this regard we propose a criteria to measure the shift towards a single data subset and develop approaches to meet both objectives. We find that a weighted average of the two subset embeddings balances the influence of those subsets while word similarity performance decreases. We further propose a promising optimization approach to balance influences and quality of word embeddings.

中文翻译：

跨异构数据集平衡词嵌入的组成

词嵌入基于上下文信息捕获语义关系，是各种自然语言处理应用程序的基础。值得注意的是，这些关系仅从数据中学习，随后数据组合会影响嵌入的语义，这可能会导致有偏见的词向量。给定质量不同的数据子集，我们的目标是对齐单个子集对结果词向量的影响，同时保持它们的质量。在这方面，我们提出了一个标准来衡量向单一数据子集的转变，并开发满足这两个目标的方法。我们发现两个子集嵌入的加权平均值平衡了这些子集的影响，同时词相似性性能下降。

更新日期：2020-01-15

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>