Semantic Data Set Construction from Human Clustering and Spatial Arrangement,Computational Linguistics

当前位置： X-MOL 学术 › Comput. Linguist. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Semantic Data Set Construction from Human Clustering and Spatial Arrangement
Computational Linguistics ( IF 9.3 ) Pub Date : 2021-03-05 , DOI: 10.1162/coli_a_00396
Olga Majewska ₁ , Diana McCarthy ₁ , Jasper J. F. van den Bosch ₂ , Nikolaus Kriegeskorte ₃ , Ivan Vulić ₁ , Anna Korhonen ₁

Affiliation

Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multiway similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work.We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity.We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”

中文翻译：

基于人类聚类和空间排列的语义数据集构建

词汇语义的表征学习模型的研究通常利用某种形式的内在评估来确保所学的表征反映人类的语义判断。词汇语义相似度估计是一种广泛使用的评估方法，但是通常将精力集中在孤立地对单词进行成对判断上，或者仅限于特定的上下文和词汇刺激。这些方法存在局限性，要么没有提供任何判断的上下文，从而忽略了歧义，要么提供了非常具体的句子上下文，这些句子上下文随后无法用于生成更大的词汇资源。此外，不考虑两个以上项目之间的相似性。我们将对最近提出的大规模数据集构建方法进行全面描述和分析，该方法可在第一阶段生成大量动词样本的语义分类，并在第二阶段在生成的语义类中进行多路相似性判断阶段。该方法使用在认知神经科学领域中提出的空间多安排方法来捕获视觉刺激的多向相似性判断。我们已经对该方法进行了调整，以处理多义语言刺激和比以前的工作大得多的样本。我们专门针对动词，但该方法同样可以应用于语音的其他部分。我们对第一阶段的数据进行聚类分析，并证明这在构建综合动词资源中可能会有用。我们还分析了第二阶段捕获的语义信息，并讨论了空间诱导相似性判断的潜力，以更好地反映人类单词相似性的概念。我们演示了如何将所得数据集用于细粒度的分析和表示学习的评估语义聚类和语义相似性的内在任务模型。尤其是，我们发现更强大的静态单词嵌入方法在词级相似性和聚类方面仍然胜过从最近的预训练方法中出现的词汇表述。而且，由于数据集的广泛覆盖，

更新日期：2021-03-07

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文本刊介绍/投稿指南

全部期刊列表>>