当前位置: X-MOL 学术Cognitive Science › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The Challenges of Large-Scale, Web-Based Language Datasets: Word Length and Predictability Revisited
Cognitive Science ( IF 2.617 ) Pub Date : 2021-06-25 , DOI: 10.1111/cogs.12983
Stephan C Meylan 1, 2 , Thomas L Griffiths 3
Affiliation  

Language research has come to rely heavily on large-scale, web-based datasets. These datasets can present significant methodological challenges, requiring researchers to make a number of decisions about how they are collected, represented, and analyzed. These decisions often concern long-standing challenges in corpus-based language research, including determining what counts as a word, deciding which words should be analyzed, and matching sets of words across languages. We illustrate these challenges by revisiting “Word lengths are optimized for efficient communication” (Piantadosi, Tily, & Gibson, 2011), which found that word lengths in 11 languages are more strongly correlated with their average predictability (or average information content) than their frequency. Using what we argue to be best practices for large-scale corpus analyses, we find significantly attenuated support for this result and demonstrate that a stronger relationship obtains between word frequency and length for a majority of the languages in the sample. We consider the implications of the results for language research more broadly and provide several recommendations to researchers regarding best practices.

中文翻译:

大规模、基于 Web 的语言数据集的挑战:重新审视词长和可预测性

语言研究已经开始严重依赖大规模的、基于网络的数据集。这些数据集可能会带来重大的方法学挑战,要求研究人员就如何收集、表示和分析它们做出许多决定。这些决定通常涉及基于语料库的语言研究中长期存在的挑战,包括确定什么是一个词,决定应该分析哪些词,以及跨语言匹配词组。我们通过重新审视“为高效交流而优化词长”(Piantadosi、Tily 和 Gibson,2011 年)来说明这些挑战,其中发现 11 种语言的词长与其平均可预测性(或平均信息内容)的相关性比它们的平均可预测性(或平均信息内容)更强烈。频率。使用我们认为是大规模语料库分析的最佳实践,我们发现对该结果的支持显着减弱,并证明样本中大多数语言的词频和长度之间存在更强的关系。我们更广泛地考虑结果对语言研究的影响,并就最佳实践向研究人员提供几项建议。
更新日期:2021-06-25
down
wechat
bug