当前位置: X-MOL 学术arXiv.cs.CL › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The empirical structure of word frequency distributions
arXiv - CS - Computation and Language Pub Date : 2020-01-09 , DOI: arxiv-2001.05292
Michael Ramscar

The frequencies at which individual words occur across languages follow power law distributions, a pattern of findings known as Zipf's law. A vast literature argues over whether this serves to optimize the efficiency of human communication, however this claim is necessarily post hoc, and it has been suggested that Zipf's law may in fact describe mixtures of other distributions. From this perspective, recent findings that Sinosphere first (family) names are geometrically distributed are notable, because this is actually consistent with information theoretic predictions regarding optimal coding. First names form natural communicative distributions in most languages, and I show that when analyzed in relation to the communities in which they are used, first name distributions across a diverse set of languages are both geometric and, historically, remarkably similar, with power law distributions only emerging when empirical distributions are aggregated. I then show this pattern of findings replicates in communicative distributions of English nouns and verbs. These results indicate that if lexical distributions support efficient communication, they do so because their functional structures directly satisfy the constraints described by information theory, and not because of Zipf's law. Understanding the function of these information structures is likely to be key to explaining humankind's remarkable communicative capacities.

中文翻译:

词频分布的经验结构

单个词在不同语言中出现的频率遵循幂律分布,这种发现模式被称为齐夫定律。大量文献争论这是否有助于优化人类交流的效率,但是这种说法必然是事后的,并且有人认为 Zipf 定律实际上可能描述了其他分布的混合。从这个角度来看,Sinosphere 名(姓)名在几何上分布的最新发现是值得注意的,因为这实际上与关于最佳编码的信息论预测是一致的。名字在大多数语言中形成自然的交际分布,我表明,当结合使用它们的社区进行分析时,不同语言集的名字分布既是几何的,又是,从历史上看,非常相似,幂律分布仅在经验分布汇总时才会出现。然后我展示了这种发现模式在英语名词和动词的交际分布中重复。这些结果表明,如果词汇分布支持有效的交流,那么它们之所以这样做,是因为它们的功能结构直接满足信息论描述的约束,而不是因为齐夫定律。了解这些信息结构的功能可能是解释人类卓越的交流能力的关键。这些结果表明,如果词汇分布支持有效的交流,那么它们之所以这样做,是因为它们的功能结构直接满足信息论描述的约束,而不是因为齐夫定律。了解这些信息结构的功能可能是解释人类卓越的交流能力的关键。这些结果表明,如果词汇分布支持有效的交流,那么它们之所以这样做,是因为它们的功能结构直接满足信息论描述的约束,而不是因为齐夫定律。了解这些信息结构的功能可能是解释人类卓越的交流能力的关键。
更新日期:2020-01-16
down
wechat
bug