当前位置: X-MOL 学术Comput. Speech Lang › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
The predictive capabilities of mathematical models for the type-token relationship in English language corpora
Computer Speech & Language ( IF 4.3 ) Pub Date : 2021-04-20 , DOI: 10.1016/j.csl.2021.101227
Martin Tunnicliffe , Gordon Hunter

We investigate the predictive capability of mathematical models of the type-token relationship applied to the vocabulary growth profiles of selected English language documents. We compare the existing Good-Toulmin and Heaps formulae with an alternative approach based on Bernoulli trial word selection from a fixed finite vocabulary using the Zipf and Zipf-Mandelbrot probability distributions. We make two major observations: firstly, while the Zipf-Mandelbrot model makes better predictions of vocabulary growth than the Zipf model, the optimized parameters of the latter correlate better than those of the former with statistics gleaned independently from the data. Secondly, the mean of the Zipf-Mandelbrot, Good-Toulmin and Heaps models provides a more consistent and unbiased prediction of vocabulary than any individual model alone.



中文翻译:

数学模型对英语语料库中类型标记关系的预测能力

我们调查了类型标记关系的数学模型对所选英语文档的词汇量增长概况的预测能力。我们将现有的Good-Toulmin和Heaps公式与使用Zipf和Zipf-Mandelbrot概率分布从固定有限词汇中基于伯努利试验词选择的替代方法进行比较。我们有两个主要观察结果:首先,虽然Zipf-Mandelbrot模型比Zipf模型能更好地预测词汇量,但后者的优化参数比前者具有更好的相关性,并且统计数据独立于数据。其次,Zipf-Mandelbrot模型,Good-Toulmin模型和Heaps模型的均值比单独的任何单个模型提供了更一致,更公正的词汇预测。

更新日期:2021-05-09
down
wechat
bug