当前位置: X-MOL 学术Journal of Quantitative Linguistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Study of Optimal Text Size Phenomenon in Zipf–Mandelbrot’s Distribution on the Bases of Full and Distorted Texts. Author’s Frequency Characteristics and derivation of Hapax Legomena
Journal of Quantitative Linguistics ( IF 0.761 ) Pub Date : 2019-03-20 , DOI: 10.1080/09296174.2018.1559460
Olga G. Gorina 1 , Natalya S. Tsarakova 2 , Sergey K. Tsarakov 3


This paper explores word-frequency patterns when considering text length, authorship, and random distortion of texts. Through a series of experiments, we determined an optimal text size, a phenomenon that was predicted by George Zipf, which sees a minimal discrepancy between calculated and observed frequencies. A graphic representation allowed a plausible explanation behind the existence of this phenomenon. Working on the assumption that distorted texts might disobey Zipf’s Law, we explored correlations among frequencies and text entirety compared with text distortions. Results reveal the crucial role of text length for maintaining Zipfian distribution: randomly chosen sets of words and fragmentary texts of optimal size still obey Zipf’s Law. Findings show that authorship manifests itself through the author constant, defined as the relative frequency of the most frequent words, which remains constant throughout the works of any given author, including randomly chosen text chunks and fragments of sentences of various sizes.


基于完整文本和失真文本的Zipf–Mandelbrot分布中的最佳文本大小现象研究。作者的频率特性和Hapax Legomena的推导


本文探讨了在考虑文本长度,作者身份和文本随机变形时的词频模式。通过一系列实验,我们确定了最佳文本大小,这是乔治·齐普夫(George Zipf)预测的现象,该现象使计算频率和观察频率之间的差异最小。图形表示允许对此现象的存在进行合理的解释。在假设扭曲的文本可能违背齐普夫定律的假设下,我们探索了与文本失真相比频率和文本整体之间的相关性。结果揭示了文本长度对于保持Zipfian分布至关重要的作用:随机选择的单词集和最佳大小的零碎文本仍然服从Zipf定律。结果表明,作者身份通过作者常量体现出来,定义为最频繁出现的单词的相对频率,该频率在任何给定作者的作品中都保持不变,包括随机选择的文本块和各种大小的句子片段。
