当前位置: X-MOL 学术Journal of Quantitative Linguistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Lexical Richness and Text Length: An Entropy-based Perspective
Journal of Quantitative Linguistics ( IF 0.761 ) Pub Date : 2020-06-10 , DOI: 10.1080/09296174.2020.1766346
Yaqian Shi 1 , Lei Lei 1
Affiliation  

ABSTRACT

Text length is a major concern in the measurement of lexical richness, and how lexical richness is affected by text length still remains open. The present study aims to explore the relation between text length and lexical richness from an entropy-based perspective. Results show a non-linear growth pattern of lexical richness by increasing text length. To be specific, lexical richness increases rapidly with shorter texts. It soon reaches a boundary point from which it stabilizes despite the continuous expansion of text length. The boundary point of the lexical richness by the Shannon estimation is around 1000 tokens and that by the Zhang estimation is lower and more varied, including 500, 800, and 1000 tokens. Such stability may be explained by the stabilization of word probability in the text.



中文翻译:

词汇丰富度和文本长度:基于熵的视角

摘要

文本长度是衡量词汇丰富度的一个主要问题,而文本长度如何影响词汇丰富度仍然是未知数。本研究旨在从基于熵的角度探讨文本长度与词汇丰富度之间的关系。结果表明,随着文本长度的增加,词汇丰富度的非线性增长模式。具体来说,随着文本的缩短,词汇丰富度会迅速增加。尽管文本长度不断增加,但它很快就达到了稳定的边界点。Shannon估计的词汇丰富度的边界点在1000个token左右,而Zhang估计的词汇丰富度更低,变化更多,包括500、800和1000个token。这种稳定性可以通过文本中单词概率的稳定性来解释。

更新日期:2020-06-10
down
wechat
bug