Two halves of a meaningful text are statistically different,Journal of Statistical Mechanics: Theory and Experiment

当前位置： X-MOL 学术 › J. Stat. Mech. › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Two halves of a meaningful text are statistically different
Journal of Statistical Mechanics: Theory and Experiment ( IF 2.2 ) Pub Date : 2021-03-25 , DOI: 10.1088/1742-5468/abe947
Weibing Deng ₁ , Rongrong Xie ₁ , Shengfeng Deng ₁ , Armen E Allahverdyan ₂

Affiliation

Which statistical features distinguish a meaningful text (possibly written in an unknown system) from a meaningless set of symbols? Here we answer this question by comparing features of the first half of a text to its second half. This comparison can uncover hidden effects, because the halves have the same values of many parameters (style, genre, etc). We found that the first half has more different words and more rare words than the second half. Also, words in the first half are distributed less homogeneously over the text. These differences hold for the significant majority of several hundred relatively short texts we studied. Differences disappear after a random permutation of words that destroys the linear structure of the text. The differences reveal a temporal asymmetry in meaningful texts, which is confirmed by showing that texts are much better compressible in their natural way (i.e. along the narrative) than in the word-inverted form. We conjecture that these results connect the semantic organization of a text (defined by the flow of its narrative) to its statistical features.

中文翻译：

一个有意义的文本的两半在统计上是不同的

哪些统计特征将有意义的文本（可能用未知系统编写）与无意义的符号集区分开来？在这里，我们通过比较文本前半部分和后半部分的特征来回答这个问题。这种比较可以发现隐藏的效果，因为两半的许多参数（风格、流派等）具有相同的值。我们发现前半部分比后半部分有更多不同的词和更多的生僻词。此外，前半部分的单词在文本中的分布不太均匀。这些差异适用于我们研究的数百篇相对较短的文本中的绝大多数。随机排列的单词破坏了文本的线性结构后，差异就会消失。这些差异揭示了有意义文本的时间不对称性，这通过显示文本以自然方式（即沿着叙述）比单词倒置形式更好地可压缩而得到证实。我们推测这些结果将文本的语义组织（由其叙事流定义）与其统计特征联系起来。

更新日期：2021-03-25

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文