当前位置: X-MOL 学术Phys. Rep. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Complex systems approach to natural language
Physics Reports ( IF 30.0 ) Pub Date : 2023-12-22 , DOI: 10.1016/j.physrep.2023.12.002
Tomasz Stanisz , Stanisław Drożdż , Jarosław Kwapień

The science of complexity aims to answer the question of what rules nature chooses when assembling the basic constituents of matter and energy into structures and dynamical patterns that cascade through the entire hierarchy of scales in the Universe. A related phenomenon – natural language – can successfully mirror such structures as reflected by its ability to encode and transmit information about them and among them. It is thus legitimate to expect that natural language carries the essence of complexity. And indeed, in the human’s speaking and writing it is particularly true that more is different. Natural language thus deserves a central place in the related quantitative study within the science of complexity.

With this in mind the present review summarizes the main methodological concepts used in this domain and documents their applicability and utility in identifying universal as well as system-specific features of natural language in its written representation in several major Western languages. In particular, three main complexity-related current research trends in quantitative linguistics are exhaustively covered. The first part addresses the issue of word frequencies in texts and, in particular, demonstrates that taking punctuation into consideration largely restores scaling whose violation in the Zipf’s law for the most frequent words is commonly modelled by the so-called Mandelbrot’s correction. The second part introduces methods inspired by time series analysis, used in studying various kinds of long-range correlations in written texts. The related time series are generated on the basis of text partition into sentences or into phrases between consecutive punctuation marks. It turns out that these series develop features often found in signals generated by complex systems: the presence of long-range correlations along with fractal or even multifractal structures. Moreover, it appears that the distances between consecutive punctuation marks quite universally across languages comply with the discrete variant of the Weibull distribution, often appearing in survival analysis. In the third part, the application of the network formalism to natural language is reviewed, particularly in the context of word-adjacency networks whose structure reflects the word co-occurrence in texts. Various parameters characterizing topology of such networks can be used for classification of texts, for example, from a stylometric perspective. Network approach can also be applied in semantic analysis to represent a hierarchy of words and associations between them based on their meaning. Structure of such networks turns out to be significantly different from that observed in random networks, revealing genuine properties of language. Finally, punctuation appears to have a significant impact not only on the language’s information-carrying ability but also on its key statistical properties, hence it seems recommended to consider punctuation marks on a par with words.



中文翻译:

自然语言的复杂系统方法

复杂性科学旨在回答这样的问题:在将物质和能量的基本成分组装成级联整个宇宙尺度层次结构和动态模式时,大自然会选择什么规则。一种相关的现象——自然语言——可以成功地反映这种结构,这反映在它编码和传输有关这些结构以及它们之间的信息的能力。因此,可以合理地期望自然语言具有复杂性的本质。事实上,在人类的说话和写作中,尤其真实的是,更多的是不同的。因此,自然语言应该在复杂性科学的相关定量研究中占据中心地位。

考虑到这一点,本综述总结了该领域使用的主要方法论概念,并记录了它们在识别自然语言在几种主要西方语言的书面表示中的普遍性和系统特定特征方面的适用性和效用。特别是,详尽地涵盖了定量语言学中当前与复杂性相关的三个主要研究趋势。第一部分解决了文本中的词频问题,特别是表明,考虑标点符号在很大程度上可以恢复缩放比例,而最常见单词的齐普夫定律的违反通常通过所谓的曼德尔布罗特校正来建模。第二部分介绍受时间序列分析启发的方法,用于研究书面文本中的各种长期相关性。相关时间序列是基于将文本划分为连续标点符号之间的句子或短语而生成的。事实证明,这些系列开发了复杂系统生成的信号中常见的特征:存在长程相关性以及分形甚至多重分形结构。此外,不同语言中连续标点符号之间的距离似乎普遍符合威布尔分布的离散变体,这种分布经常出现在生存分析中。第三部分回顾了网络形式主义在自然语言中的应用,特别是在其结构反映文本中单词共现的单词邻接网络的背景下。表征此类网络的拓扑的各种参数可用于文本分类,例如从文体测量的角度。网络方法还可以应用于语义分析,以根据单词的含义表示单词的层次结构以及它们之间的关联。事实证明,此类网络的结构与随机网络中观察到的结构显着不同,揭示了语言的真正属性。最后,标点符号似乎不仅对语言的信息承载能力产生重大影响,而且对其关键统计特性也产生重大影响,因此似乎建议将标点符号与单词同等对待。

更新日期:2023-12-24
down
wechat
bug