A study on Chinese register characteristics based on regression analysis and text clustering,Corpus Linguistics and Linguistic Theory

当前位置： X-MOL 学术 › Corpus Linguistics and Linguistic Theory › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

A study on Chinese register characteristics based on regression analysis and text clustering
Corpus Linguistics and Linguistic Theory ( IF 2.143 ) Pub Date : 2019-05-27 , DOI: 10.1515/cllt-2016-0062
Renkui Hou , Chu-Ren Huang , Hongchao Liu

Abstract This paper reports an innovative Chinese register study based on regression analysis for sentence length distribution and text clustering. Although end of sentence is not conventionally marked in Chinese, we resolve this issue by assuming that segments between periods, question marks, and exclamation marks are sentences, which can be further divided into simple sentences and compound sentences. We also assume that segments between punctuation marks that express pauses in utterances form sentences (i.e., clauses). Using regression analysis, we find that the frequency distribution of sentence and clause lengths in Chinese can be fitted by the formula F = aLbcL, where L is sentence/clause length. Texts from different registers give rise to different fitted values of the parameters, and hence can serve to differentiate these registers. Finally, we use these parameters to represent and cluster texts from different registers. The successful text clustering results further prove that the parameters of the fitted results are reliable linguistic characteristics for different registers. In terms of linguistic theories, our study shows that it is just as effective to model sentence length in Chinese using sociological words (i.e., characters) as it is using linguistic words.

中文翻译：

基于回归分析和文本聚类的中文注册特征研究

摘要本文报告了一项基于回归分析的创新汉语注册研究，用于句子长度分布和文本聚类。尽管句子的结尾通常没有用中文标记，但我们通过假设句点，问号和感叹号之间的句段是句子来解决此问题，可以将其进一步分为简单句子和复合句子。我们还假定标点符号之间的表达话语停顿的句段是句子（即从句）。通过回归分析，我们发现汉语句子和从句长度的频率分布可以通过公式F = aLbcL来拟合，其中L是句子/从句的长度。来自不同寄存器的文本会产生不同的参数拟合值，因此可以用来区分这些寄存器。最后，我们使用这些参数来表示和聚类来自不同寄存器的文本。成功的文本聚类结果进一步证明了拟合结果的参数是不同寄存器的可靠语言特性。就语言理论而言，我们的研究表明，使用社会学单词（即字符）对中文句子长度进行建模与使用语言单词一样有效。

更新日期：2019-05-27

点击分享查看原文

点击收藏

阅读更多本刊最新论文

全部期刊列表>>