Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem,Journal of Quantitative Linguistics

当前位置： X-MOL 学术 › Journal of Quantitative Linguistics › 论文详情

Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)

Comparing Lexical Bundles across Corpora of Different Sizes: The Zipfian Problem
Journal of Quantitative Linguistics ( IF 0.7 ) Pub Date : 2019-02-05 , DOI: 10.1080/09296174.2019.1566975
Yves Bestgen ₁

Affiliation

ABSTRACT

Formulaic sequences in language use are often studied by means of the automatic identification of frequently recurring series of words, often referred to as ‘lexical bundles’, in corpora that contrast different registers, academic disciplines, etc. As corpora often differ in size, a critically important assumption in this field states that the use of a normalized frequency threshold, such as 20 occurrences per million words, allows for an accurate comparison of corpora of different sizes. Yet, several researchers have argued that normalization may be unreliable when applied to frequency threshold. The study investigates this issue by comparing the number of lexical bundles identified in corpora that differ only in size. Using two complementary random sampling procedures, subcorpora of 100,000 to two million words were extracted from five corpora, with lexical bundles identified in them using two normalized frequency thresholds and two dispersion thresholds. The results show that many more lexical bundles are identified in smaller subcorpora than in larger ones. This size effect can be related to the Zipfian nature of the distribution of words and word sequences in corpora. The conclusion discusses several solutions to avoid the unfairness of comparing lexical bundles identified in corpora of different sizes.

中文翻译：

比较不同大小的语料库中的词法束：Zipfian问题

摘要

通常通过自动识别经常重复出现的单词系列（通常称为“词汇束”）来研究语言使用中的惯例性序列，这些单词在语料库中与不同的语气，学科等形成鲜明对比。由于语料库的大小通常不同，因此该领域中的一个非常重要的假设指出，使用归一化的频率阈值（例如每百万个单词20个出现次数）可以准确比较不同大小的语料库。但是，一些研究人员认为，将归一化应用于频率阈值时可能不可靠。该研究通过比较语料库中仅大小不同的词汇束的数量来调查此问题。使用两个互补的随机抽样程序，从五个语料库中提取了10万到200万个单词的子语料库，使用两个归一化的频率阈值和两个色散阈值来识别其中的词法束。结果表明，在较小的子集中比在较大的子集中发现了更多的词汇束。这种大小效应可能与语料库中单词和单词序列分布的Zipfian性质有关。该结论讨论了几种解决方案，以避免比较不同大小的语料库中识别出的词法束的不公平性。

更新日期：2019-02-05

点击分享查看原文

点击收藏

公开下载

阅读更多本刊最新论文