Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis

Alexander Koplenig

doi:10.1515/cllt-2014-0049

Published by De Gruyter Mouton April 7, 2018

Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis

Alexander Koplenig

From the journal Corpus Linguistics and Linguistic Theory

https://doi.org/10.1515/cllt-2014-0049

Showing a limited preview of this publication:

Abstract

Using the Google Ngram Corpora for six different languages (including two varieties of English), a large-scale time series analysis is conducted. It is demonstrated that diachronic changes of the parameters of the Zipf–Mandelbrot law (and the parameter of the Zipf law, all estimated by maximum likelihood) can be used to quantify and visualize important aspects of linguistic change (as represented in the Google Ngram Corpora). The analysis also reveals that there are important cross-linguistic differences. It is argued that the Zipf–Mandelbrot parameters can be used as a first indicator of diachronic linguistic change, but more thorough analyses should make use of the full spectrum of different lexical, syntactical and stylometric measures to fully understand the factors that actually drive those changes.

Keywords: Zipf’s law; Zipf–Mandelbrot law; power law; lexical richness; vocabulary size; type token ratio; syntactic complexity; noun–pronoun ratio; diachronic corpus linguistics; time series analysis; Google Ngram Corpora

Acknowledgments

I thank Julia Kaiser for helping me to check the validity of the corpus cleaning procedure, especially of the French, Italian and Spanish data. I thank my colleague Sascha Wolfer for helping me write an R script to call R from inside Stata and for many helpful discussions regarding the topics presented in this paper. I thank Stefan Engelberg, Carolin Müller-Spitzer, Peter Meyer, Sarah Signer and Sascha Wolfer (again) for (proof-)reading the draft version of this paper and for many helpful discussions. I also would like to thank one anonymous reviewer, whose comments definitely improved this paper. All remaining errors are mine.

Appendix

A.1 Table of the correlations between all investigated variables and the corpus size (first differences in each case).

Table 4:

Correlations of the corpus size with all investigated variables (year-to-year changes) for all investigated languages.

Correlation between corpus size and	British English	American English	French	German	Italian	Spanish
Zipf alpha	0.020	0.072	0.148	0.056	0.129	0.102
Zipf–Mandelbrot alpha	0.142	0.048	0.096	0.063	0.174	0.113
Zipf–Mandelbrot beta	0.173	0.034	0.056	0.060	0.212	0.084
vocabulary size	−0.002	−0.007	0.059	0.028	−0.133	−0.067
mean sentence length	0.025	0.040	−0.005	0.000	−0.005	0.054
noun–pronoun ratio	−0.105	−0.096	−0.099	0.034	−0.045	−0.132

A.2 ML estimation of the parameters of the Zipf law and the Zipf–Mandelbrot law

Since the Zipf law is just a special case of the Zipf–Mandelbrot law (ZM) with β=0, the following description focusses on the maximum likelihood fit of the ZM law, while the Stata code presented below includes both options.

In what follows, observations, that is, the word types are assumed to be conditionally independent. Thus, the log-likelihood satisfies the linear form restriction. In Stata, one then only has to specify the log-likelihood function for one individual observation. After that, Stata evaluates this function for every observation and sums up the result. Following Baixeries et al. (2013) the likelihood function for one single word type with rank r and the corresponding frequency fr can be defined as:

[13]lr=prfr

Using the definition presented in eq. [10] and taking logs on both sides this can be rewritten as:

[14]log(lr)=−α⋅fr⋅logr+β−fr⋅log∑r=1Nr+β−α

A Stata module to fit the one parameter of the Zipf distribution or the two parameters of the Zipf–Mandelbrot distribution by maximum likelihood is available online (Koplenig 2014).

A.3 Additional results

The parameter of the Zipf distribution and the two parameters of the Zipf–Mandelbrot distribution as a function of time.

Figure 8:

The parameter of the Zipf distribution (α_Zipf) and the two parameters of the Zipf–Mandelbrot (α_ZM and β_ZM) modification as a function of time.

Correlation-Analysis of the parameter of the Zipf distribution and the two parameters of the Zipf–Mandelbrot distribution with the three indicators.

Figure 9:

Coefficients of determination (left side) and partial coefficients of determination (right side) between year-to-year changes of α_ZIPF (cranberry), α_ZM (emerald), β_ZM (mint) and year-to-year changes of the vocabulary size (plot A), the noun–pronoun ratio (plot B) and the mean sentence length (plot C) for all six investigated languages.

Fitting a power law distribution

Figure 10:

ML estimation of the parameter of a power law as a function time. This analysis used the method presented in Clauset et al. (2007) and the corresponding plfit R script developed by Dubroca (2011). Cranberry lines – time series of the α exponent. Emerald lines – time series of the minimum x value. The dotted pink lines mark the years 1918 and 1945. The ρ-values on the bottom left side of each plot report the correlation values of Δα_f with Δx_min. All time series smoothed with a symmetric 5-year moving window.

Figure 11:

Coefficients of determination (left side) and partial coefficients of determination (right side) between year-to-year changes of power law (using the method presented in Clauset et al. (2007)) exponent and year-to-year changes of the vocabulary size (orange), the noun–pronoun ratio (blue) and the mean sentence length (gray) for all six investigated languages.

Figure 12:

Coefficients of determination (left side) and partial coefficients of determination (right side) between year-to-year changes of α_ZM and year-to-year changes of the vocabulary size (orange), the noun–pronoun ratio (blue) and the mean sentence length (gray) for all six investigated languages. Word types with a frequency of less than two were excluded from this analysis.

Figure 13:

The noun–pronoun ratio for three different English GNg Corpora

Figure 14:

Time series of the noun–pronoun for English Fiction (blue), British English (red) and American English (green). All time series smoothed with a symmetric 5-year moving window.

References

Baixeries, Jaume, Brita Elvevåg & Ramon Ferrer-i-Cancho. 2013. The evolution of the exponent of Zipf’s law in language ontogeny. Satoru Hayasaka (ed.). PLoS ONE 8(3). e53227. doi:10.1371/journal.pone.0053227 (accessed 10 March 2014).Search in Google Scholar

Baroni, Marco. 2009. Distributions in text. In Anke Lüdeling & Merja Kytö (eds.), Corpus linguistics: An international handbook, 803–821. (Handbücher Zur Sprach- Und Kommunikationswissenschaft=Handbooks of Linguistics and Communication Science Bd. –29.2) Berlin; New York: Walter de Gruyter.Search in Google Scholar

Baum, Christopher F. & Nicholas Cox. 2005. MVCORR: Stata module to generate moving-window correlation or autocorrelation in time series or panel. http://ideas.repec.org/c/boc/bocode/s438801.html (accessed 1 September 2014).Search in Google Scholar

Becketti, Sean. 2013. Introduction to time series using Stata, 1st edn. College Station, TX: Stata Press.Search in Google Scholar

Bentley, R. Alexander, Alberto Acerbi, Paul Ormerod & Vasileios Lampos. 2014. Books average previous decade of economic misery. Matjaž Perc (ed.). PLoS ONE 9(1). e83147. doi:10.1371/journal.pone.0083147 (accessed 10 March 2014).Search in Google Scholar

Bentz, Christian, Douwe Kiela, Felix Hill & Paula Buttery. 2014a. Zipf’s law and the grammar of languages: A quantitative study of old and modern English parallel texts. Corpus Linguistics and Linguistic Theory 10(2). doi:10.1515/cllt-2014-0009Search in Google Scholar

Bentz, Christian, Annemarie Verkerk, Douwe Kiela, Felix Hill & Paula Buttery. 2014b. Adaptive languages: Modeling the co-evolution of population structure and lexical diversity (submitted). http://www.christianbentz.de/Papers/Bentz%20et%20al.%20(submitted)%20Adaptive%20Languages.pdf (accessed 8 September 2014).Search in Google Scholar

Biber, Douglas. 1991. Variation across speech and writing. Cambridge [England]; New York: Cambridge University Press.Search in Google Scholar

Biber, Douglas & Edward Finegan. 1989. Drift and the evolution of English style: A history of three genres. Language 65(3). 487. doi:10.2307/415220 (accessed 1 July 2014).Search in Google Scholar

Biber, Douglas & Bethany Gray. 2013. Being specific about historical change: The influence of sub-register. Journal of English Linguistics doi:10.1177/0075424212472509 http://eng.sagepub.com/cgi/doi/10.1177/0075424212472509 (accessed 14 April 2014).Search in Google Scholar

Biber, Douglas, Stig Johansson, Geoffrey N. Leech, Susan Conrad & Edward Finegan. 1999. Longman grammar of spoken and written English. Harlow, England; [New York): Longman.Search in Google Scholar

Chatfield, Christopher. 2004. The analysis of time series: An introduction, 6th edn. (Texts in Statistical Science). Boca Raton, FL: Chapman & Hall/CRC.Search in Google Scholar

Clauset, Aaron, Cosma Rohilla Shalizi & M. E. J. Newman. 2009. Power-law distributions in empirical data. SIAM Review 51(4). 661–703. doi:10.1137/070710111 (accessed 10 September 2014).Search in Google Scholar

Clauset, A., M. Young & K. S. Gleditsch. 2007. On the frequency of severe terrorist events. Journal of Conflict Resolution 51(1). 58–87. doi:10.1177/0022002706296157 (accessed 10 September 2014).Search in Google Scholar

Corral, Alvaro, Gemma Boleda & Ramon Ferrer-i-Cancho. 2014. Zipf’s law for word frequencies: word forms versus lemmas in long texts. http://arxiv.org/abs/1407.8322v1 (accessed 1 October 2014).Search in Google Scholar

Dubroca, Laurent. 2011. PLFIT. http://tuvalu.santafe.edu/~aaronc/powerlaws/plfit.r (accessed 12 September 2014).Search in Google Scholar

Ehret, Katharina & Benedikt Szmrecsanyi. 2015 in press. An information-theoretic approach to assess linguistic complexity. In Raffaela Baechler & Gudio Seiler (eds.), Complexity and isolation, Berlin: De Gruyter. http://www.benszm.net/omnibuslit/EhretSzmrecsanyi_web.pdf (accessed 19 January 2015).10.1515/9783110348965-004Search in Google Scholar

Frank, Stefan L. & Robin L. Thompson. 2012. Early effects of word surprisal on pupil size during reading. In Naomi Miyake, David Peebles & Richard P. Cooper (eds.), Proceedings of the 34th Annual Conference of the Cognitive Science Society, 1554–1559. Austin, TX: Cognitive Science Society.Search in Google Scholar

Goldstein, Michel L., Steven A. Morris & Gary G. Yen. 2004. Problems with fitting to the power-law distribution. The European Physical Journal B 41(2). 255–258. doi:10.1140/epjb/e2004-00316-5 (accessed 10 April 2015).Search in Google Scholar

Granger, C.W.J. & P. Newbold. 1974. Spurious regressions in econometrics. Journal of Econometrics 2(2). 111–120. doi:10.1016/0304-4076(74)90034-7 (accessed 23 June 2014).Search in Google Scholar

Hamilton, Lawrence C. 2013. Statistics with Stata: Updated for version 12, 8th edn. Boston, MA: Brooks/Cole, Cengage Learning.Search in Google Scholar

Hill, R. Carter. 2008. Principles of econometrics. Principles of Econometrics, 3rd edn. (accompanying website). http://www.principlesofeconometrics.com/poe3/poe3do_files/figure12-2.do (accessed 23 June 2014).Search in Google Scholar

Hilpert, M. & S. Th. Gries. 2009. Assessing frequency changes in multistage diachronic corpora: Applications for historical corpus linguistics and the study of language acquisition. Literary and Linguistic Computing 24(4). 385–401. doi:10.1093/llc/fqn012 (accessed 13 January 2015).Search in Google Scholar

Juola, Patrick. 2008. Assessing linguistic complexity. In Matti Miestamo, Kaius Sinnemäki & Fred Karlsson (eds.), Language complexity: Typology, contact, change (Studies in Language Companion Series v. 94) Amsterdam ; Philadelphia: John Benjamins Pub. Co.10.1075/slcs.94.07juoSearch in Google Scholar

Juola, Patrick. 2013. Using the google N-gram corpus to measure cultural complexity. Literary and Linguistic Computing 28(4). 668–675. doi:10.1093/llc/fqt017 (accessed 8 April 2014).Search in Google Scholar

Kilgarriff, Adam. 1997. Putting frequencies in the dictionary. International Journal of Lexicography 10(2). 135–155.10.1093/ijl/10.2.135Search in Google Scholar

Kilgarriff, Adam. 2001. Comparing Corpora. International Journal of Corpus Linguistics 6(1). 97–133. doi:10.1075/ijcl.6.1.05kil (accessed 19 May 2014).Search in Google Scholar

Koplenig, Alexander. 2015. The impact of lacking metadata for the measurement of cultural and linguistic change using the Google Ngram datasets – Reconstructing the composition of the German corpus in times of WWII.10.1093/llc/fqv037Search in Google Scholar

Koplenig, Alexander. 2014. ZIPFFIT: Stata module to fit the Zipf distribution or the Zipf-Mandelbrot distribution by maximum likelihood. http://ideas.repec.org/c/boc/bocode/s457872.html (accessed 11 August 2014).Search in Google Scholar

Kupietz, Marc, Cyril Belica, Holger Keibel & Andreas Witt. 2010. The German reference corpus DeReKo: A primordial sample for linguistic research. In Nicoletta Calzolari, Daniel Tapias, Mike Rosner, Stelios Piperidis, Jan Odjik, Joseph Mariani, Bente Maegaar & Khalid Choukri (eds.), Proceedings of the Seventh Conference on International Language Resources and Evaluation. International Conference on Language Resources and Evaluation (LREC-10), 1848–1854. Valetta, Malta: European Language Resources Association (ELRA).Search in Google Scholar

Labov, William. 1994. Principles of linguistic change. (Language in Society 20) Oxford, UK ; Cambridge [Mass]: Blackwell.Search in Google Scholar

Lin, Yuri, Jean-Baptiste Michel, Lieberman Erez Aiden, Jon Orwant, Will Brockmann & Slav Petrov. 2012. Syntactic Annotations for the Google Books Ngram Corpus. Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, 169–174. Jeju, Republic of Korea.Search in Google Scholar

MacWhinney, Brian. 2014. The Childes Project: Tools for Analyzing Talk, Volume II: the Database. (Tools for Analyzing Talk). London: Routledge Chapman & Hall. http://www.amazon.com/The-Childes-Project-Analyzing-Database/dp/1138003492/ref=tmm_pap_title_0?ie=UTF8&qid=1403337096&sr=1-12 (accessed 21 June 2014).Search in Google Scholar

Mair, Christian, Marianne Hundt, Geoffrey N. Leech & Nicholas Smith. 2002. Short term diachronic shifts in part-of-speech frequencies: A comparison of the tagged LOB and F-LOB corpora. International Journal of Corpus Linguistics 7(2). 245–264. doi:10.1075/ijcl.7.2.05mai (accessed 21 July 2014).Search in Google Scholar

Mandelbrot, Benoît. 1953. An informational theory of the statistical structure of language. In Willis Jackson (ed.), Communication theory, 468–502. London: Butterworths Scientific Publications.Search in Google Scholar

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Verses, Matthew K Gray, The Google Books Team, Joseph P. Pickett, et al. 2010a. Quantitative analysis of culture using millions of digitized books. Science 331(14). 176–182. [online pre-print: 1–12] doi:10.1126/science.1199644.Search in Google Scholar

Michel, Jean-Baptiste, Yuan Kui Shen, Aviva Presser Aiden, Adrian Verses, Matthew K. Gray, The Google Books Team, Joseph P. Pickett, et al. 2010b. Quantitative analysis of culture using millions of digitized books (Supporting Online Material). Science 331(14). doi:10.1126/science.1199644. http://www.sciencemag.org/content/early/2010/12/15/science.1199644/suppl/DC1 (accessed 5 March 2014).Search in Google Scholar

Montemurro, Marcelo A. & Damián H. Zanette. 2011. Universal entropy of word ordering across Linguistic families. Michael Breakspear (ed.). PLoS ONE 6(5). e19875. doi:10.1371/journal.pone.0019875 (accessed 19 January 2015).Search in Google Scholar

Murray, Michael P. 1994. A drunk and her dog: An illustration of cointegration and error correction. The American Statistician 48(1). 37–39.10.1080/00031305.1994.10476017Search in Google Scholar

Newman, Mej. 2005. Power laws, pareto distributions and Zipf’s law. Contemporary Physics 46(5). 323–351. doi:10.1080/00107510500052444 (accessed 10 September 2014).Search in Google Scholar

Phillips, Peter C. B. & Pierre Perron. 1988. Testing for a unit root in time series regression. Biometrika 75(2). 335–346. doi:10.1093/biomet/75.2.335 (accessed 12 May 2014).Search in Google Scholar

Piantadosi, Steven T. 2014. Zipf’s word frequency law in natural language: A critical review and future directions. Psychonomic Bulletin & Review doi:10.3758/s13423-014-0585-6 http://link.springer.com/10.3758/s13423-014-0585-6 (accessed 2 May 2014).Search in Google Scholar

Piantadosi, S. T., H. Tily & E. Gibson. 2011. Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences 108(9). 3526–3529. doi:10.1073/pnas.1012551108 (accessed 21 January 2015).Search in Google Scholar

Ramisch, Carlos. 2014. Multiword expressions acquisition: A generic and open framework. New York: Springer.10.1007/978-3-319-09207-2Search in Google Scholar

Säily, Tanja, Terttu Nevalainen & Harri Siirtola. 2011. Variation in noun and pronoun frequencies in a sociohistorical corpus of English. Literary and Linguistic Computing 26(2). 167–188. doi:10.1093/llc/fqr004 (accessed 1 July 2014).Search in Google Scholar

StataCorp. 2011. Stata multivariate statistics reference manual. Release 12 College Station, TX: StataCorp LP.Search in Google Scholar

Szmrecsanyi, Benedikt. 2004. On operationalizing syntactic complexity. In Gérard Purnelle, Cédrick Fairon & Anne Dister (eds.), Le poids des mots. Proceedings of the 7th International Conference on Textual Data Statistical Analysis 2. 1032–1039. Louvain-la-Neuve: Presses universitaires de Louvain.Search in Google Scholar

Szmrecsanyi, Benedikt. 2014. About text frequencies in historical linguistics: disentangling environmental and grammatical change. Corpus Linguistics and Linguistic Theory. http://www.benszm.net/omnibuslit/Szmrecsanyi_CH_web.pdf (accessed 8 September 2014).10.1515/cllt-2015-0068Search in Google Scholar

Tweedie, Fiona J. & R. Harald Baayen. 1998. How variable may a constant be? Measures of lexical richness in perspective. Computers and the Humanities 32(5). 323–352.10.1023/A:1001749303137Search in Google Scholar

Wasow, Thomas. 1997. Remarks on grammatical weight. Language Variation and Change 9(01). 81. doi:10.1017/S0954394500001800 (accessed 29 June 2014).Search in Google Scholar

Westin, Ingrid. 2002. Language change in English newspaper editorials. Amsterdam; New York, NY: Rodopi.10.1163/9789004334007Search in Google Scholar

Yang, Charles. 2013. Ontogeny and phylogeny of language. PNAS 110(16). 6324–6327. http://www.pnas.org/content/early/2013/03/27/1216803110 (accessed 21 June 2014).10.1073/pnas.1216803110Search in Google Scholar

Young, Derek S. 2010. Tolerance: An R package for estimating tolerance intervals. Journal of Statistical Software 36(5). 1–39.10.18637/jss.v036.i05Search in Google Scholar

Zipf, George Kingsley. 1935. The psycho-biology of language ; an introduction to dynamic philology. Boston: Houghton Mifflin company.Search in Google Scholar

Zipf, George Kingsley. 2012. Human behavior and the principle of least effort: an introduction to human ecology. Mansfield Centre, CT: Martino Pub.Search in Google Scholar

Published Online: 2018-4-7

Published in Print: 2018-4-25

Using the parameters of the Zipf–Mandelbrot law to measure diachronic lexical, syntactical and stylistic changes – a large-scale corpus analysis

Abstract

Acknowledgments

Appendix

A.1 Table of the correlations between all investigated variables and the corpus size (first differences in each case).

A.2 ML estimation of the parameters of the Zipf law and the Zipf–Mandelbrot law

A.3 Additional results

References

Journal and Issue

Articles in the same Issue