Beyond lexical frequencies: using R for text analysis in the digital humanities

Arnold, Taylor; Ballier, Nicolas; Lissón, Paula; Tilton, Lauren

doi:10.1007/s10579-019-09456-6

Beyond lexical frequencies: using R for text analysis in the digital humanities

Original Paper
Published: 08 April 2019

Volume 53, pages 707–733, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Taylor Arnold ORCID: orcid.org/0000-0003-0576-0669¹,
Nicolas Ballier²,
Paula Lissón³ &
…
Lauren Tilton¹

1884 Accesses
5 Citations
3 Altmetric
Explore all metrics

Abstract

This paper presents a combination of R packages—user contributed toolkits written in a common core programming language—to facilitate the humanistic investigation of digitised, text-based corpora. Our survey of text analysis packages includes those of our own creation (cleanNLP and fasttextM) as well as packages built by other research groups (stringi, readtext, hyphenatr, quanteda, and hunspell). By operating on generic object types, these packages unite research innovations in corpus linguistics, natural language processing, machine learning, statistics, and digital humanities. We begin by extrapolating on the theoretical benefits of R as an elaborate gluing language for bringing together several areas of expertise and compare it to linguistic concordancers and other tool-based approaches to text analysis in the digital humanities. We then showcase the practical benefits of an ecosystem by illustrating how R packages have been integrated into a digital humanities project. Throughout, the focus is on moving beyond the bag-of-words, lexical frequency model by incorporating linguistically-driven analyses in research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis

Article Open access 29 November 2023

Psychological Text Analysis in the Digital Humanities

Creating and Analyzing Literary Corpora

Notes

The environment can be saved so that the research may still be reproduced even after packages are discontinued and R versions have changed; for details on doing this see Ushey et al. (2016).
A press release from Oracle in 2012, http://www.oracle.com/us/corporate/press/1515738, estimates that at there were at least 2 million users of R. By metrics such as Google searches, downloads, and blog posts, this number has continued to grow over the past 5 years (Hornik et al. 2017).
Stefan Gries devised an R script, implementing the function exact.matches(), that literally turns R into a concordancer (Gries 2009).
See (Ballier forthcoming).
A full set of code and data for replication can be found at https://github.com/statsmaths/beyond-lexical-frequencies.
In this case, Wikipedia includes an infobox at the top of the page listing the country’s capital city. Our analysis ignores this box, using only the raw text to illustrate how information can be extracted from completely unstructured text.
Interestingly, this was not always in Virginia. When the original string was, for example, ‘Dallas, Texas’ all three locations pointed to the Dallas in Texas regardless of what was tacked onto the end.
In the past year the udpipe package has done an admirable job of extending lemmatisation and dependency parsing to a larger set of target languages (Wijffels 2018).
Mutatis mutandis, this also applies to corpus linguistics where Stefan Gries has advocated more complex modelling of L1 to investigate L2 production, promoting what he calls MuPDAR (Multifactorial Prediction and Deviation Analysis Using R, (Gries and Deshors 2014).

References

Allaire, J., Cheng, J., Xie, Y., McPherson, J., Chang, W., Allen, J., Wickham, H., Atkins, A., Hyndman, R., & Arslan, R. (2017). rmarkdown: Dynamic documents for R. R package version 1.6. https://cran.r-project.org/package=rmarkdown.
Anthony, L. (2004). Antconc: A learner and classroom friendly, multi-platform corpus analysis toolkit. In Proceedings of IWLeL (pp. 7–13).
Anthony, L. (2013). A critical look at software tools in corpus linguistics. Linguistic Research, 30(2), 141–161.
Google Scholar
Arnold, T., & Benoit, K. (2017). tif: Text interchange format. R package version 0.2. https://github.com/ropensci/tif/.
Arnold, T., Lissón, P., & Ballier, N. (2017). fasttextM: Work with bilingual word embeddings. R package version 0.0.1. https://github.com/statsmaths/fasttextM/.
Arnold, T. (2017). A tidy data model for natural language processing using cleannlp. The R Journal, 9(2), 1–20.
Google Scholar
Arnold, T., & Tilton, L. (2015). Humanities data in R. New York: Springer.
Google Scholar
Baayen, R. H. (2008). Analyzing linguistic data: A practical introduction to statistics using R. Cambridge: Cambridge University Press.
Google Scholar
Baglama, J., Reichel, L., & Lewis, B. W. (2017). irlba: Fast truncated singular value decomposition and principal components analysis for large dense and sparse matrices. R package version 2.2.1. https://cran.r-project.org/package=irlba.
Ballier, N., & Lissón, P. (2017). R-based strategies for DH in English Linguistics: A case study. In Bockwinkel, P., Declerck, T., Kübler, S., Zinsmeister, H. (eds), Proceedings of the Workshop on Teaching NLP for Digital Humanities, CEUR Workshop Proceedings, Berlin, Germany (Vol. 1918, pp. 1–10). http://ceur-ws.org/Vol-1918/ballier.pdf.
Ballier, N. (2016). R, pour un écosystème du traitement des données? L’exemple de la linguistique. In P. Caron (Ed.), Données, Métadonnées des corpus et catalogage des objets en sciences humaines et sociales. Rennes: Presses universitaires de Rennes.
Google Scholar
Bastian, M., Heymann, S., Jacomy, M., et al. (2009). Gephi: An open source software for exploring and manipulating networks. International Conference on Web and Social Media, 8, 361–362.
Google Scholar
Becker, R. A., & Chambers, J. M. (1984). S: An interactive environment for data analysis and graphics. Boca Raton: CRC Press.
Google Scholar
Bécue-Bertaut, M., & Lebart, L. (2018). Analyse textuelle avec R. Rennes: Presses universitaires de Rennes.
Google Scholar
Benoit, K., & Matsuo, A. (2017). spacyr: R Wrapper to the spaCy NLP Library. R package version 0.9.0. https://cran.r-project.org/package=spacyr.
Benoit, K., & Obeng, A. (2017). readtext: Import and handling for plain and formatted text files. R package version 0.50. https://cran.r-project.org/package=readtext.
Benoit, K., Watanabe, K., Nulty, P., Obeng, A., Wang, H., Lauderdale, B., & Lowe, W. (2017). Quanteda: Quantitative analysis of textual data. R package version 0.99.9. https://cran.r-project.org/package=quanteda.
Berry, D. M. (2011). The computational turn: Thinking about the digital humanities. Culture Machine, 12, 1–22.
Google Scholar
Bird, S. (2006). NLTK: The natural language toolkit. In Proceedings of the COLING/ACL on interactive presentation sessions, Association for Computational Linguistics (pp. 69–72).
Blevins, C., & Mullen, L. (2015). Jane, John ... Leslie? A historical method for algorithmic gender prediction. Digital Humanities Quarterly 9(3).
Bradley, J., & Rockwell, G. (1992). Towards new research tools in computer-assisted text analysis. In Canadian Learned Societies Conference.
Brezina, V., McEnery, T., & Wattam, S. (2015). Collocations in context: A new perspective on collocation networks. International Journal of Corpus Linguistics, 20(2), 139–173.
Google Scholar
Camargo, B. V., & Justo, A. M. (2013). Iramuteq: um software gratuito para análisede dados textuais. Temas em Psicologia, 21(2), 513–518.
Google Scholar
Chang, W., Cheng, J., Allaire, J., Xie, Y., & McPherson, J. (2017). shiny: Web application framework for R. R package version 1.0.4. https://cran.r-project.org/package=shiny.
Deschamps, R. (2017). Correspondence analysis for historical research with R. The Programming Historian. https://programminghistorian.org/en/lessons/correspondence-analysis-in-R.
Dewar, T. (2016). R basics with tabular data. The Programming Historian. https://programminghistorian.org/en/lessons/r-basicswith-tabular-data.
Donaldson, J. (2016). tsne: T-distributed stochastic neighbor embedding for R (t-SNE). R package version 0.1-3. https://cran.r-project.org/package=tsne.
Eder, M., Rybicki, J., & Kestemont, M. (2016). Stylometry with R: A package for computational text analysis. R Journal, 8(1), 107–121.
Google Scholar
Feinerer, I., Hornik, K., & Meyer, D. (2008). Text mining infrastructure in R. Journal of Statistical Software, 25(5), 1–54.
Google Scholar
Fleury, S., & Zimina, M. (2014). Trameur: A framework for annotated text corpora exploration. In COLING (Demos) (pp. 57–61).
Gagolewski, M. (2017). R package stringi: Character string processing facilities. https://cran.r-project.org/package=stringi.
Gerdes, K. (2014). Corpus collection and analysis for the linguistic layman: The Gromoteur. http://gromoteur.ilpga.fr/.
Goldstone, A., & Underwood, T. (2014). The quiet transformations of literary studies: What thirteen thousand scholars could tell us. New Literary History, 45(3), 359–384.
Google Scholar
Gries, S. (2009). Quantitative corpus linguistics with R: A practical introduction. London: Routledge.
Google Scholar
Gries, S. (2013). Statistics for linguistics with R: A practical introduction. Berlin: Walter de Gruyter.
Google Scholar
Gries, S. T., & Deshors, S. C. (2014). Using regressions to explore deviations between corpus data and a standard/target: Two suggestions. Corpora, 9(1), 109–136.
Google Scholar
Gries, S. T., & Wulff, S. (2012). Regression analysis in translation studies. Quantitative methods in corpus-based translation studies: A practical guide to descriptive translation research (pp. 35–52). Amsterdam: Benjamins.
Google Scholar
Grün, B., & Hornik, K. (2011). topicmodels: An R package for fitting topic models. Journal of Statistical Software, 40(13), 1–30. https://doi.org/10.18637/jss.v040.i13.
Article Google Scholar
Heiden, S. (2010). The txm platform: Building open-source textual analysis software compatible with the tei encoding scheme. In 24th Pacific Asia conference on language, information and computation, Institute for Digital Enhancement of Cognitive Development, Waseda University (pp. 389–398).
Honnibal, M., & Johnson, M. (2015). An improved non-monotonic transition system for dependency parsing. In Proceedings of the 2015 conference on empirical methods in natural language processing, Association for Computational Linguistics, Lisbon, Portugal (pp. 1373–1378).
Hornik, K. (2016). openNLP: Apache OpenNLP tools interface. R package version 0.2-6. https://cran.r-project.org/package=openNLP.
Hornik, K. (2017a). NLP: Natural language processing infrastructure. R package version 0.1-11. https://cran.r-project.org/package=NLP.
Hornik, K. (2017b). R FAQ. https://cran.r-project.org/doc/FAQ/R-FAQ.html.
Hornik, K., Ligges, U., & Zeileis, A. (2017). Changes on CRAN. The R Journal, 9(1), 505–507.
Google Scholar
Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299–314.
Google Scholar
Jockers, M. L. (2013). Macroanalysis: Digital methods and literary history. Champaign: University of Illinois Press.
Google Scholar
Jockers, M. L. (2014). Text analysis with R for students of literature. New York: Springer.
Google Scholar
Johnson, K. (2008). Quantitative methods in linguistics. London: Wiley.
Google Scholar
Kahle, D., & Wickham, H. (2013). ggmap: Spatial visualization with ggplot2. The R Journal, 5(1), 144–161.
Google Scholar
Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., et al. (2014). The sketch engine: Ten years on. Lexicography, 1(1), 7–36.
Google Scholar
Klaussner, C., Nerbonne, J., & Çöltekin, Ç. (2015). Finding characteristic features in stylometric analysis. Digital Scholarship in the Humanities, 30(suppl 1), i114–i129.
Google Scholar
Komen, E. R. (2011). Cesax: Coreference editor for syntactically annotated xml corpora. Reference manual Nijmegen. Nijmegen: Radboud University Nijmegen.
Google Scholar
Lamalle, C., Martinez, W., Fleury, S., Salem, A., Fracchiolla, B., Kuncova, A., & Maisondieu, A. (2003). Lexico3–outils de statistique textuelle. manuel d’utilisation. SYLED–CLA2T, Université de la Sorbonne nouvelle–Paris 3:48.
Lancashire, I., Bradley, J., McCarty, W., Stairs, M., & Wooldridge, T. (1996). Using tact with electronic texts. New York: MLA.
Google Scholar
Levine, L. W. (1988). Documenting America (Vol. 2, pp. 1935–1943). Berkeley: University of California Press.
Google Scholar
Levshina, N. (2015). How to do linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins Publishing Company.
Google Scholar
Lienou, M., Maitre, H., & Datcu, M. (2010). Semantic annotation of satellite images using latent dirichlet allocation. IEEE Geoscience and Remote Sensing Letters, 7(1), 28–32.
Google Scholar
Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J. R., Bethard, S., & McClosky, D. (2014). The stanford corenlp natural language processing toolkit. In ACL (system demonstrations) (pp. 55–60).
McEnery, T., & Hardie, A. (2011). Corpus linguistics: Method, theory and practice. Cambridge: Cambridge University Press.
Google Scholar
Michalke, M. (2017). koRpus: An R package for text analysis. (Version 0.10-2). https://cran.rproject.org/package=koRpus.
Mimno, D. (2013). mallet: A wrapper around the Java machine learning tool MALLET. R package version 1.0. https://cran.r-project.org/package=mallet.
Morton, T., Kottmann, J., Baldridge, J., & Bierner, G. (2005). Opennlp: A java-based nlp toolkit. In EACL.
O’Donnell, M. (2008). The uam corpustool: Software for corpus annotation and exploration. In Proceedings of the XXvI congreso de AESLA, Almeria, Spain (pp. 3–5).
Ooms, J. (2017). hunspell: High-performance Stemmer, Tokenizer, and spell checker for R. R package version 2.6. https://cran.r-project.org/package=hunspell.
O’Sullivan, J., Jakacki, D., & Galvin, M. (2015). Programming in the digital humanities. Digital Scholarship in the Humanities, 30(suppl 1), i142–i147.
Google Scholar
Peng, R. D. (2011). Reproducible research in computational science. Science, 334(6060), 1226–1227.
Google Scholar
Rayson, P. (2009). Wmatrix: A web-based corpus processing environment. http://ucrel.lancs.ac.uk/wmatrix/.
Rinker, T. W. (2013). qdap: Quantitative discourse analysis package. Buffalo, NY: University at Buffalo/SUNY. 2.2.8.
Google Scholar
RStudio Team. (2017). RStudio: Integrated development environment for R. Boston, MA: RStudio Inc.
Google Scholar
Rudis, B., Levien, R., Engelhard, R., Halls, C., Novodvorsky, P., Németh, L., & Buitenhuis, N. (2016). hyphenatr: Tools to Hyphenate Strings Using the ’Hunspell’ Hyphenation Library. R package version 0.3.0. https://cran.r-project.org/package=hyphenatr.
Salkie, R. (1995). Intersect: A parallel corpus project at brighton university. Computers and Texts, 9, 4–5.
Google Scholar
Schreibman, S., Siemens, R., & Unsworth, J. (2015). A new companion to digital humanities. London: Wiley.
Google Scholar
Scott, M. (1996). WordSmith tools, Stroud: Lexical analysis software. https://lexically.net/wordsmith/.
Siddiqui, N. (2017). Data wrangling and management in R. The Programming Historian. https://programminghistorian.org/en/lessons/data_wrangling_and_management_in_R.
Sievert, C., & Shirley, K. (2015). LDAtools: Tools to fit a topic model using Latent Dirichlet Allocation (LDA). R package version 0.1. https://cran.r-project.org/package=LDAtools.
Simon, N., Friedman, J., Hastie, T., & Tibshirani, R. (2011). Regularization paths for cox’s proportional hazards model via coordinate descent. Journal of Statistical Software, 39(5), 1–13.
Google Scholar
Sinclair, S., Rockwell, G., et al. (2016). Voyant tools. http://voyant-tools.org/. Accessed 4 Sept 2018.
Th Gries, S., & Hilpert, M. (2008). The identification of stages in diachronic data: Variability-based neighbour clustering. Corpora, 3(1), 59–81.
Google Scholar
Underwood, T. (2017). A genealogy of distant reading. Digital Humanities Quarterly. http://digitalhumanities.org/dhq/vol/11/2/000317/000317.html.
Ushey, K., McPherson, J., Cheng, J., Atkins, A., & Allaire, J. (2016). packrat: A dependency management system for projects and their R package dependencies. R package version 0.4.8-1. https://cran.r-project.org/package=packrat.
Wang, X., & Grimson, E. (2008). Spatial latent Dirichlet Allocation. In: Advances in neural information processing systems 20 (pp. 1577–1584). Curran Associates, Inc. http://papers.nips.cc/paper/3278-spatial-latent-dirichlet-allocation.pdf.
Welbers, K., Van Atteveldt, W., & Benoit, K. (2017). Text analysis in R. Communication Methods and Measures, 11(4), 245–265.
Google Scholar
Wiedemann, G., & Niekler, A. (2017). Hands-on: A five day text mining course for humanists and social scientists in R. In Proceedings of the 1st workshop teaching NLP for digital humanities.
Wiedemann, G. (2016). Text mining for qualitative data analysis in the social sciences. New York: Springer.
Google Scholar
Wijffels, J. (2018). udpipe: Tokenization, parts of speech tagging, lemmatization and dependency parsing with the ’UDPipe’ ’NLP’ Toolkit. R package version 0.6.1. https://cran.r-project.org/package=udpipe.
Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1–3), 37–52.
Google Scholar
Xie, Y. (2014). knitr: A comprehensive tool for reproducible research in R. In: Stodden, V., Leisch, F., & Peng, R. D. (eds), Implementing reproducible computational research. Chapman and Hall/CRC. ISBN: 978-1466561595.

Download references

Author information

Authors and Affiliations

University of Richmond, Virginia, USA
Taylor Arnold & Lauren Tilton
UFR Études Anglophones, Université Paris Diderot, Paris, France
Nicolas Ballier
Department of Linguistics, Universität Potsdam, Potsdam, Germany
Paula Lissón

Authors

Taylor Arnold
View author publications
You can also search for this author in PubMed Google Scholar
Nicolas Ballier
View author publications
You can also search for this author in PubMed Google Scholar
Paula Lissón
View author publications
You can also search for this author in PubMed Google Scholar
Lauren Tilton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taylor Arnold.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Arnold, T., Ballier, N., Lissón, P. et al. Beyond lexical frequencies: using R for text analysis in the digital humanities. Lang Resources & Evaluation 53, 707–733 (2019). https://doi.org/10.1007/s10579-019-09456-6

Download citation

Published: 08 April 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s10579-019-09456-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Beyond lexical frequencies: using R for text analysis in the digital humanities

Abstract

Access this article

Similar content being viewed by others

Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis

Psychological Text Analysis in the Digital Humanities

Creating and Analyzing Literary Corpora

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Beyond lexical frequencies: using R for text analysis in the digital humanities

Abstract

Access this article

Similar content being viewed by others

Lingualyzer: A computational linguistic tool for multilingual and multidimensional text analysis

Psychological Text Analysis in the Digital Humanities

Creating and Analyzing Literary Corpora

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation