The creation and application of a large-scale corpus-based academic multi-word unit list

doi:10.1016/j.esp.2021.01.001

English for Specific Purposes

Volume 62, April 2021, Pages 142-157

https://doi.org/10.1016/j.esp.2021.01.001 Get rights and content

Highlights

•
This paper outlines the steps in the construction of a corpus-based academic English resource list.
•
A concgram-based methodology is used, in conjunction with manual processing of items.
•
It provides a large set of academic multiword units for educators and learners to use.
•
Comparisons are made about the features of this list to existing English for Academic Purposes resources.
•
Links are made between academic English and general English.

Abstract

This paper outlines a project involving the construction of a corpus-based list which provides a large-scale selection of multi-word units that occur in academic English. Using the most up-to-date, reliable methods, the goal was to produce a large-scale resource which could either be studied directly or used as a reference for practitioners to create further resources. The paper details the procedures used to generate this academic multi-word unit list, explains why specific decisions were made to identify useful items, and discusses the resulting resource. Comparisons will be made between the list created and currently existing lists, and also between the characteristics of the list created versus characteristics of high-frequency general English word lists. Finally, applications of this free resource for English practitioners and students will be suggested.

Introduction

A corpus, or a collection of texts compiled for the purpose of linguistic analysis, is a powerful tool for linguists. Corpus linguistics has enabled educators to create language learning materials based on real-world language use. A corpus can be analyzed to answer linguistic questions about issues regarding patterns, frequency, and scope of language use. For example, a corpus provides the opportunity to answer questions about how frequently particular words appear in the presence of other words, enabling researchers to identify commonly paired words, frequent phrases, and typical multi-word units (MWUs). Such data can be used to create more objective educational materials based on data in comparison to authors relying on subjective classroom experience and intuition. This approach can lead to materials which focus on language items that have a high probability of being encountered in the future or occurring within a specific context. This study focuses on words which frequently occur near each other in an academic context.

Many researchers have defined the co-occurrence of words as collocations or MWUs differently, and therefore a variety of terms have been used as research progressed. Some examples include word associations (Murphy, 1983), fixed expressions (Kennedy, 1990), prefabricated language chunks (Nattinger & DeCarrico, 1992), formulaic language (Wray, 2002), formulae (Simpson-Vlach & Ellis, 2010), and phrasal expressions (Martinez & Schmitt, 2012). The most basic term to refer to a word that frequently appears next to another word is collocation – which some researchers define narrowly as being 2 words co-occurring in high frequency (Hoey, 1991). Researchers such as Gitsaki (1996) define collocations utilizing syntactic structures. For example, Gitsaki (1996) compared differences between how collocations are formed in English and also in Greek to determine where the languages differ and thus where learners need to focus, such as in how English adjective + preposition formulations like afraid of [snakes] are verb + determiner + noun formulations in Greek. Other researchers have combined syntactic structures and frequency as criteria for identifying collocations to focus on (Lesniewska & Witalisz, 2007). Some researchers opt to use the term MWU in a way that encompasses collocations, and the definitions of what constitutes a MWU are similarly varied. Grant and Nation (2006) break down MWUs into literals, figuratives, and core idioms. Biber et al. (1999) differentiate collocations from MWUs by referring to 2-word phrases as collocations and anything beyond that as idioms or lexical bundles.

The current study examines MWUs in academic English. More specifically, it focuses on journals in which peer-reviewed academic research papers are published. These MWUs are collocations operationalized as a pivot word (a high-frequency academic word which is used to search for other words which co-occur with it frequently) and a collocate (a word that commonly co-occurs with another word). Words are considered to be collocates when they occur within 4 words to the left or right of the pivot word. These word combinations, or collocations, are counted as lemmas, a method of consolidating words into one group containing all the inflected forms of a given word when taken as one part of speech. Thus a search for the lemma noun study when it occurs with the lemma verb find would include searches for instances of the nouns study or studies that co-occur with the verbs find, finds, found, or finding. Furthermore, this study accounts for other words coming before, after, or in between these lemmas, so their basic relationship is not disrupted; thus, this study found, studies find, finding a study, etc. are counted as instances of the same MWU. This process is called concgramming (Cheng et al., 2006), a word association (collocation) corpus search method which considers constituency and positional variation. For instance, a concgram search for collocates with the pivot word jury would yield adjacent collocates such as jury's verdict, as well as jury's shocking verdict (constituency variation) and verdict of a jury (positional variation). This method will be explained in much more detail under sections 3.2 Concgramming: the method of identifying MWUs, 3.3 Counting linear sequences versus unsequenced appearances within a set range, 3.4 Partial duplicates, 3.5 Extending the core MWUs below. After these concgrams are identified, the MWU most exemplary of a given lemmatized pivot word and collocate is determined based on frequency. For example, if jury co-occurs frequently with verdict, then an analysis would determine which MWU is the most frequent, such as the jury's verdict, a jury's verdict, or the verdict of the jury, etc.

The result of this study is a ranked list based on corpus data of the most frequent MWUs derived from lemmatized concgrams (i.e., collocations) which were manually confirmed by ESL practitioners to have general value for students aiming to improve their academic English fluency. The corpus utilized was the Corpus of Contemporary American English (COCA) (Davies, 2008). The rationale for usage of this corpus follows a trend of previous research to use the most modern corpus data because language usage changes over time, such as how technology changes the common vocabulary in use. For example, the British National Corpus (BNC) (2007) was a popular resource among corpus linguists conducting research, but since it stopped being developed in 1993 and it mostly used texts sourced from the 1980s, some of its language can be considered dated and/or missing more modern words or phrases. Its total size is also an issue, at only 100 million tokens in total. In comparison, the COCA began to be compiled in 1990, and is still being added to today, with its academic section comprising 112 million tokens. This section features good balance, with equal amounts sourced from the following types of academic journals: education, history, geography, social science, law, political science, humanities, philosophy, religion, science, technology, medicine, and a miscellaneous section. These features, along with being freely downloadable, have made it a popular resource for corpus studies.

The list introduced in this paper is intended to assist academic English learning by providing base materials that can be used in the production of educational resources. Given that MWUs can improve a student's fluency and familiarity with the discursive space, a comprehensive evidence-informed list of the top academic English collocates and phrases should be of great utility. The creation of the list also allows an analysis of the relationship between academic English, as revealed in this study, and what we already know about general English. In the course of this paper, there will be a discussion of some lists currently available for use in comparison to the list produced by this study. The paper also carefully explains some unique steps taken, and describes new, custom software used to identify MWUs and how this affects the final product. In the concluding section, recommendations are given about features which should ideally be present in concordance software used for educational purposes. Finally, recommendations are given for the educational applications of the list and the types of academic English learning activities that could be created using this resource.

Section snippets

Collocations and language learning

There has been an increased awareness of the importance of collocational and MWU fluency in second language education over the last 2 decades, and researchers agree that knowledge of collocations and formulaic phrases is valuable (Nation, 2013; Siyanova-Chanturia & Pellicer-Sanchez, 2019; Webb & Kagimoto, 2011). Hoey (2005) and Hill et al. (2000) state that much of language itself consists of prefabricated chunks, or MWUs, and thus learning these MWUs is a central part of learning a language.

Current corpus-based MWU resources

The identification of high-frequency collocations or MWUs can be a very complex and time-consuming process. Currently, there are not many resources available to improve academic English fluency through the learning of MWUs found in academic discourse, and what is available is often small-scale or limited. For example, Liu (2012) identified only 228 MWUs, while the Simpson-Vlach & Ellis (2010) study identified just 207 core items. Other studies have limited their scope of the type of MWUs they

Aims

The aims of this project are to produce and analyse a list of MWUs of academic English that could be used as a resource by both learners and teachers. This resource will be produced using concgramming, with outputs further processed by experienced ESL practitioners using new methodologies, including a novel way of dealing with partial duplicates (such as the present study versus participation in the present study) and also extending the MWUs identified beyond their basic cores to include other

Methods

The procedures involved several steps: search, identification, elimination, manual processing, and finally a comparative analysis of the resultant list. Many of these steps required participation from this paper's research team and thus step-by-step instructions were created on the procedures they had to complete (see Appendix A).

Item analysis

The procedure described above with the 500 most frequent words from Gardner and Davies' (2014) academic vocabulary list initially produced a total of 10,190 collocations. After the MWUs of these collocations were identified and then analyzed by the EFL practitioners, 5,057 of the units (49.6%) were judged as useful for academic learners seeking to improve general academic English fluency. The vast majority of the items that were rejected included items which occurred in specific academic areas,

Project findings

Previous studies that attempted to identify high-frequency academic MWUs, though important and useful, were limited in both scope and methodology, and thus produced rather small lists that reflected their methodological shortcomings. This study experimented with new methodologies to create a large-scale academic MWU resource that addressed this gap.

Many researchers depend on corpus data for identification of high-frequency MWUs, but relying upon such data alone is known to be insufficient in

Conclusion

This paper has documented a project which attempted to extend previous work on the development of academic MWU lists for learners. It has explained how such a list was produced and highlighted the crucial roles of both concgramming and manual checking by experts in forming such a list. This paper has made observations about the similarities between academic and general English, and provided suggestions for future concordancing activities. Finally, recommendations have been made about how to use

Acknowledgements

We would like to thank the anonymous reviewers for their feedback and comments on the initial version of the article. Their input was invaluable for helping us improve this paper.

Dr James Rogers is an Associate Professor at Meijo University in Japan and has 15 years of experience teaching English. His PhD research resulted in the creation of a large-scale general English formulaic phrase list. He is also the creator of English learning smartphone apps that have been downloaded over 200,000 times which information about can be found at https://jamesmartinrogers.wixsite.com. His research interests include CALL, corpus linguistics, formulaic language, and vocabulary

References (75)

K. Ackermann et al.
Developing the academic collocation list (ACL) - a corpus-driven and expert-judged approach
Journal of English for Academic Purposes
(2013)
L. Al Hassan et al.
The effectiveness of focused instruction of formulaic sequences in augmenting L2 learners’ academic writing skills: A quantitative research study
Journal of English for Academic Purposes
(2015)
Y. Bestgen et al.
Quantifying the development of phraseological competence in L2 English writing: An automated approach
Journal of Second Language Writing
(2014)
T. Byrt et al.
Bias, prevalence and kappa
Journal of Clinical Epidemiology
(1993)
A. Gilmore et al.
The language of civil engineering research articles: A corpus-based approach
English for Specific Purposes
(2018)
L. Grabowski
Keywords and lexical bundles within English pharmaceutical discourse: A corpus-driven description
English for Specific Purposes
(2015)
W. Hsu
The most frequent opaque formulaic sequences in English-medium college textbooks
System
(2014)
D. Liu
The most frequently-used multiword constructions in academic written English: A multi-corpus study
English for Specific Purposes
(2012)
A. Masrai et al.
Measuring the contribution of academic and general vocabulary knowledge to learners’ academic achievement
Journal of English for Academic Purposes
(2018)
A. Wray et al.
The functions of formulaic language: An integrated model
Language & Communication
(2000)

H. Wright

Lexical bundles in stand-alone literature reviews: Sections, frequencies, and functions

English for Specific Purposes

(2019)

M. Almela et al.

Words as “lexical units” in learning/teaching vocabulary

International Journal of English Studies

(2007)

L. Anthony

AntWordPairs

(2013)

L. Anthony

AntConc

(2020)

D. Biber et al.

Longman grammar of spoken and written English

(1999)

F. Boers et al.

Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test

Language Teaching Research

(2006)

P. Bogaards

Lexical units and the learning of foreign language vocabulary

Studies in Second Language Acquisition

(2001)

British National Corpus

(2007)

P. Byrd et al.

On the other hand: Lexical bundles in academic writing and in the teaching of EAP

University of Sydney Papers in TESOL

(2010)

W. Cheng et al.

From n-gram to skipgram to concgram

International Journal of Corpus Linguistics

(2006)

Y. Chon et al.

Collocations in L2 writing and rater’s perceived writing proficiency

Korean Journal of Applied Linguistics

(2009)

Y. Chon et al.

A corpus-driven analysis of spoken and written academic collocations

Multimedia-assisted Language Learning

(2013)

K. Church et al.

Word association norms, mutual information, and lexicography

Computational Linguistics

(1990)

T. Cobb

VocabProfiler VP Compleat BNC-COCA-25 [Computer software]

(2018)

J. Cohen

A coefficient of agreement for nominal scales

Educational and Psychological Measurement

(1960)

K. Conklin et al.

Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers?

Applied Linguistics

(2008)

A. Coxhead

A new academic word list

Tesol Quarterly

(2000)

M. Davies

The corpus of contemporary American English: 425 million words, 1990-present

(2008)

P. Durrant

Investigating the viability of a collocation list for students of English for academic purposes

Journal of English for Specific Purposes

(2009)

P. Durrant

Formulaic language in English for academic purposes

P. Durrant et al.

To what extent do native and non-native writers make use of collocations?

International Review of Applied Linguistics

(2009)

N. Ellis

Memory for language

A. Eriksson

Pedagogical perspectives on bundles: Teaching bundles to doctoral students in biochemistry

D. Gardner et al.

A new academic vocabulary list

Applied Linguistics

(2014)

C. Gitsaki

The development of ESL collocational knowledge

(1996)

C. Gledhill

Collocations in science writing

(2000)

Cited by (6)

Methodological considerations for the use of mutual information: Examining the role of context in collocation research
2022, Research Methods in Applied Linguistics
There is a vast literature on collocation, and mutual information scores are frequently used to justify the listing, targeting or indexing of collocations within studies. However, despite methodologically focused papers that discuss the limitations of mutual information scores and the importance of uncertainty estimates in corpus linguistic data, very few applied papers consider the implications of these observations for their analyses. This paper provides a methodologically focused, empirical case study of mutual information scores to examine the extent to which such concerns have implications for study design and the interpretation of results. The paper reports mutual information scores for 100 node words in general American English, and then compares the lists of strong collocates identified with those identified in four corpora representing narrower domains of American English. In so doing, it highlights the range of collocations observed and the relatively poor overlap between conventionally defined strong collocates in general English and those that are identified in narrow domains. The results are discussed in relation to their implications for research involving listing collocates, pedagogical targeting of collocates, and the use of collocation as a correlate of broader constructs such as proficiency. Suggestions are made for calculations that could be used in a test-retest approach to address the issues identified.
Using Call to Identify and Teach English Multi-Word Units for Direct Study by Persian-Speaking Learners
2024, SSRN
The Use of Semantic Transparency and L1-L2 Congruency as Multi-Word Units Selection Criteria
2023, Studies in English Language and Education
Teaching English for Special Purposes to Bachelors of Engineering and Technology: Corpus Approach and Terminological Units
2023, European Journal of Contemporary Education
Media Report About Climate Change in an English Online Malaysian Newspaper Through Thematic and Discourse Analysis Approaches
2022, 3L: Language, Linguistics, Literature
Development of Corpus Linguistic Using Lexical Teaching to Improve English Writing
2022, Wireless Communications and Mobile Computing

Dr Amanda Müller is a Senior Lecturer at Flinders University. She teaches in the areas of English for Specific Purposes and Linguistics within the College of Nursing and Health Sciences. Her undergraduate degree was in psychology and her PhD research was based on corpus lingustics. Her research interests include language testing, vocabulary, CALL, and educational computer games.

Dr Frank E. Daulton is a full Professor at Ryukoku University in Kyoto. He earned his PhD in Applied Linguistics in 2004 under the tutelage of Paul Nation. He is the author of Japan's Builtin Lexicon of English-based Loanwords (Multilingual Matters).

Paul Dickinson currently teaches at Meijo University in Japan. He has taught English for almost 20 years and has published and presented on topics ranging from formulaic language use on Twitter to implementing Universal Design for Learning in language learning. His research interests include CALL, extensive reading, formulaic language, and instructional design.

Cosmin Florescu teaches at the International University of Health and Welfare in Japan. Formally trained as an applied linguist, he has worked towards integrating his nearly 10 years' of experience as an interpreter/translator into his teaching practice by emphasizing the learner's perspective. He is interested in researching learner motivation, corpus linguistics, technology in education, and semantic universals.

Gordon Reid is an Associate Professor at the University of Yamanashi, a national university in Kofu, Japan. He has been teaching English since 1988 and has been a professor for the last 12 years. Gordon's research interests include Vygotsky's sociocultural theory as it relates to the cultural tools of the Japanese education system. He has also been involved in various research projects including the inclusion of collocations in language learning and the use of smartphone software to enhance learners' vocabulary-learning efficiency.

Tim Stoeckel is an Associate Professor at the University of Niigata Prefecture, in Niigata, Japan. His interests include vocabulary learning and assessment, vocabulary growth, and the relationship between lexical knowledge and reading comprehension.

View full text

The creation and application of a large-scale corpus-based academic multi-word unit list

Highlights

Abstract

Introduction

Section snippets

Collocations and language learning

Current corpus-based MWU resources

Aims

Methods

Item analysis

Project findings

Conclusion

Acknowledgements

Journal of English for Academic Purposes

Journal of English for Academic Purposes

Journal of Second Language Writing

Journal of Clinical Epidemiology

English for Specific Purposes

English for Specific Purposes

System

English for Specific Purposes

Journal of English for Academic Purposes

Language & Communication

English for Specific Purposes

Words as “lexical units” in learning/teaching vocabulary

International Journal of English Studies

AntWordPairs

AntConc

Longman grammar of spoken and written English

Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test

Language Teaching Research

Lexical units and the learning of foreign language vocabulary

Studies in Second Language Acquisition

On the other hand: Lexical bundles in academic writing and in the teaching of EAP

University of Sydney Papers in TESOL

From n-gram to skipgram to concgram

International Journal of Corpus Linguistics

Collocations in L2 writing and rater’s perceived writing proficiency

Korean Journal of Applied Linguistics

A corpus-driven analysis of spoken and written academic collocations

Multimedia-assisted Language Learning

Word association norms, mutual information, and lexicography

Computational Linguistics

VocabProfiler VP Compleat BNC-COCA-25 [Computer software]

A coefficient of agreement for nominal scales

Educational and Psychological Measurement

Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers?

Applied Linguistics

A new academic word list

Tesol Quarterly

The corpus of contemporary American English: 425 million words, 1990-present

Investigating the viability of a collocation list for students of English for academic purposes

Journal of English for Specific Purposes

Formulaic language in English for academic purposes

To what extent do native and non-native writers make use of collocations?

International Review of Applied Linguistics

Memory for language

Pedagogical perspectives on bundles: Teaching bundles to doctoral students in biochemistry

A new academic vocabulary list

Applied Linguistics

The development of ESL collocational knowledge

Collocations in science writing