The creation and application of a large-scale corpus-based academic multi-word unit list
Introduction
A corpus, or a collection of texts compiled for the purpose of linguistic analysis, is a powerful tool for linguists. Corpus linguistics has enabled educators to create language learning materials based on real-world language use. A corpus can be analyzed to answer linguistic questions about issues regarding patterns, frequency, and scope of language use. For example, a corpus provides the opportunity to answer questions about how frequently particular words appear in the presence of other words, enabling researchers to identify commonly paired words, frequent phrases, and typical multi-word units (MWUs). Such data can be used to create more objective educational materials based on data in comparison to authors relying on subjective classroom experience and intuition. This approach can lead to materials which focus on language items that have a high probability of being encountered in the future or occurring within a specific context. This study focuses on words which frequently occur near each other in an academic context.
Many researchers have defined the co-occurrence of words as collocations or MWUs differently, and therefore a variety of terms have been used as research progressed. Some examples include word associations (Murphy, 1983), fixed expressions (Kennedy, 1990), prefabricated language chunks (Nattinger & DeCarrico, 1992), formulaic language (Wray, 2002), formulae (Simpson-Vlach & Ellis, 2010), and phrasal expressions (Martinez & Schmitt, 2012). The most basic term to refer to a word that frequently appears next to another word is collocation – which some researchers define narrowly as being 2 words co-occurring in high frequency (Hoey, 1991). Researchers such as Gitsaki (1996) define collocations utilizing syntactic structures. For example, Gitsaki (1996) compared differences between how collocations are formed in English and also in Greek to determine where the languages differ and thus where learners need to focus, such as in how English adjective + preposition formulations like afraid of [snakes] are verb + determiner + noun formulations in Greek. Other researchers have combined syntactic structures and frequency as criteria for identifying collocations to focus on (Lesniewska & Witalisz, 2007). Some researchers opt to use the term MWU in a way that encompasses collocations, and the definitions of what constitutes a MWU are similarly varied. Grant and Nation (2006) break down MWUs into literals, figuratives, and core idioms. Biber et al. (1999) differentiate collocations from MWUs by referring to 2-word phrases as collocations and anything beyond that as idioms or lexical bundles.
The current study examines MWUs in academic English. More specifically, it focuses on journals in which peer-reviewed academic research papers are published. These MWUs are collocations operationalized as a pivot word (a high-frequency academic word which is used to search for other words which co-occur with it frequently) and a collocate (a word that commonly co-occurs with another word). Words are considered to be collocates when they occur within 4 words to the left or right of the pivot word. These word combinations, or collocations, are counted as lemmas, a method of consolidating words into one group containing all the inflected forms of a given word when taken as one part of speech. Thus a search for the lemma noun study when it occurs with the lemma verb find would include searches for instances of the nouns study or studies that co-occur with the verbs find, finds, found, or finding. Furthermore, this study accounts for other words coming before, after, or in between these lemmas, so their basic relationship is not disrupted; thus, this study found, studies find, finding a study, etc. are counted as instances of the same MWU. This process is called concgramming (Cheng et al., 2006), a word association (collocation) corpus search method which considers constituency and positional variation. For instance, a concgram search for collocates with the pivot word jury would yield adjacent collocates such as jury's verdict, as well as jury's shocking verdict (constituency variation) and verdict of a jury (positional variation). This method will be explained in much more detail under sections 3.2 Concgramming: the method of identifying MWUs, 3.3 Counting linear sequences versus unsequenced appearances within a set range, 3.4 Partial duplicates, 3.5 Extending the core MWUs below. After these concgrams are identified, the MWU most exemplary of a given lemmatized pivot word and collocate is determined based on frequency. For example, if jury co-occurs frequently with verdict, then an analysis would determine which MWU is the most frequent, such as the jury's verdict, a jury's verdict, or the verdict of the jury, etc.
The result of this study is a ranked list based on corpus data of the most frequent MWUs derived from lemmatized concgrams (i.e., collocations) which were manually confirmed by ESL practitioners to have general value for students aiming to improve their academic English fluency. The corpus utilized was the Corpus of Contemporary American English (COCA) (Davies, 2008). The rationale for usage of this corpus follows a trend of previous research to use the most modern corpus data because language usage changes over time, such as how technology changes the common vocabulary in use. For example, the British National Corpus (BNC) (2007) was a popular resource among corpus linguists conducting research, but since it stopped being developed in 1993 and it mostly used texts sourced from the 1980s, some of its language can be considered dated and/or missing more modern words or phrases. Its total size is also an issue, at only 100 million tokens in total. In comparison, the COCA began to be compiled in 1990, and is still being added to today, with its academic section comprising 112 million tokens. This section features good balance, with equal amounts sourced from the following types of academic journals: education, history, geography, social science, law, political science, humanities, philosophy, religion, science, technology, medicine, and a miscellaneous section. These features, along with being freely downloadable, have made it a popular resource for corpus studies.
The list introduced in this paper is intended to assist academic English learning by providing base materials that can be used in the production of educational resources. Given that MWUs can improve a student's fluency and familiarity with the discursive space, a comprehensive evidence-informed list of the top academic English collocates and phrases should be of great utility. The creation of the list also allows an analysis of the relationship between academic English, as revealed in this study, and what we already know about general English. In the course of this paper, there will be a discussion of some lists currently available for use in comparison to the list produced by this study. The paper also carefully explains some unique steps taken, and describes new, custom software used to identify MWUs and how this affects the final product. In the concluding section, recommendations are given about features which should ideally be present in concordance software used for educational purposes. Finally, recommendations are given for the educational applications of the list and the types of academic English learning activities that could be created using this resource.
Section snippets
Collocations and language learning
There has been an increased awareness of the importance of collocational and MWU fluency in second language education over the last 2 decades, and researchers agree that knowledge of collocations and formulaic phrases is valuable (Nation, 2013; Siyanova-Chanturia & Pellicer-Sanchez, 2019; Webb & Kagimoto, 2011). Hoey (2005) and Hill et al. (2000) state that much of language itself consists of prefabricated chunks, or MWUs, and thus learning these MWUs is a central part of learning a language.
Current corpus-based MWU resources
The identification of high-frequency collocations or MWUs can be a very complex and time-consuming process. Currently, there are not many resources available to improve academic English fluency through the learning of MWUs found in academic discourse, and what is available is often small-scale or limited. For example, Liu (2012) identified only 228 MWUs, while the Simpson-Vlach & Ellis (2010) study identified just 207 core items. Other studies have limited their scope of the type of MWUs they
Aims
The aims of this project are to produce and analyse a list of MWUs of academic English that could be used as a resource by both learners and teachers. This resource will be produced using concgramming, with outputs further processed by experienced ESL practitioners using new methodologies, including a novel way of dealing with partial duplicates (such as the present study versus participation in the present study) and also extending the MWUs identified beyond their basic cores to include other
Methods
The procedures involved several steps: search, identification, elimination, manual processing, and finally a comparative analysis of the resultant list. Many of these steps required participation from this paper's research team and thus step-by-step instructions were created on the procedures they had to complete (see Appendix A).
Item analysis
The procedure described above with the 500 most frequent words from Gardner and Davies' (2014) academic vocabulary list initially produced a total of 10,190 collocations. After the MWUs of these collocations were identified and then analyzed by the EFL practitioners, 5,057 of the units (49.6%) were judged as useful for academic learners seeking to improve general academic English fluency. The vast majority of the items that were rejected included items which occurred in specific academic areas,
Project findings
Previous studies that attempted to identify high-frequency academic MWUs, though important and useful, were limited in both scope and methodology, and thus produced rather small lists that reflected their methodological shortcomings. This study experimented with new methodologies to create a large-scale academic MWU resource that addressed this gap.
Many researchers depend on corpus data for identification of high-frequency MWUs, but relying upon such data alone is known to be insufficient in
Conclusion
This paper has documented a project which attempted to extend previous work on the development of academic MWU lists for learners. It has explained how such a list was produced and highlighted the crucial roles of both concgramming and manual checking by experts in forming such a list. This paper has made observations about the similarities between academic and general English, and provided suggestions for future concordancing activities. Finally, recommendations have been made about how to use
Acknowledgements
We would like to thank the anonymous reviewers for their feedback and comments on the initial version of the article. Their input was invaluable for helping us improve this paper.
Dr James Rogers is an Associate Professor at Meijo University in Japan and has 15 years of experience teaching English. His PhD research resulted in the creation of a large-scale general English formulaic phrase list. He is also the creator of English learning smartphone apps that have been downloaded over 200,000 times which information about can be found at https://jamesmartinrogers.wixsite.com. His research interests include CALL, corpus linguistics, formulaic language, and vocabulary
References (75)
- et al.
Developing the academic collocation list (ACL) - a corpus-driven and expert-judged approach
Journal of English for Academic Purposes
(2013) - et al.
The effectiveness of focused instruction of formulaic sequences in augmenting L2 learners’ academic writing skills: A quantitative research study
Journal of English for Academic Purposes
(2015) - et al.
Quantifying the development of phraseological competence in L2 English writing: An automated approach
Journal of Second Language Writing
(2014) - et al.
Bias, prevalence and kappa
Journal of Clinical Epidemiology
(1993) - et al.
The language of civil engineering research articles: A corpus-based approach
English for Specific Purposes
(2018) Keywords and lexical bundles within English pharmaceutical discourse: A corpus-driven description
English for Specific Purposes
(2015)The most frequent opaque formulaic sequences in English-medium college textbooks
System
(2014)The most frequently-used multiword constructions in academic written English: A multi-corpus study
English for Specific Purposes
(2012)- et al.
Measuring the contribution of academic and general vocabulary knowledge to learners’ academic achievement
Journal of English for Academic Purposes
(2018) - et al.
The functions of formulaic language: An integrated model
Language & Communication
(2000)
Lexical bundles in stand-alone literature reviews: Sections, frequencies, and functions
English for Specific Purposes
Words as “lexical units” in learning/teaching vocabulary
International Journal of English Studies
AntWordPairs
AntConc
Longman grammar of spoken and written English
Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test
Language Teaching Research
Lexical units and the learning of foreign language vocabulary
Studies in Second Language Acquisition
On the other hand: Lexical bundles in academic writing and in the teaching of EAP
University of Sydney Papers in TESOL
From n-gram to skipgram to concgram
International Journal of Corpus Linguistics
Collocations in L2 writing and rater’s perceived writing proficiency
Korean Journal of Applied Linguistics
A corpus-driven analysis of spoken and written academic collocations
Multimedia-assisted Language Learning
Word association norms, mutual information, and lexicography
Computational Linguistics
VocabProfiler VP Compleat BNC-COCA-25 [Computer software]
A coefficient of agreement for nominal scales
Educational and Psychological Measurement
Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers?
Applied Linguistics
A new academic word list
Tesol Quarterly
The corpus of contemporary American English: 425 million words, 1990-present
Investigating the viability of a collocation list for students of English for academic purposes
Journal of English for Specific Purposes
Formulaic language in English for academic purposes
To what extent do native and non-native writers make use of collocations?
International Review of Applied Linguistics
Memory for language
Pedagogical perspectives on bundles: Teaching bundles to doctoral students in biochemistry
A new academic vocabulary list
Applied Linguistics
The development of ESL collocational knowledge
Collocations in science writing
Cited by (6)
Methodological considerations for the use of mutual information: Examining the role of context in collocation research
2022, Research Methods in Applied LinguisticsThe Use of Semantic Transparency and L1-L2 Congruency as Multi-Word Units Selection Criteria
2023, Studies in English Language and EducationTeaching English for Special Purposes to Bachelors of Engineering and Technology: Corpus Approach and Terminological Units
2023, European Journal of Contemporary EducationMedia Report About Climate Change in an English Online Malaysian Newspaper Through Thematic and Discourse Analysis Approaches
2022, 3L: Language, Linguistics, LiteratureDevelopment of Corpus Linguistic Using Lexical Teaching to Improve English Writing
2022, Wireless Communications and Mobile Computing
Dr James Rogers is an Associate Professor at Meijo University in Japan and has 15 years of experience teaching English. His PhD research resulted in the creation of a large-scale general English formulaic phrase list. He is also the creator of English learning smartphone apps that have been downloaded over 200,000 times which information about can be found at https://jamesmartinrogers.wixsite.com. His research interests include CALL, corpus linguistics, formulaic language, and vocabulary acquisition.
Dr Amanda Müller is a Senior Lecturer at Flinders University. She teaches in the areas of English for Specific Purposes and Linguistics within the College of Nursing and Health Sciences. Her undergraduate degree was in psychology and her PhD research was based on corpus lingustics. Her research interests include language testing, vocabulary, CALL, and educational computer games.
Dr Frank E. Daulton is a full Professor at Ryukoku University in Kyoto. He earned his PhD in Applied Linguistics in 2004 under the tutelage of Paul Nation. He is the author of Japan's Builtin Lexicon of English-based Loanwords (Multilingual Matters).
Paul Dickinson currently teaches at Meijo University in Japan. He has taught English for almost 20 years and has published and presented on topics ranging from formulaic language use on Twitter to implementing Universal Design for Learning in language learning. His research interests include CALL, extensive reading, formulaic language, and instructional design.
Cosmin Florescu teaches at the International University of Health and Welfare in Japan. Formally trained as an applied linguist, he has worked towards integrating his nearly 10 years' of experience as an interpreter/translator into his teaching practice by emphasizing the learner's perspective. He is interested in researching learner motivation, corpus linguistics, technology in education, and semantic universals.
Gordon Reid is an Associate Professor at the University of Yamanashi, a national university in Kofu, Japan. He has been teaching English since 1988 and has been a professor for the last 12 years. Gordon's research interests include Vygotsky's sociocultural theory as it relates to the cultural tools of the Japanese education system. He has also been involved in various research projects including the inclusion of collocations in language learning and the use of smartphone software to enhance learners' vocabulary-learning efficiency.
Tim Stoeckel is an Associate Professor at the University of Niigata Prefecture, in Niigata, Japan. His interests include vocabulary learning and assessment, vocabulary growth, and the relationship between lexical knowledge and reading comprehension.