Abstract
One of the main research questions concerning multi-word expressions (MWEs) is which of them are transparent word combinations created ad hoc and which are multi-word lexical units (MWUs). In this paper, we use selected corpus-linguistic and machine-learning methods to determine which lexicalization criteria guide Polish and English lexicographers in deciding which MWEs (bigrams such as adjective+noun and noun+noun combinations) should be treated as lexical units recorded in dictionaries as MWUs. We analyzed two samples: MWEs extracted from Polish and English monolingual dictionaries, and those created by the annotators, and tested two custom-designed criteria, i.e., intuition and paraphrase, also by using statistical methods (measures of collocational strength: PMI and Jaccard). We revealed that Polish lexicographers have a tendency not to include compositional MWEs as lexical entries in their dictionaries and that the criteria of paraphrase and intuition are important for them: if MWEs are not clearly and unambiguously paraphrasable and compositional, then they are recorded in dictionaries. We found that in contrast to Polish lexicographers English lexicographers tend to record also compositional and partly compositional MWEs.
Funding source: Polish National Science Centre
Award Identifier / Grant number: UMO-2019/33/B/HS2/02814
-
Research funding: This research has been funded by the Polish National Science Centre under the grant agreement No UMO-2019/33/B/HS2/02814.
References
Agirre, Eneko, Izaskun Aldezabal & E. Eli Pociello. 2006. Lexicalisation and multiword expressions in the Basque WordNet. In Sojka, Petr, Key-Sun Choi, Christiane Fellbaum & Piek Vossen (eds.), Proceedings of the third international wordnet conference, 131–138. Amsterdam: Vrije Universiteit Amsterdam.Search in Google Scholar
Akaike, Hirotugu. 1974. A new look at statistical model identification. IEEE Transactions on Automatic Control 19(6). 716–723. https://doi.org/10.1109/tac.1974.1100705.Search in Google Scholar
Asparouhov, Tihomir & Bength Muthén. 2014. Auxiliary variables in mixture modeling: Using the BCH method in Mplus to estimate a distal outcome model and an arbitrary secondary model. Mplus Web Notes 21(2). 1–22.Search in Google Scholar
Baldwin, Tim & Francis Bond. 2002. Multiword expressions: Some problems for Japanese NLP. In Proceedings of the 8th annual meeting of the association for NLP, Keihanna, Japan, 379–382.Search in Google Scholar
Baldwin, Tim & Su Nam Kim. 2010. Multiword expressions. In Nitin Indurkhya & Fred Damerau (eds.), Handbook of natural language processing, 2nd edn., 1–26. Boca Raton: CRC.Search in Google Scholar
Battaglia, Michael. 2008. Convenience sampling. In Paul Lavrakas (ed.), Encyclopedia of survey research methods. Los Angeles, USA: SAGE Publications Inc.Search in Google Scholar
Bauer, Laurie. 2019. Compounds and multi-word expressions in English. In Barbara Schlücker (ed.), Complex lexical units: Compounds and multi-word expressions, 45–68. Berlin: de Gruyter.10.1515/9783110632446-002Search in Google Scholar
Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. The Longman grammar of spoken and written English. London: Longman.Search in Google Scholar
Bogusławski, Andrzej. 1976. O zasadach rejestracji jednostek języka. Poradnik Jezykowy 8. 356–364.Search in Google Scholar
Bouckaert, Remco. 2004. Bayesian network classifiers in Weka. In Working paper series. Hamilton, New Zealand: University of Waikato, Department of Computer Science. No. 14/2004.Search in Google Scholar
Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/a:1010933404324.10.1023/A:1010933404324Search in Google Scholar
Bright, William (ed.). 1992. International Encyclopaedia of linguistics. Oxford: OUP.Search in Google Scholar
Brinton, Laurel & Elisabeth Traugott. 2005. Lexicalisation and language change. Cambridge: Cambridge University Press.10.1017/CBO9780511615962Search in Google Scholar
Chlebda, Wojciech. 2010. Nieautomatyczne drogi dochodzenia do reproduktów wielowyrazowych. In Wojciech Chlebda (ed.), Na tropach reproduktów: w poszukiwaniu wielowyrazowych jednostek języka, 15–35. Opole: Wydawnictwo Uniwersytetu Opolskiego.Search in Google Scholar
Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1). 37–46. https://doi.org/10.1177/001316446002000104.Search in Google Scholar
Collins, Linda & Stephanie Lanza. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Hoboken: John Wiley & Sons.10.1002/9780470567333Search in Google Scholar
Constant, Mathieu, Gülşen Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner & Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics 43(4). 837–892. https://doi.org/10.1162/coli_a_00302.Search in Google Scholar
Cordeiro, Silvio, Aline Villavicencio, Marco Idiart & Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics 45(1). 1–57. https://doi.org/10.1162/coli_a_00341.Search in Google Scholar
Corpas Pastor, Gloria & Jean-Paul Colson (eds.). 2020. Computational phraseology. Amsterdam: John Benjamins.10.1075/ivitra.24Search in Google Scholar
von Davier, Matthias. 1997. Bootstrapping goodness-of-fit statistics for sparse categorical data: Results of a Monte Carlo study. Methods of Psychological Research Online 2(2). 29–48.Search in Google Scholar
Espinal, Teresa & Jaume Mateu. 2019. Idioms and phraseology. In Mark Aronoff (ed.), Oxford research Encyclopedia of linguistics. Oxford: Oxford University Press.10.1093/acrefore/9780199384655.013.51Search in Google Scholar
Forsyth, Richard & Łukasz Grabowski. 2015. Is there a formula for formulaic language? Poznań Studies in Contemporary Linguistics 54(1). 511–549. https://doi.org/10.1515/psicl-2015-0019.Search in Google Scholar
Frank, Eibe, Mark Hall & Ian Witten. 2016. The WEKA workbench. Online appendix for Ian Witten. In Eibe Frank, Mark Hall & Christopher Pal (eds.), Data mining: Practical machine learning tools and techniques, 4th edn. Burlington: Morgan Kaufmann.Search in Google Scholar
Garcia, Marcos, Tiago Kramer Vieria, Caroline Scarton, Marco Idiart & Aline Villavicencio. 2021a. Probing for idiomaticity in vector space models. In Proceedings of the 16th conference of the European chapter of the Association for computational linguistics, April 19–23, 2021, 3551–3564.10.18653/v1/2021.eacl-main.310Search in Google Scholar
Garcia, Marcos, Tiago Kramer Vieria, Caroline Scarton, Marco Idiart & Aline Villavicencio. 2021b. Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, August 1–6, 2021, 2730–2741.10.18653/v1/2021.acl-long.212Search in Google Scholar
Gilpin, Leilani H., David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter & Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th international conference on data science and advanced analytics (DSAA), 80–89.10.1109/DSAA.2018.00018Search in Google Scholar
Grochowski, Maciej. 1982. Zarys leksykologii i leksykografii: Zagadnienia synchroniczne. Toruń: Wydawnictwo UMK.Search in Google Scholar
Grzybek, Peter. 2014. Word length. In John R. Taylor (ed.), The Oxford Handbook of the word, 1–25. Oxford: Oxford University Press.10.1093/oxfordhb/9780199641604.013.37Search in Google Scholar
Guidotti, Riccardo, Anna, Monreale, Salvatore Ruggieri, Franco, Turini, Fosca Giannotti & Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM Computing Surveys 51(5). 1–42. https://doi.org/10.1145/3236009.Search in Google Scholar
Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter, Reutemann & Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations 11(1). 10–18. https://doi.org/10.1145/1656274.1656278.Search in Google Scholar
Hanks, Peter. 2013. Lexical analysis: Norms and exploitations. Cambridge, MA: MIT Press.10.7551/mitpress/9780262018579.001.0001Search in Google Scholar
Heshempur, Reyhaneh & Aline Villavicencio. 2020. Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Proceedings of the workshop on cognitive aspects of the lexicon, December 12, 2020, 72–80.Search in Google Scholar
Hirst, Graeme. 1987. Semantic interpretation and the resolution of ambiguity. Cambridge: CUP (cited in Pelletier 1994: 12).10.1017/CBO9780511554346Search in Google Scholar
Hosmer, David W.Jr, Stanley Lemeshow & Rodney X. Sturdivant. 2013. Applied logistic regression, 3rd edn. John Wiley & Sons.10.1002/9781118548387Search in Google Scholar
Hunston, Susan. 2008. Starting with the small words: Patterns, lexis and semantic sequences. International Journal of Corpus Linguistics 13(1). 271–295. https://doi.org/10.1075/ijcl.13.3.03hun.Search in Google Scholar
Hunston, Susan & Gill Francis. 2000. Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.10.1075/scl.4Search in Google Scholar
Iwatsuki, Kenichi, Florian Boudin & Akiko Aizawa. 2022. Extraction and evaluation of formulaic expressions used in scholarly papers. Expert Systems with Applications 187. 115840. https://doi.org/10.1016/j.eswa.2021.115840.Search in Google Scholar
Ježek, Elisabetta. 2016. The lexicon. An introduction. Oxford: Oxford University Press.Search in Google Scholar
Jones, Zachary M. & Fridolin J. Linder. 2016. edarf: exploratory data analysis using random forests. Journal of Open Source Software 1(6). 92. https://doi.org/10.21105/joss.00092.Search in Google Scholar
Juba, Brendan & Hai S. Le. 2019. Precision-recall versus accuracy and the role of large data sets. Proceedings of the AAAI Conference on Artificial Intelligence 33(01). 4039–4048. https://doi.org/10.1609/aaai.v33i01.33014039.Search in Google Scholar
Kocoń, Jan & Michał Gawor. 2019. Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF. https://doi.org/10.4467/20838476SI.18.008.10413.Search in Google Scholar
Landis, J. Richard & Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33(1). 159–174. https://doi.org/10.2307/2529310.Search in Google Scholar
Langeheine, Rolf, Jeroen Pannekoek & Frank Van De Pol. 1996. Bootstrapping goodness-of-fit measures in categorical data analysis. Sociological Methods & Research 24(4). 492–516. https://doi.org/10.1177/0049124196024004004.Search in Google Scholar
Le Cessie, Saskia & Johannes Cornelis van Houwelingen. 1992. Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics) 41(1). 191–201. https://doi.org/10.2307/2347628.Search in Google Scholar
Linzer, Drew A. & Jeffrey Lewis. 2011. poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software 42(10). 1–29. https://doi.org/10.18637/jss.v042.i10.Search in Google Scholar
Malmkjaer, Kirsten (ed.). 1991/2010. The linguistics Encyclopedia. London & New York: Routledge.10.4324/9780203874950Search in Google Scholar
Maziarz, Marek, Stanisław Szpakowicz & Maciej Piasecki. 2015. A procedural definition of multi-word lexical units. In Proceedings of the international conference recent advances in natural language processing, pp. 427–435.Search in Google Scholar
McCauley, Stewart M. & Morten H. Christiansen. 2011. Learning simple statistics for language comprehension and production: The CAPPUCINO model. Proceedings of the CogSci 2011. 1619–1624.Search in Google Scholar
Miller, Rupert G.Jr. 1991. Simultaneous statistical inference. New York: Springer-Verlag.Search in Google Scholar
Miodunka, Władysław. 1989. Podstawy leksykologii i leksykografii. Warszawa: PWN.Search in Google Scholar
Moon, Rosamund. 1998. Fixed expressions and idioms in English. A corpus-based approach. Oxford: Clarendon.10.1093/oso/9780198236146.001.0001Search in Google Scholar
Moon, Rosamund. 2015. Multi-word items. In John R. Taylor (ed.), The Oxford handbook of the word, 120–140. Oxford: OUP.10.1093/oxfordhb/9780199641604.013.031Search in Google Scholar
Müldner-Nieckowski, Piotr. 2003. Wielki słownik frazeologiczny języka polskiego. Warszawa: Świat Książki.Search in Google Scholar
Müldner-Nieckowski, Piotr. 2007. Frazeologia poszerzona: Studium leksykograficzne. Warszawa: Oficyna Wydawnicza Volumen.Search in Google Scholar
Pelletier, Geoffrey F. 1994. The principle of semantic compositionality. Topoi 13. 11–24. https://doi.org/10.1007/bf00763644.Search in Google Scholar
Pęzik, Piotr. 2018. Facets of prefabrication. Łódź: Wydawnictwo Uniwersytetu Łódzkiego.Search in Google Scholar
Piantadosi, Steven T., Harry Tily & Edward Gibson. 2011. Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences 108(9). 3526–3529. https://doi.org/10.1073/pnas.1012551108.Search in Google Scholar
Pike, Kenneth L. 1967. Language in relation to a unified theory of the structure of human behaviour, 2nd edn. The Hague: Mouton.10.1037/14786-000Search in Google Scholar
Quinlan, John R. 1993. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers.Search in Google Scholar
Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. Proceedings of CICLing 2002. 1–15.10.1007/3-540-45715-1_1Search in Google Scholar
Salehi, Bahar, Paul Cook & Timothy Baldwin. 2015. A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, 977–983. Denver, Colorado: Association for Computational Linguistics.10.3115/v1/N15-1099Search in Google Scholar
Sampson, Geoffrey. 2007. Grammar without grammaticality. [Special issue] Corpus Linguistics and Linguistic Theory 3(1). 1–32. https://doi.org/10.1515/cllt.2007.001.Search in Google Scholar
Sampson, Geoffrey & Anna Babarczy. 2014. Grammar without grammaticality: Growth and limits of grammatical precision. Berlin: Mouton de Gruyter.10.1515/9783110290011Search in Google Scholar
Sapir, Edward. 1921. Language. An introduction to the study of speech. New York: Harcourt, Brace.Search in Google Scholar
Schütze, Carson T. 1996/2016. The empirical base of linguistics. Chicago: University of Chicago Press. (Reprinted as: Schütze, Carson T. 2016. The empirical base of linguistics: Grammaticality judgments and linguistic methodology. (Classics in Linguistics 2). Berlin: Language Science Press.10.26530/OAPEN_603356Search in Google Scholar
Schwarz, Gideon E. 1978. Estimating the dimension of a model. Annals of Statistics 6(2). 461–464. https://doi.org/10.1214/aos/1176344136.Search in Google Scholar
Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press.Search in Google Scholar
Strauss, Udo, Peter Grzybek & Gabriel, Altmann. 2007. Word length and word frequency. In Peter Grzybek (ed.), Contributions to the science of text and language, 277–294. Dordrecht: Springer.10.1007/1-4020-4068-7_13Search in Google Scholar
Sumner, Marc, Eibe Frank & Mark Hall. 2005. Speeding up logistic model tree induction. In Proceedings of 9th European conference on principles and practice of knowledge discovery in databases, 675–683.10.1007/11564126_72Search in Google Scholar
Svensén, Bo. 2009. A Handbook of lexicography. The theory and practice of dictionary-making. Cambridge: Cambridge University Press.Search in Google Scholar
Tabachnick, Barbara & Linda Fidell. 2001. Using multivariate analysis. Boston: Allyn & Bacon.Search in Google Scholar
Taylor, John R. 2012. The mental corpus. How language is represented in the mind. Oxford: OUP.10.1093/acprof:oso/9780199290802.001.0001Search in Google Scholar
Tomasello, Michael. 2003. Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.Search in Google Scholar
Weller, Bridget E., Natasha K. Bowen & Sarah J. Faubert. 2020. Latent class analysis: A guide to best practice. Journal of Black Psychology 46(4). 287–311. https://doi.org/10.1177/0095798420930932.Search in Google Scholar
Woźniak, Michał. 2017. Jak znaleźć igłę w stogu siana? Automatyczna ekstrakcja wielosegmentowych jednostek leksykalnych z tekstu polskiego. Kraków: IJP PAN.Search in Google Scholar
Wray, Alison. 2002. Formulaic language and the lexicon. Cambridge: CUP.10.1017/CBO9780511519772Search in Google Scholar
Wray, Alison. 2008. Formulaic language. Pushing the boundaries. Oxford: OUP.Search in Google Scholar
Wray, Alison. 2009a. Identifying formulaic language. Persistent challenges and new opportunities. In Roberta Corrigan, Edith Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds.), Formulaic language. Vol. 1. Distribution and historical change, 27–51. Amsterdam: John Benjamins.10.1075/tsl.82.02ideSearch in Google Scholar
Wray, Alison. 2009b. Future directions in formulaic language research. Journal of Foreign Languages 32(6). 2–17.Search in Google Scholar
Wray, Alison & Perkins, Michael R. 2000. The functions of formulaic language: An integrated model. Language & Communication 20(1). 1–28. https://doi.org/10.1016/s0271-5309(99)00015-4.Search in Google Scholar
Wurpts, Ingrid C. & Christian Geiser. 2014. Is adding more indicators to a latent class analysis beneficial or detrimental? Results of a Monte-Carlo study. Frontiers in Psychology 5. 920. https://doi.org/10.3389/fpsyg.2014.00920.Search in Google Scholar
Zgusta, Ladislav. 1971. Manual of lexicography. Prague: Czechoslovak Academy of Sciences.10.1515/9783111349183Search in Google Scholar
Żmigrodzki, Piotr. 2018. Wielki słownik języka polskiego PAN. Geneza projektu i zasady opracowania. In Piotr Żmigrodzki, Mirosław Bańko, Barbara Batko-Tokarz, Jakub Bobrowski, Anna Czelakowska, Maciej Grochowski, Renata Przybylska, Jadwiga Waniakowa & Katarzyna Węgrzynek (eds.), Wielki słownik PAN. Geneza, koncepcja, zasady opracowania., 9–16. Kraków: Instytut Języka Polskiego PAN.10.17651/WSJP2018Search in Google Scholar
Żmigrodzki, Piotr. 2021. Wielki słownik języka polskiego PAN. Zasady opracowania. wyd. V. Available at: https://pliki.wsjp.pl/zasady_opracowania_wsjp.pdf.10.18276/sj.2022.21-11Search in Google Scholar
Resources
Dunaj, Bogusław, (ed.). 1996. Słownik współczesnego języka polskiego. Warszawa: Wydawnictwo Wilga.Search in Google Scholar
Pearsall, Judy, (ed). 2001. The new Oxford dictionary of English. Oxford: Oxford University Press. https://www.lexico.com/ Search in Google Scholar
SJPDor = Doroszewski, Witold, (ed). 1958–1969. Słownik języka polskiego, pod red. W., t. 1–11, PWN, Warszawa 1958–1969 http://doroszewski.pwn.pl/ Search in Google Scholar
WSJP= Żmigrodzki, Piotr, (ed). 2006-. Wielki słownik języka polskiego PAN. Kraków: IJP PAN. https://wsjp.pl/ Search in Google Scholar
Merriam-Webster = Merriam-Webster.com Dictionary, Merriam-Webster, https://www.merriam-webster.com/dictionary/ Search in Google Scholar
© 2023 Walter de Gruyter GmbH, Berlin/Boston