Lexicalisation of Polish and English word combinations: an empirical study

Marek Maziarz; Łukasz Grabowski; Tadeusz Piotrowski; Ewa Rudnicka; Maciej Piasecki

doi:10.1515/psicl-2023-2002

Published by De Gruyter Mouton February 27, 2023

Lexicalisation of Polish and English word combinations: an empirical study

Marek Maziarz , Łukasz Grabowski , Tadeusz Piotrowski , Ewa Rudnicka and Maciej Piasecki

From the journal Poznan Studies in Contemporary Linguistics

https://doi.org/10.1515/psicl-2023-2002

Showing a limited preview of this publication:

Abstract

One of the main research questions concerning multi-word expressions (MWEs) is which of them are transparent word combinations created ad hoc and which are multi-word lexical units (MWUs). In this paper, we use selected corpus-linguistic and machine-learning methods to determine which lexicalization criteria guide Polish and English lexicographers in deciding which MWEs (bigrams such as adjective+noun and noun+noun combinations) should be treated as lexical units recorded in dictionaries as MWUs. We analyzed two samples: MWEs extracted from Polish and English monolingual dictionaries, and those created by the annotators, and tested two custom-designed criteria, i.e., intuition and paraphrase, also by using statistical methods (measures of collocational strength: PMI and Jaccard). We revealed that Polish lexicographers have a tendency not to include compositional MWEs as lexical entries in their dictionaries and that the criteria of paraphrase and intuition are important for them: if MWEs are not clearly and unambiguously paraphrasable and compositional, then they are recorded in dictionaries. We found that in contrast to Polish lexicographers English lexicographers tend to record also compositional and partly compositional MWEs.

Keywords: (multi-word) lexical units; compositionality; Latent Class Analysis; lexicalisation; lexicography; multi-word expressions

Corresponding author: Ewa Rudnicka, Wrocław University of Science and Technology, Wrocław, Poland, E-mail: ewa.rudnicka@pwr.edu.pl

Funding source: Polish National Science Centre

Award Identifier / Grant number: UMO-2019/33/B/HS2/02814

Research funding: This research has been funded by the Polish National Science Centre under the grant agreement No UMO-2019/33/B/HS2/02814.

References

Agirre, Eneko, Izaskun Aldezabal & E. Eli Pociello. 2006. Lexicalisation and multiword expressions in the Basque WordNet. In Sojka, Petr, Key-Sun Choi, Christiane Fellbaum & Piek Vossen (eds.), Proceedings of the third international wordnet conference, 131–138. Amsterdam: Vrije Universiteit Amsterdam.Search in Google Scholar

Akaike, Hirotugu. 1974. A new look at statistical model identification. IEEE Transactions on Automatic Control 19(6). 716–723. https://doi.org/10.1109/tac.1974.1100705.Search in Google Scholar

Asparouhov, Tihomir & Bength Muthén. 2014. Auxiliary variables in mixture modeling: Using the BCH method in Mplus to estimate a distal outcome model and an arbitrary secondary model. Mplus Web Notes 21(2). 1–22.Search in Google Scholar

Baldwin, Tim & Francis Bond. 2002. Multiword expressions: Some problems for Japanese NLP. In Proceedings of the 8th annual meeting of the association for NLP, Keihanna, Japan, 379–382.Search in Google Scholar

Baldwin, Tim & Su Nam Kim. 2010. Multiword expressions. In Nitin Indurkhya & Fred Damerau (eds.), Handbook of natural language processing, 2nd edn., 1–26. Boca Raton: CRC.Search in Google Scholar

Battaglia, Michael. 2008. Convenience sampling. In Paul Lavrakas (ed.), Encyclopedia of survey research methods. Los Angeles, USA: SAGE Publications Inc.Search in Google Scholar

Bauer, Laurie. 2019. Compounds and multi-word expressions in English. In Barbara Schlücker (ed.), Complex lexical units: Compounds and multi-word expressions, 45–68. Berlin: de Gruyter.10.1515/9783110632446-002Search in Google Scholar

Biber, Douglas, Stig Johansson, Geoffrey Leech, Susan Conrad & Edward Finegan. 1999. The Longman grammar of spoken and written English. London: Longman.Search in Google Scholar

Bogusławski, Andrzej. 1976. O zasadach rejestracji jednostek języka. Poradnik Jezykowy 8. 356–364.Search in Google Scholar

Bouckaert, Remco. 2004. Bayesian network classifiers in Weka. In Working paper series. Hamilton, New Zealand: University of Waikato, Department of Computer Science. No. 14/2004.Search in Google Scholar

Breiman, Leo. 2001. Random forests. Machine Learning 45(1). 5–32. https://doi.org/10.1023/a:1010933404324.10.1023/A:1010933404324Search in Google Scholar

Bright, William (ed.). 1992. International Encyclopaedia of linguistics. Oxford: OUP.Search in Google Scholar

Brinton, Laurel & Elisabeth Traugott. 2005. Lexicalisation and language change. Cambridge: Cambridge University Press.10.1017/CBO9780511615962Search in Google Scholar

Chlebda, Wojciech. 2010. Nieautomatyczne drogi dochodzenia do reproduktów wielowyrazowych. In Wojciech Chlebda (ed.), Na tropach reproduktów: w poszukiwaniu wielowyrazowych jednostek języka, 15–35. Opole: Wydawnictwo Uniwersytetu Opolskiego.Search in Google Scholar

Cohen, Jacob. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement 20(1). 37–46. https://doi.org/10.1177/001316446002000104.Search in Google Scholar

Collins, Linda & Stephanie Lanza. 2010. Latent class and latent transition analysis: With applications in the social, behavioral, and health sciences. Hoboken: John Wiley & Sons.10.1002/9780470567333Search in Google Scholar

Constant, Mathieu, Gülşen Eryiǧit, Johanna Monti, Lonneke van der Plas, Carlos Ramisch, Michael Rosner & Amalia Todirascu. 2017. Multiword expression processing: A survey. Computational Linguistics 43(4). 837–892. https://doi.org/10.1162/coli_a_00302.Search in Google Scholar

Cordeiro, Silvio, Aline Villavicencio, Marco Idiart & Carlos Ramisch. 2019. Unsupervised compositionality prediction of nominal compounds. Computational Linguistics 45(1). 1–57. https://doi.org/10.1162/coli_a_00341.Search in Google Scholar

Corpas Pastor, Gloria & Jean-Paul Colson (eds.). 2020. Computational phraseology. Amsterdam: John Benjamins.10.1075/ivitra.24Search in Google Scholar

von Davier, Matthias. 1997. Bootstrapping goodness-of-fit statistics for sparse categorical data: Results of a Monte Carlo study. Methods of Psychological Research Online 2(2). 29–48.Search in Google Scholar

Espinal, Teresa & Jaume Mateu. 2019. Idioms and phraseology. In Mark Aronoff (ed.), Oxford research Encyclopedia of linguistics. Oxford: Oxford University Press.10.1093/acrefore/9780199384655.013.51Search in Google Scholar

Forsyth, Richard & Łukasz Grabowski. 2015. Is there a formula for formulaic language? Poznań Studies in Contemporary Linguistics 54(1). 511–549. https://doi.org/10.1515/psicl-2015-0019.Search in Google Scholar

Frank, Eibe, Mark Hall & Ian Witten. 2016. The WEKA workbench. Online appendix for Ian Witten. In Eibe Frank, Mark Hall & Christopher Pal (eds.), Data mining: Practical machine learning tools and techniques, 4th edn. Burlington: Morgan Kaufmann.Search in Google Scholar

Garcia, Marcos, Tiago Kramer Vieria, Caroline Scarton, Marco Idiart & Aline Villavicencio. 2021a. Probing for idiomaticity in vector space models. In Proceedings of the 16th conference of the European chapter of the Association for computational linguistics, April 19–23, 2021, 3551–3564.10.18653/v1/2021.eacl-main.310Search in Google Scholar

Garcia, Marcos, Tiago Kramer Vieria, Caroline Scarton, Marco Idiart & Aline Villavicencio. 2021b. Assessing the representations of idiomaticity in vector models with a noun compound dataset labeled at type and token levels. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing, August 1–6, 2021, 2730–2741.10.18653/v1/2021.acl-long.212Search in Google Scholar

Gilpin, Leilani H., David Bau, Ben Z. Yuan, Ayesha Bajwa, Michael Specter & Lalana Kagal. 2018. Explaining explanations: An overview of interpretability of machine learning. In 2018 IEEE 5th international conference on data science and advanced analytics (DSAA), 80–89.10.1109/DSAA.2018.00018Search in Google Scholar

Grochowski, Maciej. 1982. Zarys leksykologii i leksykografii: Zagadnienia synchroniczne. Toruń: Wydawnictwo UMK.Search in Google Scholar

Grzybek, Peter. 2014. Word length. In John R. Taylor (ed.), The Oxford Handbook of the word, 1–25. Oxford: Oxford University Press.10.1093/oxfordhb/9780199641604.013.37Search in Google Scholar

Guidotti, Riccardo, Anna, Monreale, Salvatore Ruggieri, Franco, Turini, Fosca Giannotti & Dino Pedreschi. 2018. A survey of methods for explaining black box models. ACM Computing Surveys 51(5). 1–42. https://doi.org/10.1145/3236009.Search in Google Scholar

Hall, Mark, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter, Reutemann & Ian H. Witten. 2009. The WEKA data mining software: An update. SIGKDD Explorations 11(1). 10–18. https://doi.org/10.1145/1656274.1656278.Search in Google Scholar

Hanks, Peter. 2013. Lexical analysis: Norms and exploitations. Cambridge, MA: MIT Press.10.7551/mitpress/9780262018579.001.0001Search in Google Scholar

Heshempur, Reyhaneh & Aline Villavicencio. 2020. Leveraging contextual embeddings and idiom principle for detecting idiomaticity in potentially idiomatic expressions. In Proceedings of the workshop on cognitive aspects of the lexicon, December 12, 2020, 72–80.Search in Google Scholar

Hirst, Graeme. 1987. Semantic interpretation and the resolution of ambiguity. Cambridge: CUP (cited in Pelletier 1994: 12).10.1017/CBO9780511554346Search in Google Scholar

Hosmer, David W.Jr, Stanley Lemeshow & Rodney X. Sturdivant. 2013. Applied logistic regression, 3rd edn. John Wiley & Sons.10.1002/9781118548387Search in Google Scholar

Hunston, Susan. 2008. Starting with the small words: Patterns, lexis and semantic sequences. International Journal of Corpus Linguistics 13(1). 271–295. https://doi.org/10.1075/ijcl.13.3.03hun.Search in Google Scholar

Hunston, Susan & Gill Francis. 2000. Pattern grammar: A corpus-driven approach to the lexical grammar of English. Amsterdam: John Benjamins.10.1075/scl.4Search in Google Scholar

Iwatsuki, Kenichi, Florian Boudin & Akiko Aizawa. 2022. Extraction and evaluation of formulaic expressions used in scholarly papers. Expert Systems with Applications 187. 115840. https://doi.org/10.1016/j.eswa.2021.115840.Search in Google Scholar

Ježek, Elisabetta. 2016. The lexicon. An introduction. Oxford: Oxford University Press.Search in Google Scholar

Jones, Zachary M. & Fridolin J. Linder. 2016. edarf: exploratory data analysis using random forests. Journal of Open Source Software 1(6). 92. https://doi.org/10.21105/joss.00092.Search in Google Scholar

Juba, Brendan & Hai S. Le. 2019. Precision-recall versus accuracy and the role of large data sets. Proceedings of the AAAI Conference on Artificial Intelligence 33(01). 4039–4048. https://doi.org/10.1609/aaai.v33i01.33014039.Search in Google Scholar

Kocoń, Jan & Michał Gawor. 2019. Evaluating KGR10 Polish word embeddings in the recognition of temporal expressions using BiLSTM-CRF. https://doi.org/10.4467/20838476SI.18.008.10413.Search in Google Scholar

Landis, J. Richard & Gary G. Koch. 1977. The measurement of observer agreement for categorical data. Biometrics 33(1). 159–174. https://doi.org/10.2307/2529310.Search in Google Scholar

Langeheine, Rolf, Jeroen Pannekoek & Frank Van De Pol. 1996. Bootstrapping goodness-of-fit measures in categorical data analysis. Sociological Methods & Research 24(4). 492–516. https://doi.org/10.1177/0049124196024004004.Search in Google Scholar

Le Cessie, Saskia & Johannes Cornelis van Houwelingen. 1992. Ridge estimators in logistic regression. Journal of the Royal Statistical Society: Series C (Applied Statistics) 41(1). 191–201. https://doi.org/10.2307/2347628.Search in Google Scholar

Linzer, Drew A. & Jeffrey Lewis. 2011. poLCA: An R package for polytomous variable latent class analysis. Journal of Statistical Software 42(10). 1–29. https://doi.org/10.18637/jss.v042.i10.Search in Google Scholar

Malmkjaer, Kirsten (ed.). 1991/2010. The linguistics Encyclopedia. London & New York: Routledge.10.4324/9780203874950Search in Google Scholar

Maziarz, Marek, Stanisław Szpakowicz & Maciej Piasecki. 2015. A procedural definition of multi-word lexical units. In Proceedings of the international conference recent advances in natural language processing, pp. 427–435.Search in Google Scholar

McCauley, Stewart M. & Morten H. Christiansen. 2011. Learning simple statistics for language comprehension and production: The CAPPUCINO model. Proceedings of the CogSci 2011. 1619–1624.Search in Google Scholar

Miller, Rupert G.Jr. 1991. Simultaneous statistical inference. New York: Springer-Verlag.Search in Google Scholar

Miodunka, Władysław. 1989. Podstawy leksykologii i leksykografii. Warszawa: PWN.Search in Google Scholar

Moon, Rosamund. 1998. Fixed expressions and idioms in English. A corpus-based approach. Oxford: Clarendon.10.1093/oso/9780198236146.001.0001Search in Google Scholar

Moon, Rosamund. 2015. Multi-word items. In John R. Taylor (ed.), The Oxford handbook of the word, 120–140. Oxford: OUP.10.1093/oxfordhb/9780199641604.013.031Search in Google Scholar

Müldner-Nieckowski, Piotr. 2003. Wielki słownik frazeologiczny języka polskiego. Warszawa: Świat Książki.Search in Google Scholar

Müldner-Nieckowski, Piotr. 2007. Frazeologia poszerzona: Studium leksykograficzne. Warszawa: Oficyna Wydawnicza Volumen.Search in Google Scholar

Pelletier, Geoffrey F. 1994. The principle of semantic compositionality. Topoi 13. 11–24. https://doi.org/10.1007/bf00763644.Search in Google Scholar

Pęzik, Piotr. 2018. Facets of prefabrication. Łódź: Wydawnictwo Uniwersytetu Łódzkiego.Search in Google Scholar

Piantadosi, Steven T., Harry Tily & Edward Gibson. 2011. Word lengths are optimized for efficient communication. Proceedings of the National Academy of Sciences 108(9). 3526–3529. https://doi.org/10.1073/pnas.1012551108.Search in Google Scholar

Pike, Kenneth L. 1967. Language in relation to a unified theory of the structure of human behaviour, 2nd edn. The Hague: Mouton.10.1037/14786-000Search in Google Scholar

Quinlan, John R. 1993. C4.5: Programs for machine learning. San Mateo, CA: Morgan Kaufmann Publishers.Search in Google Scholar

Sag, Ivan A., Timothy Baldwin, Francis Bond, Ann Copestake & Dan Flickinger. 2002. Multiword expressions: A pain in the neck for NLP. Proceedings of CICLing 2002. 1–15.10.1007/3-540-45715-1_1Search in Google Scholar

Salehi, Bahar, Paul Cook & Timothy Baldwin. 2015. A word embedding approach to predicting the compositionality of multiword expressions. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, 977–983. Denver, Colorado: Association for Computational Linguistics.10.3115/v1/N15-1099Search in Google Scholar

Sampson, Geoffrey. 2007. Grammar without grammaticality. [Special issue] Corpus Linguistics and Linguistic Theory 3(1). 1–32. https://doi.org/10.1515/cllt.2007.001.Search in Google Scholar

Sampson, Geoffrey & Anna Babarczy. 2014. Grammar without grammaticality: Growth and limits of grammatical precision. Berlin: Mouton de Gruyter.10.1515/9783110290011Search in Google Scholar

Sapir, Edward. 1921. Language. An introduction to the study of speech. New York: Harcourt, Brace.Search in Google Scholar

Schütze, Carson T. 1996/2016. The empirical base of linguistics. Chicago: University of Chicago Press. (Reprinted as: Schütze, Carson T. 2016. The empirical base of linguistics: Grammaticality judgments and linguistic methodology. (Classics in Linguistics 2). Berlin: Language Science Press.10.26530/OAPEN_603356Search in Google Scholar

Schwarz, Gideon E. 1978. Estimating the dimension of a model. Annals of Statistics 6(2). 461–464. https://doi.org/10.1214/aos/1176344136.Search in Google Scholar

Sinclair, John. 1991. Corpus, concordance, collocation. Oxford: Oxford University Press.Search in Google Scholar

Strauss, Udo, Peter Grzybek & Gabriel, Altmann. 2007. Word length and word frequency. In Peter Grzybek (ed.), Contributions to the science of text and language, 277–294. Dordrecht: Springer.10.1007/1-4020-4068-7_13Search in Google Scholar

Sumner, Marc, Eibe Frank & Mark Hall. 2005. Speeding up logistic model tree induction. In Proceedings of 9th European conference on principles and practice of knowledge discovery in databases, 675–683.10.1007/11564126_72Search in Google Scholar

Svensén, Bo. 2009. A Handbook of lexicography. The theory and practice of dictionary-making. Cambridge: Cambridge University Press.Search in Google Scholar

Tabachnick, Barbara & Linda Fidell. 2001. Using multivariate analysis. Boston: Allyn & Bacon.Search in Google Scholar

Taylor, John R. 2012. The mental corpus. How language is represented in the mind. Oxford: OUP.10.1093/acprof:oso/9780199290802.001.0001Search in Google Scholar

Tomasello, Michael. 2003. Constructing a language: A usage-based theory of language acquisition. Cambridge, MA: Harvard University Press.Search in Google Scholar

Weller, Bridget E., Natasha K. Bowen & Sarah J. Faubert. 2020. Latent class analysis: A guide to best practice. Journal of Black Psychology 46(4). 287–311. https://doi.org/10.1177/0095798420930932.Search in Google Scholar

Woźniak, Michał. 2017. Jak znaleźć igłę w stogu siana? Automatyczna ekstrakcja wielosegmentowych jednostek leksykalnych z tekstu polskiego. Kraków: IJP PAN.Search in Google Scholar

Wray, Alison. 2002. Formulaic language and the lexicon. Cambridge: CUP.10.1017/CBO9780511519772Search in Google Scholar

Wray, Alison. 2008. Formulaic language. Pushing the boundaries. Oxford: OUP.Search in Google Scholar

Wray, Alison. 2009a. Identifying formulaic language. Persistent challenges and new opportunities. In Roberta Corrigan, Edith Moravcsik, Hamid Ouali & Kathleen M. Wheatley (eds.), Formulaic language. Vol. 1. Distribution and historical change, 27–51. Amsterdam: John Benjamins.10.1075/tsl.82.02ideSearch in Google Scholar

Wray, Alison. 2009b. Future directions in formulaic language research. Journal of Foreign Languages 32(6). 2–17.Search in Google Scholar

Wray, Alison & Perkins, Michael R. 2000. The functions of formulaic language: An integrated model. Language & Communication 20(1). 1–28. https://doi.org/10.1016/s0271-5309(99)00015-4.Search in Google Scholar

Wurpts, Ingrid C. & Christian Geiser. 2014. Is adding more indicators to a latent class analysis beneficial or detrimental? Results of a Monte-Carlo study. Frontiers in Psychology 5. 920. https://doi.org/10.3389/fpsyg.2014.00920.Search in Google Scholar

Zgusta, Ladislav. 1971. Manual of lexicography. Prague: Czechoslovak Academy of Sciences.10.1515/9783111349183Search in Google Scholar

Żmigrodzki, Piotr. 2018. Wielki słownik języka polskiego PAN. Geneza projektu i zasady opracowania. In Piotr Żmigrodzki, Mirosław Bańko, Barbara Batko-Tokarz, Jakub Bobrowski, Anna Czelakowska, Maciej Grochowski, Renata Przybylska, Jadwiga Waniakowa & Katarzyna Węgrzynek (eds.), Wielki słownik PAN. Geneza, koncepcja, zasady opracowania., 9–16. Kraków: Instytut Języka Polskiego PAN.10.17651/WSJP2018Search in Google Scholar

Żmigrodzki, Piotr. 2021. Wielki słownik języka polskiego PAN. Zasady opracowania. wyd. V. Available at: https://pliki.wsjp.pl/zasady_opracowania_wsjp.pdf.10.18276/sj.2022.21-11Search in Google Scholar

Resources

Dunaj, Bogusław, (ed.). 1996. Słownik współczesnego języka polskiego. Warszawa: Wydawnictwo Wilga.Search in Google Scholar

Pearsall, Judy, (ed). 2001. The new Oxford dictionary of English. Oxford: Oxford University Press. https://www.lexico.com/ Search in Google Scholar

SJPDor = Doroszewski, Witold, (ed). 1958–1969. Słownik języka polskiego, pod red. W., t. 1–11, PWN, Warszawa 1958–1969 http://doroszewski.pwn.pl/ Search in Google Scholar

WSJP= Żmigrodzki, Piotr, (ed). 2006-. Wielki słownik języka polskiego PAN. Kraków: IJP PAN. https://wsjp.pl/ Search in Google Scholar

Merriam-Webster = Merriam-Webster.com Dictionary, Merriam-Webster, https://www.merriam-webster.com/dictionary/ Search in Google Scholar

Published Online: 2023-02-27

Published in Print: 2023-06-27

Lexicalisation of Polish and English word combinations: an empirical study

Abstract

References

Resources

Journal and Issue

Articles in the same Issue