Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter Mouton September 26, 2017

Using token-based semantic vector spaces for corpus-linguistic analyses: From practical applications to tests of theoretical claims

  • Martin Hilpert EMAIL logo and David Correia Saavedra

Abstract

This paper presents token-based semantic vector spaces as a tool that can be applied in corpus-linguistic analyses such as word sense comparisons, comparisons of synonymous lexical items, and matching of concordance lines with a given text. We demonstrate how token-based semantic vector spaces are created, and we illustrate the kinds of result that can be obtained with this approach. Our main argument is that token-based semantic vector spaces are not only useful for practical corpus-linguistic applications but also for the investigation of theory-driven questions. We illustrate this point with a discussion of the asymmetric priming hypothesis (Jäger and Rosenbach 2008). The asymmetric priming hypothesis, which states that grammaticalizing constructions will be primed by their lexical sources but not vice versa, makes a number of empirically testable predictions. We operationalize and test these predictions, concluding that token-based semantic vector spaces yield conclusions that are relevant for linguistic theory-building.

Funding statement: This work was supported by Schweizerischer Nationalfonds zur Förderung der Wissenschaftlichen Forschung (Grant/Award Number: ‘100015_149176/1’).

Appendix A: Results of the case studies on got, may, and since

Table 9:

Cross- and within-category sequences with got (X2=44.351, df=1, p<0.001).

target: lexicaltarget: modal
prime: lexical151 (137.01)11 (24.99)
prime: modal19 (32.98)20 (6.01)
Table 10:

Cross- and within-category sequences with may (X2=76.688, df=1, p<0.001).

target: deontictarget: epistemic
prime: deontic34 (10.25)32 (55.74)
prime: epistemic21 (44.75)267 (243.25)
Table 11:

Cross- and within-category sequences with since (X2=77.803, df=1, p=2.745e-11).

target: causaltarget: temporal
prime: causal39 (14.94)7 (31.06)
prime: temporal12 (36.06)99 (74.94)
Table 12:

Mean magnitudes of semantic leaps between prime and target.

lex>lexlex>gramgram>lexgram>gram
got0.109 (sd 0.098)0.150 (sd 0.154)0.212 (sd 0.127)0.077 (sd 0.056)
may0.094 (sd 0.052)0.112 (sd 0.082)0.106 (sd 0.046)0.126 (sd 0.089)
since0.146 (sd 0.088)0.176 (sd 0.086)0.102 (sd 0.036)0.109 (sd 0.069)

References

Bybee, Joan L., Revere Perkins & William Pagliuca. 1994. The evolution of grammar: Tense, aspect and modality in the languages of the world. Chicago: University of Chicago Press.Search in Google Scholar

Davies, Mark. 2004. BYU-BNC. Based on the British National Corpus from Oxford University Press. Available online at http://corpus.byu.edu/bnc/.Search in Google Scholar

Firth, John R. 1957. Papers in Linguistics 1934–1951. London: Oxford University PressSearch in Google Scholar

Glynn, Dylan & Justyna Robinson. 2014. Corpus methods in cognitive semantics. Studies in synonymy and polysemy. Amsterdam: John Benjamins.10.1075/hcp.43Search in Google Scholar

Goldberg, Adele. E. 2006. Constructions at work: The nature of generalization in language. Oxford: Oxford University Press.10.1093/acprof:oso/9780199268511.001.0001Search in Google Scholar

Heylen, Kris, Dirk Speelman & Dirk Geeraerts. 2012. Looking at word meaning. An interactive visualization of semantic vector spaces for dutch synsets. In Proceedings of the EACL-2012 joint workshop of LINGVIS & UNCLH: Visualization of Language Patters and Uncovering Language History from Multilingual Resources, 16–24.Search in Google Scholar

Heylen, Kris, Thomas Wielfaert, Dirk Speelman & Dirk Geeraerts. 2015. Monitoring polysemy. Word space models as a tool for large-scale lexical semantic analysis. Lingua 157. 153–172.10.1016/j.lingua.2014.12.001Search in Google Scholar

Hilpert, Martin. 2008. Germanic future constructions. A usage-based approach to language change. Amsterdam: John Benjamins.10.1075/cal.7Search in Google Scholar

Hilpert, Martin & David Correia Saavedra. 2016. The unidirectionality of semantic changes in grammaticalization: An experimental approach to the asymmetric priming hypothesis. English Language and Linguistics. https://doi.org/10.1017/S1360674316000496 (accessed 12 September 2017).10.1017/S1360674316000496Search in Google Scholar

Hopper, Paul J. & Elizabeth C. Traugott. 2003. Grammaticalization, 2nd edn. Cambridge: Cambridge University Press.10.1017/CBO9781139165525Search in Google Scholar

Izenman, Alan J. 2008. Modern multivariate statistical techniques. Regression, classification, and manifold learning. New York: Springer.10.1007/978-0-387-78189-1Search in Google Scholar

Jäger, Gerhard & Anette Rosenbach. 2008. Priming and unidirectional language change. Theoretical Linguistics 34(2). 85–113.10.1515/THLI.2008.008Search in Google Scholar

Jenset, Gard B. 2013. Mapping meaning with distributional methods. A diachronic corpus-based study of existential there. Journal of Historical Linguistics 3(2). 272–306.10.1075/jhl.3.2.04jenSearch in Google Scholar

Kiela, Douwe & Stephen Clark. 2014. A systematic study of semantic vector space model parameters. Proceedings of EACL 2014, Second Workshop on Continuous Vector Space Models and their Compositionality (CVSC), Gothenburg, Sweden, 21–30.Search in Google Scholar

Lebani, Gianluca & Alessandro Lenci. 2016. “Beware the Jabberwock, dear reader!” Testing the distributional reality of construction semantics. Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex-V), 8–18.Search in Google Scholar

Leech, Geoffrey. 1992. 100 million words of English: the British National Corpus. Language Research 28(1). 1–13.10.1017/S0266078400006854Search in Google Scholar

Levshina, Natalia. 2015. How to do Linguistics with R: Data exploration and statistical analysis. Amsterdam: John Benjamins.10.1075/z.195Search in Google Scholar

Norde, Muriel. 2009. Degrammaticalization. Oxford: Oxford University Press.10.1093/acprof:oso/9780199207923.001.0001Search in Google Scholar

Perek, Florent. 2016. Using distributional semantics to study syntactic productivity in diachrony: A case study. Linguistics 54(1). 149–188.10.1515/ling-2015-0043Search in Google Scholar

Ruette, Tom, Dirk Speelman & Dirk Geeraerts. 2013. Lexical variation in aggregate perspective. In Augusto Soares Da Silva (ed.), Pluricentricity: Linguistic variation and sociocognitive dimensions, 95–116. Berlin: De Gruyter.10.1515/9783110303643.103Search in Google Scholar

Sagi, Eyal, Stefan Kaufmann, and Brady Clark. 2011. Tracing semantic change with latent semantic analysis. In Justyna Robynson and Kathryn Allan (eds.), Current methods in historical semantics, 161–183. Berlin: De Gruyter.10.1515/9783110252903.161Search in Google Scholar

Schütze, Hinrich. 1998. Automatic word sense discrimination. Computational Linguistics 24(1). 97–124.Search in Google Scholar

Traugott, Elizabeth Closs & Graeme Trousdale (eds.) 2010. Gradience, gradualness and grammaticalization. Amsterdam: John Benjamins.10.1075/tsl.90Search in Google Scholar

Traugott, Elizabeth Closs & Graeme Trousdale. 2013. Constructionalization and constructional changes. Oxford: Oxford University Press.10.1093/acprof:oso/9780199679898.001.0001Search in Google Scholar

Turney, Peter D. & Patrick Pantel. 2010. From frequency to meaning: Vector space models of semantics. Journal of Artificial Intelligence Research 37. 141–188.10.1613/jair.2934Search in Google Scholar

Wheeler, Eric S. 2005. Multidimensional scaling for linguistics. In Reinhard Koehler, Gabriel Altmann & Raimond G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 548–553. Berlin: De Gruyter.Search in Google Scholar

Published Online: 2017-09-26
Published in Print: 2020-10-25

© 2020 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 18.4.2024 from https://www.degruyter.com/document/doi/10.1515/cllt-2017-0009/html
Scroll to top button