Abstract
Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.
Data availability
Due to the copyright issues, only certain portions of the texts in our corpora are freely available to the public in the form of concordance lines per request. The lexical databases for which there are no copyright issues have been made available at http://era.library.ualberta.ca/files/j098zc38m for use by the research community.
Notes
We return to possible disadvantages of using translated subtitles below.
This idea is not new in linguistics. Firth (1957, p. 11) referred to this as “You shall know a word by the company it keeps.”
The original LSA divided its corpus into 30,000 episodes, and assessed the number of times each one of words appeared in each episodes. Instead of assigning 30,000 individual values to each word, factor analysis reduces the number of values to about 300.
A monitor corpus is a type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. A monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e., balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year. An example of an English monitor corpus is the COCA corpus (Davies 2010), which can be accessed at http://corpus.byu.edu/coca/.
References
Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Baayen, R. H., Feldman, L., & Schreuder, R. (2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 53, 496–512.
Baayen, R. H., Milin, P., Filipovíc Đurđevíc, D., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438–481.
Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania.
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316.
Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aitken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh: Edinburgh University Press.
Brysbaert, M., Mandera, P., & Keuleers, E. (2017). The word frequency effect in word processing: A review update. In To be published in Current Directions in Psychological Science.
Brysbaert, M., & New, B. (2009). Moving beyond Kuˇcera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, 30, 272–277.
Burgess, C., & Lund, K. (1998). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual change in humans and machines. Mahwah: Lawrence Erlbaum Associates.
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS One, 5(6), e10729.
Cantos Gómez, P. (2013). Statistical methods in language and linguistic research. Sheffield: Equinox Publishing Limited.
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.
Core Team, R. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Crossley, S. A., Salsbury, T., McCarthy, P. M., & McNamara, D. S. (2008). LSA as a measure of coherence in second language natural discourse. In V. Sloutsky & B. K. M. Love (Eds.), Proceedings of the 30th Annual Meeting of the Cognitive Science Society. Washington, DC: Cognitive Science Society.
Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicologica, 32, 133–143.
Davies, M. (2010). Corpus of contemporary American English (COCA). http://www.americancorpus.org/. Accessed 16 Feb 2014.
de Groot, A., & Hagoort, P. (2017). Research methods in psycholinguistics and the neurobiology of language: A practical guide. GMLZ—Guides to research methods in language and linguistics. New York: Wiley.
Delic, E. (2004). Présentation du Corpus de référence du Francais parlé. Recherches sur le Francais parlé, 18, 11–42.
Dimitropoulou, M., Dunabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behaviour: The case of Greek. Frontiers in Psychology, 1, 1–12.
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74.
Firth, J. R. (1957). Papers in linguistics, 1934–1951. London: Oxford University Press.
Gagné, C. L., & Shoben, E. J. (1997). Influence of thematic relations on the comprehension of modifier-noun combinations. Journal of Experimental Psychology. Learning, Memory, and Cognition, 23, 71–87.
Gagné, C. L., Spalding, T. L., & Nisbet, K. A. (2016). Processing English compounds: Investigating semantic transparency. SKASE Journal of Theoretical Linguistics, 13(2), 2–22.
Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48(3), 963–972.
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.
Gries, S. T. (2009). Dispersions and adjusted frequencies in corpora: Further explorations. Language and Computers, 71(1), 197–212.
Günther, F., Dudschig, C., & Kaup, B. (2016). Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies. The Quarterly Journal of Experimental Psychology, 69(4), 626–653.
Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information. The case of frequency of occurrence. American Psychologist, 39, 1372–1388.
Hoàng, P. (ed) (2000). Từ điển tiếng Việt [Vietnamese Dictionary]. Khoa học Xã hội, Hà Nội. Viện Ngôn ngữ học.
Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Hague: Romance languages and their structures.
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behaviour Research Methods, 42(3), 627–633.
Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behaviour Research Methods, 42(3), 643–650.
Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence: Brown University Press.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Lê, H. P., Nguyen, T. M. H., Roussanaly, A., & Ho, V. (2008). A hybrid approach to word segmentation of Vietnamese texts. In C. Martin-Vide, F. Otto, & H. Fernau (Eds.), Language and automata theory and applications (Vol. 5196, pp. 240–249)., Lecture Notes in Computer Science Springer: Berlin, Heidelberg.
Le, D.-T., & Quasthoff, U. (2016). Construction and analysis of a large Vietnamese text corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).
Lê, H. P., Roussanaly, A., Nguyen, T. M. H., & Rossignol, M. (2010). An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues Naturelles—TALN 2010 (p. 12), Montréal Canada. ATALA (Association pour le Traitement Automatique des Langues).
Libben, G., Gibson, M., Yoon, Y. B., & Sandra, D. (2003). Compound fracture: The role of semantic transparency and morphological headedness. Brain and Language, 84, 50–64.
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical cooccurrence. Behavior Research Methods Instruments and Computers, 28(2), 203–208.
Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th annual conference of the Cognitive Science Society (pp. 660–665), Hillsdale: Erlbaum.
McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part I. An account of the basic findings. Psychological Review, 88, 375–407.
McDonald, S. A., & Shillcock, R. C. (2001). Rethinking the word frequency effect: The neglected role of distributional information in lexical processing. Language and Speech, 44(3), 295–323.
New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04), 661–677.
New, B., Pallier, C., Brysbaert, M., Ferr, L., Holloway, R., Service, U., et al. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, and Computers, 36, 516–524.
Nguyễn, Đ. D., & Lê, Q. T. (1980). Dictionnaire de fréquence du Vietnamien. Paris: Université de Paris VII.
Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.
Ooi, V. (1998). Computer corpus lexicography. Edinburgh: Edinburgh University Press.
Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1988). Positron emission tomographic studies of the cortical anatomy of single-word processing. Nature, 331(6157), 585–589.
Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1989). Positron emission tomographic studies of the processing of single words. Journal of Cognitive Neuroscience, 1(2), 153–170.
Pham, H., & Baayen, H. R. (2013). Semantic relations and compound transparency: A regression study in CARIN theory. Psihologija, 46(4), 455–478.
Pham, H., & Baayen, H. R. (2015). Vietnamese compounds show an anti-frequency effect in visual lexical decision. Language, Cognition & Neuroscience, 30(9), 1077–1095.
Pham, H., Bolger, P., & Baayen, R. H. (2012). Vietnamese word and syllabeme (syllable-morpheme) frequencies: A corpus and lexical decision study. In SEALS 22. Agay, France.
Pham, G., Kohnert, K., & Carney, E. (2008). Corpora of Vietnamese texts: Lexical effects of intended audience and publication place. Behavior Research Methods, 40(1), 154–163.
Pinker, S. (1999). Words and rules: The ingredients of language. New York: Basic Books.
Rayson, P. & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction ACL 2000. October 2000, Hong Kong (pp. 1–6).
Read, T., & Cressie, N. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.
Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.
Shaoul, C., & Westbury, C. (2006). Word frequency effects in high-dimensional co-occurrence models: A new approach. Behavior Research Methods, 38, 190–195.
Southeast Asian Languages Library, S. (2009). Vietnamese text corpus. http://sealang.net/vietnamese/corpus.htm. Accessed 16 Feb 2014.
Trung tâm từ điển học, V. (1998). Vietnamese corpus. http://vietlex.com/kho-ngu-lieu. Accessed 16 Feb 2014.
Walter, J. B., van Heuven, P. M., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
Wild, F. (2011). LSA: Latent semantic analysis. R package version 0.63-3.
Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60(4), 502–529.
Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton Mifflin.
Acknowledgements
This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 602.10-2016.05.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
POS tags used in the corpora
ID | POS-tags | POS in English | POS in Vietnamese |
---|---|---|---|
1 | Np | Proper noun | danh từ riêng |
2 | Nc | Classifier noun | danh từ chỉ loại |
3 | Nu | Unit noun | danh từ đơn vị |
4 | N | Common noun | danh từ chung |
5 | V | Verb | động từ |
6 | A | Adjective | tính từ |
7 | P | Pronoun | đại từ |
8 | R | Adverb | phó từ |
9 | L | Determiner | định từ |
10 | M | Numeral | số từ |
11 | E | Preposition | giới từ |
12 | C | Subordinating conjunction | liên từ phụ |
13 | CC | Coordinating conjunction | liên từ kết hợp |
14 | I | Interjection | từ cảm thán |
15 | T | Auxiliary word, modal words | trợ từ |
16 | Y | Abbreviation | từ viết tắt |
17 | Z | Bound morphemes | yếu tố cấu tạo từ (bất, vô. . .) |
18 | X | Undetermined | không (hoặc chưa) xác định |
POS-tagged XML sample
<s> | |||||||
<doc> | <w | pos=“A”>lớn</w> | |||||
… | <w | pos=“R”>lên</w> | |||||
<s> | <w | pos=“CC”>và</w> | |||||
<w | pos=“P”>Đây</w> | <w | pos=“V”>tin</w> | ||||
<w | pos=“V”>là</w> | <w | pos=“C”>rằng</w> | ||||
<w | pos=“N”>câu_chuyện</w> | <w | pos=“M”>…</w> | ||||
<w | pos=“E”>về</w> | </s> | |||||
<w | pos=“M”>một</w> | <s> | |||||
<w | pos=“N”>chàng_trai</w> | <w | pos=“P”>mình</w> | ||||
<w | pos=“V”>gặp</w> | <w | pos=“R”>sẽ</w> | ||||
<w | pos=“M”>một</w> | <w | pos=“V”>không_bao_giờ</w> | ||||
<w | pos=“N”>cô_gái</w> | <w | pos=“V”>có</w> | ||||
<w | pos=“.”>.</w> | <w | pos=“R”>được</w> | ||||
</s> | <w | pos=“N”>hạnh_phúc</w> | |||||
<s> | <w | pos=“A”>thực_sự</w> | |||||
<w | pos=“N”>Ngày</w> | <w | pos=“…”>…</w> | ||||
<w | pos=“N”>thứ</w> | </s> | |||||
<w | pos=“M”>nhất</w> | <s> | |||||
</s> | <w | pos=“E”>cho_đến</w> | |||||
<s> | <w | pos=“N”>ngày</w> | |||||
<w | pos=“N”>Chàng_trai</w> | <w | pos=“V”>gặp</w> | ||||
<w | pos=“,”>,</w> | <w | pos=“R”>được</w> | ||||
<w | pos=“Np”>Tom_Hansen</w> | <w | pos=“““>“</w> | ||||
<w | pos=“,”>,</w> | <w | pos=“N”>người</w> | ||||
<w | pos=“V”>sinh_ra</w> | <w | pos=“P”>ấy</w> | ||||
<w | pos=“E”>ở</w> | <w | pos=“““>“</w> | ||||
<w | pos=“Np”>Margate</w> | <w | pos=“.”>.</w> | ||||
<w | pos=“,”>,</w> | </s> | |||||
<w | pos=“Np”>New_Jersey</w> | … | |||||
<w | pos=“,”>,</w> | </doc> | |||||
</s> |
Dispersion measures
Word | FREQ | RANGE | MAXMIN | SD | VARCOEFF | CHISQUARE | D_EQ | D_UNEQ |
---|---|---|---|---|---|---|---|---|
cảnh gần | 8.00 | 8.00 | 1.00 | 0.01 | 147.34 | 220,704.86 | 0.65 | 0.53 |
cánh phấn | 2.00 | 2.00 | 1.00 | 0.00 | 294.69 | 15,572.46 | 0.29 | 0.23 |
cảnh sát | 25,695.00 | 12,434.00 | 46.00 | 0.76 | 5.16 | 869,451.64 | 0.99 | 0.99 |
cao lương | 100.00 | 49.00 | 23.00 | 0.08 | 135.55 | 575,687.79 | 0.67 | 0.75 |
cặp lồng | 30.00 | 23.00 | 3.00 | 0.02 | 94.21 | 191,176.18 | 0.77 | 0.64 |
cạp nia | 9.00 | 7.00 | 3.00 | 0.01 | 179.34 | 137,331.11 | 0.57 | 0.51 |
D2 | S_EQ | S_UNEQ | D3 | DC | IDF | ENGVALL | U_EQ | U_UNEQ | UM_CARR |
---|---|---|---|---|---|---|---|---|---|
0.17 | 0.00 | 0.00 | − 5426.44 | 0.00 | 14.41 | 0.00 | 5.17 | 4.28 | 1.38 |
0.06 | 0.00 | 0.00 | − 21,709.50 | 0.00 | 16.41 | 0.00 | 0.59 | 0.46 | 0.11 |
0.76 | 0.06 | 0.05 | − 5.66 | 0.06 | 3.80 | 1839.48 | 25,376.75 | 25,394.28 | 19,446.87 |
0.26 | 0.00 | 0.00 | − 4592.74 | 0.00 | 11.79 | 0.03 | 67.47 | 74.61 | 25.61 |
0.25 | 0.00 | 0.00 | − 2218.07 | 0.00 | 12.88 | 0.00 | 23.22 | 19.08 | 7.61 |
0.15 | 0.00 | 0.00 | − 8039.77 | 0.00 | 14.60 | 0.00 | 5.13 | 4.63 | 1.37 |
AF_EQ | AF_UNEQ | Ur_KROM | F_ARF | AWT | F_AWT | ALD | F_ALD | DP | DPnorm |
---|---|---|---|---|---|---|---|---|---|
0.00 | 0.00 | 8.00 | 5.63 | 7,427,879.64 | 5.72 | 7.13 | 6.27 | 1.00 | 1.00 |
0.00 | 0.00 | 2.00 | 1.20 | 3,498,9851.57 | 1.21 | 7.79 | 1.38 | 1.00 | 1.00 |
1606.13 | 1393.76 | 17,028.34 | 9384.85 | 20,117.13 | 2111.81 | 4.15 | 5980.62 | 0.93 | 0.93 |
0.02 | 0.08 | 58.00 | 34.54 | 2,075,863.59 | 20.46 | 6.49 | 27.38 | 1.00 | 1.00 |
0.00 | 0.02 | 26.33 | 15.24 | 3,737,474.52 | 11.37 | 6.77 | 14.35 | 1.00 | 1.00 |
0.00 | 0.00 | 7.83 | 4.32 | 11,980,483.74 | 3.55 | 7.30 | 4.29 | 1.00 | 1.00 |
Abbreviation | Measure |
---|---|
FREQ | Observed frequency of word w |
RANGE | Number of parts with word w |
MAXMIN | Max. freq. of w/part—min. freq. of w/part |
SD | Standard deviation of frequencies |
VARCOEFF | Variation coefficient of frequencies |
CHISQUARE | Chi square value of the frequency distribution |
D_EQ | Juilland et al.’s D (assuming equal parts) |
D_UNEQ | Juilland et al.’s D (not assuming equal parts) |
D2 | Carroll’s D2 |
S_EQ | Rosengren’s S (assuming equal parts) |
S_UNEQ | Rosengren’s S (not assuming equal parts) |
D3 | Lyne’s D3 |
DC | Distributional Consistency |
IDF | Inverse Document Frequency |
ENGVALL | Engvall’s measure |
U_EQ | Juilland et al.’s usage coefficient U (assuming equal parts) |
U_UNEQ | Juilland et al.’s usage coefficient U (not assuming equal parts) |
UM_CARR | Carroll’s Um |
AF_EQ | Rosengren’s Adjusted Frequency AF (assuming equal parts) |
AF_UNEQ | Rosengren’s Adjusted Frequency AF (not assuming equal parts) |
Ur_KROM | Kromer’s UR |
F_ARF | Savický and Hlaváčová’s fARF |
AWT | Savický and Hlaváˇcová’s AW T |
F_AWT | Savický and Hlaváˇcová’s fAW T |
ALD | Savický and Hlaváˇcová’s |
ALD F_ALD | Savický and Hlaváˇcová’s fALD |
SELF_DISP | Washtell’s self-dispersion |
DP | Gries’s Deviation of Proportions |
DP_norm | Gries’s Deviation of Proportions (normalized) |
Rights and permissions
About this article
Cite this article
Pham, H., Tucker, B.V. & Baayen, R.H. Constructing two vietnamese corpora and building a lexical database. Lang Resources & Evaluation 53, 465–498 (2019). https://doi.org/10.1007/s10579-019-09451-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-019-09451-x