Constructing two vietnamese corpora and building a lexical database

Pham, Hien; Tucker, Benjamin V.; Baayen, R. Harald

doi:10.1007/s10579-019-09451-x

Constructing two vietnamese corpora and building a lexical database

Project Notes
Published: 21 March 2019

Volume 53, pages 465–498, (2019)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

482 Accesses
1 Citation
Explore all metrics

Abstract

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Data availability

Due to the copyright issues, only certain portions of the texts in our corpora are freely available to the public in the form of concordance lines per request. The lexical databases for which there are no copyright issues have been made available at http://era.library.ualberta.ca/files/j098zc38m for use by the research community.

Notes

We return to possible disadvantages of using translated subtitles below.
http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/_dispersion1.r.
This idea is not new in linguistics. Firth (1957, p. 11) referred to this as “You shall know a word by the company it keeps.”
The original LSA divided its corpus into 30,000 episodes, and assessed the number of times each one of words appeared in each episodes. Instead of assigning 30,000 individual values to each word, factor analysis reduces the number of values to about 300.
A monitor corpus is a type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. A monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e., balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year. An example of an English monitor corpus is the COCA corpus (Davies 2010), which can be accessed at http://corpus.byu.edu/coca/.

References

Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.
Article Google Scholar
Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.
Book Google Scholar
Baayen, R. H., Feldman, L., & Schreuder, R. (2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 53, 496–512.
Google Scholar
Baayen, R. H., Milin, P., Filipovíc Đurđevíc, D., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438–481.
Article Google Scholar
Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania.
Google Scholar
Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316.
Article Google Scholar
Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aitken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh: Edinburgh University Press.
Google Scholar
Brysbaert, M., Mandera, P., & Keuleers, E. (2017). The word frequency effect in word processing: A review update. In To be published in Current Directions in Psychological Science.
Brysbaert, M., & New, B. (2009). Moving beyond Kuˇcera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.
Article Google Scholar
Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, 30, 272–277.
Google Scholar
Burgess, C., & Lund, K. (1998). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual change in humans and machines. Mahwah: Lawrence Erlbaum Associates.
Google Scholar
Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS One, 5(6), e10729.
Article Google Scholar
Cantos Gómez, P. (2013). Statistical methods in language and linguistic research. Sheffield: Equinox Publishing Limited.
Google Scholar
Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.
Google Scholar
Core Team, R. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0.
Google Scholar
Crossley, S. A., Salsbury, T., McCarthy, P. M., & McNamara, D. S. (2008). LSA as a measure of coherence in second language natural discourse. In V. Sloutsky & B. K. M. Love (Eds.), Proceedings of the 30th Annual Meeting of the Cognitive Science Society. Washington, DC: Cognitive Science Society.
Google Scholar
Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicologica, 32, 133–143.
Google Scholar
Davies, M. (2010). Corpus of contemporary American English (COCA). http://www.americancorpus.org/. Accessed 16 Feb 2014.
de Groot, A., & Hagoort, P. (2017). Research methods in psycholinguistics and the neurobiology of language: A practical guide. GMLZ—Guides to research methods in language and linguistics. New York: Wiley.
Google Scholar
Delic, E. (2004). Présentation du Corpus de référence du Francais parlé. Recherches sur le Francais parlé, 18, 11–42.
Google Scholar
Dimitropoulou, M., Dunabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behaviour: The case of Greek. Frontiers in Psychology, 1, 1–12.
Article Google Scholar
Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74.
Google Scholar
Firth, J. R. (1957). Papers in linguistics, 1934–1951. London: Oxford University Press.
Google Scholar
Gagné, C. L., & Shoben, E. J. (1997). Influence of thematic relations on the comprehension of modifier-noun combinations. Journal of Experimental Psychology. Learning, Memory, and Cognition, 23, 71–87.
Article Google Scholar
Gagné, C. L., Spalding, T. L., & Nisbet, K. A. (2016). Processing English compounds: Investigating semantic transparency. SKASE Journal of Theoretical Linguistics, 13(2), 2–22.
Google Scholar
Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48(3), 963–972.
Article Google Scholar
Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.
Article Google Scholar
Gries, S. T. (2009). Dispersions and adjusted frequencies in corpora: Further explorations. Language and Computers, 71(1), 197–212.
Google Scholar
Günther, F., Dudschig, C., & Kaup, B. (2016). Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies. The Quarterly Journal of Experimental Psychology, 69(4), 626–653.
Article Google Scholar
Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information. The case of frequency of occurrence. American Psychologist, 39, 1372–1388.
Article Google Scholar
Hoàng, P. (ed) (2000). Từ điển tiếng Việt [Vietnamese Dictionary]. Khoa học Xã hội, Hà Nội. Viện Ngôn ngữ học.
Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Hague: Romance languages and their structures.
Google Scholar
Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behaviour Research Methods, 42(3), 627–633.
Article Google Scholar
Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behaviour Research Methods, 42(3), 643–650.
Article Google Scholar
Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence: Brown University Press.
Google Scholar
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.
Article Google Scholar
Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.
Article Google Scholar
Lê, H. P., Nguyen, T. M. H., Roussanaly, A., & Ho, V. (2008). A hybrid approach to word segmentation of Vietnamese texts. In C. Martin-Vide, F. Otto, & H. Fernau (Eds.), Language and automata theory and applications (Vol. 5196, pp. 240–249)., Lecture Notes in Computer Science Springer: Berlin, Heidelberg.
Chapter Google Scholar
Le, D.-T., & Quasthoff, U. (2016). Construction and analysis of a large Vietnamese text corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).
Google Scholar
Lê, H. P., Roussanaly, A., Nguyen, T. M. H., & Rossignol, M. (2010). An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues Naturelles—TALN 2010 (p. 12), Montréal Canada. ATALA (Association pour le Traitement Automatique des Langues).
Libben, G., Gibson, M., Yoon, Y. B., & Sandra, D. (2003). Compound fracture: The role of semantic transparency and morphological headedness. Brain and Language, 84, 50–64.
Article Google Scholar
Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical cooccurrence. Behavior Research Methods Instruments and Computers, 28(2), 203–208.
Article Google Scholar
Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th annual conference of the Cognitive Science Society (pp. 660–665), Hillsdale: Erlbaum.
McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part I. An account of the basic findings. Psychological Review, 88, 375–407.
Article Google Scholar
McDonald, S. A., & Shillcock, R. C. (2001). Rethinking the word frequency effect: The neglected role of distributional information in lexical processing. Language and Speech, 44(3), 295–323.
Article Google Scholar
New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04), 661–677.
Article Google Scholar
New, B., Pallier, C., Brysbaert, M., Ferr, L., Holloway, R., Service, U., et al. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, and Computers, 36, 516–524.
Article Google Scholar
Nguyễn, Đ. D., & Lê, Q. T. (1980). Dictionnaire de fréquence du Vietnamien. Paris: Université de Paris VII.
Google Scholar
Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.
Google Scholar
Ooi, V. (1998). Computer corpus lexicography. Edinburgh: Edinburgh University Press.
Google Scholar
Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1988). Positron emission tomographic studies of the cortical anatomy of single-word processing. Nature, 331(6157), 585–589.
Article Google Scholar
Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1989). Positron emission tomographic studies of the processing of single words. Journal of Cognitive Neuroscience, 1(2), 153–170.
Article Google Scholar
Pham, H., & Baayen, H. R. (2013). Semantic relations and compound transparency: A regression study in CARIN theory. Psihologija, 46(4), 455–478.
Article Google Scholar
Pham, H., & Baayen, H. R. (2015). Vietnamese compounds show an anti-frequency effect in visual lexical decision. Language, Cognition & Neuroscience, 30(9), 1077–1095.
Article Google Scholar
Pham, H., Bolger, P., & Baayen, R. H. (2012). Vietnamese word and syllabeme (syllable-morpheme) frequencies: A corpus and lexical decision study. In SEALS 22. Agay, France.
Pham, G., Kohnert, K., & Carney, E. (2008). Corpora of Vietnamese texts: Lexical effects of intended audience and publication place. Behavior Research Methods, 40(1), 154–163.
Article Google Scholar
Pinker, S. (1999). Words and rules: The ingredients of language. New York: Basic Books.
Google Scholar
Rayson, P. & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction ACL 2000. October 2000, Hong Kong (pp. 1–6).
Read, T., & Cressie, N. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.
Book Google Scholar
Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.
Book Google Scholar
Shaoul, C., & Westbury, C. (2006). Word frequency effects in high-dimensional co-occurrence models: A new approach. Behavior Research Methods, 38, 190–195.
Article Google Scholar
Southeast Asian Languages Library, S. (2009). Vietnamese text corpus. http://sealang.net/vietnamese/corpus.htm. Accessed 16 Feb 2014.
Trung tâm từ điển học, V. (1998). Vietnamese corpus. http://vietlex.com/kho-ngu-lieu. Accessed 16 Feb 2014.
Walter, J. B., van Heuven, P. M., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.
Article Google Scholar
Wild, F. (2011). LSA: Latent semantic analysis. R package version 0.63-3.
Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60(4), 502–529.
Article Google Scholar
Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton Mifflin.
Google Scholar

Download references

Acknowledgements

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 602.10-2016.05.

Author information

Authors and Affiliations

Institute of Linguistics, Vietnam Academy of Social Sciences, Ba Đình, Hà Nội, Vietnam
Hien Pham
University of Alberta, Edmonton, Canada
Benjamin V. Tucker & R. Harald Baayen
University of Tübingen, Tübingen, Germany
R. Harald Baayen

Authors

Hien Pham
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin V. Tucker
View author publications
You can also search for this author in PubMed Google Scholar
R. Harald Baayen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hien Pham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

POS tags used in the corpora

ID	POS-tags	POS in English	POS in Vietnamese
1	Np	Proper noun	danh từ riêng
2	Nc	Classifier noun	danh từ chỉ loại
3	Nu	Unit noun	danh từ đơn vị
4	N	Common noun	danh từ chung
5	V	Verb	động từ
6	A	Adjective	tính từ
7	P	Pronoun	đại từ
8	R	Adverb	phó từ
9	L	Determiner	định từ
10	M	Numeral	số từ
11	E	Preposition	giới từ
12	C	Subordinating conjunction	liên từ phụ
13	CC	Coordinating conjunction	liên từ kết hợp
14	I	Interjection	từ cảm thán
15	T	Auxiliary word, modal words	trợ từ
16	Y	Abbreviation	từ viết tắt
17	Z	Bound morphemes	yếu tố cấu tạo từ (bất, vô. . .)
18	X	Undetermined	không (hoặc chưa) xác định

POS-tagged XML sample

					<s>
<doc>						<w	pos=“A”>lớn</w>
…						<w	pos=“R”>lên</w>
<s>						<w	pos=“CC”>và</w>
		<w	pos=“P”>Đây</w>			<w	pos=“V”>tin</w>
		<w	pos=“V”>là</w>			<w	pos=“C”>rằng</w>
		<w	pos=“N”>câu_chuyện</w>			<w	pos=“M”>…</w>
		<w	pos=“E”>về</w>		</s>
		<w	pos=“M”>một</w>		<s>
		<w	pos=“N”>chàng_trai</w>			<w	pos=“P”>mình</w>
		<w	pos=“V”>gặp</w>			<w	pos=“R”>sẽ</w>
		<w	pos=“M”>một</w>			<w	pos=“V”>không_bao_giờ</w>
		<w	pos=“N”>cô_gái</w>			<w	pos=“V”>có</w>
		<w	pos=“.”>.</w>			<w	pos=“R”>được</w>
	</s>					<w	pos=“N”>hạnh_phúc</w>
	<s>					<w	pos=“A”>thực_sự</w>
		<w	pos=“N”>Ngày</w>			<w	pos=“…”>…</w>
		<w	pos=“N”>thứ</w>		</s>
		<w	pos=“M”>nhất</w>		<s>
	</s>					<w	pos=“E”>cho_đến</w>
	<s>					<w	pos=“N”>ngày</w>
		<w	pos=“N”>Chàng_trai</w>			<w	pos=“V”>gặp</w>
		<w	pos=“,”>,</w>			<w	pos=“R”>được</w>
		<w	pos=“Np”>Tom_Hansen</w>			<w	pos=“““>“</w>
		<w	pos=“,”>,</w>			<w	pos=“N”>người</w>
		<w	pos=“V”>sinh_ra</w>			<w	pos=“P”>ấy</w>
		<w	pos=“E”>ở</w>			<w	pos=“““>“</w>
		<w	pos=“Np”>Margate</w>			<w	pos=“.”>.</w>
		<w	pos=“,”>,</w>		</s>
		<w	pos=“Np”>New_Jersey</w>	…
		<w	pos=“,”>,</w>	</doc>
	</s>

Dispersion measures

Word	FREQ	RANGE	MAXMIN	SD	VARCOEFF	CHISQUARE	D_EQ	D_UNEQ
cảnh gần	8.00	8.00	1.00	0.01	147.34	220,704.86	0.65	0.53
cánh phấn	2.00	2.00	1.00	0.00	294.69	15,572.46	0.29	0.23
cảnh sát	25,695.00	12,434.00	46.00	0.76	5.16	869,451.64	0.99	0.99
cao lương	100.00	49.00	23.00	0.08	135.55	575,687.79	0.67	0.75
cặp lồng	30.00	23.00	3.00	0.02	94.21	191,176.18	0.77	0.64
cạp nia	9.00	7.00	3.00	0.01	179.34	137,331.11	0.57	0.51

D2	S_EQ	S_UNEQ	D3	DC	IDF	ENGVALL	U_EQ	U_UNEQ	UM_CARR
0.17	0.00	0.00	− 5426.44	0.00	14.41	0.00	5.17	4.28	1.38
0.06	0.00	0.00	− 21,709.50	0.00	16.41	0.00	0.59	0.46	0.11
0.76	0.06	0.05	− 5.66	0.06	3.80	1839.48	25,376.75	25,394.28	19,446.87
0.26	0.00	0.00	− 4592.74	0.00	11.79	0.03	67.47	74.61	25.61
0.25	0.00	0.00	− 2218.07	0.00	12.88	0.00	23.22	19.08	7.61
0.15	0.00	0.00	− 8039.77	0.00	14.60	0.00	5.13	4.63	1.37

AF_EQ	AF_UNEQ	Ur_KROM	F_ARF	AWT	F_AWT	ALD	F_ALD	DP	DPnorm
0.00	0.00	8.00	5.63	7,427,879.64	5.72	7.13	6.27	1.00	1.00
0.00	0.00	2.00	1.20	3,498,9851.57	1.21	7.79	1.38	1.00	1.00
1606.13	1393.76	17,028.34	9384.85	20,117.13	2111.81	4.15	5980.62	0.93	0.93
0.02	0.08	58.00	34.54	2,075,863.59	20.46	6.49	27.38	1.00	1.00
0.00	0.02	26.33	15.24	3,737,474.52	11.37	6.77	14.35	1.00	1.00
0.00	0.00	7.83	4.32	11,980,483.74	3.55	7.30	4.29	1.00	1.00

Abbreviation	Measure
FREQ	Observed frequency of word w
RANGE	Number of parts with word w
MAXMIN	Max. freq. of w/part—min. freq. of w/part
SD	Standard deviation of frequencies
VARCOEFF	Variation coefficient of frequencies
CHISQUARE	Chi square value of the frequency distribution
D_EQ	Juilland et al.’s D (assuming equal parts)
D_UNEQ	Juilland et al.’s D (not assuming equal parts)
D2	Carroll’s D2
S_EQ	Rosengren’s S (assuming equal parts)
S_UNEQ	Rosengren’s S (not assuming equal parts)
D3	Lyne’s D3
DC	Distributional Consistency
IDF	Inverse Document Frequency
ENGVALL	Engvall’s measure
U_EQ	Juilland et al.’s usage coefficient U (assuming equal parts)
U_UNEQ	Juilland et al.’s usage coefficient U (not assuming equal parts)
UM_CARR	Carroll’s Um
AF_EQ	Rosengren’s Adjusted Frequency AF (assuming equal parts)
AF_UNEQ	Rosengren’s Adjusted Frequency AF (not assuming equal parts)
Ur_KROM	Kromer’s U_R
F_ARF	Savický and Hlaváčová’s fARF
AWT	Savický and Hlaváˇcová’s AW T
F_AWT	Savický and Hlaváˇcová’s fAW T
ALD	Savický and Hlaváˇcová’s
ALD F_ALD	Savický and Hlaváˇcová’s fALD
SELF_DISP	Washtell’s self-dispersion
DP	Gries’s Deviation of Proportions
DP_norm	Gries’s Deviation of Proportions (normalized)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pham, H., Tucker, B.V. & Baayen, R.H. Constructing two vietnamese corpora and building a lexical database. Lang Resources & Evaluation 53, 465–498 (2019). https://doi.org/10.1007/s10579-019-09451-x

Download citation

Published: 21 March 2019
Issue Date: 15 September 2019
DOI: https://doi.org/10.1007/s10579-019-09451-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing two vietnamese corpora and building a lexical database

Abstract

Access this article

Data availability

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation