Skip to main content
Log in

Constructing two vietnamese corpora and building a lexical database

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3

Data availability

Due to the copyright issues, only certain portions of the texts in our corpora are freely available to the public in the form of concordance lines per request. The lexical databases for which there are no copyright issues have been made available at http://era.library.ualberta.ca/files/j098zc38m for use by the research community.

Notes

  1. We return to possible disadvantages of using translated subtitles below.

  2. http://www.linguistics.ucsb.edu/faculty/stgries/research/dispersion/_dispersion1.r.

  3. This idea is not new in linguistics. Firth (1957, p. 11) referred to this as “You shall know a word by the company it keeps.”

  4. The original LSA divided its corpus into 30,000 episodes, and assessed the number of times each one of words appeared in each episodes. Instead of assigning 30,000 individual values to each word, factor analysis reduces the number of values to about 300.

  5. A monitor corpus is a type of corpus which is a growing, non-finite collection of texts, of primary use in lexicography. A monitor corpus reflects language changes in a constant growth rate of corpora, leaving untouched the relative weight of its components (i.e., balance) as defined by the parameters. The same composition schema should be followed year by year, the basis being a reference corpus with texts spoken or written in one single year. An example of an English monitor corpus is the COCA corpus (Davies 2010), which can be accessed at http://corpus.byu.edu/coca/.

References

  • Adelman, J. S., Brown, G. D. A., & Quesada, J. F. (2006). Contextual diversity, not word frequency, determines word-naming and lexical decision times. Psychological Science, 17(9), 814–823.

    Article  Google Scholar 

  • Baayen, R. H. (2001). Word frequency distributions. Dordrecht: Kluwer Academic Publishers.

    Book  Google Scholar 

  • Baayen, R. H., Feldman, L., & Schreuder, R. (2006). Morphological influences on the recognition of monosyllabic monomorphemic words. Journal of Memory and Language, 53, 496–512.

    Google Scholar 

  • Baayen, R. H., Milin, P., Filipovíc Đurđevíc, D., Hendrix, P., & Marelli, M. (2011). An amorphous model for morphological processing in visual comprehension based on naive discriminative learning. Psychological Review, 118(3), 438–481.

    Article  Google Scholar 

  • Baayen, R. H., Piepenbrock, R., & Gulikers, L. (1995). The CELEX lexical database (CD-ROM). Philadelphia: Linguistic Data Consortium, University of Pennsylvania.

    Google Scholar 

  • Balota, D. A., Cortese, M. J., Sergent-Marshall, S. D., Spieler, D. H., & Yap, M. J. (2004). Visual word recognition of single-syllable words. Journal of Experimental Psychology: General, 133, 283–316.

    Article  Google Scholar 

  • Berry-Rogghe, G. L. M. (1973). The computation of collocations and their relevance in lexical studies. In A. J. Aitken, R. W. Bailey, & N. Hamilton-Smith (Eds.), The computer and literary studies (pp. 103–112). Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Brysbaert, M., Mandera, P., & Keuleers, E. (2017). The word frequency effect in word processing: A review update. In To be published in Current Directions in Psychological Science.

  • Brysbaert, M., & New, B. (2009). Moving beyond Kuˇcera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41(4), 977–990.

    Article  Google Scholar 

  • Burgess, C., & Livesay, K. (1998). The effect of corpus size in predicting reaction time in a basic word recognition task: Moving on from Kučera and Francis. Behavior Research Methods, 30, 272–277.

    Google Scholar 

  • Burgess, C., & Lund, K. (1998). The dynamics of meaning in memory. In E. Dietrich & A. Markman (Eds.), Cognitive dynamics: Conceptual change in humans and machines. Mahwah: Lawrence Erlbaum Associates.

    Google Scholar 

  • Cai, Q., & Brysbaert, M. (2010). SUBTLEX-CH: Chinese word and character frequencies based on film subtitles. PloS One, 5(6), e10729.

    Article  Google Scholar 

  • Cantos Gómez, P. (2013). Statistical methods in language and linguistic research. Sheffield: Equinox Publishing Limited.

    Google Scholar 

  • Church, K. W., & Hanks, P. (1990). Word association norms, mutual information, and lexicography. Computational Linguistics, 16, 22–29.

    Google Scholar 

  • Core Team, R. (2013). R: A language and environment for statistical computing. Vienna: R Foundation for Statistical Computing. ISBN 3-900051-07-0.

    Google Scholar 

  • Crossley, S. A., Salsbury, T., McCarthy, P. M., & McNamara, D. S. (2008). LSA as a measure of coherence in second language natural discourse. In V. Sloutsky & B. K. M. Love (Eds.), Proceedings of the 30th Annual Meeting of the Cognitive Science Society. Washington, DC: Cognitive Science Society.

    Google Scholar 

  • Cuetos, F., Glez-Nosti, M., Barbon, A., & Brysbaert, M. (2011). SUBTLEX-ESP: Spanish word frequencies based on film subtitles. Psicologica, 32, 133–143.

    Google Scholar 

  • Davies, M. (2010). Corpus of contemporary American English (COCA). http://www.americancorpus.org/. Accessed 16 Feb 2014.

  • de Groot, A., & Hagoort, P. (2017). Research methods in psycholinguistics and the neurobiology of language: A practical guide. GMLZ—Guides to research methods in language and linguistics. New York: Wiley.

    Google Scholar 

  • Delic, E. (2004). Présentation du Corpus de référence du Francais parlé. Recherches sur le Francais parlé, 18, 11–42.

    Google Scholar 

  • Dimitropoulou, M., Dunabeitia, J. A., Avilés, A., Corral, J., & Carreiras, M. (2010). Subtitle-based word frequencies as the best estimate of reading behaviour: The case of Greek. Frontiers in Psychology, 1, 1–12.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19, 61–74.

    Google Scholar 

  • Firth, J. R. (1957). Papers in linguistics, 1934–1951. London: Oxford University Press.

    Google Scholar 

  • Gagné, C. L., & Shoben, E. J. (1997). Influence of thematic relations on the comprehension of modifier-noun combinations. Journal of Experimental Psychology. Learning, Memory, and Cognition, 23, 71–87.

    Article  Google Scholar 

  • Gagné, C. L., Spalding, T. L., & Nisbet, K. A. (2016). Processing English compounds: Investigating semantic transparency. SKASE Journal of Theoretical Linguistics, 13(2), 2–22.

    Google Scholar 

  • Gimenes, M., & New, B. (2016). Worldlex: Twitter and blog word frequencies for 66 languages. Behavior Research Methods, 48(3), 963–972.

    Article  Google Scholar 

  • Gries, S. T. (2008). Dispersions and adjusted frequencies in corpora. International Journal of Corpus Linguistics, 13(4), 403–437.

    Article  Google Scholar 

  • Gries, S. T. (2009). Dispersions and adjusted frequencies in corpora: Further explorations. Language and Computers, 71(1), 197–212.

    Google Scholar 

  • Günther, F., Dudschig, C., & Kaup, B. (2016). Latent semantic analysis cosines as a cognitive similarity measure: Evidence from priming studies. The Quarterly Journal of Experimental Psychology, 69(4), 626–653.

    Article  Google Scholar 

  • Hasher, L., & Zacks, R. T. (1984). Automatic processing of fundamental information. The case of frequency of occurrence. American Psychologist, 39, 1372–1388.

    Article  Google Scholar 

  • Hoàng, P. (ed) (2000). Từ điển tiếng Việt [Vietnamese Dictionary]. Khoa học Xã hội, Hà Nội. Viện Ngôn ngữ học.

  • Juilland, A., Brodin, D., & Davidovitch, C. (1970). Frequency dictionary of French words. Hague: Romance languages and their structures.

    Google Scholar 

  • Keuleers, E., & Brysbaert, M. (2010). Wuggy: A multilingual pseudoword generator. Behaviour Research Methods, 42(3), 627–633.

    Article  Google Scholar 

  • Keuleers, E., Brysbaert, M., & New, B. (2010). SUBTLEX-NL: A new measure for Dutch word frequency based on film subtitles. Behaviour Research Methods, 42(3), 643–650.

    Article  Google Scholar 

  • Kučera, H., & Francis, W. N. (1967). Computational analysis of present-day American English. Providence: Brown University Press.

    Google Scholar 

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104(2), 211–240.

    Article  Google Scholar 

  • Landauer, T. K., Foltz, P. W., & Laham, D. (1998). Introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

    Article  Google Scholar 

  • Lê, H. P., Nguyen, T. M. H., Roussanaly, A., & Ho, V. (2008). A hybrid approach to word segmentation of Vietnamese texts. In C. Martin-Vide, F. Otto, & H. Fernau (Eds.), Language and automata theory and applications (Vol. 5196, pp. 240–249)., Lecture Notes in Computer Science Springer: Berlin, Heidelberg.

    Chapter  Google Scholar 

  • Le, D.-T., & Quasthoff, U. (2016). Construction and analysis of a large Vietnamese text corpus. In N. Calzolari, K. Choukri, T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk, & S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016). Paris: European Language Resources Association (ELRA).

    Google Scholar 

  • Lê, H. P., Roussanaly, A., Nguyen, T. M. H., & Rossignol, M. (2010). An empirical study of maximum entropy approach for part-of-speech tagging of Vietnamese texts. In Traitement Automatique des Langues NaturellesTALN 2010 (p. 12), Montréal Canada. ATALA (Association pour le Traitement Automatique des Langues).

  • Libben, G., Gibson, M., Yoon, Y. B., & Sandra, D. (2003). Compound fracture: The role of semantic transparency and morphological headedness. Brain and Language, 84, 50–64.

    Article  Google Scholar 

  • Lund, K., & Burgess, C. (1996). Producing high-dimensional semantic spaces from lexical cooccurrence. Behavior Research Methods Instruments and Computers, 28(2), 203–208.

    Article  Google Scholar 

  • Lund, K., Burgess, C., & Atchley, R. A. (1995). Semantic and associative priming in high-dimensional semantic space. In Proceedings of the 17th annual conference of the Cognitive Science Society (pp. 660–665), Hillsdale: Erlbaum.

  • McClelland, J. L., & Rumelhart, D. E. (1981). An interactive activation model of context effects in letter perception: Part I. An account of the basic findings. Psychological Review, 88, 375–407.

    Article  Google Scholar 

  • McDonald, S. A., & Shillcock, R. C. (2001). Rethinking the word frequency effect: The neglected role of distributional information in lexical processing. Language and Speech, 44(3), 295–323.

    Article  Google Scholar 

  • New, B., Brysbaert, M., Veronis, J., & Pallier, C. (2007). The use of film subtitles to estimate word frequencies. Applied Psycholinguistics, 28(04), 661–677.

    Article  Google Scholar 

  • New, B., Pallier, C., Brysbaert, M., Ferr, L., Holloway, R., Service, U., et al. (2004). Lexique 2: A new French lexical database. Behavior Research Methods, Instruments, and Computers, 36, 516–524.

    Article  Google Scholar 

  • Nguyễn, Đ. D., & Lê, Q. T. (1980). Dictionnaire de fréquence du Vietnamien. Paris: Université de Paris VII.

    Google Scholar 

  • Oakes, M. (1998). Statistics for corpus linguistics. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Ooi, V. (1998). Computer corpus lexicography. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1988). Positron emission tomographic studies of the cortical anatomy of single-word processing. Nature, 331(6157), 585–589.

    Article  Google Scholar 

  • Petersen, S. E., Fox, P. T., Posner, M. I., Mintun, M., & Raichle, M. E. (1989). Positron emission tomographic studies of the processing of single words. Journal of Cognitive Neuroscience, 1(2), 153–170.

    Article  Google Scholar 

  • Pham, H., & Baayen, H. R. (2013). Semantic relations and compound transparency: A regression study in CARIN theory. Psihologija, 46(4), 455–478.

    Article  Google Scholar 

  • Pham, H., & Baayen, H. R. (2015). Vietnamese compounds show an anti-frequency effect in visual lexical decision. Language, Cognition & Neuroscience, 30(9), 1077–1095.

    Article  Google Scholar 

  • Pham, H., Bolger, P., & Baayen, R. H. (2012). Vietnamese word and syllabeme (syllable-morpheme) frequencies: A corpus and lexical decision study. In SEALS 22. Agay, France.

  • Pham, G., Kohnert, K., & Carney, E. (2008). Corpora of Vietnamese texts: Lexical effects of intended audience and publication place. Behavior Research Methods, 40(1), 154–163.

    Article  Google Scholar 

  • Pinker, S. (1999). Words and rules: The ingredients of language. New York: Basic Books.

    Google Scholar 

  • Rayson, P. & Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the workshop on Comparing Corpora, held in conjunction ACL 2000. October 2000, Hong Kong (pp. 1–6).

  • Read, T., & Cressie, N. (1988). Goodness-of-fit statistics for discrete multivariate data. New York: Springer.

    Book  Google Scholar 

  • Scott, M., & Tribble, C. (2006). Textual patterns: Key words and corpus analysis in language education. Amsterdam: John Benjamins.

    Book  Google Scholar 

  • Shaoul, C., & Westbury, C. (2006). Word frequency effects in high-dimensional co-occurrence models: A new approach. Behavior Research Methods, 38, 190–195.

    Article  Google Scholar 

  • Southeast Asian Languages Library, S. (2009). Vietnamese text corpus. http://sealang.net/vietnamese/corpus.htm. Accessed 16 Feb 2014.

  • Trung tâm từ điển học, V. (1998). Vietnamese corpus. http://vietlex.com/kho-ngu-lieu. Accessed 16 Feb 2014.

  • Walter, J. B., van Heuven, P. M., Keuleers, E., & Brysbaert, M. (2014). SUBTLEX-UK: A new and improved word frequency database for British English. The Quarterly Journal of Experimental Psychology, 67(6), 1176–1190.

    Article  Google Scholar 

  • Wild, F. (2011). LSA: Latent semantic analysis. R package version 0.63-3.

  • Yap, M. J., & Balota, D. A. (2009). Visual word recognition of multisyllabic words. Journal of Memory and Language, 60(4), 502–529.

    Article  Google Scholar 

  • Zipf, G. K. (1935). The psycho-biology of language. Boston: Houghton Mifflin.

    Google Scholar 

Download references

Acknowledgements

This research is funded by Vietnam National Foundation for Science and Technology Development (NAFOSTED) under grant number 602.10-2016.05.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hien Pham.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

POS tags used in the corpora

ID

POS-tags

POS in English

POS in Vietnamese

1

Np

Proper noun

danh từ riêng

2

Nc

Classifier noun

danh từ chỉ loại

3

Nu

Unit noun

danh từ đơn vị

4

N

Common noun

danh từ chung

5

V

Verb

động từ

6

A

Adjective

tính từ

7

P

Pronoun

đại từ

8

R

Adverb

phó từ

9

L

Determiner

định từ

10

M

Numeral

số từ

11

E

Preposition

giới từ

12

C

Subordinating conjunction

liên từ phụ

13

CC

Coordinating conjunction

liên từ kết hợp

14

I

Interjection

từ cảm thán

15

T

Auxiliary word, modal words

trợ từ

16

Y

Abbreviation

từ viết tắt

17

Z

Bound morphemes

yếu tố cấu tạo từ (bất, vô. . .)

18

X

Undetermined

không (hoặc chưa) xác định

POS-tagged XML sample

 

<s>

 

<doc>

     

<w

pos=“A”>lớn</w>

     

<w

pos=“R”>lên</w>

<s>

     

<w

pos=“CC”>và</w>

  

<w

pos=“P”>Đây</w>

  

<w

pos=“V”>tin</w>

  

<w

pos=“V”>là</w>

  

<w

pos=“C”>rằng</w>

  

<w

pos=“N”>câu_chuyện</w>

  

<w

pos=“M”>…</w>

  

<w

pos=“E”>về</w>

 

</s>

  
  

<w

pos=“M”>một</w>

 

<s>

  
  

<w

pos=“N”>chàng_trai</w>

  

<w

pos=“P”>mình</w>

  

<w

pos=“V”>gặp</w>

  

<w

pos=“R”>sẽ</w>

  

<w

pos=“M”>một</w>

  

<w

pos=“V”>không_bao_giờ</w>

  

<w

pos=“N”>cô_gái</w>

  

<w

pos=“V”>có</w>

  

<w

pos=“.”>.</w>

  

<w

pos=“R”>được</w>

 

</s>

    

<w

pos=“N”>hạnh_phúc</w>

 

<s>

    

<w

pos=“A”>thực_sự</w>

  

<w

pos=“N”>Ngày</w>

  

<w

pos=“…”>…</w>

  

<w

pos=“N”>thứ</w>

 

</s>

  
  

<w

pos=“M”>nhất</w>

 

<s>

  
 

</s>

    

<w

pos=“E”>cho_đến</w>

 

<s>

    

<w

pos=“N”>ngày</w>

  

<w

pos=“N”>Chàng_trai</w>

  

<w

pos=“V”>gặp</w>

  

<w

pos=“,”>,</w>

  

<w

pos=“R”>được</w>

  

<w

pos=“Np”>Tom_Hansen</w>

  

<w

pos=“““>“</w>

  

<w

pos=“,”>,</w>

  

<w

pos=“N”>người</w>

  

<w

pos=“V”>sinh_ra</w>

  

<w

pos=“P”>ấy</w>

  

<w

pos=“E”>ở</w>

  

<w

pos=“““>“</w>

  

<w

pos=“Np”>Margate</w>

  

<w

pos=“.”>.</w>

  

<w

pos=“,”>,</w>

 

</s>

  
  

<w

pos=“Np”>New_Jersey</w>

   
  

<w

pos=“,”>,</w>

</doc>

   
 

</s>

      

Dispersion measures

Word

FREQ

RANGE

MAXMIN

SD

VARCOEFF

CHISQUARE

D_EQ

D_UNEQ

cảnh gần

8.00

8.00

1.00

0.01

147.34

220,704.86

0.65

0.53

cánh phấn

2.00

2.00

1.00

0.00

294.69

15,572.46

0.29

0.23

cảnh sát

25,695.00

12,434.00

46.00

0.76

5.16

869,451.64

0.99

0.99

cao lương

100.00

49.00

23.00

0.08

135.55

575,687.79

0.67

0.75

cặp lồng

30.00

23.00

3.00

0.02

94.21

191,176.18

0.77

0.64

cạp nia

9.00

7.00

3.00

0.01

179.34

137,331.11

0.57

0.51

D2

S_EQ

S_UNEQ

D3

DC

IDF

ENGVALL

U_EQ

U_UNEQ

UM_CARR

0.17

0.00

0.00

− 5426.44

0.00

14.41

0.00

5.17

4.28

1.38

0.06

0.00

0.00

− 21,709.50

0.00

16.41

0.00

0.59

0.46

0.11

0.76

0.06

0.05

− 5.66

0.06

3.80

1839.48

25,376.75

25,394.28

19,446.87

0.26

0.00

0.00

− 4592.74

0.00

11.79

0.03

67.47

74.61

25.61

0.25

0.00

0.00

− 2218.07

0.00

12.88

0.00

23.22

19.08

7.61

0.15

0.00

0.00

− 8039.77

0.00

14.60

0.00

5.13

4.63

1.37

AF_EQ

AF_UNEQ

Ur_KROM

F_ARF

AWT

F_AWT

ALD

F_ALD

DP

DPnorm

0.00

0.00

8.00

5.63

7,427,879.64

5.72

7.13

6.27

1.00

1.00

0.00

0.00

2.00

1.20

3,498,9851.57

1.21

7.79

1.38

1.00

1.00

1606.13

1393.76

17,028.34

9384.85

20,117.13

2111.81

4.15

5980.62

0.93

0.93

0.02

0.08

58.00

34.54

2,075,863.59

20.46

6.49

27.38

1.00

1.00

0.00

0.02

26.33

15.24

3,737,474.52

11.37

6.77

14.35

1.00

1.00

0.00

0.00

7.83

4.32

11,980,483.74

3.55

7.30

4.29

1.00

1.00

Abbreviation

Measure

FREQ

Observed frequency of word w

RANGE

Number of parts with word w

MAXMIN

Max. freq. of w/part—min. freq. of w/part

SD

Standard deviation of frequencies

VARCOEFF

Variation coefficient of frequencies

CHISQUARE

Chi square value of the frequency distribution

D_EQ

Juilland et al.’s D (assuming equal parts)

D_UNEQ

Juilland et al.’s D (not assuming equal parts)

D2

Carroll’s D2

S_EQ

Rosengren’s S (assuming equal parts)

S_UNEQ

Rosengren’s S (not assuming equal parts)

D3

Lyne’s D3

DC

Distributional Consistency

IDF

Inverse Document Frequency

ENGVALL

Engvall’s measure

U_EQ

Juilland et al.’s usage coefficient U (assuming equal parts)

U_UNEQ

Juilland et al.’s usage coefficient U (not assuming equal parts)

UM_CARR

Carroll’s Um

AF_EQ

Rosengren’s Adjusted Frequency AF (assuming equal parts)

AF_UNEQ

Rosengren’s Adjusted Frequency AF (not assuming equal parts)

Ur_KROM

Kromer’s UR

F_ARF

Savický and Hlaváčová’s fARF

AWT

Savický and Hlaváˇcová’s AW T

F_AWT

Savický and Hlaváˇcová’s fAW T

ALD

Savický and Hlaváˇcová’s

ALD F_ALD

Savický and Hlaváˇcová’s fALD

SELF_DISP

Washtell’s self-dispersion

DP

Gries’s Deviation of Proportions

DP_norm

Gries’s Deviation of Proportions (normalized)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pham, H., Tucker, B.V. & Baayen, R.H. Constructing two vietnamese corpora and building a lexical database. Lang Resources & Evaluation 53, 465–498 (2019). https://doi.org/10.1007/s10579-019-09451-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-019-09451-x

Keywords

Navigation