当前位置: X-MOL 学术Lang. Resour. Eval. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Constructing two vietnamese corpora and building a lexical database
Language Resources and Evaluation ( IF 1.7 ) Pub Date : 2019-03-21 , DOI: 10.1007/s10579-019-09451-x
Hien Pham , Benjamin V. Tucker , R. Harald Baayen

Corpus-based research has formed the backbone of linguistic research in recent decades. Large text corpora are used for solving various kinds of linguistic problems, including those of quantitative linguistics, cognitive linguistics, and psycholinguistics. This paper reports the creation of two corpora of contemporary Vietnamese. It also describes the construction of these two equally sized Vietnamese corpora (a corpus from Vietnamese film subtitles, subtlex-viet, and a general corpus of varieties of online newspapers and stories, genlex-viet). We document the general steps of the construction and extraction of linguistic information from the language corpora and provide a road map for others who would like to create similar corpora. The resultant corpora are available in three versions: plain text, tokenized, and POS tagged. In the second half of the paper, the construction of a lexical database derived from the corpora is described. The database includes measures such as frequency of occurrence, dispersion, Mutual Information, Inverse Document Frequency, as well as vector space measures based on Latent Semantic Analysis and Hyperspace Analogue to Language. We conclude by reporting a comparison of the lexical predictors and a validation using psycholinguistic data from visual lexical decision experiments.

中文翻译:

构造两个越南语语料库并建立词汇数据库

基于语料库的研究已成为近几十年来语言学研究的中坚力量。大文本语料库用于解决各种语言问题,包括定量语言学,认知语言学和心理语言学的问题。本文报道了当代越南语两个语料库的创建。它也描述了这两个同样大小的越语料库的建设(从越南电影的字幕,语料库subtlex -越南和品种的在线报纸和故事,一个一般性语料库genlex -越南)。我们记录了从语言语料库构建和提取语言信息的一般步骤,并为其他想要创建类似语料库的人提供了路线图。生成的语料库有三种版本:纯文本,标记化和POS标记。在本文的后半部分,描述了从语料库派生的词汇数据库的构建。该数据库包括诸如出现频率,离散度,互信息,逆文档频率之类的度量,以及基于潜在语义分析语言超空间模拟的向量空间度量。。我们通过报告词汇预测变量的比较和使用来自视觉词汇决策实验的心理语言数据进行的验证来得出结论。
更新日期:2019-03-21
down
wechat
bug