当前位置: X-MOL 学术International Journal of Corpus Linguistics › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Multi-word discourse markers and their corpus-driven identification
International Journal of Corpus Linguistics ( IF 0.919 ) Pub Date : 2017-12-01 , DOI: 10.1075/ijcl.16127.dob
Kaja Dobrovoljc 1
Affiliation  

With expanding evidence on the formulaic nature of human communication, there is a growing need to extend discourse marker research to functionally analogue multi-word expressions. In contrast to the common qualitative approaches to discourse marker identification in corpora, this paper presents a corpus-driven semi-automatic approach to identification of multi-word discourse markers (MWDMs) in the reference corpus of spoken Slovene. Using eight statistical measures, we identified 173 structurally fixed discourse-marking MWEs, distinguished by a high number of tokens, a large proportion of grammatical words and semantic heterogeneity. This is a significantly longer list than would have been gained by manual inspection of smaller corpus samples. Although frequency-based methods produced satisfactory results, best precision in MWDM identification was achieved using the t-score association measure, while the overall poor performance of the mutual information suggests its inadequacy for extraction of MWDMs and other MWEs with similar lexical and distributional features.

中文翻译:

多词话语标记及其语料驱动识别

随着人类交流的公式化性质的证据不断增加,越来越需要将话语标记研究扩展到功能模拟的多词表达。与语料库中话语标记识别的常见定性方法相比,本文提出了一种语料库驱动的半自动方法来识别斯洛文尼亚口语参考语料库中的多词话语标记 (MWDM)。使用八项统计措施,我们确定了 173 个结构固定的话语标记 MWE,其特点是标记数量多、语法词比例大和语义异质性。这是一个比手动检查较小的语料库样本所获得的列表要长得多的列表。尽管基于频率的方法产生了令人满意的结果,
更新日期:2017-12-01
down
wechat
bug