当前位置: X-MOL 学术Int. J. Lexicogr. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Identifying dictionary-relevant formulaic sequences in written and spoken corpora
International Journal of Lexicography ( IF 0.8 ) Pub Date : 2020-04-13 , DOI: 10.1093/ijl/ecaa008
Kaja Dobrovoljc 1
Affiliation  

Abstract In view of the pervasiveness of formulaic language in human communication and the growing awareness of its relevance to modern lexicography, this study presents a corpus-driven identification, analysis and comparison of dictionary-relevant formulaic sequences in reference corpora of written and spoken Slovenian. The sequences were identified using a semi-automatic approach, whereby the most frequently recurring word combinations in each corpus were ranked according to their statistical salience and manually inspected for formulaic expressions with lexicographic relevance. Despite its semantic heterogeneity, the resulting list illustrates the distinct characteristics of formulaic multi-word expressions, such as high frequency of usage, prevalent inclusion of grammatical words and common non-propositional meaning, especially in speech, where research revealed numerous understudied formulaic expressions related to interaction management and mitigation. The final evaluation of measures used in the identification process demonstrates their relative suitability for corpus-driven identification of dictionary-relevant formulaic expressions, with their precision varying in relation to corpus size and length of sequences under investigation.

中文翻译:

识别书面和口语语料库中与字典相关的公式序列

摘要 鉴于公式化语言在人类交际中的普遍性以及人们对其与现代词典学相关性的日益认识,本研究提出了一种语料库驱动的斯洛文尼亚语书面和口语参考语料库中与字典相关的公式化序列的识别、分析和比较。使用半自动方法识别序列,其中每个语料库中最常出现的单词组合根据其统计显着性进行排名,并手动检查具有词典相关性的公式表达。尽管存在语义异质性,但结果列表说明了公式化多词表达的明显特征,例如使用频率高、语法词普遍包含和常见的非命题意义,尤其是在演讲中,研究揭示了许多与交互管理和缓解相关的未充分研究的公式化表达。识别过程中使用的度量的最终评估表明它们相对适用于语料库驱动识别字典相关的公式表达,其精度随语料库大小和所研究序列的长度而变化。
更新日期:2020-04-13
down
wechat
bug