当前位置: X-MOL 学术Comput. Linguist. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Automatic Identification and Production of Related Words for Historical Linguistics
Computational Linguistics ( IF 3.7 ) Pub Date : 2020-01-01 , DOI: 10.1162/coli_a_00361
Alina Maria Ciobanu 1 , Liviu P. Dinu 2
Affiliation  

Language change across space and time is one of the main concerns in historical linguistics. In this article, we develop tools to assist researchers and domain experts in the study of language evolution. Firstly, we introduce a method to automatically determine if two words are cognates.We propose an algorithm for extracting cognates from electronic dictionaries that contain etymological information. Having built a dataset of related words, we further develop machine learning methods based on orthographic alignment for identifying cognates.We use aligned subsequences as features for classification algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Secondly, we extend the method to a finer-grained level, to identify the type of relationship between words. Discriminating between cognates and borrowings provides a deeper insight into the history of a language and allows a better characterization of language relatedness. We show that orthographic features have discriminative power and we analyze the underlying linguistic factors that prove relevant in the classification task. To our knowledge, this is the first attempt of this kind. Thirdly, we develop a machine learning method for automatically producing related words. We focus on reconstructing proto-words, but we also address two related sub-problems, producing modern word forms and producing cognates. The task of reconstructing proto-words consists in recreating the words in an ancient language from its modern daughter languages. Having modern word forms in multiple Romance languages, we infer the form of their common Latin ancestors. Our approach relies on the regularities that occurred when words entered the modern languages. We leverage information from several modern languages, building an ensemble system for reconstructing proto-words. We apply our method on multiple datasets, showing that our approach improves on previous results, having also has the advantage of requiring less input data, which is essential in historical linguistics, where resources are generally scarce.

中文翻译:

历史语言学相关词的自动识别与生成

跨越时空的语言变化是历史语言学的主要关注点之一。在本文中,我们开发工具来帮助研究人员和领域专家研究语言进化。首先,我们介绍了一种自动判断两个词是否为同源词的方法。我们提出了一种从包含词源信息的电子词典中提取同源词的算法。在建立了相关词的数据集后,我们进一步开发了基于正交对齐的机器学习方法来识别同源词。我们使用对齐的子序列作为分类算法的特征,以推断词在进入新语言时所经历的语言变化的规则并区分同源和非同源。其次,我们将方法扩展到更细粒度的级别,识别单词之间的关系类型。区分同源词和借词可以更深入地了解语言的历史,并可以更好地表征语言相关性。我们表明正字法特征具有判别力,我们分析了在分类任务中证明相关的潜在语言因素。据我们所知,这是此类尝试的第一次。第三,我们开发了一种自动生成相关词的机器学习方法。我们专注于重构原始词,但我们也解决了两个相关的子问题,即产生现代词形式和产生同源词。重建原始词的任务在于从现代子语言中重新创建古代语言中的词。拥有多种罗曼语言的现代词形,我们推断出他们共同的拉丁祖先的形式。我们的方法依赖于单词进入现代语言时发生的规律。我们利用来自几种现代语言的信息,构建一个用于重建原始词的集成系统。我们将我们的方法应用于多个数据集,表明我们的方法改进了以前的结果,还具有需要更少输入数据的优势,这在资源通常稀缺的历史语言学中是必不可少的。
更新日期:2020-01-01
down
wechat
bug