当前位置: X-MOL 学术Language Dynamics and Change › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Making genealogical language classifications available for phylogenetic analysis: Newick trees, unified identifiers, and branch length
Language Dynamics and Change Pub Date : 2018-06-22 , DOI: 10.1163/22105832-00801001
Dan Dediu 1
Affiliation  

One of the best-known types of non-independence between languages is caused by genealogical relationships due to descent from a common ancestor. These can be represented by (more or less resolved and controversial) language family trees. In theory, one can argue that language families should be built through the strict application of the comparative method of historical linguistics, but in practice this is not always the case, and there are several proposed classifications of languages into language families, each with its own advantages and disadvantages. A major stumbling block shared by most of them is that they are relatively difficult to use with computational methods, and in particular with phylogenetics. This is due to their lack of standardization, coupled with the general non-availability of branch length information, which encapsulates the amount of evolution taking place on the family tree. In this paper I introduce a method (and its implementation in R) that converts the language classifications provided by four widely-used databases (Ethnologue, WALS, AUTOTYP and Glottolog) into the de facto Newick standard generally used in phylogenetics, aligns the four most used conventions for unique identifiers of linguistic entities (ISO 639-3, WALS, AUTOTYP and Glottocode), and adds branch length information from a variety of sources (the tree’s own topology, an externally given numeric constant, or a distance matrix). The R scripts, input data and resulting Newick trees are available under liberal open-source licenses in a GitHub repository (https://github.com/ddediu/lgfam-newick), to encourage and promote the use of phylogenetic methods to investigate linguistic diversity and its temporal dynamics.



中文翻译:

使家谱语言分类可用于系统发育分析:Newick树,统一标识符和分支长度

语言之间最不独立的一种类型是由于共同祖先的血统而导致的族谱关系引起的。这些可以用(或多或少有争议的)语言家谱来表示。从理论上讲,可以通过严格应用历史语言学的比较方法来建立语言族,但实际上并非总是如此,并且提出了几种将语言归类为语言族的提议,每种都有其自己的分类的优点和缺点。他们中的大多数人遇到的主要绊脚石是,它们相对难以用于计算方法,尤其是系统发育学。这是由于它们缺乏标准化,再加上分支长度信息通常不可用,它封装了在家族树上发生的进化的数量。在本文中,我介绍了一种方法(及其实现方法)R)将四个广泛使用的数据库(Ethnologue,WALSAUTOTYP和Glottolog)提供的语言分类转换为系统发育学中普遍使用的事实上的Newick标准,将四种最常用的约定 统一为语言实体的唯一标识符(ISO 639- 3,WALSAUTOTYP和Glottocode),并添加来自各种来源(树的自身拓扑,外部给定的数字常数或距离矩阵)的分支长度信息。的ř脚本,输入数据和产生的Newick树木正在自由开源许可可在一个GitHub的储存库(https://github.com/ddediu/lgfam-newick),以鼓励和促进使用系统发育方法来研究语言多样性及其时间动态。

更新日期:2018-06-22
down
wechat
bug