当前位置: X-MOL 学术J. Anim. Breed. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Welcome to the machine: Terms, topics and trends
Journal of Animal Breeding and Genetics ( IF 2.6 ) Pub Date : 2020-10-10 , DOI: 10.1111/jbg.12511
Elisa Oertel 1 , Henner Simianer 1, 2
Affiliation  

“Welcome to the machine” was the programmatic title of the second song on Pink Floyd's much acclaimed album “Wish you were here” in 1975. Today, 45 years later, machine learning undoubtedly has become one of the hot topics in data‐based research and is of growing importance in animal science, too. One essential objective of machine learning is to extract so far unknown features and patterns from massive data in an efficient way. There is a lot we can learn from complex data in animal breeding and genetics, be it the subtle substructure of livestock populations, typical behaviours of animals in early stages of a disease, or complex genomic constellations affecting certain traits of interest. Under the umbrella term of data mining, machine learning outcomes can be connected with meta data to identify trends and dependencies in big data. In the following, we will demonstrate that such techniques may also assist in identifying the evolution of topics in a research field, here, animal breeding and genetics.

For this, we used topic modelling, a special discipline of statistical modelling, which has been developed over the last two decades and is a frequently used technique in text mining. It allows for finding latent (i.e. hidden) structures in large text collections, called corpora. The aim is to identify a predefined number of topics based on the entire set of words in the corpus. Analogously to the structure algorithm in population genetics, which searches for latent genetic structures, topic modelling clusters latent semantic structures. In fact, both algorithms derive from the same principle and have been developed in parallel ever since. Clusters consist of similar entities (here words), representing an overall topic.

Without going too much into technical detail, the approach works as follows: first, irrelevant words (so‐called stopwords like “the, with, and, …”) are removed. After cleaning up the corpus, some normalization and word weighting steps are performed. Then documents are randomly assigned to one of the predefined number of topics and subsequently this assignment is optimized until the topics are as different as possible. For each topic, different scores, based on the used weighting function for similarity, can be calculated. In natural language processing and computer linguistics, the FREX score (standing for FRequency and EXclusivity) is currently considered state‐of‐the‐art. The resulting FREX terms are assigned to clusters and therefore represent topics. Note that, these terms are assigned probabilistically to topics, so that the same word is allowed to appear in multiple topics, with varying strength of linkage.

To give an example: we would not be surprised to see the word Holstein in a text about dairy cattle. However, Holstein may also appear in a text on German geography, since it is the name of a region in Germany, north of Hamburg spanning from the North Sea to the Baltic Sea. Together with the German‐Dutch Frisia region to the west, it is the historical place of origin of the Holstein Friesian breed. On the other hand, the term rumen may also prevail in texts on dairy cattle, but much less likely so in texts about German geography. Hence, rumen is a good separator of texts on dairy cattle and geography, while Holstein is rather not.

The core analysis, to identify topics and trends in the Journal of Animal Breeding and Genetics over the last 15 years, was conducted with PubMeta (http://pubmeta.uni‐goettingen.de), which is an online implementation of the stm() function of the stm R package (Roberts, Steward and Tingley 2019, J. of Statistical Software 91). We first aimed at identifying the three most prevalent topics over the entire period. The corpus consisted of all abstracts of the journal published between 2005 and 2020, which PubMeta retrieves from PubMed. Unfortunately, English abstracts are not comprehensively available before 2005 in PubMed, but the corpus was still substantial with 852 abstracts from 2664 authors in 74 countries and a total of 183.600 words.

After a short overview provided by PubMeta, we downloaded the full model to apply the post‐processing steps in tailor‐made scripts that lead to the reported results. We extracted the 50 most important FREX terms per topic, including their respective score and the probabilities of every abstract belonging to any of the three topics. As a result, we know the assignment probabilities of each of the 852 abstracts to each of the three topics, but what are the topics? This can be inferred from the FREX terms characterizing each of the topics, which are displayed as word clouds in Figure 1.

image
FIGURE 1
Open in figure viewerPowerPoint
Word clouds of FREX terms per calculated topic for topics one to three and linear regression with 95% confidence bands of the assignment probability of each abstract to this topic on the year of publication

While the collection of words was generated by the algorithm in a completely uninformed way, it requires expert knowledge to identify a common theme for any of the three topics. Topic 1 clearly is dominated by the main genetic parameters (heritability and correlation) and also comprises numerous traits, for which parameters are estimated. We would, therefore, suggest genetic parameters as the common theme for this topic. Arguably, a fitting title for topic 2 would be molecular genetics, and prediction and diversity nicely describes topic 3. Moreover, we can regress the assignment probabilities of the abstracts on the year of publication, which reveals a trend of importance of the topics over time. This is represented by linear regressions with 95 per cent confidence bands overlaying the word clouds in Figure 1.

When interpreting these results, it catches the eye that the three topics reflect quite well the self‐description of the journal on its homepage, stating that it “publishes original articles … on genomic selection, and any other topic related to breeding programmes, selection, quantitative genetic, genomics, diversity and evolution of domestic animals”. Over the entire time range considered, with a slight increase, manuscripts about topic 1, dealing with the estimation of genetic parameters for any sort of traits across diverse species, are well represented. A strong decline over time is observed for topic 2, representing molecular genetic analyses across all livestock species, which was a prominent topic in 2005 but now ranks third. The opposite trend is true for topic 3, which, in our view, illustrates the upcoming of genomic prediction and diversity studies based on high throughput marker data and today is the most popular topic in the journal. A detailed analysis shows that for the first two topics, especially in the early years, many abstracts are assigned to this topic either with extremely high or extremely low probability. Since 2010, almost all abstracts published are to some extent assigned to the third topic as well, reflecting the large impact high throughput genotyping had in all fields of research.

Obviously, extracting topics from the entire corpus over 15 years limits the power of detecting topics that only have been relevant in a limited time frame. Therefore, we have repeated the analysis for certain time periods; for example, abstracts published in the years 2017‐2020 only. One of the clusters popping up there is clearly related to efficiency of the production system, containing terms like feed efficiency, longevity, residual feed intake and emissions. This specific combination of FREX terms did not show up that prominently before, reflecting that breeding for resource efficiency and reduction of environmental impact has been an emerging topic only in the last years.

Of course, these and similar results just tell us in a descriptive way which major topics were published in the journal, and how these patterns developed over time. The approach does not provide any inference, as to why these trends evolved. Does it reflect actual trends in the scientific field or is it due to changes in editorial policies or preferences? Another reason might be competition between the journals in the sector. A new journal dedicated to a certain field may attract a specific sort of articles, which, henceforth, are no longer submitted to the Journal of Animal Breeding and Genetics, leading to a shift in topics. Again, expert knowledge is required to arrive at a fair appraisal of the extracted patterns. Interested readers can, thus, repeat the analyses with PubMeta and vary, for example, time periods and number of topics.

Admittedly, one may have arrived at similar conclusions by a tedious visual inspection of all the abstracts and counting of the occurrence of keywords etc. over time. It is still instructive, though, that an algorithm which is completely ignorant to the meaning of words and the scientific context was able to extract the described patterns and trends which make a lot of sense and provide valuable insights. We expect to see many more applications of machine learning techniques in animal science to various data types (genotypes and sequences, sensor data, videos, etc.) in the near future, and we just start to understand the potential of these techniques, that may lie in areas which so far have not been accessible for profound and automated analyses.



中文翻译:

欢迎使用机器:术语、主题和趋势

“欢迎来到机器”是平克·弗洛伊德 1975 年广受好评的专辑“Wish you were here”中第二首歌的程序标题。 45 年后的今天,机器学习无疑已成为基于数据的研究的热门话题之一并且在动物科学中也越来越重要。机器学习的一个基本目标是以有效的方式从海量数据中提取迄今为止未知的特征和模式。我们可以从动物育种和遗传学的复杂数据中学到很多东西,无论是牲畜种群的微妙子结构、疾病早期动物的典型行为,还是影响某些感兴趣特征的复杂基因组群。在数据挖掘的总称下,机器学习结果可以与元数据联系起来,以识别大数据中的趋势和依赖关系。

为此,我们使用了主题建模,这是统计建模的一门特殊学科,它在过去的 20 年中得到发展,是文本挖掘中常用的技术。它允许在称为语料库的大型文本集合中查找潜在(即隐藏)结构。目的是根据语料库中的整个单词集识别预定义数量的主题。类似于种群遗传学中​​的结构算法,它搜索潜在的遗传结构,主题建模聚类潜在的语义结构。事实上,这两种算法都源自相同的原理,并且从那时起一直在并行开发。集群由相似的实体(这里是词)组成,代表一个整体主题。

在不涉及太多技术细节的情况下,该方法的工作原理如下:首先,删除不相关的词(所谓的停用词,如“the、with、and、……”)。清理完语料库后,进行一些归一化和词权重的步骤。然后将文档随机分配给预定义数量的主题之一,随后优化该分配,直到主题尽可能不同。对于每个主题,可以根据使用的相似度加权函数计算不同的分数。在自然语言处理和计算机语言学中,FREXscore(代表频率和排他性)目前被认为是最先进的。产生的 FREX 术语被分配给集群,因此代表主题。请注意,这些术语按概率分配给主题,因此允许同一个词出现在多个主题中,并具有不同的链接强度。

举个例子:在一篇关于奶牛的文章中看到荷斯坦这个词,我们不会感到惊讶。然而,荷尔斯泰因也可能出现在关于德国地理的文本中,因为它是德国一个地区的名称,汉堡以北从北海到波罗的海。它与西部的德荷弗里西亚地区一起,是荷斯坦弗里斯兰品种的历史起源地。另一方面,瘤胃一词也可能出现在有关奶牛的文本中,但在有关德国地理的文本中不太可能如此。因此,瘤胃是奶牛和地理文本的一个很好的分隔符,而荷斯坦则不是。

核心分析用于确定过去 15 年《动物育种和遗传学杂志》中的主题和趋势,是使用 PubMeta (http://pubmeta.uni-goettingen.de) 进行的,这是 stm( ) 函数 (Roberts, Steward and Tingley 2019, J. of Statistical Software 91)。我们首先旨在确定整个时期内三个最流行的主题。该语料库包括该期刊在 2005 年至 2020 年期间发表的所有摘要,这些摘要是 PubMeta 从 PubMed 中检索到的。不幸的是,2005 年之前 PubMed 中的英文摘要并不全面,但语料库仍然很丰富,共有来自 74 个国家的 2664 位作者的 852 篇摘要,共计 183.600 字。

在 PubMeta 提供的简短概述之后,我们下载了完整的模型,以在定制的脚本中应用后处理步骤,从而生成报告的结果。我们提取了每个主题的 50 个最重要的 FREX 术语,包括它们各自的分数以及属于三个主题中任何一个的每个摘要的概率。因此,我们知道 852 篇摘要中的每一个对三个主题中的每一个的分配概率,但主题是什么?这可以从表征每个主题的 FREX 术语中推断出来,这些术语在图 1 中显示为词云。

图片
图1
在图形查看器中打开微软幻灯片软件
主题 1 到主题 3 的每个计算主题的 FREX 术语词云和线性回归,每个摘要在发表年份对该主题的分配概率的置信区间为 95%

虽然该算法以完全不知情的方式生成单词集合,但它需要专业知识来确定三个主题中任何一个的共同主题。主题 1 显然由主要遗传参数(遗传力和相关性)主导,还包括许多性状,可对其参数进行估计。因此,我们建议将遗传参数作为本主题的共同主题。可以说,主题 2 的合适标题应该是分子遗传学预测和多样性很好地描述了主题 3。此外,我们可以对发表年份的摘要的分配概率进行回归,这揭示了主题重要性随时间变化的趋势。这由线性回归表示,其中 95% 的置信区间覆盖了图 1 中的词云。

在解释这些结果时,它引人注目的是,这三个主题很好地反映了该期刊主页上的自我描述,称其“发表原创文章……关于基因组选择,以及与育种计划、选择、家畜的数量遗传、基因组学、多样性和进化”。在考虑的整个时间范围内(略有增加),关于主题 1 的手稿,涉及估计不同物种的任何类型的性状的遗传参数,得到了很好的体现。随着时间的推移观察到主题 2 的显着下降,代表所有牲畜物种的分子遗传分析,这是 2005 年的一个突出主题,但现在排名第三。主题 3 的趋势正好相反,我们认为,说明了基于高通量标记数据的基因组预测和多样性研究的即将到来,如今是该杂志最热门的话题。详细分析表明,对于前两个主题,尤其是早些年,许多摘要被分配到该主题的概率或极高或极低。自 2010 年以来,几乎所有发表的摘要在某种程度上也属于第三主题,反映了高通量基因分型对所有研究领域的巨大影响。

显然,从整个语料库中提取主题超过 15 年限制了检测仅在有限时间范围内相关的主题的能力。因此,我们对某些时间段进行了重复分析;例如,仅在 2017-2020 年发表的摘要。出现的集群之一显然与生产系统的效率有关,包括饲料效率、寿命、剩余饲料摄入量和排放量等术语。FREX 术语的这种特定组合以前没有那么突出,反映了资源效率和减少环境影响的育种直到最近几年才成为一个新兴话题。

当然,这些和类似的结果只是以描述性的方式告诉我们期刊上发表了哪些主要主题,以及这些模式是如何随着时间发展的。该方法没有提供任何关于这些趋势演变的原因的推论。它反映了科学领域的实际趋势还是由于编辑政策或偏好的变化?另一个原因可能是该行业期刊之间的竞争。专注于某个领域的新期刊可能会吸引特定类型的文章,此后这些文章不再提交给 Journal of Animal Breeding and Genetics,从而导致主题发生转变。同样,需要专业知识才能对提取的模式进行公平评估。因此,感兴趣的读者可以使用 PubMeta 重复分析并改变时间段和主题数量等。

诚然,随着时间的推移,通过对所有摘要进行乏味的目视检查和对关键字出现的计数等,人们可能已经得出了类似的结论。尽管如此,一个完全不了解单词含义和科学背景的算法能够提取出所描述的模式和趋势,这些模式和趋势很有意义,并提供有价值的见解,这仍然具有启发意义。我们希望在不久的将来看到机器学习技术在动物科学中更多地应用于各种数据类型(基因型和序列、传感器数据、视频等),我们才刚刚开始了解这些技术的潜力,这可能位于迄今为止无法进行深入和自动化分析的领域。

更新日期:2020-10-11
down
wechat
bug