当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
A machine learning framework to determine geolocations from metagenomic profiling
Biology Direct ( IF 5.7 ) Pub Date : 2020-11-23 , DOI: 10.1186/s13062-020-00278-z
Lihong Huang 1 , Canqiang Xu 2 , Wenxian Yang 2 , Rongshan Yu 1, 2
Affiliation  

Studies on metagenomic data of environmental microbial samples found that microbial communities seem to be geolocation-specific, and the microbiome abundance profile can be a differentiating feature to identify samples’ geolocations. In this paper, we present a machine learning framework to determine the geolocations from metagenomics profiling of microbial samples. Our method was applied to the multi-source microbiome data from MetaSUB (The Metagenomics and Metadesign of Subways and Urban Biomes) International Consortium for the CAMDA 2019 Metagenomic Forensics Challenge (the Challenge). The goal of the Challenge is to predict the geographical origins of mystery samples by constructing microbiome fingerprints.First, we extracted features from metagenomic abundance profiles. We then randomly split the training data into training and validation sets and trained the prediction models on the training set. Prediction performance was evaluated on the validation set. By using logistic regression with L2 normalization, the prediction accuracy of the model reaches 86%, averaged over 100 random splits of training and validation datasets.The testing data consists of samples from cities that do not occur in the training data. To predict the “mystery” cities that are not sampled before for the testing data, we first defined biological coordinates for sampled cities based on the similarity of microbial samples from them. Then we performed affine transform on the map such that the distance between cities measures their biological difference rather than geographical distance. After that, we derived the probabilities of a given testing sample from unsampled cities based on its predicted probabilities on sampled cities using Kriging interpolation. Results show that this method can successfully assign high probabilities to the true cities-of-origin of testing samples. Our framework shows good performance in predicting the geographic origin of metagenomic samples for cities where training data are available. Furthermore, we demonstrate the potential of the proposed method to predict metagenomic samples’ geolocations for samples from locations that are not in the training dataset.

中文翻译:

通过宏基因组分析确定地理位置的机器学习框架

对环境微生物样本宏基因组数据的研究发现,微生物群落似乎具有地理位置特异性,并且微生物组丰度概况可以成为识别样本地理位置的区分特征。在本文中,我们提出了一个机器学习框架,用于根据微生物样本的宏基因组分析确定地理位置。我们的方法应用于来自 MetaSUB(地铁和城市生物群落的宏基因组学和元设计)国际联盟的 CAMDA 2019 宏基因组取证挑战赛(挑战赛)的多源微生物组数据。挑战赛的目标是通过构建微生物组指纹来预测神秘样本的地理起源。首先,我们从宏基因组丰度谱中提取特征。然后,我们将训练数据随机分为训练集和验证集,并在训练集上训练预测模型。在验证集上评估预测性能。通过使用带有 L2 归一化的逻辑回归,模型的预测精度达到 86%,对训练数据集和验证数据集进行 100 多个随机分割的平均。测试数据由来自训练数据中未出现的城市的样本组成。为了预测之前未采样的测试数据的“神秘”城市,我们首先根据采样城市微生物样本的相似性定义了采样城市的生物坐标。然后我们对地图进行仿射变换,使得城市之间的距离衡量的是它们的生物差异而不是地理距离。之后,我们使用克里金插值法,根据采样城市的预测概率,推导出未采样城市的给定测试样本的概率。结果表明,该方法可以成功地为测试样本的真实来源城市分配较高的概率。我们的框架在预测有训练数据的城市宏基因组样本的地理起源方面表现出良好的性能。此外,我们证明了所提出的方法在预测来自不在训练数据集中的位置的样本的宏基因组样本地理位置方面的潜力。
更新日期:2020-11-23
down
wechat
bug