当前位置: X-MOL 学术Biol. Direct › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Unraveling city-specific signature and identifying sample origin locations for the data from CAMDA MetaSUB challenge
Biology Direct ( IF 5.7 ) Pub Date : 2021-01-04 , DOI: 10.1186/s13062-020-00284-1
Runzhi Zhang 1 , Alejandro R Walker 2 , Susmita Datta 1
Affiliation  

Composition of microbial communities can be location-specific, and the different abundance of taxon within location could help us to unravel city-specific signature and predict the sample origin locations accurately. In this study, the whole genome shotgun (WGS) metagenomics data from samples across 16 cities around the world and samples from another 8 cities were provided as the main and mystery datasets respectively as the part of the CAMDA 2019 MetaSUB “Forensic Challenge”. The feature selecting, normalization, three methods of machine learning, PCoA (Principal Coordinates Analysis) and ANCOM (Analysis of composition of microbiomes) were conducted for both the main and mystery datasets. Features selecting, combined with the machines learning methods, revealed that the combination of the common features was effective for predicting the origin of the samples. The average error rates of 11.93 and 30.37% of three machine learning methods were obtained for main and mystery datasets respectively. Using the samples from main dataset to predict the labels of samples from mystery dataset, nearly 89.98% of the test samples could be correctly labeled as “mystery” samples. PCoA showed that nearly 60% of the total variability of the data could be explained by the first two PCoA axes. Although many cities overlapped, the separation of some cities was found in PCoA. The results of ANCOM, combined with importance score from the Random Forest, indicated that the common “family”, “order” of the main-dataset and the common “order” of the mystery dataset provided the most efficient information for prediction respectively. The results of the classification suggested that the composition of the microbiomes was distinctive across the cities, which could be used to identify the sample origins. This was also supported by the results from ANCOM and importance score from the RF. In addition, the accuracy of the prediction could be improved by more samples and better sequencing depth.

中文翻译:


解开 CAMDA MetaSUB 挑战赛中特定城市的签名并识别数据的样本来源位置



微生物群落的组成可以是特定位置的,并且位置内不同丰度的分类单元可以帮助我们揭示城市特定的特征并准确预测样本起源位置。在本研究中,作为 CAMDA 2019 MetaSUB“法医挑战”的一部分,提供了来自全球 16 个城市样本和另外 8 个城市样本的全基因组鸟枪法 (WGS) 宏基因组数据分别作为主要数据集和神秘数据集。对主要数据集和神秘数据集都进行了特征选择、归一化、机器学习的三种方法、PCoA(主坐标分析)和ANCOM(微生物组组成分析)。特征选择与机器学习方法相结合,揭示了共同特征的组合对于预测样本的来源是有效的。三种机器学习方法在主数据集和神秘数据集上的平均错误率分别为 11.93% 和 30.37%。使用主数据集中的样本来预测神秘数据集中样本的标签,近 89.98% 的测试样本可以被正确标记为“神秘”样本。 PCoA 显示,前两个 PCoA 轴可以解释近 60% 的数据总变异性。虽然很多城市重叠,但在PCoA中发现了一些城市的分离。 ANCOM 的结果与随机森林的重要性得分相结合,表明主数据集的公共“族”、“顺序”和神秘数据集的公共“顺序”分别为预测提供了最有效的信息。 分类结果表明,各个城市的微生物组组成是不同的,这可以用来识别样本来源。 ANCOM 的结果和 RF 的重要性评分也支持了这一点。此外,通过更多样本和更好的测序深度可以提高预测的准确性。
更新日期:2021-01-04
down
wechat
bug