当前位置: X-MOL 学术J. Hum. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Artificial intelligence powered statistical genetics in biobanks.
Journal of Human Genetics ( IF 3.5 ) Pub Date : 2020-08-11 , DOI: 10.1038/s10038-020-0822-y
Akira Narita 1 , Masao Ueki 2 , Gen Tamiya 1, 2
Affiliation  

Large-scale, sometimes nationwide, prospective genomic cohorts biobanking rich biological specimens such as blood, urine and tissues, have been established and released their vast amount of data in several countries. These genetic and epidemiological resources are expected to allow investigators to disentangle genetic and environmental components conferring common complex diseases. There are, however, two major challenges to statistical genetics for this goal: small sample size—high dimensionality and multilayered—heterogenous endophenotypes. Rather counterintuitively, biobank data generally have small sample size relative to their data dimensionality consisting of genomic variation, lifestyle questionnaire, and sometimes their interaction. This is a widely acknowledged difficulty in data analysis, so-called “p»n problem” in statistics or “curse of dimensionality” in machine-learning field. On the other hand, we have too many measurements of individual health status, which are endophenotypes, such as health check-up data, images, psychological test scores in addition to metabolomics and proteomics data. These endophenotypes are rich but not so tractable because of their worsen dimensionality, and substantial correlation, sometimes confusing causation among them. We have tried to overcome the problems inherent to biobank data, using statistical machine-learning and deep-learning technologies.



中文翻译:

人工智能推动了生物库中的统计遗传学。

已经建立了大规模的,有时是全国性的,预期的基因组队列生物库,这些库具有丰富的生物样本,例如血液,尿液和组织,并已在多个国家发布了它们的大量数据。预计这些遗传和流行病学资源将使研究人员能够分辨出赋予常见复杂疾病的遗传和环境成分。但是,实现这一目标的统计遗传学面临两个主要挑战:小样本量(高维)和多层(异质)内表型。相对于直觉而言,生物库数据通常相对于其数据维数而言,样本量较小,包括基因组变异,生活方式调查表,有时还包括它们之间的相互作用。这是公认的数据分析难题,统计学中的“ p»n问题”或机器学习领域中的“维数诅咒”。另一方面,除了代谢组学和蛋白质组学数据外,我们对个体健康状况的测量值太多,这些测量值属于内表型,例如健康检查数据,图像,心理测验得分。这些内表型很丰富,但由于尺寸变差和相关性差而难以处理,有时使它们之间的因果关系混乱。我们已经尝试使用统计机器学习和深度学习技术来克服生物库数据固有的问题。除代谢组学和蛋白质组学数据外,还提供心理测验分数。这些内表型很丰富,但由于维数变差和相关性差而难以处理,有时使它们之间的因果关系混乱。我们已经尝试使用统计机器学习和深度学习技术来克服生物库数据固有的问题。除代谢组学和蛋白质组学数据外,还提供心理测验分数。这些内表型很丰富,但由于维数变差和相关性差而难以处理,有时使它们之间的因果关系混乱。我们已经尝试使用统计机器学习和深度学习技术来克服生物库数据固有的问题。

更新日期:2020-08-11
down
wechat
bug