当前位置: X-MOL 学术J. Hum. Genet. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Practical guide for managing large-scale human genome data in research
Journal of Human Genetics ( IF 2.6 ) Pub Date : 2020-10-23 , DOI: 10.1038/s10038-020-00862-1
Tomoya Tanjo 1 , Yosuke Kawai 2 , Katsushi Tokunaga 2 , Osamu Ogasawara 3 , Masao Nagasaki 4, 5
Affiliation  

Studies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.



中文翻译:


研究中管理大规模人类基因组数据的实用指南



人类遗传学研究涉及大量由标本生成并可在公共领域获得的人类基因组测序数据。随着各种生物信息学应用的发展,保持研究生产力、管理人类基因组数据和分析下游数据至关重要。本综述旨在指导陷入困境的研究人员处理和分析这些大规模基因组数据,以提取相关信息以改进下游分析。在这里,我们讨论可以整合到任何数据中以改进分析的全球人类基因组项目。从数据存储和流程中获取人类全基因组测序数据的成本很高;因此,我们专注于开发操纵全基因组测序的数据格式和软件。一旦测序完成并选择其格式和数据处理工具,就需要一个计算平台。对于该平台,我们描述了一种在成本、性能和可定制性之间取得平衡的多云策略。高质量的已发表研究依赖于数据的可重复性来确保高质量的结果、应用于其他数据集的可重用性以及未来数据集增加的可扩展性。为了解决这些问题,我们描述了计算机科学中开发的几项关键技术,包括工作流引擎。我们还讨论了与模式生物不同的人类基因组数据分析不可避免的道德准则。最后总结了未来数据处理与分析的理想前景。

更新日期:2020-10-28
down
wechat
bug