Advanced algorithms evolve in the age of genomic big data.

Genetic variation underlies many complex phenotypes and diseases in human populations. The fast-moving development of high-throughput sequencing technologies opens up opportunities to identify genetic variation on an ever-more-comprehensive scale, in unprecedentedly large cohorts. This raises unceasing demands for advanced informatics methods for genomic data analysis.

Some genome graphs resemble subway maps. Credit: Marina Corral Spence/Springer Nature

One challenge is the handling of huge sequencing datasets (Nat. Rev. Genet. 20, 693–701, 2019). The reference genome has been pivotal in identifying and analyzing genetic variants since the completion of the Human Genome Project. However, it has become ever clearer that a reference-genome-centric strategy may be neither sufficient to characterize the full spectrum of human genomic variation nor efficient for computational and statistical analysis. A variety of graph-based models have emerged recently that achieve higher accuracy and speed (Nat. Genet. 49, 1654–1660, 2017; Nat. Biotechnol. 36, 875–879, 2018; Nat. Genet. 51, 354–362, 2019). Other architectures take advantage of genealogical relationship between samples in human populations (Nat. Genet. 51, 1330–1338, 2019). We expect that the coming years will see more powerful models, as well as integration and update with the vast arsenal of genomic analysis tools.

Another direction of method development is driven by the imperfection of sequencing technologies. Despite continuous improvement, the trade-off between factors such as coverage, read length and error rate still limits the quantity and quality of the data that are available to genome informatics tools. Multi-platform strategies have shown promising performance in tasks such as de novo genome assembly and structural variation identification (Nat. Commun. 10, 1784, 2019). We look forward to informatics approaches keeping pace with the advancement of sequencing technologies (Nat. Biotechnol. 37, 1155–1162, 2019).

On top of the above two areas, the accurate and comprehensive interpretation of human genomic data in terms of biological and medical significance is arguably the holy grail of human genomics. Statistical and machine-learning methods have been thriving in the areas of statistical genetics and functional genomics. With the deluge of large-scale genomic data, variant types previously understudied (due to their low frequency or complex structure) are now becoming the next frontier of genetic investigation. We believe informatics tools are indispensable for exploring the sea of genomic big data.