Abstract
DNA methylation data have become a valuable source of information for biomarker development, because, unlike static genetic risk estimates, DNA methylation varies dynamically in relation to diverse exogenous and endogenous factors, including environmental risk factors and complex disease pathology. Reliable methods for genome-wide measurement at scale have led to the proliferation of epigenome-wide association studies and subsequently to the development of DNA methylation-based predictors across a wide range of health-related applications, from the identification of risk factors or exposures, such as age and smoking, to early detection of disease or progression in cancer, cardiovascular and neurological disease. This Review evaluates the progress of existing DNA methylation-based predictors, including the contribution of machine learning techniques, and assesses the uptake of key statistical best practices needed to ensure their reliable performance, such as data-driven feature selection, elimination of data leakage in performance estimates and use of generalizable, adequately powered training samples.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$29.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on Springer Link
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout
Similar content being viewed by others
References
GBD 2019 Diseases and Injuries Collaborators. Global burden of 369 diseases and injuries in 204 countries and territories, 1990–2019: a systematic analysis for the Global Burden of Disease Study 2019. Lancet 396, 1204–1222 (2020).
Yengo, L. et al. Meta-analysis of genome-wide association studies for height and body mass index in ~700 000 individuals of European ancestry. Hum. Mol. Genet. 27, 3641–3649 (2018).
Khera, A. V. et al. Polygenic prediction of weight and obesity trajectories from birth to adulthood. Cell 177, 587–596.e9 (2019).
Khera, A. V. et al. Genome-wide polygenic scores for common diseases identify individuals with risk equivalent to monogenic mutations. Nat. Genet. 50, 1219–1224 (2018).
Roberts, N. J. et al. The predictive capacity of personal genome sequencing. Sci. Transl. Med. 4, 133ra58 (2012).
Adeyemo, A. et al. Responsible use of polygenic risk scores in the clinic: potential benefits, risks and gaps. Nat. Med. 27, 1876–1884 (2021).
Timpson, N. J., Greenwood, C. M. T., Soranzo, N., Lawson, D. J. & Richards, J. B. Genetic architecture: the shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19, 110–124 (2017).
Ala-Korpela, M. & Holmes, M. V. Polygenic risk scores and the prediction of common diseases. Int. J. Epidemiol. 49, 1–3 (2020).
Cavalli, G. & Heard, E. Advances in epigenetics link genetics to the environment and disease. Nature 571, 489–499 (2019).
Teschendorff, A. E. et al. Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–446 (2010).
Petronis, A. Epigenetics as a unifying principle in the aetiology of complex traits and diseases. Nature 465, 721–727 (2010).
Baubec, T. & Schübeler, D. Genomic patterns and context specific interpretation of DNA methylation. Curr. Opin. Genet. Dev. 25, 85–92 (2014).
Bird, A. DNA methylation patterns and epigenetic memory. Genes Dev. 16, 6–21 (2002).
Kim, M. & Costello, J. DNA methylation: an epigenetic mark of cellular memory. Exp. Mol. Med. 49, 49 (2017).
Russo, V. E. A., Martienssen, R. A. & Riggs, A. D. Epigenetic Mechanisms of Gene Regulation (Cold Spring Harbor laboratory Press, 1996).
Lappalainen, T. & Greally, J. M. Associating cellular epigenetic models with human phenotypes. Nat. Rev. Genet. 18, 441–451 (2017).
Hou, L., Zhang, X., Wang, D. & Baccarelli, A. Environmental chemical exposures and human epigenetics. Int. J. Epidemiol. 41, 79–105 (2012).
Perera, F. & Herbstman, J. Prenatal environmental exposures, epigenetics, and disease. Reprod. Toxicol. 31, 363–373 (2011).
Laird, P. W. Principles and challenges of genomewide DNA methylation analysis. Nat. Rev. Genet. 11, 191–203 (2010).
Foley, D. L. et al. Prospects for epigenetic epidemiology. Am. J. Epidemiol. 169, 389–400 (2009).
Moran, S., Arribas, C. & Esteller, M. Validation of a DNA methylation microarray for 850,000 CpG sites of the human genome enriched in enhancer sequences. Epigenomics 8, 389–399 (2015).
Sandoval, J. et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics 6, 692–702 (2011).
Horvath, S. & Raj, K. DNA methylation-based biomarkers and the epigenetic clock theory of ageing. Nat. Rev. Genet. 19, 371–384 (2018).
Bell, C. G. et al. DNA methylation aging clocks: challenges and recommendations. Genome Biol. 20, 249 (2019).
McRae, A. F. et al. Contribution of genetic variation to transgenerational inheritance of DNA methylation. Genome Biol. 15, R73 (2014).
McCartney, D. L. et al. Epigenetic prediction of complex traits and death. Genome Biol. 19, 136 (2018). This paper systematically demonstrates that DNAm could predict a whole range of risk factors and exposures, with explanatory capacity roughly equal to or better than polygenic risk predictors.
Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol. 14, R115 (2013). This early epigenetic clock is broadly applicable owing to its multi-tissue training set and accordingly saw widespread use as a biomarker of biological ageing in many epidemiological studies.
Bocklandt, S. et al. Epigenetic predictor of age. PLoS ONE 6, e14821 (2011). This is the first paper to report a DNAm predictor of age, or epigenetic clock.
Hannum, G. et al. Genome-wide methylation profiles reveal quantitative views of human aging rates. Mol. Cell 49, 359–367 (2013).
Crimmins, E. M., Thyagarajan, B., Levine, M. E., Weir, D. R. & Faul, J. Associations of age, sex, race/ethnicity and education with 13 epigenetic clocks in a nationally representative US sample: the Health and Retirement Study. J. Gerontol. Ser. A biol. Sci. Med. Sci. 76, 1117–1123 (2021).
Rakyan, V. K. et al. Human aging-associated DNA hypermethylation occurs preferentially at bivalent chromatin domains. Genome Res. 20, 434–439 (2010).
Boks, M. P. et al. Longitudinal changes of telomere length and epigenetic age related to traumatic stress and post-traumatic stress disorder. Psychoneuroendocrinology 51, 506–512 (2015).
Zannas, A. S. et al. Lifetime stress accelerates epigenetic aging in an urban, African American cohort: relevance of glucocorticoid signaling. Genome Biol. 16, 266 (2015).
Horvath, S. et al. Obesity accelerates epigenetic aging of human liver. Proc. Natl Acad. Sci. USA 111, 15538–15543 (2014).
Marioni, R. E. et al. The epigenetic clock is correlated with physical and cognitive fitness in the Lothian Birth Cohort 1936. Int. J. Epidemiol. 44, 1388–1396 (2015).
Levine, M. E. et al. DNA methylation age of blood predicts future onset of lung cancer in the women’s health initiative. Aging 7, 690–700 (2015).
Marioni, R. E. et al. DNA methylation age of blood predicts all-cause mortality in later life. Genome Biol. 16, 25 (2015).
Marioni, R. E. et al. The epigenetic clock and telomere length are independently associated with chronological age and mortality. Int. J. Epidemiol. 45, 424–432 (2016).
Jaffe, A. E. & Irizarry, R. A. Accounting for cellular heterogeneity is critical in epigenome-wide association studies. Genome Biol. 15, R31 (2014).
Horvath, S. & Ritz, B. R. Increased epigenetic age and granulocyte counts in the blood of Parkinson’s disease patients. Aging 7, 1130–1142 (2015).
Zhang, Q. et al. Improved precision of epigenetic clock estimates across tissues and its implication for biological ageing. Genome Med. 11, 887–897 (2019).
Levine, M. E. et al. An epigenetic biomarker of aging for lifespan and healthspan. Aging 10, 573–591 (2018).
Lu, A. T. et al. DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging 11, 303 (2019). This paper presents an influential second-generation epigenetic clock and demonstrates that DNAm predictors of molecular phenotypes, risk factors and exposures can be usefully combined.
Belsky, D. W. W. et al. Quantification of the pace of biological aging in humans through a blood test, the DunedinPoAm DNA methylation algorithm. eLife 9, e54870 (2020).
Lu, A. T. et al. GWAS of epigenetic aging rates in blood reveals a critical role for TERT. Nat. Commun. 9, 387 (2018).
Gibson, J. et al. A meta-analysis of genome-wide association studies of epigenetic age acceleration. PLoS Genet. 15, e1008104 (2019).
McCartney, D. L. et al. Genome-wide association studies identify 137 genetic loci for DNA methylation biomarkers of aging. Genome Biol. 22, 1–25 (2021).
Vetter, V. M. et al. Epigenetic clock and relative telomere length represent largely different aspects of aging in the Berlin aging study II (BASE-II). J. Gerontol. A Biol. Sci. Med. Sci. 74, 27–32 (2019).
Joehanes, R. et al. Epigenetic signatures of cigarette smoking. Circ. Cardiovasc. Genet. 9, 436–447 (2016). This paper is the largest EWAS on cigarette smoking in adults with almost 16,000 participants and identifies differential DNAm between current and never smokers at 2,623 CpG sites.
Zeilinger, S. et al. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS ONE 8, e63812 (2013).
Guida, F. et al. Dynamics of smoking-induced genome-wide methylation changes with time since smoking cessation. Hum. Mol. Genet. 24, 2349–2359 (2015).
Maas, S. C. E. et al. Validated inference of smoking habits from blood with a finite DNA methylation marker set. Eur. J. Epidemiol. 34, 1055–1074 (2019).
McCartney, D. L. et al. Epigenetic signatures of starting and stopping smoking. EBioMedicine 37, 214–220 (2018).
Corley, J. et al. Epigenetic signatures of smoking associate with cognitive function, brain structure, and mental and physical health outcomes in the Lothian Birth Cohort 1936. Transl. Psychiatry 9, 248 (2019).
Su, D. et al. Distinct epigenetic effects of tobacco smoking in whole blood and among leukocyte subtypes. PLoS ONE 11, e0166486 (2016).
You, C. et al. A cell-type deconvolution meta-analysis of whole blood EWAS reveals lineage-specific smoking-associated DNA methylation changes. Nat. Commun. 11, 4779 (2020).
Benowitz, N. L. et al. Biochemical verification of tobacco use and abstinence: 2019 update. Nicotine Tob. Res. 22, 1086–1097 (2020).
Richmond, R. C., Suderman, M., Langdon, R., Relton, C. L., & Davey Smith, G. DNA methylation as a marker for prenatal smoke exposure in adults. Int. J. Epidemiol. 47, 1120–1130 (2018).
Wiklund, P. et al. DNA methylation links prenatal smoking exposure to later life health outcomes in offspring. Clin. Epigenetics 11, 97 (2019).
Bojesen, S. E., Timpson, N., Relton, C., Davey Smith, G. & Nordestgaard, B. G. AHRR (cg05575921) hypomethylation marks smoking behaviour, morbidity and mortality. Thorax 72, 646–653 (2017). This paper provides a clear example of how DNAm can proxy an established risk factor and out-perform the measurement of that risk factor in predicting morbidity and mortality.
Tu, W., Chu, C., Li, S. & Liangpunsakul, S. Development and validation of a composite score for excessive alcohol use screening. J. Investig. Med. 64, 1006–1011 (2016).
Joubert, B. R. et al. DNA methylation in newborns and maternal smoking in pregnancy: genome-wide consortium meta-analysis. Am. J. Hum. Genet. 98, 680–696 (2016).
Liu, C. et al. A DNA methylation biomarker of alcohol consumption. Mol. Psychiatry 23, 422–433 (2018).
Clarke, T. K. et al. Genome-wide association study of alcohol consumption and genetic overlap with other health-related traits in UK biobank (N = 112117). Mol. Psychiatry 22, 1376–1384 (2017).
Taylor, M., Simpkin, A. J., Haycock, P. C., Dudbridge, F. & Zuccolo, L. Exploration of a polygenic risk score for alcohol consumption: a longitudinal analysis from the ALSPAC cohort. PLoS ONE 11, e0167360 (2016).
Philibert, R., Dogan, M., Beach, S. R. H., Mills, J. A. & Long, J. D. AHRR methylation predicts smoking status and smoking intensity in both saliva and blood DNA. Am. J. Med. Genet. B Neuropsychiatr. Genet. 183, 51–60 (2020).
Yousefi, P. D. et al. Validation and characterisation of a DNA methylation alcohol biomarker across the life course. Clin. Epigenetics 11, 163 (2019).
Wahl, S. et al. Epigenome-wide association study of body mass index, and the adverse outcomes of adiposity. Nature 541, 81–86 (2017). This paper provided an early demonstration of the value of DNAm predictors in relation to disease discrimination, by showing that a DNAm score for BMI is associated with incident type 2 diabetes.
Dick, K. J. et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet 383, 1990–1998 (2014).
Mendelson, M. M. et al. Association of body mass index with DNA methylation and gene expression in blood cells and relations to cardiometabolic disease: a Mendelian randomization approach. PLoS Med. 14, e1002215 (2017).
Reed, Z. E., Suderman, M. J., Relton, C. L., Davis, O. S. P. & Hemani, G. The association of DNA methylation with body mass index: distinguishing between predictors and biomarkers. Clin. Epigenetics 12, 50 (2020).
Keller, M. et al. DNA methylation signature in blood mirrors successful weight-loss during lifestyle interventions: the CENTRAL trial. Genome Med. 12, 97 (2020).
Crocker, K. C. et al. DNA methylation and adiposity phenotypes: an epigenome-wide association study among adults in the Strong Heart Study. Int. J. Obes. 44, 2313–2322 (2020).
Justice, A. E. et al. Methylome-wide association study of central adiposity implicates genes involved in immune and endocrine systems. Epigenomics 12, 1483–1499 (2020).
Vehmeijer, F. O. L. et al. DNA methylation and body mass index from birth to adolescence: meta-analyses of epigenome-wide association studies. Genome Med. 12, 105 (2020).
Mandaviya, P. R. et al. Association of dietary folate and vitamin B-12 intake with genome-wide DNA methylation in blood: a large-scale epigenome-wide association analysis in 5841 individuals. Am. J. Clin. Nutr. 110, 437–450 (2019).
Gensous, N. et al. One-year Mediterranean diet promotes epigenetic rejuvenation with country- and sex-specific effects: a pilot study from the NU-AGE project. GeroScience 42, 687–701 (2020).
Ma, J. et al. Whole blood DNA methylation signatures of diet are associated with cardiovascular disease risk factors and all-cause mortality. Circ. Genom. Precis. Med. 13, 324–333 (2020).
Do, W. L. et al. Epigenome-wide association study of diet quality in the Women’s Health Initiative and TwinsUK cohort. Int. J. Epidemiol. 50, 675–684 (2021).
Gomez-Alonso, M del C. et al. DNA methylation and lipid metabolism: an EWAS of 226 metabolic measures. Clin. Epigenetics 13, 7 (2021).
Antoun, E. et al. Maternal dysglycaemia, changes in the infant’s epigenome modified with a diet and physical activity intervention in pregnancy: secondary analysis of a randomised control trial. PLoS Med. 17, e1003229 (2020).
Irwin, R. E. et al. A randomized controlled trial of folic acid intervention in pregnancy highlights a putative methylation-regulated control element at ZFP57. Clin. Epigenetics 11, 31 (2019).
Sharp, G. C. et al. Maternal BMI at the start of pregnancy and offspring epigenome-wide DNA methylation: findings from the pregnancy and childhood epigenetics (PACE) consortium. Hum. Mol. Genet. 26, 4067–4085 (2017).
Howe, C. G. et al. Maternal gestational diabetes and newborn DNA methylation: findings from the Pregnancy and Childhood Epigenetics consortium. Diabetes Care 43, dc190524 (2019).
Ouidir, M. et al. Early pregnancy dyslipidemia is associated with placental DNA methylation at loci relevant for cardiometabolic diseases. Epigenomics 12, 921–934 (2020).
Agha, G. et al. Adiposity is associated with DNA methylation profile in adipose tissue. Int. J. Epidemiol. 44, 1277–1287 (2015).
Huang, Y. T. et al. Epigenome-wide profiling of DNA methylation in paired samples of adipose tissue and blood. Epigenetics 11, 227–236 (2016).
Allum, F. et al. Dissecting features of epigenetic variants underlying cardiometabolic risk using full-resolution epigenome profiling in regulatory elements. Nat. Commun. 10, 1209 (2019).
Richmond, R. C. et al. DNA methylation and BMI: investigating identified methylation sites at HIF3A in a causal framework. Diabetes 65, 1231–1244 (2016).
Sun, D. et al. Body mass index drives changes in DNA methylation: a longitudinal study. Circ. Res. 125, 824–833 (2019).
Gudsnuk, K. & Champagne, F. A. Epigenetic influence of stress and the social environment. ILAR J. 53, 279–288 (2012).
Cunliffe, V. T. The epigenetic impacts of social stress: how does social adversity become biologically embedded? Epigenomics 8, 1653–1669 (2016).
Borghol, N. et al. Associations with early-life socio-economic position in adult DNA methylation. Int. J. Epidemiol. 41, 62–74 (2012).
Chen, D., Meng, L., Pei, F., Zheng, Y. & Leng, J. A review of DNA methylation in depression. J. Clin. Neurosci. 43, 39–46 (2017).
Vukojevic, V. et al. Epigenetic modification of the glucocorticoid receptor gene is linked to traumatic memory and post-traumatic stress disorder risk in genocide survivors. J. Neurosci. 34, 10274–10284 (2014).
Yehuda, R. et al. Lower methylation of glucocorticoid receptor gene promoter 1F in peripheral blood of veterans with posttraumatic stress disorder. Biol. Psychiatry 77, 356–364 (2015).
Non, A. L. et al. DNA methylation at stress-related genes is associated with exposure to early life institutionalization. Am. J. Phys. Anthropol. 161, 84–93 (2016).
McGowan, P. O. et al. Epigenetic regulation of the glucocorticoid receptor in human brain associates with childhood abuse. Nat. Neurosci. 12, 342–348 (2009).
Suderman, M. et al. Childhood abuse is associated with methylation of multiple loci in adult DNA. BMC Med. Genomics 7, 13 (2014).
Hostinar, C. E., Sullivan, R. M. & Gunnar, M. R. Psychobiological mechanisms underlying the social buffering of the hypothalamic-pituitary-adrenocortical axis: a review of animal models and human studies across development. Psychol. Bull. 140, 256–282 (2014).
Swartz, J. R., Hariri, A. R. & Williamson, D. E. An epigenetic mechanism links socioeconomic status to changes in depression-related brain function in high-risk adolescents. Mol. Psychiatry 22, 209–214 (2017).
Clark, S. L. et al. A methylation study of long-term depression risk. Mol. Psychiatry 25, 1334–1343 (2020).
Barbu, M. C. et al. Epigenetic prediction of major depressive disorder. Mol. Psychiatry 26, 5112–5123 (2021).
Clive, M. L. et al. Discovery and replication of a peripheral tissue DNA methylation biosignature to augment a suicide prediction model. Clin. Epigenetics 8, 113 (2016).
Yang, X., Gao, L. & Zhang, S. Comparative pan-cancer DNA methylation analysis reveals cancer common and specific patterns. Brief. Bioinform. 18, 761–773 (2017).
Zhang, J. & Huang, K. Pan-cancer analysis of frequent DNA co-methylation patterns reveals consistent epigenetic landscape changes in multiple cancers. BMC Genomics 18, 1045 (2017).
Tao, Y. et al. Aging-like spontaneous epigenetic silencing facilitates Wnt activation, stemness, and Braf V600E -induced tumorigenesis. Cancer Cell 35, 315–328.e6 (2019).
Chen, Y. et al. MGMT promoter methylation and glioblastoma prognosis: a systematic review and meta-analysis. Arch. Med. Res. 44, 281–290 (2013).
Wick, W. et al. Temozolomide chemotherapy alone versus radiotherapy alone for malignant astrocytoma in the elderly: the NOA-08 randomised, phase 3 trial. Lancet Oncol. 13, 707–715 (2012).
Malmström, A. et al. Temozolomide versus standard 6-week radiotherapy versus hypofractionated radiotherapy in patients older than 60 years with glioblastoma: the Nordic randomised, phase 3 trial. Lancet Oncol. 13, 916–926 (2012).
Loeb, S. et al. Overdiagnosis and overtreatment of prostate cancer. Eur. Urol. 65, 1046–1055 (2014).
Jørgensen, K. J. & Gøtzsche, P. C. Overdiagnosis in publicly organised mammography screening programmes: systematic review of incidence trends. BMJ 339, 206–209 (2009).
Hulbert, A. et al. Early detection of lung cancer using DNA promoter hypermethylation in plasma and sputum. Clin. Cancer Res. 23, 1998–2005 (2017).
Li, L. et al. Diagnosis of pulmonary nodules by DNA methylation analysis in bronchoalveolar lavage fluids. Clin. Epigenetics 13, 185 (2021).
Dvorská, D. et al. Aberrant methylation status of tumour suppressor genes in ovarian cancer tissue and paired plasma samples. Int. J. Mol. Sci. 20, 4119 (2019).
Majumder, S. et al. Novel methylated DNA markers discriminate advanced neoplasia in pancreatic cysts: marker discovery, tissue validation, and cyst fluid testing. Am. J. Gastroenterol. 114, 1539–1549 (2019).
Sanchez-Cespedes, M. et al. Gene promoter hypermethylation in tumors and serum of head and neck cancer patients. Cancer Res. 60, 892–895 (2000).
Nakahara, Y., Shintani, S., Mihara, M., Hino, S. & Hamakawa, H. Detection of p16 promoter methylation in the serum of oral cancer patients. Int. J. Oral. Maxillofac. Surg. 35, 362–365 (2006).
Nakayama, H. et al. Molecular detection of p16 promoter methylation in the serum of colorectal cancer patients. Cancer Lett. 188, 115–119 (2002).
Ooki, A. et al. A panel of novel detection and prognostic methylated DNA markers in primary non–small cell lung cancer and serum DNA. Clin. Cancer Res. 23, 7141–7152 (2017).
Guan, Z. et al. Individual and joint performance of DNA methylation profiles, genetic risk score and environmental risk scores for predicting breast cancer risk. Mol. Oncol. 14, 42–53 (2020).
Onwuka, J. U. et al. A panel of DNA methylation signature from peripheral blood may predict colorectal cancer susceptibility. BMC Cancer 20, 692 (2020).
Walker, R. M. et al. Epigenome-wide analyses identify DNA methylation signatures of dementia risk. Alzheimer’s Dement. 12, e12078 (2020).
Baglietto, L. et al. DNA methylation changes measured in pre-diagnostic peripheral blood samples are associated with smoking and lung cancer risk. Int. J. Cancer 140, 50–61 (2017).
Zhang, Y. et al. Smoking-associated DNA methylation markers predict lung cancer incidence. Clin. Epigenetics 8, 127 (2016).
Wang, L. et al. Methylation markers for small cell lung cancer in peripheral blood leukocyte DNA. J. Thorac. Oncol. 5, 778–785 (2010).
Pedersen, K. S. et al. Leukocyte DNA methylation signature differentiates pancreatic cancer patients from healthy controls. PLoS ONE 6, e18223 (2011).
Michaud, D. S. et al. Epigenome-wide association study using prediagnostic bloods identifies new genomic regions associated with pancreatic cancer risk. JNCI Cancer Spectr. 4, pkaa041 (2020).
Xu, R. H. et al. Circulating tumour DNA methylation markers for diagnosis and prognosis of hepatocellular carcinoma. Nat. Mater. 16, 1155–1162 (2017).
Shen, S. Y. et al. Sensitive tumour detection and classification using plasma cell-free DNA methylomes. Nature 563, 579–583 (2018).
Roy, D. & Tiirikainen, M. Diagnostic power of DNA methylation classifiers for early detection of cancer. Trends cancer 6, 78–81 (2020).
Liu, M. C. et al. Sensitive and specific multi-cancer detection and localization using methylation signatures in cell-free DNA. Ann. Oncol. 31, 745–759 (2020).
Nassiri, F. et al. Detection and discrimination of intracranial tumors using plasma cell-free DNA methylomes. Nat. Med. 26, 1044–1047 (2020).
Nuzzo, P. V. et al. Detection of renal cell carcinoma using plasma and urine cell-free DNA methylomes. Nat. Med. 26, 1041–1043 (2020).
Guler, G. D. et al. Detection of early stage pancreatic cancer using 5-hydroxymethylcytosine signatures in circulating cell free DNA. Nat. Commun. 11, 5270 (2020).
Tse, R. T.-H. et al. Urinary cell-free DNA in bladder cancer detection. Diagnostics 11, 306 (2021).
Luo, H. et al. Circulating tumor DNA methylation profiles enable early diagnosis, prognosis prediction, and screening for colorectal cancer. Sci. Transl. Med. 12, eaax7533 (2020).
NHS. NHS to pilot potentially revolutionary blood test that detects more than 50 cancers. https://www.england.nhs.uk/2020/11/nhs-to-pilot-potentially-revolutionary-blood-test/ (2021).
Klein, E. A. et al. Clinical validation of a targeted methylation-based multi-cancer early detection test using an independent validation set. Ann. Oncol. 32, 1167–1177 (2021). This study demonstrates the ability of cell-free DNA polymorphisms and DNAm to discriminate >50 cancer types and tissue of origin.
Richard, M. A. et al. DNA methylation analysis identifies loci for blood pressure regulation. Am. J. Hum. Genet. 101, 888–902 (2017). The largest blood pressure EWAS to date, with information from more than 17,000 participants, which found that a 13 CpG score could explain only between 1% and 2% of the variance of systolic and diastolic blood pressure, respectively.
Huang, Y. et al. Identification, heritability, and relation with gene expression of novel DNA methylation loci for blood pressure. Hypertension 76, 195–205 (2020).
Fernández-Sanlés, A., Sayols-Baixeras, S., Subirana, I., Degano, I. R. & Elosua, R. Association between DNA methylation and coronary heart disease or other atherosclerotic events: a systematic review. Atherosclerosis 263, 325–333 (2017).
Westerman, K. et al. DNA methylation modules associate with incident cardiovascular disease and cumulative risk factor exposure. Clin. Epigenetics 11, 142 (2019).
Shen, Y. et al. Epigenome-wide association study indicates hypomethylation of MTRNR2L8 in large-artery atherosclerosis stroke. Stroke 50, 1330–1338 (2019).
Dogan, M. V., Grumbach, I. M., Michaelson, J. J. & Philibert, R. A. Integrated genetic and epigenetic prediction of coronary heart disease in the Framingham Heart Study. PLoS ONE 13, e0190549 (2018).
Westerman, K. et al. Epigenomic assessment of cardiovascular disease risk and interactions with traditional risk metrics. J. Am. Heart Assoc. 9, e015299 (2020).
Nuotio, M. L. et al. An epigenome-wide association study of metabolic syndrome and its components. Sci. Rep. 10, 20567 (2020).
Chambers, J. C. et al. Epigenome-wide association of DNA methylation markers in peripheral blood from Indian Asians and Europeans with incident type 2 diabetes: a nested case–control study. Lancet Diabetes Endocrinol. 3, 526–534 (2015).
Cardona, A. et al. Epigenome-wide association study of incident type 2 diabetes in a British population: EPIC-Norfolk study. Diabetes 68, 2315–2326 (2019).
Xu, C. et al. Elevated methylation of OPRM1 and OPRL1 genes in Alzheimer’s disease. Mol. Med. Rep. 18, 4297–4302 (2018).
Wang, C., Chen, L., Yang, Y., Zhang, M. & Wong, G. Identification of potential blood biomarkers for Parkinson’s disease by gene expression and DNA methylation data integration analysis. Clin. Epigenetics 11, 24 (2019).
Osborne, L. et al. Replication of epigenetic postpartum depression biomarkers and variation with hormone levels. Neuropsychopharmacology 41, 1648–1658 (2016).
Guintivano, J., Arad, M., Gould, T. D., Payne, J. L. & Kaminsky, Z. A. Antenatal prediction of postpartum depression with blood DNA methylation biomarkers. Mol. Psychiatry 19, 560–567 (2014).
Boks, M. P. et al. SKA2 methylation is involved in cortisol stress reactivity and predicts the development of post-traumatic stress disorder (PTSD) after military deployment. Neuropsychopharmacology 41, 1350–1356 (2016).
Kaminsky, Z. et al. A multi-tissue analysis identifies HLA complex group 9 gene methylation differences in bipolar disorder. Mol. Psychiatry 17, 728–740 (2012).
Howsmon, D. P., Kruger, U., Melnyk, S., James, S. J. & Hahn, J. Classification and adaptive behavior prediction of children with autism spectrum disorder based upon multivariate data analysis of markers of oxidative stress and DNA methylation. PLoS Comput. Biol. 13, e1005385 (2017).
Ju, C. et al. Integrated genome-wide methylation and expression analyses reveal functional predictors of response to antidepressants. Transl. Psychiatry 9, 1–12 (2019).
Kuhn, M. & Johnson, K. Feature Engineering and Selection: a Practical Approach for Predictive Models (CRC Press, 2019).
Zhang, Y., Florath, I., Saum, K. U. & Brenner, H. Self-reported smoking, serum cotinine, and blood DNA methylation. Environ. Res. 146, 395–403 (2016).
Rhead, B. et al. Rheumatoid arthritis naive T cells share hypermethylation sites with synoviocytes. Arthritis Rheumatol. 69, 550–559 (2017).
Ligthart, S. et al. DNA methylation signatures of chronic low-grade inflammation are associated with complex diseases. Genome Biol. 17, 255 (2016).
Shi, J. et al. Winner’s curse correction and variable thresholding improve performance of polygenic risk modeling based on genome-wide association study summary-level data. PLoS Genet. 12, e100649 (2016).
Kundu, S. AI in medicine must be explainable. Nat. Med. 27, 1328 (2021).
Dye, C. K. et al. Comparative DNA methylomic analyses reveal potential origins of novel epigenetic biomarkers of insulin resistance in monocytes from virally suppressed HIV-infected adults. Clin. Epigenetics 11, 95 (2019).
Shen, F. et al. Identification of CD28 and PTEN as novel prognostic markers for cervical cancer. J. Cell. Physiol. 234, 7004–7011 (2019).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning (Springer, 2013). This is a standard introductory text to machine learning modelling with some level of mathematical background required and applied programming tutorials.
Hattab, M. W., Clark, S. L. & van den Oord, E. J. C. G. Overestimation of the classification accuracy of a biomarker for assessing heavy alcohol use. Mol. Psychiatry 23, 2114–2115 (2018). This letter identifies and clearly articulates the issue of data leakage that impacted the approach and inflated the performance statistics of several early DNAm predictors, particularly those developed from large EWAS meta-analyses.
Steyerberg, E. W. et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology 21, 128–138 (2010).
Cohen, J. Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70, 213–220 (1968).
Jurman, G., Riccadonna, S. & Furlanello, C. A comparison of MCC and CEN error measures in multi-class prediction. PLoS ONE 7, 41882 (2012).
Simpkin, A. J., Suderman, M. & Howe, L. D. Epigenetic clocks for gestational age: statistical and study design considerations. Clin. Epigenetics 9, 100 (2017).
Mills, M. C. & Rahal, C. A scientometric review of genome-wide association studies. Commun. Biol. 2, 9 (2019).
Chen, I. Y. et al. Ethical machine learning in health care. Annu. Rev. Biomed. Data Sci. 4, 123–144 (2021). This review identifies the many different ways that uncritical development of prediction models of health characteristics can entrench and exacerbate disparities for vulnerable populations.
Mitchell, M. et al. Model cards for model reporting. In FAT* ‘19: Proceedings of the Conference on Fairness, Accountability, and Transparency 220–229 (ACM, 2018).
Thomas, R. & Uminsky, D. The problem with metrics is a fundamental problem for AI. arXiv, doi:arxiv.org/abs/2002.08512 (2020).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning. Data Mining, Inference, and Prediction, Second Edition (Springer Science & Business Media, 2009). This is a canonical text on theoretical and applied machine learning with detailed introductions to linear modelling, many common supervised and unsupervised learning methods, and design considerations for prediction modelling.
Greener, J. G., Kandathil, S. M., Moffat, L. & Jones, D. T. A guide to machine learning for biologists. Nat. Rev. Mol. Cell Biol. 23, 40–55 (2022).
Bottner, A. et al. Gender differences of adiponectin levels develop during the progression of puberty and are related to serum androgen levels. J. Clin. Endocrinol. Metab. 89, 4053–4061 (2004).
Riley, R. D. et al. Minimum sample size for developing a multivariable prediction model: PART II - binary and time-to-event outcomes. Stat. Med. 38, 1276–1296 (2019). This is an exploration of the key constraints that affect power and sample size in machine learning and prediction settings for binary and time-to-event outcomes.
Riley, R. D. et al. Minimum sample size for developing a multivariable prediction model: part I – continuous outcomes. Stat. Med. 38, 1262–1275 (2019). This is an exploration of the key constraints that affect power and sample size in machine learning and prediction settings for continuous outcomes.
National Human Genome Research Institute. DNA sequencing costs: data. https://www.genome.gov/about-genomics/fact-sheets/DNA-Sequencing-Costs-Data (2021).
Shafi, A., Mitrea, C., Nguyen, T. & Draghici, S. A survey of the approaches for identifying differential methylation using bisulfite sequencing data. Brief. Bioinform. 19, 737–753 (2018).
Zhang, L. et al. DNA methylation landscape reflects the spatial organization of chromatin in different cells. Biophys. J. 113, 1395–1404 (2017).
Lin, N. et al. Genome-wide DNA methylation profiling in human breast tissue by Illumina TruSeq methyl capture EPIC sequencing and infinium methylationEPIC beadchip microarray. Epigenetics 16, 754–769 (2021).
Wendt, J., Rosenbaum, H., Richmond, T. A., Jeddeloh, J. A. & Burgess, D. L. Targeted bisulfite sequencing using the SeqCap Epi enrichment system. Methods Mol. Biol. 1708, 383–405 (2018).
Liu, Y. et al. DNA methylation-calling tools for Oxford Nanopore sequencing: a survey and human epigenome-wide evaluation. Genome Biol. 22, 295 (2021).
Sakamoto, Y. et al. Long-read whole-genome methylation patterning using enzymatic base conversion and nanopore sequencing. Nucleic Acids Res. 49, e81 (2021). This study highlights the use of long-read sequencing of DNAm levels without bisulfite conversion.
Shi, J. et al. The concurrence of DNA methylation and demethylation is associated with transcription regulation. Nat. Commun. 12, 5285 (2021).
Pinu, F. R., Goldansaz, S. A. & Jaine, J. Translational metabolomics: current challenges and future opportunities. Metabolites 9, 108 (2019).
Ignjatovic, V. et al. Mass spectrometry-based plasma proteomics: considerations from sample collection to achieving translational data. J. Proteome Res. 18, 4085–4097 (2019).
Shah, S. et al. Improving phenotypic prediction by combining genetic and epigenetic associations. Am. J. Hum. Genet. 97, 75–85 (2015). This study demonstrates the additive explanatory power of combining polygenic and DNAm-based complex trait prediction, with greater benefit observed when adding DNAm information for traits with greater environmental components.
Shah, S. et al. Genetic and environmental exposures constrain epigenetic drift over the human life course. Genome Res. 24, 1725–1733 (2014).
Trejo Banos, D. et al. Bayesian reassessment of the epigenetic architecture of complex traits. Nat. Commun. 11, 2865 (2020).
Zhang, F. et al. OSCA: a tool for omic-data-based complex trait analysis. Genome Biol. 20, 107 (2019).
Ebrahim, A. et al. Multi-omic data integration enables discovery of hidden biological regularities. Nat. Commun. 7, 13091 (2016).
Argelaguet, R. et al. Multi-Omics Factor Analysis — a framework for unsupervised integration of multi-omics data sets. Mol. Syst. Biol. 14, e8124 (2018).
Woo, H. G. et al. Integrative analysis of genomic and epigenomic regulation of the transcriptome in liver cancer. Nat. Commun. 8, 839 (2017).
Zhu, B. et al. Integrating clinical and multiple omics data for prognostic assessment across human cancers. Sci. Rep. 7, 16954 (2017).
Hasin, Y., Seldin, M. & Lusis, A. Multi-omics approaches to disease. Genome Biol. 18, 83 (2017).
Gadd, D. A. et al. Epigenetic scores for the circulating proteome as tools for disease prediction. eLife 11, e71802 (2022). This study highlights the potential of DNAm to index endogenous biomarkers and thus enhance prediction of phenotypes or diseases associated with these biomarkers.
Collins, G. S., Reitsma, J. B., Altman, D. G. & Moons, K. G. M. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. 13, 1 (2015). This paper details consensus recommendations of best practices for reporting prediction modelling results as developed by an international expert pannel.
Moons, K. G. M., Royston, P., Vergouwe, Y., Grobbee, D. E. & Altman, D. G. Prognosis and prognostic research: what, why, and how? BMJ 338, 1317–1320 (2009).
Weber, L. M. et al. Essential guidelines for computational method benchmarking. Genome Biol. 20, 125 (2019).
Shmueli, G. To explain or to predict? Stat. Sci. 25, 289–310 (2010). This paper provides an accessible explanation of the distinctions between explanatory and predictive statistics in terms of aims and methodologies, as well as perspective on why such differences have been persistently confused across fields.
Murray, R. P., Connett, J. E., Lauger, G. G. & Voelker, H. T. Error in smoking measures: effects of intervention on relations of cotinine and carbon monoxide to self-reported smoking. Am. J. Public Health 83, 1251 (1993).
Rehm, J. & Spuhler, T. Measurement error in alcohol consumption: the Swiss Health Survey. Eur. J. Clin. Nutr. 47 (Suppl. 2), S25–S30 (1993).
Subar, A. F. et al. Using intake biomarkers to evaluate the extent of dietary misreporting in a large sample of adults: the OPEN study. Am. J. Epidemiol. 158, 1–13 (2003).
Adab, P., Pallan, M. & Whincup, P. H. Is BMI the best measure of obesity? BMJ 360, k1274 (2018).
Greenland, S., Pearl, J. & Robins, J. M. Causal diagrams for epidemiologic research. Epidemiology 10, 37–48 (1999).
Peters, J., Janzing, D. & Schölkopf, B. Elements of Causal Inference: Foundations and Learning Algorithms (MIT Press, 2017).
Pearl, J. Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference. The Morgan Kaufmann Series in Representation and Reasoning (Morgan Kaufmann, 1988).
Piccininni, M., Konigorski, S., Rohmann, J. L. & Kurth, T. Directed acyclic graphs and causal thinking in clinical risk prediction modeling. BMC Med. Res. Methodol. 20, 179 (2020).
Austin, P. C. & Steyerberg, E. W. The Integrated Calibration Index (ICI) and related metrics for quantifying the calibration of logistic regression models. Stat. Med. 38, 4051–4065 (2019).
Korologou-Linden, R., Leyden, G. M., Relton, C. L., Richmond, R. C. & Richardson, T. G. Multi-omics analyses of cognitive traits and psychiatric disorders highlights brain-dependent mechanisms. Hum. Mol. Genet. https://doi.org/10.1093/hmg/ddab016 (2021).
Tsai, P. C. et al. Smoking induces coordinated DNA methylation and gene expression changes in adipose tissue with consequences for metabolic health. Clin. Epigenetics 10, 126 (2018).
Smith, A. K. et al. DNA extracted from saliva for methylation studies of psychiatric traits: evidence tissue specificity and relatedness to brain. Am. J. Med. Genet. B Neuropsychiatr. Genet. 168, 36–44 (2015).
Braun, P. R. et al. Genome-wide DNA methylation comparison between live human brain and peripheral tissues within individuals. Transl. Psychiatry 9, 47 (2019).
Nagy, C. et al. Single-nucleus transcriptomics of the prefrontal cortex in major depressive disorder implicates oligodendrocyte precursor cells and excitatory neurons. Nat. Neurosci. 23, 771–781 (2020).
Acknowledgements
The authors thank G. Hemani for helpful discussions on genetic prediction and K. Tilling for comments on a draft manuscript. The authors’ work is supported by the Medical Research Council Integrative Epidemiology Unit at the University of Bristol (MC_UU_00011/1 & 5) and via the Cancer Research UK programme grant (C18281/A29019). The authors’ work is also supported by the NIHR Biomedical Research Centre at University Hospitals Bristol and Weston NHS Foundation Trust and the University of Bristol. The views expressed are those of the authors and not necessarily those of the NIHR or the Department of Health and Social Care.
Author information
Authors and Affiliations
Contributions
P.D.Y., M.S., R.L., O.W. researched the literature. P.D.Y., M.S. and C.L.R. contributed substantially to discussions of the content. P.D.Y., M.S., R.L. wrote the article. P.D.Y., M.S., G.D.S and C.L.R. reviewed and/or edited the manuscript before submission.
Corresponding author
Ethics declarations
Competing interests
The authors declare no competing interests.
Peer review
Peer review information
Nature Reviews Genetics thanks Christopher Bell, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Glossary
- Genome-wide association studies
-
(GWAS). Studies that examine the statistical correlation or ‘association’ between a set of genetic polymorphisms large enough to capture most of the variation in the human genome and a given phenotype of interest.
- Polygenic risk scores
-
(PRSs). Weighted sums of risks for a phenotype conferred by genetic polymorphisms within an individual where the weights used are coefficients from the relevant genome-wide association studies (GWAS). GWAS loci are typically selected for inclusion in the score by applying a P value threshold, commonly that of genome-wide significance (P < 5 × 10–8).
- Broad-sense heritability
-
The proportion of phenotype or trait variance attributable to genetic factors.
- DNA methylation
-
(DNAm). An epigenetic modification whereby a methyl group (CH3) is covalently attached to a DNA base in a mitotically stable bond. In mammals, DNAm occurs mainly at cytosine residues in CpG sites.
- CpG sites
-
Specific sequences of DNA bases where cytosines are followed by guanines. The ‘p’ indicates the phosphate bond separating the two residues in sequence in the 5′ to 3′ direction.
- Epigenome-wide association studies
-
(EWAS). Studies that examine the association between a large number of epigenetic variables and a phenotype or exposure of interest. As most have been performed using DNA methylation levels, we treat EWAS and methylome-wide association studies as synonyms.
- DNAm-based predictors
-
Any statistical models (for example, linear model) of observed data employed to predict values of an outcome (for example, exposure, phenotype or disease) in which many or all of the of the input variables are levels of DNA methylation (DNAm) measured at CpG sites.
- Machine learning
-
Algorithms and statistical models that improve their performance from experience or by optimization through training on earlier data collection.
- Epigenetic clocks
-
Estimators of biological age or other ageing phenotypes that use levels of DNA methylation or other epigenetic measurements as inputs.
- Penalized regression
-
Linear regression modelling methods that apply some numerical penalty on the total size of all input variable coefficient values. Examples include lasso, ridge and elastic net regression.
- Linear model
-
A statistical description of the relationship between one or many input variables X and an observed level of an output Y, where each X–Y association is summarized by the slope or coefficient of the line plotted between them.
- Biological age
-
The hypothesis that the phenotypical age of a DNA source (for example, cell, tissue or organ) may be greater (that is, accelerated) or less (that is, decelerated) than chronological age at any given point in time.
- Mendelian randomization
-
An analytical method that uses genetic variants as instrumental variables to evaluate putative causal relationships between modifiable risk factors and disease outcomes.
- Cell-free DNA
-
(cfDNA). Non-nucleated DNA found circulating in blood plasma. Sources can include lysed cells from any number of tissues, including tumour cells, which are commonly of greatest interest.
- Winner’s curse
-
The phenomenon that strength of association is commonly overestimated in initial discovery samples and often experiences a regression to the mean in subsequent validation.
- Linkage disequilibrium
-
(LD). Greater than chance co-occurrence or association of alleles at various loci due to nonrandom assortment.
- Feature engineering
-
The process of transforming or combining possible inputs (for example, by taking their principal components or rescaling their values) to make novel super-features that better explain or predict an outcome.
- Out-of-sample prediction error
-
The discrepancy between estimates of an outcome \(\hat{{Y}}\) generated by a predictive modelling function f and values of Y observed in a sample of data that was not available to f during model training.
- In-sample prediction error
-
The discrepancy between estimated values of an outcome \(\hat{{Y}}\) generated by a modelling function f and values of Y observed in a sample of data that was available to f during model training.
- Resampling
-
Splitting, partitioning or sampling available data to generate subsamples in which model predictions can be tested and used to estimate distributions of out-of-sample errors.
- Accuracy
-
The percentage of times all levels of a classifier agree with observed values of those levels.
- Confusion matrix
-
A frequency table of agreement and disagreement between observed and predicted values of an outcome variable. It is used to compute many classification metrics, including, among others, accuracy, sensitivity and specificity.
- Cohen’s kappa
-
A confusion matrix metric ranging from –1 (total disagreement between observed and predicted classes) to 1 (total agreement), where class imbalances are corrected by normalizing to the expected error rate.
- Matthews correlation coefficient
-
A numerical summary of agreement in a confusion matrix, ranging from –1 (total disagreement) to 1 (total agreement), that seeks to correct for class imbalances using a method similar to that of a χ2 statistic.
- Calibration
-
The extent to which predicted outcome risk matches observed outcome proportions.
Rights and permissions
About this article
Cite this article
Yousefi, P.D., Suderman, M., Langdon, R. et al. DNA methylation-based predictors of health: applications and statistical considerations. Nat Rev Genet 23, 369–383 (2022). https://doi.org/10.1038/s41576-022-00465-w
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1038/s41576-022-00465-w
This article is cited by
-
Metabolomic machine learning predictor for diagnosis and prognosis of gastric cancer
Nature Communications (2024)
-
The transition from genomics to phenomics in personalized population health
Nature Reviews Genetics (2024)
-
An overview of DNA methylation-derived trait score methods and applications
Genome Biology (2023)
-
Refining epigenetic prediction of chronological and biological age
Genome Medicine (2023)
-
Integration of datasets for individual prediction of DNA methylation-based biomarkers
Genome Biology (2023)