Deep neural networks identify sequence context features predictive of transcription factor binding

Zheng, An; Lamkin, Michael; Zhao, Hanqing; Wu, Cynthia; Su, Hao; Gymrek, Melissa

doi:10.1038/s42256-020-00282-y

Article
Published: 18 January 2021

Deep neural networks identify sequence context features predictive of transcription factor binding

Nature Machine Intelligence volume 3, pages 172–180 (2021)Cite this article

3803 Accesses
38 Citations
50 Altmetric
Metrics details

Subjects

A preprint version of the article is available at bioRxiv.

Abstract

Transcription factors bind DNA by recognizing specific sequence motifs, which are typically 6–12 bp long. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine-learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 transcription factors in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of transcription factor binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Interpreting context-specific determinants of TF binding.**

**Fig. 3: Identifying key context sequence features for TF binding in GM12878.**

**Fig. 4: Cell-type-specific enrichment of 5-mers influential for STAT3 binding.**

Base-resolution models of transcription-factor binding reveal soft motif syntax

Article 18 February 2021

Žiga Avsec, Melanie Weilert, … Julia Zeitlinger

BindSpace decodes transcription factor binding signals by large-scale sequence embedding

Article 12 August 2019

Han Yuan, Meghana Kshirsagar, … Christina S. Leslie

Effective gene expression prediction from sequence by integrating long-range interactions

Article Open access 04 October 2021

Žiga Avsec, Vikram Agarwal, … David R. Kelley

Data availability

Variant annotation scores for each TF analysed can be found at https://github.com/Pandaman-Ryan/AgentBind. Peak files for ENCODE ChIP-seq datasets can be found at http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeAwgTfbsUniform. Peak files for STAT3 in CD4+ T cells were obtained from the Gene Expression Omnibus (GEO accession GSM2545819).

Code availability

Code used for training models and performing analyses are available in our Github repository https://github.com/Pandaman-Ryan/AgentBind (ref. ⁴⁵)

References

Lambert, S. A. et al. The human transcription factors. Cell 172, 650–665 (2018).
Article Google Scholar
Zaret, K. S. & Mango, S. E. Pioneer transcription factors, chromatin dynamics, and cell fate control. Curr. Opin. Genet. Dev. 37, 76–81 (2016).
Article Google Scholar
Segal, E. et al. A genomic code for nucleosome positioning. Nature 442, 772–778 (2006).
Article Google Scholar
Morgunova, E. & Taipale, J. Structural perspective of cooperative transcription factor binding. Curr. Opin. Struct. Biol. 47, 1–8 (2017).
Article Google Scholar
Wang, J. et al. Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 22, 1798–1812 (2012).
Article Google Scholar
Zhou, T. et al. Quantitative modeling of transcription factor binding specificities using DNA shape. Proc. Natl Acad. Sci. USA 112, 4654–4659 (2015).
Article Google Scholar
Guo, Y., Tian, K., Zeng, H., Guo, X. & Gifford, D. K. A novel k-mer set memory (KSM) motif representation improves regulatory variant prediction. Genome Res. 28, 891–900 (2018).
Article Google Scholar
Westholm, J. O., Xu, F., Ronne, H. & Komorowski, J. Genome-scale study of the importance of binding site context for transcription factor binding and gene regulation. BMC Bioinform. 9, 484 (2008).
Article Google Scholar
Alipanahi, B., Delong, A., Weirauch, M. T. & Frey, B. J. Predicting the sequence specificities of DNA-and RNA-binding proteins by deep learning. Nat. Biotechnol. 33, 831–838 (2015).
Article Google Scholar
Kelley, D. R. et al. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res. 28, 739–750 (2018).
Article Google Scholar
Kelley, D. R., Snoek, J. & Rinn, J. L. Basset: learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res. 26, 990–999 (2016).
Article Google Scholar
Lee, D. et al. A method to predict the impact of regulatory variants from DNA sequence. Nat. Genet. 47, 955–961 (2015).
Article Google Scholar
Quang, D. & Xie, X. DanQ: a hybrid convolutional and recurrent deep neural network for quantifying the function of DNA sequences. Nucleic Acids Res. 44, e107 (2016).
Article Google Scholar
Quang, D. & Xie, X. FactorNet: a deep learning framework for predicting cell type specific transcription factor binding from nucleotide-resolution sequential data. Methods 166, 40–47 (2019).
Article Google Scholar
Zeng, H., Hashimoto, T., Kang, D. D. & Gifford, D. K. GERV: a statistical method for generative evaluation of regulatory variants for transcription factor binding. Bioinformatics 32, 490–496 (2016).
Article Google Scholar
Zhou, J. & Troyanskaya, O. G. Predicting effects of noncoding variants with deep learning–based sequence model. Nat. Meth. 12, 931–934 (2015).
Article Google Scholar
Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In Proc. IEEE International Conference on Computer Vision 618–626 (2017).
Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57–74 (2012).
Article Google Scholar
Khan, A. et al. JASPAR 2018: update of the open-access database of transcription factor binding profiles and its web framework. Nucleic Acids Res. 46, D260–D266 (2018).
Article Google Scholar
Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).
Article Google Scholar
Ma, H., Bandos, A. I., Rockette, H. E. & Gur, D. On use of partial area under the ROC curve for evaluation of diagnostic performance. Stat. Med. 32, 3449–3458 (2013).
Article MathSciNet Google Scholar
Amariuta, T. et al. IMPACT: genomic annotation of cell-state-specific regulatory elements inferred from the epigenome of bound transcription factors. Am. J. Hum. Genet. 104, 879–895 (2019).
Article Google Scholar
Shrikumar, A., Greenside, P. & Kundaje, A. Learning important features through propagating activation differences. In Proc. 34th International Conference on Machine Learning 70, 3145–3153 (JMLR, 2017).
Lanchantin, J., Singh, R., Wang, B. & Qi, Y. Deep motif dashboard: visualizing and understanding genomic sequences using deep neural networks. In Pacific Symposium on Biocomputing 2017 254–265 (World Scientific, 2017).
Gotea, V. et al. Homotypic clusters of transcription factor binding sites are a key component of human promoters and enhancers. Genome Res. 20, 565–577 (2010).
Article Google Scholar
Roder, K., Wolf, S. S., Larkin, K. J. & Schweizer, M. Interaction between the two ubiquitously expressed transcription factors NF-Y and Sp1. Gene 234, 61–69 (1999).
Article Google Scholar
Dolfini, D., Zambelli, F., Pedrazzoli, M., Mantovani, R. & Pavesi, G. A high definition look at the NF-Y regulome reveals genome-wide associations with selected transcription factors. Nucleic Acids Res. 44, 4684–4702 (2016).
Article Google Scholar
Van Dam, H. & Castellazzi, M. Distinct roles of Jun: Fos and Jun: ATF dimers in oncogenesis. Oncogene 20, 2453–2464 (2001).
Article Google Scholar
Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol. Cell 38, 576–589 (2010).
Article Google Scholar
Li, P. et al. BATF–JUN is critical for IRF4-mediated transcription in T cells. Nature 490, 543–546 (2012).
Article Google Scholar
Benner, C. et al. Decoding a signature-based model of transcription cofactor recruitment dictated by cardinal cis-regulatory elements in proximal promoter regions. PLoS Genet. 9, e1003906 (2013).
Article Google Scholar
Mevel, R., Draper, J. E., Lie, A. L. M., Kouskoff, V. & Lacaud, G. RUNX transcription factors: orchestrators of development. Development 146, dev148296 (2019).
Kroger, A. IRFs as competing pioneers in T-cell differentiation. Cell. Mol. Immunol. 14, 649–651 (2017).
Article Google Scholar
Tewhey, R. et al. Direct identification of hundreds of expression-modulating variants using a multiplexed reporter assay. Cell 165, 1519–1529 (2016).
Article Google Scholar
Liu, G., Zeng, H. & Gifford, D. K. Visualizing complex feature interactions and feature sharing in genomic deep neural networks. BMC Bioinform. 20, 401 (2019).
Article Google Scholar
Zeng, H., Edwards, M. D., Guo, Y. & Gifford, D. K. Accurate eQTL prioritization with an ensemble-based framework. Hum. Mutat. 38, 1259–1265 (2017).
Article Google Scholar
Grant, C. E., Bailey, T. L. & Noble, W. S. FIMO: scanning for occurrences of a given motif. Bioinformatics 27, 1017–1018 (2011).
Article Google Scholar
Abadi, M. et al. TensorFlow: a system for large-scale machine learning. In Proc. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ‘16) 265–283 (USENIX, 2016).
Avsec, Ž. et al. The Kipoi repository accelerates community exchange and reuse of predictive models for genomics. Nat. Biotechnol. 37, 592–600 (2019).
Article Google Scholar
Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Methods 17, 261–272 (2020).
Article Google Scholar
Kulakovskiy, I. V. et al. HOCOMOCO: towards a complete collection of transcription factor binding models for human and mouse via large-scale ChIP-Seq analysis. Nucleic Acids Res. 46, D252–D259 (2018).
Article Google Scholar
Karczewski, K. J. et al. The mutational constraint spectrum quantified from variation in 141,456 humans. Nature 581, 434–-443 (2020).
Article Google Scholar
Lek, M. et al. Analysis of protein-coding genetic variation in 60,706 humans. Nature 536, 285–291 (2016).
Article Google Scholar
Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–311 (2001).
Article Google Scholar
Zheng, A. Pandaman-Ryan/AgentBind v.0.1 (accessed 19 November 2020); https://doi.org/10.5281/zenodo.4281456

Download references

Acknowledgements

This study was supported in part by NIH/NHGRI 1R21HG010070-01 (M.G.), the Microsoft Genomics for Research programme and an Amazon Web Services research award. We thank NVIDIA for donating a Tesla K40 GPU to support this project. We additionally thank C. Benner, C. Glass and A. Goren for helpful comments.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
An Zheng, Hao Su & Melissa Gymrek
Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
Michael Lamkin
Department of Biological Sciences, University of California San Diego, La Jolla, CA, USA
Hanqing Zhao
Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
Cynthia Wu
Department of Medicine, University of California San Diego, La Jolla, CA, USA
Melissa Gymrek

Authors

An Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Michael Lamkin
View author publications
You can also search for this author in PubMed Google Scholar
Hanqing Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Cynthia Wu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Su
View author publications
You can also search for this author in PubMed Google Scholar
Melissa Gymrek
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

A.Z. designed and performed analyses and helped write the manuscript. M.L., H.Z. and C.W. helped perform analyses. H.S. helped design the study. M.G. conceived the study, supervised analyses and helped write the manuscript.

Corresponding author

Correspondence to Melissa Gymrek.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Machine Intelligence thanks David Gifford, Peter Koo and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Model performance related to GC content and open chromatin.

(a,b) GC content differences correlated with model performance. The x-axis shows the absolute value of the difference in mean GC content for positive vs. negative sequences with the motif for each TF. The y-axis shows auROCs. Each dot represents one TF. Results in a-b are for baseline models with motifs blocked (a) or unblocked (b). (c) Comparison of training data size and change in model performance. The x-axis (log₁₀ scale) shows the number of training samples. Orange points show the difference in auROC (y-axis) for baseline vs. GC-controlled models. Green points show GC-controlled vs. DNase-I-controlled models. Each dot represents one TF. (d) Model performance for each TF. The y-axis gives the auROC obtained for different models for each TF. Grey = baseline; orange = GC-controlled; green = DNase-I-controlled. TFs are ranked by the change in auROC between the DNase-I and GC-controlled models. (e) Comparison of cross-TF model performance. Heatmaps show the auROC using a GC-controlled model trained on one TF (rows) and tested on another TF (columns). Red squares denote the model with highest auROC for each TF. (f) Distribution of the difference in auROC between top models and TF-specific models. For TFs where the TF-specific model was best, we computed the difference between the TF-specific model and the next best model (red). For all other TFs, we compared performance of the best model to the TF-specific model (blue). (g-h) are the same as in e-f but based on DNase-I-controlled models.

Extended Data Fig. 2 Aggregate Grad-CAM score profiles for each TF.

For each TF, we computed the average absolute value of the Grad-CAM score per position using either models with the core motif unblocked (left) or blocked (right). Values shown are Z-normalized across rows. (a) shows aggregate scores for sequences labelled as positive (bound) and is reproduced from Fig. 2d. (b) shows aggregate scores for sequences labelled as negative (unbound).

Extended Data Fig. 3 Comparing key context sequence features identified in pre-trained vs. fine-tuned models.

The heatmap shows the enrichment of each 5-mer in regions with the highest Grad-CAM scores for each TF using baseline models before (a) and after (b) fine-tuning. Rows and columns are ordered the same as in Fig. 3. Colours denote odds ratios and the sizes of the boxes denote statistical significance as in Fig. 3. Panel (b) is reproduced from Fig. 3a for comparison.

Extended Data Fig. 4 Context sequence features specific to proximal vs. distal sites.

(a) Enrichment of 5-mers in high-scoring Grad-CAM regions for proximal (left) and distal (right) binding sites. Proximal and distal TF binding sites are defined as described in Methods. Rows and columns are ordered the same as in Fig. 3. (b,c) are the same as in (a) but show data for GC-controlled (b) and DNase-I-controlled (c) models. For (a–c), colours denote odds ratios and the sizes of the boxes denote statistical significance as in Fig. 3. (d) Comparison of top scoring 5-mers in proximal vs. distal SP1 sites. Bars show the odds ratio of enrichment of each sequence in top 5-mers for all (gray), proximal (red) and distal (blue) SP1 sites. The top 20 5-mers ranked by the best odds ratio across all three SP1 models (all, proximal, and distal sites) are shown. Error bars show 95% confidence intervals on odds ratios. (e,f) are the same as in (d) but show data for GC-controlled (e) and DNase-I-controlled (f) models.

Extended Data Fig. 5 Singleton rate of context SNPs vs. core motif regions.

(a) Singleton rate of context SNPs. The plot shows the percent of SNPs in each category that are singletons. Black = all context sites, orange = context sites with top 5% Grad-CAM scores, red = context sites with top 0.5% Grad-CAM scores. Error bars show +/− 1 s.e. (b) is the same as (a), but additionally shows singleton rates for SNPs in core motif regions (blue). The number of SNPs in each category for each TF is annotated above each plot.

Supplementary information

Supplementary Information

Supplementary discussion.

Reporting Summary

Supplementary Tables

Supplementary Tables 1–11.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zheng, A., Lamkin, M., Zhao, H. et al. Deep neural networks identify sequence context features predictive of transcription factor binding. Nat Mach Intell 3, 172–180 (2021). https://doi.org/10.1038/s42256-020-00282-y

Download citation

Received: 17 December 2019
Accepted: 10 December 2020
Published: 18 January 2021
Issue Date: February 2021
DOI: https://doi.org/10.1038/s42256-020-00282-y

This article is cited by

The nuclear factor ID3 endows macrophages with a potent anti-tumour activity
- Zihou Deng
- Pierre-Louis Loyher
- Frederic Geissmann
Nature (2024)
KDeep: a new memory-efficient data extraction method for accurately predicting DNA/RNA transcription factor binding sites
- Saeedeh Akbari Rokn Abadi
- SeyedehFatemeh Tabatabaei
- Somayyeh Koohi
Journal of Translational Medicine (2023)
Decoding enhancer complexity with machine learning and high-throughput discovery
- Gabrielle D. Smith
- Wan Hern Ching
- Emily S. Wong
Genome Biology (2023)
Functional annotation of enzyme-encoding genes using deep learning with transformer layers
- Gi Bae Kim
- Ji Yeon Kim
- Sang Yup Lee
Nature Communications (2023)
Obtaining genetics insights from deep learning via explainable artificial intelligence
- Gherman Novakovsky
- Nick Dexter
- Sara Mostafavi
Nature Reviews Genetics (2023)