Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Article
  • Published:

Deciphering eukaryotic gene-regulatory logic with 100 million random promoters

An Author Correction to this article was published on 13 August 2020

This article has been updated

Abstract

How transcription factors (TFs) interpret cis-regulatory DNA sequence to control gene expression remains unclear, largely because past studies using native and engineered sequences had insufficient scale. Here, we measure the expression output of >100 million synthetic yeast promoter sequences that are fully random. These sequences yield diverse, reproducible expression levels that can be explained by their chance inclusion of functional TF binding sites. We use machine learning to build interpretable models of transcriptional regulation that predict ~94% of the expression driven from independent test promoters and ~89% of the expression driven from native yeast promoter fragments. These models allow us to characterize each TF’s specificity, activity and interactions with chromatin. TF activity depends on binding-site strand, position, DNA helical face and chromatin context. Notably, expression level is influenced by weak regulatory interactions, which confound designed-sequence studies. Our analyses show that massive-throughput assays of fully random DNA can provide the big data necessary to develop complex, predictive models of gene regulation.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: GPRA.
Fig. 2: Expression models learned from a GPRA of 108 random promoters are highly predictive.
Fig. 3: Billboard models learn biochemical activities of TFs.
Fig. 4: Position, orientation and helical face preferences among yeast TFs.
Fig. 5: Inadvertent perturbation of abundant secondary TFBSs confounds TFBS tiling experiments.
Fig. 6: Abundant weak regulatory interactions explain most of expression level.

Similar content being viewed by others

Data availability

Data are available at NCBI’s GEO: GSE104903 and GSE104878.

Code availability

Open source code for our transcriptional models is available at https://github.com/Carldeboer/CisRegModels

Change history

References

  1. Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).

    CAS  PubMed  Google Scholar 

  2. Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).

    PubMed  PubMed Central  Google Scholar 

  3. Kinney, J. B., Murugan, A., Callan, C. G. Jr. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA107, 9158–9163 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  4. van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).

    PubMed  Google Scholar 

  5. Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).

    CAS  PubMed  Google Scholar 

  6. Wang, X. et al. High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human. Nat. Commun. 9, 5380 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  7. Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  8. Gertz, J., Siggia, E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457, 215–218 (2009).

    CAS  PubMed  Google Scholar 

  9. Hughes, T. R. & de Boer, C. G. Mapping yeast transcriptional networks. Genetics 195, 9–36 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).

    CAS  PubMed  Google Scholar 

  11. Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  12. Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol. 9, 2944–2949 (1989).

    CAS  PubMed  PubMed Central  Google Scholar 

  13. Horwitz, M. S. & Loeb, L. A. Promoters selected from random DNA sequences. Proc. Natl Acad. Sci. USA 83, 7405–7409 (1986).

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  15. Yona, A. H., Alm, E. J. & Gore, J. Random sequences rapidly evolve into de novo promoters. Nat. Commun. 9, 1530 (2018).

    PubMed  PubMed Central  Google Scholar 

  16. Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25, 434–440 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Arnosti, D. N. & Kulkarni, M. M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898 (2005).

    CAS  PubMed  Google Scholar 

  18. de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).

    PubMed  Google Scholar 

  19. Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  20. Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).

    PubMed  PubMed Central  Google Scholar 

  21. Granek, J. A. & Clarke, N. D. Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. 6, R87 (2005).

    PubMed  PubMed Central  Google Scholar 

  22. Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).

    CAS  PubMed  Google Scholar 

  23. Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).

    CAS  PubMed  Google Scholar 

  24. Conlon, E. M., Liu, X. S., Lieb, J. D. & Liu, J. S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA 100, 3339–3344 (2003).

    CAS  PubMed  PubMed Central  Google Scholar 

  25. Liu, X., Lee, C. K., Granek, J. A., Clarke, N. D. & Lieb, J. D. Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection. Genome Res. 16, 1517–1528 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Zhang, Z. et al. A packing mechanism for nucleosome organization reconstituted across a eukaryotic genome. Science 332, 977–980 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Hesselberth, J. R. et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nat. Methods 6, 283–289 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  28. Bernstein, B. E., Liu, C. L., Humphrey, E. L., Perlstein, E. O. & Schreiber, S. L. Global nucleosome occupancy in yeast. Genome Biol. 5, R62 (2004).

    PubMed  PubMed Central  Google Scholar 

  29. Hartley, P. D. & Madhani, H. D. Mechanisms that specify promoter nucleosome location and identity. Cell 137, 445–458 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  30. Ganapathi, M. et al. Extensive role of the general regulatory factors, Abf1 and Rap1, in determining genome-wide chromatin structure in budding yeast. Nucleic Acids Res. 39, 2032–2044 (2011).

    CAS  PubMed  Google Scholar 

  31. Levo, M. et al. Systematic investigation of transcription factor activity in the context of chromatin using massively parallel binding and expression assays. Mol. Cell 65, 604–617 e606 (2017).

    CAS  PubMed  Google Scholar 

  32. Axelrod, J. D., Reagan, M. S. & Majors, J. GAL4 disrupts a repressing nucleosome during activation of GAL1 transcription in vivo. Genes Dev. 7, 857–869 (1993).

    CAS  PubMed  Google Scholar 

  33. Morse, R. H. Nucleosome disruption by transcription factor binding in yeast. Science 262, 1563–1566 (1993).

    CAS  PubMed  Google Scholar 

  34. Roberts, G. G. & Hudson, A. P. Transcriptome profiling of Saccharomyces cerevisiae during a transition from fermentative to glycerol-based respiratory growth reveals extensive metabolic and structural remodeling. Mol. Genet. Genomics 276, 170–186 (2006).

    CAS  PubMed  Google Scholar 

  35. Forsburg, S. L. & Guarente, L. Identification and characterization of HAP4: a third component of the CCAAT-bound HAP2/HAP3 heteromer. Genes Dev. 3, 1166–1178 (1989).

    CAS  PubMed  Google Scholar 

  36. Hedges, D., Proft, M. & Entian, K. D. CAT8, a new zinc cluster-encoding gene necessary for derepression of gluconeogenic enzymes in the yeast Saccharomyces cerevisiae. Mol. Cell. Biol. 15, 1915–1922 (1995).

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Haurie, V. et al. The transcriptional activator Cat8p provides a major contribution to the reprogramming of carbon metabolism during the diauxic shift in Saccharomyces cerevisiae. J. Biol. Chem. 276, 76–85 (2001).

    CAS  PubMed  Google Scholar 

  38. Sato, T. et al. TheE-box DNA binding protein Sgc1p suppresses the gcr2 mutation, which is involved in transcriptional activation of glycolytic genes in Saccharomyces cerevisiae. FEBS Lett. 463, 307–311 (1999).

    CAS  PubMed  Google Scholar 

  39. Grauslund, M. & Ronnow, B. Carbon source-dependent transcriptional regulation of the mitochondrial glycerol-3-phosphate dehydrogenase gene, GUT2, from Saccharomyces cerevisiae. Can. J. Microbiol. 46, 1096–1100 (2000).

    CAS  PubMed  Google Scholar 

  40. Madhani, H. D. & Fink, G. R. Combinatorial control required for the specificity of yeast MAPK signaling. Science 275, 1314–1317 (1997).

    CAS  PubMed  Google Scholar 

  41. Gavrias, V., Andrianopoulos, A., Gimeno, C. J. & Timberlake, W. E. Saccharomyces cerevisiae TEC1 is required for pseudohyphal growth. Mol. Microbiol. 19, 1255–1263 (1996).

    CAS  PubMed  Google Scholar 

  42. Cullen, P. J. & Sprague, G. F. Jr. Glucose depletion causes haploid invasive growth in yeast. Proc. Natl Acad. Sci. USA 97, 13619–13624 (2000).

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Neely, K. E., Hassan, A. H., Brown, C. E., Howe, L. & Workman, J. L. Transcription activator interactions with multiple SWI/SNF subunits. Mol. Cell. Biol. 22, 1615–1625 (2002).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Kim, T. S., Kim, H. Y., Yoon, J. H. & Kang, H. S. Recruitment of the Swi/Snf complex by Ste12-Tec1 promotes Flo8-Mss11-mediated activation of STA1 expression. Mol. Cell. Biol. 24, 9542–9556 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Hibbs, M. A. et al. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23, 2692–2699 (2007).

    CAS  PubMed  Google Scholar 

  47. Bednar, J. et al. Determination of DNA persistence length by cryo-electron microscopy. Separation of the static and dynamic contributions to the apparent persistence length of DNA. J. Mol. Biol. 254, 579–594 (1995).

    CAS  PubMed  Google Scholar 

  48. Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Gen. Res. 16, 962–972 (2006).

    CAS  Google Scholar 

  49. Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  50. Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034 e1026 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  51. Gibson, D. G. et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345 (2009).

    CAS  PubMed  Google Scholar 

  52. Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PloS One 6, e24279 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Tong, A. H. & Boone, C. Synthetic genetic array analysis in Saccharomyces cerevisiae. Methods Mol. Biol. 313, 171–192 (2006).

    CAS  PubMed  Google Scholar 

  54. de Boer, C. High-efficiency S. cerevisiae lithium acetate transformation. protocols.io https://doi.org/10.17504/protocols.io.j4tcqwn (2017).

  55. Deng, C., Daley, T. & Smith, A. D. Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol. 3, 135–144 (2015).

    CAS  PubMed  PubMed Central  Google Scholar 

  56. Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).

    CAS  PubMed  PubMed Central  Google Scholar 

  57. Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    CAS  PubMed  Google Scholar 

  58. Segal, E. & Widom, J. From DNA sequence to transcriptional behaviour: a quantitative approach. Nat. Rev. Genet. 10, 443–456 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  59. Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. arXiv 1603.04467 (2016).

  60. Kent, N. A., Eibert, S. M. & Mellor, J. Cbf1p is required for chromatin remodeling at promoter-proximal CACGTG motifs in yeast. J. Biol. Chem. 279, 27116–27123 (2004).

    CAS  PubMed  Google Scholar 

  61. Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).

    CAS  PubMed  Google Scholar 

  62. Chua, G. et al. Identifying transcription factor functions and targets by phenotypic activation. Proc. Natl Acad. Sci. USA 103, 12045–12050 (2006).

    CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank R. Nelken, J. Weinstein, A. Dixit, B. Cleary, K. Shekhar and U. Eser for analysis advice; C. Muus, B. Cleary, A. Dixit, Y. Oren, T. Jones, L. Mariani, K. Shekhar, J. B. Kinney, D. M. McCandlish and J. Vierstra for feedback on the manuscript; T. Delorey, J. Pfiffner and C. Bashor for experimental advice; L. Gaffney and A. Hupalowska for help with figures; P. Rogers for cell sorting; and E. Segal for the dual reporter yeast vector. C.G.D. was supported by a Fellowship from the Canadian Institutes for Health Research and by the NIH (grant no. K99-HG009920-01). E.D.V. was supported by the MIT Presidential Fellowship. Work was supported by the Klarman Cell Observatory, the NHGRI Center of Excellence in Genome Science, the HHMI (A.R.) and the Israel Science Foundation ICORE on Chromatin and RNA in Gene Regulation (N.F.).

Author information

Authors and Affiliations

Authors

Contributions

C.G.D. and A.R. drafted the manuscript, with all authors contributing. C.G.D. analyzed the data. C.G.D., E.D.V., E.L.A. and R.S. performed the experiments. A.R. and N.F. supervised the research.

Corresponding authors

Correspondence to Carl G. de Boer or Aviv Regev.

Ethics declarations

Competing interests

A.R. is an SAB member of Thermo Fisher Scientific, Neogene Therapeutics, Asimov, and Syros Pharmaceuticals, an equity holder of Immunitas, and a founder of and equity holder in Celsius Therapeutics. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Random DNA yields diverse expression levels.

(a) Diverse expression in all tested promoter scaffolds. For each promoter scaffold (right), shown are the distributions of expression levels (log2(YFP/RFP), x axis) measured by flow cytometry for the entire library (gray filled curves) and for a few selected clones, each from a different single promoter from the library (colored line curves). (b) Expression level distribution of random DNA. Shown is the expression distribution of >2,000,000 promoters comprised of 3,811 random scaffolds, each in combination with ~660 random 80 bp oligos included in the middle. The periodic peaks occur at the expression bins and result from the large number of promoters that were only observed in a single bin (and so have a discrete expression level). We considered any promoters that, upon resorting, would end up in any of the non-zero bins as ‘expressed’ (non-zero bins indicated in red; Methods). Note that the expression units are not equivalent to those used elsewhere; the dynamic range of this library is similar to that of the pTpA library, but contains fewer high-expressing promoters.

Supplementary Figure 2 Model performance on held-out data.

Shown are the model-predicted (x axis) and measured (y axis) expression levels of held-out training data (never seen by the model, but generated in the same experiment as the training data) for (a) Abf1TATA+glucose, (b) pTpA+glucose, (c) pTpA+galactose, and (d) pTpA+glycerol. Since the pTpA+glucose billboard model (b) can explain 68.3% of this held out data and 92.6% of the high-quality data, we estimate that the training data has ~24% noise, although this varies by experiment. (n = 250,000 promoters for each; Pearson’s r2 is shown at bottom right; Blue lines: Generalized Additive Model lines of best fit.).

Supplementary Figure 3 Saturation analysis of GPRA data.

Shown are the numbers of distinct promoters detected (y axis) when subsampling the pTpA+glucose sequencing data (black points) to different read coverage (x axis), after combining reads from all expression bins. Red curve: promoters projected to be detected with additional sequencing.

Supplementary Figure 4 Predicting expression from TF binding.

(a) Relationship between predicted binding of individual TFs and expression level. Measured expression level (pTpA+Glu data; y axis) vs. predicted binding (x axis) for Abf1 (left), Rsc30 (middle), and Ume6 (right). Ume6 (a similar motif to Rsc30), is positively correlated with expression overall (r=0.14), but has a strong negative trend at high predicted occupancies. (n = 250,000 promoters for each; Pearson’s r shown at bottom right; Top: Motifs. Blue lines: GAM lines of best fit. Gray shaded areas: 95% confidence intervals.) (b) Model correctly predicts expression of designed sequences. Shown are the model-predicted (x axis) and measured (y axis) expression levels for 1,000 in silico generated random sequences designed based on the pTpA+glucose billboard model’s predictions to span a range of expression levels, and then measured experimentally. (n = 1,000 promoters; Pearson’s r2 shown at bottom right.).

Supplementary Figure 5 Expression of 80 bp native promoter fragments.

Distributions of measured expression levels (x axis) for 80 bp fragments of native yeast promoters, corresponding to different regions of the promoter (colors). Although each region included sequences from all promoters, the regions upstream of -200 are covered by nucleosomes in many promoters, which may explain their tendency to express at lower levels.

Supplementary Figure 6 Predicted nucleosome occupancy.

(a,b) Model predicts nucleosome occupancy well. (a) Model predicted (x axis) vs. measured (MNase-Seq, y axis) nucleosome occupancy. Four MNase biological replicates are shown (Methods). (Blue lines: Generalized Additive Model lines of best fit.) (b) As in (a), with replicates averaged, and only promoters present in both replicates shown. (c) Model-predicted accessibility across the YGR012W locus. A genome browser track showing yeast genes (top), measured nucleosome occupancy (Zhang, second from the top) and DNaseI hypersensitivity (Hesselberth, third from the top), and model predicted accessibilities (Ω). Promoters are predicted to be accessible and models provide similar predicted accessibilities. The four models were learned independently from different experimental data.

Supplementary Figure 7 The billboard models identify biochemical activities of TFs.

Shown are the number of TFs classified as activators, repressors, neither, or both in the yeast Gene Ontology (GO, Methods) (x axes) and whether they are predicted as (a) repressor (blue) or activator (red); or (b) closing (blue) or opening (red) chromatin, by each model (labels on top of subpanels). Hypergeometric P-values for overlaps between predicted activator/repressor (or chromatin opener/closer), compared with activator/repressor GO annotations are as shown (‘neither’ and ‘both’ categories are ignored). (n = 1 set of model predictions for each subpanel; 4 models total).

Supplementary Figure 8 Learning motifs de novo and improving known motifs.

(a) Model initialized with random PWMs learns that General Regulatory Factor motifs open chromatin. Shown are 12 motifs with the highest potentiation parameter values for a model trained on pTpA+glucose data with 1,000 randomly initialized motifs, allowing for multiple variants of the same motif to be learned. Motifs correspond to the three canonical general regulatory factors, and Cbf1, previously suggested to have a similar function. (b-d) Model-refined motifs perform better in predicting TF binding and knockout effects in independent experiments. (b) Similar refinement in independent models. Comparison of the original TFBS motif (top) and model-refined motifs from each of the four models for two example motifs. (c,d) Shown are the absolute values of the Pearson correlation coefficient (|r|) when using either the original motifs (x axis) or the pTpA+Gal model-refined motifs (y axis) to predict whether (c) the gene’s expression will change in the corresponding TF mutant (compared to wild type) based on predicted binding to the promoter, or (d) a ChIP probe will be bound by the TF in a ChIP assay based on predicted binding to the ChIP probe. (Here, data were not subsampled). Overall, model-refined motifs perform better (points above diagonal), but some perform worse. Reduced performance can be due to condition specific regulators that are minimally active in our tested growth conditions (for example, Gcn4), redundancy between motifs (for example, Hsf1 has mono-, di-, and trimeric motifs), and overfitting of the original motif to the test data (for example, ChIP-derived motifs tested on ChIP data). (n = 824 and 303 experiments, respectively for (c) and (d)).

Supplementary Figure 9 Positional model summary.

The positional model is identical to that in Fig. 2c, except that the predicted accessibility (Ω) is used to rescale the tensor of DNA binding sites (4) (rather than the DNA binding vector, as before), preserving binding site orientation and location in the resulting chromatin binding tensor. Positional chromatin binding is multiplied pairwise with the learned positional TF activities (5), and then summed (6) to obtain the activity contribution of each TF, which is used to calculate the overall expression level of the sequence (7). Actual parameter values are shown (pTpA+glucose model), with binding sites smoothed over 8 bp for visualization purposes.

Supplementary Figure 10 Downsampling analysis determines the extent of data needed to train models of varying complexity.

Model performance as reflected by prediction accuracy on the held-out validation data (Pearson’s r2, y axis, top) and by the relative performance on validation-vs-training data (y axis, bottom, ratio of validation-vs-training Pearson r2; when <1.00, model is overfit) for models trained using sub-samples of the available training data (x axis) and learning different parameters (colors; parameters are cumulative from top to bottom of legend). Dotted lines: number of training examples needed to eliminate overfitting ((validation r2 / training r2) > 0.999). Solid points: number of examples at which the maximal validation performance is achieved.

Supplementary Figure 11 Positional models improve predictions.

(a,b) Position-specific pTpA+Glu model-predicted expression levels (x axes) vs. measured expression levels (y axes) for (a) high-quality test data in the pTpA promoter scaffold, grown in glucose, and (b) native yeast promoter sequences, divided into 80 bp fragments and tested in the pTpA promoter scaffold, grown in glucose. (c) Most expression variation is attributed to accessibility. Position-specific pTpA+Glu model-predicted accessibility (Ω; x axis) vs. measured expression levels (y axis) for high-quality test data in the pTpA promoter scaffold, grown in glucose. Performance is better when incorporating positional activities (a,b), and, in particular, accessibility alone cannot distinguish the highest expression levels (measured expression >12). The dominant effect of accessibility on expression likely reflects accessibility being a prerequisite for expression. (n = 9,982, 70,924, and 9,982 promoters for (a-c), respectively; Pearson’s r2 shown in bottom right; Red lines: Generalized Additive Model lines of best fit).

Supplementary Figure 12 Positional preferences of TFs are prevalent and context-dependent.

Position and strand preferences. Learned activity parameter values (y axis) for motifs in each position (x axis) and strand orientation (upper and lower panels) for each model (colors), for (a) Abf1, (b) Mcm1, (c) Ume6, (d) Mot3, (e) Azf1, and (f) Thi2.

Supplementary Figure 13 Features of positional preferences.

(a) Capturing helically biased positional preferences. For each location within the promoter (x axis), this shows the learned activity bias parameters (red curve; as in Fig. 4b) for the poly-A motif, long-range trend captured by a loess fit (green), and short-range residual activity bias after subtracting loess fit (blue) with reference 10.5 bp sine waves (black) for the minus strand (top) and plus strands (bottom) for the four different models (columns). (b) Modeling positional preferences increases predictive accuracy within the same scaffold but can drastically decrease it between scaffolds. For each training data set (four sub-panels) for both model types (colors), the Pearson r2 (y axis) capturing performance on each test dataset (x axis). (n = 1 set of model predictions per bar; 4 independent training sets each with 2 model types, evaluated on each of 3 test datasets total).

Supplementary Figure 14 Model correctly predicts expression in tiling experiments.

Measured (black) and predicted (red) expression levels for sequences containing each motif (rows), tiled at every possible position (x axes) and orientation (top and bottom halves) in each of three random background sequence (columns).

Supplementary Figure 15 Weak interactions have a predominant effect on expression level.

(a,b) Distinct regulatory networks for two sequences with nearly identical expression levels. Model-predicted regulatory networks for two 80 bp native promoters fragments using the positional pTpA+glucose model. Purple nodes: Individual TFs. Edge thickness and spacing: interaction strengths for each TF (green: positive/activating; red: negative/repressing). Both 80 bp promoter sequences have intermediate measured expression levels of ~10.5 in pTpA+glucose, but different regulatory networks: (a) the -160:-80 region of YDR456W is dominated by a single strong interaction, while (b) the −320:-240 region of YIR035C is regulated by many weak interactions. (c) Abundant weak regulatory interactions also explain most of expression in the most relevant promoter region. Cumulative distributions (y axis) of the number of regulatory interactions (black) and fraction of expression explained (that is cumulative total/sum of all effect sizes; red) for each interaction strength (x axis). Here, we consider only native promoter fragments from the -160:-80 region (relative to the TSS), which is the region our models are trained on and corresponds to the most active promoter region (Supplementary Fig. 5). The magnitude (and not the sign) of the interaction strength is considered.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15.

Reporting Summary

Supplementary Table 1

Motifs used in this study. Motif IDs are from the YeTFaSCo database. Motifs excluded from the motif frequency analysis (Fig. 1a) are indicated.

Supplementary Table 2

Promoter scaffolds included in the scaffold library. Sequences include 80 Ns in place of the random 80-mers and begin 13 bp upstream of the theoretical TSS.

Supplementary Table 3

TSS annotations of native promoters. Coordinates of TSSs for the v64 S. cerevisiae genome sequence.

Supplementary Table 4

All model parameters learned. For each of the eight models discussed in this paper (billboard and positional, for each of the four datasets), all model parameters, including TF activities, potentiations, motifs and (where applicable) position-specific activities.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

de Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol 38, 56–65 (2020). https://doi.org/10.1038/s41587-019-0315-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41587-019-0315-8

This article is cited by

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing