Deciphering eukaryotic gene-regulatory logic with 100 million random promoters

de Boer, Carl G.; Vaishnav, Eeshit Dhaval; Sadeh, Ronen; Abeyta, Esteban Luis; Friedman, Nir; Regev, Aviv

doi:10.1038/s41587-019-0315-8

Article
Published: 02 December 2019

Deciphering eukaryotic gene-regulatory logic with 100 million random promoters

Nature Biotechnology volume 38, pages 56–65 (2020)Cite this article

29k Accesses
121 Citations
277 Altmetric
Metrics details

Subjects

An Author Correction to this article was published on 13 August 2020

This article has been updated

Abstract

How transcription factors (TFs) interpret cis-regulatory DNA sequence to control gene expression remains unclear, largely because past studies using native and engineered sequences had insufficient scale. Here, we measure the expression output of >100 million synthetic yeast promoter sequences that are fully random. These sequences yield diverse, reproducible expression levels that can be explained by their chance inclusion of functional TF binding sites. We use machine learning to build interpretable models of transcriptional regulation that predict ~94% of the expression driven from independent test promoters and ~89% of the expression driven from native yeast promoter fragments. These models allow us to characterize each TF’s specificity, activity and interactions with chromatin. TF activity depends on binding-site strand, position, DNA helical face and chromatin context. Notably, expression level is influenced by weak regulatory interactions, which confound designed-sequence studies. Our analyses show that massive-throughput assays of fully random DNA can provide the big data necessary to develop complex, predictive models of gene regulation.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 2: Expression models learned from a GPRA of 10⁸ random promoters are highly predictive.**

**Fig. 3: Billboard models learn biochemical activities of TFs.**

**Fig. 4: Position, orientation and helical face preferences among yeast TFs.**

**Fig. 5: Inadvertent perturbation of abundant secondary TFBSs confounds TFBS tiling experiments.**

**Fig. 6: Abundant weak regulatory interactions explain most of expression level.**

Sequence determinants of human gene regulatory elements

Article Open access 21 February 2022

Biswajyoti Sahu, Tuomo Hartonen, … Jussi Taipale

Hold out the genome: a roadmap to solving the cis-regulatory code

Article 13 December 2023

Carl G. de Boer & Jussi Taipale

The evolution, evolvability and engineering of gene regulatory DNA

Article 09 March 2022

Eeshit Dhaval Vaishnav, Carl G. de Boer, … Aviv Regev

Data availability

Data are available at NCBI’s GEO: GSE104903 and GSE104878.

Code availability

Open source code for our transcriptional models is available at https://github.com/Carldeboer/CisRegModels

Change history

13 August 2020
A Correction to this paper has been published: https://doi.org/10.1038/s41587-020-0665-2

References

Beer, M. A. & Tavazoie, S. Predicting gene expression from sequence. Cell 117, 185–198 (2004).
CAS PubMed Google Scholar
Yuan, Y., Guo, L., Shen, L. & Liu, J. S. Predicting gene expression from sequence: a reexamination. PLoS Comput. Biol. 3, e243 (2007).
PubMed PubMed Central Google Scholar
Kinney, J. B., Murugan, A., Callan, C. G. Jr. & Cox, E. C. Using deep sequencing to characterize the biophysical mechanism of a transcriptional regulatory sequence. Proc. Natl Acad. Sci. USA107, 9158–9163 (2010).
CAS PubMed PubMed Central Google Scholar
van Arensbergen, J. et al. Genome-wide mapping of autonomous promoter activity in human cells. Nat. Biotechnol. 35, 145–153 (2017).
PubMed Google Scholar
Muerdter, F. et al. Resolving systematic errors in widely used enhancer activity assays in human cells. Nat. Methods 15, 141–149 (2018).
CAS PubMed Google Scholar
Wang, X. et al. High-resolution genome-wide functional dissection of transcriptional regulatory regions and nucleotides in human. Nat. Commun. 9, 5380 (2018).
CAS PubMed PubMed Central Google Scholar
Sharon, E. et al. Inferring gene regulatory logic from high-throughput measurements of thousands of systematically designed promoters. Nat. Biotechnol. 30, 521–530 (2012).
CAS PubMed PubMed Central Google Scholar
Gertz, J., Siggia, E. D. & Cohen, B. A. Analysis of combinatorial cis-regulation in synthetic and genomic promoters. Nature 457, 215–218 (2009).
CAS PubMed Google Scholar
Hughes, T. R. & de Boer, C. G. Mapping yeast transcriptional networks. Genetics 195, 9–36 (2013).
CAS PubMed PubMed Central Google Scholar
Jolma, A. et al. DNA-binding specificities of human transcription factors. Cell 152, 327–339 (2013).
CAS PubMed Google Scholar
Nutiu, R. et al. Direct measurement of DNA affinity landscapes on a high-throughput sequencing instrument. Nat. Biotechnol. 29, 659–664 (2011).
CAS PubMed PubMed Central Google Scholar
Oliphant, A. R., Brandl, C. J. & Struhl, K. Defining the sequence specificity of DNA-binding proteins by selecting binding sites from random-sequence oligonucleotides: analysis of yeast GCN4 protein. Mol. Cell. Biol. 9, 2944–2949 (1989).
CAS PubMed PubMed Central Google Scholar
Horwitz, M. S. & Loeb, L. A. Promoters selected from random DNA sequences. Proc. Natl Acad. Sci. USA 83, 7405–7409 (1986).
CAS PubMed PubMed Central Google Scholar
Cuperus, J. T. et al. Deep learning of the regulatory grammar of yeast 5’ untranslated regions from 500,000 random sequences. Genome Res. 27, 2015–2024 (2017).
CAS PubMed PubMed Central Google Scholar
Yona, A. H., Alm, E. J. & Gore, J. Random sequences rapidly evolve into de novo promoters. Nat. Commun. 9, 1530 (2018).
PubMed PubMed Central Google Scholar
Wunderlich, Z. & Mirny, L. A. Different gene regulation strategies revealed by analysis of binding motifs. Trends Genet. 25, 434–440 (2009).
CAS PubMed PubMed Central Google Scholar
Arnosti, D. N. & Kulkarni, M. M. Transcriptional enhancers: intelligent enhanceosomes or flexible billboards? J. Cell. Biochem. 94, 890–898 (2005).
CAS PubMed Google Scholar
de Boer, C. G. & Hughes, T. R. YeTFaSCo: a database of evaluated yeast transcription factor sequence specificities. Nucleic Acids Res. 40, D169–D179 (2012).
PubMed Google Scholar
Kosuri, S. et al. Composability of regulatory sequences controlling transcription and translation in Escherichia coli. Proc. Natl Acad. Sci. USA 110, 14024–14029 (2013).
CAS PubMed PubMed Central Google Scholar
Shalem, O. et al. Systematic dissection of the sequence determinants of gene 3’ end mediated expression control. PLoS Genet. 11, e1005147 (2015).
PubMed PubMed Central Google Scholar
Granek, J. A. & Clarke, N. D. Explicit equilibrium modeling of transcription-factor binding and gene regulation. Genome Biol. 6, R87 (2005).
PubMed PubMed Central Google Scholar
Kulkarni, M. M. & Arnosti, D. N. Information display by transcriptional enhancers. Development 130, 6569–6575 (2003).
CAS PubMed Google Scholar
Bussemaker, H. J., Li, H. & Siggia, E. D. Regulatory element detection using correlation with expression. Nat. Genet. 27, 167–171 (2001).
CAS PubMed Google Scholar
Conlon, E. M., Liu, X. S., Lieb, J. D. & Liu, J. S. Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA 100, 3339–3344 (2003).
CAS PubMed PubMed Central Google Scholar
Liu, X., Lee, C. K., Granek, J. A., Clarke, N. D. & Lieb, J. D. Whole-genome comparison of Leu3 binding in vitro and in vivo reveals the importance of nucleosome occupancy in target site selection. Genome Res. 16, 1517–1528 (2006).
CAS PubMed PubMed Central Google Scholar
Zhang, Z. et al. A packing mechanism for nucleosome organization reconstituted across a eukaryotic genome. Science 332, 977–980 (2011).
CAS PubMed PubMed Central Google Scholar
Hesselberth, J. R. et al. Global mapping of protein–DNA interactions in vivo by digital genomic footprinting. Nat. Methods 6, 283–289 (2009).
CAS PubMed PubMed Central Google Scholar
Bernstein, B. E., Liu, C. L., Humphrey, E. L., Perlstein, E. O. & Schreiber, S. L. Global nucleosome occupancy in yeast. Genome Biol. 5, R62 (2004).
PubMed PubMed Central Google Scholar
Hartley, P. D. & Madhani, H. D. Mechanisms that specify promoter nucleosome location and identity. Cell 137, 445–458 (2009).
CAS PubMed PubMed Central Google Scholar
Ganapathi, M. et al. Extensive role of the general regulatory factors, Abf1 and Rap1, in determining genome-wide chromatin structure in budding yeast. Nucleic Acids Res. 39, 2032–2044 (2011).
CAS PubMed Google Scholar
Levo, M. et al. Systematic investigation of transcription factor activity in the context of chromatin using massively parallel binding and expression assays. Mol. Cell 65, 604–617 e606 (2017).
CAS PubMed Google Scholar
Axelrod, J. D., Reagan, M. S. & Majors, J. GAL4 disrupts a repressing nucleosome during activation of GAL1 transcription in vivo. Genes Dev. 7, 857–869 (1993).
CAS PubMed Google Scholar
Morse, R. H. Nucleosome disruption by transcription factor binding in yeast. Science 262, 1563–1566 (1993).
CAS PubMed Google Scholar
Roberts, G. G. & Hudson, A. P. Transcriptome profiling of Saccharomyces cerevisiae during a transition from fermentative to glycerol-based respiratory growth reveals extensive metabolic and structural remodeling. Mol. Genet. Genomics 276, 170–186 (2006).
CAS PubMed Google Scholar
Forsburg, S. L. & Guarente, L. Identification and characterization of HAP4: a third component of the CCAAT-bound HAP2/HAP3 heteromer. Genes Dev. 3, 1166–1178 (1989).
CAS PubMed Google Scholar
Hedges, D., Proft, M. & Entian, K. D. CAT8, a new zinc cluster-encoding gene necessary for derepression of gluconeogenic enzymes in the yeast Saccharomyces cerevisiae. Mol. Cell. Biol. 15, 1915–1922 (1995).
CAS PubMed PubMed Central Google Scholar
Haurie, V. et al. The transcriptional activator Cat8p provides a major contribution to the reprogramming of carbon metabolism during the diauxic shift in Saccharomyces cerevisiae. J. Biol. Chem. 276, 76–85 (2001).
CAS PubMed Google Scholar
Sato, T. et al. TheE-box DNA binding protein Sgc1p suppresses the gcr2 mutation, which is involved in transcriptional activation of glycolytic genes in Saccharomyces cerevisiae. FEBS Lett. 463, 307–311 (1999).
CAS PubMed Google Scholar
Grauslund, M. & Ronnow, B. Carbon source-dependent transcriptional regulation of the mitochondrial glycerol-3-phosphate dehydrogenase gene, GUT2, from Saccharomyces cerevisiae. Can. J. Microbiol. 46, 1096–1100 (2000).
CAS PubMed Google Scholar
Madhani, H. D. & Fink, G. R. Combinatorial control required for the specificity of yeast MAPK signaling. Science 275, 1314–1317 (1997).
CAS PubMed Google Scholar
Gavrias, V., Andrianopoulos, A., Gimeno, C. J. & Timberlake, W. E. Saccharomyces cerevisiae TEC1 is required for pseudohyphal growth. Mol. Microbiol. 19, 1255–1263 (1996).
CAS PubMed Google Scholar
Cullen, P. J. & Sprague, G. F. Jr. Glucose depletion causes haploid invasive growth in yeast. Proc. Natl Acad. Sci. USA 97, 13619–13624 (2000).
CAS PubMed PubMed Central Google Scholar
Neely, K. E., Hassan, A. H., Brown, C. E., Howe, L. & Workman, J. L. Transcription activator interactions with multiple SWI/SNF subunits. Mol. Cell. Biol. 22, 1615–1625 (2002).
CAS PubMed PubMed Central Google Scholar
Kim, T. S., Kim, H. Y., Yoon, J. H. & Kang, H. S. Recruitment of the Swi/Snf complex by Ste12-Tec1 promotes Flo8-Mss11-mediated activation of STA1 expression. Mol. Cell. Biol. 24, 9542–9556 (2004).
CAS PubMed PubMed Central Google Scholar
Harbison, C. T. et al. Transcriptional regulatory code of a eukaryotic genome. Nature 431, 99–104 (2004).
CAS PubMed PubMed Central Google Scholar
Hibbs, M. A. et al. Exploring the functional landscape of gene expression: directed search of large microarray compendia. Bioinformatics 23, 2692–2699 (2007).
CAS PubMed Google Scholar
Bednar, J. et al. Determination of DNA persistence length by cryo-electron microscopy. Separation of the static and dynamic contributions to the apparent persistence length of DNA. J. Mol. Biol. 254, 579–594 (1995).
CAS PubMed Google Scholar
Tanay, A. Extensive low-affinity transcriptional interactions in the yeast genome. Gen. Res. 16, 962–972 (2006).
CAS Google Scholar
Boyle, E. A., Li, Y. I. & Pritchard, J. K. An expanded view of complex traits: from polygenic to omnigenic. Cell 169, 1177–1186 (2017).
CAS PubMed PubMed Central Google Scholar
Liu, X., Li, Y. I. & Pritchard, J. K. Trans effects on gene expression can drive omnigenic inheritance. Cell 177, 1022–1034 e1026 (2019).
CAS PubMed PubMed Central Google Scholar
Gibson, D. G. et al. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat. Methods 6, 343–345 (2009).
CAS PubMed Google Scholar
Erb, I. & van Nimwegen, E. Transcription factor binding site positioning in yeast: proximal promoter motifs characterize TATA-less promoters. PloS One 6, e24279 (2011).
CAS PubMed PubMed Central Google Scholar
Tong, A. H. & Boone, C. Synthetic genetic array analysis in Saccharomyces cerevisiae. Methods Mol. Biol. 313, 171–192 (2006).
CAS PubMed Google Scholar
de Boer, C. High-efficiency S. cerevisiae lithium acetate transformation. protocols.io https://doi.org/10.17504/protocols.io.j4tcqwn (2017).
Deng, C., Daley, T. & Smith, A. D. Applications of species accumulation curves in large-scale biological data analysis. Quant. Biol. 3, 135–144 (2015).
CAS PubMed PubMed Central Google Scholar
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
CAS PubMed PubMed Central Google Scholar
Li, W. & Godzik, A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
CAS PubMed Google Scholar
Segal, E. & Widom, J. From DNA sequence to transcriptional behaviour: a quantitative approach. Nat. Rev. Genet. 10, 443–456 (2009).
CAS PubMed PubMed Central Google Scholar
Abadi, M. et al. TensorFlow: large-scale machine learning on heterogeneous systems. arXiv 1603.04467 (2016).
Kent, N. A., Eibert, S. M. & Mellor, J. Cbf1p is required for chromatin remodeling at promoter-proximal CACGTG motifs in yeast. J. Biol. Chem. 279, 27116–27123 (2004).
CAS PubMed Google Scholar
Cherry, J. M. et al. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res. 40, D700–D705 (2012).
CAS PubMed Google Scholar
Chua, G. et al. Identifying transcription factor functions and targets by phenotypic activation. Proc. Natl Acad. Sci. USA 103, 12045–12050 (2006).
CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

We thank R. Nelken, J. Weinstein, A. Dixit, B. Cleary, K. Shekhar and U. Eser for analysis advice; C. Muus, B. Cleary, A. Dixit, Y. Oren, T. Jones, L. Mariani, K. Shekhar, J. B. Kinney, D. M. McCandlish and J. Vierstra for feedback on the manuscript; T. Delorey, J. Pfiffner and C. Bashor for experimental advice; L. Gaffney and A. Hupalowska for help with figures; P. Rogers for cell sorting; and E. Segal for the dual reporter yeast vector. C.G.D. was supported by a Fellowship from the Canadian Institutes for Health Research and by the NIH (grant no. K99-HG009920-01). E.D.V. was supported by the MIT Presidential Fellowship. Work was supported by the Klarman Cell Observatory, the NHGRI Center of Excellence in Genome Science, the HHMI (A.R.) and the Israel Science Foundation ICORE on Chromatin and RNA in Gene Regulation (N.F.).

Author information

Authors and Affiliations

Klarman Cell Observatory, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Carl G. de Boer, Eeshit Dhaval Vaishnav, Nir Friedman & Aviv Regev
Howard Hughes Medical Institute and Koch Institute of Integrative Cancer Research, Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, USA
Eeshit Dhaval Vaishnav & Aviv Regev
School of Computer Science and Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
Ronen Sadeh & Nir Friedman
Initiative for Maximizing Student Development Program, University of New Mexico, Albuquerque, NM, USA
Esteban Luis Abeyta

Authors

Carl G. de Boer
View author publications
You can also search for this author in PubMed Google Scholar
Eeshit Dhaval Vaishnav
View author publications
You can also search for this author in PubMed Google Scholar
Ronen Sadeh
View author publications
You can also search for this author in PubMed Google Scholar
Esteban Luis Abeyta
View author publications
You can also search for this author in PubMed Google Scholar
Nir Friedman
View author publications
You can also search for this author in PubMed Google Scholar
Aviv Regev
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

C.G.D. and A.R. drafted the manuscript, with all authors contributing. C.G.D. analyzed the data. C.G.D., E.D.V., E.L.A. and R.S. performed the experiments. A.R. and N.F. supervised the research.

Corresponding authors

Correspondence to Carl G. de Boer or Aviv Regev.

Ethics declarations

Competing interests

A.R. is an SAB member of Thermo Fisher Scientific, Neogene Therapeutics, Asimov, and Syros Pharmaceuticals, an equity holder of Immunitas, and a founder of and equity holder in Celsius Therapeutics. All other authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Random DNA yields diverse expression levels.

(a) Diverse expression in all tested promoter scaffolds. For each promoter scaffold (right), shown are the distributions of expression levels (log₂(YFP/RFP), x axis) measured by flow cytometry for the entire library (gray filled curves) and for a few selected clones, each from a different single promoter from the library (colored line curves). (b) Expression level distribution of random DNA. Shown is the expression distribution of >2,000,000 promoters comprised of 3,811 random scaffolds, each in combination with ~660 random 80 bp oligos included in the middle. The periodic peaks occur at the expression bins and result from the large number of promoters that were only observed in a single bin (and so have a discrete expression level). We considered any promoters that, upon resorting, would end up in any of the non-zero bins as ‘expressed’ (non-zero bins indicated in red; Methods). Note that the expression units are not equivalent to those used elsewhere; the dynamic range of this library is similar to that of the pTpA library, but contains fewer high-expressing promoters.

Supplementary Figure 2 Model performance on held-out data.

Shown are the model-predicted (x axis) and measured (y axis) expression levels of held-out training data (never seen by the model, but generated in the same experiment as the training data) for (a) Abf1TATA+glucose, (b) pTpA+glucose, (c) pTpA+galactose, and (d) pTpA+glycerol. Since the pTpA+glucose billboard model (b) can explain 68.3% of this held out data and 92.6% of the high-quality data, we estimate that the training data has ~24% noise, although this varies by experiment. (n = 250,000 promoters for each; Pearson’s r² is shown at bottom right; Blue lines: Generalized Additive Model lines of best fit.).

Supplementary Figure 3 Saturation analysis of GPRA data.

Shown are the numbers of distinct promoters detected (y axis) when subsampling the pTpA+glucose sequencing data (black points) to different read coverage (x axis), after combining reads from all expression bins. Red curve: promoters projected to be detected with additional sequencing.

Supplementary Figure 4 Predicting expression from TF binding.

(a) Relationship between predicted binding of individual TFs and expression level. Measured expression level (pTpA+Glu data; y axis) vs. predicted binding (x axis) for Abf1 (left), Rsc30 (middle), and Ume6 (right). Ume6 (a similar motif to Rsc30), is positively correlated with expression overall (r=0.14), but has a strong negative trend at high predicted occupancies. (n = 250,000 promoters for each; Pearson’s r shown at bottom right; Top: Motifs. Blue lines: GAM lines of best fit. Gray shaded areas: 95% confidence intervals.) (b) Model correctly predicts expression of designed sequences. Shown are the model-predicted (x axis) and measured (y axis) expression levels for 1,000 in silico generated random sequences designed based on the pTpA+glucose billboard model’s predictions to span a range of expression levels, and then measured experimentally. (n = 1,000 promoters; Pearson’s r² shown at bottom right.).

Supplementary Figure 5 Expression of 80 bp native promoter fragments.

Distributions of measured expression levels (x axis) for 80 bp fragments of native yeast promoters, corresponding to different regions of the promoter (colors). Although each region included sequences from all promoters, the regions upstream of -200 are covered by nucleosomes in many promoters, which may explain their tendency to express at lower levels.

Supplementary Figure 6 Predicted nucleosome occupancy.

(a,b) Model predicts nucleosome occupancy well. (a) Model predicted (x axis) vs. measured (MNase-Seq, y axis) nucleosome occupancy. Four MNase biological replicates are shown (Methods). (Blue lines: Generalized Additive Model lines of best fit.) (b) As in (a), with replicates averaged, and only promoters present in both replicates shown. (c) Model-predicted accessibility across the YGR012W locus. A genome browser track showing yeast genes (top), measured nucleosome occupancy (Zhang, second from the top) and DNaseI hypersensitivity (Hesselberth, third from the top), and model predicted accessibilities (Ω). Promoters are predicted to be accessible and models provide similar predicted accessibilities. The four models were learned independently from different experimental data.

Supplementary Figure 7 The billboard models identify biochemical activities of TFs.

Shown are the number of TFs classified as activators, repressors, neither, or both in the yeast Gene Ontology (GO, Methods) (x axes) and whether they are predicted as (a) repressor (blue) or activator (red); or (b) closing (blue) or opening (red) chromatin, by each model (labels on top of subpanels). Hypergeometric P-values for overlaps between predicted activator/repressor (or chromatin opener/closer), compared with activator/repressor GO annotations are as shown (‘neither’ and ‘both’ categories are ignored). (n = 1 set of model predictions for each subpanel; 4 models total).

Supplementary Figure 8 Learning motifs de novo and improving known motifs.

(a) Model initialized with random PWMs learns that General Regulatory Factor motifs open chromatin. Shown are 12 motifs with the highest potentiation parameter values for a model trained on pTpA+glucose data with 1,000 randomly initialized motifs, allowing for multiple variants of the same motif to be learned. Motifs correspond to the three canonical general regulatory factors, and Cbf1, previously suggested to have a similar function. (b-d) Model-refined motifs perform better in predicting TF binding and knockout effects in independent experiments. (b) Similar refinement in independent models. Comparison of the original TFBS motif (top) and model-refined motifs from each of the four models for two example motifs. (c,d) Shown are the absolute values of the Pearson correlation coefficient (|r|) when using either the original motifs (x axis) or the pTpA+Gal model-refined motifs (y axis) to predict whether (c) the gene’s expression will change in the corresponding TF mutant (compared to wild type) based on predicted binding to the promoter, or (d) a ChIP probe will be bound by the TF in a ChIP assay based on predicted binding to the ChIP probe. (Here, data were not subsampled). Overall, model-refined motifs perform better (points above diagonal), but some perform worse. Reduced performance can be due to condition specific regulators that are minimally active in our tested growth conditions (for example, Gcn4), redundancy between motifs (for example, Hsf1 has mono-, di-, and trimeric motifs), and overfitting of the original motif to the test data (for example, ChIP-derived motifs tested on ChIP data). (n = 824 and 303 experiments, respectively for (c) and (d)).

Supplementary Figure 9 Positional model summary.

The positional model is identical to that in Fig. 2c, except that the predicted accessibility (Ω) is used to rescale the tensor of DNA binding sites (4) (rather than the DNA binding vector, as before), preserving binding site orientation and location in the resulting chromatin binding tensor. Positional chromatin binding is multiplied pairwise with the learned positional TF activities (5), and then summed (6) to obtain the activity contribution of each TF, which is used to calculate the overall expression level of the sequence (7). Actual parameter values are shown (pTpA+glucose model), with binding sites smoothed over 8 bp for visualization purposes.

Supplementary Figure 10 Downsampling analysis determines the extent of data needed to train models of varying complexity.

Model performance as reflected by prediction accuracy on the held-out validation data (Pearson’s r², y axis, top) and by the relative performance on validation-vs-training data (y axis, bottom, ratio of validation-vs-training Pearson r²; when <1.00, model is overfit) for models trained using sub-samples of the available training data (x axis) and learning different parameters (colors; parameters are cumulative from top to bottom of legend). Dotted lines: number of training examples needed to eliminate overfitting ((validation r² / training r²) > 0.999). Solid points: number of examples at which the maximal validation performance is achieved.

Supplementary Figure 11 Positional models improve predictions.

(a,b) Position-specific pTpA+Glu model-predicted expression levels (x axes) vs. measured expression levels (y axes) for (a) high-quality test data in the pTpA promoter scaffold, grown in glucose, and (b) native yeast promoter sequences, divided into 80 bp fragments and tested in the pTpA promoter scaffold, grown in glucose. (c) Most expression variation is attributed to accessibility. Position-specific pTpA+Glu model-predicted accessibility (Ω; x axis) vs. measured expression levels (y axis) for high-quality test data in the pTpA promoter scaffold, grown in glucose. Performance is better when incorporating positional activities (a,b), and, in particular, accessibility alone cannot distinguish the highest expression levels (measured expression >12). The dominant effect of accessibility on expression likely reflects accessibility being a prerequisite for expression. (n = 9,982, 70,924, and 9,982 promoters for (a-c), respectively; Pearson’s r² shown in bottom right; Red lines: Generalized Additive Model lines of best fit).

Supplementary Figure 12 Positional preferences of TFs are prevalent and context-dependent.

Position and strand preferences. Learned activity parameter values (y axis) for motifs in each position (x axis) and strand orientation (upper and lower panels) for each model (colors), for (a) Abf1, (b) Mcm1, (c) Ume6, (d) Mot3, (e) Azf1, and (f) Thi2.

Supplementary Figure 13 Features of positional preferences.

(a) Capturing helically biased positional preferences. For each location within the promoter (x axis), this shows the learned activity bias parameters (red curve; as in Fig. 4b) for the poly-A motif, long-range trend captured by a loess fit (green), and short-range residual activity bias after subtracting loess fit (blue) with reference 10.5 bp sine waves (black) for the minus strand (top) and plus strands (bottom) for the four different models (columns). (b) Modeling positional preferences increases predictive accuracy within the same scaffold but can drastically decrease it between scaffolds. For each training data set (four sub-panels) for both model types (colors), the Pearson r² (y axis) capturing performance on each test dataset (x axis). (n = 1 set of model predictions per bar; 4 independent training sets each with 2 model types, evaluated on each of 3 test datasets total).

Supplementary Figure 14 Model correctly predicts expression in tiling experiments.

Measured (black) and predicted (red) expression levels for sequences containing each motif (rows), tiled at every possible position (x axes) and orientation (top and bottom halves) in each of three random background sequence (columns).

Supplementary Figure 15 Weak interactions have a predominant effect on expression level.

(a,b) Distinct regulatory networks for two sequences with nearly identical expression levels. Model-predicted regulatory networks for two 80 bp native promoters fragments using the positional pTpA+glucose model. Purple nodes: Individual TFs. Edge thickness and spacing: interaction strengths for each TF (green: positive/activating; red: negative/repressing). Both 80 bp promoter sequences have intermediate measured expression levels of ~10.5 in pTpA+glucose, but different regulatory networks: (a) the -160:-80 region of YDR456W is dominated by a single strong interaction, while (b) the −320:-240 region of YIR035C is regulated by many weak interactions. (c) Abundant weak regulatory interactions also explain most of expression in the most relevant promoter region. Cumulative distributions (y axis) of the number of regulatory interactions (black) and fraction of expression explained (that is cumulative total/sum of all effect sizes; red) for each interaction strength (x axis). Here, we consider only native promoter fragments from the -160:-80 region (relative to the TSS), which is the region our models are trained on and corresponds to the most active promoter region (Supplementary Fig. 5). The magnitude (and not the sign) of the interaction strength is considered.

Supplementary information

Supplementary Information

Supplementary Figs. 1–15.

Reporting Summary

Supplementary Table 1

Motifs used in this study. Motif IDs are from the YeTFaSCo database. Motifs excluded from the motif frequency analysis (Fig. 1a) are indicated.

Supplementary Table 2

Promoter scaffolds included in the scaffold library. Sequences include 80 Ns in place of the random 80-mers and begin 13 bp upstream of the theoretical TSS.

Supplementary Table 3

TSS annotations of native promoters. Coordinates of TSSs for the v64 S. cerevisiae genome sequence.

Supplementary Table 4

All model parameters learned. For each of the eight models discussed in this paper (billboard and positional, for each of the four datasets), all model parameters, including TF activities, potentiations, motifs and (where applicable) position-specific activities.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Boer, C.G., Vaishnav, E.D., Sadeh, R. et al. Deciphering eukaryotic gene-regulatory logic with 100 million random promoters. Nat Biotechnol 38, 56–65 (2020). https://doi.org/10.1038/s41587-019-0315-8

Download citation

Received: 24 January 2019
Accepted: 16 October 2019
Published: 02 December 2019
Issue Date: 01 January 2020
DOI: https://doi.org/10.1038/s41587-019-0315-8

This article is cited by

Proformer: a hybrid macaron transformer model predicts expression values from promoter sequences
- Il-Youp Kwak
- Byeong-Chan Kim
- Wuming Gong
BMC Bioinformatics (2024)
A universal system for boosting gene expression in eukaryotic cell-lines
- Inbal Vaknin
- Or Willinger
- Roee Amit
Nature Communications (2024)
Convenient synthesis and delivery of a megabase-scale designer accessory chromosome empower biosynthetic capacity
- Yuan Ma
- Shuxin Su
- Ying-Jin Yuan
Cell Research (2024)
Regulatory activity is the default DNA state in eukaryotes
- Ishika Luthra
- Cassandra Jensen
- Carl G. de Boer
Nature Structural & Molecular Biology (2024)
Hold out the genome: a roadmap to solving the cis-regulatory code
- Carl G. de Boer
- Jussi Taipale
Nature (2024)