Free energy-based model of CTCF-mediated chromatin looping in the human genome

doi:10.1016/j.ymeth.2020.05.025

Methods

Volumes 181–182, 1 October 2020, Pages 35-51

https://doi.org/10.1016/j.ymeth.2020.05.025 Get rights and content

Highlights

•
Dynamic programming algorithm systematically searches chromatin free energy landscape.
•
Free energy minimization generates folded optimal and suboptimal 2D meta-structures.
•
3D structures are generated from 2D meta-structure including multipair interactions.
•
The method identifies the contacts that are significant in the 2D contact map.
•
CTCF contacts and free energy correlate strongly with regions containing euchromatin.

Abstract

In recent years, high-throughput techniques have revealed considerable structural organization of the human genome with diverse regions of the chromatin interacting with each other in the form of loops. Some of these loops are quite complex and may encompass regions comprised of many interacting chain segments around a central locus. Popular techniques for extracting this information are chromatin interaction analysis by paired-end tag sequencing (ChIA-PET) and high-throughput chromosome conformation capture (Hi-C). Here, we introduce a physics-based method to predict the three-dimensional structure of chromatin from population-averaged ChIA-PET data. The approach uses experimentally-validated data from human B-lymphoblastoid cells to generate 2D meta-structures of chromatin using a dynamic programming algorithm that explores the chromatin free energy landscape. By generating both optimal and suboptimal meta-structures we can calculate both the free energy and additionally the relative thermodynamic probability. A 3D structure prediction program with applied restraints then can be used to generate the tertiary structures. The main advantage of this approach for population-averaged experimental data is that it provides a way to distinguish between the principal and the spurious contacts. This study also finds that euchromatin appear to have rather precisely regulated 2D meta-structures compared to heterochromatin. The program source-code is available at https://github.com/plewczynski/looper.

Introduction

Recent high-throughput sequencing methods on chromatin fibers have revealed considerable hierarchical organization and packaging of the chromatin in the form of loop-like structures in humans [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], other vertebrates such as mice [14], or zebrafish [15], insects like Drosophila melanogaster [16], and even plants [17]. The 3D topology of chromatin not only helps to establish the volume of the nucleus, it is thought to have a functional role; e.g. in regulating gene expression by modulating the spatial distance between promoters and enhancers [2], [3], [4], [18], [19], [20], [21]. Alterations in the arrangement of some of the loops have been linked to diseases like cancer [22] and autoimmune diseases [23].

A key feature in humans and other vertebrates is that the DNA loops arise as a consequence of protein-mediated binding of distinct DNA regions. A pivotal role in this process is played by CTCF: an 11 zinc-fingers protein that binds to the CCCTC motif [2], [18], [21], [24]. Proteins, which are often found concomitant with the CTCF protein around the DNA loops, include cohesin, Yin Yang 1, the RNA polymerase II complex (RNAPII), or CCCTC-binding factor like protein (CTCFL: also called BORIS) [2], [18], [21], [24], [25]. The latter is expressed in germ cells and some cancers and mediates the formation of a subset of cell-specific loops [26]. YY1-mediated loops bring promoters and enhancers into close spatial proximity [27]. RNAPII also appears to be strongly associated with the presence and formation of chromatin loops; though often shorter and more tissue specific [2]. How the loops are formed is an open question. In some works, it was shown that CTCF forms a dimer that can bring together two distal regions of DNA [18]. Another possible mechanism is the loop extrusion model, which proposes that the pair of CTCFs serves as a topological obstacle for the cohesin ring traversing along the DNA fiber [28], [29]. RNAPII may also play an important role in the active translocation of the cohesin.

There are multiple methods that have been developed to obtain genome-wide maps of chromatin organization [30], [31]. The most common approach is high-throughput chromosome conformation capture (Hi-C), which identifies all types of interactions between diverse parts of the chromatin chain. Based on Hi-C data, the genome can be divided into substructures known as topologically associated domains – TADs [3], [6], [13], [32]. A TAD is essentially contiguous regions of the chromatin in which loci tend to interact far more frequently with each other than with loci outside the region. Another method is chromatin interaction analysis by paired-end tag sequencing (ChIA-PET), which is designed to measure only the interactions mediated by specific protein factors like CTCF or RNAPII [33], [34]. This is achieved by introducing the chromatin immunoprecipitation (ChIP) step that targets specific proteins associated with the protein-DNA complexes [33]. Furthermore, ChIA-PET permits identifying specific chromatin loops at a resolution of DNA base pairs, in principle. The strength of these loops is expressed in terms of the number of PET counts; i.e., the number of unique pairs of reads confirming that two regions are joined together. Usually only loops above a specific PET count threshold are reported; however, even loops with PET count equal to 1 (called ‘singletons’), which may arise due to transient interactions with the chromatin chain, are sometimes used for example to propose chromatin 3D structure [1]. Two interacting loci that form a given loop are called anchors [2], [31]. The length (or “span”) of the ChIA-PET-proposed loops can vary significantly; between 8 kilobase-pairs (kbp) up to 1 Mbp. Similarly to Hi-C, with ChIA-PET, the genome can be partitioned into segments called chromatin contact domains (CCDs). A CCD is a genome fragment that is continously covered by interconneted loops and, for the GM12878 cell line, 2,267 CCDs were identified. The ChIA-PET experiments on the GM12878 cell line produced 31 million reads [2]. The data is usually binned into segments of 1- to 5-kbp. In general, a resolution of 5 kbp produces a reasonable 2D contact map (or heat map). Beside the two abovementioned methods, several other techniques have been introduced recently to analyze 3D genome organization like ChIA-Drop [35], GAM [36], or PLAC-Seq [37]. Some of these strive to report the chromatin structure from individual cells rather than a population-averaged picture observed with older techniques.

Chromatin loops can be described in terms of motifs. For example, Fig. 1A depicts a pair of CTCFs interacting with cohesin, a ring-shaped multidomain structural protein, resulting in a simple loop. As the CTCF binding motifs in both anchors are in a parallel orientation, we call this loop “convergent”. This type of loop is the most frequently observed substructure in the human genome. Based on the PET count, it also can be considered the strongest of interactions. “Divergent” loops occur when the CTCF binding motifs adopt an antiparallel orientation, thus influencing the relative orientation of both CTCFs bound to this DNA fragment (Fig. 1B). When the direction of the motifs is parallel, two types of structures of equal tendency and intermediate strength are suggested: tandem right (Fig. 1C) and tandem left (Fig. 1D). The CTCF ChIA-PET data for the GM12878 cell line indicate that around 65% of the loops are convergent, roughly 33% are tandem and remaining divergent loops amount to only around 2%. Similar values were obtained for Hi-C data for the same cell line [2], [21]. CTCF structures also appear to group into islands (also called multipair interactions, as shown in Fig. 1E) when more than two chromatin chains interact at a single spatial locus. In this case the central region is the largely insoluble and inactive part (heterochromatin) whereas the regions jutting out are more accessible to transcription factors and therefore active (euchromatin).

In recent years, there has been a considerable interest in finding ways to physically model the distribution of chromatin structure using various approaches such as population-based analysis [38], polymer based models using molecular dynamics (MD) simulations [2], [11], [12], [39], [40], or Monte Carlo (MC) simulation techniques [1], [8], and some thermodynamics based methods [16], [41], [42]. These models provide 3D structures of the chromatin based on the population-averaged experimental data. However, MD or MC simulation techniques do not guarantee an exhaustive search of such a large landscape as is observed for chromatin, nor is it assured that a conformation found in this way is the most stable structure [38]. More recently, a maximum entropy method has been introduced that explores the phase space of the polymer chain under the constraints of pair-wise interactions and uses these contacts to fit a proposed 3D potential using maximum entropy [43], [44].

Here, we introduce a different approach where we use thermodynamics on 2D meta-structures to produce ranked, optimal and suboptimal structures. In our model, we assume that the number of observed PET counts associated with a particular contact point is a measure of the likelihood of that interaction. Given so, it should be a function of the relative stability of the DNA loops and can be represented as a binding free energy. For finite, uniquely definable interactions, a thermodynamic potential can be developed to estimate the likelihood of a given 2D meta-structure based upon the free energy of each motif. These attractive interactions were combined with the repulsive effects caused by entropy loss due to folding and the formation of contacts; a concept originally introduced as the cross linking entropy ([45], [46], [47], [48] and references therein). This approach provides a list of optimal and suboptimal 2D meta-structures from the collection of possible loops. To systematically rank various combinations of loops, we employed a dynamic programming algorithm (DPA) strategy to optimize the arrangement of these pair contacts. Using heuristic adaptations to the DPA, the method is designed to handle complex motifs such as pseudoknots involving both parallel and antiparallel chain interactions (see Section 2.3). Our model works largely from the opposite direction of what is proposed by Di Pierro et al. [43], which is a top-down approach. Combining the statistical potential with the entropy loss, this model provides a quantitative, bottom-up integration procedure that is consistent with the Di Pierro et al. maximum entropy model.

Unlike many protein and RNA structures where there is a very well-defined native 3D structure; chromatin conformations tend to be far more plastic. These highly flexible loops effectively confine the enclosed chromatin chain to a subset of possible interactions; increasing the probability that some of the observed contacts are spurious interactions mixed with the genuine contacts. During the cross-linking process, commonly used in experimental techniques, the chromatin is fixed in a specific 3D conformation. Consequently, CTCF-cohesin complexes in transient locations on the chromatin chain will end up fused at those locations and pulled down with the actual, biologically relevant contacts. If spurious contacts are present in the population averaged data, then they will also appear in single-cell data and the priority of the contacts cannot be clearly established. This presents some challenges in interpreting 2D contact maps. Our method provides a way to “filter” and rank the population-averaged and single-cell data into a set of human readable optimal and suboptimal structures and generates 3D structures and trajectories from that information. This modeling approach tends to be consistent with the observed behavior of DNA. The 3D structures are not calculated from rigid harmonic potentials, rather, the structures are systematically ranked using the DPA and then the 3D structure is constructed using restraints associated with these predicted contacts. The thermodynamic approach also tends to guarantee that the loss of pairing probability with genomic distance n is consistent with the observed behavior: p(n) ∝ 1/n^α, where α = 1 would represent the fractal globule case with respect to the volume of the nucleus [6].

In this approach, we also focus on representing chromatin motifs as 2D meta-structures. This representation approach is helpful both in visualizing chromatin structures, and in comparing differences between two related genomes. This helps focus any 3D visualizations on the relevant changes in a given genome. Furthermore, computation with our approach generates 1D structural representations, which are machine readable, and can be automated to generate 3D conformations.

In summary we propose a novel method to help filter out spurious contacts obtain with methods designed to analyze genome-wide chromatin organization and to obtain the optimal structures of chromatin. As proof of principle, we focus on population-averaged ChIA-PET data from the GM12878 cell line [1], [2]. We applied the free energy prediction approach to obtain optimal and suboptimal meta-structures that reveal an ensemble of chromatin structural motifs. To validate our approach, we predicted both the optimal structure and all the other major suboptimal meta-structures of chromatin regions for the publicly available list of chromatin contact domains (CCDs), as well as for the members of A and B compartments. We observe that, for segments of chromatin that are active, there are more suboptimal structures that tend to have a more diverse distribution of free energies than regions that are inactive. Thus, with this approach we can correctly recognize regions of chromatin with high mobility identified in independent experimental methods.

Section snippets

Methods

We have divided the discussion of methods into three major parts. Here in Section 2.1, we explain what structural motifs are and how a hierarchy of structural motifs can be built up to form a 2D meta-structure, as it pertains to chromatin. Next, we show how to represent these 2D meta-structures in a machine-readable 1D format. In Section 2.2, we explain the thermodynamic calculation scheme in evaluating pair-wise and multipair interactions. Finally, in Section 2.3, we explain the detailed

Results and discussion

There are two main objectives in this work. The first is to use polymer physics to obtain the dominant meta-structures for a given regions of chromatin from a set of population-averaged contacts from a particular human cell line (in this case, GM12878). This is achieved by calculating the free energy, which helps to prioritize the types of structural motifs that dominate a particular part of chromatin. The second is to confirm the validity of our approach by trying to characterize regions of

Conclusion

We have introduced a computational algorithm for predicting the structure and dynamics of chromatin loops by analyzing their thermodynamic probability derived from ChIA-PET CTCF data. This permits learning the most probable structure of the chromatin in terms of 2D meta-structural motifs, which then can be used to derive 3D structures. This algorithm may help serve as an aid in determining the relative activity of the chromatin based upon the stability of the structure. From our results, a

Acknowledgements

The study was supported by the National Science Centre, Poland grants (2019/35/O/ST6/02484, 2018/02/X/NZ2/02125, 2014/15/B/ST6/05082). The research was also funded under the grant "Three-dimensional Human Genome structure at the population scale: computational algorithm and experimental validation for lymmphoblastoid cell lines of selected families from 1000 Genomes Project" that is within the TEAM programme of the Foundation for Polish Science; co-financed by the European Union under the

Software availability

The chromatin evaluation (chreval) package consists of various object oriented programs implemented using the Python 3.6 distribution. This includes the computational program to calculate individual contact maps and generate free energies and predicted meta-structures. Examples and a short tutorial are provided.

The software is available from the project web page.

[https://4dnucleome.cent.uw.edu.pl/looper/], and github.

[https://github.com/plewczynski/looper] as a stand-alone software package. The

References (87)

Z. Tang et al.
Cell
(2015)
J. Dekker et al.
Cell
(2016)
A. Lewis et al.
Current biology : CB
(2004)
J.R. Davie et al.
Adv. Enzyme Regul.
(2008)
P. Dong et al.
Molecular plant
(2017)
X. Ji et al.
Cell Stem Cell
(2016)
J.A. Beagan et al.
Cell Stem Cell
(2016)
S.S. Rao et al.
Cell
(2014)
K.S. Sandhu et al.
Cell reports
(2012)
G. Fudenberg et al.
Cell reports
(2016)

D. Capurso et al.

Methods

(2020)

B. Zhang et al.

Biophys. J .

(2017)

W. Dawson et al.

J. Theor. Biol.

(2001)

W. Dawson et al.

J. Theor. Biol.

(2001)

K. Maeshima et al.

Curr. Opin. Cell Biol.

(2010)

S. Todolli et al.

Biophys. J .

(2017)

W.K. Dawson et al.

Methods

(2016)

H. Agarwal et al.

Biophys. J .

(2017)

W. Humphrey et al.

J. Mol. Graph.

(1996)

I. Tuszynska et al.

Methods

(2014)

D.G. Lupianez et al.

Cell

(2015)

J.N. Onuchic et al.

Curr. Opin. Struct. Biol.

(2004)

P. Szalaj et al.

Genome Res.

(2016)

S.V. Ulianov et al.

Genome Res.

(2016)

S. He et al.

J. Cell. Biochem.

(2008)

E. Lieberman-Aiden et al.

Science

(2009)

S. Wang et al.

Nuc Acids Res

(2015)

J.R. Davie

Mol. Biol. Rep.

(1997)

M. Barbieri et al.

Nucleus

(2013)

M. Barbieri et al.

Front. Genet.

(2013)

R. Mourad, O. Cuvier, Nuclear Architecture and Dynamics, Elsevier BV2018, pp....

B. Bonev, N. Mendelson Cohen, Q. Szabo, L. Fritsch, G.L. Papadopoulos, Y. Lubling, X. Xu, X. Lv, J.P. Hugnot, A. Tanay,...

L.J.T. Kaaij et al.

Cell reports

(2018)

Q. Szabo, D. Jost, J.M. Chang, D.I. Cattoni, G.L. Papadopoulos, B. Bonev, T. Sexton, J. Gurgo, C. Jacquier, M....

J.D. Buenrostro et al.

Nature

(2015)

M. Sadowski et al.

Genome Biol.

(2019)

A.G. West et al.

Genes Dev.

(2002)

M. Lazniewski et al.

Plewczynski, Seminars in cell

& developmental biology

(2018)

P. Bergmaier et al.

Nucleic Acids Res.

(2018)

A.S. Weintraub et al.

Cell

(2017)

E. Alipour et al.

Nucleic Acids Res.

(2012)

W.K. Dawson, M. Lazniewski, D. Plewczynski, in: S. Ranganathan, M. Gribskov, K. Nakai, C. Schönbach (Eds.) Encyclopedia...

J.R. Dixon et al.

Nature

(2012)

Cited by (2)

PredTAD: A machine learning framework that models 3D chromatin organization alterations leading to oncogene dysregulation in breast cancer cell lines
2021, Computational and Structural Biotechnology Journal
Citation Excerpt :
In order to fit this vast genomic material in a small nuclear space, the chromatin undergoes multiple levels of organization. The genome is beautifully organized into fractal globular-like structures with extensive but specific looping and folding [1–4]. Furthermore, the chromatin is organized into sub-mega base regions with increased self-interaction frequency and decreased interactions with neighboring domains.
Topologically associating domains, or TADs, play important roles in genome organization and gene regulation; however, they are often altered in diseases. High-throughput chromatin conformation capturing assays, such as Hi-C, can capture domains of increased interactions, and TADs and boundaries can be identified using well-established analytical tools. However, generating Hi-C data is expensive. In our study, we addressed the relationship between multi-omics data and higher-order chromatin structures using a newly developed machine-learning model called PredTAD. Our tool uses already-available and cost-effective datatypes such as transcription factor and histone modification ChIPseq data. Specifically, PredTAD utilizes both epigenetic and genetic features as well as neighboring information to classify the entire human genome as boundary or non-boundary regions. Our tool can predict boundary changes between normal and breast cancer genomes. Among the most important features for predicting boundary alterations were CTCF, subunits of cohesin (RAD21 and SMC3), and chromosome number, suggesting their roles in conserved and dynamic boundaries formation. Upon further analysis, we observed that genes near altered TAD boundaries were found to be involved in several important breast cancer signaling pathways such as Ras, Jak-STAT, and estrogen signaling pathways. We also discovered a TAD boundary alteration that contributes to RET oncogene overexpression. PredTAD can also successfully predict TAD boundary changes in other conditions and diseases. In conclusion, our newly developed machine learning tool allowed for a more complete understanding of the dynamic 3D chromatin structures involved in signaling pathway activation, altered gene expression, and disease state in breast cancer cells.
Regulation of loop extrusion on the interphase genome
2023, Critical Reviews in Biochemistry and Molecular Biology

View full text

Free energy-based model of CTCF-mediated chromatin looping in the human genome

Highlights

Abstract

Introduction

Section snippets

Methods

Results and discussion

Conclusion

Acknowledgements

Software availability

Cell

Cell

Current biology : CB

Adv. Enzyme Regul.

Molecular plant

Cell Stem Cell

Cell Stem Cell

Cell

Cell reports

Cell reports

Methods

Biophys. J .

J. Theor. Biol.

J. Theor. Biol.

Curr. Opin. Cell Biol.

Biophys. J .

Methods

Biophys. J .

J. Mol. Graph.

Methods

Cell

Curr. Opin. Struct. Biol.

Genome Res.

Genome Res.

J. Cell. Biochem.

Science

Nuc Acids Res

Mol. Biol. Rep.

Nucleus

Front. Genet.

Cell reports

Nature

Genome Biol.

Genes Dev.

Plewczynski, Seminars in cell

& developmental biology

Nucleic Acids Res.

Cell

Nucleic Acids Res.

Nature