An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data

Qi Zhang; Zheng Xu; Yutong Lai

doi:10.1515/sagmb-2020-0026

Published by De Gruyter January 25, 2021

An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data

Qi Zhang , Zheng Xu and Yutong Lai

From the journal Statistical Applications in Genetics and Molecular Biology

https://doi.org/10.1515/sagmb-2020-0026

Showing a limited preview of this publication:

Abstract

Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).

Keywords: empirical Bayes; epigenetics; Hi–C; peak identification

Corresponding author: Qi Zhang, Department of Mathematics and Statistics, University of New Hampshire, Durham, NH03824, USA, E-mail: qi.zhang2@unh.edu

Funding source: Division of Biological Infrastructure

Award Identifier / Grant number: 1564621

Funding source: Office of Integrative Activities

Award Identifier / Grant number: 1736192

Acknowledgement

The initial work was performed when QZ and ZX were faculty, and YL was a graduate student in Department of Statistics at University of Nebraska Lincoln. QZ and YL’s research has been supported by NSF ABI (Award# DBI-1564621), NSF EPSCoR (RII) Track II (Award# OIA-1736192) and NU Collaborative System Science Seed Grant to QZ. We thank Dr. Ye Zheng for inspiring discussions, and the Holland Computing Center (HCC) at UNL for computation resources and technical supports.

Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This research has been supported by NSF ABI (Award# DBI-1564621), NSF EPSCoR (RII) Track II (Award# OIA-1736192).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.

References

Aguet, F., Brown, A.A., Castel, S.E., Davis, J.R., He, Y., Jo, B., Mohammadi, P., Park, Y., and Parsana, P., et al., GTEx Consortium (2017). Genetic effects on gene expression across human tissues. Nature 550: 204–213, (Epub 11 Oct 2017). https://doi.org/10.1038/nature24277.Search in Google Scholar

Ay, F., Bailey, T.L., and Noble, W.S. (2014). Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 24: 999–1011. https://doi.org/10.1101/gr.160374.113.Search in Google Scholar

Carty, M., Zamparo, L., Sahin, M., González, A., Pelossof, R., Elemento, O., and Leslie, C.S. (2017). An integrated model for detecting significant chromatin interactions from high-resolution Hi-C data. Nat. Commun. 8: 15454. https://doi.org/10.1038/ncomms15454.Search in Google Scholar

Chen, H., Xiao, J., Shao, T., Wang, L., Bai, J., Lin, X., Ding, N., Qu, Y., Tian, Y., Chen, X., et al.. (2019). Landscape of enhancer-enhancer cooperative regulation during human cardiac commitment. Mol. Ther. Nucleic Acids 17: 840–851. https://doi.org/10.1016/j.omtn.2019.07.015.Search in Google Scholar

Cideciyan, A.V., Zhao, X., Nielsen, L., Khani, S.C., Jacobson, S.G., and Palczewski, K. (1998). Null mutation in the rhodopsin kinase gene slows recovery kinetics of rod and cone phototransduction in man. Proc. Natl. Acad. Sci. U. S. A. 95: 328–333. https://doi.org/10.1073/pnas.95.1.328.Search in Google Scholar

Davis, J., Burnside, E.S., de Castro Dutra, I., Page, D., Ramakrishnan, R., Costa, V.S., and Shavlik, J.W. (2005). View learning for statistical relational learning: with an application to mammography. IJCAI 677–683, https://dl.acm.org/doi/abs/10.5555/1642293.1642402.Search in Google Scholar

Dekker, J., Rippe, K., Dekker, M., and Kleckner, N. (2002). Capturing chromosome conformation. Science 295: 1306–1311. https://doi.org/10.1126/science.1067799.Search in Google Scholar

Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485: 376–380. https://doi.org/10.1038/nature11082.Search in Google Scholar

Dostie, J., Richmond, T.A., Arnaout, R.A., Selzer, R.R., Lee, W.L., Honan, T.A., Rubio, E.D., Krumm, A., Lamb, J., Nusbaum, C., et al.. (2006). Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16: 1299–1309. https://doi.org/10.1101/gr.5571506.Search in Google Scholar

Duan, Z., Andronescu, M., Schutz, K., McIlwain, S., Kim, Y.J., Lee, C., Shendure, J., Fields, S., Blau, C.A., and Noble, W.S. (2010). A three-dimensional model of the yeast genome. Nature 465: 363. https://doi.org/10.1038/nature08973.Search in Google Scholar

Duggal, G., Wang, H., and Kingsford, C. (2013). Higher-order chromatin domains link eqtls with the expression of far-away genes. Nucleic Acids Res 42: 87–96. https://doi.org/10.1093/nar/gkt857.Search in Google Scholar

Efrom, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc. 99: 96–104. https://doi.org/10.1198/016214504000000089.Search in Google Scholar

Efron, B. (2012). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, 1. Cambridge University Press, Cambridge UK. https://doi.org/10.1017/CBO9780511761362.Search in Google Scholar

Efron, B. (2016). Empirical Bayes deconvolution estimates. Biometrika 103: 1–20. https://doi.org/10.1093/biomet/asv068.Search in Google Scholar

Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96: 1151–1160. https://doi.org/10.1198/016214501753382129.Search in Google Scholar

Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Stein, T.I., Rosen, N., Kohn, A., Twik, M., Safran, M., et al.. (2017). Genehancer: genome-wide integration of enhancers and target genes in genecards. Database 2017: bax028. https://doi.org/10.1093/database/bax028.Search in Google Scholar

Forcato, M., Nicoletti, C., Pal, K., Livi, C.M., Ferrari, F., and Bicciato, S. (2017). Comparison of computational methods for Hi-C data analysis. Nat. Methods 14: 679. https://doi.org/10.1038/nmeth.4325.Search in Google Scholar

Givens, G.H., and Hoeting, J.A. (2012). Computational statistics. John Wiley & Sons, Hoboken, NJ, USA. https://doi.org/10.1002/9781118555552.Search in Google Scholar

Harewood, L., Kishore, K., Eldridge, M.D., Wingett, S., Pearson, D., Schoenfelder, S., Collins, V.P., and Fraser, P. (2017). Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumours. Genome Biol. 18: 1–11. https://doi.org/10.1186/s13059-017-1253-8.Search in Google Scholar

Hu, M., Deng, K., Selvaraj, S., Qin, Z., Ren, B., and Liu, J.S. (2012). Hicnorm: removing biases in Hi-C data via poisson regression. Bioinformatics 28: 3131–3133. https://doi.org/10.1093/bioinformatics/bts570.Search in Google Scholar

Imakaev, M., Fudenberg, G., McCord, R.P., Naumova, N., Goloborodko, A., Lajoie, B.R., Dekker, J., and Mirny, L.A. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9: 999. https://doi.org/10.1038/nmeth.2148.Search in Google Scholar

Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.-A., Schmitt, A.D., Espinoza, C.A., and Ren, B. (2013). A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 503: 290. https://doi.org/10.1038/nature12644.Search in Google Scholar

Knight, P.A. and Ruiz, D. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Anal. 33: 1029–1047. https://doi.org/10.1093/imanum/drs019.Search in Google Scholar

Kokonendji, C., Kiessé, T.S., and Zocchi, S.S. (2007). Discrete triangular distributions and non-parametric estimation for probability mass function. J Nonparametric Statistics 19: 241–254. https://doi.org/10.1080/10485250701733747.Search in Google Scholar

Lamb, A.N., Rosenfeld, J.A., Neill, N.J., Talkowski, M.E., Blumenthal, I., Girirajan, S., Keelean-Fuller, D., Fan, Z., Pouncey, J., Stevens, C., et al.. (2012). Haploinsufficiency of sox5 at 12p12. 1 is associated with developmental delays with prominent language delay, behavior problems, and mild dysmorphic features. Hum. Mutat. 33: 728–740. https://doi.org/10.1002/humu.22037.Search in Google Scholar

Lieberman-Aiden, E., Van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al.. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326: 289–293. https://doi.org/10.1126/science.1181369.Search in Google Scholar

Ma, W., Ay, F., Lee, C., Gulsoy, G., Deng, X., Cook, S., Hesson, J., Cavanaugh, C., Ware, C.B., Krumm, A., et al.. (2018). Using dnase Hi-C techniques to map global and local three-dimensional genome architecture at high resolution. Methods 142: 59–73. https://doi.org/10.1016/j.ymeth.2018.01.014.Search in Google Scholar

MacDonald, J.R., Ziman, R., Yuen, R.K., Feuk, L., and Scherer, S.W. (2014). The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42: D986–D992. https://doi.org/10.1093/nar/gkt958.Search in Google Scholar

Mifsud, B., Martincorena, I., Darbo, E., Sugar, R., Schoenfelder, S., Fraser, P., and Luscombe, N.M. (2017). Gothic, a probabilistic model to resolve complex biases and to identify real interactions in Hi-C data. PloS One 12: e0174744. https://doi.org/10.1371/journal.pone.0174744.Search in Google Scholar

Ongen, H., Buil, A., Brown, A.A., Dermitzakis, E.T., and Delaneau, O. (2016). Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics 32: 1479–1485. https://doi.org/10.1093/bioinformatics/btv722.Search in Google Scholar

Park, J. and Lin, S. (2017). A random effect model for reconstruction of spatial chromatin structure. Biometrics 73: 52–62. https://doi.org/10.1111/biom.12544.Search in Google Scholar

Pendleton, M., Sebra, R., Pang, A.W.C., Ummat, A., Franzen, O., Rausch, T., Stütz, A.M., Stedman, W., Anantharaman, T., Hastie, A., et al.. (2015). Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12: 780–786. https://doi.org/10.1038/nmeth.3454.Search in Google Scholar

Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., et al.. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159: 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021.Search in Google Scholar

Rieber, L. and Mahony, S. (2017). miniMDS: 3D structural inference from high-resolution Hi-C data. Bioinformatics 33: i261–i266. https://doi.org/10.1093/bioinformatics/btx271.Search in Google Scholar

Schmitt, A.D., Hu, M., Jung, I., Xu, Z., Qiu, Y., Tan, C.L., Li, Y., Lin, S., Lin, Y., Barr, C.L., et al.. (2016). A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep. 17: 2042–2059. https://doi.org/10.1016/j.celrep.2016.10.061.Search in Google Scholar

Schwartzman, A. (2008). Empirical null and false discovery rate inference for exponential families. Ann. Appl. Stat. 2: 1332–1359. https://doi.org/10.1214/08-aoas184.Search in Google Scholar

Silverman, B., Jones, M., Wilson, J., and Nychka, D. (1990). A smoothed em approach to indirect estimation problems, with particular, reference to stereology and emission tomography. J. Roy. Stat. Soc. B 52: 271–324. https://doi.org/10.1111/j.2517-6161.1990.tb01788.x.Search in Google Scholar

Smemo, S., Tena, J.J., Kim, K.-H., Gamazon, E.R., Sakabe, N.J., Gómez-Marín, C., Aneas, I., Credidio, F.L., Sobreira, D.R., Wasserman, N.F., et al.. (2014). Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507: 371. https://doi.org/10.1038/nature13138.Search in Google Scholar

Wang, S., Lee, S., Chu, C., Jain, D., Kerpedjiev, P., Nelson, G.M., Walsh, J.M., Alver, B.H., and Park, P.J. (2020). Hint: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol. 21: 1–15. https://doi.org/10.1186/s13059-020-01986-5.Search in Google Scholar

Xu, Z., Zhang, G., Jin, F., Chen, M., Furey, T.S., Sullivan, P.F., Qin, Z., Hu, M., and Li, Y. (2015). A hidden Markov random field-based bayesian method for the detection of long-range chromosomal interactions in Hi-C data. Bioinformatics 32: 650–656. https://doi.org/10.1093/bioinformatics/btv650.Search in Google Scholar

Yamamoto, S., Sippel, K.C., Berson, E.L., and Dryja, T.P. (1997). Defects in the rhodopsin kinase gene in the oguchi form of stationary night blindness. Nat. Genet. 15: 175–178. https://doi.org/10.1038/ng0297-175.Search in Google Scholar

Zhang, Q. and Keles, S. (2017). An empirical bayes test for allelic-imbalance detection in chip-seq. Biostatistics 19: 546–61. https://doi.org/10.1093/biostatistics/kxx060.Search in Google Scholar

Zheng, X. and Zheng, Y. (2017). Cscoretool: fast Hi-C compartment analysis at high resolution. Bioinformatics 34: 1568–1570. https://doi.org/10.1093/bioinformatics/btx802.Search in Google Scholar

Zheng, Y., Ay, F., and Keles, S. (2019). Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies. eLife 8: e38070. https://doi.org/10.7554/elife.38070.Search in Google Scholar

Received: 2020-04-23

Accepted: 2021-01-06

Published Online: 2021-01-25

From the journal

Statistical Applications in Genetics and Molecular Biology

Volume 20 Issue 1

Submit manuscript

Journal and Issue

Articles in the same Issue

Frontmatter

Research Articles

An Empirical Bayes approach for the identification of long-range chromosomal interaction from Hi-C data

Fine tuned exploration of evolutionary relationships within the protein universe