Abstract
Hi-C experiments have become very popular for studying the 3D genome structure in recent years. Identification of long-range chromosomal interaction, i.e., peak detection, is crucial for Hi-C data analysis. But it remains a challenging task due to the inherent high dimensionality, sparsity and the over-dispersion of the Hi-C count data matrix. We propose EBHiC, an empirical Bayes approach for peak detection from Hi-C data. The proposed framework provides flexible over-dispersion modeling by explicitly including the “true” interaction intensities as latent variables. To implement the proposed peak identification method (via the empirical Bayes test), we estimate the overall distributions of the observed counts semiparametrically using a Smoothed Expectation Maximization algorithm, and the empirical null based on the zero assumption. We conducted extensive simulations to validate and evaluate the performance of our proposed approach and applied it to real datasets. Our results suggest that EBHiC can identify better peaks in terms of accuracy, biological interpretability, and the consistency across biological replicates. The source code is available on Github (https://github.com/QiZhangStat/EBHiC).
Funding source: Division of Biological Infrastructure
Award Identifier / Grant number: 1564621
Funding source: Office of Integrative Activities
Award Identifier / Grant number: 1736192
Acknowledgement
The initial work was performed when QZ and ZX were faculty, and YL was a graduate student in Department of Statistics at University of Nebraska Lincoln. QZ and YL’s research has been supported by NSF ABI (Award# DBI-1564621), NSF EPSCoR (RII) Track II (Award# OIA-1736192) and NU Collaborative System Science Seed Grant to QZ. We thank Dr. Ye Zheng for inspiring discussions, and the Holland Computing Center (HCC) at UNL for computation resources and technical supports.
Author contributions: All the authors have accepted responsibility for the entire content of this submitted manuscript and approved submission.
Research funding: This research has been supported by NSF ABI (Award# DBI-1564621), NSF EPSCoR (RII) Track II (Award# OIA-1736192).
Conflict of interest statement: The authors declare no conflicts of interest regarding this article.
References
Aguet, F., Brown, A.A., Castel, S.E., Davis, J.R., He, Y., Jo, B., Mohammadi, P., Park, Y., and Parsana, P., et al., GTEx Consortium (2017). Genetic effects on gene expression across human tissues. Nature 550: 204–213, (Epub 11 Oct 2017). https://doi.org/10.1038/nature24277.Search in Google Scholar
Ay, F., Bailey, T.L., and Noble, W.S. (2014). Statistical confidence estimation for Hi-C data reveals regulatory chromatin contacts. Genome Res. 24: 999–1011. https://doi.org/10.1101/gr.160374.113.Search in Google Scholar
Carty, M., Zamparo, L., Sahin, M., González, A., Pelossof, R., Elemento, O., and Leslie, C.S. (2017). An integrated model for detecting significant chromatin interactions from high-resolution Hi-C data. Nat. Commun. 8: 15454. https://doi.org/10.1038/ncomms15454.Search in Google Scholar
Chen, H., Xiao, J., Shao, T., Wang, L., Bai, J., Lin, X., Ding, N., Qu, Y., Tian, Y., Chen, X., et al.. (2019). Landscape of enhancer-enhancer cooperative regulation during human cardiac commitment. Mol. Ther. Nucleic Acids 17: 840–851. https://doi.org/10.1016/j.omtn.2019.07.015.Search in Google Scholar
Cideciyan, A.V., Zhao, X., Nielsen, L., Khani, S.C., Jacobson, S.G., and Palczewski, K. (1998). Null mutation in the rhodopsin kinase gene slows recovery kinetics of rod and cone phototransduction in man. Proc. Natl. Acad. Sci. U. S. A. 95: 328–333. https://doi.org/10.1073/pnas.95.1.328.Search in Google Scholar
Davis, J., Burnside, E.S., de Castro Dutra, I., Page, D., Ramakrishnan, R., Costa, V.S., and Shavlik, J.W. (2005). View learning for statistical relational learning: with an application to mammography. IJCAI 677–683, https://dl.acm.org/doi/abs/10.5555/1642293.1642402.Search in Google Scholar
Dekker, J., Rippe, K., Dekker, M., and Kleckner, N. (2002). Capturing chromosome conformation. Science 295: 1306–1311. https://doi.org/10.1126/science.1067799.Search in Google Scholar
Dixon, J.R., Selvaraj, S., Yue, F., Kim, A., Li, Y., Shen, Y., Hu, M., Liu, J.S., and Ren, B. (2012). Topological domains in mammalian genomes identified by analysis of chromatin interactions. Nature 485: 376–380. https://doi.org/10.1038/nature11082.Search in Google Scholar
Dostie, J., Richmond, T.A., Arnaout, R.A., Selzer, R.R., Lee, W.L., Honan, T.A., Rubio, E.D., Krumm, A., Lamb, J., Nusbaum, C., et al.. (2006). Chromosome conformation capture carbon copy (5c): a massively parallel solution for mapping interactions between genomic elements. Genome Res. 16: 1299–1309. https://doi.org/10.1101/gr.5571506.Search in Google Scholar
Duan, Z., Andronescu, M., Schutz, K., McIlwain, S., Kim, Y.J., Lee, C., Shendure, J., Fields, S., Blau, C.A., and Noble, W.S. (2010). A three-dimensional model of the yeast genome. Nature 465: 363. https://doi.org/10.1038/nature08973.Search in Google Scholar
Duggal, G., Wang, H., and Kingsford, C. (2013). Higher-order chromatin domains link eqtls with the expression of far-away genes. Nucleic Acids Res 42: 87–96. https://doi.org/10.1093/nar/gkt857.Search in Google Scholar
Efrom, B. (2004). Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc. 99: 96–104. https://doi.org/10.1198/016214504000000089.Search in Google Scholar
Efron, B. (2012). Large-scale inference: empirical Bayes methods for estimation, testing, and prediction, 1. Cambridge University Press, Cambridge UK. https://doi.org/10.1017/CBO9780511761362.Search in Google Scholar
Efron, B. (2016). Empirical Bayes deconvolution estimates. Biometrika 103: 1–20. https://doi.org/10.1093/biomet/asv068.Search in Google Scholar
Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. J. Am. Stat. Assoc. 96: 1151–1160. https://doi.org/10.1198/016214501753382129.Search in Google Scholar
Fishilevich, S., Nudel, R., Rappaport, N., Hadar, R., Plaschkes, I., Stein, T.I., Rosen, N., Kohn, A., Twik, M., Safran, M., et al.. (2017). Genehancer: genome-wide integration of enhancers and target genes in genecards. Database 2017: bax028. https://doi.org/10.1093/database/bax028.Search in Google Scholar
Forcato, M., Nicoletti, C., Pal, K., Livi, C.M., Ferrari, F., and Bicciato, S. (2017). Comparison of computational methods for Hi-C data analysis. Nat. Methods 14: 679. https://doi.org/10.1038/nmeth.4325.Search in Google Scholar
Givens, G.H., and Hoeting, J.A. (2012). Computational statistics. John Wiley & Sons, Hoboken, NJ, USA. https://doi.org/10.1002/9781118555552.Search in Google Scholar
Harewood, L., Kishore, K., Eldridge, M.D., Wingett, S., Pearson, D., Schoenfelder, S., Collins, V.P., and Fraser, P. (2017). Hi-C as a tool for precise detection and characterisation of chromosomal rearrangements and copy number variation in human tumours. Genome Biol. 18: 1–11. https://doi.org/10.1186/s13059-017-1253-8.Search in Google Scholar
Hu, M., Deng, K., Selvaraj, S., Qin, Z., Ren, B., and Liu, J.S. (2012). Hicnorm: removing biases in Hi-C data via poisson regression. Bioinformatics 28: 3131–3133. https://doi.org/10.1093/bioinformatics/bts570.Search in Google Scholar
Imakaev, M., Fudenberg, G., McCord, R.P., Naumova, N., Goloborodko, A., Lajoie, B.R., Dekker, J., and Mirny, L.A. (2012). Iterative correction of Hi-C data reveals hallmarks of chromosome organization. Nat. Methods 9: 999. https://doi.org/10.1038/nmeth.2148.Search in Google Scholar
Jin, F., Li, Y., Dixon, J.R., Selvaraj, S., Ye, Z., Lee, A.Y., Yen, C.-A., Schmitt, A.D., Espinoza, C.A., and Ren, B. (2013). A high-resolution map of the three-dimensional chromatin interactome in human cells. Nature 503: 290. https://doi.org/10.1038/nature12644.Search in Google Scholar
Knight, P.A. and Ruiz, D. (2013). A fast algorithm for matrix balancing. IMA J. Numer. Anal. 33: 1029–1047. https://doi.org/10.1093/imanum/drs019.Search in Google Scholar
Kokonendji, C., Kiessé, T.S., and Zocchi, S.S. (2007). Discrete triangular distributions and non-parametric estimation for probability mass function. J Nonparametric Statistics 19: 241–254. https://doi.org/10.1080/10485250701733747.Search in Google Scholar
Lamb, A.N., Rosenfeld, J.A., Neill, N.J., Talkowski, M.E., Blumenthal, I., Girirajan, S., Keelean-Fuller, D., Fan, Z., Pouncey, J., Stevens, C., et al.. (2012). Haploinsufficiency of sox5 at 12p12. 1 is associated with developmental delays with prominent language delay, behavior problems, and mild dysmorphic features. Hum. Mutat. 33: 728–740. https://doi.org/10.1002/humu.22037.Search in Google Scholar
Lieberman-Aiden, E., Van Berkum, N.L., Williams, L., Imakaev, M., Ragoczy, T., Telling, A., Amit, I., Lajoie, B.R., Sabo, P.J., Dorschner, M.O., et al.. (2009). Comprehensive mapping of long-range interactions reveals folding principles of the human genome. Science 326: 289–293. https://doi.org/10.1126/science.1181369.Search in Google Scholar
Ma, W., Ay, F., Lee, C., Gulsoy, G., Deng, X., Cook, S., Hesson, J., Cavanaugh, C., Ware, C.B., Krumm, A., et al.. (2018). Using dnase Hi-C techniques to map global and local three-dimensional genome architecture at high resolution. Methods 142: 59–73. https://doi.org/10.1016/j.ymeth.2018.01.014.Search in Google Scholar
MacDonald, J.R., Ziman, R., Yuen, R.K., Feuk, L., and Scherer, S.W. (2014). The database of genomic variants: a curated collection of structural variation in the human genome. Nucleic Acids Res. 42: D986–D992. https://doi.org/10.1093/nar/gkt958.Search in Google Scholar
Mifsud, B., Martincorena, I., Darbo, E., Sugar, R., Schoenfelder, S., Fraser, P., and Luscombe, N.M. (2017). Gothic, a probabilistic model to resolve complex biases and to identify real interactions in Hi-C data. PloS One 12: e0174744. https://doi.org/10.1371/journal.pone.0174744.Search in Google Scholar
Ongen, H., Buil, A., Brown, A.A., Dermitzakis, E.T., and Delaneau, O. (2016). Fast and efficient QTL mapper for thousands of molecular phenotypes. Bioinformatics 32: 1479–1485. https://doi.org/10.1093/bioinformatics/btv722.Search in Google Scholar
Park, J. and Lin, S. (2017). A random effect model for reconstruction of spatial chromatin structure. Biometrics 73: 52–62. https://doi.org/10.1111/biom.12544.Search in Google Scholar
Pendleton, M., Sebra, R., Pang, A.W.C., Ummat, A., Franzen, O., Rausch, T., Stütz, A.M., Stedman, W., Anantharaman, T., Hastie, A., et al.. (2015). Assembly and diploid architecture of an individual human genome via single-molecule technologies. Nat. Methods 12: 780–786. https://doi.org/10.1038/nmeth.3454.Search in Google Scholar
Rao, S.S., Huntley, M.H., Durand, N.C., Stamenova, E.K., Bochkov, I.D., Robinson, J.T., Sanborn, A.L., Machol, I., Omer, A.D., Lander, E.S., et al.. (2014). A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell 159: 1665–1680. https://doi.org/10.1016/j.cell.2014.11.021.Search in Google Scholar
Rieber, L. and Mahony, S. (2017). miniMDS: 3D structural inference from high-resolution Hi-C data. Bioinformatics 33: i261–i266. https://doi.org/10.1093/bioinformatics/btx271.Search in Google Scholar
Schmitt, A.D., Hu, M., Jung, I., Xu, Z., Qiu, Y., Tan, C.L., Li, Y., Lin, S., Lin, Y., Barr, C.L., et al.. (2016). A compendium of chromatin contact maps reveals spatially active regions in the human genome. Cell Rep. 17: 2042–2059. https://doi.org/10.1016/j.celrep.2016.10.061.Search in Google Scholar
Schwartzman, A. (2008). Empirical null and false discovery rate inference for exponential families. Ann. Appl. Stat. 2: 1332–1359. https://doi.org/10.1214/08-aoas184.Search in Google Scholar
Silverman, B., Jones, M., Wilson, J., and Nychka, D. (1990). A smoothed em approach to indirect estimation problems, with particular, reference to stereology and emission tomography. J. Roy. Stat. Soc. B 52: 271–324. https://doi.org/10.1111/j.2517-6161.1990.tb01788.x.Search in Google Scholar
Smemo, S., Tena, J.J., Kim, K.-H., Gamazon, E.R., Sakabe, N.J., Gómez-Marín, C., Aneas, I., Credidio, F.L., Sobreira, D.R., Wasserman, N.F., et al.. (2014). Obesity-associated variants within FTO form long-range functional connections with IRX3. Nature 507: 371. https://doi.org/10.1038/nature13138.Search in Google Scholar
Wang, S., Lee, S., Chu, C., Jain, D., Kerpedjiev, P., Nelson, G.M., Walsh, J.M., Alver, B.H., and Park, P.J. (2020). Hint: a computational method for detecting copy number variations and translocations from Hi-C data. Genome Biol. 21: 1–15. https://doi.org/10.1186/s13059-020-01986-5.Search in Google Scholar
Xu, Z., Zhang, G., Jin, F., Chen, M., Furey, T.S., Sullivan, P.F., Qin, Z., Hu, M., and Li, Y. (2015). A hidden Markov random field-based bayesian method for the detection of long-range chromosomal interactions in Hi-C data. Bioinformatics 32: 650–656. https://doi.org/10.1093/bioinformatics/btv650.Search in Google Scholar
Yamamoto, S., Sippel, K.C., Berson, E.L., and Dryja, T.P. (1997). Defects in the rhodopsin kinase gene in the oguchi form of stationary night blindness. Nat. Genet. 15: 175–178. https://doi.org/10.1038/ng0297-175.Search in Google Scholar
Zhang, Q. and Keles, S. (2017). An empirical bayes test for allelic-imbalance detection in chip-seq. Biostatistics 19: 546–61. https://doi.org/10.1093/biostatistics/kxx060.Search in Google Scholar
Zheng, X. and Zheng, Y. (2017). Cscoretool: fast Hi-C compartment analysis at high resolution. Bioinformatics 34: 1568–1570. https://doi.org/10.1093/bioinformatics/btx802.Search in Google Scholar
Zheng, Y., Ay, F., and Keles, S. (2019). Generative modeling of multi-mapping reads with mHi-C advances analysis of Hi-C studies. eLife 8: e38070. https://doi.org/10.7554/elife.38070.Search in Google Scholar
© 2021 Walter de Gruyter GmbH, Berlin/Boston