Skip to content
Licensed Unlicensed Requires Authentication Published by De Gruyter July 13, 2021

AdaReg: data adaptive robust estimation in linear regression with application in GTEx gene expressions

  • Meng Wang EMAIL logo , Lihua Jiang and Michael P. Snyder EMAIL logo

Abstract

The Genotype-Tissue Expression (GTEx) project provides a valuable resource of large-scale gene expressions across multiple tissue types. Under various technical noise and unknown or unmeasured factors, how to robustly estimate the major tissue effect becomes challenging. Moreover, different genes exhibit heterogeneous expressions across different tissue types. Therefore, we need a robust method which adapts to the heterogeneities of gene expressions to improve the estimation for the tissue effect. We followed the approach of the robust estimation based on γ-density-power-weight in the works of Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081 and Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609, where γ is the exponent of density weight which controls the balance between bias and variance. As far as we know, our work is the first to propose a procedure to tune the parameter γ to balance the bias-variance trade-off under the mixture models. We constructed a robust likelihood criterion based on weighted densities in the mixture model of Gaussian population distribution mixed with unknown outlier distribution, and developed a data-adaptive γ-selection procedure embedded into the robust estimation. We provided a heuristic analysis on the selection criterion and found that our practical selection trend under various γ’s in average performance has similar capability to capture minimizer γ as the inestimable mean squared error (MSE) trend from our simulation studies under a series of settings. Our data-adaptive robustifying procedure in the linear regression problem (AdaReg) showed a significant advantage in both simulation studies and real data application in estimating tissue effect of heart samples from the GTEx project, compared to the fixed γ procedure and other robust methods. At the end, the paper discussed some limitations on this method and future work.


Corresponding authors: Meng WangandMichael P. Snyder, Department of Genetics, Stanford University, Stanford, 94305, USA, E-mail: (M. Wang), (M. P. Snyder)

Acknowledgements

The gene expression TPM data used for the analyses described in this manuscript were obtained from GTEx Portal in version 7. We acknowledge the discussions with Hua Tang at Stanford and thank Catalina Vallejos, Associate Editor, and two reviewers for the insightful comments that greatly improved the paper.

  1. Author contribution: MW developed the method. LJ and MPS helped analysing the data. MPS supervised the project. All of the authors wrote the paper and contributed to discussion and revised the paper.

  2. Research funding: The work was supported by the GTEx grant (5U01HL13104203) and the CEGS grant (2RM1HG00773506). The Genotype-Tissue Expression (GTEx) Project was supported by the Common Fund of the Office of the Director of the National Institutes of Health, and by NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS.

  3. Conflict of interest statement: M.P.S. is a cofounder and is on the scientific advisory board of Personalis, Filtircine, SensOmics, Qbio, January, Mirvie, Oralome, and Proteus. He is also on the scientific advisory board (SAB) of Genapsys and Jupiter. The other authors declare no competing interests.

References

Arias-Castro, E. and Wang, M. (2017). Distribution-free tests for sparse heterogeneous mixtures. Test 26: 71–94. https://doi.org/10.1007/s11749-016-0499-x.Search in Google Scholar

Basu, A., Harris, I.R., Hjort, N.L., and Jones, M. (1998). Robust and efficient estimation by minimising a density power divergence. Biometrika 85: 549–559. https://doi.org/10.1093/biomet/85.3.549.Search in Google Scholar

Bates, D., Chambers, J., Dalgaard, P., Gentleman, R., Hornik, K., Ihaka, R., Kalibera, T., Lawrence, M., Leisch, F., Ligges, U., et al.. (2013). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.Search in Google Scholar

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B 57: 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x.Search in Google Scholar

Chen, T.-L., Hsieh, D.-N., Hung, H., Tu, I.-P., Wu, P.-S., Wu, Y.-M., Chang, W.-H., Huang, S.-Y. (2014). gamma-sup: a clustering algorithm for cryo-electron microscopy images of asymmetric particles. Ann. Appl. Stat. 8: 259–285. https://doi.org/10.1214/13-aoas680.Search in Google Scholar

Consortium, G.O. (2014). Gene ontology consortium: going forward. Nucleic Acids Res. 43: D1049–D1056. https://doi.org/10.1093/nar/gku1179.Search in Google Scholar PubMed PubMed Central

Consortium, G. (2015). The genotype-tissue expression (gtex) pilot analysis: multitissue gene regulation in humans. Science 348: 648–660.10.1126/science.1262110Search in Google Scholar PubMed PubMed Central

Donoho, D. and Jin, J. (2004). Higher criticism for detecting sparse heterogeneous mixtures. Ann. Stat. 32: 962–994. https://doi.org/10.1214/009053604000000265.Search in Google Scholar

Efron, B. (2005). Local false discovery rates. Stanford University.Search in Google Scholar

Fujisawa, H. (2013). Normalized estimating equation for robust parameter estimation. Electron. J. Stat. 7: 1587–1606. https://doi.org/10.1214/13-ejs817.Search in Google Scholar

Fujisawa, H. and Eguchi, S. (2008). Robust parameter estimation with a small bias against heavy contamination. J. Multivariate Anal. 99: 2053–2081. https://doi.org/10.1016/j.jmva.2008.02.004.Search in Google Scholar

Grünwald, P. (2011). Safe learning: bridging the gap between bayes, mdl and statistical learning theory via empirical convexity. In: Proceedings of the 24th annual conference on learning theory. JMLR workshop and conference proceedings, pp. 397–420.Search in Google Scholar

Grünwald, P. and Van Ommen, T. (2017). Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Anal. 12: 1069–1103. https://doi.org/10.1214/17-ba1085.Search in Google Scholar

Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., and Stahel, W.A. (2011). Robust statistics: the approach based on influence functions, vol. 114. John Wiley & Sons.Search in Google Scholar

Huber, P.J. (1964. Robust estimation of a location parameter. Ann. Math. Stat. 35, 73–101.https://doi.org/10.1214/aoms/1177703732.Search in Google Scholar

Huber, P.J. (2011). Robust statistics. Springer.10.1007/978-3-642-04898-2_594Search in Google Scholar

Ingster, Y.I. (1996). On some problems of hypothesis testing leading to infinitely divisible distributions. Math. Methods Stat. 6: 47–69.Search in Google Scholar

Jiang, L., Wang, M., Lin, S., Jian, R., Li, X., Chan, J., Dong, G., Fang, H., Robinson, A.E., Aguet, F., et al.. (2020). A quantitative proteome map of the human body. Cell 183: 269–283. https://doi.org/10.1016/j.cell.2020.08.036.Search in Google Scholar PubMed PubMed Central

Jones, M., Hjort, N.L., Harris, I.R., and Basu, A. (2001). A comparison of related density-based minimum divergence estimators. Biometrika 88: 865–873. https://doi.org/10.1093/biomet/88.3.865.Search in Google Scholar

Kanamori, T. and Fujisawa, H. (2015). Robust estimation under heavy contamination using unnormalized models. Biometrika: asv014.10.1093/biomet/asv014Search in Google Scholar

Katayama, S., Fujisawa, H., and Drton, M. (2018). Robust and sparse Gaussian graphical modelling under cell-wise contamination. Stat 7: e181. https://doi.org/10.1002/sta4.181.Search in Google Scholar

Mair, P. and Wilcox, R. (2020). Robust statistical methods in r using the wrs2 package. Behav. Res. Methods 52: 464–488.10.3758/s13428-019-01246-wSearch in Google Scholar PubMed

Maronna, R.A., Martin, R.D., Yohai, V.J., and Salibián-Barrera, M. (2018). Robust statistics: theory and methods (with R). Wiley.10.1002/9781119214656Search in Google Scholar

Miyamura, M. and Kano, Y. (2006). Robust Gaussian graphical modeling. J. Multivariate Anal. 97: 1525–1550. https://doi.org/10.1016/j.jmva.2006.02.006.Search in Google Scholar

Petralia, F., V. Rao, and D.B. Dunson (2012). Repulsive mixtures. arXiv preprint arXiv:1204.5243.Search in Google Scholar

Rousseeuw, P. and Yohai, V. (1984). Robust regression by means of s-estimators. In: Robust and nonlinear time series analysis. Springer, pp. 256–272.10.1007/978-1-4615-7821-5_15Search in Google Scholar

Rousseeuw, P.J. (1984). Least median of squares regression. J. Am. Stat. Assoc. 79: 871–880. https://doi.org/10.1080/01621459.1984.10477105.Search in Google Scholar

Rousseeuw, P.J. (1985). Multivariate estimation with high breakdown point. Math. Stat. Appl. 8: 37.10.1007/978-94-009-5438-0_20Search in Google Scholar

Rousseeuw, P.J. and Leroy, A.M. (1987). Robust regression and outlier detection, vol. 1. Wiley Online Library.10.1002/0471725382Search in Google Scholar

Singh, S., Hein, M.Y., and Stewart, A.F. (2016). msvolcano: a flexible web application for visualizing quantitative proteomics data. Proteomics 16: 2491–2494. https://doi.org/10.1002/pmic.201600167.Search in Google Scholar PubMed PubMed Central

Van der Vaart, A.W. (2000). Asymptotic statistics, vol. 3. Cambridge University Press.Search in Google Scholar

Venables, W.N. and Ripley, B.D. (2013). Modern applied statistics with S-PLUS. Springer Science & Business Media.Search in Google Scholar

Wang, M., Jiang, L., and Snyder, M.P. (2021). AdaTiSS: a novel data-adaptive robust method for quantifying tissue specificity scores. Bioinformatics 2021: btab460.10.1101/869404Search in Google Scholar

Windham, M.P. (1995). Robustifying model fitting. J. Roy. Stat. Soc. B: 599–609. https://doi.org/10.1111/j.2517-6161.1995.tb02050.x.Search in Google Scholar

Xie, F. and Xu, Y. (2020). Bayesian repulsive Gaussian mixture model. J. Am. Stat. Assoc. 115: 187–203. https://doi.org/10.1080/01621459.2018.1537918.Search in Google Scholar

Yu, G., Wang, L.-G., Han, Y., and He, Q.-Y. (2012). clusterprofiler: an r package for comparing biological themes among gene clusters. OMICS A J. Integr. Biol. 16: 284–287. https://doi.org/10.1089/omi.2011.0118.Search in Google Scholar PubMed PubMed Central


Supplementary Material

The online version of this article offers supplementary material (https://doi.org/10.1515/sagmb-2020-0042).


Received: 2020-07-06
Revised: 2021-05-19
Accepted: 2021-06-08
Published Online: 2021-07-13

© 2021 Walter de Gruyter GmbH, Berlin/Boston

Downloaded on 25.4.2024 from https://www.degruyter.com/document/doi/10.1515/sagmb-2020-0042/html
Scroll to top button