Abstract
Rank regression is a robust modeling tool; it is challenging to implement it for the distributed massive data owing to memory constraints. In practice, the massive data may be distributed heterogeneously from machine to machine; how to incorporate the heterogeneity is also an interesting issue. This paper proposes a distributed rank regression (\(\mathrm {DR}^{2}\)), which can be implemented in the master machine by solving a weighted least-squares and adaptive when the data are heterogeneous. Theoretically, we prove that the resulting estimator is statistically as efficient as the global rank regression estimator. Furthermore, based on the adaptive LASSO and a newly defined distributed BIC-type tuning parameter selector, we propose a distributed regularized rank regression (\(\mathrm {DR}^{3}\)), which can make consistent variable selection and can also be easily implemented by using the LARS algorithm on the master machine. Simulation results and real data analysis are included to validate our method.
Similar content being viewed by others
References
Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46, 1352–1382.
Chen, X., Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24, 1655–1684.
Chen, L., Zhou, Y. (2019). Quantile regression in big data: A divide and conquer based strategy. Computational Statistics and Data Analysis. https://doi.org/10.1016/j.csda.2019.106892.
Fan, J., Wang, D., Wang, K., Zhu, Z. (2017). Distributed estimation of principal eigenspaces. arXiv preprint arXiv:1702.06488 .
Fan, J., Guo, Y., Wang, K. (2019). Communication-efficient accurate statistical estimation. arXiv preprint arXiv:1906.04870
Feng, L., Zou, C., Wang, Z., Wei, X., Chen, B. (2015). Robust spline-based variable selection in varying coefficient model. Metrika, 78, 85–118.
Jordan, M. I., Lee, J. D., Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 14, 668–681.
Koenker, R., Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.
Lee, J., Sun, Y., Liu, Q., Taylor, J. (2015). Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv: 1503.04337.
Lehmann, E. (1983). Theory of Point Estimation. New York: Wiley.
Leng, C. (2010). Variable selection and coefficient estimation via regularized rank regression. Statistica Sinica, 20, 167–181.
Lin, N., Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4, 73–83.
McKean, J. (2004). Robust analysis of linear models. Statistical Science, 19, 562–570.
Rosenblatt, J., Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5, 379–404.
Shin, Y. (2010). Local rank estimation of transformation models with functional coefficients. Econometric Theory, 26, 1807–1819.
Wang, H., Li, G., Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business and Economic Statistics, 25, 347–355.
Wang, J., Kolar, M., Srebro, N., Zhang, T. (2017). Efficient distributed learning with sparsity. In: International Conference on Machine Learning, 3636-3645.
Wang, L., Li, R. (2009). Wighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics, 65, 564–571.
Wang, L., Kai, B., Li, R. (2009). Local rank inference for varying coefficient models. Journal of the American Statistical Association, 488, 1631–1645.
Zhang, Q., Wang, W. (2007). A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the International Conference on Scientific and Statistical Database Management.
Zhang, Y., Duchi, J., Wainwright, M. (2013). Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Reaearch, 14, 3321–3363.
Zhang, Y., Duchi, J., Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16, 3299–3340.
Zhu, X., Li, F., Wang, H. (2019). Least squares approximation for a distributed system. arXiv preprint arXiv: 1908.04904.
Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
K. Wang: The authors are listed in the alphabetical order. The authors would like to thank Dr. Shaomin Li for his valuable suggestions. The authors would like to thank the editor, an associate editor and two anonymous reviewers for their constructive comments that led to a major improvement of this article. The research was supported by NNSF project of China (11901356, 11901149), wealth management project (2019ZBKY047) of Shandong Technology and Business University.
Supplementary Information
Below is the link to the electronic supplementary material.
About this article
Cite this article
Luan, J., Wang, H., Wang, K. et al. Robust distributed estimation and variable selection for massive datasets via rank regression. Ann Inst Stat Math 74, 435–450 (2022). https://doi.org/10.1007/s10463-021-00803-5
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10463-021-00803-5