Skip to main content
Log in

Robust distributed estimation and variable selection for massive datasets via rank regression

  • Published:
Annals of the Institute of Statistical Mathematics Aims and scope Submit manuscript

Abstract

Rank regression is a robust modeling tool; it is challenging to implement it for the distributed massive data owing to memory constraints. In practice, the massive data may be distributed heterogeneously from machine to machine; how to incorporate the heterogeneity is also an interesting issue. This paper proposes a distributed rank regression (\(\mathrm {DR}^{2}\)), which can be implemented in the master machine by solving a weighted least-squares and adaptive when the data are heterogeneous. Theoretically, we prove that the resulting estimator is statistically as efficient as the global rank regression estimator. Furthermore, based on the adaptive LASSO and a newly defined distributed BIC-type tuning parameter selector, we propose a distributed regularized rank regression (\(\mathrm {DR}^{3}\)), which can make consistent variable selection and can also be easily implemented by using the LARS algorithm on the master machine. Simulation results and real data analysis are included to validate our method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Battey, H., Fan, J., Liu, H., Lu, J., Zhu, Z. (2018). Distributed testing and estimation under sparse high dimensional models. The Annals of Statistics, 46, 1352–1382.

    Article  MathSciNet  Google Scholar 

  • Chen, X., Xie, M. (2014). A split-and-conquer approach for analysis of extraordinarily large data. Statistica Sinica, 24, 1655–1684.

    MathSciNet  MATH  Google Scholar 

  • Chen, L., Zhou, Y. (2019). Quantile regression in big data: A divide and conquer based strategy. Computational Statistics and Data Analysis. https://doi.org/10.1016/j.csda.2019.106892.

    Article  MATH  Google Scholar 

  • Fan, J., Wang, D., Wang, K., Zhu, Z. (2017). Distributed estimation of principal eigenspaces. arXiv preprint arXiv:1702.06488 .

  • Fan, J., Guo, Y., Wang, K. (2019). Communication-efficient accurate statistical estimation. arXiv preprint arXiv:1906.04870

  • Feng, L., Zou, C., Wang, Z., Wei, X., Chen, B. (2015). Robust spline-based variable selection in varying coefficient model. Metrika, 78, 85–118.

    Article  MathSciNet  Google Scholar 

  • Jordan, M. I., Lee, J. D., Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 14, 668–681.

    Article  MathSciNet  Google Scholar 

  • Koenker, R., Bassett, G. (1978). Regression quantiles. Econometrica, 46, 33–50.

    Article  MathSciNet  Google Scholar 

  • Lee, J., Sun, Y., Liu, Q., Taylor, J. (2015). Communication-efficient sparse regression: a one-shot approach. arXiv preprint arXiv: 1503.04337.

  • Lehmann, E. (1983). Theory of Point Estimation. New York: Wiley.

    Book  Google Scholar 

  • Leng, C. (2010). Variable selection and coefficient estimation via regularized rank regression. Statistica Sinica, 20, 167–181.

    MathSciNet  MATH  Google Scholar 

  • Lin, N., Xi, R. (2011). Aggregated estimating equation estimation. Statistics and Its Interface, 4, 73–83.

    Article  MathSciNet  Google Scholar 

  • McKean, J. (2004). Robust analysis of linear models. Statistical Science, 19, 562–570.

    Article  MathSciNet  Google Scholar 

  • Rosenblatt, J., Nadler, B. (2016). On the optimality of averaging in distributed statistical learning. Information and Inference: A Journal of the IMA, 5, 379–404.

    Article  MathSciNet  Google Scholar 

  • Shin, Y. (2010). Local rank estimation of transformation models with functional coefficients. Econometric Theory, 26, 1807–1819.

    Article  MathSciNet  Google Scholar 

  • Wang, H., Li, G., Jiang, G. (2007). Robust regression shrinkage and consistent variable selection through the lad-lasso. Journal of Business and Economic Statistics, 25, 347–355.

    Article  MathSciNet  Google Scholar 

  • Wang, J., Kolar, M., Srebro, N., Zhang, T. (2017). Efficient distributed learning with sparsity. In: International Conference on Machine Learning, 3636-3645.

  • Wang, L., Li, R. (2009). Wighted Wilcoxon-type smoothly clipped absolute deviation method. Biometrics, 65, 564–571.

    Article  MathSciNet  Google Scholar 

  • Wang, L., Kai, B., Li, R. (2009). Local rank inference for varying coefficient models. Journal of the American Statistical Association, 488, 1631–1645.

    Article  MathSciNet  Google Scholar 

  • Zhang, Q., Wang, W. (2007). A fast algorithm for approximate quantiles in high speed data streams. In Proceedings of the International Conference on Scientific and Statistical Database Management.

  • Zhang, Y., Duchi, J., Wainwright, M. (2013). Communication-efficient algorithms for statistical optimization. Journal of Machine Learning Reaearch, 14, 3321–3363.

    MathSciNet  MATH  Google Scholar 

  • Zhang, Y., Duchi, J., Wainwright, M. (2015). Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. Journal of Machine Learning Research, 16, 3299–3340.

    MathSciNet  MATH  Google Scholar 

  • Zhu, X., Li, F., Wang, H. (2019). Least squares approximation for a distributed system. arXiv preprint arXiv: 1908.04904.

  • Zou, H. (2006). The adaptive LASSO and its oracle properties. Journal of the American Statistical Association, 101, 1418–1429.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kangning Wang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

K. Wang: The authors are listed in the alphabetical order. The authors would like to thank Dr. Shaomin Li for his valuable suggestions. The authors would like to thank the editor, an associate editor and two anonymous reviewers for their constructive comments that led to a major improvement of this article. The research was supported by NNSF project of China (11901356, 11901149), wealth management project (2019ZBKY047) of Shandong Technology and Business University.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (PDF 264KB)

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Luan, J., Wang, H., Wang, K. et al. Robust distributed estimation and variable selection for massive datasets via rank regression. Ann Inst Stat Math 74, 435–450 (2022). https://doi.org/10.1007/s10463-021-00803-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10463-021-00803-5

Keywords

Navigation