Statistics and Its Interface

Volume 14 (2021)

Number 3

Residual-based tree for clustered binary data

Pages: 295 – 308

DOI: https://dx.doi.org/10.4310/20-SII638

Authors

Rong Xia (Department of Biostatistics, University of Michigan, Ann Arbor, Mich., U.S.A.)

Christopher R. Friese (Department of Systems, Populations and Leadership, University of Michigan, Ann Arbor, Mich., U.S.A.)

Mousumi Banerjee (Department of Biostatistics, University of Michigan, Ann Arbor, Mich., U.S.A.)

Abstract

Tree-based methods are widely used for classification in health sciences research, where data are often clustered. In this paper, we propose a variant of the standard classification and regression tree paradigm (CART) to handle clustered binary outcomes. Using residuals from a null generalized linear mixed model as the response, we build a regression tree to partition the covariate space into rectangles. This circumvents modeling the correlation structure explicitly while still accounting for the cluster-correlated design, thereby allowing us to adopt the standard CART machinery in tree growing, pruning, and cross-validation. Class predictions for each terminal node in the final tree are estimated based on the success probabilities within the specific node. Our method also allows easy extension to ensemble of trees and random forest. Using extensive simulations, we compare our residual-based trees to the standard classification tree. Finally, the methods are illustrated using data from a study of kidney cancer and a study of surgical mortality after colectomy.

Keywords

clustered data, classification, tree-based methods, residuals, kidney cancer, colectomy surgical mortality

Received 2 December 2019

Accepted 16 September 2020

Published 9 February 2021