当前位置: X-MOL 学术Int. Stat. Rev. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Discussion
International Statistical Review ( IF 2 ) Pub Date : 2014-07-24 , DOI: 10.1111/insr.12060
Chi Song 1 , Heping Zhang 1
Affiliation  

We wish to congratulate the author for a nice overview of the tree-based methods, and the author clearly highlighted the recursive partitioning technique (Friedman, 1977; Breiman et al., 1984; Zhang and Singer, 2010) behind the tree-based methods. As the author summarized, there are two major types of tree methods: classification trees and regression trees, as precisely reected in the title of the classical book by Breiman et al. (1984). In our own experience, for regression problems, other nonparametric methods, including adaptive splines (Friedman, 1991) that are based on a similar partitioning technique, appear more desirable than regression trees, with the exception of survival analysis (Zhang, 1997, 2004; Zhang and Singer, 2010). With the advent of high-throughput genomic technologies, classification trees have become one of the most common and convenient bioinformatic tools. In what follows, we would like to share some of the recent developments in this area. Genome-wide association studies (GWASs) collect data for hundreds of thousands or millions of single nucleotide polymorphisms (SNPs) to study diseases of complex inheritance patterns, which can be recorded qualitatively (e.g., breast cancer) or in a quantitative scale (e.g., blood pressure). GWASs typically employ the case-control design, and the logistic regression model is generally applied to assess the association between each of the SNPs and the disease response, although more advanced techniques, especially nonparametric regression, have been proposed to incorporate multiple SNPs and interactions. A clear advantage of classification trees is that they make no model assumption and that they can select important variables (or features) and detect interactions among the variables. Zhang et al. (2000) was among the early applications of tree-based methods to genetic association analysis. Since then, interests in tree-based genetic analyses have grown substantially. For examples, Chen et al. (2007) developed a forest-based method on haplotypes instead of SNPs to detect gene-gene interactions, and importantly, they detected both a known variant and an unreported haplotype that were associated with age-related macular degeneration. Wang et al. (2009) further demonstrated the utility of this forest-based approach. Yao et al. (2009) applied GUIDE to the Framingham Heart Study (FHS) and detected combinations of SNPs that affect the disease risk. Garcia-Magarinos et al. (2009) demonstrated that the tree-based methods were effective in detecting interactions with pre-selected variables that were marginally associated with the disease outcome, but were susceptible to the local maximum problem when many noise variables were present. Chen et al. (2011) combined the classification tree and Bayesian search strategy, which improved the power to detect high order gene-gene interactions at the cost of high computation demand. Tree-based methods are extensively used in gene expression analysis to classify tissue types. Here the setting is very different from the GWAS applications. In GWAS applications, we deal with a very large number of discrete risk factors (e.g., the number of copies of a particular allele). In expression analysis, the number of variables is large but not so large, usually in the order of tens of thousands, and the variables tend to be continuous. For example, Zhang et al. (2001) demonstrated that classification trees can discriminate distinct colon cancers more accurately than other methods. Huang et al. (2003) found that aggregated gene expression patterns can predict the breast cancer outcomes with about 90% accuracy using tree models. Zhang et al. (2003) introduced deterministic forests for gene expression data in cancer diagnosis which have similar power to random forests but are easier in scientific interpretation. Pang et al. (2006) developed a random forest method incorporating pathway information and demonstrated that it has low prediction error in gene expression analysis. Furthermore, Diaz-Uriarte and De Andres (2006) demonstrated that random forest can be useful in variable selection by using a smaller set of genes and maintaining a comparable prediction accuracy. Of a related note, Wang and Zhang (2009) attempted to address the basic questions: how many trees are really needed in a random forest? They provided empirical evidence that a random forest can be reduced in size so much to allow scientific interpretation. As more and more data are generated from new technologies such as the Next-Generation Sequencing, tree-based methods will be very useful for analyzing such large and complex data after necessary extensions. Closely related to genomic data analysis is the personalized medicine. Zhang et al. (2010) presented a proof of concept that tree-based methods have some unique advantages over parametric methods to identify patient characteristics that may affect their treatment responses. In summary, tree-based methods have thrived in the past several decades, and they will become more useful and the methodological developments will be more challenging than ever, as more information increases in both size and complexity.

中文翻译:

讨论

我们要祝贺作者对基于树的方法进行了很好的概述,并且作者清楚地强调了基于树的方法背后的递归分区技术(Friedman,1977;Breiman 等人,1984;Zhang 和 Singer,2010) . 正如作者总结的那样,树方法有两种主要类型:分类树和回归树,正如 Breiman 等人的经典著作的书名所反映的那样。(1984)。根据我们自己的经验,对于回归问题,其他非参数方法,包括基于类似划分技术的自适应样条 (Friedman, 1991),似乎比回归树更可取,但生存分析除外 (Zhang, 1997, 2004;张和辛格,2010 年)。随着高通量基因组技术的出现,分类树已成为最常见、最方便的生物信息学工具之一。在下文中,我们想分享该领域的一些最新进展。全基因组关联研究 (GWAS) 收集数十万或数百万个单核苷酸多态性 (SNP) 的数据,以研究具有复杂遗传模式的疾病,这些疾病可以定性(例如,乳腺癌)或定量(例如,血压)。GWAS 通常采用病例对照设计,逻辑回归模型通常用于评估每个 SNP 与疾病反应之间的关联,尽管已经提出了更先进的技术,尤其是非参数回归,以合并多个 SNP 和相互作用。分类树的一个明显优势是它们不做模型假设,它们可以选择重要的变量(或特征)并检测变量之间的相互作用。张等人。(2000) 是基于树的方法在遗传关联分析中的早期应用之一。从那时起,人们对基于树的遗传分析的兴趣大幅增长。例如,陈等人。(2007) 开发了一种基于森林的单倍型方法而不是 SNP 来检测基因 - 基因相互作用,重要的是,他们检测到与年龄相关性黄斑变性相关的已知变异和未报告的单倍型。王等人。(2009) 进一步证明了这种基于森林的方法的实用性。姚等人。(2009) 将 GUIDE 应用于弗雷明汉心脏研究 (FHS) 并检测到影响疾病风险的 SNP 组合。加西亚-马加里诺斯等人。(2009) 证明,基于树的方法在检测与预选变量的相互作用方面是有效的,这些变量与疾病结果略有关联,但当存在许多噪声变量时容易受到局部最大值问题的影响。陈等人。(2011) 结合分类树和贝叶斯搜索策略,以高计算需求为代价提高了检测高阶基因-基因相互作用的能力。基于树的方法广泛用于基因表达分析以对组织类型进行分类。这里的设置与 GWAS 应用程序非常不同。在 GWAS 应用中,我们处理大量离散风险因素(例如,特定等位基因的拷贝数)。在表达式分析中,变量的数量很多但没有那么大,通常在数万个数量级,并且变量往往是连续的。例如,张等人。(2001) 证明分类树可以比其他方法更准确地区分不同的结肠癌。黄等人。(2003) 发现聚合的基因表达模式可以使用树模型以约 90% 的准确率预测乳腺癌结果。张等人。(2003) 为癌症诊断中的基因表达数据引入了确定性森林,其具有与随机森林相似的能力,但在科学解释上更容易。庞等人。(2006) 开发了一种包含通路信息的随机森林方法,并证明它在基因表达分析中具有低预测误差。此外,Diaz-Uriarte 和 De Andres (2006) 证明,通过使用较小的基因集并保持可比的预测准确性,随机森林可用于变量选择。在相关说明中,Wang 和 Zhang (2009) 试图解决基本问题:随机森林中真正需要多少棵树?他们提供了经验证据,证明随机森林的大小可以减少到允许科学解释的程度。随着越来越多的数据从下一代测序等新技术中产生,基于树的方法在必要扩展后对于分析如此庞大而复杂的数据将非常有用。与基因组数据分析密切相关的是个性化医疗。张等人。(2010) 提出了一个概念证明,即基于树的方法在识别可能影响其治疗反应的患者特征方面比参数方法具有一些独特的优势。总之,基于树的方法在过去几十年中蓬勃发展,随着更多信息的规模和复杂性增加,它们将变得更加有用,方法学的发展将比以往任何时候都更具挑战性。
更新日期:2014-07-24
down
wechat
bug