Artificial Intelligence in Medicine ( IF 7.5 ) Pub Date : 2020-08-20 , DOI: 10.1016/j.artmed.2020.101950 Oscar Reyes 1 , Eduardo Pérez 1 , Raúl M Luque 2 , Justo Castaño 2 , Sebastián Ventura 1
Deregulated splicing machinery components have shown to be associated with the development of several types of cancer and, therefore, the determination of such alterations can help the development of tumor-specific molecular targets for early prognosis and therapy. Determining such splicing components, however, is not a straightforward task mainly due to the heterogeneity of tumors, the variability across samples, and the fat-short characteristic of genomic datasets. In this work, a supervised machine learning-based methodology is proposed, allowing the determination of subsets of relevant splicing components that best discriminate samples. The methodology comprises three main phases: first, a ranking of features is determined by means of applying feature weighting algorithms that compute the importance of each splicing component; second, the best subset of features that allows the induction of an accurate classifier is determined by means of conducting an effective heuristic search; then the confidence over the induced classifier is assessed by means of explaining the individual predictions and its global behavior. At the end, an extensive experimental study was conducted on a large collection of transcript-based datasets, illustrating the utility and benefit of the proposed methodology for analyzing dysregulation in splicing machinery.
中文翻译:
用于分析剪接机制失调的基于监督机器学习的方法:在癌症诊断中的应用
解除调节的剪接机制组件已被证明与几种癌症的发展有关,因此,确定此类改变可以帮助开发用于早期预后和治疗的肿瘤特异性分子靶标。然而,确定这种剪接成分并不是一项简单的任务,主要是由于肿瘤的异质性、样本间的变异性以及脂肪短基因组数据集的特征。在这项工作中,提出了一种基于监督机器学习的方法,允许确定最能区分样本的相关拼接组件的子集。该方法包括三个主要阶段:首先,通过应用计算每个拼接组件重要性的特征加权算法来确定特征的排名;其次,通过进行有效的启发式搜索来确定能够归纳出准确分类器的最佳特征子集;然后通过解释个体预测及其全局行为来评估对诱导分类器的置信度。最后,对大量基于转录本的数据集进行了广泛的实验研究,