当前位置: X-MOL 学术Artif. Intell. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Domain expertise–agnostic feature selection for the analysis of breast cancer data*
Artificial Intelligence in Medicine ( IF 7.5 ) Pub Date : 2020-07-16 , DOI: 10.1016/j.artmed.2020.101928
Susanna Pozzoli 1 , Amira Soliman 2 , Leila Bahri 3 , Rui Mamede Branca 4 , Sarunas Girdzijauskas 5 , Marco Brambilla 6
Affiliation  

Progress in proteomics has enabled biologists to accurately measure the amount of protein in a tumor. This work is based on a breast cancer data set, result of the proteomics analysis of a cohort of tumors carried out at Karolinska Institutet. While evidence suggests that an anomaly in the protein content is related to the cancerous nature of tumors, the proteins that could be markers of cancer types and subtypes and the underlying interactions are not completely known. This work sheds light on the potential of the application of unsupervised learning in the analysis of the aforementioned data sets, namely in the detection of distinctive proteins for the identification of the cancer subtypes, in the absence of domain expertise. In the analyzed data set, the number of samples, or tumors, is significantly lower than the number of features, or proteins; consequently, the input data can be thought of as high-dimensional data. The use of high-dimensional data has already become widespread, and a great deal of effort has been put into high-dimensional data analysis by means of feature selection, but it is still largely based on prior specialist knowledge, which in this case is not complete. There is a growing need for unsupervised feature selection, which raises the issue of how to generate promising subsets of features among all the possible combinations, as well as how to evaluate the quality of these subsets in the absence of specialist knowledge. We hereby propose a new wrapper method for the generation and evaluation of subsets of features via spectral clustering and modularity, respectively. We conduct experiments to test the effectiveness of the new method in the analysis of the breast cancer data, in a domain expertise–agnostic context. Furthermore, we show that we can successfully augment our method by incorporating an external source of data on known protein complexes. Our approach reveals a large number of subsets of features that are better at clustering the samples than the state-of-the-art classification in terms of modularity and shows a potential to be useful for future proteomics research.



中文翻译:

领域专业知识——用于分析乳腺癌数据的不可知特征选择*

蛋白质组学的进步使生物学家能够准确测量肿瘤中的蛋白质含量。这项工作基于乳腺癌数据集,这是卡罗林斯卡医学院对一组肿瘤进行蛋白质组学分析的结果。虽然有证据表明蛋白质含量的异常与肿瘤的癌性有关,但可能作为癌症类型和亚型标志物的蛋白质以及潜在的相互作用尚不完全清楚。这项工作揭示了无监督学习在上述数据集分析中的应用潜力,即在缺乏领域专业知识的情况下检测用于识别癌症亚型的独特蛋白质。在分析的数据集中,样本或肿瘤的数量明显低于特征数量,或蛋白质;因此,输入数据可以被认为是高维数据。高维数据的使用已经很普遍,通过特征选择的方式进行高维数据分析已经投入了大量的精力,但它仍然主要基于先验的专业知识,在这种情况下完成。对无监督特征选择的需求越来越大,这就提出了如何在所有可能的组合中生成有前景的特征子集,以及如何在缺乏专业知识的情况下评估这些子集的质量的问题。我们在此提出了一种新的包装方法,用于分别通过谱聚类和模块化来生成和评估特征子集。我们进行实验以测试新方法在分析乳腺癌数据方面的有效性,在领域专业知识不可知的背景下。此外,我们表明我们可以通过合并已知蛋白质复合物的外部数据源来成功地增强我们的方法。我们的方法揭示了大量特征子集,在模块化方面比最先进的分类更擅长对样本进行聚类,并显示出对未来蛋白质组学研究有用的潜力。

更新日期:2020-07-16
down
wechat
bug