当前位置: X-MOL 学术Stat. Med. › 论文详情
Our official English website, www.x-mol.net, welcomes your feedback! (Note: you will need to create a separate account there.)
Category encoding method to select feature genes for the classification of bulk and single-cell RNA-seq data
Statistics in Medicine ( IF 2 ) Pub Date : 2021-05-24 , DOI: 10.1002/sim.9015
Yan Zhou 1 , Li Zhang 1 , Jinfeng Xu 2 , Jun Zhang 1 , Xiaodong Yan 3
Affiliation  

Bulk and single-cell RNA-seq (scRNA-seq) data are being used as alternatives to traditional technology in biology and medicine research. These data are used, for example, for the detection of differentially expressed (DE) genes. Several statistical methods have been developed for the classification of bulk and single-cell RNA-seq data. These feature genes are vitally important for the classification of bulk and single-cell RNA-seq data. The majority of genes are not DE and they are thus irrelevant for class distinction. To improve the classification performance and save the computation time, removal of irrelevant genes is necessary. Removal will aid the detection of the important feature genes. Widely used schemes in the literature, such as the BSS/WSS (BW) method, assume that data are normally distributed and may not be suitable for bulk and single-cell RNA-seq data. In this article, a category encoding (CAEN) method is proposed to select feature genes for bulk and single-cell RNA-seq data classification. This novel method encodes categories by employing the rank of sequence samples for each gene in each class. Correlation coefficients are considered for gene and class with the rank of sample and a new rank of category. The highest gene correlation coefficients are considered feature genes, which are the most effective for classifying bulk and single-cell RNA-seq dataset. The sure screening method was also established for rank consistency properties of the proposed CAEN method. Simulation studies show that the classifier using the proposed CAEN method performs better than, or at least as well as, the existing methods in most settings. Existing real datasets were analyzed, with the results demonstrating superior performance of the proposed method over current competitors. The application has been coded into an R package named “CAEN” to facilitate wide use.

中文翻译:

分类编码方法选择特征基因用于批量和单细胞RNA-seq数据的分类

批量和单细胞 RNA-seq (scRNA-seq) 数据被用作生物学和医学研究中传统技术的替代品。例如,这些数据用于检测差异表达 (DE) 基因。已经开发了几种统计方法来对批量和单细胞 RNA-seq 数据进行分类。这些特征基因对于批量和单细胞 RNA-seq 数据的分类至关重要。大多数基因不是DE,因此它们与类别区分无关。为了提高分类性能并节省计算时间,去除无关基因是必要的。去除将有助于检测重要的特征基因。文献中广泛使用的方案,如BSS/WSS(BW)方法,假设数据是正态分布的,可能不适合批量和单细胞 RNA-seq 数据。在本文中,提出了一种类别编码(CAEN)方法来选择用于批量和单细胞 RNA-seq 数据分类的特征基因。这种新方法通过使用每个类别中每个基因的序列样本的等级来编码类别。考虑基因和类的相关系数,具有样本的等级和新的类别等级。最高的基因相关系数被认为是特征基因,它对分类大量和单细胞 RNA-seq 数据集最有效。还为所提出的 CAEN 方法的等级一致性属性建立了确定筛选方法。模拟研究表明,使用所提出的 CAEN 方法的分类器的性能优于或至少与 大多数设置中的现有方法。对现有的真实数据集进行了分析,结果表明所提出的方法比当前竞争对手具有更优越的性能。该应用程序已被编码到名为“CAEN”的 R 包中,以方便广泛使用。
更新日期:2021-07-16
down
wechat
bug