Skip to main content
Log in

SUSCC: Secondary Construction of Feature Space based on UMAP for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data

  • Original research article
  • Published:
Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

Abstract

Clustering is a common method to identify cell types in single cell analysis, but the increasing size of scRNA-seq datasets brings challenges to single cell clustering. Therefore, it is an urgent need to design a faster and more accurate clustering method for large-scale scRNA-seq data. In this paper, we proposed a new method for single cell clustering. First, a count matrix is constructed through normalization and gene filtration. Second, the raw data of gene expression matrix are projected to feature space constructed by secondary construction of feature space based on UMAP (Uniform Manifold Approximation and Projection). Third, the low-dimensional matrix on the feature space is randomly divided into two sub-matrices according to a certain proportion for clustering and classifying, respectively. Finally, one subset is clustered by k-means algorithm and then the other subset is classified by k-nearest neighbor algorithm based on clustering results. Experimental results show that our method can cluster the scRNA-seq datasets effectively.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Na C, Menglu L, Le Z, Bo Z, Yuhua Y, Chun-Hou Z, Junfeng X (2020) Comparison and integration of computational methods for deleterious synonymous mutation prediction. Brief Bioinform 21:970–981. https://doi.org/10.1093/bib/bbz047

    Article  CAS  Google Scholar 

  2. Zhenyu Y, Xinlu C, Junfeng X (2020) PredCID: prediction of driver frameshift indels in human cancer. Brief Bioinform. https://doi.org/10.1093/bib/bbaa119

    Article  Google Scholar 

  3. Li M, Liu M, Bin Y et al (2020) Prediction of circRNA-disease associations based on inductive matrix completion. BMC Med Genomics 13(5):1–13. https://doi.org/10.1186/s12920-020-0679-0

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  4. Kuo RJ, Wang HS, Hu TL, Chou SH (2015) Application of ant K-means on clustering analysis. Comput Math Appl 50(10):1709–1724. https://doi.org/10.1016/j.camwa.2005.05.009

    Article  Google Scholar 

  5. Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41(1):176–190. https://doi.org/10.1016/j.patcog.2007.05.018

    Article  Google Scholar 

  6. Sibson R (1973) SLINK: an optimally efficient algorithm for the single-link cluster method. Comput J 16:30–34. https://doi.org/10.1093/comjnl/16.1.30

    Article  Google Scholar 

  7. Alex R and Alessandro L (2014) Clustering by fast search and find of density peaks. Science 344:1492–1496. http://science.sciencemag.org/content/344/6191/1492.abstract

  8. Chen X, Zhengchang S (2015) Identification of cell types from single-cell transcriptomes using a novel clustering method. Bioinformatics 31:1974–1980. https://doi.org/10.1093/bioinformatics/btv088

    Article  CAS  Google Scholar 

  9. Wang B, Zhu J, Pierson E et al (2017) Visualization and analysis of single-cell RNA-seq data by kernel-based similarity learning. Nat Methods 14:414–416. https://doi.org/10.1038/nmeth.4207

    Article  CAS  PubMed  Google Scholar 

  10. Kiselev V, Kirschner K, Schaub M et al (2017) SC3: consensus clustering of single-cell RNA-seq data. Nat Methods 14:483–486. https://doi.org/10.1038/nmeth.4236

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Trapnell C, Cacchiarelli D, Grimsby J et al (2014) The dynamics and regulators of cell fate decisions are revealed by pseudotemporal ordering of single cells. Nat Biotechnol 32:381–386. https://doi.org/10.1038/nbt.2859

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation. Neural Comput 15:1373–1396. https://doi.org/10.1162/089976603321780317

    Article  Google Scholar 

  13. Raphael P, Zhuliu L, Rui K (2020) Machine learning and statistical methods for clustering single-cell RNA-sequencing data. Brief Bioinform 21(4):1209–1223. https://doi.org/10.1093/bib/bbz063

    Article  CAS  Google Scholar 

  14. Pierson E, Yau C (2015) ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol 16(1):1–10. https://doi.org/10.1186/s13059-015-0805-z

    Article  CAS  Google Scholar 

  15. Risso D, Perraudeau F, Gribkova S et al (2018) A general and flexible method for signal extraction from single-cell RNA-seq data. Nat Commun 9:284. https://doi.org/10.1038/s41467-017-02554-5

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Ren X, Zheng L, Zhang Z (2019) SSCC: a novel computational framework for rapid and accurate clustering large-scale single cell RNA-seq data. Genomics Proteomics Bioinformatics 17(2):201–210. https://doi.org/10.1016/j.gpb.2018.10.003

    Article  PubMed  PubMed Central  Google Scholar 

  17. McInnes L, Healy J, and Melville J (2018). Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint.

  18. Coomans D, Massart DL (1982) Alternative k-nearest neighbour rules in supervised pattern recognition: part 1. k-nearest neighbour classification by using alternative voting rules. Anal Chim Acta 136(15):15–27. https://doi.org/10.1016/S0003-2670(01)95359-0

    Article  CAS  Google Scholar 

  19. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297. https://doi.org/10.1007/BF00994018

    Article  Google Scholar 

  20. Breiman L (2001) Random forests. Mach Learn 45(1):5–32. https://doi.org/10.1023/A:1010933404324

    Article  Google Scholar 

  21. Pollen AA, Nowakowski TJ, Shuga J, Wang X, Leyrat AA, Lui JH, Ramalingam N (2014) Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex. Nat Biotechnol 32(10):1053–1058. https://doi.org/10.1038/nbt.2967

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Kolodziejczyk AA, Kim JK, Tsang JC, Ilicic T, Henriksson J, Natarajan KN, Marioni JC (2015) Single cell RNA-sequencing of Pluripotent States unlocks modular transcriptional variation. Cell Stem Cell 17(4):471–485. https://doi.org/10.1016/j.stem.2015.09.011

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  23. Zeisel A, Muñoz-Manchado AB, Codeluppi S, Lönnerberg P, La Manno G, Juréus A, Rolny C (2015) Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science 347(6226):1138–1142. https://doi.org/10.1126/science.aaa1934

    Article  CAS  PubMed  Google Scholar 

  24. Usoskin D, Furlan A, Islam S, Abdo H, Lönnerberg P, Lou D, Linnarsson S (2015) Unbiased classification of sensory neuron types by large-scale single-cell RNA sequencing. Nat Neurosci 18(1):145–153. https://doi.org/10.1038/nn.3881

    Article  CAS  PubMed  Google Scholar 

  25. Zheng GX, Terry JM, Belgrader P, Ryvkin P, Bent ZW, Wilson R, Gregory MT (2017) Massively parallel digital transcriptional profiling of single cells. Nat Commun 8(1):1–12. https://doi.org/10.1038/ncomms14049

    Article  CAS  Google Scholar 

  26. Wolf FA, Angerer P, Theis FJ (2018) SCANPY: large-scale single-cell gene expression data analysis. Genome Biol 19(1):1–5. https://doi.org/10.1186/s13059-017-1382-0

    Article  Google Scholar 

  27. Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65. https://doi.org/10.1016/0377-0427(87)90125-7

    Article  Google Scholar 

  28. Tian T, Wan J, Song Q et al (2019) Clustering single-cell RNA-seq data with a model-based deep learning approach. Nat Mach Intell 1:191–198. https://doi.org/10.1038/s42256-019-0037-0

    Article  Google Scholar 

  29. Yury A, MalkovYashunin DA (2020) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Trans Pattern Anal Mach Intell 42(4):824–836. https://doi.org/10.1109/TPAMI.2018.2889473

    Article  Google Scholar 

  30. Satija R, Farrell JA, Gennert D, Schier AF, Regev A (2015) Spatial reconstruction of single-cell gene expression data. Nat Biotechnol 33(5):495–502. https://doi.org/10.1038/nbt.3192

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Sinha D, Kumar A, Kumar H, Bandyopadhyay S, Sengupta D (2018) dropClust: efficient clustering of ultra-large scRNA-seq data. Nucleic Acids Res 46(6):e36–e36. https://doi.org/10.1093/nar/gky007

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Iacono G, Mereu E, Guillaumet-Adkins A, Corominas R, Cuscó I, Rodríguez-Esteban G, Heyn H (2018) bigSCale: an analytical framework for big-scale single-cell data. Genome Res 28(6):878–890. https://doi.org/10.1101/197244

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Malkov Y, Ponomarenko A, Logvinov A, Krylov V (2012) Scalable Distributed Algorithm for Approximate Nearest Neighbor Search Problem in High Dimensional General Metric Spaces. In International Conference on Similarity Search and Applications. Springer, Berlin, pp. 132–147. https://doi.org/10.1007/978-3-642-32153-5_10

  34. Malkov Y, Ponomarenko A, Logvinov A, Krylov V (2014) Approximate nearest neighbor algorithm based on navigable small world graphs. Inf Syst 45:61–68. https://doi.org/10.1016/j.is.2013.10.006

    Article  Google Scholar 

Download references

Acknowledgements

This work was supported by grants from the Xinjiang Autonomous Region University Research Program (No. XJEDU2019Y002) and the National Natural Science Foundation of China (No. U19A2064, 61873001).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Jian-ping Zhao or Chun-Hou Zheng.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, HY., Zhao, Jp. & Zheng, CH. SUSCC: Secondary Construction of Feature Space based on UMAP for Rapid and Accurate Clustering Large-scale Single Cell RNA-seq Data. Interdiscip Sci Comput Life Sci 13, 83–90 (2021). https://doi.org/10.1007/s12539-020-00411-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12539-020-00411-6

Keywords

Navigation