Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Perspective
  • Published:

Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis

Abstract

The high proportion of zeros in typical single-cell RNA sequencing datasets has led to widespread but inconsistent use of terminology such as dropout and missing data. Here, we argue that much of this terminology is unhelpful and confusing, and outline simple ideas to help to reduce confusion. These include: (1) observed single-cell RNA sequencing counts reflect both true gene expression levels and measurement error, and carefully distinguishing between these contributions helps to clarify thinking; and (2) method development should start with a Poisson measurement model, rather than more complex models, because it is simple and generally consistent with existing data. We outline how several existing methods can be viewed within this framework and highlight how these methods differ in their assumptions about expression variation. We also illustrate how our perspective helps to address questions of biological interest, such as whether messenger RNA expression levels are multimodal among cells.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: Comparing single-gene expression models on scRNA-seq data.

Similar content being viewed by others

Data availability

Sorted immune cell and PBMC data were downloaded from https://10xgenomics.com/data. iPSC data were downloaded from the Gene Expression Omnibus (accession number GSE118723). Brain data were downloaded from the Genotype-Tissue Expression portal (https://www.gtexportal.org/home/datasets). Kidney and retina data were downloaded from the Human Cell Atlas Data Portal (https://data.humancellatlas.org/). Control data were downloaded from https://figshare.com/projects/Zero_inflation_in_negative_control_data/61292. All of the results generated in this study are available at https://zenodo.org/record/4543923 and all analysis notebooks have been published at https://aksarkar.github.io/singlecell-modes/.

Code availability

All of the code used to perform the analysis is available at https://zenodo.org/record/4543921 and https://zenodo.org/record/4543923.

References

  1. Fuller, W. A. Measurement Error Models (John Wiley & Sons, 1986).

  2. Marioni, J. C., Mason, C. E., Mane, S. M., Stephens, M. & Gilad, Y. RNA-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome Res. 18, 1509–1517 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  3. Pachter, L. Models for transcript quantification from RNA-seq. Preprint at https://arxiv.org/abs/1104.3889 (2011).

  4. Wang, J. et al. Gene expression distribution deconvolution in single-cell RNA sequencing. Proc. Natl Acad. Sci. USA 115, E6437–E6446 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  5. Zhang, M. J., Ntranos, V. & Tse, D. Determining sequencing depth in a single-cell RNA-seq experiment. Nat. Commun. 11, 774 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Kharchenko, P. V., Silberstein, L. & Scadden, D. T. Bayesian approach to single-cell differential expression analysis. Nat. Methods 11, 740–742 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Stegle, O., Teichmann, S. A. & Marioni, J. C. Computational and analytical challenges in single-cell transcriptomics. Nat. Rev. Genet. 16, 133–145 (2015).

    Article  CAS  PubMed  Google Scholar 

  8. Haque, A., Engel, J., Teichmann, S. A. & Lönnberg, T. A practical guide to single-cell RNA-sequencing for biomedical research and clinical applications. Genome Med. 9, 75 (2017).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  9. Zhu, L., Lei, J., Devlin, B. & Roeder, K. A unified statistical framework for single cell and bulk RNA sequencing data. Ann. Appl. Stat. 12, 609–632 (2018).

    PubMed  PubMed Central  Google Scholar 

  10. Qiu, P. Embracing the dropouts in single-cell RNA-seq analysis. Nat. Commun. 11, 1169 (2020).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  11. Fujimura, F. K., Northrup, H., Beaudet, A. L. & O’Brien, W. E. Genotyping errors with the polymerase chain reaction. N. Engl. J. Med. 322, 61 (1990).

    CAS  PubMed  Google Scholar 

  12. Whale, A. S., Cowen, S., Foy, C. A. & Huggett, J. F. Methods for applying accurate digital PCR analysis on low copy DNA samples. PLoS ONE 8, e58177 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  13. Hicks, S. C., Townes, F. W., Teng, M. & Irizarry, R. A. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics 19, 562–578 (2018).

    Article  PubMed  Google Scholar 

  14. Li, W. V. & Li, J. J. An accurate and robust imputation method scImpute for single-cell RNA-seq data. Nat. Commun. 9, 997 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  15. Chen, M. & Zhou, X. VIPER: variability-preserving imputation for accurate gene expression recovery in single-cell RNA sequencing studies. Genome Biol. 19, 196 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  16. Talwar, D., Mongia, A., Sengupta, D. & Majumdar, A. AutoImpute: autoencoder based imputation of single-cell RNA-seq data. Sci. Rep. 8, 16329 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  17. Svensson, V. Droplet scRNA-seq is not zero-inflated. Nat. Biotechnol. 38, 147–150 (2020).

    Article  CAS  PubMed  Google Scholar 

  18. Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2013).

    Article  PubMed  CAS  Google Scholar 

  19. Townes, F. W., Hicks, S. C., Aryee, M. J. & Irizarry, R. A. Feature selection and dimension reduction for single-cell RNA-seq based on a multinomial model. Genome Biol. 20, 295 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  20. Huang, M. et al. SAVER: gene expression recovery for single-cell RNA sequencing. Nat. Methods 15, 539–542 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  21. Eraslan, G., Simon, L. M., Mircea, M., Mueller, N. S. & Theis, F. J. Single-cell RNA-seq denoising using a deep count autoencoder. Nat. Commun. 10, 390 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  22. Tang, W. et al. bayNorm: Bayesian gene expression recovery, imputation and normalization for single-cell RNA-sequencing data. Bioinformatics 36, 1174–1181 (2019).

    PubMed Central  Google Scholar 

  23. Hilbe, J. M. Modeling Count Data (Cambridge Univ. Press, 2014).

  24. Lu, M. Generalized Adaptive Shrinkage Methods and Applications in Genomics Studies. PhD thesis, Univ. Chicago (2018).

  25. Raj, A. & van Oudenaarden, A. Nature, nurture, or chance: stochastic gene expression and its consequences. Cell 135, 216–226 (2008).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  26. Shalek, A. K. et al. Single-cell transcriptomics reveals bimodality in expression and splicing in immune cells. Nature 498, 236–240 (2013).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  27. Shalek, A. K. et al. Single-cell RNA-seq reveals dynamic paracrine control of cellular variation. Nature 510, 363–369 (2014).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  28. Bacher, R. & Kendziorski, C. Design and computational analysis of single-cell RNA-sequencing experiments. Genome Biol. 17, 63 (2016).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  29. Lopez, R., Regier, J., Cole, M. B., Jordan, M. I. & Yosef, N. Deep generative modeling for single-cell transcriptomics. Nat. Methods 15, 1053–1058 (2018).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Hu, Q. & Greene, C. S. Parameter tuning is a key part of dimensionality reduction via deep variational autoencoders for single cell RNA transcriptomics. Pac. Symp. Biocomput. 24, 362–373 (2019).

    PubMed  PubMed Central  Google Scholar 

  31. Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20, 269 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  32. Brennecke, P. et al. Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10, 1093–1095 (2013).

    Article  CAS  PubMed  Google Scholar 

  33. Kim, J. K., Kolodziejczyk, A. A., Ilicic, T., Teichmann, S. A. & Marioni, J. C. Characterizing noise structure in single-cell RNA-seq distinguishes genuine from technical stochastic allelic expression. Nat. Commun. 6, 8687 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Tipping, M. E. & Bishop, C. M. Probabilistic principal component analysis. J. R. Stat. Soc. B Stat. Methodol. 61, 611–622 (1999).

    Article  Google Scholar 

  35. Wang, W. & Stephens, M. Empirical Bayes matrix factorization. J. Mach. Learn. Res. (in the press).

  36. Pierson, E. & Yau, C. ZIFA: dimensionality reduction for zero-inflated single-cell gene expression analysis. Genome Biol. 16, 241 (2015).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  37. Buettner, F. et al. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nat. Biotechnol. 33, 155–160 (2015).

    Article  CAS  PubMed  Google Scholar 

  38. Verma, A. & Engelhardt, B. E. A robust nonlinear low-dimensional manifold for single cell RNA-seq data. BMC Bioinformatics 21, 324 (2020).

    Article  PubMed  PubMed Central  Google Scholar 

  39. Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: Precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  40. Lun, A. Overcoming systematic errors caused by log-transformation of normalized single-cell RNA sequencing data. Preprint at bioRxiv https://doi.org/10.1101/404962 (2018).

  41. Venables, W. N. & Ripley, B. D. Modern Applied Statistics with S 4th edn (Springer, 2002).

  42. Robinson, M. D., McCarthy, D. J. & Smyth, G. K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics 26, 139–140 (2009).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  43. Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15, 550 (2014).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  44. Vallejos, C. A., Marioni, J. C. & Richardson, S. BASiCS: Bayesian analysis of single-cell sequencing data. PLoS Comp. Biol. 11, e1004333 (2015).

    Article  CAS  Google Scholar 

  45. Zeileis, A., Kleiber, C. & Jackman, S. Regression models for count data in R. J. Stat. Softw. 27, 1–25 (2008).

    Google Scholar 

  46. Stephens, M. False discovery rates: a new deal. Biostatistics 18, 275–294 (2017).

    PubMed  Google Scholar 

  47. Kiefer, J. & Wolfowitz, J. Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. Ann. Math. Stat. 27, 887–906 (1956).

    Article  Google Scholar 

  48. Lee, D. D. & Seung, H. S. in Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference (eds Leen, T. K. et al.) 556–562 (MIT Press, 2000).

  49. Levitin, H. M. et al. De novo gene signature identification from single-cell RNA-seq with hierarchical Poisson factorization. Mol. Syst. Biol. 15, e8557 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  50. Gouvert, O., Oberlin, T. & Févotte, C. Negative binomial matrix factorization for recommender systems. IEEE Signal Process. Lett. 27, 815–819 (2020).

    Article  Google Scholar 

  51. Sun, S., Chen, Y., Liu, Y. & Shang, X. A fast and efficient count-based matrix factorization method for detecting cell types from single-cell RNA-seq data. BMC Syst. Biol. 13, 28 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  52. Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J.-P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  53. Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  54. Sarkar, A. K. et al. Discovery and characterization of variance QTLs in human induced pluripotent stem cells. PLoS Genet. 15, e1008045 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  55. Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods 14, 955–958 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  56. Stewart, B. J. et al. Spatiotemporal immune zonation of the human kidney. Science 365, 1461–1466 (2019).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  57. Lukowski, S. W. et al. A single-cell transcriptome atlas of the adult human retina. EMBO J. 38, e100811 (2019).

    Article  PubMed  PubMed Central  CAS  Google Scholar 

  58. Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. Nat. Methods 14, 381–387 (2017).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  59. Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  60. Klein, A. M. et al. Droplet barcoding for single-cell transcriptomics applied to embryonic stem cells. Cell 161, 1187–1201 (2015).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  61. Islam, S. et al. Characterization of the single-cell transcriptional landscape by highly multiplex RNA-seq. Genome Res. 21, 1160–1167 (2011).

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

We thank members of the M.S. and Y. Gilad laboratories for helpful comments. This work was supported by NIH grant HG002585 and a Gut Cell Atlas grant from The Leona M. and Harry B. Helmsley Charitable Trust (both to M.S.).

Author information

Authors and Affiliations

Authors

Contributions

A.S. and M.S. developed the theory. A.S. performed the analysis. A.S. and M.S. wrote the paper.

Corresponding authors

Correspondence to Abhishek Sarkar or Matthew Stephens.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Genetics thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes 1–5, Figs. 1–4 and Methods

Reporting Summary

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarkar, A., Stephens, M. Separating measurement and expression models clarifies confusion in single-cell RNA sequencing analysis. Nat Genet 53, 770–777 (2021). https://doi.org/10.1038/s41588-021-00873-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41588-021-00873-4

This article is cited by

Search

Quick links

Nature Briefing AI and Robotics

Sign up for the Nature Briefing: AI and Robotics newsletter — what matters in AI and robotics research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: AI and Robotics