PhISCS: a combinatorial approach for subperfect tumor phylogeny reconstruction via integrative use of single-cell and bulk sequencing data

  1. S. Cenk Sahinalp3,10
  1. 1School of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada;
  2. 2Department of Computer Science, Indiana University, Bloomington, Indiana 47408, USA;
  3. 3Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, Maryland 20892, USA;
  4. 4Department of Computer Systems and Communication, University of Milano-Bicocca, 20136 Milan, Italy;
  5. 5Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York 10065, USA;
  6. 6Tri-I Computational Biology and Medicine Graduate Program, Cornell University, New York, New York 10065, USA;
  7. 7Department of Urologic Sciences, University of British Columbia, Vancouver, BC V5Z 1M9, Canada;
  8. 8Vancouver Prostate Centre, Vancouver, BC V6H 3Z6, Canada;
  9. 9Department of Physiology and Biophysics, Englander Institute for Precision Medicine, The Meyer Cancer Center, Weill Cornell Medicine, New York, New York 10065, USA
  • Corresponding author: cenk.sahinalp{at}nih.gov
  • Abstract

    Available computational methods for tumor phylogeny inference via single-cell sequencing (SCS) data typically aim to identify the most likely perfect phylogeny tree satisfying the infinite sites assumption (ISA). However, the limitations of SCS technologies including frequent allele dropout and variable sequence coverage may prohibit a perfect phylogeny. In addition, ISA violations are commonly observed in tumor phylogenies due to the loss of heterozygosity, deletions, and convergent evolution. In order to address such limitations, we introduce the optimal subperfect phylogeny problem which asks to integrate SCS data with matching bulk sequencing data by minimizing a linear combination of potential false negatives (due to allele dropout or variance in sequence coverage), false positives (due to read errors) among mutation calls, and the number of mutations that violate ISA (real or because of incorrect copy number estimation). We then describe a combinatorial formulation to solve this problem which ensures that several lineage constraints imposed by the use of variant allele frequencies (VAFs, derived from bulk sequence data) are satisfied. We express our formulation both in the form of an integer linear program (ILP) and—as a first in tumor phylogeny reconstruction—a Boolean constraint satisfaction problem (CSP) and solve them by leveraging state-of-the-art ILP/CSP solvers. The resulting method, which we name PhISCS, is the first to integrate SCS and bulk sequencing data while accounting for ISA violating mutations. In contrast to the alternative methods, typically based on probabilistic approaches, PhISCS provides a guarantee of optimality in reported solutions. Using simulated and real data sets, we demonstrate that PhISCS is more general and accurate than all available approaches.

    Footnotes

    • 10 Co-senior authors

    • 11 Joint first authors

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.234435.118.

    • Freely available online through the Genome Research Open Access option.

    • Received August 15, 2018.
    • Accepted September 11, 2019.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    Related Article

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server