Introduction

Haplotypes are combinations of alleles at different polymorphic sites occurring on the same DNA molecule and have been an important tool in genetics research [1]. Haplotypes are useful for imputation of alleles at ungenotyped loci, identification of genomic regions shared identical by descent (IBD), genotype error detection and correction, identification of compound heterozygosity, and analysis of parent-of-origin effects, among many other applications. Haplotypes can also be used instead of alleles at individual variable sites in association testing. At present, the most popular sequencing platforms and associated software packages report genotypes for individual polymorphisms, from which haplotypes must be inferred algorithmically. Phasing, or the process of analyzing known genotypes to infer haplotypes, is an important concern in genetics and an area of active research [2].

In principle, pedigrees should facilitate highly accurate estimation of haplotypes. In contrast to unrelated (or distantly related) individuals, pedigrees provide considerable additional information for phasing. The transmission of alleles from one generation to the next is directly observed in pedigrees and inheritance patterns can be used to establish phase empirically. Although pedigree-based studies have not been popular for association-based gene mapping of complex traits, this is changing with the emerging interest in rare variants, which are arguably more easily and more powerfully studied in family data. Indeed, as the trend toward sequencing greater numbers of subjects in a study population leads—if simply by chance—to the inclusion of ever more closely related individuals [3], knowledge of the pedigree relationships becomes increasingly essential. Furthermore, phasing of pedigree data need not require the allocation and processing of samples to form a population reference panel and can be performed with a smaller number of study participants.

Pedigree data can introduce unique computational challenges, however. Currently, there are limited options for phasing pedigrees, especially in the absence of reference panels (which is typical for most species other than humans). Consequently, there is a need for computational methods and easy-to-use software optimized for phasing related individuals in pedigrees using whole-genome sequence (WGS) data. Here we describe a novel and fast algorithm, PULSAR (Phasing Using Lineage Specific Alleles/Rare variants), that phases WGS data in families. We demonstrate the utility of the algorithm and software implementation using simulated data as well as real WGS data, and compare its performance with alternative approaches to phasing, including long-range phasing (LRP).

Background

The key step in pedigree-based phasing is identification of large haplotype segments that are IBD within the pedigree. The fundamental premise of PULSAR is the supposition that alleles specific to a single founding chromosome within a pedigree, here referred to as lineage-specific alleles (LSAs), are highly informative for inferring phase and are sufficiently abundant in the genomes of humans (and in many other species) to do so with high confidence. In other words, our method focuses on variants for which a single founder is heterozygous and all other founders are homozygous wild-type. In noninbred individuals each chromosome carries many rare variants, many of which will be LSAs within a pedigree and can be interrogated with whole-genome sequencing. The abundance of LSAs allows one to observe large haplotype segments shared IBD within pedigrees empirically, simplifying the phasing effort compared with inferring the inheritance of common alleles. Non-LSAs provide additional information for elucidating the inheritance of large haplotype segments, but it is not necessary to use non-LSAs to identify these large segments with confidence. As non-LSAs tend to be more common variants they can be mapped to the haplotype segments after the haplotype inheritance has been determined using LSAs.

Our phasing approach is similar in one respect to the technique of LRP [4], in that both PULSAR and LRP use IBS information to identify genuine IBD sharing. The methods differ, however, in the strategy used to identify IBS sharing, and therefore tend to perform optimally in different contexts. LRP identifies IBD sharing by searching for opposing homozygotes (i.e., loci at which two individuals are homozygous for different alleles) and excluding regions that are not IBS. This approach works well with common variants, and is highly accurate when at least one individual sharing a segment IBD is homozygous for a given variant. PULSAR identifies putative IBD sharing by searching for shared LSAs (which are necessarily IBS), and uses IBS information from opposing homozygous genotypes only to extend the haplotype boundaries initially established by the observed inheritance of LSAs. This strategy is most efficient for rare variants, but more common variants may be informative in specific pedigrees.

Phasing strategy

The algorithm used by PULSAR comprises four main stages: (1) identify LSAs, i.e., alleles that are likely to be lineage specific; (2) use these LSAs to identify haplotypes, set their initial boundaries, and trace their pattern of inheritance; (3) extend the estimated boundaries of these haplotypes using the fact that individuals must share at least one allele at loci at which they share a haplotype IBD (this is the idea behind LRP); and (4) comprehensively assign alleles to the established haplotypes. Our method assumes the physical location of variants is known, and optionally makes use of external information regarding allele frequencies. The approach is robust to some degree of genotyping error, provided the genotypes have been called from WGS data of reasonably high accuracy, such as from high- or moderate-coverage rather than low-coverage WGS data. Below we describe each of these steps in more detail.

Identify putative LSAs

When an allele is present in a single chromosome among the founders of a given pedigree, then any direct descendent of that founder who also carries the allele can be assumed to share IBD the chromosomal segment harboring that locus. This inference enables empirical estimation of the inheritance of haplotypes within the pedigree. Exceptions to this rule, such as de novo variation, are expected but assumed infrequent enough not to vitiate the overall strategy.

In many pedigrees not all founders will be sequenced, and thus the initial challenge is identifying those alleles that are specific to a single founding chromosome. (Note that at this point we seek only to identify putative LSAs.) In PULSAR, potential LSAs are identified by analyzing the pattern of individuals carrying a given allele, i.e., within a pedigree we search for alleles for which all individuals carrying the allele also have at least one pedigree founder in common. If such is not the case, then clearly not all copies of the allele within the pedigree can be IBD and the allele cannot be an LSA. (Distinct lineages within a pedigree, however, could carry alleles IBD in consequence of having common ancestors external to the observed pedigree structure). Putative LSAs are then checked for Mendelian consistency within their inheritance patterns, i.e., between each founder carrying the allele and the carriers in subsequent generations (at least for those individuals that have sequence data). We also check that the allele is not homozygous in any individual. The algorithm does not presently accommodate the presence of inbreeding loops (wherein LSAs could be homozygous in inbred individuals). These ideas are illustrated in Fig. 1. Figure 1a illustrates a hypothetical pedigree in which individuals carry an allele that cannot be lineage specific because the carriers do not share a common founder. Figure 1b illustrates a hypothetical pedigree in which individuals share an allele that is potentially an LSA.

Fig. 1: Inheritance patterns of lineage-specific alleles.
figure 1

a A pattern of individuals who share an allele (red) but do not share a common founder (within the limits of the available pedigree information). The shared allele cannot therefore be lineage specific. b A pedigree in which the individuals sharing an allele (orange) have two founders in common. The shared allele is potentially lineage specific. [Legend: ■ male; female; slash indicates an unsequenced individual].

Establish haplotype boundaries and haplotype inheritance

Using the set of putative LSAs identified in the previous step, we establish boundaries within which haplotypes are shared IBD in a set of related individuals. At this point the putative LSAs may include false positives, but for now we assume we have only true LSAs and relax this assumption later. Assuming the absence of meiotic recombination between the loci, de novo variation, and genotyping error, the set of individuals carrying a true LSA must share alleles IBD at nearby loci. Conversely, a change in the set of individuals sharing an LSA from one locus to the next indicates one or more recombination events within the meioses that gave rise to the haplotype lineage. By tracking those individuals who share neighboring LSAs along the chromosome, the boundaries of a given haplotype are determined approximately. The procedure is straightforward, but complications arise because at any given region in the diploid genome all individuals carry two sets of haplotypes, one derived maternally and the other paternally. It is therefore necessary to track two separate haplotypes simultaneously. This approach is implemented in PULSAR using a rule-based algorithm to identify changes in the pattern of LSA sharing along a chromosome.

It will often be the case that not all individuals in a given pedigree are sequenced, and consequently one may expect false positives among the putative LSAs identified in the previous step. In other words, some of the putative LSAs are in fact not IBD but merely identical by state (IBS). To reduce the risk of mischaracterizing IBS loci as IBD and inferring incorrect haplotypes, we require a predetermined number of neighboring putative LSAs to be shared before demarcating a new haplotype. A change in the observed inheritance pattern of a small number of neighboring putative LSAs is required to infer, with high confidence, the presence of a true recombination event. This heuristic is reasonable since it is highly unlikely that the same set of individuals will share putative LSAs over a given region of a chromosome if the genotypes are simply a product of chance (such as being IBS, or perhaps due to genotyping error) and not truly lineage specific. For PULSAR, we have required five observations of neighboring putative LSAs showing the same changed pattern (among individuals sharing the putative LSA) before accepting a recombination event and declaring a haplotype change.

Once the haplotypes carried by a proband have been identified based on the pattern of LSA inheritance, we then establish chromosome-wide haplotypes from these smaller haplotype segments. This task comprises two steps. First, if during the course of identifying the haplotypes, a change is observed in a subset of carriers along a particular haplotype segment, we then assume that the resulting two haplotype segments are on the same chromosome in the individual in which the change was observed. Second, for nonfounders having a sufficient number of informative relatives with sequencing data, we determine if the haplotype is shared with the maternal or paternal side of the proband’s relatives. Since the proband inherits one chromosome from each parent, haplotypes that are shared with either maternal or paternal relatives are declared to reside on the maternally or paternally inherited chromosome, respectively.

Extend haplotype boundaries

Although the human genome contains a large number of rare variants, many of which will be lineage specific in a given pedigree, the density of LSAs limits the precision with which haplotype boundaries can be determined. We extend the haplotype boundaries observed in a proband with a simple heuristic: individuals sharing a haplotype must also share an allele IBD (and therefore IBS as well) at each locus in the haplotype. This heuristic is central to the strategy behind LRP [4], and our approach is similar in that we also search for opposing homozygous genotypes. The primary difference, however, is that we do not use this rationale to discover shared haplotypes, but only to extend predetermined haplotypes, already determined from the observed LSA patterns, for relatively small (typically <1% of the genome) unresolved segments of the genome. (We examine the effect of the density of LSA coverage in the results). Haplotype extension proceeds in both directions into the unassigned gap between neighboring haplotypes identified in the previous step, and haplotypes are extended only as far as is possible without overlap. In the case of overlapping haplotypes, the region of overlap is not assigned to either haplotype.

Map alleles to haplotypes

After determining the haplotype boundaries and pattern of inheritance throughout the pedigree, the alleles for all variants are mapped onto the haplotypes. This task could be integrated into the prior steps, but in our implementation we assign alleles to haplotypes (including LSAs) as a separate and final step of the procedure. Initially, homozygous genotypes are straightforwardly mapped onto haplotypes, with each of the two haplotypes carrying the same allele. Next, considering each variant independently, we phase heterozygous genotypes in individuals for which at least one allele has been mapped onto one of the two haplotypes carried at that genomic location. This mapping process is iterated, at each step incorporating the mapped alleles from heterozygous genotypes from the previous iteration, until no additional genotypes can be phased. When all individuals within a pedigree are sequenced, and all of the haplotypes carried by each individual are known, then—barring genotyping errors—this procedure can resolve the phase of all combinations of genotypes (except in the case where every individual is heterozygous, a situation that becomes less likely in larger pedigrees).

When multiple individuals carry a given haplotype the information from these individuals is considered jointly when assigning alleles to the haplotype. In the presence of genotyping error, however, it is possible for the genotypes of multiple individuals to provide a conflicting or ambiguous indication as to which allele is carried on a haplotype. (For example, two individuals may share a haplotype yet have opposing homozygous genotypes.) In such cases we apply a majority rule: the allele supported by the majority of carriers is assigned to the haplotype. If resolution of ambiguity requires assignment of new alleles (i.e., alleles not originally observed in an individual), then the revision of genotypes within the considered haplotypes effectively amounts to a correction of genotyping error.

Note that in certain cases there can be no majority support for resolving allelic ambiguity. In trios, for example, no haplotype is shared by more than two people, and no majority can be formed. The same is true for haplotypes shared between two or fewer individuals within larger pedigrees, such as between married-in founders and only one offspring when the lineage (and its LSAs) are not passed to subsequent generations.

Methods

Pedigrees used for simulation studies

We used a variety of simulated sibships and extended pedigrees to investigate the phasing and imputation accuracy of PULSAR. Sibships comprised two parents and 1–7 children. As a variance reduction technique, smaller pedigrees were generated as a subset of larger pedigrees, e.g., a pedigree with one child is formed as a subset of the pedigree with two children, and so forth. Seven large, multigenerational pedigree structures were chosen from the San Antonio Mexican American Family Studies (SAMAFS) [5, 6] to serve as simulation templates. These pedigrees have 11, 14, 20, 28, 55, 78, 94 individuals, comprising 3, 4, 4, 4, 6, 6, 5 generations, and 4, 3, 3, 16, 29, 31, 61 sequenced individuals, respectively. Diagrams of these pedigrees, generated using the application Cranefoot [7], are included in the Supplementary information. To investigate the computational scalability of the software, two additional pedigree structures were simulated and studied: pedigrees having two parents and 1–1000 children, and pedigrees with 2–50 generations in which each generation comprised one offspring and one married-in founder.

Simulation of whole-genome sequence data

For the individuals in the test pedigrees we simulated genotypes and chromosomal haplotypes having many of the characteristics of real data. Whole-genome sequence data for 84 male X chromosomes (excluding the pseudoautosomal regions) from British (GBR) and Finnish (FIN) populations from the 1000 Genomes Project [8] were used as the set of potential founder chromosomes. Use of real male X chromosomes has the advantage that chromosomal haplotypes are known, while preserving the complexities of real whole-genome sequencing data such as the minor allele frequency distribution of variants, linkage disequilibrium, and the existence of small segments shared IBD between distantly related individuals [9], even if the X chromosome differs slightly from the autosomes in these characteristics. After filtering out pseudoautosomal regions and loci with more than two alleles, there remained 318,912 polymorphic variants in the seed dataset. To accommodate limitations in the software package AlphaPhase1.1 [10], the chromosomes were shortened in simulation to 200,000 variants by considering a smaller section (from 152.23 to 94.37 Mb) of the original chromosome. Although not designed for phasing WGS data, AlphaPhase1.1 provides an important benchmark because it implements the ‘long-range phasing’ approach to phasing in pedigrees [4].

Within each pedigree, founder chromosomes were sampled randomly without replacement from the 84 X chromosomes, assuming the pedigree founders represent a random sample from the population. Inheritance of haplotypes within pedigrees was simulated by gene-dropping, with a recombination probability between variants corresponding to one recombination per 100 Mb, or roughly the observed recombination rate in humans. Chromosomes not used as founder chromosomes were used to calculate minor allele frequencies (MAF), or were used as a reference panel of additional samples for SHAPEIT2 and Beagle 4.0. Genotyping errors were introduced by choosing genotypes at random and replacing them with one of the two alternative genotypes (for diallelic loci) with equal probability.

Whole-genome sequence data

Whole-genome sequence data have been generated for many participants in the SAMAFS. Single-nucleotide variants and diallelic indels with at least five observations of the minor allele were homogenized and merged from vendor-provided (Complete Genomics, Illumina) sequencing genotype calls from 2330 directly sequenced genomes, resulting in 27,160,796 genetic variants. Some additional, minimal, quality control had been performed on the vendor-provided genotypes beyond simple filtering based on allele count and merging. Illumina sequencing was performed at an average read depth of 30×, with 98% of the not-N Refseq [11] coverage being ≥20×. (These data are available through dbGaP Study Accession: phs000462.v1.p1.) Chromosome 21 (356,545 variants) was selected for analysis. Individuals and variants with genotyping rates less than 99% were excluded, resulting in 354,466 variants. To maximize the comparability of the simulated data with the real data we focused on the same seven pedigrees of SAMAFS for both real and simulated datasets. These seven pedigrees comprise three hundred individuals, of whom 147 are sequenced.

Benchmarking

The performance of PULSAR was compared with AlphaPhase1.1 [10], Beagle 4.0 [9], and SHAPEIT2 [12]. AlphaPhase1.1 is an implementation of the LRP [4] algorithm which phases haplotypes under the assumption that individuals sharing a haplotype IBD must also share at least one allele IBS at each locus in the shared region. Functionally, the method searches for opposing homozygous alleles, the presence of which excludes two individuals from sharing a segment IBD at that genomic location (assuming no de novo variants or genotyping error). LRP is highly accurate when at least one individual sharing the segment IBD is homozygous for a given variant. Beagle 4.0 is a phasing algorithm for population sequencing that is based on a hidden Markov model and can accommodate pedigree information. SHAPEIT2 is a phasing method for singleton individuals and is a variant of the approximate coalescent model. Default settings were used for all three software packages. Pedigree information was supplied for Beagle 4.0 and AlphaPhase1.1, but not for SHAPEIT2.

Three metrics were used to compare the performance of these methods: the switch error rate (SER) [13] to assess phasing accuracy, the proportion of heterozygous genotypes phased, and the GNU time command to measure execution times. The SER is a measure of the discrepancy between reconstructed and original haplotypes that is due strictly to misphasing of neighboring heterozygous regions; an SER of zero indicates no phasing error, and an SER of one indicates that no neighboring heterozygous genotypes were correctly phased.

Benchmarking results

Coverage by LSAs

The PULSAR algorithm as described above is based on the idea that variants introduced into a pedigree via a single founding chromosome, i.e., LSAs, can be used to trace the inheritance of haplotypes within the pedigree. For this approach to be practical, it is crucial that LSAs exist in sufficient density in realistic pedigree structures. We have estimated the degree of LSA coverage using real-sequencing data with known phase, specifically the male X-chromosome data from the British (GBR) and Finnish (FIN) cohorts from the 1000 Genomes Project [8]. If we take the pedigree founders to be a random sample of the population, then various key aspects of LSA coverage can easily be determined by permutation for different pedigree structures.

Supplementary Fig. 1 shows the distribution of the number of LSAs per Mb along the X chromosome as a function of the number of pedigree founders for the GBR and FIN cohorts. For the GBR X chromosomes, the median density ranges from 135.4 LSA/Mb in 2-founder pedigrees (median inter-LSA distance of 874 bp, with a maximum of 5.14 Mb which includes a gap in coverage caused by the centromere) to 14.8 LSA/Mb with 15 founders (median distance of 19.0 Kb, with a maximum of 5.39 Mb which includes a gap in coverage caused by the centromere). To estimate the coverage with larger pedigrees, we fit a power function (R2 > 0.99) to the relationship between the number of founders and the median number of LSA/Mb from our simulations. By extrapolating this power function, we estimate that a pedigree with 170 founders would have ~1.0 LSA/Mb. In Finns, a population with decreased genetic diversity due to a series of founder effects [14], there are comparatively fewer LSAs (median 133.1 LSA/Mb in 2-founder pedigrees and 13.4 LSA/Mb with 15 founders). The trend to fewer LSAs in larger pedigrees is reasonable given that the definition of LSA is based on the observation or inference of an allele in a single pedigree founder—thus, as a pedigree contains more founders, fewer polymorphic sites will qualify as LSAs.

Allele frequency distribution of LSAs

The probability of a single founding event in a pedigree depends on the population prevalence of an allele, with rare alleles more likely to be specific to a single founding chromosome. The enrichment for rare alleles among all LSAs will also be greater in pedigrees having more founders, and this is indeed what we observe. For the GBR pedigrees with 2 founders, ~13% of LSAs have MAF less than 5% (median 21.7%), whereas with 15 founders ~91% of LSAs have MAF < 5% (median 2.2%). These observations suggest that filtering on MAF, with allele frequencies estimated from sample data or an independent reference panel, could be an effective means of decreasing the false positive rate when LSAs cannot be identified unambiguously (perhaps due to the presence of unsequenced individuals). In any case, these imputations indicate that for various human pedigree types there will generally be sufficient numbers of LSAs to make them useful as a starting point for phasing WGS data with our algorithm.

Pedigrees with complete sequencing data and no genotyping error

Using the 1000 Genomes Project data, we evaluated the performance of PULSAR, AlphaPhase1.1, Beagle 4.0, and SHAPEIT2.0 in pedigrees of varying size and structure under the ideal scenario in which all individuals are sequenced without error. We chose to examine simulated nuclear families having 1–7 children and seven larger and more complex pedigree structures comprising 11–94 individuals and 3–6 generations. The larger pedigrees are based on actual pedigrees from the SAMAFS [5, 6]. (Diagrams of a seven-child nuclear family, and the seven SAMAFS pedigrees, are provided in Supplementary Fig. 29.) Beagle 4.0 and SHAPEIT2.0 were provided with additional reference samples created using male X chromosomes that were not used for simulating the pedigree genotypes. For comparison, Beagle 4.0 was run with and without the reference samples.

Table 1 presents a comparison of the SER [13] and the proportion of heterozygous markers phased for these simulations. Across most simulations PULSAR produced the lowest SER by a considerable margin. Note that PULSAR phased fewer heterozygous markers than either SHAPEIT2.0 or Beagle 4.0 (each of which phases all genotypes), but a higher percentage than AlphaPhase1.1. In the case of nuclear families some markers were heterozygous in all individuals, a situation that is unresolvable without additional data. Excluding such cases, the observed rate of heterozygous genotypes phased by PULSAR was >99.9% of the achievable upper limit. The greatly reduced accuracy of Beagle 4.0 in the absence of reference samples shows the clear need for reference panels in population-based phasing algorithms.

Table 1 Phasing accuracy and heterozygous genotypes phased (simulated data, all individuals sequenced without error).

Effect of genotyping error

Table 2 summarizes the SER and the proportion of heterozygous markers phased for those simulations in which the genotyping accuracy is 99.0% (i.e., 1% error and assuming, for simplicity, an equal error rate for all variants). PULSAR, SHAPEIT2.0, and Beagle 4.0 yielded similar SERs. PULSAR yielded a lower SER, and phased a higher proportion of heterozygous markers, than did AlphaPhase1.1.

Table 2 Phasing accuracy and heterozygous genotypes phased (simulated data, all individuals sequenced, and genotyping error 1%).

To further examine the influence of genotyping error on the accuracy of imputation, nuclear pedigree datasets were simulated with genotyping accuracies of 99–100%. For each level of accuracy, 20 datasets were simulated for each pedigree type and analyzed with the results shown in Fig. 2. Significantly, SER increases linearly with the genotyping error rate. Genotyping error has somewhat less impact on SER in pedigrees with more children; in larger pedigrees there is an increased potential for sharing genomic sections IBD, enabling PULSAR to identify haplotypes accurately and more reliably despite errors in genotyping.

Fig. 2: Effect of genotyping accuracy and IBD sharing (number of sibs) on the switch error rate (the proportion of adjacent heterozygous genotypes correctly phased).
figure 2

SER of zero indicates perfect phasing, SER of one indicates no adjacent heterozygous genotypes were correctly phased. Results are based on 20 simulations in nuclear families having 1–7 children.

Using the haplotypes reported by PULSAR we reconstructed genotypes and compared these with the true (error-free) genotypes. As shown in Fig. 3, error in the reconstructed genotypes increases approximately linearly with the simulated error rate. Note also that the error in genotype reconstruction for the sib trio is comparable with the error in the simulated genotypes; this is expected because (in the trio) no haplotype can be observed more than twice, i.e., the pedigree lacks sufficient individuals to establish a majority haplotype. In fact, the reconstructed genotype error rate can be slightly higher than the true genotype error rate simply because the algorithm must impute haplotype boundaries, which can introduce a small number of errors. PULSAR is able to correct genotypes in pedigrees with two or more children, performing better in pedigrees having more individuals.

Fig. 3: Effect of IBD sharing on the accuracy of genotype reconstruction according to the haplotypes generated by PULSAR.
figure 3

Results are based on 20 simulations in nuclear families having 1–7 children.

Effect of ungenotyped individuals

Ungenotyped individuals are a common complication in real datasets. For phasing, nonsequenced individuals can affect the false positive rate among putative LSAs identified by PULSAR, which will ultimately decrease the accuracy of reconstructing haplotypes, inferring haplotype boundaries, and estimating haplotype sharing. Moreover, PULSAR will not phase heterozygous genotypes for individuals in genomic regions that are not shared IBD with another individual (unless this happens in consequence of falsely inferring IBD sharing).

Table 3 presents a comparison of the SER and the proportion of heterozygous markers phased for simulations in which genotypes are known without error but with some individuals in each pedigree having no available sequence data. For the seven pedigrees based on the SAMAFS data (11–94 individuals), we recorded which pedigree members had actual sequencing data and used this information to model the presence of unsequenced individuals in our WGS simulations. For the hypothetical nuclear families, we performed two simulations in which one or both parents were sequenced, with the results shown in Table 3. (For comparison, refer to Table 1 for the results for nuclear families with complete sequence data.)

Table 3 Phasing accuracy and heterozygous genotypes phased (simulated data with ungenotyped individuals).

From Table 3 it is evident that the number of ungenotyped individuals and, more significantly, the location of the ungenotyped individuals within the pedigree, have pronounced effects on the ability to phase haplotypes. In the case of nuclear families, children missing sequence data have only a small negative effect on the accuracy of PULSAR, as this situation does not lead to an increase in the number of falsely inferred LSAs (although the number of observations of each haplotype is reduced on average). The presence in a nuclear family of children who are missing sequence data is effectively equivalent to a reduction in the size of the nuclear family. As a consequence, a greater number of variants will likely be heterozygous in all the sequenced individuals, although this effect is of practical importance only if the number of sequenced individuals is small. Since PULSAR cannot resolve cases in which all individuals are heterozygous, the main impact of missing children in nuclear families is a decreased percentage of loci that are successfully phased. The effect of missing parents is more significant because this situation creates ambiguity in identifying LSAs. For example, in the simulated nine-member nuclear family, the false positive rate for putative LSAs was 15.9% with one parent missing and 0% with both parents sequenced (data not shown); from Tables 1 and  3 the SER correspondingly increased from 1.68 × 10−5 to 4.04 × 10−2 and the percentage of heterozygous markers phased dropped slightly from 99.99% to 99.45%.

Mitigating the effect of ungenotyped individuals

One approach to mitigating the effect of ungenotyped individuals on the performance of PULSAR is to filter putative LSAs based on their minor allele frequency. Table 3 includes a comparison of the effect of missing individuals in nuclear families when putative LSAs are filtered on the threshold MAF < 5%. In the nine-member nuclear family with one sequenced parent, the rate of falsely inferred LSAs was 15.9% prior to filtering and only 1.50% with filtering (data not shown); from Tables 1 and 3 the SER correspondingly decreased from 4.04 × 10−2 to 1.46 × 10−3. Filtering of LSAs is not without some compromise, however; in this example, filtering reduced the total number of putative LSAs from 61,160 to 7,850 (data not shown), and decreased the percentage of heterozygous markers phased from 99.45% to 98.58%. A simple metric for quantifying the tradeoff between phasing accuracy (as measured by SER), and the proportion of heterozygous markers phased (correctly or not), is the number of heterozygous markers phased correctly, given by the product of phasing accuracy (1-SER) and number of heterozygous markers phased. By this measure, we found that filtering of LSAs clearly improved overall performance, at least in the case of nuclear families (data not shown).

Filtering LSAs according to MAF is most advantageous with smaller pedigrees having fewer founders; in larger pedigrees with many founders, the putative LSAs identified by PULSAR tend to have lower allele frequencies anyway, and filtering on MAF < 5% has little effect. In the largest pedigree considered in this study, for example, in which 61 of 94 (64.9%) individuals have sequence data, prefiltering with MAF < 5% had only a minor effect on the results. The false positive rate for putative LSAs decreased from 0.0099 to 0.0091, the total number of putative LSAs decreased from 44889 to 40060 (data not shown); from Table 3 the SER increased from 2.54 × 10−3 to 2.98 × 10−3, and the percentage of heterozygous variants phased declined slightly from 97.38% to 97.26%. We have not investigated the effect of different thresholds for MAF, but with larger pedigrees more stringent filtering (i.e., smaller MAF) could well be advantageous.

Performance with missing data and genotyping error

In typical situations, pedigree-based studies must cope both with missing data (e.g., individuals unavailable for sequencing) and genotyping errors. We investigated this situation using the simulated nuclear families and the pedigrees based on SAMAFS data. For the nuclear families the genotypes for one parent were blanked to simulate missing data. Genotypes were simulated with an accuracy of 99%; as most sequencing platforms can outperform this benchmark easily with appropriate read depth, an accuracy of 99% is somewhat conservative. Results for these simulations are presented in Table 4, from which it is seen that PULSAR yields an SER competitive with SHAPEIT2.0 and Beagle 4.0, and all three methods outperformed AlphaPhase1.1.

Table 4 Phasing accuracy and heterozygous genotypes phased (simulated data with missing individuals and genotyping error 1%).

Application to real WGS data

Last, we investigated the phasing performance of PULSAR with actual whole-genome sequence data. We used whole-genome sequence genotypes for chromosome 21 from the SAMAFS data, and the same multigenerational pedigrees that were used as template pedigree structures in the other simulations. This test dataset embodied the usual problems and errors inherent with real-world data such as ungenotyped individuals, missing (uncalled) genotypes within sequenced individuals, and genotyping errors. SAMAFS pedigree relationships had been examined previously and corrected as necessary, but a small degree of kinship between some presumably unrelated founders is possible. (Of course, the disadvantage of real data for characterizing the performance of statistical methods is that the true state—mainly with respect to phasing, and to a lesser extent the genotypes—is unknown.) To better characterize the performance of PULSAR with these data, we introduced missing data by blanking genotypes at random with a probability of 1/1000, and then applied PULSAR to the remaining data. The phased haplotypes imputed by PULSAR were used to reconstruct the blanked genotypes, and the concordance of the reconstructed and originally observed genotypes was computed.

The results for this experiment are summarized in Table 5. PULSAR accurately phased nearly all of the observed genotypes (>97% except in one sparsely genotyped pedigree), and the genotypes imputed from the phased haplotypes were in excellent agreement with the originally observed genotypes (>98% concordance). Although fewer than half of the blanked genotypes were phased or imputed, the concordance of the imputed and blanked genotypes ranged from 95.45% to 98.58% across pedigrees, indicating high-phasing accuracy and reliable genotype imputation.

Table 5 Performance of PULSAR using SAMAFS WGS Data.

The original SAMAFS sequence data also contained some genuinely missing genotypes, and we found a notable difference between the proportions of artificially missing (blanked) genotypes and truly missing genotypes that PULSAR was able to impute (data not shown). This difference is interpreted as a consequence of the different distributions for the two kinds of missing data. The blanked genotypes were (by construction) distributed randomly, whereas the truly missing genotypes were more likely to be found at the same marker in multiple individuals. This difference can be explained in various ways, but a reasonable hypothesis is simply that the presence of indels or copy number variants acts to disrupt the diploid state of these markers in a subset of related individuals.

Execution times

All software packages were benchmarked on a server with 2.40 GHz Intel Xeon Core i7 CPUs running CentOS Linux 7. Although computation times can become critical with increasing volumes of data, we found all software packages tested to be sufficiently fast for practical use on whole-genome sequence data in the pedigrees we studied. For simulated data on the largest pedigree (94 individuals), AlphaPhase was fastest (93 s), followed by eagle 4.0 (826 s), PULSAR (938 s), and SHAPEIT2.0 (8833 s). SHAPEIT2.0 and Beagle 4.0 were each provided with reference samples as would be expected in typical usage.

Scalability

Computational scalability is an important consideration with any statistical genetic procedure applicable to pedigree data. An algorithm that performs well with nuclear families and small sibships may rapidly become impractical for use with large sibships or deep, extended pedigrees. To estimate the scalability of the PULSAR algorithm we simulated and analyzed sibships comprising two parents and 1–100 children. Such extreme sibships are unrealistic for human populations, although they may exist in other species. This design was chosen simply to enable us to investigate the performance of PULSAR as IBD sharing within the pedigree increases. The results (data not shown) indicate that overall computation time can be well-modeled by a polynomial of quadratic order in the number of individuals, with coefficient 0.48 for the quadratic term. Based on this polynomial fit, we can estimate that a pedigree with 2 parents and 1000 children will require ~41,000× the execution time for a trio comprising two parents and one offspring.

We also studied artificial pedigrees having 2–50 generations, in which each generation comprised one offspring and one married-in founder. This structure was chosen because the IBD sharing within such a pedigree does not, on average, increase between generations. In this case the simulation results disclosed a strongly linear relationship (R2 > 0.99) between the number of individuals and execution time. Based on this result, analysis of a hypothetical pedigree having 500 generations and 999 individuals is expected to require 450× the execution time for a trio comprising two parents and one offspring.

In general, our simulation results indicate that the overall execution time of the PULSAR algorithm scales approximately linearly with pedigree size and approximately quadratically with factors that increase IBD sharing (i.e., number of meioses). These tendencies also affect the relationship between execution time and presence of unsequenced individuals. Unsequenced individuals cannot be used by PULSAR and act generally to reduce execution time, but the precise effect depends to some extent on pedigree size, structure, and the location of the unsequenced individuals within the pedigree.

Discussion

Accurate phasing of genotype data is an essential step in many genetic studies, yet there are currently few tools designed specifically to phase genotypes in pedigrees of arbitrary size and structure. Alternative methods for phasing are typically restricted to use with unrelated individuals or nuclear families, and many require supplemental data in the form of reference panels in order to produce accurate results. While large reference panels are now available for major human ethnic groups, this is not the case for many smaller human populations or for virtually all other species of biomedical relevance.

We have presented a novel algorithm for phasing whole-genome sequence data in a broad range of pedigree sizes and structures, and have described the associated software, PULSAR. Our algorithm yields low SER, and phases a high percentage of heterozygous variants, without the use of additional data in the form of reference panels. The high accuracy of the haplotypes produced by PULSAR, often spanning entire chromosomes, is encouraging. Based on the results of our benchmarking tests, PULSAR is a practical and effective approach to phasing that works well for a range of pedigree sizes and structures provided that a reasonable proportion of the pedigree members is sequenced. The algorithm used in PULSAR can also form the basis of a tool for genotype error checking and correction.

While the method clearly has merit, PULSAR shares some of the limitations typical of other phasing approaches. Unsequenced individuals and/or genotype errors are unavoidable in real datasets and will weaken the performance not only of PULSAR but also of alternative approaches for phasing. However, in our simulations based on real data, in which we examined ‘difficult’ pedigrees with sparse sequencing coverage (e.g., only 3 out of 20 individuals sequenced), PULSAR performed well, erring conservatively on the side of phasing fewer heterozygous markers rather than phasing them incorrectly.

The critical factor in the performance of PULSAR is the proportion of alleles an individual shares IBD with at least one other sequenced individual. Since it is not always straightforward to know what patterns of missing individuals will be detrimental to the performance of PULSAR, we also developed a tool that uses Monte Carlo gene-dropping to estimate the expected proportion of the genome for which each individual will share at least one haplotype IBD with another sequenced individual. This tool can be used for a priori estimation of the applicability of PULSAR to a given pedigree-based dataset. If the sequenced individuals in the pedigree are not predicted to share a high percentage of alleles IBD across the genome, one may wish to consider alternative phasing approaches, such as those designed for unrelated individuals.

Our simulations, using realistic genotypes and chromosomal haplotypes, show that PULSAR is quite robust to genotyping errors. Indeed, PULSAR can be used to correct some kinds of genotyping errors. PULSAR currently ‘fixes’ incorrect genotypes using a simple majority rule within the individuals sharing a haplotype, but the ability to fix genotype errors could potentially be improved by implementing a weighting scheme based on the confidence of genotype calls, or on the number of reads supporting a given allele call.

We have not investigated the impact of errors in the pedigree structures, including the existence of unknown relationships between pedigree founders, on the performance of PULSAR. As a rule, we advocate that pedigree relationships be confirmed or estimated analytically before undertaking any efforts to establish phase, correct genotyping errors, or impute missing genotypes [15, 16].

PULSAR was designed specifically to work on genotype calls generated from high- or medium-coverage WGS data, based on the rationale that the vast number of rare sequence variants in the human genome will favor a high density of LSAs in pedigrees. Low-coverage WGS data leads to less precise genotype calls, particularly for rare variants observed only once or a few times in a given dataset, and PULSAR’s performance with such low-coverage data would be impacted correspondingly. At the present time, and due mainly to cost factors, many studies only involve exome-sequencing data or dense SNP genotyping rather than WGS data, but we have not investigated the performance of PULSAR with data generated on these other genotyping platforms. SNP genotyping panels are biased towards common SNPs, and for this reason we expect that PULSAR would be of limited utility for such data in extended pedigrees, where very few common SNPs would be introduced only once into a given pedigree and thus serve as LSAs. In smaller pedigrees, however, having relatively few founders, we expect the performance of PULSAR to be much less impacted.

One can envisage a number of potential improvements and extensions to PULSAR, but these are quite beyond the scope of the present discussion. However, we did explore the utility of prescreening variants based on allele frequency when identifying putative LSAs. When pedigree founders are not available for sequencing, as is often the case, then such prefiltering based on minor allele frequency was found to be useful for reducing the number of false positives among putative LSAs and thereby improving the performance of the algorithm. Of course, the allele frequency estimates must be reliable to be effective as a filtering criterion. With a sufficient number of founders, population allele frequencies can be estimated from the analysis sample; alternatively they can be estimated from independent reference panels. In either case, one must remain cognizant of the effects of admixture, variations in sequencing technology platform, and quality of sequencing, on the resulting estimates.

The algorithm used in PULSAR for inferring phase in pedigrees is a rule-based procedure, but alternative methods based on a maximum-likelihood criterion could be investigated. For example, one might infer phase using the Elston–Stewart algorithm [17] as implemented in LINKAGE [18] or FASTLINK [19]. As a practical consideration, however, the Elston–Stewart algorithm is limited in the number of variants that can be jointly analyzed, which would make it necessary to subdivide the genome into a number of overlapping segments, phase these segments separately, and then assemble the resulting haplotype data. Alternatively, the Lander–Green algorithm [20], as implemented in the software package MERLIN [21], can analyze many variants simultaneously, but is limited in the number of individuals (in a single pedigree) that can be handled. For large pedigrees, one would need to break the pedigree into several overlapping subpedigrees for phasing, then recombine the phased genotypes. Pedigree breaking and combining are not trivial tasks, since joining marker segments of pedigree fragments can lead to Mendelian inconsistencies and other problems. To overcome some of the limitations of the Elston–Stewart and Lander–Green algorithms, a number of Monte Carlo Markov Chain (MCMC) methods have been developed, including LOKI [22] and SimWalk2 [23]. These methods are computationally intensive, however, and not easily applied to whole-genome sequence data. All of these methods are also impacted to varying degrees by genotyping errors and missing genotype data.

There are a number of methods for phasing unrelated individuals [2], some of which, such as Beagle 4.0 [9], are based on hidden Markov models. Currently, the most prominent phasing methods for singleton individuals use variants of the approximate coalescent models, such as in SHAPEIT2.0 [12] and MaCH [24]. The accuracy and computational speed of these packages has greatly improved recently, aided by the availability of ever-larger reference panels for some of the major ethnic groups. One difficulty with applying these methods to pedigree data is that suitable reference panels are not presently available for many smaller populations which are often well-characterized and particularly well-suited for pedigree studies. Without large and accurate reference samples, however, statistical-phasing methods requiring reference panels perform much worse, as seen in the phasing results for Beagle 4.0 in the absence of reference samples. Unfortunately, large and accurate reference panels do not exist for most species.

Yet another simplification that is often made is that of treating family members as unrelated individuals during phasing. This strategy will often generate haplotypes that are inconsistent with Mendelian rules of inheritance. Some packages, such as Beagle 4.0, take family relationships into consideration and can correct for Mendelian inconsistencies. The phasing algorithm duoHMM [25], as implemented in SHAPEIT2 [12], attempts to use the restrictions on inheritance observed in duos to correct the haplotypes produced by statistical phasing, but when applied to the test data described in this study SHAPEIT2 frequently failed to run on most of the simulated pedigree structures.

In view of the various strengths and limitations of the available phasing methods for pedigree data, we see the opportunity for development of a meta-analytic algorithm for combining the phasing results from multiple methods and issuing results that are statistically favored in some well-defined sense. Such an algorithm would weigh the conclusions of different algorithms according not only to the method and assumptions embodied by specific algorithms, but also to features of the sequence data, e.g., sequencing technology and platform, read depth, and so on. Results from a combination of methods may ultimately prove to be more accurate and complete while remaining computationally practical. Alternatively, the phasing results from one algorithm could be refined by additional methods. For example, the phasing output from PULSAR could be used as input for MCMC methods, providing an initial, plausible, phased genotyping state to be refined by an MCMC-based method. Alternatively, the output from statistical phasing of singletons could serve as input for pedigree-based phasing methods, or conversely. Such tandem analyses could harness the high-phasing accuracy achievable by direct observation of inheritance patterns within pedigrees, with statistical-phasing methods that infer phase in parts of the genome where inheritance may not be directly observable given the particular sequencing dataset.