Introduction

Copy-number variation is a form of structural genetic variation that involves a gain or loss of DNA segments. Copy-number variants (CNVs) are > 50 bp in size and can include a part of a gene, a whole gene, or a longer genomic region1. CNVs are associated with a number of genetic disorders, including autism spectrum disorders, neurodevelopmental disorders, and autoimmune diseases2,3. With advancements in next-generation sequencing technology and an increasing availability of bioinformatics tools to analyze NGS data, clinical labs are now able to process and detect CNVs in batches of exomes, genomes, and gene panels. In order for the patients to receive an accurate diagnosis and appropriate care, it is essential to correctly determine the pathogenicity of variants.

In late 2019 ACMG released updated guidelines for clinical classification of CNVs4. Each CNV is classified into one of the following categories: benign, likely benign, a variant of uncertain significance, likely pathogenic, or pathogenic. The new guidelines take into account a wide range of CNV properties and allow for comprehensive analysis and accurate classification of variants. However, implementation of the guidelines on a large scale is challenging, as each CNV requires considerable time on the part of a clinician to obtain a final pathogenicity score. Although the new guidelines are intended for manual evaluation, computational analysis expedites the process and determines the impact of CNVs more efficiently. Available CNV annotation tools use criteria that are different from the new ACMG guidelines5,6,7, hence, a new computational approach is needed.

Here, we present ClassifyCNV, a command-line tool that allows for rapid high-throughput classification of CNVs in accordance with the latest ACMG guidelines.

Methods

Databases

The databases used to implement the 2019 ACMG criteria for clinical classification of copy-number variants (CNVs)4 are listed in Supplementary Table S1. For each database we indicate which human genome build it is available for (hg19 or hg38). If a database is only available for one genome build, we used CrossMap v0.4.28 and the UCSC chain files, available from the UCSC genome browser9, to lift over genomic coordinates between the genome builds.

All of the mentioned databases were converted to BED format and are available in the ClassifyCNV repository. We recommend that the local versions of the ClinGen databases are updated regularly by executing the update_clingen.sh script, which is available in the ClassifyCNV repository.

Implementation

ClassifyCNV is implemented in Python 3, runs on Linux, UNIX, and Mac OS X, and requires BEDTools v.2.27.1 or higher10. Both the GRCh37 and the GRCh38 genome builds are supported.

ClassifyCNV accepts a BED file as input and requires the user to provide genomic coordinates and type (deletion or duplication) for each CNV. ClassifyCNV does not evaluate the quality of the CNV calls as it is expected to be done during the CNV calling and filtering steps. The tool then uses the criteria described in the ACMG scoring rubrics for copy-number loss and gain4 to evaluate the clinical significance of the CNVs. The criteria that are implemented in ClassifyCNV are listed in Supplementary Table S2 for copy-number losses and in Supplementary Table S3 for copy-number gains. Points are awarded for each evaluated section of the rubric. Clinical classification is calculated based on the total number of points assigned to a CNV. The flowchart of the algorithm is shown in Fig. 1 for copy-number losses and in Fig. 2 for copy-number gains.

Figure 1
figure 1

The algorithm to determine the pathogenicity score of a copy-number loss.

Figure 2
figure 2

The algorithm to determine the pathogenicity score of a copy-number gain.

To assess the genomic content of each variant, ClassifyCNV checks for a full or partial (≥ 1 bp) overlap with protein-coding and noncoding genes, as well as enhancers and promoters. It also tracks the number of protein-coding genes that are fully or partially overlapped by each CNV. To assess whether any established dosage-sensitive genes or regions are included and what effect the deletion or duplication might have on their expression, each CNV is evaluated against a set of curated haploinsufficient and triplosensitive genes and genomic regions obtained from ClinGen11. A score of ‘3’ is required for a gene or genomic region to be considered haploinsufficient or triplosensitive. For partially overlapped dosage-sensitive genes ClassifyCNV evaluates which regions within the gene are involved as per ACMG guidelines. If a deletion does not encompass genes or regions that are known to be haploinsufficient, ClassifyCNV checks whether haploinsufficiency is predicted for any genes within the deletion. To satisfy this condition, a gene is required to have a DECIPHER HI index ≤ 10%12, a gnomAD pLI score ≥ 0.9 and the upper bound of the observed/expected confidence interval < 0.3513. Finally, to assess whether the CNV is likely to be benign, ClassifyCNV obtains the population frequencies of similar variants from DGV14 and gnomAD15. For each analyzed CNV that does not contain known dosage-sensitive genes or genomic regions, the population frequencies of known overlapping CNVs are extracted. An overlap of at least 80% of the query CNV length is required. If multiple known variants overlap the CNV, their average population frequency is calculated. A CNV is considered common if its population frequency is > 1%.

ClassifyCNV continues the evaluation through the end of the rubric for all CNVs, including the ones where a benign or pathogenic classification is determined before all of the conditions in the rubric have been evaluated.

ClassifyCNV outputs a tab-delimited file that can be used by another pipeline in downstream analysis or evaluated by a clinician. For each variant ClassifyCNV reports the clinical classification, the total number of points, a breakdown of how the final pathogenicity score was determined, a list of established and predicted dosage-sensitive genes encompassed by the CNV, and a list of all protein-coding genes within the CNV. As some of the sections of the ACMG scoring rubrics require manual evaluation by a clinician, the information provided can be used to continue the evaluation if necessary.

Results

To test speed performance of ClassifyCNV, we obtained a set of 17,683 duplications and 20,805 deletions from the nstd102 study in ClinVar16. We used the hg19 coordinates and ran ClassifyCNV using the -precise flag, thus treating the CNV coordinates as exact. For CNVs for which precise coordinates were unknown, we used the inner coordinates. The run completed in less than 60 s on a 64-bit Linux virtual machine using two cores.

We used the same set of CNVs to evaluate the ClassifyCNV performance on clinical data. The ClinVar variants were obtained from studies published prior to 2019 and, therefore, classified before the current ACMG guidelines were released. The comparison of ClinVar and ClassifyCNV classifications is shown in Table 1.

Table 1 ClassifyCNV performance on ClinVar data.

The pathogenic/likely pathogenic variants and variants of uncertain significance had a high degree of concordance between the original ClinVar classification and the ClassifyCNV result (57% and 97.8% respectively). The majority of benign variants were classified as variants of uncertain significance [16,687 (87.7%)]. 14,356 of these variants did not receive any points during the classification, indicating that the variants do contain genes or regulatory elements. However, the information about the genetic content within these variants was unavailable or did not strongly support reclassification of the variants from uncertain significance to benign or pathogenic. Despite the low sensitivity (11.8%) when evaluating benign variants, ClassifyCNV showed a high degree of specificity (99.6%) as the tool is conservative when moving variants between categories. Since the classification parameters used by ClassifyCNV are different from the parameters used prior to the release of the 2019 ACMG guidelines, we do not expect full concordance even when evaluating variants manually.

To assess the concordance of the ClassifyCNV calls with the results of manual evaluation we obtained the complete list of 114 variants previously classified by the ACMG/ClinGen committee using the new guidelines4 (Table 2, Supplementary Table S4). In the ACMG/ClinGen dataset, the manual classification results were provided by two evaluators who assessed the variants independently. We re-grouped the calls into 4 categories: pathogenic/likely pathogenic, uncertain significance, benign/likely benign, and conflicting. The latter category contained the variants that the two evaluators disagreed on. CNV breakpoints were presumed to be accurate and the -precise flag was used.

Table 2 Comparison of ClassifyCNV calls to the results of manual annotation by ACMG/ClinGen.

For 81% of CNVs, the ClassifyCNV result matched the ACMG/ClinGen category (for 76% of CNVs the match was exact and for 5% ClassifyCNV determined the CNV to be likely benign or likely pathogenic, while the manual evaluation result was benign or pathogenic, respectively). In only one case did ClassifyCNV place a variant of uncertain significance into the likely pathogenic category. The pathogenicity points were assigned due to the large number of protein-coding genes encompassed by the CNV, many of which belonged to the same gene family and thus were not counted individually during the manual evaluation. For both benign/likely benign and pathogenic/likely pathogenic categories, ClassifyCNV showed a high degree of specificity (100% and 98.4% respectively). There were no occurrences of benign/likely benign variants classified as pathogenic/likely pathogenic and vice versa. For variants automatically classified as uncertain, a manual evaluation of the published literature and patients’ family histories by a clinician was required to arrive at the final classification.

Lastly, we compared ClassifyCNV performance to the performance of AnnotSV5, a comprehensive annotation tool that implements an earlier version of the ACMG criteria. To compare the two tools, we used the ACMG/ClinGen manually curated set of 114 variants. We removed the variants for which the ACMG/ClinGen classification was conflicting since calculating sensitivity, specificity and accuracy for such variants would not be possible. We analyzed the remaining 84 CNVs using AnnotSV version 2.4 with default settings and ClassifyCNV with the -precise flag enabled to treat the CNV coordinates as exact. The comparison of the two tools is shown in Table 3.

Table 3 A comparison of ClassifyCNV and AnnotSV.

Compared to ClassifyCNV, AnnotSV is less conservative when making pathogenic/likely pathogenic calls. Out of 84 variants, AnnotSV determined 72 to be pathogenic/likely pathogenic, compared to 15 calls by ClassifyCNV and 23 calls by ACMG/ClinGen manual evaluation. AnnotSV showed higher sensitivity for pathogenic/likely pathogenic variants (100% vs 60.9% by ClassifyCNV) and benign/likely benign variants (37.5% vs 25% by ClassifyCNV). However, both the specificity and the accuracy of AnnotSV were lower. For benign/likely benign variants ClassifyCNV had 100% specificity and 92.9% accuracy, while AnnotSV’s values were 92.1% and 86.9%, respectively. For pathogenic/likely pathogenic variants ClassifyCNV had 98.4% specificity while AnnotSV’s specificity was 19.7%. The accuracy of ClassifyCNV and AnnotSV was 88.1% and 41.7%, respectively.

In summary, while ClassifyCNV places variants in the uncertain category more often compared to AnnotSV, the high specificity and accuracy of ClassifyCNV make it a more suitable tool for evaluation of CNVs using the latest ACMG/ClinGen guidelines. A follow-up evaluation by a clinician is expected to refine the classification of variants of uncertain significance.

Discussion

ClassifyCNV is the first tool that automates the implementation of the updated ACMG guidelines to classify CNVs. It produces a rapid and reliable evaluation of variants and is suitable for high-throughput analysis. The tool can be easily integrated into existing pipelines and can expedite the evaluation of CNVs, helping to reduce the time to diagnosis.

ClassifyCNV errs on the side of caution when moving a variant between categories, as advised by the new ACMG guidelines. Therefore, if convincing data are not available, a CNV is likely to remain a variant of uncertain significance. Although a follow-up evaluation by a clinician may be necessary for these variants, ClassifyCNV significantly facilitates the process by completing the evaluation of gene content, dosage-sensitivity, and population frequencies and outputting a list of genes of interest.