Main

Identifying noncoding loss-of-function DNA variants is a major bottleneck of whole genome interpretation, as predicting function outside coding regions is difficult1. Variants altering splicing represent an important class of noncoding loss-of-function variants because they can lead to drastically altered RNA isoforms, for instance, by inducing frameshifts or ablations of functionally important protein domains. If the variant strongly alters splicing isoform choice, the remaining abundance of functional RNA isoforms can be so reduced that the function of the gene is lost. Due to the relevance of splicing for variant interpretation, notably in rare disease diagnostics and in oncology, algorithms have been developed to predict whether variants affect splicing2,3,4,5,6,7,8,9. However, only recently, aberrant splicing events, that is, rare large alterations of splice isoform usage, have been called in human tissues10,11,12. While a method to a posteriori prioritize candidate causal rare variants for observed aberrant splicing events has been proposed12, the forward problem, that is, predicting among rare variants which ones will result in aberrant splicing, has not been addressed.

Here, we set out to establish models predicting whether a rare variant associates with aberrant splicing in any given human tissue. First, we assumed only DNA to be available and later on further considered complementary RNA-sequencing (RNA-seq) data of clinically accessible tissues (CATs) (Fig. 1).

Fig. 1: Study design and main findings.
figure 1

We set out to predict whether rare variants associate with aberrant splicing across 49 human tissues. a, We established a comprehensive benchmark for aberrant splicing by processing GTEx samples with a recently published aberrant splicing caller10 based on which we could assess and develop predictors that could take as input DNA sequence and, optionally, RNA-seq data of CATs. b, Benchmarking revealed modest performance of currently used algorithms based on DNA only, a substantial performance improvement when integrating these models with SpliceMap, a quantitative map of tissue-specific splicing we developed in this study, and further improvements when also including direct measures of aberrant splicing in accessible tissues.

Results

A benchmark dataset for aberrant splicing predictions

We created a benchmark using the aberrant splicing caller FRASER (Find RAre Splicing Events in RNA-seq)10 on 16,213 RNA-seq samples of the Genotype-Tissue Expression (GTEx) dataset, spanning 49 tissues and 946 individuals. Compared with other splicing outlier detection methods11,12, FRASER consistently showed the highest agreement with sequence-based predictors and was therefore subsequently used for our evaluations (Extended Data Fig. 1). For every individual, we considered every protein coding gene carrying at least one rare variant (minor allele frequency (MAF) less than 0.1% based on the Genome Aggregation Database (gnomAD)13 and found in no more than two individuals across GTEx) and set out to predict in which tissue, if any, is this gene aberrantly spliced. We defined a gene to be aberrantly spliced in a sample if it was called as a transcriptome-wide significant splicing outlier and with a sufficient amplitude (differential percent spliced-in (Ψ) larger than 0.3; Methods, and see Extended Data Fig. 1 for results with alternative cutoffs). Previous studies had reported that as many as 75% of aberrant splicing events in GTEx RNA-seq samples are not replicated across tissues10,12 and thus may reflect technical artifacts or aberrant splicing that is not genetically driven. We quantified the enrichment of replicated splicing outliers across tissues of the same individual with respect to the distance to the closest rare variant and found them to be enriched up to a distance of 250 base pairs (bp) (Extended Data Fig. 2). Therefore, we also required a rare variant to be less than 250 bp away from the boundaries of any intron associated with the aberrantly spliced splice site (Methods and Extended Data Fig. 3). This filter yielded similar results as filtering for replicated aberrant events with the extra advantage of being applicable to independent cohorts that have a single sample per individual (Extended Data Fig. 4).

State-of-the art sequence-based models poorly predict tissue-specific aberrant splicing

We then assessed the performance of two complementary state-of-the-art sequence-based deep learning models: modular modeling of splicing (MMSplice)3, which predicts quantitative usage changes of predefined splice sites within a 100-bp window of a variant, and SpliceAI2, which is independent of gene annotations and predicts creation or loss of splice sites within a 50-bp window of a variant (Extended Data Fig. 5). Using larger prediction window sizes for SpliceAI did not improve the results (Supplementary Fig. 1). For individuals with multiple rare variants on a gene, we retained the highest score of each model. Out-of-the-box application of MMSplice and SpliceAI showed a modest performance, with an overall precision of 8% for MMSplice and of 12% for SpliceAI at 20% recall, and an area under the precision–recall curve (auPRC) of 4% ± 1 percentage point across tissues for MMSplice and 5% ± 2 percentage points for SpliceAI.

Tissue-specific splicing annotations improve aberrant splicing predictions

We observed that many false predictions originated from inaccurate genome annotations. On the one hand, standard genome annotations are not tissue-specific, leading to false positive predictions. This includes predictions for genes that are not expressed in the tissue of interest, as for the gene TRPC6 in the brain (Fig. 2a), and, among expressed genes, predictions for exons that are not canonically used in the tissue, as for exon 2 of C2orf74 in the tibial nerve (Fig. 2b). On the other hand, many splice sites are missing from standard genome annotations14,15. These nonannotated splice sites are often spliced at a low level, yet can be strongly enhanced by variants (see Fig. 2c for an example) and are suspected to be a major cause of aberrant splicing16,17. To address all these issues, we created a tissue-specific splice site map, which we named SpliceMap, using GTEx RNA-seq data. SpliceMap excludes untranscribed splice sites and introns for each tissue and includes nonannotated splice sites and introns reproducibly observed among samples of the same tissue (Methods). The standard genome annotation GENCODE18 (release 38 of hg38) contains 244,189 donor sites and 235,654 acceptor sites, of which 93% were detected at least in one GTEx tissue (Fig. 2d). SpliceMap contains 168,004 ± 9,288 donor sites and 164,702 ± 8,950 acceptor sites per tissue (Extended Data Fig. 6). From this total, 7,060 ± 3,706 donor sites and 8,222 ± 3,740 acceptor sites were unannotated, with testis containing the maximum number of nonannotated donor and acceptor sites (29,673 and 29,911 respectively), in line with the unique transcriptional and splicing patterns of testis19,20. SpliceMap is robust to variations in sample size and to different split-read counting tools21,22 (Supplementary Fig. 2). Moreover, we found that currently available long-read RNA-seq data in GTEx23 were not yet sensitive enough24 to reliably identify nonannotated splice sites (Supplementary Fig. 2). Applying MMSplice on the tissue-specific splice sites defined by SpliceMap increased the precision of MMSplice to 13% at 20% recall (Fig. 2e), with a significantly higher auPRC consistently across tissues (Fig. 2f). Similarly, applying SpliceMap on SpliceAI increased precision to 22% at 20% recall.

Fig. 2: Tissue-specific splice site map improves prediction performance.
figure 2

ac, Sashimi plots showing RNA-seq read coverage (y axis) and the numbers of split reads spanning an intron indicated on the exon-connecting line (using pysashimi50) for instances illustrating the benefits of the SpliceMap annotation. For each instance, two individuals are displayed. The individual with the rare genetic variant (located at the dashed black line) is shown in the lower track (darker color). SpliceMap catalogs expressed genes and splice sites in each tissue and can thus help in identifying cases for which there is no variant effect in tissues not expressing the whole gene (a) or the exon (b) in proximity of the variant. Moreover, SpliceMap includes weak splice sites, which are spliced at a low level, but can be activated and create novel exons in the presence of a variant (c). d, Venn diagram comparing annotated splice sites in standard genome annotation (GENCODE release 38) and SpliceMap aggregating all GTEx tissues. e, Precision–recall curves comparing the overall prediction performance across all GTEx tissues (n = 49) of MMSplice applied to GENCODE splice sites, MMSplice applied to tissue-specific splice sites according to SpliceMap, SpliceAI and SpliceAI using tissue-specific SpliceMaps. f, Distribution of the auPRC across all GTEx tissues of the models in e. Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. P values were computed using the paired one-sided Wilcoxon test. Alt, alternative; Ind, individual; Ref, reference.

Quantified reference isoform proportions improve aberrant splicing predictions

Variants affecting splicing typically associate with abundance ratio fold-changes of competing splicing isoforms, which result in nonlinear effects on isoform proportions according to the so-called scaling law of splicing25,26. For instance, starting from a 1:1 ratio between one splicing isoform and its alternative in a major allele background, a tenfold decrease leads to a 1:10 ratio, which amounts to around 40 percentage points decrease (from 50% to approximately 10%). However, the same ratio fold-change starting from a 1:10 ratio amounts to less than 1 percentage point decrease (Extended Data Fig. 7). Hence, the scaling law of splicing implies that the variation of isoform abundance between tissues in major allele background alone can explain some of the tissue-specific effects of variants on isoform proportion25, as exemplified with exon 7 of the gene TRPC6 (Fig. 3a). We estimated major allele background levels of alternative donor and acceptor splice site usage proportions for all introns and all tissues of SpliceMap (Extended Data Fig. 7). Integrating these reference levels further improved the MMSplice predictions by 1.6-fold consistently across tissues, and to a lesser extent the SpliceAI predictions (Fig. 3b,c and Methods). We suspect that MMSplice showed stronger relative improvement compared with SpliceAI because it models percent spliced-in of predefined splice sites and can integrate in a principled fashion reference levels using the scaling law. In contrast, SpliceAI models creation or loss of splice sites. We integrated reference levels with SpliceAI by applying filters (Methods). However, predicted activations of annotated splice sites and predicted deactivations of unannotated splice sites are already masked in SpliceAI, thereby qualitatively capturing the effect of using reference level filters for a large number of splice sites.

Fig. 3: Quantitative splicing levels further improve prediction performance.
figure 3

a, Sashimi plot of TRPC6 around exon 7 in lung and brain for two individuals, one carrying no rare variant in this region (control, upper tracks), and one carrying an exonic rare deletion (dashed line and lower tracks) associated with reduced splicing of exon 7. The donor sites of exon 6 and exon 7 compete against each other for splicing with the acceptor site of exon 8. For the control individual, the donor site of exon 7 is used 70% of the time in the lung, and only 11% of the time in the brain. The variant associates with a stronger difference (33 percentage points) in the lung than in the brain (1 percentage point). b, Precision–recall curve comparing the overall prediction performance on all GTEx tissues of SpliceAI, SpliceAI using SpliceMap, SpliceAI using SpliceMap along with quantitative reference levels of splicing, MMSplice using GENCODE annotation, MMSplice using SpliceMap annotation, MMSplice using SpliceMap annotation along with quantitative reference levels of splicing and the integrative model AbSplice-DNA. Different cutoffs are shown (SpliceAI, high: 0.8, medium: 0.5, low: 0.2; MMSplice (score absolute value), high: 2, medium: 1.5, low: 1; AbSplice-DNA, high: 0.2, medium: 0.05, low: 0.01). c, Distribution of the auPRC of the models in b across tissues (n = 49). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles. P values were computed using the paired one-sided Wilcoxon test. d, Model performance across different VEP51 variant categories. Categories are ordered from left to right by decreasing severity. Each annotated variant is labeled by its most severe category. The ‘Exon’ category consists of the VEP categories stop gained, stop lost, missense and synonymous. e, Model performance across nonexclusive outlier outcome categories (Methods). For panels d and e, the ‘All’ category contains all unique variants (independent of the VEP annotation and outlier outcome categories) and n is the number of variants associated with outliers.

AbSplice-DNA predicts the probability that a variant causes aberrant splicing in a given tissue

Next, to leverage the complementarity of MMSplice and SpliceAI predictions7, we trained a generalized additive model using the scores from both deep learning models as well as annotation features from tissue-specific SpliceMaps (Methods). This model, which we call AbSplice-DNA, achieved an additional 1.5-fold improvement (Fig. 3b,c). The AbSplice-DNA scores are probability estimates which we found to be well calibrated on GTEx (Extended Data Fig. 8). AbSplice predicts for each variant how likely aberrant splicing of some sort takes place in a given tissue and reports the splice site with the strongest effect (see Supplementary Table 1 for an example). To ease downstream applications we suggest three cutoffs (high: 0.2, medium: 0.05, low: 0.01), which have approximately the same recalls as the high, medium and low cutoffs of SpliceAI (Fig. 3b).

We also tested integration of other predictors into AbSplice-DNA by including scores from Combined Annotation Dependent Depletion-Splice (CADD-Splice)7, Multi-tissue Splicing (MTSplice)9 and Super Quick Information-content Random-forest Learning of Splice variants (SQUIRLS)8 (Methods). However, those models only led to minor improvements (Extended Data Fig. 9). We decided to incorporate only MMSplice and SpliceAI into the final model so as not to have a model confounded by conservation information (used by CADD-Splice and SQUIRLS), and to keep the possibility to easily integrate new tissues which would not be the case with MTSplice. Nevertheless, the code of AbSplice can easily be modified to incorporate new features. We also tried random forest and logistic regression as alternative machine learning models, which gave similar performances to the generalized additive model (Methods and Extended Data Fig. 9).

We evaluated the model performances in more detail by stratifying the results on two different scenarios. First, we stratified by variant categories. As expected, the precision was the best on variants affecting the donor and acceptor dinucleotides on all models, followed by variants in the splice region (within 1–3 bases of the exon or 3–8 bases of the intron), then in the exonic, and lastly in the intronic regions (Methods and Fig. 3d). AbSplice-DNA outperformed all models throughout all variant categories, including intronic variants, whose effects are notoriously more difficult to predict. Second, we analyzed the model performance for five nonexclusive aberrant splicing outcomes: exon elongation, exon truncation, exon skipping, any alternative donor or acceptor choice outlier, and any splicing efficiency outlier. AbSplice-DNA performed better for exon skipping than for exon elongation and truncation, as well as better for alternative donor or acceptor choice than for splicing efficiency outliers. Moreover, AbSplice-DNA outperformed all other models throughout all investigated outlier outcome categories (Fig. 3e).

AbSplice-DNA performance is confirmed on independent data

Having established our model on GTEx, we next assessed how well the performance replicated in independent cohorts. We first evaluated a dataset consisting of RNA-seq samples from skin fibroblasts of 303 individuals with a suspected rare mitochondriopathy27. We found that there was a large overlap (86%) of splice sites in SpliceMaps generated from GTEx fibroblasts and from this cohort (Fig. 4a and Supplementary Fig. 3). Moreover, we observed consistent reference levels of splicing between the two datasets (Fig. 4b, Pearson correlation 0.87). We applied AbSplice-DNA trained on GTEx using the SpliceMap from GTEx fibroblasts on the subset of this data for which whole genome sequencing (WGS) was available (n = 20) and used aberrant splicing calls performed on the RNA-seq samples to assess the predictions. The relative improvements between the baseline models and AbSplice-DNA replicated. AbSplice-DNA achieved 13.2 ± 1.5% auPRC, 2.5-fold higher than SpliceAI or MMSplice alone (Fig. 4c). From a rare variant prioritization standpoint, AbSplice-DNA typically gave about twofold fewer candidate predictions at the same level of recall than SpliceAI, itself comparing favorably over MMSplice (Supplementary Fig. 4). Hence, AbSplice-DNA can help rare disease diagnostics by providing substantially shorter lists of predicted candidate variants to investigate compared with state-of-the–art sequence-based models.

Fig. 4: Application of AbSplice-DNA on independent data.
figure 4

a, Venn diagram comparing the splice sites in the SpliceMap generated from fibroblasts from a mitochondrial disease dataset (n = 303) and GTEx (n = 492). b, Correlation of the reference Ψ values from the union of introns of the SpliceMaps from a. For the union of introns: n = 736,503, Pearson correlation = 0.87, R2 = 0.74, where reference Ψ of nonintersecting introns was set to zero. For the intersection of introns: n = 522,876, Pearson correlation = 1.0, R2 = 0.99. c, AuPRC for classification of aberrant splicing events from rare variants in the mitochondrial disease dataset for SpliceAI, MMSplice and AbSplice-DNA trained on GTEx using the GTEx fibroblasts SpliceMap from a. Error bars represent s.e.m. (Jackknife over samples, n = 20). d, Enrichment of high-score predictions in ALS genes (n = 165). Cutoffs are for SpliceAI (high: 0.8), MMSplice (high: 2) and AbSplice-DNA (high: 0.2). The sample size n in the x-axis labels corresponds to the total number of predictions above the cutoff. P values were computed using one-sided Fisher tests considering all protein coding genes as the universe. e, Proportion of rare variants that pass the high cutoffs described in d for MMSplice with GENCODE annotation, SpliceAI and AbSplice-DNA trained on GTEx and using GTEx brain SpliceMaps as well as a SpliceMap from ALS motor neurons, validated using proteomics (Z-score < −2; Methods) in the ALS dataset. The sample size n in the y-axis labels corresponds to the total number of predictions above the cutoff. Error bars represent 95% CIs from the binomial test. f, Genome-wide depletion of high-impact variants among rare SNVs (gnomAD MAF < 0.1%) within a gene (n = 19,534) as a function of LOEUF score deciles. High-impact variants are defined by a SpliceAI score > 0.8, MMSplice score > 2 (absolute score) and an AbSplice-DNA score > 0.2 in at least one tissue. Asterisks mark significance levels of two-sided Fisher tests of AbSplice-DNA compared with SpliceAI (*<0.05, **<10−4, ***<10−8). NS, not significant.

We next considered a cohort of WGS samples paired with RNA-seq and proteomics data of induced pluripotent stem cell (iPSC)-derived spinal motor neurons from 245 amyotrophic lateral sclerosis (ALS)-affected and 45 healthy individuals from the Answer ALS project (Methods). As iPSC-derived spinal motor neurons were not profiled in GTEx, we considered two approaches. On the one hand, we used the Answer ALS healthy controls to generate a SpliceMap for iPSC-derived spinal motor neurons. On the other hand, we used the SpliceMap of GTEx brain tissues as a proxy which showed the highest overlap from all GTEx tissues (Supplementary Fig. 5). We found that the GTEx SpliceMap from brain tissues agreed reasonably well with the one derived from this cohort both qualitatively (76% shared splice sites) and quantitatively (Pearson correlation 0.86; Supplementary Fig. 5). Here, too, AbSplice-DNA outperformed SpliceAI and MMSplice. Interestingly, AbSplice-DNA achieved similar performances using the SpliceMap from GTEx brain tissues or using the SpliceMap from iPSC-derived spinal motor neurons, suggesting that AbSplice-DNA can be applied robustly in absence of control samples using SpliceMaps from proxy tissues (Supplementary Fig. 6). Moreover, AbSplice-DNA predictions were enriched for genes associated with ALS28,29,30,31,32 (threefold enrichment; Fig. 4d), which was less so for MMSplice predictions and not the case for SpliceAI predictions. We further validated AbSplice-DNA using proteomics data available for this cohort. At our recommended cutoff, AbSplice-DNA predicted 58 genes to be aberrantly spliced, of which 31% (18 of 58; 95% confidence interval (95% CI), 20–45%) of the corresponding proteins showed significantly low abundance (Z-score < −2; Methods), consistent with RNA degradation via nonsense-mediated decay or protein isoforms resulting from aberrant splicing events that are more poorly translated or less stable. Similarly, independent confirmation by proteomics led to validation rates of MMSplice (26 of 183; 95% CI, 9–20%) and SpliceAI (17 of 80; 95% CI, 13–32%) consistent with the validation rates we originally observed at those cutoffs using the GTEx RNA-seq benchmark (Fig. 3b). Altogether, the proteomics analyses confirm the relative improvements of the different models and are overall consistent with our precision estimates.

Furthermore, we applied AbSplice-DNA to 203,306,868 rare variants (MAF < 0.1%) from the gnomAD dataset using SpliceMaps from all GTEx tissues. In highly constrained genes, defined as the 10% of genes most strongly depleted for loss-of-function variants in gnomAD13, rare variants were more strongly depleted for high AbSplice-DNA scores in at least one tissue (3.4-fold depletion), than for high SpliceAI scores (2.9-fold depletion, P < 10−21; Fig. 4f) or high MMSplice scores (2.2-fold depletion). A stronger depletion than with SpliceAI or MMSplice also held when relaxing the AbSplice-DNA cutoff to match the total number of predictions of SpliceAI (Supplementary Fig. 7).

Collectively, these results on independent data demonstrate the robustness and the applicability of AbSplice-DNA and suggest its utility for rare disease diagnostics and rare variant interpretation.

AbSplice-RNA incorporates RNA-seq from CATs

Sequencing transcriptomes of CATs such as skin or body fluids is of increasing interest in rare disease research as it allows direct detection of aberrant splicing for those splice sites used both in the CAT and in tissues of suspected disease relevance16,33,34,35. The GTEx dataset consists of post-mortem-collected RNA-seq samples across a vast variety of tissues and thereby offers the unique opportunity to evaluate to what extent aberrant splicing in an accessible tissue reflects aberrant splicing of another tissue of interest. One positive example in GTEx is aberrant splicing of DDX27 in the heart which can also be observed in skin fibroblasts (Fig. 5a). Consistent with a previous study35 based on the Ensembl gene annotation36, we found that among the CATs, fibroblasts have the highest overlap of transcribed splice sites according to SpliceMap with nonaccessible tissues, followed by lymphocytes and whole blood (Fig. 5b). To predict aberrant splicing in nonaccessible tissues, we considered ranking genes of an individual first for showing significant and large aberrant splicing in a CAT (false discovery rate (FDR) < 0.1 and |ΔΨ| > 0.3) and then by significance level. This simple method yielded a markedly increased precision compared with the DNA-based models, up to nearly 40% recall (Fig. 5c and Extended Data Fig. 10a). However, RNA-based predictions remain limited to those splice sites expressed and spliced in the CAT. Therefore, we next trained models integrating AbSplice-DNA features together with RNA-seq-based features from CATs, including differential splicing amplitude estimates to leverage the splicing scaling law and the SpliceMaps (Methods). These models, which we call AbSplice-RNA, outperformed all other models (Fig. 5c and Extended Data Fig. 10a). We found that using fibroblasts only led to the same performance as using all CATs, reaching around 60% precision at 20% recall and amounting to a twofold improvement over AbSplice-DNA (Fig. 5c and Extended Data Fig. 10b). Those improvements were consistent across target tissues (Fig. 5d). As expected, AbSplice-RNA outperformed AbSplice-DNA for genes expressed in CATs and remained on par with it otherwise (Extended Data Fig. 10c). Altogether, these results establish a formal way to integrate direct measurements of aberrant splicing along with sequence-based models to predict aberrant splicing in a tissue of interest.

Fig. 5: Integrating RNA-seq data of CATs to predict aberrant splicing in nonaccessible tissues.
figure 5

a, Sashimi plot of DDX27 around exon 10 for two individuals in heart and fibroblasts. One individual carries no rare variant in this region (control, upper tracks), and one carries an exonic rare variant (dashed line, lower tracks) associated with increased splicing of exon 10. This exon shows a similar usage in fibroblasts and in the heart (reference donor site percent spliced-in, Ψ3 = 8%, according to SpliceMap in both tissues, in line with the measured values for the displayed control individual: Ψ3 = 6% in heart and Ψ3 = 5% in fibroblasts). The effect associated with the variant in fibroblasts approximates well the one in heart (difference of donor site usage, ΔΨ3 = 50% in heart and 37% in fibroblasts). In this case, aberrant splicing can be directly detected from the accessible tissue. b, Proportion of splice sites used in GTEx clinically nonaccessible target tissues (rows) also used in GTEx CATs (columns). c, Precision–recall curve comparing the overall prediction performance on all GTEx tissues of SpliceAI, MMSplice using GENCODE annotation, AbSplice-DNA, gene-level FRASER P values in fibroblasts and AbSplice-RNA, which integrates AbSplice-DNA features with features from RNA-seq from fibroblasts. d, Distribution of the auPRC of the models in c across tissues (n = 49). Center line, median; box limits, first and third quartiles; whiskers span all data within 1.5 interquartile ranges of the lower and upper quartiles; P values were computed using the paired one-sided Wilcoxon test.

Discussion

We established a comprehensive benchmark for predicting variants leading to aberrant splicing in human tissues, revealing limited performance of state-of-the-art sequence-based models. We created a tissue-specific splicing annotation (SpliceMap) based on GTEx which maps acceptor and donor splice sites and quantifies their usage in 49 human tissues. We showed that integrating SpliceMap with DNA-based prediction models leads to a threefold increase of precision at the same recall. Additionally, we found that RNA-seq from CATs complements DNA-based splicing predictions when incorporated into an integrative model.

The prediction of splicing-perturbing variants has a long history of over 20 years’ work2,3,4,5,6,7,8,9,26,37,38,39,40,41,42,43,44. This includes tissue-specific models for mouse43,44 and more recently human9,41. Those models showed successes in various splicing prediction tasks, such as quantitative change of percent spliced-in, splice site usage or splicing efficiency. This study mainly focuses on the prediction of extreme splicing effects (outliers), which has not yet been assessed. This modeling task could be investigated only now, after the development of aberrant splicing callers10,11,12 which enabled the establishment of a ground truth for splicing outlier prediction. We foresee that the paradigm of predicting extreme effects in splicing from DNA could be an inspiration for future research and further be extended to aberrant expression or protein abundance prediction. Furthermore, the large multi-tissue cohorts provided by GTEx allowed us to assess and develop tissue-specific predictors. Using aberrant splicing predictions for tissues that are mechanistically related to the disease of interest may prove to be helpful to identify the effector gene, just as tissue-specific predictions are important for transcriptome-wide association studies45.

Some splicing variant effect prediction models leverage conservation as further evidence of the functional relevance of a variant7,8. Even though conservation is a strong indicator of function, we decided not to include conservation in our final model, as variants causing aberrant splicing do not necessarily have to reside on conserved regions. Moreover, conservation depends on the functional importance of the gene. A nucleotide strongly affecting splicing of a nonconserved gene may be less conserved than a nucleotide with a milder effect on splicing located in a highly conserved gene. Also, a nucleotide can be conserved due to its other potential roles besides splicing. For example, exonic regions near splice sites might be conserved due to their role in protein function. Altogether, even though conservation could still marginally yet significantly improve our model (Extended Data Fig. 9), we opted to provide to the community a model predicting aberrant splicing per se by integrating models solely trained on DNA sequence and splicing metrics measured from RNA-seq or massively parallel reporter assays (MPRAs) (SpliceAI and MMSplice). AbSplice users can still benefit from conservation evidence in post-processing steps to further prioritize variants.

We constructed SpliceMaps and detected aberrant splicing from short-read RNA-seq. We found that current long-read RNA-seq data available for GTEx23 did not provide sufficient coverage to detect unannotated splice sites (Supplementary Fig. 2). Since split short reads reveal splice sites, we foresee that the major added value of long-read over short-read sequencing is not about calling splice sites but identifying the complete RNA isoforms. This could be used in the future to develop models predicting the exact splicing outcome (for example, exact elongated or truncated exon boundaries, exon combinations and so on) caused by the variant, which is beyond the scope of current models trained primarily on short-read data.

We showed how RNA-seq of CATs effectively complements DNA-based predictions. An alternative to this approach is to reprogram or transdifferentiate cells into the suspected mechanistically involved cell type and perform RNA-seq on them46. This approach has, however, important caveats. First, it is not ensured that the suspected mechanistically involved cell type is the correct one, as symptoms may manifest more strongly in downstream affected tissues. Second, this approach is cost, time and labor intensive. Third, cell reprogramming can induce and select mutations which may lead to false identifications. Therefore, predictive models that can leverage RNA-seq of CATs will probably remain relevant in practice47. Furthermore, RNA-seq reveals the consequence of the splicing defect on the resulting transcript isoform (for example, frameshift or exon truncation), which is crucial for diagnostics.

By increasing the precision at 20% recall from about 10% to 60%, the cumulative improvements of our models are substantial. Still, a majority of the aberrant splicing events are not recalled and there remains a majority of false positives. An unknown and potentially large fraction of events that are not recalled might be aberrant splicing calling artifacts, as suggested by the high number of singleton calls. In this study, we implemented strategies aiming at improving the proportion of genuine genetically driven aberrant splicing events in the ground truth while not introducing biases favoring particular models (Extended Data Figs. 24). However, every classification task is founded on a reliable ground truth. As splicing is a complex process and not all aberrant events can be reliably called by state-of-the art aberrant splicing callers, the ground truth in the prediction task remains a proxy. Progress in aberrant splicing calling or better understanding of the technical reasons could reduce the number of incorrectly called aberrant splicing events and improve the recall. Moreover, some of the apparent false positive predictions may be actually correct. This is the case when the aberrant splicing isoform contains a premature termination codon and, often, though not systematically48, gets rapidly degraded by nonsense-mediated decay. Rapidly degraded isoforms barely have any reads in RNA-seq data and hence are typically not detected by aberrant splicing callers. In diagnostic applications, those variants remain relevant. Moreover, dedicated experiments can be done to test whether aberrant splicing is taking place, for instance, using the translation inhibitor cycloheximide.

As WGS becomes more readily available in research and healthcare, there is a growing need for accurate annotation of noncoding variants with strong deleterious effects for establishing genetic diagnostics of rare disorders, identifying effector genes of common diseases and more precisely stratifying patients with cancer based on their tumor genetic alterations. Variants causing aberrant splicing are not only a major class of such noncoding loss-of-function variants, but their mechanisms of action also now become targetable for an increasingly rich therapeutic arsenal49. Hence, because of its high precision and its focus on extreme events, we foresee AbSplice to be instrumental for genome-based diagnostics and therapy design.

Methods

Ethics statement

No primary data were generated for this study. Person-related data were obtained through authorized access from primary data controllers. The study adheres to the ethical and research agreements between the Technical University of Munich and the primary data controllers. All participant informed consents were collected by and remain with the primary data controllers.

Statistics and reproducibility

No statistical method was used to predetermine sample size. We did not use any study design that required randomization or blinding. In the GTEx data we excluded tissues with fewer than 50 samples. In the ALS and mitochondrial disease datasets we did not exclude any samples.

Datasets

GTEx

We downloaded the RNA-seq read alignment files (BAM files) and the variant calling files (VCF files) from WGS from GTEx v8p (hg38) from the database of Genotypes and Phenotypes (dbGaP) (study accession: phs000424.v8.p2). We used data from 946 individuals with paired WGS and RNA-seq measurements (n = 16,213) in at least one tissue. For the long-read RNA-seq data, we downloaded the transcript annotation (GTF) generated by FLAIR52 based on 88 Nanopore samples from the GTEx portal.

Mitochondrial disease dataset

The dataset consists of 303 patients with mitochondriopathy described by Yépez et al.27, all of which have RNA-seq from skin-derived fibroblasts. For 20 individuals, WGS is also available.

ALS dataset

The dataset consists of WGS, RNA-seq and proteomics data from 245 individuals diagnosed with ALS and 45 control samples. RNA-seq data were obtained from iPSC-derived spinal motor neurons. We downloaded the data from the Answer ALS portal (dataportal.answerals.org). Genes known to be involved in ALS disease development were manually curated from literature28,29,30,31,32.

Data preprocessing

Rare variants

Variants had to be supported by at least ten reads and had to pass the conservative genotype-quality filter of GQ ≥ 99. These criteria were used for single nucleotide variants (SNVs) and indels in the same way. We considered a variant to be rare if it had an MAF in the general population ≤0.001 based on gnomAD (v.3.1.2) and was found in at most two individuals within each cohort.

Splicing outlier detection

Splicing outliers were called using FRASER10 (v.1.6.0) as implemented in the Detection of RNA-seq Outliers Pipeline53 (v.1.1.2). FRASER was used to detect introns (including de novo introns) and to count split reads for each intron. Based on the split-read counts, three intron-centric metrics were calculated: alternative acceptor usage with the ψ5 metric, alternative donor usage with the ψ3 metric, and splicing efficiencies as defined with the θ5 and θ3 metrics54:

$$\Psi _5(D,A) = \frac{{n(D,A)}}{{\mathop {\sum}\nolimits_{A^\prime } {n(D,A^\prime )} }} = \frac{k}{{N_5}}$$
$$\Psi _3(D,A) = \frac{{n(D,A)}}{{\mathop {\sum}\nolimits_{D^\prime } {n(D^\prime ,A)} }} = \frac{k}{{N_3}}$$
$$\theta _5 = \frac{{\mathop {\sum}\nolimits_{A^\prime } {n(D,A^\prime )} }}{{n(D) + \mathop {\sum}\nolimits_{A^\prime } {n(D,A^\prime )} }}$$
$$\theta _3 = \frac{{\mathop {\sum}\nolimits_{D^\prime } {n(D^\prime ,A)} }}{{n(A) + \mathop {\sum}\nolimits_{D^\prime } {n(D^\prime ,A)} }}$$

where k is the number of split reads supporting the intron from donor D to acceptor A. The sum in the denominator of ψ5(D,A) goes over all possible acceptors A′ for donor D, and the sum in the denominator of ψ3(D,A) goes over all possible donors D′ for acceptor A. In the splicing efficiencies, the denominator contains n(D) or n(A) which are the numbers of nonsplit reads spanning the exon–intron boundary of donor D or acceptor A, respectively. The advantage of these intron-centric metrics over the exon-centric metric percent spliced-in (ψ) is that they do not require exons to be mapped, which is an ill-defined task when starting from short-read RNA-seq data.

FRASER models these metrics while controlling for latent confounders and reports both splice-site-level and gene-level FDRs. We called aberrant spliced genes using the gene-level FDR < 0.1 as in Mertes et al.10 Furthermore, we requested the gene to contain at least one significant splice site (FDR < 0.05, FRASER default) supported by 20 reads and with an absolute deviation of ψ5,3 from the FRASER-modeled expected value larger than 0.3 (denoted |Δψ5,3| > 0.3). The same filters were applied to the splicing efficiency metrics.

To discard aberrant splicing calls that probably have no genetic basis10, we additionally applied and compared different filtering methods (Extended Data Fig. 4). In the GTEx dataset, where multiple RNA-seq samples from the same individual are available, we investigated including splicing outliers from at least two tissues from the same individual (Filter 2; Extended Data Fig. 4b). Here, a gene-level outlier was considered to be replicated if the same splice-site-level outlier was detected in multiple tissues. As this strategy cannot be applied to other single-tissue datasets, we alternatively filtered for splicing outliers containing a rare variant in the vicinity of ±250 bp of every splice site based on RNA-seq from the sample (Filter 3; Extended Data Fig. 4c). Importantly, this filter was applied to all splice sites identified by FRASER, which includes both annotated splice sites as well as cryptic splice sites (Extended Data Fig. 3). For consistency, all reported results are based on Filter 3.

Aberrant splicing prediction benchmark

Aberrant splicing prediction task

The task is to predict whether a protein coding gene with one or more rare variants within the gene body is aberrantly spliced in a given tissue of an individual.

Performance evaluation metric

Due to the large class imbalance in the splicing outlier prediction benchmarking dataset, we chose to evaluate models using precision–recall curves. As evaluation metric we used the auPRC, computed using the average precision (AP) score55 (which represents the mean of precisions for each threshold weighted by the recall difference):

$${\mathrm{AP}} = \mathop {\sum}\limits_n {\left( {R_n - R_{n - 1}} \right)P_n}$$

where Pn and Rn are the precision and recall at the nth threshold.

Tissue-specific SpliceMap

For each tissue separately, we created a SpliceMap that lists all active introns along with aggregate statistics about acceptor and donor site usage useful for aberrant splicing prediction purposes.

Active introns

We started from all introns reported by FRASER. We filtered out untranscribed splice sites and background noise by filtering out introns not supported by any split-read in more than 95% of the samples. For this and other operations involving genomic ranges we used PyRanges56 (v.0.0.115).

Aggregate statistics

Aggregate statistics were calculated on donor and acceptor sites independently. For donor site usage, the SpliceMap aggregate statistics are (1) the total number of split reads across samples (s) supporting the intron (Σsk), (2) the total number of split reads across samples sharing the same acceptor site (ΣsN3), (3) the median number of split reads per sample sharing the same acceptor site and (4) the reference isoform proportion (\(\psi _3^{\mathrm{ref}}\)), defined as \(\psi _3^{\mathrm{ref}} = \frac{{\mathop {\sum}\nolimits_s k }}{{\mathop {\sum}\nolimits_s {N_3} }}\). Aggregate statistics were computed analogously for acceptor site usage.

Exclusion of rare variant data in aggregate statistics

To prevent information leakage, the aggregate statistics were computed so that they do not contain information about splicing events associated with rare variants (specifically, we excluded from the computations of the aggregate statistics data from samples with a rare variant within ±250 bp of any donor or acceptor site).

SpliceMap generation using alternative counting tools

SpliceMaps were also created from split-read (introns) counts using Regtools22 (v.0.5.2) and STAR21 (v.2.5.3) for the ‘Skin – Not Sun Exposed (Suprapubic)’ tissue. We ran Regtools using BAM files. Regtools performs annotation-free counting; thus, it also calls unannotated introns and splice sites. We downloaded STAR split-read counts from the GTEx portal. The GTEx pipeline filters unannotated splice sites, although the STAR two-pass approach could call unannotated splice sites and introns. During SpliceMap generation, active introns and aggregate statistics were computed as described above.

Aberrant splicing prediction models

SpliceAI

SpliceAI2 (v.1.3.1) is a deep learning model that predicts splice site alteration for acceptor and donor sites from sequence. SpliceAI is annotation free and can therefore score all variants including cryptic splice sites created by deep intronic variants. SpliceAI provides precomputed scores for all SNVs and indels up to the length of 4 nucleotides. These variant scores were computed with 50 bp as the maximum distance between the variant and gained/lost splice sites. We downloaded precomputed variant scores from Illumina BaseSpace and stored them in a RocksDB57 (v.6.10.2) key-value database for fast lookup. We ran SpliceAI to obtain variant scores for long indels not available in the database. Also, we used masked scores of SpliceAI as recommended by the authors for variant interpretation. This masking sets Delta scores to zero if SpliceAI predicts activation for annotated splice sites and deactivation for unannotated splice sites.

SpliceAI + SpliceMap

We used tissue-specific splice site annotations from SpliceMap together with SpliceAI predictions. For each tissue, we retained those variant scores that contained an annotated splice site within a 100-bp window.

SpliceAI + SpliceMap + ψ ref

As SpliceAI was trained to predict creation or loss of splice sites and not ψ, there is no principled way to apply the splicing scaling law to include reference levels. Therefore, we used reference levels only to filter predictions. Analogously to the masking of scores representing annotated acceptor/donor gain and unannotated acceptor/donor loss performed by the authors of SpliceAI, we used tissue-specific ψref values for filtering. Specifically, variant scores associated with acceptor/donor gain and a splice site with ψref ≥ 0.95 as well as with acceptor/donor loss and a splice site with ψref ≤ 0.05 were filtered out.

MMSplice

MMsplice3 (v.2.3.0) is a deep learning model that predicts the impact of a variant (in a 100-bp window of annotated splice sites) on alternative usage of a nearby donor or acceptor site. MMSplice predicts the effect of a variant in log-odds ratios (denoted Δlogitψ5 or Δlogitψ3). MMSplice requires a splice site annotation. We used the GENCODE (release 38 of hg38) annotation.

MMSplice + SpliceMap

We ran MMSplice on tissue-specific splice site annotations from SpliceMap.

MMSplice + SpliceMap + ψ ref

MMSplice is a quantitative model predicting percent spliced-in for which the splicing scaling law can be leveraged to integrate reference levels. For conversion of the variant effect into natural scale, reference levels of donor site and acceptor site usages are required. For the sake of shorter notations, we write in the following ψ instead of ψ5 and ψ3. We used MMSplice to predict Δlogit(ψ) values. Δlogit(ψ) values were then combined with the corresponding reference ψ value (ψref) in SpliceMap: first in logit scale to adjust the predicted variant effect by MMSplice to the correct reference level; then in natural scale by using the sigmoid function (Extended Data Fig. 7a):

$$\Delta {\mathrm{logit}}(\varPsi) = {\mathrm{logit}}(\varPsi_{\mathrm{alt}}) - {\mathrm{logit}}(\varPsi_{\mathrm{ref}})$$
$$\hat \varPsi _{\mathrm{alt}} = \sigma \left( {\Delta {\mathrm{logit}}\left(\varPsi \right) + {\mathrm{logit}}\left( {\varPsi_{\mathrm{ref}}} \right)} \right)$$
$$\Delta {\hat{\varPsi}}= \hat \varPsi _{\mathrm{alt}} - {\varPsi}_{\mathrm{ref}}$$
$$\sigma ^{ - 1} = {\mathrm{logit}}$$

Variants further away than 100 bp from any SpliceMap splice site were scored 0 (no effect).

MTSplice

MTSplice9 (v.2.3.0) is a tissue-specific version of MMSplice. The model scores each exon–variant pair for 56 tissues. With respect to each annotated exon boundary, the model takes as input a sequence of 100 bp in the exon and 300 bp in the intron. MTSplice predicts the tissue-specific effect of a variant in log-odds ratios (denoted Δlogit(ψ)). MTSplice requires a splice site annotation. We used the GENCODE (release 38 of hg38) annotation.

CADD-Splice

CADD-Splice7 is an ensemble model that combines CADD scores (contains conservation scores) together with splicing predictions from SpliceAI and MMSplice. We ran CADD-Splice v.1.6. CADD-Splice provides raw and PHRED-scaled scores. We used the PHRED score.

SQUIRLS

SQUIRLS8 is based on engineered splicing features for donor and acceptor sites that are extracted from a genome annotation. SQUIRLS predicts the probability of a variant to alter the splicing pattern. We downloaded the SQUIRLS database v.2203 and ran SQUIRLS v.2.0.0.

AbSplice-DNA

AbSplice-DNA is a generalized additive model, namely the ExplainableBoostingClassifier from the python package interpretml58. Similar performance was achieved using a random forest or logistic regression model from scikit-learn55. The features of AbSplice-DNA were the prediction score from MMSplice + SpliceMap, MMSplice + SpliceMap + ψref, the SpliceAI Delta score and a binary feature from SpliceMap indicating if the splice site is expressed in the target tissue (using a cutoff of 10 reads for the median number of split reads sharing the splice site). The model includes interaction terms, thereby de facto capturing the effect of combining SpliceMap with SpliceAI scores. The model was trained on a variant level using outliers within 250-bp distance of rare variants as ground truth (Extended Data Fig. 4c before aggregation to gene level). The model was trained with fivefold-stratified cross-validation, grouped by individuals to avoid information leakage, and such that the proportions of the negative (variant is associated with no outlier on the gene) and positive (variant is associated with an outlier on the gene) classes were preserved in each fold.

Predictors using RNA-seq from CATs

We used different features from RNA-seq of three CATs from GTEx (Whole blood, Cells transformed fibroblasts, and Cells Epstein-Barr virus (EBV)-transformed lymphocytes) to predict aberrant splicing in nonaccessible target tissues.

As one predictive feature we used the −log10 nominal gene-level P values obtained using FRASER. In the benchmark, we ranked all splicing outlier genes (FDR < 0.1 and |Δψ| > 0.3) lower than the remaining genes, and further ranked genes within each of these two groups by increasing P value.

Additionally, we used SpliceMaps from the accessible and the nonaccessible tissues together with ψ measurements from RNA-seq and applied the splicing scaling law to infer Δψ values in the nonaccessible target tissue:

$$\Delta {\mathrm{logit}}\left( \varPsi \right) = {\mathrm{logit}}\left( {\varPsi^{\mathrm{CAT}}} \right) - {\mathrm{logit}}\left( {\varPsi_{\mathrm{ref}}^{\mathrm{CAT}}} \right)$$
$$\varPsi ^{{\mathrm{target}}} = \sigma \left( {\Delta {\mathrm{logit}}\left( \varPsi \right) + {\mathrm{logit}}\left( {\varPsi_{\mathrm{ref}}^{{\mathrm{target}}}} \right)} \right)$$
$$\Delta \varPsi ^{{\mathrm{target}}} = \varPsi ^{{\mathrm{target}}} - \varPsi _{\mathrm{ref}}^{{\mathrm{target}}}$$

where ΨCAT is the splicing level in the CAT and \(\varPsi _{\mathrm{ref}}^{\mathrm{CAT}}\) is the reference level of splicing obtained from SpliceMap, and the difference of these two values provides the tissue unspecific variant effect, Δlogit(Ψ). Then, adding Δlogit(Ψ) with the reference level of splicing of the target tissue \({\mathrm{logit}}\left( {\varPsi_{\mathrm{ref}}^{{\mathrm{target}}}} \right)\) in logit scale and converting back to natural scale provides Ψtarget in the target tissue. Subtracting the reference level of splicing of the target tissue, \(\varPsi _{\mathrm{ref}}^{{\mathrm{target}}}\), provides the predicted splicing change in the target tissue, ΔΨtarget, using RNA-seq measurements in CAT.

All precision–recall curves involving CATs have been computed on a subset of the data, excluding CATs from the target tissues and only containing individuals that have RNA-seq measurements from multiple tissues (including the CAT).

AbSplice-RNA

We trained integrative models using the two predictors from RNA-seq data from CATs described above in addition to DNA-based features used in AbSplice-DNA.

We trained AbSplice-RNA models using a single CAT and all CATs together. For the model using all CATs together we trained AbSplice-RNA in a CAT-agnostic manner such that the model predicts outliers regardless of the CAT source. This might be helpful in a diagnostic setting as it might be that the available CAT differs from the CATs that AbSplice-RNA was trained on.

Gene-level aggregation

For genes with multiple variants, we retained the largest score per model.

Model performance per variant and outlier category

Variant categories were annotated with the Ensembl Variant Effect Predictor (VEP)51. For each variant, the most severe VEP annotation was considered. For the ‘Exon’ category, the following VEP categories were grouped together: synonymous_variant, missense_variant, stop_lost, stop_gained. For the nonexclusive splicing outlier categories, we defined ‘exon elongation’, ‘exon truncation’, ‘exon skipping’ using FRASER’s branch: https://github.com/c-mertes/FRASER/tree/junction_annotation ref. 59. We defined the category ‘Any alternative donor or acceptor choice’ as any ψ5 or ψ3 outlier, and the category ‘Any splicing efficiency outlier’ as any θ outlier.

Enrichment in known ALS genes

The enrichment of 165 manually curated genes involved in ALS28,29,30,31 was computed as the proportion of high-splicing-impact variants within those genes, divided by all the high-score predictions of the respective models. Depletion was computed as 1/enrichment.

Proteomics in ALS

We downloaded the protein intensities matrix from the ALS cohort consisting of 4,442 proteins and 204 samples from the Answer ALS portal. We considered the 178 affected individuals. Proteins with missing values in more than 30% of the samples were filtered out, with 3,329 remaining. We then ran PROTRIDER60, a denoising autoencoder-based method to detect outliers on proteomics data. The encoding dimension was optimized by injecting outliers. No covariates were provided. Z-scores were extracted from the results table.

Depletion in loss-of-function intolerant genes

For all possible rare SNVs (gnomAD MAF < 0.1%) in 19,534 protein coding genes, we computed AbSplice-DNA scores and obtained the SpliceAI precomputed scores from Illumina BaseSpace. The loss-of-function observed/expected upper bound fraction (LOEUF) scores were downloaded from https://gnomad.broadinstitute.org/downloads. For each LOEUF decile we computed the proportion of high-splicing-impact variants to the total sum of high-impact variants and divided it by the proportion of rare variants in each decile.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.