Main

Since the first large genome-wide association studies1 were carried out in 2007, there has been a steady increase in sample size—now reaching hundreds of thousands of individuals—which is enabled by a parallel stream of methods with ever-increasing computational efficiency. Initial methods used simple linear or logistic regression using programs such as SNPTEST1 and PLINK2, but these have largely been replaced by the use of linear mixed models (LMMs) and the closely related whole-genome regression models. These approaches have been shown to account for population structure and relatedness, and offer advantages in power by conditioning on associated markers from across the whole genome3,4,5,6,7.

The initial methods were focused on quantitative traits3 for studies with a few thousand samples and assumed a Gaussian distribution on SNP effect sizes. These approaches were extended to datasets including tens of thousands of individuals by computational strategies that avoided repeated matrix inversions when testing each SNP7,8. Building on work from the plant and animal breeding literature9,10, even more efficient whole-genome regression approaches were developed that allowed for more flexible (non-Gaussian) prior distributions of SNP effect sizes11,12. BOLT-LMM and LEMMA are implementations of this approach13,14,15. The fastGWA LMM approach reduces the computational time by using a sparse representation of the genetic correlations present in the sample16. For simple linear regression of quantitative traits, the BGENIE method (https://jmarchini.org/BGENIE/) introduced the idea of the simultaneous analysis of multiple quantitative traits, which required only a single pass through the genetic data and provided substantial speed-ups over PLINK17.

BOLT-LMM and fastGWA have also been applied to binary (case–control) traits when the case–control ratio is reasonably balanced and relatively common variants are tested for association. However, these approaches break down when applied to unbalanced case–control studies tested with rarer variants, such as those found in exome sequencing studies. The SAIGE method implements a logistic mixed-model approach and a saddle-point approximation (SPA) to the null distribution of the test statistic, which is effective at controlling Type 1 errors18.

The BOLT-LMM, fastGWA and SAIGE methods all proceed in two main steps that are applied one trait at a time. In Step 1, a model is fit to a set of SNPs from across the whole genome, such as all of the SNPs on a genotyping array. The resulting model fit is then used to create either a prediction of individual trait values based on the genetic data (in BOLT-LMM and SAIGE) or an estimate of the trait variance–covariance matrix (in fastGWA). In Step 2, a larger set of imputed or sequenced variants on the same set of samples are tested for association, conditional on the predictions or variance–covariance matrix in Step 1. This is usually carried out using the so-called leave-one-chromosome-out (LOCO) scheme, where each imputed SNP on a chromosome is tested conditional on the Step 1 predictions ignoring that chromosome. This approach avoids proximal contamination, which can reduce the power of association tests8,19.

In this paper we propose a new machine-learning method within this two-step paradigm, called REGENIE (https://rgcgithub.github.io/regenie/), that is substantially faster than existing approaches. Extended Data Fig. 1 provides an overview of the REGENIE method. In Step 1, array SNPs are partitioned into consecutive blocks of B SNPs and a small set of J ridge regression predictions are generated from each block (this is referred to as Level 0). Within each block, the ridge regression predictors each use a slightly different set of shrinkage parameters. The idea of using a range of shrinkage values is to capture the unknown number and size of truly associated genetic markers in each window. This approach is equivalent to placing a Gaussian prior on the effect sizes of the SNPs in the block and finding the maximum a posteriori estimate of the effect sizes and the resulting prediction. One can think of these predictions as local polygenic scores that account for local linkage disequilibrium (LD) within blocks. Combining the predictions from across the genome results in a large reduction in the size of the genetic dataset. In this paper we use B = 1,000 and J = 5, and this reduces a set of M = 500,000 SNPs to M = 2,500 predictors. The method then uses a second ridge regression (referred to as Level 1) to combine the M predictors into a single predictor, which is then decomposed into 23 chromosome predictions for a LOCO approach. Linear or logistic regression is used at Level 1, depending on the phenotype. The resulting LOCO predictions are then used as a covariate in Step 2 when each imputed SNP is tested. This approach completely decouples Step 1 and Step 2 so that the Step 1 predictions can be re-used when running Step 2 on distinct sets of markers (for example, imputed and exome markers) or even when distinct statistical tests are needed at Step 2. All of the predictions at Level 0 and Level 1 are obtained within a cross-validation (CV) scheme (either K-fold CV or leave-one-out CV (LOOCV)) to prevent over-fitting.

This approach exhibits a number of desirable properties. First, many of the calculations in Steps 1 and 2 can be carried out for multiple traits in parallel. This leads to substantial gains in speed as the files containing the variants in Steps 1 and 2 are read only once, rather than repeatedly for each trait. In practice, we find that for Step 1, REGENIE can be over 150× faster than BOLT-LMM and 300× faster than SAIGE when analyzing 50 UK Biobank quantitative and binary traits with up to 407,746 samples (Tables 1 and 2). In Step 2, each variant is tested for association and the overall computational burden will depend on the number of variants tested. For example, analyzing imputed variants across the genome will result in a higher computational burden for Step 2 relative to Step 1, but this will be less so when Step 2 involves testing coding variants from exome sequencing. The computational differences between methods in Step 2 are less extreme and mostly depend on the type of trait, test statistic and implementations of file format reading and parallelization schemes. However, REGENIE analyzes multiple traits in parallel and this can result in substantial computational savings, especially for quantitative traits. On an imputed dataset with 30 million tested variants and 50 traits, we find that over both Steps 1 and 2, REGENIE is 19.5× and 4.4× faster than BOLT-LMM and SAIGE, respectively. In the Supplementary Note and Supplementary Table 1 we provide an analysis of the computational complexity of REGENIE.

Table 1 Computational performance of REGENIE, fastGWA and BOLT-LMM when analyzing 50 quantitative traits with UK Biobank data
Table 2 Computational performance of REGENIE–Firth, REGENIE–SPA and SAIGE when analyzing 50 binary traits with UK Biobank data

Second, in Step 1 of REGENIE, only B SNPs need to be stored in memory at once, which leads to a low memory footprint, which can reduce the costs on cloud-based platforms. Third, the method is applicable to both quantitative and binary traits and we have implemented a new, fast Firth logistic regression test as well as a SPA test for binary traits. Finally, our algorithm is ideally suited to implementation on distributed computing frameworks, such as Apache Spark, where both the dataset and application of the method and computation can be parallelized across a large number of machines. The main implementation of REGENIE is a standalone C++ program (https://rgcgithub.github.io/regenie/) but these methods have also been implemented for quantitative traits in the Apache Spark-based Glow project (http://projectglow.io; see Supplementary Note). All of the main experiments and results in the paper were obtained using the C++ program.

Results

Quantitative traits

Figure 1 shows the results of applying REGENIE, BOLT-LMM and fastGWA to three quantitative phenotypes measured on white British participants of the UK Biobank (low-density lipoprotein cholesterol, n = 389,189; body mass index, n = 407,609; and bilirubin, n = 388,303), where Step 2 testing was performed on 9.8 million imputed SNPs (see Supplementary Note). The Manhattan plots for all three phenotypes show good agreement between the methods (see also Extended Data Fig. 2) with both REGENIE and BOLT-LMM showing increased power gains relative to fastGWA at known peaks of association.

Fig. 1: Comparison of methods on three quantitative traits from the UK Biobank.
figure 1

ac, Results from REGENIE, fastGWA and BOLT-LMM for low-density lipoprotein cholesterol (a; n = 389,189), body mass index (b; n = 407,609) and bilirubin (c; n = 388,303) of samples from white British individuals. Tests were performed on 9.8 million imputed SNPs with a minor allele frequency greater than 1%. The bottom dashed horizontal line represents the genome-wide significance (P = 5 × 10−8) and the top dashed horizontal line represents the breakpoint for the different scaling of the y axis. The dashed vertical lines separate the 22 chromosomes.

To demonstrate the advantages of analyzing multiple traits in parallel using REGENIE, we compared it to BOLT-LMM and fastGWA on a set of 50 quantitative traits from the UK Biobank, each with a distinct missing data pattern (Supplementary Table 2). Whereas REGENIE can analyze all traits at once within a single run of the software, the BOLT-LMM and fastGWA software must be run once for each of the 50 traits. Across all 50 traits, we found that the P values for REGENIE and BOLT-LMM were in very close agreement on the majority of traits tested, with some evidence that REGENIE is slightly less powerful for a few traits (Supplementary Fig. 1), whereas the fastGWA P values were noticeably deflated compared with REGENIE and BOLT-LMM. The compute time and memory usage of the three methods is given in Table 1. The table shows that in this 50-trait scenario, REGENIE is 151× faster than BOLT-LMM in elapsed time for Step 1 and 11.5× faster for Step 2, and this translates into an overall speed-up in terms of elapsed time of approximately 20× when projecting to 30 million tested variants obtained using imputation information score > 0.8. Similar to BOLT-LMM, Step 2 of REGENIE has been optimized for input genotype data in BGEN v1.2 format, which highly helped reduce the runtime. In addition, REGENIE has a maximum memory usage of 12.9 GB, which is mostly due to REGENIE reading onlya small portion of the genotype data at a time, whereas BOLT-LMM required 50 GB. To keep memory usage low when analyzing the 50 traits, within-block predictions are stored on disk and read separately for each trait working across blocks. The added input/output operations incur a small cost on the overall runtime but substantially decrease the amount of memory needed by REGENIE (Supplementary Table 3). When running analyses on cloud-based services such as Amazon Web Services, these time and memory reductions both contribute to large reductions in cost as cheaper Amazon Web Services instance types can be used and for less time. In the same 50-traits scenario, we find that REGENIE is about 3× faster than fastGWA but fastGWA is very memory efficient and uses a maximum of only 2 GB.

Binary traits

In addition to analyzing quantitative traits, REGENIE was also designed for the analysis of binary traits, including those with unbalanced case–control ratios. REGENIE includes implementations of both Firth and SPA corrections to handle this scenario (see Methods). Figure 2 (see also Extended Data Fig. 3) shows the results of applying REGENIE, BOLT-LMM and SAIGE to four binary phenotypes measured on white British participants of the UK Biobank (coronary artery disease, N = 352,063; glaucoma, N = 406,927; colorectal cancer, N = 407,746; and thyroid cancer, N = 407,746) where Step 2 testing was performed on 11.6 million imputed SNPs (Supplementary Note). All four approaches demonstrated very good agreement for the most balanced trait (coronary artery disease; case–control ratio = 1:11), but as the fraction of cases decreased, BOLT-LMM tended to give inflated test statistics. However, both REGENIE with Firth and SPA corrections as well as SAIGE are robust to this inflation and show similar agreement for the associations detected.

Fig. 2: Comparison of methods on four binary traits from the UK Biobank.
figure 2

ad, Results from REGENIE using Firth and SPA correction, BOLT-LMM and SAIGE on samples from white British individuals for coronary artery disease (a; case–control ratio = 1:11, N = 352,063), glaucoma (b; case–control ratio = 1:52, N = 406,927), colorectal cancer (c; case–control ratio = 1:97, N = 407,746) and thyroid cancer (d; case–control ratio = 1:660, N = 407,746). Tests were performed on 11 million imputed SNPs. The bottom dashed horizontal line represents the genome-wide significance (P = 5 × 10−8) and the top dashed horizontal line represents the breakpoint for the different scaling of the y axis. The dashed vertical lines separate the 22 chromosomes.

The SPA approach calculates a standard score test statistic and approximates the null distribution, whereas the Firth correction uses a penalized likelihood approach to estimate the SNP-effect-size parameters in an asymptotic likelihood-ratio test. Although both provide good control of Type 1 error for rare binary traits, we found that the SPA approach implemented in SAIGE can result in very inflated effect-size estimates (Supplementary Fig. 2). However, the Firth correction used in REGENIE provides reasonable effect-size estimates and standard errors when the minor allele count is low (Supplementary Table 4). The fast Firth correction that we developed agrees well with the exact Firth correction (Supplementary Figs. 3 and 4) but is approximately 60 times faster (Supplementary Table 5).

To assess the computational resources needed to analyze a larger number of traits, we again ran REGENIE using Firth/SPA correction and SAIGE on a set of 50 binary traits from the UK Biobank with a range of different case–control ratios and distinct missing data patterns (Supplementary Table 6). The compute time and memory usage details are given in Table 2.

For Step 1, we found that REGENIE (using the LOOCV scheme) was about 350 times faster (CPU time of 777 versus 275,070 h) and required only 40% of the memory used by SAIGE (19.5 versus 49 GB). In Step 2, REGENIE–Firth and REGENIE–SPA were 2× and 3× faster than SAIGE in CPU time, respectively, but 21× and 34× faster than SAIGE in elapsed time, respectively, which suggests that REGENIE makes better use of parallelization in this step. Overall, in this 50-trait setting, REGENIE–Firth was 4.4× faster than SAIGE in terms of CPU time and 23× faster in elapsed time when projected to 30 million tested variants obtained using INFO > 0.8%. REGENIE reduces the CO2 footprint by more than 85% compared with SAIGE (Supplementary Table 7). Supplementary Figs. 510 compare the accuracy of REGENIE and SAIGE across all 50 traits and show good agreement.

A large portion of the compute time in SAIGE is used to implement the LOCO, but it has been suggested that for binary traits, the effect of proximal contamination is not as substantial for less prevalent traits18. We ran SAIGE without LOCO on the same 50 binary traits and observed that the impact of using LOCO was indeed more apparent with low case–control imbalance, where it can be highly beneficial (Extended Data Fig. 4 and Supplementary Figs. 1113). These results would caution against the perfunctory use of SAIGE without LOCO for the analysis of all traits in a study such as the UK Biobank.

CV scheme

We implemented both a K-fold CV and a LOOCV scheme in Step 1 for both quantitative and binary traits (see Methods). Both approaches provided almost identical accuracy (Supplementary Figs. 14 and 15). For the dataset of 50 quantitative traits, LOOCV required 192 h of CPU time and 23.6 GB of memory, whereas the K-fold CV required 111 h of CPU time and 12.9 GB of memory (Table 1). For the dataset of 50 binary traits, the LOOCV approach required approximately 50% of the CPU time used by the K-fold CV approach (CPU time of 777 versus 1,590 h) and 65% more memory, but the elapsed time of the two methods was similar (108 versus 117 h). The LOOCV approach requires fewer relatively expensive logistic regression calls compared with the K-fold CV, but the extra calls needed are easily parallelized across multiple cores.

Missing phenotype data

When analyzing multiple traits together with different missing data patterns, we use mean imputation of missing phenotype values in Step 1 but keep only samples with non-missing phenotypes in Step 2. This approach gave almost identical results to an exact approach that uses only samples with non-missing phenotypes in both Step 1 and Step 2 (Supplementary Figs. 1618).

Simulation studies

Through simulations, we investigated the Type 1 error and power of the tests in REGENIE, BOLT-LMM, fastGWA, SAIGE and simple linear/logistic regression with the top principal component as a covariate (PCA). We also assessed the LOCO scheme used in REGENIE and the accuracy of the effect-size estimates from REGENIE–Firth and SAIGE. We used real genetic array data from the UK Biobank to simulate realistic genetic LD patterns and population structure. We sampled 100,000 individuals from either the white British or the full European ancestry set of the UK Biobank and selected one of the following: (1) only unrelated individuals, (2) individuals at random or (3) half of the samples from the related individuals and the remaining half from the unrelated individuals. To consider scenarios of more extreme relatedness, we sampled only first-degree relatives (N = 22,990), first- and second-degree relatives (N = 30,775) or first- to third-degree relatives (N = 70,684) in the set of white British participants. More details can be found in the ‘Data simulation’ section of Methods.

We used genomic inflation (λGC) and empirical Type 1-error rate (defined as the proportion of null tests with a P value less than a nominal level α) to assess the calibration of the tests across 100 simulation replicates. For the quantitative traits, the PCA method was well calibrated when the sample consisted only of unrelated individuals but became inflated with increasing levels of relatedness (Extended Data Fig. 5 and Supplementary Tables 8,9). However, REGENIE, BOLT-LMM with a mixture of Gaussian’s model (BOLT-LMM-MoG), the BOLT-LMM infinitesimal model (BOLT-LMM-Inf) and fastGWA retained good Type 1-error control in all of the settings considered. REGENIE had slightly deflated type I error rates when half of the samples were related. This was also observed in more extreme relatedness scenarios, where the use of the Step 1 predictions in REGENIE led to good calibration of the test unlike the PCA method, which was inflated (Supplementary Table 10).

For binary traits with low case–control imbalance, REGENIE–Firth, REGENIE–SPA, BOLT-LMM-MoG, BOLT-LMM-Inf and fastGWA had good control of the Type 1 error with the more common variants tested (Extended Data Fig. 6, Supplementary Fig. 19d and Supplementary Tables 11,12), and SAIGE had slightly deflated Type 1 error rates. BOLT-LMM-MoG, BOLT-LMM-Inf and fastGWA were inflated for more unbalanced traits and this was worse for rarer variants (Supplementary Fig. 19). However, REGENIE–Firth, REGENIE–SPA and SAIGE were robust against this inflation and for extremely unbalanced traits; REGENIE–SPA was conservative with rarer variants. In more extreme relatedness scenarios, the PCA method became inflated with higher heritability levels but REGENIE–Firth and REGENIE–SPA retained good control of the Type 1 error, although they became conservative for more heritable traits (Supplementary Table 13).

To quantify power, we used the mean χ2 test statistic at causal SNPs. For quantitative traits, REGENIE and BOLT-LMM-Inf had similar power performance, which was higher than for fastGWA across all settings (Supplementary Fig. 20 and Supplementary Table 14). BOLT-LMM-MoG had the highest power performance with fewer causal SNPs, and the power difference with REGENIE decreased as the number of causal SNPs increased. BOLT-LMM-MoG uses a more flexible mixture of Gaussian’s prior, which may model traits with highly non-infinitesimal genetic architectures better. For binary traits, we compared REGENIE–Firth, REGENIE–SPA and SAIGE, which were all well calibrated. With low case–control imbalance, REGENIE–Firth and REGENIE–SPA had slightly higher power than SAIGE and for more unbalanced traits, REGENIE–Firth and SAIGE had similar power performance and REGENIE–SPA had slightly lower performance at rarer variants (Supplementary Fig. 21 and Supplementary Table 15).

We further investigated the accuracy of the effect sizes from REGENIE–Firth and SAIGE. They were similar for moderately unbalanced traits but as the case–control imbalance increased, the estimates from SAIGE became highly inflated (Extended Data Fig. 7). Finally, when comparing the approximate LOCO scheme in REGENIE to an exact LOCO scheme, we found that they gave similar results, both for the genetic predictions in Step 1 (squared sample correlation coefficient (R2) = 0.99932 for quantitative traits and R2 = 0.98710 for binary traits) and the P values in Step 2 (Supplementary Fig. 22).

We assessed the single-trait performance of REGENIE, BOLT-LMM and SAIGE and found that REGENIE takes approximately 3× less CPU time than BOLT-LMM for quantitative traits and >8× less CPU time than SAIGE for binary traits (Supplementary Table 16). With five traits, the computational efficiency of REGENIE improved—it took approximately 10× less CPU time than BOLT-LMM and >22× less CPU time than SAIGE. More generally, we observed that Step 1 of REGENIE scales sub-linearly with the number of traits and the scaling gets closer to linear when the number of traits become large (Extended Data Fig. 8).

Inter-chromosomal LD in the UK Biobank

While developing REGENIE, we and others20 identified an anomaly in the UK Biobank array genotypes that leads to reduced performance of some of the LMMs being tested. We observed a sizeable number of SNP pairs that exhibited inter-chromosome LD, which breaks the assumptions of the LOCO scheme and can result in loss of power when any one of the SNPs in a pair is associated with a trait (see Supplementary Note).

Discussion

In this study we present a machine-learning method that implements simultaneous whole-genome-wide regression of multiple quantitative or binary traits. The method uses a strategy that splits computation into blocks of consecutive SNPs and does not require loading of a genome-wide set of SNPs into memory. This approach also facilitates the analysis of multiple traits in parallel. Overall, this results in substantial computational savings in terms of both CPU time and memory usage compared with existing methods such as BOLT-LMM, fastGWA and SAIGE. As the number of large-scale cohorts with deep phenotyping grows, this approach will probably become even more relevant. The parallel nature of the approach is ideally suited to distributed environments such as Apache Spark. We have developed a first version of REGENIE for quantitative traits within the Glow project as well as the full version of the method for quantitative and binary traits in a standalone C++ program with source code that is openly available.

Analysis of large cohorts for which phenotypes are derived from electronic health records often results in many binary traits with substantial case–control imbalance. REGENIE is applicable to binary traits and we have proposed an approximate Firth regression approach, which we show is almost identical to an exact Firth regression implementation, and much faster. This approach has the added benefit that it avoids the parameter estimate inflation that occurs when SAIGE is used to analyze ultra-rare variants.

Like many existing mixed-model-based approaches, REGENIE is well able to handle relatedness in the sample, although it can become conservative in more extreme cases and hence, we recommend that it not be used for smaller cohorts with high levels of relatedness—such as founder populations, where exact mixed-model methods can be used. As previous methods have proposed to address this issue13,18,21, we plan to explore extending REGENIE to compute and incorporate a calibration factor in its association testing step.

The approach used in REGENIE is inspired by, but not the same as, the machine-learning approach of stacked regressions22. REGENIE uses ridge regression to combine a set of correlated predictors, whereas Breiman’s stacking approach used non-negative least squares to combine a set of highly correlated predictions in an ensemble learning approach. We have not yet investigated whether non-negative least squares might have advantages here. Furthermore, our simulations with traits that have a sparser genetic architecture also highlight the potential improvement of the REGENIE method by using more flexible priors on the effect size of predictors, as is done in BOLT-LMM with a mixture of Gaussian’s prior.

There are many other potential avenues for development of this approach. It will be easy to expand the functionality to include tests such as SNP × covariate interactions16, variance tests23 and a whole range of gene-based tests24,25,26. Multivariate probit regression for binary traits27, multivariate linear regression for quantitative traits28 and multi-trait burden tests29 will all be straightforward to implement.

We also plan to investigate whether REGENIE can be extended to handle time-to-event data30 and multinomial regression in a mixed-model framework31,32. We suspect it may also be possible to leverage the REGENIE output to estimate SNP heritability, polygenic scores and multi-trait missing data imputation using mixed models on a scale that is not possible using the existing approaches33.

One novel application would be to use REGENIE to analyze cohorts that have undergone both RNA-sequencing and either whole-genome SNP genotyping or sequencing. In this setting, the expression levels of up to 20,000 genes would represent the multiple traits of interest, and running a whole-genome regression analysis would allow for joint inference of cis and trans expression quantitative trait loci in a single analysis. This would be equivalent to an LMM analysis of an RNA-sequencing study, which has been performed in previous studies34,35.

Cohorts will continue to grow in terms of sample size, the number of phenotypes and the number of variants available for testing, either via imputation from whole-genome-sequenced reference panels or via direct whole-genome sequencing of the study samples. It seems clear to us that Step 1 of the whole-genome regression paradigm is now highly computationally tractable using the REGENIE approach. However, further advances will be needed to reduce the compute time in Step 2, as whole-genome sequencing produces ever-increasing numbers of rare variants. Efficient utilization of the sparsity of such variants will help to improve memory efficiency and substantially reduce the cost of computation.

Methods

Whole-genome linear regression

In a sample of N individuals, y denotes the N-element phenotype vector, G represents the N × M genotype matrix, where Gij {0, 1, 2} is the allele count for individual i at the jth marker and X represents the N × C matrix of covariates (including an intercept), which is assumed to be full rank. We consider a whole-genome regression model

$$\mathbf{y} = \it{X}\mathbf{\upalpha} + \it{G}_{\mathrm{S}}\mathbf{\upbeta} + {\mathbf{\upepsilon }}$$
(1)

where α are the fixed covariate effects, GS is a standardized version of G, the genotypes have been transformed to have a mean of zero and variance of one, β ~ MVN(0, \(\sigma _g^2\)IM) and \(\mathbf{\upepsilon}\) ~ MVN(0, \(\sigma _e^2\)IN), where MVN denotes the multivariate normal distribution. This is the standard infinitesimal model, which can also be re-written as

$$\mathbf{y} = \it{X}\mathbf{\upalpha} + {{\mathbf{g}}} + \mathbf{\upepsilon}$$
(2)

with g ~ MVN(0, \(\sigma _a^2\)K), where \(K = G_{\mathrm{S}}{G_{\mathrm{S}}^{T}}/M\) is usually referred to as a genetic-relatedness matrix or empirical kinship matrix and \(\sigma _a^2 = M\sigma _g^2\) is the additive polygenic variance.

Covariate effects are removed from both the trait and the genotypes in equation (1) by first computing an orthonormal basis for the covariates, projecting the genotypes and the trait onto that basis and then subtracting out the resulting vectors to obtain the residuals. This is equivalent to using a projection matrix \(\it{P}_{X} = \it{I}_N - \it{X}(\it{X}^{T}\it{X})^{ - 1}\it{X}^{T}\) with

$${\tilde{{\mathbf{y}}}} = \it{P}_{X}{{\mathbf{y}}}$$
(3)
$${\tilde{\it{G}}} = \it{P}_{X}\it{G}_{\mathrm{S}}$$
(4)

Both the genotype and phenotype residuals are then scaled to have a variance of one.

Stacked block ridge regression

Fitting equation (1) is computationally intensive given that G typically has many hundreds of thousands of columns. Instead, for Step 1, we transform the model to

$${\tilde{{\mathbf{y}}}} = \it{W}\mathbf{\upeta} + \mathbf{\upepsilon}$$
(5)

where W is a matrix derived from G with substantially fewer columns. Specifically, we divide G into blocks of B consecutive and non-overlapping SNPs, and from each block we derive a small set of predictors using ridge regression across a range of J shrinkage parameters (see Supplementary Note). The idea behind using a range of shrinkage values is to capture the unknown number and size of truly associated genetic markers within each window. This approach is equivalent to placing a Gaussian prior on the effect sizes of the SNPs in the block and finding the maximum a posteriori estimate of the effect sizes and the resulting prediction. Another approach would be to integrate out the effect sizes over the Gaussian prior to obtain the best linear unbiased prediction36 but we have not investigated that approach in this paper.

The ridge predictors are re-scaled to have unit variance and are stored in place of the genetic markers in matrix W, providing a large reduction in data size. If M = 500,000, B = 1,000 and J = 5 are used, then the reduced dataset will have JM / B = 2,500 predictors. We refer to this part of the method as the Level 0 ridge regression.

To keep the memory usage low when analyzing multiple traits, the within-block predictions are stored on disk and read separately for each trait when fitting models at Level 1 (see below). The added input/output operations incur a small cost on the overall runtime and substantially decrease the amount of memory needed.

The ridge regression takes account of the LD within each block but not between blocks. One option that we have considered, but not implemented yet, is to condition the ridge regression on the estimates from the previous block, which may better account for LD across block boundaries.

The predictors in W will all be positively correlated with the phenotype. Thus, it is important to account for that correlation when building a whole-genome-wide regression model. The predictors will also be correlated with each other, especially within each block, but also between blocks that are close together due to LD. We use a second level of ridge regression on W for a range of shrinkage parameters and choose a single best value using the K-fold CV scheme22. This assesses the predictive performance of the model using held-out sets of data and aims to control any over-fitting induced by using the first level of ridge regression to derive the predictors (see Supplementary Note). We refer to this part of the method as the Level 1 ridge regression.

The result of this model fit is a single N × 1 predicted phenotype \({\hat{{\mathbf{y}}}}^ \ast\), and this can be partitioned into 22 LOCO predictions (denoted \({\hat{{\mathbf{y}}}}_{{\mathrm{LOCO}}}^ \ast\)), which are used when testing SNPs for association in Step 2 to avoid proximal contamination (see Supplementary Note).

Association testing

When testing for association of the phenotype with a variant g in Step 2, we consider a simple linear model

$${\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast = {\tilde{{\mathbf{g}}}}\upbeta + \tilde{\mathbf{\upepsilon}}$$
(6)

where \({\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast = {\tilde{{\mathbf{y}}}} - {\hat{{\mathbf{y}}}}_{{\mathrm{LOCO}}}^ \ast\) refers to the phenotype residuals where the polygenic effects estimated from the null model with LOCO have been removed, \({\tilde{{\mathbf{g}}}} = \it{P}_X{{{\mathbf{g}}}}\) are residuals obtained from removing the covariate effects from the tested variant and \(\tilde {\mathbf{\upepsilon}} =\it{P}_X{\mathbf{\upepsilon}}\) with \(\mathbf{\upepsilon}\) ~ MVN(0, σe2IN).

A score test statistic for H0: β = 0 is

$$T_{{\mathrm{linear}}} = \frac{{{\tilde{{\mathbf{g}}}}^{T}{\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast }}{{[\hat \upsigma _e^2 \cdot {\tilde{{\mathbf{g}}}}^{T}{\tilde{{\mathbf{g}}}}]^{1/2}}}$$
(7)

where we use \(\hat \upsigma _e^2 = ||{\hat{{\mathbf{y}}}}_{{\mathrm{resid}},{\mathrm{LOCO}}}^ \ast ||_2^2/(N - C)\). In equation (7), when estimating the variance of the numerator, we assume that the polygenic effects are given, which leads to the denominator involving only O(N) computation. While other methods make use of a calibration factor in the denominator to account for the variance of the polygenic effects13,18,21, we found in applications that the results obtained using this simple form match up closely to those using a calibration factor. Finally, we use a normal approximation, \(T_{{\mathrm{linear}}}^2\sim \chi _1^2\), to estimate the P value. As with Step 1 above, the REGENIE software reads the genetic data file in blocks of B SNPs and these are processed together, taking advantage of parallel linear algebra routines in the Eigen library.

Multiple traits

Both Step 1 and Step 2 above are easily extended so that multiple phenotypes can be processed in parallel. The genetic data files in both steps can be read once, in blocks of B SNPs, which means the method uses a small amount of memory. In addition, the linear algebra operations for the covariate residualization, ridge regression and association testing can be shared across traits. This is similar to the approach implemented in the BGENIE software for single SNP linear regression analysis17. The fine details of the multiple phenotype approach are given in the Supplementary Note.

Binary traits

For binary traits, we use exactly the same Level 0 ridge regression approach, which effectively treats the trait as if it were quantitative. However, at Level 1, instead of a linear regression in equation (5) we use logistic regression

$${\mathrm{logit}}\left( {p_{i}} \right) = {{\mathbf{X}}}_i^{\boldsymbol{T}}\mathbf{\upalpha} + {{\mathbf{W}}}_i^{\boldsymbol{T}}\mathbf{\upeta}$$
(8)

where pi = E(yi) = P(yi = 1) with yi indicating the case status of the ith individual, Xi is the covariate vector for the ith individual, α are the fixed covariate effects, Wi are the within-block (BR) predictions for the ith individual and η = (η1,…, ηBR)T, with ηi ~ N(0, 1/ω). This model corresponds to logistic regression with ridge penalty applied to the effects of within-block predictions in W.We approximate the model in equation (8) by first fitting a null model for each trait that only has

$${\mathrm{logit}}\left( {p_i} \right) = {{{\mathbf{X}}}}_i^T\mathbf{\upalpha}$$
(9)

covariate effects and then using the resulting estimated effects as an offset in the model in equation (8),

$${\mathrm{logit}}\left( {p_{i}} \right) = {\mathbf{X}}_i^T\hat{\mathbf{\upalpha}} + {\mathbf{W}}_i^T{\mathbf{\upeta}}$$
(10)

where \(\hat{\mathbf{\upalpha}}\) represents the effects estimated in equation (9). As the covariate effects are not expected to change substantially (unless correlation between covariates and block predictions are very large), this approximation is expected to work well in most analyses.

As with quantitative traits, we used K-fold CV to choose the Level 1 ridge regression parameter. However, for extremely unbalanced traits, it may happen that one of the folds contains no cases. To avoid this situation, we also implemented an efficient version of LOOCV. Although at first sight it may seem that LOOCV is more computationally intensive than K-fold CV given that the model has to be fitted N times (rather than K times) on data with N − 1 samples, the leave-one-out estimates can actually be obtained (approximately for binary traits) from rank 1 updates to the results from fitting the model once to the full data (see Supplementary Note). In practice we have found that LOOCV gives similar association results to K-fold CV (Supplementary Figs. 14 and 15) and can be computationally faster in some cases (see Tables 1 and 2). A LOCO scheme is applied to the polygenic effect estimates and the resulting predictions \({\hat{{\mathbf{w}}}}_{{\mathrm{LOCO}}} = \it{W}_{{\mathrm{LOCO}}}\hat{\mathbf{\upeta}} _{{\mathrm{LOCO}}}\) are then stored.

In Step 2, we use a logistic regression model score test to test for association between each marker and binary trait. Covariate effect sizes are estimated along with genetic marker effect sizes but we include the LOCO predictions from Step 1 as a fixed offset (see Supplementary Note).

When rare variants are tested for association with a highly unbalanced trait (that is, a trait that has low sample prevalence), the use of asymptotic test statistic distributions does not work well and results in elevated Type 1 error rates. REGENIE implements several methods to handle this situation. First, it includes the SPA test37, which is also included in SAIGE18. This approach better approximates the null distribution of the test statistic but we have found that it can sometimes fail to produce good estimates of SNP effect sizes and standard errors, which are highly desirable for meta-analysis applications (Supplementary Table 4 and Supplementary Fig. 2).

Second, we use Firth logistic regression, which uses a penalized likelihood to remove much of the bias from the maximum-likelihood estimates in the logistic regression model. This approach results in well-calibrated Type 1 errors and usable SNP effect sizes and standard errors. Given that the use of Firth regression can be relatively computationally intensive, we have developed an approximate Firth regression approach that is much faster (Supplementary Table 5), which involves estimating the covariate effects in a null Firth regression model and then including covariate effects along with the LOCO genetic predictor as offset terms in a Firth logistic regression test (see Supplementary Note). In practice, we have found this approximation to give very similar results to when the exact Firth test is used (Supplementary Fig. 3). This approach has been used to analyze COVID-19 outcomes across four studies and four ancestries38, and proved vital to provide accurate effect-size estimates for the meta-analysis.

Handling missing data

As a key goal of our approach is to analyze multiple traits all at once, one issue that remains to be addressed is the presence of ‘missingness’ in the data, which could differ among the traits. We consider different approaches based on the nature of the trait as well as whether the null model is being fitted or whether association testing is being performed.

For quantitative traits, when fitting the null model missing data is addressed by replacing the missing values by the sample averages for each trait and in the association testing step, individuals with missing phenotype observations are removed from the analysis for each trait. The latter is done by ensuring that when taking sums over individuals, those with a missing phenotype have a zero contribution to the sum. This is similar to the approach implemented in the BGENIE software (https://jmarchini.org/BGENIE/) for single SNP linear regression analysis17. We assume that covariates are fairly well-balanced in the sample and project them out of the phenotypes using all of the samples (that is, ignoring the missingness within each trait). In the case where phenotypes have the same or very similar patterns of missingness, or if only a single phenotype is being analyzed, it may be more logical to discard the missing observations rather than impute them with the sample averages per trait. Hence, we implement an alternative approach where, in both the null-fitting and the association testing steps, all samples with missingness at any of the P phenotypes are dropped. An approach we have not yet implemented, but may produce better results for quantitative traits, would involve using a multivariate normal model to jointly model correlation between the set of traits and impute missing data, either before or conditional on the output of Step 1.

For binary traits, we use the mean-imputed phenotypes to fit the Level 0 linear ridge regression models within blocks but discard missing observations when fitting Level 1 logistic ridge regressions. As the logistic ridge regressions are fitted separately for each trait, this makes it straightforward to account for the missingness patterns separately for each trait. Similarly, in the testing step, we discard missing observations when fitting logistic regression for each trait as well as when using Firth or SPA corrections.

UK Biobank dataset

The UK Biobank17 (http://www.ukbiobank.ac.uk) is a large prospective study of about 500,000 individuals who are 40–69 years old and for whom extensive phenotype information is being recorded. Genotyping was performed using the Affymetrix UK BiLEVE Axiom array on an initial set of 50,000 participants and the Affymetrix UK Biobank Axiom array was used for the remaining participants. Up to 11,914,699 variants imputed by the Haplotype Reference Consortium panel that either have a minor allele frequency above 0.5% or a minor allele count above five and are annotated as functional in 462,428 samples of European ancestry were used in the data analyses. We selected up to 407,746 individuals of white British ancestry for whom genotype and imputed data were available and applied quality-control filters on the genotype data using PLINK2 (ref. 39; version v2.00aLM, https://www.cog-genomics.org/plink2) that included: a minor allele frequency of ≥1%, a Hardy–Weinberg equilibrium test not exceeding P = 1 × 10−15, a genotyping rate above 99%, not present in low-complexity regions, not involved in inter-chromosomal LD and LD pruning using a R2 threshold of 0.9 with a window size of 1,000 markers and a step size of 100 markers. This resulted in up to 471,762 genotyped SNPs that were kept in the analyses.

Data simulation

We performed simulations to assess the performance of the tests in REGENIE under various population-structure configurations for both quantitative and binary traits. To mimic realistic scenarios, we used genotype array data from the UK Biobank European samples (679,209 array SNPs with a minor allele count > 5). We considered scenarios with 100,000 samples obtained from the set of white British participants or from the full European set so as to incorporate various amounts of population structure. In addition, we varied the proportion of related individuals selected from 0 to 50% of the sample, where we defined a pair of individuals as related if their estimated kinship coefficient, provided by UK Biobank using KING40, was above 0.044. This is to assess how REGENIE would perform in samples with higher amounts of relatedness. We also considered randomly selected samples from the white British or European set, irrespective of the relatedness information, where about 30% of samples in these sets are related up to the third degree. Finally, to consider scenarios of more extreme relatedness, we considered scenarios with samples consisting of only first-degree white British relatives (N = 22,990), first- and second-degree white British relatives (N = 30,775), and first- to third-degree white British relatives (N = 70,684).

We generated quantitative traits as

$$Y_i = \mathop {\sum }\limits_{j = 1}^M G_{ij}\upbeta _j + A_\gamma + {\it{\epsilon }}_i$$

where the M causal SNPs were randomly selected only from odd chromosomes with a minor allele count above 100 and not involved in inter-chromosomal LD, and Gij represents the standardized genotype for individual i at the jth causal SNP, Ai represents the score of the individual for the top principal component from a genotype relatedness matrix using SNPs on odd chromosomes and \({\it{\epsilon }}_i\) represents the environmental effects. The effect sizes for the causal SNPs were sampled from a normal distribution with a mean of zero, where the variance was determined based on the desired proportion of trait variance explained by the causal SNPs \(h_g^2\). The effect from population structure γ was set so that the proportion of the trait variance explained by the top principal component was 5%. The environmental effects were sampled from a normal distribution with a mean of zero and the variance was set to correspond to a trait variance of one.

For binary traits, we used the model described above to obtain a quantitative phenotype and then applied a threshold based on a target sample prevalence value K to dichotomize the phenotype and obtain a binary trait. We also considered simulations to assess the effect-size estimates using a logistic model where

$${\mathrm{logit}}\left( {p_i} \right) = \upbeta _0 + \mathop {\sum }\limits_{j = 1}^M G_{ij}\upbeta _j$$

and Yi|pi ~ Bernoulli(pi), independently, where β0 was chosen to achieve the desired prevalence level, and the effect sizes of the causal SNPs were sampled from a normal distribution with a mean of zero and the variance parameter was chosen so that they explain 20% of the variance on the logistic scale.

We simulated up to 100 phenotypic replicates for each simulation setting and selected 10,000, 25,000 or 50,000 SNPs to be causal. For binary traits, we varied the sample prevalence K between 0.1, 0.01 or 0.001, corresponding to a case–control ratio of 1:9, 1:99 or 1:999, respectively, and fixed the number of causal SNPs to 10,000. SNPs on even chromosomes (Mnull = 324,838 variants) were used to assess the Type 1-error performance and the power was estimated using the set of causal SNPs for each trait. REGENIE was compared with BOLT-LMM with a mixture of Gaussian’s model (BOLT-LMM-MoG), BOLT-LMM with infinitesimal model (BOLT-LMM-inf), fastGWA, SAIGE (only for binary traits and run with LOCO scheme) and PCA (using only the top principal component as a covariate in Step 2 of REGENIE without the LOCO predictions from Step 1). The top principal component was included as a covariate for all methods. For REGENIE–Firth, REGENIE–SPA and SAIGE, the P-value fallback threshold for Firth/SPA correction was set to 0.05.

Statistical analyses

We used REGENIE to perform genome-wide association analyses on up to approximately 11 million imputed variants for 50 quantitative traits and 54 binary traits of up to 407,746 white British participants in the UK Biobank. Quantitative phenotypes were converted to z-scores using rank-inverse-based normal transformation. In the statistical models used, the covariates included age, age2, sex, age × sex and the top-10 principal components provided by the UK Biobank to appropriately correct for population stratification. To assess the performance of REGENIE in genome-wide association studies, we compared the results from REGENIE with those of existing approaches for large-scale analysis, which included BOLT-LMM (version 2.3; https://data.broadinstitute.org/alkesgroup/BOLT-LMM/) and fastGWA (GCTA version 1.93.0beta; https://cnsgenomics.com/software/gcta/#Overview) for quantitative phenotypes and SAIGE (version 0.36.5.1; https://github.com/weizhouUMICH/SAIGE) with the LOCO option for binary traits. For all methods, Step 1 was run on a set of array SNPs stored in bed/bim/fam format and Step 2 was run on imputed data stored in BGEN format. All association analyses used a \(\chi _{\mathrm{df} = 1}^2\) statistic to test a variant for association with a trait (that is H0SNP = 0). All programs were called within R41, where we used the function system.time to track the CPU and wall-clock timings.

Reporting Summary

Further information on research design is available in the Nature Research Reporting Summary linked to this article.