Introduction

Genetic information is encoded by DNA, transcribed into RNA and translated into protein. When originally proposed1, this foundational tenet assumed faithful transmission of information such that mRNA accurately reflects what is encoded at the DNA level. However, it is now clear that RNA molecules can undergo several processing events that diversify the genomic information, resulting in different transcripts that, in some cases, encode different protein isoforms. Examples of such processes are alternative splicing2, alternative polyadenylation3 and base modifications.

Most RNA base modifications are not easily detectable via synthesis-based RNA sequencing4, making it exceedingly difficult to distinguish between modified and unmodified RNA molecules5. One exception is RNA base deamination (also known as RNA editing), a widespread set of modifications that lead to a change in the RNA sequence itself. RNA editing can be detected simply by comparing the sequence of the transcript with that of its cognate gene.

In mammals, RNA editing refers specifically to the deamination of adenosine to inosine (A-to-I) or cytosine to uracil (C-to-U); for the purposes of this Perspective we will exclude the phenomenon of uracil insertion or deletion that was described as RNA editing in mitochondria of Trypanosoma brucei6. A-to-I editing is catalysed by the adenosine deaminase acting on RNA (ADAR) protein family7,8,9. C-to-U editing is performed by numerous cytosine deaminases, the best known of which belong to a family of mammalian enzymes known as the ‘activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like’ (AID/APOBEC) protein family10 (Box 1).

The first member of the AID/APOBEC family to be characterized was the bona fide RNA editing enzyme APOBEC1 (Fig. 1a). Since then, additional RNA editing deaminases belonging to this family have been described, including APOBEC3A (A3A) and A3G. These RNA editors have the peculiar ability to also deaminate DNA, leading to single-nucleotide variant mutations that often occur processively in genomic DNA or reverse-transcribed viral cDNA (Fig. 1b). By contrast, other family members seem to have lost their ability to deaminate RNA: some, instead, catalyse mutation of viral DNA (or cDNA); others have very specific genomic DNA substrates — for example, AID edits the expressed immunoglobulin gene (Fig. 1c). Finally, APOBEC2 cannot edit RNA or DNA but has the ability to bind DNA with affinities much higher than those reported for any other family member11.

Fig. 1: Physiological and aberrant functions of the AID/APOBEC deaminases.
figure 1

a | APOBEC1 in humans and mice acts in the nucleus of enterocytes, together with its cofactor RNA-binding motif protein 47 (RBM47), to edit apolipoprotein B (APOB) mRNA155. Editing leads to C-to-U base change that converts Gln (CAA) to a stop codon (UAA). Edited and unedited APOB mRNAs are then translated in the cytoplasm, generating two distinct isoforms: short (APOB-48) and long (APOB-100) isoform, respectively. APOB-100 is the major component of plasma low-density lipoproteins whereas APOB-48 is essential for secretion of chylomicrons. In mice, APOBEC1, together with RBM47, catalyses RNA editing of a large set of additional transcript targets (mRNA set 1). A change of cofactor from RBM47 to APOBEC1 complementation factor (A1CF) leads to RNA editing of a different set of transcript targets (mRNA set 2), suggesting that target specificity resides with the cofactor. Finally, in mice and humans, APOBEC1 is also able to induce DNA editing within the genome of the cells, leading to undesired mutations (dashed red arrow)109,141,142. b | In humans, APOBEC3 family members play an essential role during retroviral infections (for example, in leukocytes). Specifically, once a retrovirus infects a cell, it releases its viral genome as single-stranded RNA (ssRNA), which is retro-transcribed (RT) to cDNA. APOBEC3 proteins can deaminate this single-stranded DNA (ssDNA) leading to C-to-T base changes and mutations within the viral genome. This edited viral genome can be degraded (if heavily edited) or integrated into the genome as a provirus. APOBEC3A (A3A) and A3G are also able to perform RNA editing on RNA viruses (such as SARS-CoV-2) as well as host mRNAs. Aberrant activity of A3A and A3B can also induce DNA mutations within the genome of cells (dashed red arrows). c | AID plays an essential role in B cell antibody diversification, where it catalyses deamination either within transcribed (black arrow) antibody variable region V(D)J gene segments of the immunoglobulin (Ig) gene leading to somatic hypermutation (SHM) (mutations represented as red bars) or within repetitive ‘switch’ regions upstream of the constant region gene segments, leading to class switch recombination (CSR) (switch regions Sμ or Sɣ1 shown). Resulting mRNA encodes an IgG1 protein that contains a hypermutated variable region and a ɣ1 heavy chain. Unregulated AID activity can also result in mutations and translocations elsewhere in the genome (dashed red arrow). AID/APOBEC, activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like; dsDNA, double-stranded DNA.

The interplay between RNA editing and DNA mutation and the types of molecular restrictions that determine substrate range and selectivity is the focus of this Perspective. We first summarize the main determinants of AID/APOBEC substrate selectivity in members that are able to deaminate only DNA (we call these ‘specialists’) and those that can deaminate both RNA and DNA (we term these ‘generalists’) (Table 1); note that family members where activity has not yet been tested on both substrates remain unassigned in this scheme. We then provide examples of how these different functionalities have allowed specific members of the AID/APOBEC family to drive evolution in different contexts. Finally, we discuss how AID/APOBEC enzymes have been co-opted into synthetic biology — specifically into the genome and transcriptome engineering technologies broadly known as programmable base editing, which have enormous therapeutic potential12. A broad understanding of the molecular features that drive AID/APOBEC selectivity will be key to the development of such precision therapeutics.

Table 1 AID/APOBEC members are classified as generalists or specialists based on their substrate flexibility and functional restriction

Determinants of substrate selectivity

AID/APOBEC enzymes share three major functional elements: they all contain the catalytic domain (comprising the enzymatic pocket that, in part, overlaps with the substrate binding surface), whereas some also contain a cofactor interaction region (that can also multimerize) and sequence elements that define the subcellular localization of each protein (Fig. 2a). Sequence and/or structural variations in any of these features can change nucleic acid preference, for example through minor alterations in the substrate binding groove, or through restricted subcellular localization, such as exclusion from the nucleus through interaction with cofactors or through intramolecular oligomerization (reviewed elsewhere13,14).

Fig. 2: The emergence of the AID/APOBEC family and the conserved core cytidine deaminase domain.
figure 2

a | Left: simplified phylogenetic tree of the AID/APOBEC family, which is believed to have emerged by co-opting prokaryotic tRNA editing enzymes (Tad/ADAT2) to deaminate DNA156. In the vertebrate-specific branch, AID and APOBEC2 are the most ancient members (present in cartilaginous and bony fish)157. APOBEC1 emerged later in the tetrapod–lungfish divergence; and APOBEC3 appeared even later, in placental mammals. Both are believed to have evolved from AID gene duplications82,158. Paralogue expansion within placental mammals led to emergence of several APOBEC3 subfamily members, with the seven members of the human subfamily being among the most diverse82,159. More recently, orthologues of APOBEC4 have been found in invertebrates, suggesting it predates rest of family members and forms a separate invertebrate branch158. Right: domain delineation of members of the vertebrate-specific AID/APOBEC family. Each member of the family contains the core zinc-dependent cytidine deaminase domain (core CDA). Specific members contain accessory motifs within core CDA that determine subcellular localization, including nuclear localization signal (NLS), nuclear export signal (NES) and cytoplasmic retention signal (CRS). Some members contain additional accessory regions that provide specific molecular properties: for example, APOBEC2 contains an amino-terminal intrinsically disordered region (IDR) whereas the carboxy terminus of APOBEC1 is hydrophobic. b | Core CDA composed of a five-stranded β-sheet (β1–β5) surrounded by six α-helices (α1–α6). Several loops found within the deaminase fold (L-1 to L-10) with loops 1, 3, 5 and 7 forming the substrate binding groove. Catalytic pocket coordinates a zinc ion (Zn; green sphere) with the His-Glu (H and E) and Cys-Cys (C) motifs found on α2 and L-5/α3, respectively. AID/APOBEC, activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like.

Key enzymatic features

Here, we describe common structural features of the generalists (APOBEC1, A3A and A3G) and how they relate to their ability to bind to and deaminate RNA and DNA. We compare and contrast these features with those of mammalian AID and APOBEC2, two specialists that have lost this substrate flexibility and have assumed unique functions. We specifically focus on the substrate binding groove of these proteins, which is largely defined by four loops surrounding the active site (loops 1, 3 and 7, with minor contributions from loop 5) (reviewed elsewhere13) (Fig. 2b). These loops have been demonstrated to be responsible for the interaction of AID/APOBECs with their substrates15,16,17,18,19,20,21, but they also help delineate the contours of the catalytic pocket. Therefore, together they define most enzymatic functionality — from substrate binding to dinucleotide preference to catalysis.

Loop 7 residues and nucleic acid interaction within generalists

Recent co-crystal structures of A3A bound to a six-nucleotide single-stranded DNA (ssDNA) substrate (PDB:5SWW) (Table 2) revealed that the deoxycytidine to be deaminated (C0) is located at the bottom of a substrate binding groove shaped by loops 1, 3 and 7, where it forms a π-stacking interaction with Y130 in loop 7 (Fig. 3a). A3A protein variants in which Y130 is replaced by an alanine (Y130A) lacked deaminase activity in vitro, proving the catalytic importance of this residue20. Y130 of A3A corresponds to residues Y315 of A3G and F120 of APOBEC1 (Fig. 3b,c). The co-crystal structure of A3G bound to a nine-nucleotide ssDNA substrate containing a 5′-TCCCA-3′ target sequence (PDB:6BUX) (Table 2) confirmed that Y315 forms a π-stacking interaction with C0 (ref.22) (Fig. 3b). Moreover, Y315A variants of A3G have significantly reduced binding to both ssDNA and RNA substrates in in vitro binding assays23. Co-crystal structures of APOBEC1 bound to nucleic acid substrates are not yet available, but evidence suggests it too interacts with C0 through a π-stacking interaction: F120A variants have little or no deaminase activity towards RNA or DNA substrates in in vitro assays21; and alignment of A3A and A3G co-crystal structures with the structure of APOBEC1 (PDB:6X91) (Table 2) shows an almost perfect overlap of APOBEC1 F120 with A3A Y130 and A3G Y315 (Fig. 3b). Molecular dynamics simulations suggest that this catalytic pocket ‘breathes’, often occluding or restricting substrate entrance20,24, suggesting it could enforce local sequence preference and might even be selectively druggable.

Table 2 Available crystal structures of AID/APOBECs
Fig. 3: Structural insights from generalists and specialists.
figure 3

a | Co-crystal structure of APOBEC3A (A3A) bound to six-nucleotide single-stranded DNA (ssDNA; turquoise) (PDB:5SWW) (Table 2). Target deoxycytidine (C0) located at bottom of the substrate binding groove formed by loops 1, 3 and 7 and forms a π-stacking interaction with Y130 (PDB:5SWW) (Table 2). b | Overlapping crystal structures of APOBEC1 (purple) (PDB:6X91) (Table 2), A3A bound to ssDNA (pink) (PDB:5KEG) (Table 2) and A3G bound to ssDNA (blue) (PDB:6BUX) (Table 2). Residues F120 (APOBEC1), Y130 (A3A) and Y315 (A3G) form critical aromatic π-stacking interactions with target C (C0, from co-crystal structure of A3A bound to ssDNA; turquoise) (PDB:5SWW) (Table 2). c | Alignment of amino acids present in loop 7 for different APOBECs and target C preference motif for each. d | A3A binds U-shaped substrates, such as ssDNA (orange) (nucleobases represented as blue sticks) (PDB:5SWW) (Table 2). e | AID in co-crystal structure with dCMP ligand (PDB:5W0U) (Table 2) shown as a molecular surface. AID loops 1, 3 and 7 form positively charged (blue) bifurcated substrate binding surface, comprising ‘substrate channel’ (which hosts dCMP) and a second groove, termed the ‘assistant patch’. The two grooves are separated near the point of convergence by negatively charged residues in loop 7 (red) known as the ‘separation wedge’. f | Co-crystal structure AID (light brown) with a dCMP ligand (orange) (PDB:5W0U) (Table 2) overlaid with crystal structure of APOBEC2 (blue) (PDB:2NYT) (Table 2). Loop 1 of APOBEC2 obstructs substrate (orange) at the active site. Residue E60 in APOBEC2 forms a fourth point of coordination with zinc ion (Zn; green sphere). AID/APOBEC, activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like. Part e is adapted from ref.13, CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/).

The residues in loop 7 are also essential for the interaction with the nucleotide preceding C0 (the –1 position), thereby defining the local dinucleotide sequence preference (Fig. 3c). This was demonstrated by cell-based assays in which loop 7 of A3G was replaced with the corresponding region from A3A, which changed its dinucleotide preference from 5′-CC to 5′-TC (the preference of A3A)18. Single amino acid exchanges demonstrated the critical role of D317 of A3G as a main determinant of dinucleotide substrate preference. Interestingly, D317W led to stronger preferences towards 5′-TC than did D317Y, suggesting that amino acids with larger aromatic side chains may be even more favourable to 5′-TC deamination18. The currently available co-crystal structures of A3A and A3G bound to nucleic acids (Table 2) provide an explanation of these findings. D131 and Y132 of A3A have a primary role in defining the A3A preference for a 5′-TC sequence motif: although these residues have the potential to interact extensively with either T–1 or C–1, the size of the –1 pocket precludes access of the larger purine16,20. Taken together, these findings highlight the importance of the residues in loop 7 in determining the local dinucleotide preference of a generalist, most likely by affecting the size of the substrate binding pocket.

Although these experiments indicate that the catalytic pocket confers some degree of substrate sequence specificity, small-molecule inhibitors that can discriminate between AID/APOBEC family members have yet to be identified — implying that the catalytic pockets of these enzymes share strong common features (recently reviewed elsewhere25) and that the main determinant of substrate selectivity may, in fact, be the substrate binding groove.

The substrate binding groove of generalists is U-shaped

In vitro assays indicate that generalist AID/APOBECs prefer structured substrates; disrupting stem–loops in ssDNA and RNA substrates directly alters the frequency with which they are deaminated by A3A and A3G (refs26,27,28,29). Moreover, the co-crystal structures of A3A and A3G bound to nucleic acid demonstrate that the U-shaped substrate binding groove formed by loops 1, 3, 5 and 7 (with the catalytic pocket located at the bottom of the U shape where the π-stacking interaction occurs)16,20 (Fig. 3d) optimally accommodates a stem–loop structure. Although crystal structures of APOBEC1 bound to ssDNA or single-stranded RNA (ssRNA) are not currently available, its best-studied target, apolipoprotein B (APOB) mRNA, is predicted to form a stem–loop secondary structure30,31,32. Thus, generalists bind substrates with similar conformations to tRNA, the substrate of the distantly related tRNA adenosine deaminases (such as TadA33; Fig.2a), suggesting that the shape of the binding groove in generalists has co-evolved with the structure of their nucleic acid substrate. Moreover, the shape of this groove could be predictive of generalists. We note here that a similarly shaped groove (termed ‘patch 1’) is evident in co-structures of A3H with ssDNA and RNA34. For the purposes of this Perspective, A3H is considered ‘unassigned’ because its ability to catalyse RNA editing has not yet been tested. However, given the shape of its substrate binding groove and its demonstrated ability to bind RNA (see below), we would predict that it too might function as a bona fide RNA editor.

Groove residues help generalists discriminate between RNA and DNA substrates

Residues in the loops forming the substrate binding groove of AID/APOBECs have key roles in substrate discrimination (that is, binding of RNA versus DNA). For example, a W121A substitution in loop 7 of APOBEC1 almost completely abolishes deamination of RNA while retaining activity on DNA, indicating an essential role of this amino acid in substrate differentiation21. Notably, alignment with other APOBECs reveals that W121 in APOBEC1 corresponds to Y113 in A3H (Fig. 3c), a residue that directly interacts with a ribose 2′-hydroxyl of bound RNA15,19 (PDB:6B0B, PDB:5W3V) (Table 2). The same residue also corresponds to D131 in A3A and D316 in A3G. As discussed above, these residues have been shown to be important for deamination activity on ssDNA and for local dinucleotide sequence preference16,18,20,35, but no evidence yet exists for their function on RNA. However, two A3A protein variants were recently described that exclusively deaminate RNA. Both variants have a Y132G substitution combined with additional substitutions in loop 1 or helix 6, implicating these amino acids in substrate discrimination36. Nonetheless, much remains to be learned about how generalist APOBECs discriminate between RNA or DNA substrates; additional structures of the proteins bound to DNA or RNA, in combination with genetic studies targeting specific amino acids, will be necessary to pinpoint the residues that define substrate selectivity.

Structural differences in grooves of specialists reflect functional differences

The structure of AID (with a dCMP bound within the catalytic pocket) (PDB:5W3V) (Table 2) revealed a bifurcated, rather than U-shaped, substrate binding surface. Residues of loops 1, 3 and 7 are essential in shaping the substrate channel where the dCMP coordinates37 (Fig. 3e), which is connected to a second groove, termed the ‘assistant patch’37 (Fig. 3e). Positively charged basic residues in these channels form a binding surface, which is separated near their point of convergence by negatively charged residues in loop 7 (the ‘separation wedge’) (Fig. 3e). Groove residues are highly conserved in AID proteins from different species, but not among other APOBECs, highlighting that this structure is specific to AID. Interestingly, similar separation wedge structures have been observed for proteins that recognize branched nucleic acids, such as T4 RNase H38 or Cas9 (ref.39), suggesting that AID recognizes structured substrates. Although AID targeting mechanisms are still not fully clarified, the conformation of the substrate binding region agrees with recent experiments that reveal a possible role for G-quadruplex structures in guiding and targeting AID, at least in the context of immunoglobulin class switch recombination (CSR)37,40. These data also highlight the importance of the substrate binding groove structure in allowing different AID/APOBECs to discriminate substrates based on their secondary structure. It must also be noted that AID, similar to other specialists, can bind RNA41, especially within RNA–DNA hybrids42, but cannot deaminate it43, suggesting again that binding is required but not sufficient for catalysis.

The structure of APOBEC2 was the first among the AID/APOBEC family to be published44,45 (PDB:2NYT) (Table 2), but little is known about its molecular substrate and so co-crystal structures are currently unavailable. As such, it is not possible to assess the conformation of the substrate binding groove, but the APOBEC2 structure does provide some insight into its lack of deaminase activity. E60 in APOBEC2 forms a point of coordination with the zinc ion that is absent from catalytically active AID, A3A, A3G or APOBEC1, and this may affect catalytic activity by disrupting coordination of an essential water molecule or by modulating substrate affinity46 (Fig. 3f). Deamination could also be prevented by obstruction of the nucleic acid binding pocket by loop 1 (Fig. 3f). However, given the flexibility of this loop seen in its solution structures44, intermolecular interactions affecting its conformation may allow transient access to the deaminase active site and transient interactions with nucleic acid. Recent work from our laboratory, which has been made available as a preprint, strongly suggests that APOBEC2 has retained the ability to interact with ssDNA containing GC-rich motifs; moreover, this interaction seems to affect gene expression11. It is tempting to speculate that, similar to AID, APOBEC2 may interact with G-quadruplex structures found within these GC-rich promoter sequences. Alternatively, APOBEC2 may interact with transient ssDNA structures resulting from RNA polymerase promoter melting, in a manner similar to other APOBECs47. We currently speculate that transcriptional repression through chromatin interaction may be an evolutionarily conserved function of APOBEC2 (ref.48), especially in the context of cellular reprogramming.

Taken together, the available AID/APOBEC structures illustrate the flexibility of their core structure and how it maintains the active site requirements of the family while enabling substrate restriction and functional specialization in some members or broader substrate preference and functional plasticity in others.

Subcellular localization

Regardless of the innate capacity of an AID/APOBEC protein to bind and deaminate DNA, RNA or both substrates, its ability to do so in cells will depend on its subcellular localization and its access to the specific substrate. Whereas mRNA, viral RNA and viral DNA can all be deaminated in either the nucleus or the cytoplasm, the host genome can only be deaminated by nuclear-localized family members. For example, despite having DNA binding and deamination capabilities, the generalist A3G cannot mutate genomic DNA because it is confined to the cytoplasm.

The subcellular localization of each member of the AID/APOBEC family may depend on active or passive cellular mechanisms. Transit of AID and APOBEC1 between the nucleus and the cytoplasm relies on both an amino-terminal bipartite basic nuclear localization signal (NLS) sequence and a strong carboxy-terminal leucine-rich nuclear export signal (NES) sequence49,50 (Fig. 2a). APOBEC1 also contains a C-terminal hydrophobic domain, which is involved in intramolecular interactions that can play a part in further defining subcellular localization21. An extensive study of AID and APOBEC2 protein chimaeras showed that nuclear import of AID involves residues in addition to the N-terminal NLS, whereas APOBEC2 lacks NLS or NES motifs and, instead, passively diffuses between the cytoplasmic and nuclear compartments51. Unlike the rest of the AID/APOBEC family, APOBEC2 contains an N-terminal glutamate-rich acidic intrinsically disordered region (IDR), which could further restrict its subcellular localization through intermolecular interactions with shuttling proteins or cofactors (Fig. 2a).

Single-domain human APOBEC3 paralogues, A3A, A3C and A3H, are small enough (~25 kDa) to passively enter and exit the nucleus, and are generally found throughout the cell during interphase52 (Fig. 2a; Table 1). For example, A3H lacks an NLS but enters the nucleus through passive diffusion and is retained within the nucleolar subcompartment53. By contrast, the larger (>50 kDa) double-domain APOBEC3 paralogues cannot passively enter the nucleus; A3B is constitutively nuclear owing to its N-terminal NLS52,53,54, whereas A3D, A3F and A3G lack an NLS and are mostly found within the cytoplasm (Fig. 2a). Interestingly, A3G seems to contain a novel cytoplasmic retention signal (CRS)55. All human APOBEC3 paralogues are excluded from chromatin during mitosis when the nuclear envelope breaks down, which presumably inhibits genome mutagenesis52 (ref.14 offers an in-depth review on trafficking kinetics of the AID/APOBEC family of proteins).

Cofactors

AID/APOBEC enzymes interact with numerous protein cofactors that enable them to carry out their functions in the cell. Here, we focus on cofactors that affect substrate targeting or modulate catalytic activity.

To date, APOBEC1 is the only AID/APOBEC protein for which specific cofactors have been demonstrated to modulate its catalytic activity. In mice, APOBEC1 is expressed in the small intestine and the liver, where it edits a specific cytosine within the APOB pre-mRNA. The C-to-U RNA editing event recodes a CAA codon to a stop codon, resulting in a truncated form of the APOB protein, called APOB-48 (refs56,57) (Box 1; Fig. 1a). Two cofactors of mouse APOBEC1 (mAPOBEC1) — APOBEC1 complementation factor (A1CF)58,59 and RNA-binding motif protein 47 (RBM47)60 — have so far been identified, but given that doubly mutant mice lacking both of these cofactors still retain some C-to-U editing activity, other cofactors are likely to exist61,62. A1CF and RBM47 bind RNA, interact directly with APOBEC1 protein58,59,60 and have an essential role in defining which RNAs are targeted for editing as well as determining the level of editing per target61,62. Elegant genetic dissection in a mouse system suggests that cofactors ‘recruit’ different (sometimes partially overlapping) sets of transcripts to the editing complex (Fig. 1a) and that cofactor dominance is associated with editing frequency61,62. Together with the fact that APOBEC1 exerts its biological function by deaminating target cytosines within cohorts of transcripts that define common pathways63, these experiments support the idea that distinct tissues drive APOBEC1 to specific sets of transcripts through the provision of different sets of cofactors64.

Several potential cofactors have been identified for AID65,66,67,68,69,70, but none has been proven to be the key determinant in targeting AID to the immunoglobulin locus, its physiological target. Finally, a secondary Zn2+ ion has been shown to allosterically modulate catalysis of A3A and A3G (ref.71). Although not a cofactor in the traditional sense, this functionality points to possible surfaces that could be occupied by more traditional cofactors to regulate enzymatic function.

AID/APOBECs drive adaptive evolution

RNA editing and DNA mutations have very different features; editing is transient and tunable, whereas mutations are irreversible and heritable. Despite these differences, both mechanisms create genetic variability that has an essential role in adaptive evolution72,73,74. In this section, we discuss how AID/APOBEC proteins can drive adaptive evolution in viral and cancer genomes owing to their ability to deaminate both RNA and DNA.

APOBEC3 proteins in viral genome evolution

Early experiments predicted that T cells express a factor that blocks the replication of viral infectivity factor (Vif)-deficient human immunodeficiency virus type 1 (HIV-1)75,76. A3G was later identified as one of the factors responsible for this HIV-1 restriction through active deamination of nascent retroviral cDNA77,78,79, with subsequent studies highlighting the involvement of A3D, A3F and A3H (refs80,81). Although many of these experiments were performed in APOBEC3-overexpressing cells infected with pseudotyped HIV and may not fully reflect in vivo conditions, the general consensus is that several APOBEC3 proteins individually and synergistically restrict viral infectivity of HIV and many other viruses during natural infections, a view that is supported by the substantial expansion of the APOBEC3 family in organisms that support large infection loads, such as bats82,83. In a process known as hypermutation, APOBEC3 proteins can deaminate a substantial proportion of the total cytosines in the HIV cDNA in a single round of viral replication, with reports of up to 10% in in vitro or cell culture experiments and up to 98% in HIV sequences isolated from peripheral blood mononuclear cells. The resulting uracils are recognized and excised by the host uracil DNA N-glycosylase (UNG) protein, which initiates the base excision repair pathway and, ultimately, leads to heavily damaged genomes containing multiple abasic sites. These genomes can be further cleaved and degraded, thereby decreasing viral infectivity77,83. However, genomes with less extensive damage (and fewer abasic sites) can simply be repaired, often resulting in mutations that can support viral evolution84 and the acquisition of drug resistance85, altered transmission and immune escape85,86 (Fig. 4a).

Fig. 4: The consequences of deamination for adaptive evolution.
figure 4

a | APOBEC levels are upregulated in response to viral infection, and both single-stranded DNA (ssDNA) and single-stranded RNA (ssRNA) viral genomes can undergo APOBEC-mediated deamination resulting in mutations. When mutational load in the viral genome is so high that the genome cannot perform its function, the genome is degraded and viral particles are not produced; this process is known as viral restriction. However, mutations resulting from lower levels of deamination can become fixed in the viral genome after replication, increasing the probability of producing viral variants with altered characteristics (compared with the original strain). b | If AID/APOBEC-catalysed C-to-U deamination events are not repaired by the base excision repair pathway during replication, resulting C-to-T transitions can induce massive DNA damage and genome instability. Cell death can be triggered if accumulation of transitions leads to an excessive mutational load, resulting in tumour restriction. However, lower levels of mutation can fuel genome variability and tumour cellular heterogeneity that, upon selection, can result in adaptive evolution and cancer progression. AID/APOBEC, activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like.

Analysis of HIV genomes that have undergone hypermutation or are associated with immune escape reveals an enrichment of APOBEC3-defined mutational signatures, which, in conjunction with biochemically derived triplet preferences, strongly support a physiologic role for specific APOBEC3 family enzymes in both viral restriction and viral evolution (reviewed elsewhere87). Although the majority of knowledge surrounding APOBECs and viral restriction comes from the study of retroviruses, DNA viruses such as hepatitis B virus (HBV) and human papilloma virus (HPV) are also restricted by APOBEC3 enzymes88,89,90. Additionally, some APOBEC3 proteins can also deaminate viral genomes composed solely of RNA, such as the positive-sense RNA genome of the betacoronavirus SARS-CoV-2. Soon after the beginning of the COVID-19 pandemic, RNA sequencing data from bronchoalveolar lavage fluid of patients with COVID-19 was used to monitor the mutational signatures shaping the viral genome before fitness selection91,92. The most common mutations detected in these sequencing data were A-to-G and T-to-C changes (possibly the outcome of ADAR1 activity on the positive-sense and negative-sense strands, respectively, during viral replication) followed by C-to-T and G-to-A changes, likely mediated by APOBEC3 proteins, the only AID/APOBEC family members known to bind and deaminate viral RNA91,92,93. The involvement of APOBEC3 proteins is further supported by the frequent occurrence of edited Cs within the motif 5′-U/ACU/A-3′ (refs91,94) (although a recent preprint indicates this could also be explained by APOBEC1-mediated deamination95) and in terminal loop rather than stem sequences96, and the upregulation of APOBEC3 proteins in samples from patients with COVID97,98,99,100. Analysis of SARS-CoV-2 genomic sequences largely acquired through the process of viral genome surveillance of variants of interest over the course of the pandemic has revealed that, after fitness selection, about 40% of all mutations involve C-to-T changes (reviewed elsewhere100,101), which are at least partially confined to a group of mutational hotspots102, a pattern consistent with APOBEC3 activity. Numerous other ssRNA viruses (including human T cell leukaemia virus type 1 (HTLV-1) and rubella) have been shown to be targeted by APOBEC3 proteins (reviewed elsewhere81). Overall, deep sequencing data strongly support a functional role of APOBEC3 family members in the restriction of ssRNA viruses in natural settings.

Taken together, these studies clearly show the effects of APOBEC3 mutagenesis on viral genomes and its relevance to virus evolution84,103,104. As generalists with a preference for viruses with ssRNA and DNA genomes105, A3A and A3G contribute to restriction of a range of viruses but can also drive evolution of retroviruses (such as HIV-1 (ref.85)), DNA viruses (such as herpesviruses74) and also ssRNA viruses that lack ssDNA intermediates (including SARS-CoV-2 and rubella among others96).

AID/APOBECs and cancer evolution

The first solid piece of genetic evidence linking any AID/APOBEC family member to cancer was the finding that APOBEC1 overexpression in the liver of transgenic animals induces hepatocellular carcinoma106, although whether this was the result of RNA editing or DNA mutation remained unclear. Ectopic expression of AID was later shown to catalyse off-target DNA mutations and chromosomal translocations107,108, albeit at rates substantially lower than those reported for its true target, the immunoglobulin genes. Subsequently, some APOBEC3 family members (chiefly those with access to the nucleus) were reported to be a cause of DNA damage and mutagenesis109,110. Indeed, based on mutational signatures found in cancer genomes, AID/APOBEC-derived mutations are present in more than 50% of human cancer types, and account for 5–90% of all substitution mutations111,112. In addition, AID/APOBEC mutations can occur in clusters over kilobase-sized regions113,114. These hypermutated clusters are termed kataegis mutations113,115 and have been reported in more than 60% of cancers116. They are especially prominent in cancer types where APOBEC3 mutagenesis is active117. Expression of some AID/APOBEC enzymes in tumours (such as AID in chronic myeloid leukaemia118 or A3B in tamoxifen-resistant breast cancer119) has been correlated with increased tumour evasion and drug resistance, suggesting that they drive tumour evolution. Independently of kataegis mutations, APOBEC3-catalysed mutagenesis can also lead to chromosomal instability120,121 and, thus, to either cell-autonomous lethality122,123 or to cancer evolution through increased tumour heterogeneity124. Given that these outcomes mirror those of APOBEC3-mediated viral restriction, we hypothesize that expression of APOBEC3 proteins is induced by the inflammatory cancer microenvironment in an attempt to kill malignant cells via localized hypermutation. However, when APOBEC3-mediated mutation is not successful in achieving tumour restriction125,126, the tumour cells that have evaded cell death (that is, those with non-lethal levels of mutation) can drive cancer evolution, thus leaving behind a mutational signature in the genome at sites that are likely directly related to the original drive to restrict26 (Fig. 4b).

The RNA editing capacity of some AID/APOBEC deaminases has also been directly linked to the generation of heterogeneity essential to tumour evolution127,128,129,130,131 (for a comprehensive recent review on the AID/APOBEC but also ADAR contribution to tumour evolution, see ref.128). For example, loss of editing (through ablation of Apobec1) in the small intestine of a mouse model of intestinal cancer (the APCmin mouse) leads to substantial tumour reduction132. Additionally, deletion of Apobec1 from the germline of a mouse model of testicular cancer (in which around 8% of male mice succumb to testicular teratocarcinomas by 4 weeks of age) ablates susceptibility133. Finally, it has recently been demonstrated that the location of A3A-catalysed DNA mutations in cancer genomes can be predicted in clinical samples by monitoring the frequency of A3A RNA editing at the same loci28. This finding supports the notion that editing precedes mutation and that RNA editors induced under inflammatory conditions can also inflict DNA damage, such as kataegis mutation. More generally, these data imply that the RNA editing state of a cell determines the fate of that cell, even in the absence of a heritable genomic mutation. Indeed, both A3A expression and RNA editing were detected in cancers such as acute myeloid leukaemia and myeloproliferative neoplasm28, yet APOBEC-associated genomic signatures are only a minor component of the mutational signatures present in these tumours111, further implying that A3A activity on RNA could precede DNA mutagenesis in cancer.

AID/APOBECs as base-editing tools

In this section we will discuss how AID/APOBEC enzymes have been used in genome and transcriptome engineering technologies, broadly known as programmable base editing (Fig. 5), to revert T-to-C or A-to-G transitions in DNA or mRNA, and how a fuller understanding of their substrate specificities can inform the design and optimization of these tools. As this Perspective is focused on AID/APOBECs, we will not discuss mRNA base-editing technologies that are based on adenosine deaminase enzymes (reviewed extensively elsewhere134,135,136,137,138,139).

Fig. 5: APOBEC-derived DNA and RNA base-editing tools.
figure 5

a | DNA-directed cytosine base editor (CBE) comprises catalytically dead Cas9 (dCas9), guide RNA (gRNA), single-stranded DNA (ssDNA) deaminase and uracil DNA glycosylase inhibitor (UGI). b | Quantification of on-target editing (on DNA) and off-target editing (on RNA), allowing quick visual understanding of features of each genome base editor. Several CBE variants illustrated, which differ with respect to the deaminase used (rat APOBEC1 (rA1) or human APOBEC3A (A3A)) and specific mutations within the deaminase. Note that amino acids identified in the human APOBEC1 structure21 as likely to be functionally important can inform base-editing work with rA1. ce | RNA-directed CBE tools based on APOBEC proteins. APOBEC variants are directed to a specific nucleotide in a transcript of interest via an antisense gRNA, design of which varies according to the editing system: RNA base editing by CURE (cytidine-specific C-to-U RNA Editor) — here, targeting mediated by gRNA that recruits a chimeric protein comprising either dCas13 or dCasRx and a Y132D variant of A3A to the RNA target — and gRNA creates a 14-nucleotide loop containing C to be edited (part c); RNA base editing with a SNAP-tagging system — mouse APOBEC1 (mAPOBEC1)-SNAP chimaera recruited to target RNA via covalent linkage to a benzylguanine (BG)-modified gRNA — and unlike other systems, C to be deaminated is positioned four to six nucleotides downstream of region bound by gRNA (part d); and RNA base editing with an MS2-tagging system — a human APOBEC1 deamination domain (hA1DD)–MS2 chimaera is recruited to a specific location on target RNA by binding MS2 coat proteins to MS2 stem–loop on gRNA — and in this system, C to be edited is specified by a C:A mismatch between target RNA and gRNA (part e). AID/APOBEC, activation-induced cytidine deaminase/apolipoprotein B mRNA-editing enzyme catalytic polypeptide-like; WT, wild type.

DNA-directed base-editing tools

The first members of the AID/APOBEC family to be used as the basis of a cytosine base editor (CBE) were AID, rat APOBEC1 (rA1) and A3G. A seminal paper from the Liu laboratory used catalytically dead CRISPR-associated endonuclease (dCas) fused to these AID/APOBEC family members, together with appropriate Cas9 guide RNAs (gRNAs), to target deaminase activity to specific loci and induce single base changes in the absence of a DNA break140. Given the substantial activity of rA1 as a DNA mutator109,141,142, its fusion with dCas9 was the most efficient at generating specific C-to-T (or G-to-A) substitutions within DNA, constituting the first CBE140. Several variations of this system were soon developed to increase base-editing efficiency (by fusion with a uracil DNA glycosylase inhibitor (UGI)), to reduce indel generation (for example, by using Cas9-D10A, a nickase mutant of Cas9) and to reduce off-target editing (by using A3A or AID instead of APOBEC1) (reviewed elsewhere138).

Given that rA1 and A3A are generalists, it was unavoidable that DNA editing systems based on these deaminases would also lead to several thousand unwanted RNA editing events143,144. However, this off-target activity was almost entirely eliminated by introducing specific amino acid changes into rA1 and A3A. Two different two-amino acid changes to rA1 (R33A/K34A and W90Y/R126E) each resulted in reduced off-target activity on RNA while retaining efficient base editing on DNA143,144 (Fig. 5b). Similarly, off-target RNA editing by A3A was reduced by introducing either an R128A or a Y130F amino acid change144 (Fig. 5b). R128A and Y130F of A3A and R126E of rA1 occur in loop 7, emphasizing the importance of residues in this loop for deamination of RNA. Moreover, R33A/K34A changes were shown to affect the capability of APOBEC1 to bind RNA49. These mutations illustrate how a better understanding of the features that determine whether an AID/APOBEC protein acts as a generalist or a specialist might enable specificity issues to be avoided by facilitating more informed CBE design and optimization at the outset.

RNA-directed base-editing tools

The development of AID/APOBECs as RNA-directed CBEs has proven to be more difficult than DNA-directed CBEs, leading one group to, instead, evolve ADAR proteins to induce C-to-U editing145. One possible explanation for these difficulties is that RNA deamination by APOBEC1, A3A and A3G requires the target RNA to adopt specific secondary structures26,29,30,31,146. This theory is supported by studies using the recently developed CURE (cytidine-specific C-to-U RNA Editor) system, which uses gRNAs to target a Y132D mutant version of A3A fused either to dPspCas13b or dCasRx to specific locations in a target transcript. Interestingly, A3A was only able to elicit RNA editing at the desired location when these gRNAs induced the target transcripts to form a loop147 (Fig. 5c). Importantly, no off-target DNA editing was detected using CURE, although a few hundred off-target RNA edits were found147. Two other recently reported RNA-directed CBE approaches used mAPOBEC1 or human APOBEC1 in combination with either SNAP-tagged or MS2-tagged gRNAs to target specific target mRNAs148,149 (Fig. 5d,e). Neither of these two methods was checked for off-target DNA editing, but the mAPOBEC1-SNAP system demonstrated that integration of an inducible editing enzyme reduces global off-target RNA editing, as had previously been shown for ADAR RNA base-editing technologies149,150. This method was not benchmarked against CURE, making a direct comparison difficult, but it is important to note that the reported RNA off-target activity of CURE (measured as a simple sum of sites and noting that CURE enzymes are overexpressed) is much lower than that of mAPOBEC1-SNAP (refs147,149). Despite these recent developments, APOBEC1-based RNA-directed CBE systems still suffer from moderate levels of global off-target RNA editing and, owing to the inherent dinucleotide preference of A3A, CURE can only edit Cs present in a 5′-UC-3′ motif (Fig. 3c). A better understanding of how APOBEC1, A3A and A3G interact specifically with RNA will help improve the current systems and facilitate the development of new ones.

Expanding the potential of base editing

An important limitation of the RNA-directed CBE systems described here is that editing is restricted to locations that match the sequence context preferences of the enzymes used. In particular, no currently known APOBECs naturally edit Cs within a 5′-GC-3′ context (Fig. 3c). Therefore, it will be necessary to develop additional context-specific base editors to complete the spectrum of Cs that can be edited. Considering the importance of residues in loop 7 (but also in loops 1 and 3) in defining the substrate and sequence context preference of the AID/APOBEC enzymes, it seems reasonable to hypothesize that altering residues within these loops may be a way to change the local motif preferences and alleviate target motif limitations. Finally, recruitment of endogenous AID/APOBECs for base-editing purposes (as has been done for ADAR151,152) remains an unexplored field. Further developments in this area are important because endogenous AID/APOBEC enzymes are generally overexpressed in contexts (such as cancer) in which therapeutic editing could be beneficial.

Conclusion and future perspectives

Here, we have argued that, under certain conditions, several AID/APOBEC deaminases can act on both RNA and DNA substrates whereas other family members are substrate-restricted. Through the analysis of recently published co-crystal structures we have attempted to describe the features that allow these enzymes to ‘toggle’ between substrates (as APOBEC1 and some APOBEC3 proteins do) and how such activity can be restricted (as in the case of AID and, perhaps, APOBEC2). With the advent of programmable base editors, it will be important to analyse all known AID/APOBEC deaminases (not only all mammalian family members but also distant relatives that seem to exist in marine organisms153) for their properties, in order to develop CBEs that can selectively target RNA or DNA and to expand the local sequence preference of such tools.

Such analyses can also help answer biological questions arising from the close mechanistic relationship between RNA editing and DNA mutation. For example, it is well understood that DNA mutators of the APOBEC3 family are upregulated in cancer tissue — a holistic (but yet to be fully tested) view of the field would argue that these enzymes are actually upregulated in the context of a programmed RNA editing response to inflammation, and that DNA mutation is an off-target outcome of this response28,143. If, as implied, RNA is the preferred substrate for these enzymes, it will be important to understand the physiologic role of RNA editing in the context of an early host response to tumour inflammation. Finally, if kataegis mutations (detected in the majority of human cancers) are simply the by-product of the host’s attempt to limit tumour growth, then RNA editing could be used diagnostically as an early biomarker for ongoing tumour diversification and relapse28.