Introduction

The amount of nuclear DNA, the C-value, is a characteristic of a species (Swift 1950). The C-value refers to the total amount of DNA in an unreplicated haploid or gametic nucleus of an organism (Greilhuber et al. 2005) and is often reported in picograms (pg) (1 pg ≅ 980 Mbp). In eukaryotes, the C-value is variable tremendously - as much as 66,000-fold - from the lowest of 0.0023 pg in Encephalitozoon intestinalis, a parasitic microsporodian, to 151.9 pg in Paris japonica, a monocotyledonous plant in the Liliales (Corradi et al. 2010; Pellicer et al. 2010). The major distinction between eukaryotic genomes and prokaryotic genomes involves genomic constraints. Genomic constraints in prokaryotes or simple eukaryotes are high, such that the genomes of these simple organisms are gene dense, whereas the genomic constraints of multicellular eukaryotes are low, such that their genomes are packed with repeated sequences, making them gene sparse (Koonin and Wolf 2010). Thus, the C-value is generally proportional to the organism’s developmental complexity. However, this general propensity is often violated by a lack of apparent correlation between organismal complexity and genome size and large differences in the C-value among closely related species (Cavallier-Smith 1978; Gall 1981), which is referred to as the “C-value paradox” (Thomas 1971). For instance, the enormous genome of the whisk fern (Tmesipteris obliqua;1 C = 147.3 Gbp) (Hildago et al. 2017a) is approximately 46 times larger than the human genome size (1C = 3.2 Gbp) (Pennisi 2001). An example of genome size differences between closely related species is in the genus Eleocharis in the Poales of angiosperms. The genus Eleocharis, a sedge genus, contains approximately 250 species in which the genome of E. acicularis (2n = 20, 1C = 0.25 pg) is 20 times smaller than that of E. palustris (2n = 16, 1C = 5.5 pg) (Zedek et al. 2010). With accurate gene number estimation from whole-genomic sequencing of various eukaryotic organisms, Hahn and Wray (2002) coined the term “G-value” to designate the number of genes in a haploid genome and the term “I-value” for the amount of information encoded in a genome, which includes the number of genes and complexity added as a result of gene expression and interacting genes. They also coined the term “G-value paradox” to explain the explicit disconnections between the number of protein-coding genes and organismal complexity.

Next-generation sequencing (NGS) and analytically efficient bioinformatics tools have generated entire genomic sequences with highly accurate genetic information for many species (Park and Kim 2016; Straiton et al. 2019). One of the striking findings from NGS projects is that eukaryotic genomes are highly loaded with so-called “junk” sequences, which partly resolved the C-value paradox. However, unlocking the biological functions of the junk sequences is still a challenging project for understanding the evolutionary significance of genome evolution (Adelman and Egan 2017; Bernardi 2019). Genome size information has been acquired for more than 15,000 eukaryotic species, including plants (Plant C-value Database, www.cvalues.science.kew.org) (Pellicer and Leitch 2020), animals (Animal Genome Size Database, www.genomesize.com), and fungi (Fungal Genome Database, www.zbi.ee/fungal-genomesize), in recent decades. Although our understanding of genome architecture has dramatically increased because of both whole-genome sequence database information (https://www.ncbi.nlm.nih.gov/genome/browse#!/eukaryotes/; https://genome.jgi.doe.gov/portal/) and a wealth of genome size information, the fundamentals of genome evolution are not fully understood. Some species have streamlined genomes, but their closely related species have enormous genomes with high amounts of noncoding sequences. We can posit genomic theories to explain the C-value paradox with the knowledge of genome architecture in various types of organisms (Elliott and Gregory 2015). However, do contemporary genomes have evolutionarily inevitable outcomes? If so, do the genome sizes represent the best adaptive features for the extant species over evolutionary history? The current review provides recent updates on C-value genomics and evolutionary perspectives on eukaryotic genome size biology with an emphasis on plant genomes.

Eukaryotic chromosome architecture

The prominent Japanese geneticist Hitoshi Kihara coined the striking aphorism “The history of the earth is recorded in the layers of its crust. The history of all organisms is inscribed in the chromosomes” in the early 20th century (Crow 1994). His foresight without knowledge of molecular details on the chromosomes holds true even today in the genomic era. Eukaryotic chromosomes are now finely dissected at various molecular levels to enhance our understanding of the evolutionary history of organisms. Chromosomes are dynamic architectural structures to ensure that they pass their genetic integrity to daughter nuclei and regulate gene expression for cellular function (Bickmore 2001). To maintain genetic integrity generation after generation of cell division, chromosomes must have three basic elements: centromeres, telomeres, and replication origins.

Chromosomes consist of DNA and proteins that are collectively called chromatin. Genes are not evenly distributed along the chromosomes; genes are present in the loosely condensed euchromatic regions between highly condensed heterochromatin blocks (Schimidt and Heslop-Harrison 1998; King 2002). Along with euchromatin and heterochromatin, chromosomes have other chromosomal landmarks, including centromeres, telomeres, and nucleolar organizing regions (NORs) (Fig. 1). Each chromosome is distinct in its shape by the location of the centromere and the distribution of euchromatin and heterochromatin. Moreover, heterochromatin is composed of a mixture of elements of repeated DNAs, such as minisatellites, simple sequence repeats (SSRs), and transposable elements (TEs) (Heslop-Harrison 2000). While highly repeated satellite DNA sequences and Ty3/gypsy long terminal repeat (LTR)-retrotransposons are packed in the centromeric regions, class 2 DNA transposons, Ty1/copia LTR retrotransposons, and SSRs are dispersed and often present in clusters (Schimidt and Heslop-Harrison 1998; Heslop-Harrison and Schimidt 2001). NORs are chromosomal sites that appear during secondary constriction in cytological preparations and are the sites where 18S, 5.8S, and 25S rRNA genes reside in tandem arrays of thousands of copies (Heslop-Harrison 2000). Another type of ribosomal RNA repeat is the 5S rRNA gene repeat, which is separately or closely located to NORs (Nguyen et al. 2016). The 5S rDNA genes are also repeated in tandem arrays of hundreds or thousands of copies (Cloix et al. 2002). Eukaryotic chromosomes are capped with telomeric repeats at both ends with many thousands of simple TTAGGG telomeric repeats whose main function is protecting chromosome integrity during cell division (McKnight and Shippen 2004). Other types of intercalary tandem or dispersed repeats are also scattered throughout the chromosomes.

Fig. 1
figure 1

Illustration of a mitotic metaphase chromosome. The chromosomes have reached their maximum condensed state and consist of two genetically identical sister chromatids. The centromere is the site of kinetochore formation where microtubules are attached to pull the sister chromatids to each pole. The centromeric region is highly saturated with Ty1-gypsy retroelements and other satellite DNA. Both ends of each chromosome are capped with thousands of copies of TTAGGG simple repeats whose function is protecting chromosome integrity during cell division. NOR is the site where hundreds to thousands of copies of 45S rDNA reside in a tandem array. Hundreds to thousands of 5S rDNA repeats can also be located elsewhere (not shown in the illustration). Heterochromatin contains various mixtures of repeat DNAs. Both class 1 and class 2 TEs are distributed along the chromosome. Euchromatin DNA is distributed between heterochromatin along the chromosomes, such as on islands in heterochromatic oceans

If chromosomal DNA is stretched, the human genome (1C ≅ 3 Gb) is approximately 1.5 m, and the largest eukaryotic genome (that of P. japonica; 1C ≅ 148.8 Gb) is as long as 100 m; however, the eukaryotic nucleus is approximately 10 µm in diameter (Huber and Gerace 2007). Thus, packaging long DNA molecules into the small nucleus is highly challenging for eukaryotes; this is the primary function of chromatin. The chromosome structure is uneven in chromatin packaging such that the gene-rich euchromatic regions are relatively loosely packaged, but the heterochromatic gene-sparse regions are tightly packaged. The chromatin structure must also be able to be unpackaged during replication and gene transcription and then packaged again during cell division to be passed to daughter cells; thus, the dynamic regulation of chromatin structure is vital for successful survival throughout evolution of the species. The process of packaging and unpackaging chromosomal DNA is finely regulated by epigenetic mechanisms, which is beyond the scope of this review.

C-value and chromosome numbers

The haploid chromosome number is designated as n, which is a genetic characteristic of eukaryotes, and ranges from n = 01 in jack juniper ant (Myrmecia pilosula) (Crosland and Crozier 1986) to n = 720 in the monilophyte fern Ophioglossum reticulum (Khandelwal 2008). Chromosomes of polyploids will be addressed more thoroughly in relation to the C-value in the next section. Reports on the relationship between genome size and chromosome numbers are available with inconsistency. There was no clear relation between n number and C-value in the analysis of 343 taxa of Balkan flora by Siljak-Yakovlev and Pustahija (2010). Pellicer et al. (2014) analyzed the genome size and chromosome evolution in the Melanthiaceae family of monocots; the haploid chromosome number ranged from 5 to 27, but the C-value was highly variable, as much as 230-fold among the species. For instance, the genera Paris and Trillium are n = 05, whereas the n numbers are variable in the genera Heonias (n = 17), Stenanthium (n = 10), and Xerophyllum (n = 15). The 1C value of the species in the genera Paris and Trillium ranged from 31.21 ~ 56.59 pg and 27.51 ~ 54.56 pg (excluding the tetraploid), respectively, whereas the 1C values of the species in the last three genera were approximately 3 pg, indicating that the chromosome number and genome size are negatively associated among the species in the family Melanthiaceae. Nishikawa et al. (1984) also reported a strong negative correlation between chromosome number and genome size among species in the genus Carex in the Cyperaceae family and argued that species with many small chromosomes are derived from a small number of large holocentric chromosomes by chromosome fragmentation followed by DNA loss. In combination with phylogenetic analysis with the chromosome numbers and C-values among Carex species, Chung et al. (2012) reported that the correlation was nearly zero or weakly positive or weakly negative at deeper phylogenetic scales. Thus, the authors postulated that the highly labile chromosome numbers might have affected reduced selection pressure for chromosome numbers in the Carex genus indirectly. However, the genus Eleocharis in the same Cyperaceae family showed a strong positive correlation between chromosome number and genome size in another study (Zedek et al. 2010). The chromosomes of both Carex and Eleocharis are holocentric and are easily breakable during cell division. The authors posited that the genome size variations in the Eleocharis species were derived from the occurrence of polyploidy and aneuploidy/symploidy with the amplification of LTR retrotransposons. The correlation between chromosome number and genome size was weakly negative (r = −0.0187) in their study. However, a weak but significantly positive correlation between the C-value and chromosome number was reported in a study involving more than 500 eukaryotic species (r = 0.1456, p = 0.0076) (Elliott and Gregory 2015). We retrieved the data of chromosome numbers and 1C values of diploids from the angiosperm genome size database (Bennett and Leitch 2011) and then analyzed the correlations between chromosome numbers and genome size. Of the 868 diploid species analyzed, the haploid chromosome number (n) ranged from 3 in Hypochaeris oligocephala to 43 in Ceiba pentandra, and the C-value ranged from 0.2 pg (Arabidopsis thaliana and 16 other species) to 77.4 pg in Fritillaria koidzumiana. The relationship between chromosome number and C-value revealed a relatively weak negative relationship (r = −0.019) in our analysis.

C-value and polyploidization

Whole-genome duplication (WGD; polyploidy) is an important driver during genome evolution. Polyploidization can cause doubling of the genome in autopolyploidization and the addition of two parental genomes in allopolyploidization. The WGD is followed by subsequent diploidization processes, including gene loss, genome fractionation, genome downsizing, and chromosome rearrangement (Wendel 2000). Polyploidy is common in plants but rare in animals such that only a few polyploid species exist (insects and reptiles) (Otto 2007), and ancestral vertebrates have undergone two rounds of WGD (Dehal and Boore 2005). However, polyploids are frequent in the plant kingdom; as much as approximately 70% of the extant plants are polyploids (Soltis et al. 2003). Virtually all plants have experienced one or more rounds of WGD, affecting both genome size and gene contents (Soltis and Soltis 2016; Clark and Donoghue 2018). While many rounds of WGD have been recorded in angiosperms, WGD in gymnosperms is rare, with only a few available reports, such as WGD in cycad and gingko (Roodt et al. 2017). Amborella trichocarpa is a basal angiosperm, and its genome has undergone at least two rounds of WGD (AGP 2013). In addition, the small genome of Arabidopsis thaliana has undergone at least two additional rounds of WGD since the divergence of eudicots (Bowers et al. 2003; del Pozo and Ramirez-Parra 2015).

Repeated rounds of WGD shaped the contemporary genomes of angiosperms, leading to inevitable genome size increases throughout evolution (Wendel 2015; Soltis and Soltis 2016). Then, is the genome size proportional to the number of WGDs? Are the genome sizes the sum of those of both parental species of allopolyploids? Genome size increases are not directly proportional to polyploidization events, such that the genome sizes of polyploids are usually smaller than expected, which is termed “genome downsizing” (Leitch and Bennet 2004; Doyle and Coate 2019). Here is a theoretical inference. The number of episodes of WGD was estimated to be as high as 288 in Brassica napus (2n = 4x = 38) and 144 in Gossypium hirsutum (2n = 4x = 26) (Wendel 2015). However, the genome size of ancestral angiosperms was estimated to be very small (1C ≤ 1.4 pg) (Soltis et al. 2003). If so, genome sizes of the current B. napus and G. hirsutum species should be as large as > 400 pg and > 200 pg, respectively, instead of the actual C-values of 1.5 pg in B. napus and 2.4 pg in G. hirsutum (Bennet and Letich 2011), implying that there must be some counterbalancing system for genome size. Reduction in repetitive DNA sequences was posited as a main mechanism for genome downsizing (Doyle and Coate 2019). Renny-Byfield et al. (2011) demonstrated the elimination of repetitive DNA sequences in the genome of Nicotiana tabacum (2n = 4x = 48), which is an allotetraploid derived from interspecific hybridization between N. sylvestris and N. tomentosiformis. Large amounts of Ty3-gypsy long terminal repeat (LTR) retroelements and 35S rDNA were eliminated from the N. tabacum genome compared with that of its parental species. Illegitimate or unequal recombination between LTR sequences accounted for the purging of the Ty3-gypsy elements in the synthetic allotetraploid N. tabacum. Reduction in rDNA sequences was also reported in several other allopolyploid species, including those of the Brassica, Festuca, Glycine, and Triticeae (Wendel 2000). By contrast, genome size increased in the species of the sunflower genus Helianthus after allopolyploidization (Ungerer et al. 2006). The genome sizes of the hybrid taxa H. deserticola, H. anomalus, and H. paradoxus were larger than the expected sum of their diploid parents H. annuus and H. petiolaris. The proliferation of Ty1-gypsy LTR retrotransposons by genome shock via interspecific hybridization was attributed to the differences of the hybrids in the sunflowers (Ungerer et al. 2006; Staton et al. 2012). In contrast to that in allopolyploids, genome downsizing data are limited in autopolyploids (Parisod et al. 2010). Raina et al. (1994) reported 17% of total DNA loss in synthetic autotetraploids of Phlox drummondii immediately after tetraploidization and further reduction (up to 25%) in the third generation. However, synthetic autotetraploids of A. thaliana revealed no DNA loss from the expected amount (Ozkan et al. 2006). There are 34 autotetraploid species in the 2221 angiosperm genome size database (Bennet and Leitch 2011). We manually checked the genome sizes of these autotetraploids, which revealed that six species had exactly doubled values from their diploids, whereas 20 showed a reduction in genome size compared with that of their diploids, but eight autotetraploids showed genome size increases (data not shown). Transposable elements were again posited to be responsible for the genome changes of the autotetraploids by genome shock, and thus, modulation of genome size should be considered part of the response to genome duplication (Doyle and Coate 2019).

C-value and introns

Eukaryotic genes are interrupted by introns that have to be removed from the primary transcript to form mature messenger RNA. Introns that do not encode proteins remain a debated issue. One speculation is that having introns might have driven eukaryotic evolution by enhancing coding capacity by alternative splicing (Kim et al. 2007; Nielson and Graveley 2010). However, transcribing and splicing introns requires energetic and time costs. For instance, the human dystrophin gene is 2300 kb, which encodes only a 14 kb mRNA, but the rest (99.4%) of this gene are 78 intronic sequences. Transcription of this gene takes up to 16 h, and splicing of a high number of introns also requires high amounts of cellular energy (Tennyson et al. 1995). Nevertheless, because introns are evolutionarily less constrained than are coding sequences, they usually evolve faster than coding sequences (Chamary and Hurst 2004).

Positive correlations between intronic features (average size, total number and nucleotide contents) and genome size are available in several reports over large evolutionary scales (Vinogradov 1999; Lynch and Conery 2003; Suetsuga et al. 2013; Elliott and Gregory 2015). Lynch (2007) reported that the amount of DNA in introns was nearly equal to the amount of DNA in exons in small genomes (< 200 Mbp), whereas intronic DNA occupied approximately 95% of the total length of protein-coding genes in large-genome-sized (> 2,500 Mbp) mammals. In the analysis of variable genome sizes of 12 animal species ranging from C. elegans (1C = 100.3 Mbp) to Bombyx mori (1C = 431.7 Mbp) to humans (1C = 3101.8 Mbp), Suetsuga et al. (2013) reported a strong positive correlation between intron size and genome size, with a correlation coefficient (CC) of r = 0.942; this value was higher than that (0.558) between genome size and TE contents in the genome, whereas the CC value between average exon and genome size was negative (−0.487). Elliott and Gregory (2015) analyzed associations of genome size with gene content, chromosome number, TE content, and intron features in 502 species, comprising 148 animals, 81 land plants, 202 fungi, and 70 protists. In their study, intron content and genome size were positively correlated across eukaryotes in terms of average intron size (r = 0.6065, p < 0.0001, n = 245 contrasts), the number of introns per genome (r = 0.4535, p < 10− 07, n = 121 contrasts) and the total amount of intronic DNA present in the genome (r = 0.6079, p < 10− 11, n = 115 contrasts). Francis and Wörheide (2017) presented interesting results on the introns in 68 species across 12 animal phyla, in which both introns and intergenic fractions displayed a linear correlation to total genome size, and the ratio of introns:intergenic regions approached 1:1 (r2 = 0.8286, p value: 5.6 × 10− 27). These studies support the idea that intronic features scale linearly with genome size; larger genomes have more and longer introns, and vice versa for small genomes. However, opposite opinions also exist in several studies. Proponents of the latter theory argue that the concerted evolution between genome size and introns is weak or null throughout the Eukarya because recurrent intron loss/gain occurred at the lineage-specific level (Wang et al. 2014), a large portion of eukaryotic genomes lacks an organism-level function (Wang et al. 2014), or intron densities are variable across a wide range eukaryotic lineages (Farlow et al. 2011). Recently, Lozada-Chávez et al. (2020) carried out correlative association studies of various genome-wide features of introns (size, density, genome content, repeats), genome size, and multicellular complexity of 461 eukaryotes. In their study, the intronic features were weakly correlated between themselves and genome size at a broad phylogenetic scale. The strength of the associations was variable at the lineage-specific level, and the variations in intron length and abundance within the genome were largely independent throughout the Eukarya. Their findings might be reasonable from the presumption that various kinds of repeat sequences, including TEs, are major drivers and that TEs can rapidly increase in number in specific lineages (Oliver et al. 2013; Pellicer et al. 2014; Blommaert et al. 2019).

C-value and repeatomes

Eukaryotic genomes teem with menagerie of repeated sequences, which are collectively called repeatomes (Maumus and Quesneville 2014). Along with polyploidization, the expansion of repeatomes is the main contributor to the large genome size variation among eukaryotes. There are two main types of repeats: tandem repeats and dispersed repeats. The tandem repeats include centromeric repeats, telomeric repeats, and ribosomal RNA (rDNA) genes, while dispersed repeats include simple sequence repeats (SSRs), minisatellites, and various kinds of transposable elements (TEs) (Heslop-Harrison and Schmidt 2001). Except for rDNAs, these repeated sequences are usually selectively neutral in terms of their accumulation in the genome, without major effects on the phenomes of the host; they were once considered “junk” DNA or “selfish” DNA (Doolittle and Sapienza 1980; Orgel and Crick 1980). However, the biological roles of this selfish DNA have been revisited, armed with a plethora genomic information; the unwelcome moniker has been changed to “genomic treasure” because this DNA has played major roles in shaping the current genomes and biodiversity (Volff 2006; Maumus and Quesneville 2014, 2016).

Of the various repeat sequences, TEs are major players in genome size variations by constituting a variably large proportion of eukaryotes, especially plants genomes (Kumar and Bennetzen 1999, 2000). TEs are classified into two classes on the basis of transposition mechanisms: class 1 and class 2 (Finnegen 1989). Class 1 TEs are retrotransposons that retrotranspose semiconservatively via mRNA intermediates in a “copy-and-paste” manner, whereas class 2 TEs are DNA transposons that transpose conservatively in a “cut-and-paste” manner. The content of TEs is generally proportional to the genome size; a small proportion of TEs exists in small genomes; a large proportion of TEs exists in large genomes (Civán et al. 2011). For instance, TEs constitute approximately 3% of the minute genome of the carnivorous bladderwort Utricularia gibba (1C = 0.079 pg) (Ibarra-Laclette et al. 2013), whereas TEs constitute > 85% of the maize genome (1C = 2.55 pg) (Schnable et al. 2009). However, no congruence in phylogenetic context with TE content has been reported in the fully sequenced genomes of 24 crop species, and the success of different types of TEs differs in different species (Vitte et al. 2014). For instance, Ty1-gypsy LTR retrotransposons are predominant in the genomes of maize and grapevine, whereas non-LTR retrotransposons are prevalent TEs in the genomes of sorghum, barley, potato and tomato.

The conservative cut-and-paste transposition mode does not usually allow high copies of class 2 DNA TEs; thus, they are present in moderate numbers, and their impact on genome size is not great compared with that of class 1 retrotransposons (Lee and Kim 2014). Class 1 retrotransposons can proliferate via very large copy numbers to cause genome bloating because the original copies are left behind in the copy-and-paste retrotransposition process (Bennetzen and Kellogg 1997). In the 50 fully sequenced plant genomes, the increase in genome size is correlated with the abundance of repeatomes (r2 = 0.584) and, more specifically, LTR retrotransposons (r2 = 0.68) (Michael 2014). Thus, it is generally accepted that there is a positive linear function between genome size and the content of TEs in eukaryotes (plants) in which class 1 L retrotransposons are major contributors to C-value differences (Kim 2017). For instance, the 17,000 Mbp wheat genome comprises 63.7% class 1 TEs and 14.9% class 2 TEs, and 2,300 Mbp of the maize genome constitutes 75.6% class 1 TEs and 8.6% class 2 TEs. Similarly, TEs constitute 68% of the large genome of Secale cereale (1C = 8.093 pg), in which class 1 and class 2 TEs constitute 64.3% and 5%, respectively (Oliver et al. 2013). The small genome of Arabidopsis thaliana (125 Mbp) comprises 7.5% of class 1 TEs and 11% of class 2 TEs. TE expansion has caused genome size variations in animals as well. For instance, rotifers of the Brachionus plicatilis species complex exhibit severalfold differences in genome size due to genome doubling and transposon expansion (Blommaert et al. 2019). Kapusta et al. (2017) also reported that many mammal and bird lineages have experienced different rates of TE accumulation, resulting in substantial variation in genome size between species.

If the copy-and-paste retrotransposition allows accumulation of class 1 retrotransposons in the genome, do the genomes become large only by one way? (Bennetzen and Kellogg 1997). The answer is ‘no’ because maintaining a large genome may be a burden to cell physiology, which will be discussed more in the chapter below. Analysis of the C-values of more than 6000 plant species (6287 angiosperms, 204 gymnosperms) revealed that plant genomes are skewed to small sizes (Civán et al. 2011). The C-values of 95% of the angiosperms are less than 22 Gbp, with a mean of 5.809 Gbp and a median 2.401 Gbp, whereas those of 95% of gymnosperms are in the range of 7–33 Gbp, with a mean 18.157 Gbp and a median 17.506 Gbp. Hildago et al. (2017b) showed a violin plot of genome size distributions in flowering plants, ferns and vertebrates, including mammals, in which the genome sizes of all these biological domains were streamlined except for a few extraordinarily large genomes. Eukaryotic cells are equipped with mechanisms to counterbalance increasing genome size, such as illegitimate recombination (Devos et al. 2002; Hawkins et al. 2009) and nonhomologous end joining (NHEJ) after double-strand breaks (DSBs) (Chen et al. 2013; Fawcett et al. 2012; Lynch 2007). Unequal crossing over between repeat sequences leads to sequence deletion. Illegitimate intrastrand homologous recombination between direct repeat LTR sequences results in the deletion of sequences between LTRs, leading to solo LTRs. Devos et al. (2002) demonstrated that there was fivefold more illegitimate intrastrand recombination than unequal crossing over, which led to the small genome of A. thaliana. Gossypium (cotton) species carry Geroge3, a gypsy-like LTR retrotransposon, with variably high copy numbers among species (Hawkins et al. 2006). The copy number of George3 increased in lineages specific to A- and K-genome diploids that have approximately 3 times larger genome sizes than D-genome diploids do, which have many more solo-LTR George3 than the A- and K-diploids do, implying that intrastrand recombination purged George3 copies in the D-genome species (Hawkins et al. 2009). NHEJ after DSB can also lead to the purging of LTR retrotransposons. For instance, the Oryza brachyantha genome is approximately 60% smaller than its close relative O. sativa, in which the amplification and deletion of recent LTR retrotransposons account for the difference. Comparison of protein-coding genes between the two species revealed that only 70% of the O. brachyantha genes were collinear with those of O. sativa. In this respect, the low LTR retrotransposon activity and massive amount of internal deletions of LTRs by NHEJ after DSB were proposed to cause the genome reduction in O. brachyantha (Chen et al. 2013). Removal of repeatomes may be a safeguard system in preventing uncontrolled genome expansion in combination with epigenetic regulation of TE activities (Slotkin and Martienssen 2007).

Why did evolution lead some species askew to have excessive large genomes? If closely related species with small and large genomes have similar DNA deletion systems, then the old repeats must have purged from both large and small genomes equally, but the species with large genomes must have undergone recent amplification of a few LTR retrotransposons (i.e., George3 in Gossypium species) (Hawkins et al. 2009). Genome size and phylogenetic analyses have revealed that the lack of an efficient DNA removal system resulted in extreme expansion of the large genome of Fritillaria (Liliaceae) (Kelly et al. 2015). Studies on species with extreme genomes, such as lungfish (Metcalf et al. 2012), black salamander (Sun et al. 2012), and loblolly pine (Wegrzyn et al. 2014), have also revealed the presence of highly heterogeneous repeated DNA sequences.

C- and G-value paradox

The C-value paradox can be partly resolved by the bloating of noncoding DNA and polyploidy of some eukaryotic genomes, as mentioned above. It is hard to deny the general perception that gene number is roughly correlated with organismal complexity; however, it is also hard to accept this obsessive perception of linear correlation, because the gene number is lower in those developmentally complex organisms (i.e., mammals) than in simple organisms (i.e., ciliates of protists, many species of plants, and zebrafish of vertebrates). Table 1 shows the C- and G-values of 35 fully sequenced species from prokaryotes to eukaryotes. Both C- and G-values generally increased according to organismal complexity: prokaryotes < single cellular eukaryotes < multicellular eukaryotes. For instance, both the C- and G-values of single-celled yeast (S. cerevisiae and S. pombe) and the microsporodian fungus (E. cuniculi) were smaller than those of multi-cellular fungi (N. crassa and U. maydis). The genome sizes and gene numbers of protists are smaller than those of plants and animals. The C- and G-values of the moss P. patens are larger than those of the plant A. thaliana. If we consider G-values, the simple ciliate Tetrahymena has more genes than do developmentally complex organisms (i.e., Amborella, fruit fly, medaka fish, zebrafish, silkworm, etc.). The human G-value is dwarfed by that zebrafish, loblolly pine, wheat, soybean, and even ciliates. Thus, in parallel with the term ‘C-value’ paradox, the ‘G-value’ paradox was coined to account for the disconnection between the number of genes and organismal complexity (Hahn and Wray 2002). Indeed, eukaryotic genome size variation is approximately 66,000-fold, whereas the transcriptome difference is approximately 17-fold (Cavalier-Smith 2005).

Table 1 C- and G-values of sorted organisms from prokaryotes to eukaryotes that have been fully sequenced

Organismal complexity is somehow an illusory definition. Does the complexity mean the number of proteins produced or the number of cell types or organs? The proteome refers to the entire set of proteins that are expressed by a genome, cell, tissue, or organism under certain conditions (Altelaar et al. 2013). The one-gene one-protein concept is obsolete in modern genetics. The number of genes underestimates the proteome and developmental complexity because alternative splicing can produce several mature mRNAs. Approximately 95% of human multiexonic genes are alternatively spliced, and the specific mRNA from the alternative splicing of a gene is developmental or cell specific (Pan et al. 2008; David and Manley 2008). Many proteins have several cellular functions, and these ‘Swiss army knife’-style proteins can also account for the smaller-than-expected G-values of multicellular species (Hahn and Wray 2002). The expression of eukaryotic genes is finely regulated by sophisticated machinery (Krebs et al. 2018), and the development of multicellular organisms is regulated by a specific set of homeotic genes (Popodi et al. 1996). Additionally, noncoding RNAs regulate gene expression at both the transcriptional and posttranscriptional levels (Hirota et al. 2008; Palazzo and Lee 2015). Expansion of genes in multigene families has occurred differently in evolutionarily close species. For instance, olfactory receptor genes were identified to be present as 339 copies in humans (Malnic et al. 2004) but there are 1,296 copies in mice (Zhang and Firestein 2002). Thus, gene number may be related to organismal complexity in general, but we have to accept many exceptions for this general dogma. To account for the paradox in the correlation between C- and G-values and organismal complexity, the I-value was posited as a measure of the total information contained in a genome (Hahn and Wray 2002).

C-value and cell economy

It is obvious that a dramatic increase in noncoding or repeated sequences would be a burden to the host not only in terms of cellular physiology but also in terms of packaging DNA within a limited nuclear space; thus, eukaryotic genomes have been streamlined as much as possible (Cavalier-Smith 2005; Hildago et al. 2017b). Metabolic expense may be important to maintain and replicate the bulk noncoding DNA whose function is mostly unknown, which might be costly to the fitness of the host. Nuclear volume space doubles with genome doubling, but the surface area of the nuclear envelope increases only 1.6-fold (Melaragno et al. 1993), which can cause an imbalance in cellular factors mediating the interactions between chromosomes and nuclear components (Comai 2009). There is a strong nucleotypic effect on the cell cycle regardless of ploidy level in 100 plant species in which the C-value is positively related to cell cycle time (Francis et al. 2008). Knight et al. (2005) proposed that species with relatively small genomes presented higher growth rates than did those species with large genomes by facilitating fast cell divisions, so he posited the ‘large genome constraints’ theory to explain the physiological and metabolic costs associated with maintaining large genomes with excessive amounts of repeat DNA. The ‘large genome constraint’ theory explains the disadvantages of large genomes in terms of evolution, ecology, and physiology such that large genomes have diversified more slowly by being constrained, being underrepresented in extreme environments, and presenting reduced maximum photosynthetic rates; consequently, species with large genomes were trimmed from evolutionary trees and restricted in ecological distribution.

Reducing the genome size is a reasonable inference from the perspective of “large genome constraint”, and the distribution of genome size is actually skewed to small sizes in all domains of eukaryotes (Oliver et al. 2007; Pellicer and Leitch 2020). C-values are less than 2 pg except in a few species within the tail region of those with large genomes in a graphical distribution of 6287 plant species (Civán et al. 2011). A strong correlation between cell size and genome size was observed in early studies in the 1950 and 1960 s (Mirsky and Ris 1951; Vialli 1957; Baetke et al. 1967). Then, have genomes become larger or smaller? Cavalier-Smith (1978) proposed that nuclear volume and genome size must be adjusted according to cell volume to allow reasonable growth rates, because DNA has two additional major functions in addition to encoding proteins, such as controlling cell volume by the number of replication origins and determining nuclear volume by the overall bulk of DNA. Nucleotides are charged solutes, and a large genome size decreases the osmotic potential of plant cells to draw more water into the cell, resulting in larger cells requiring more cellular and metabolic resources (Knight and Beaulieu 2008). Because nuclear DNA is encapsulated within the nuclear architecture, which is dynamically dissolved and reformed during the cell cycle, the amount of nuclear DNA is positively correlated with the volume of nuclear architecture. The intracellular parasite Plasmodium (microsporidia) has two nuclei with a normal large nucleus and secondary micronuclei (nucleomorphs) (Archibold and Lane 2009). The normal large nucleus shows a typical positive correlation between genome size and cell volume, but the small nucleomorph nuclei did not display an obvious correlation between them (Cavalier-Smith 2005). While the main nucleus allowed expansion of repeat DNA, minute nucleomorphs strongly decreased the genome size even by reducing gene sizes; thus, the author argued that the nuclear dimorphism of Plasmodium strongly supported the skeletal DNA/karyoplasmic ratio interpretation of genome size evolution, as economy, speed and size matter for evolutionary forces driving nuclear genome miniaturization and expansion. Furthermore, he refuted the previous idea of the correlation between cell cycle and nuclear DNA contents from the inference of small cells and rapid growth rates (Commoner 1964; Bennett 1972), because the relation between genome size and cell cycle length was much weaker than the relation between cell and nuclear volume (Cavalier-Smith 1978, 2005). However, this is disputable because many contrasting reports were put forward with respect to genome size and growth rates of plants (Suda et al. 2015; Roddy et al. 2020). Nevertheless, it still remains to be resolved why phylogenetically closely related species display many-fold differences in genome size.

Cell economy has slowed genome expansion so that most eukaryotes possess small genomes. However, the size of some genomes has skewed and expanded to an extraordinarily large size; this has occurred for P. japonica (1C = 148.8 Gbp) among angiosperms (Pellicer et al. 2010), Tmesipteris obliqua (1C = 147.3 Gbp) among whisk ferns (Hildago et al. 2017a), and Protopterus aethiopicus (lungfish, 130.0 Gbp) (Metcalf et al. 2012) and Necturus lewisi (salamander, 118.0 Gbp) (Sun et al. 2012) among vertebrates. If so, what is the biological upper limit of genome size? Hildago et al. (2017b) suggested that ~ 150 Gbp might be the biological upper limit for genome size. For this theory, they suggested several basic constraining factors, including biochemical and energetic costs, the maintenance of genome integrity, geometric constraints from a decreasing surface area-to-volume ratio of the cell as the genome size increases, timing constraints from longer mitosis and meiosis, and evolutionary constraints.

C-value and the phenome

Natural selection acts on phenotypes rather than genotypes. The phenome is a collective term describing the set of all phenotypes expressed by a cell, tissue, organ, organism or even species (Furbank and Tester 2011; Bush et al. 2016). Does genome size affect phenomes? A good example of the genome size effect on phenomes is the cell size of autopolyploids, which is discussed in detail in the literatures (Tsukaya 2013; Orr-Weaver 2015). The effect of genome/cell size has been documented in the salamander family Plethodontidae, which exhibits large genome variation from 1C to 15 pg (i.e., the genus Desmognathus) to ~ 120 pg (i.e., the genus Necturus) (Gregory 2005), and strong positive correlations were observed between C-values and blood cells as well as nucleus sizes among salamander species (Mueller et al. 2008). Such cases also occur in other fishes, birds and mammals (Gregory 2001, 2005).

As discussed in the section on the G-value paradox, gene contents are not greatly variable among species of taxa with different levels of biological complexities. The C-value paradox can be explained by the fact that larger genomes are packed more with selectively neutral repeat DNA than small genomes are. Then, is there any relationship between the C-value and phenome? Plants grow annually, biennially, or perennially. Annual or perennial growth might be another good example of how the C-value can affect the phenome. Table 2 shows the C-values of 2000 species of annual, biannual, and perennial plants collected by Bennet and Leitch (2011). The C-values of perennial plants were distributed mainly from 0.2 to 77.4 pg, but the distribution of biennials narrowed to 0.2–3.5 pg. The C-values of annuals ranged from 0.2 to 20.2 pg, which means that the plants with more than 20.2 pg are obligate perennials. By using regression analysis with 110 plant species, Francis et al. (2008) confirmed the general assumption that larger genomes take more time to multiply; there is a strong positive relationship between cell cycle time and C-value of diploids and polyploids, including for both monocots and dicots. The limited C-value of the perennials (20.2 pg) may imply that the large genomes place some selective disadvantage for plants that develop within only one growing season.

Table 2 Genome sizes (pg) of annual, perennial, and biannual angiosperm species

Previous observations revealed that there was a positive correlation between genome size and seed mass and various metrics of growth and leaf morphology characteristics of plants (Bennett 1971, 1972, 1987). However, Knight and Beaulieu (2008) reported somewhat different results, in which genome size was a strong predictor of phenotypic traits at the cellular level, but the power decreased for the higher-level phenotypes. There was a strong positive correlation between genome size and guard cell length and epidermal cell area and a negative correlation with stomatal density. However, the relationship was weak for the traits of the higher-level phenotypes (i.e., seed mass, leaf mass per unit area, wood density). Plant height was interesting: an increasing genome size decreases plant height among angiosperms, but it was reversed in gymnosperms, as species with larger genomes were taller. Similarly, a contrasting effect between angiosperms and gymnosperms was found for the relationship between genome size and pollen size (Knight et al. 2010). De Baedemaeker et al. (2018) reported that tetraploid apple trees tolerated drought better than did diploids; the authors speculated that the higher water content in leafy shoots, higher amount of parenchyma cells, and larger vessel area and size resulted in significantly higher hydraulic cavitation of the tetraploid plants. This might be important because global climate change is widely accepted among the public as well as within the scientific community. Global warming is obvious, and dry areas are rapidly expanding in many areas. Then, are species with large genomes better able to cope with environmental change? We do not provide any solid answers to this question. Species with small genomes may have traits conferring a growth advantage, such as longer dispersal of small pollen or seeds and shorter generation times, owing to the higher rates of cell division and efficiency in cell metabolism (Suda et al. 2015; Roddy et al. 2020). Many reports are available describing that species with smaller genomes are more invasive and successful in new habitats (Bennett et al. 1998; Pandit et al. 2014; Pysek et al. 2018). However, species can experience cellular shock in new environments, which can unlock the epigenetic suppression systems of TEs to propagate class 1 retroelements. TEs and epigenetic components are important environment-sensitive molecular elements, and coupling these two elements allows fine-tuning to adjust the production of phenotypes and genetic variations, including genome size (Rey et al. 2016). Li et al. (2018) reported differential expansion and contraction of the number of TEs among worldwide collections of A. thaliana, which might have played a role in their adaptive evolution. The genomes of salamanders are most variable among vertebrates, having from 13.89 pg to 120.60 pg and a mean of 35.35 pg per 1C (Lertzman-Lepofsky et al. 2019). The larval habitat of salamanders is either permanent aquatic or ephemeral aquatic, or direct development occurs. While small-genome species are distributed across a gradient of ephemeral habitats, species with a larger genome are almost exclusively associated with a permanent aquatic habitat. Moreover, smaller-genome species showed a higher rate of evolutionary transition between permanent and ephemeral larval habitats. Thus, the authors proposed that the evolutionary constraint on the ecological habitat was imposed by the genome size of salamanders such that the species with large genomes were restricted to the permanent aquatic habitat due to their slower development.

Concluding remarks

The genome is defined as the whole set of genetic information of a species. There are highly diverse life forms on Earth, and all of them have their own genome. Like the immense biological diversity of life, eukaryotic genomes are also highly variable among species. Genome size (C-value) and gene content (G-value) are generally proportional to organismal complexity, except for a few outliers. The disconnection between gene number and biological complexity may be derived from highly complex gene expression regulation, multifunctional proteins, alternative splicing, multigene families, and developmental regulation by homoeotic gene sets. Polyploidy and TE expansion are two major players in genome size expansion, but cell economy has restricted genome size growth by the use of counterbalancing systems such as illegitimate recombination and NHEJ after DSB. Thus, genome size evolution follows a simple proportional model in which distribution is skewed to smaller genomes without invoking strong selection against large genomes. Nevertheless, a few species with extremely large genomes exist in which heterogeneous groups of repeat sequences accumulate to very high numbers of copies because they did not have efficient systems to remove repeated sequences from their genomes. Evolution is a stochastic process, and genome size is no exception from the many probabilistic events during selection. Because random genetic drift is a prominent evolutionary force within populations with limited size, substantial deviations are expected with high possibility of specific phylogenetic lineages whose genome size is prone to contraction/expansion; thus, genome size may be quasiadaptable rather than the best adaptive trait. Genome size affects various levels of phenomes, and genome size variations exist among species from different niches. In this respect, genome size is an important subject because many species are driven to new habitats from climate change.