The maximality of circular codes in genes statistically verified
Introduction
A circular code is a set of words such that any motif from , called motif, allows to retrieve, maintain and synchronize the original (construction) frame. The circular code identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel, 2017, 2015; Arquès and Michel, 1996) contains the 20 following trinucleotides in reading frame (frame 0)the 20 following trinucleotides in frame 1 (reading frame shifted by 1 nucleotide in the direction, i.e. to the right)and the 20 following trinucleotides in frame 2 (reading frame shifted by 2 nucleotides in the direction)
The trinucleotide set (defined in (1)) coding the reading frame in genes is a maximal (20 trinucleotides) self-complementary trinucleotide circular code (Arquès and Michel, 1996; reviewed in Michel, 2008; Fimmel and Strüngmann, 2018).
From a mathematical point of view, the identification of circular codes in genes led to about 200 theorems obtained in the different research fields of circular codes: the flower automaton (Arquès and Michel, 1996; and subsequent works), the probability approach (Koch and Lehmann, 1997; Lacan and Michel, 2001), the necklace 5LDCN (Letter Diletter Continued Necklace) (Pirillo, 2003), the necklace LDCCN (Letter Diletter Continued Closed Necklace) with (Michel and Pirillo, 2010; and subsequent works), the group theory (Fimmel et al., 2014; and subsequent works) and the graph theory (Fimmel et al., 2016; and subsequent works).
From a biological point of view, the circular code motifs (or briefly motifs), i.e. motifs of the circular code, allow to retrieve, maintain and synchronize the reading frame in genes. The concept, the statistical analyses and the biological studies of circular code motifs have been introduced in Michel (2012).1 It has been shown recently that the motifs are enriched in the reading frame of extant genes (El Soufi and Michel, 2016; Michel et al., 2017; Dila et al., 2019a), as well as in tRNA sequences (Michel, 2012, 2013; El Soufi and Michel, 2015) and in functional regions of rRNA involved in mRNA translation (Michel, 2012; El Soufi and Michel, 2014, 2015; Dila et al., 2019b). Furthermore, a circular code periodicity 0 modulo 3 was identified in the 16S rRNA, covering the region that corresponds to the primordial proto-ribosome decoding center and containing numerous sites that interact with the tRNA and mRNA during translation (Michel and Thompson, 2020). Based on the mathematical properties of the circular code and the enrichment of motifs in the main actors involved in translation, it has been suggested that the circular code was an ancestor code of the standard genetic code that was used to code amino acids and simultaneously to identify and maintain the reading frame (Dila et al., 2019b).
In order to verify the maximality of circular codes in genes, we will reformulate in this work the presentation of the classical method (Arquès and Michel, 1996; Michel, 2015) and the recent method (Michel, 2017). Precisely, we will demonstrate here that the maximality of the 3 circular codes (with 20 trinucleotides in reading frame, defined in (1)), (with 20 trinucleotides in frame 1, defined in (2)) and (with 20 trinucleotides in frame 2, defined in (3)), that have been assigned by inspection in Arquès and Michel (1996), is statistically verified.
Section snippets
Gene kingdoms
Gene kingdoms of bacteria and eukaryotes are obtained from the GenBank database (http://www.ncbi.nlm.nih.gov/genome/browse/, January 2020). Computer tests exclude genes when: (i) their nucleotides do not belong to the alphabet where stands for adenine, stands for cytosine, stands for guanine and stands for thymine; (ii) they do not begin with a start trinucleotide ; (iii) they do not end with a stop trinucleotide ; and (iv) their lengths are not modulo
Bacterial genes
With the classical method (trinucleotide occurrence frequencies at the gene population level), the mean numbers of trinucleotides in the frames 0 (reading frame), 1 and 2 of genes in the 613 bacterial genomes are 20.28, 19.83 and 19.90, respectively (Table 2). These important statistical results are the first demonstration of the maximality of circular codes with 20 preferential trinucleotides in average for each frame. The standard deviation is about 2 trinucleotides for each frame with a
The codon usage parameter unable to identify the circular codes
The concept and the method by which the circular code in genes was identified, and in particular the determination of its maximality property (20 trinucleotides), is a question very often asked by the reader. In particular, the reader is astonished not to find the circular codes with the classical parameter analysing the codon usage (CU). This CU parameter is based on the frequencies of trinucleotides in reading frame. I would like to make a few responses, which will also help to explain the
Some historical and theoretical considerations on comma-free and circular codes
In 1957, the comma-free codes have been proposed to the amino acid coding, i.e. a model of the genetic code before its experimental discovery (Crick et al., 1957). The idea was to find a code of 20 trinucleotides (codons) in the reading frame of genes coding for 20 amino acids such that no trinucleotides of the code exist in one of the two shifted frames, i.e. such that the trinucleotides of the code appear only in the reading frame – the comma-free property. The four nucleotides as
Conclusion
We presented several scientific arguments that the codon usage parameter has limitations for the identification of circular codes. We also demonstrated the maximality of circular codes with 20 preferential trinucleotides in average for each frame, both in the bacterial and eukaryotic genes, and both with the classical method based on the trinucleotide occurrence frequencies at the gene population level and the recent method founded on the trinucleotide occurrence frequencies at the gene level.
Acknowledgments
This work is dedicated to Marie Denise Besch for her constant support.
References (30)
- et al.
A complementary circular code in the protein coding genes
J. Theor. Biol.
(1996) - et al.
A code in the protein coding genes
Biosystems
(1997) - et al.
Evolutionary conservation and functional implications of circular code motifs in eukaryotic genomes
Biosystems
(2019) - et al.
Circular code motifs in the ribosome decoding center
Comput. Biol. Chem.
(2014) - et al.
Circular code motifs near the ribosome decoding center
Comput. Biol. Chem.
(2015) - et al.
Circular code motifs in genomes of eukaryotes
J. Theor. Biol.
(2016) - et al.
Mathematical fundamentals for the noise immunity of the genetic code
Biosystems
(2018) - et al.
An analytical model of gene evolution with 6 mutation parameters: an application to archaeal circular codes
Comput. Biol. Chem.
(2006) - et al.
Identification of circular codes in bacterial genomes and their use in a factorization method for retrieving the reading frames of genes
Comput. Biol. Chem.
(2006) - et al.
About a symmetry of the genetic code
J. Theor. Biol.
(1997)
Analysis of a circular code model
J. Theor. Biol.
A study of the purine/pyrimidine codon occurrence with a reduced centered variable and an evaluation compared to the frequency statistic
Math. Biosci.
A 2006 review of circular codes in genes
Comput. Math. Appl.
Circular code motifs in transfer and 16S ribosomal RNAs: a possible translation code in genes
Comput. Biol. Chem.
Circular code motifs in transfer RNAs
Comput. Biol. Chem.
Cited by (6)
Trinucleotide k-circular codes II: Biology
2022, BioSystemsInfinite combinatorics in mathematical biology
2021, BioSystemsPotential role of the X circular code in the regulation of gene expression
2021, BioSystemsCitation Excerpt :Furthermore, from a theoretical point of view (see Section 5. in Michel, 2020): (i) the comma-free codes cannot satisfy the coding condition of 20 amino acids (at most 13 amino acids can be coded by the 408 maximal comma-free codes); and (ii) the self-complementary comma-free codes have an incomplete circularity property (reading frame retrieval) as 12 trinucleotides among 60 must be ignored (the maximality of comma-free codes which are self-complementary, or C3, or C3 self-complementary, is only 16 trinucleotides). Other circular codes are less restrictive than comma-free codes, as a frameshift of 1 or 2 nucleotides in a sequence entirely consisting of codons from a circular code will not be detected immediately but after the reading of a certain number of nucleotides.
Reading Frame Retrieval of Genes: A New Parameter of Codon Usage Based on the Circular Code Theory
2023, Bulletin of Mathematical BiologyEquivalence classes of circular codes induced by permutation groups
2021, Theory in Biosciences