Elsevier

Biosystems

Volume 197, November 2020, 104201
Biosystems

The maximality of circular codes in genes statistically verified

https://doi.org/10.1016/j.biosystems.2020.104201Get rights and content

Abstract

The maximality of circular codes in genes has 20 preferential trinucleotides in each frame. This combinatorial property is statistically verified in the genes of both bacteria and eukaryotes, and by two approaches computing the trinucleotide occurrence frequencies in the 3 frames at the gene population level (classical method) and at the gene level (recent method). Several remarks explain why the codon usage parameter is unable to identify the circular codes. Some historical and theoretical considerations on comma-free and circular codes are presented. An evolutionary process by trinucleotide permutation is proposed to describe the transformation of a circular code (and its motifs) into another circular code.

Introduction

A circular code X is a set of words such that any motif from X, called X motif, allows to retrieve, maintain and synchronize the original (construction) frame. The circular code X identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel, 2017, 2015; Arquès and Michel, 1996) contains the 20 following trinucleotides in reading frame (frame 0)X={AAC,AAT,ACC,ATC,ATT,CAG,CTC,CTG,GAA,GAC,GAG,GAT,GCC,GGC,GGT,GTA,GTC,GTT,TAC,TTC},the 20 following trinucleotides in frame 1 (reading frame shifted by 1 nucleotide in the 5'3' direction, i.e. to the right)X1={AAG,ACA,ACG,ACT,AGC,AGG,ATA,ATG,CCA,CCG,GCG,GTG,TAG,TCA,TCC,TCG,TCT,TGC,TTA,TTG}and the 20 following trinucleotides in frame 2 (reading frame shifted by 2 nucleotides in the 5'3' direction)X2={AGA,AGT,CAA,CAC,CAT,CCT,CGA,CGC,CGG,CGT,CTA,CTT,GCA,GCT,GGA,TAA,TAT,TGA,TGG,TGT}.

The trinucleotide set X (defined in (1)) coding the reading frame in genes is a maximal (20 trinucleotides) C3 self-complementary trinucleotide circular code (Arquès and Michel, 1996; reviewed in Michel, 2008; Fimmel and Strüngmann, 2018).

From a mathematical point of view, the identification of circular codes in genes led to about 200 theorems obtained in the different research fields of circular codes: the flower automaton (Arquès and Michel, 1996; and subsequent works), the probability approach (Koch and Lehmann, 1997; Lacan and Michel, 2001), the necklace 5LDCN (Letter Diletter Continued Necklace) (Pirillo, 2003), the necklace nLDCCN (Letter Diletter Continued Closed Necklace) with n{2,3,4,5} (Michel and Pirillo, 2010; and subsequent works), the group theory (Fimmel et al., 2014; and subsequent works) and the graph theory (Fimmel et al., 2016; and subsequent works).

From a biological point of view, the X circular code motifs (or briefly X motifs), i.e. motifs of the X circular code, allow to retrieve, maintain and synchronize the reading frame in genes. The concept, the statistical analyses and the biological studies of X circular code motifs have been introduced in Michel (2012).1 It has been shown recently that the X motifs are enriched in the reading frame of extant genes (El Soufi and Michel, 2016; Michel et al., 2017; Dila et al., 2019a), as well as in tRNA sequences (Michel, 2012, 2013; El Soufi and Michel, 2015) and in functional regions of rRNA involved in mRNA translation (Michel, 2012; El Soufi and Michel, 2014, 2015; Dila et al., 2019b). Furthermore, a circular code periodicity 0 modulo 3 was identified in the 16S rRNA, covering the region that corresponds to the primordial proto-ribosome decoding center and containing numerous sites that interact with the tRNA and mRNA during translation (Michel and Thompson, 2020). Based on the mathematical properties of the X circular code and the enrichment of X motifs in the main actors involved in translation, it has been suggested that the X circular code was an ancestor code of the standard genetic code that was used to code amino acids and simultaneously to identify and maintain the reading frame (Dila et al., 2019b).

In order to verify the maximality of circular codes in genes, we will reformulate in this work the presentation of the classical method (Arquès and Michel, 1996; Michel, 2015) and the recent method (Michel, 2017). Precisely, we will demonstrate here that the maximality of the 3 circular codes X (with 20 trinucleotides in reading frame, defined in (1)), X1 (with 20 trinucleotides in frame 1, defined in (2)) and X2 (with 20 trinucleotides in frame 2, defined in (3)), that have been assigned by inspection in Arquès and Michel (1996), is statistically verified.

Section snippets

Gene kingdoms

Gene kingdoms K of bacteria B and eukaryotes E are obtained from the GenBank database (http://www.ncbi.nlm.nih.gov/genome/browse/, January 2020). Computer tests exclude genes when: (i) their nucleotides do not belong to the alphabet B={A,C,G,T} where A stands for adenine, C stands for cytosine, G stands for guanine and T stands for thymine; (ii) they do not begin with a start trinucleotide ATG; (iii) they do not end with a stop trinucleotide {TAA,TAG,TGA}; and (iv) their lengths are not modulo

Bacterial genes

With the classical method (trinucleotide occurrence frequencies at the gene population level), the mean numbers of trinucleotides in the frames 0 (reading frame), 1 and 2 of genes in the 613 bacterial genomes are 20.28, 19.83 and 19.90, respectively (Table 2). These important statistical results are the first demonstration of the maximality of circular codes with 20 preferential trinucleotides in average for each frame. The standard deviation is about 2 trinucleotides for each frame with a

The codon usage parameter unable to identify the circular codes

The concept and the method by which the X circular code in genes was identified, and in particular the determination of its maximality property (20 trinucleotides), is a question very often asked by the reader. In particular, the reader is astonished not to find the circular codes with the classical parameter analysing the codon usage (CU). This CU parameter is based on the frequencies of trinucleotides in reading frame. I would like to make a few responses, which will also help to explain the

Some historical and theoretical considerations on comma-free and circular codes

In 1957, the comma-free codes have been proposed to the amino acid coding, i.e. a model of the genetic code before its experimental discovery (Crick et al., 1957). The idea was to find a code of 20 trinucleotides (codons) in the reading frame of genes coding for 20 amino acids such that no trinucleotides of the code exist in one of the two shifted frames, i.e. such that the trinucleotides of the code appear only in the reading frame – the comma-free property. The four nucleotides {A,C,G,T} as

Conclusion

We presented several scientific arguments that the codon usage parameter has limitations for the identification of circular codes. We also demonstrated the maximality of circular codes with 20 preferential trinucleotides in average for each frame, both in the bacterial and eukaryotic genes, and both with the classical method based on the trinucleotide occurrence frequencies at the gene population level and the recent method founded on the trinucleotide occurrence frequencies at the gene level.

Acknowledgments

This work is dedicated to Marie Denise Besch for her constant support.

References (30)

Cited by (6)

  • Potential role of the X circular code in the regulation of gene expression

    2021, BioSystems
    Citation Excerpt :

    Furthermore, from a theoretical point of view (see Section 5. in Michel, 2020): (i) the comma-free codes cannot satisfy the coding condition of 20 amino acids (at most 13 amino acids can be coded by the 408 maximal comma-free codes); and (ii) the self-complementary comma-free codes have an incomplete circularity property (reading frame retrieval) as 12 trinucleotides among 60 must be ignored (the maximality of comma-free codes which are self-complementary, or C3, or C3 self-complementary, is only 16 trinucleotides). Other circular codes are less restrictive than comma-free codes, as a frameshift of 1 or 2 nucleotides in a sequence entirely consisting of codons from a circular code will not be detected immediately but after the reading of a certain number of nucleotides.

View full text