Self-complementary circular codes in coding theory

Fimmel, Elena; Michel, Christian J.; Starman, Martin; Strüngmann, Lutz

doi:10.1007/s12064-018-0259-4

Self-complementary circular codes in coding theory

Original Article
Published: 12 March 2018

Volume 137, pages 51–65, (2018)
Cite this article

Theory in Biosciences Aims and scope Submit manuscript

Elena Fimmel¹,
Christian J. Michel²,
Martin Starman¹ &
…
Lutz Strüngmann¹

347 Accesses
18 Citations
1 Altmetric
Explore all metrics

Abstract

Self-complementary circular codes are involved in pairing genetic processes. A maximal $C^3$ self-complementary circular code X of trinucleotides was identified in genes of bacteria, archaea, eukaryotes, plasmids and viruses (Michel in Life 7(20):1–16 2017, J Theor Biol 380:156–177, 2015; Arquès and Michel in J Theor Biol 182:45–58 1996). In this paper, self-complementary circular codes are investigated using the graph theory approach recently formulated in Fimmel et al. (Philos Trans R Soc A 374:20150058, 2016). A directed graph $\mathcal {G}(X)$ associated with any code X mirrors the properties of the code. In the present paper, we demonstrate a necessary condition for the self-complementarity of an arbitrary code X in terms of the graph theory. The same condition has been proven to be sufficient for codes which are circular and of large size $\mid X \mid \ge 18$ trinucleotides, in particular for maximal circular codes ($\mid X \mid = 20$ trinucleotides). For codes of small-size $\mid X \mid \le 16$ trinucleotides, some very rare counterexamples have been constructed. Furthermore, the length and the structure of the longest paths in the graphs associated with the self-complementary circular codes are investigated. It has been proven that the longest paths in such graphs determine the reading frame for the self-complementary circular codes. By applying this result, the reading frame in any arbitrary sequence of trinucleotides is retrieved after at most 15 nucleotides, i.e., 5 consecutive trinucleotides, from the circular code X identified in genes. Thus, an X motif of a length of at least 15 nucleotides in an arbitrary sequence of trinucleotides (not necessarily all of them belonging to X) uniquely defines the reading (correct) frame, an important criterion for analyzing the X motifs in genes in the future.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Equivalence classes of circular codes induced by permutation groups

Article Open access 01 February 2021

Strong Comma-Free Codes in Genetic Information

Article 22 June 2017

Circular codes, symmetries and transformations

Article 10 July 2014

Notes

Recall that the union $G_1\cup G_2$ of two graphs $G_1=(V_1,E_1)$ and $G_2=(V_2,E_2)$ is defined as $G=(V_1\cup V_2, E_1\cup E_2)$ (Clark and Holton 1991).
Due to self-complementarity of X, $\mid X \mid $ must be even, but in opposite to circular codes, there are no self-complementary comma-free codes of sizes 18 or 20.

References

Arquès DG, Michel CJ (1996) A complementary circular code in the protein coding genes. J Theor Biol 182:45–58
Article PubMed Google Scholar
Clark J, Holton DA (1991) A first look at graph theory. World Scientific, New Jersey
Book Google Scholar
Crick FH, Brenner S, Klug A, Pieczenik G (1976) A speculation on the origin of protein synthesis. Orig Life 7:389–397
Article CAS PubMed Google Scholar
Crick FH, Griffith JS, Orgel LE (1957) Codes without commas. Proc Natl Acad Sci USA 43:416–421
Article CAS PubMed PubMed Central Google Scholar
Eigen M, Schuster P (1978) The hypercycle. A principle of natural self-organization. Part C: The realistic hypercycle. Naturwissenschaften 65:341–369
Article CAS Google Scholar
El Soufi K, Michel CJ (2014) Circular code motifs in the ribosome decoding center. Comput Biol Chem 52:9–17
Article PubMed Google Scholar
El Soufi K, Michel CJ (2015) Circular code motifs near the ribosome decoding center. Comput Biol Chem 59:158–176
Article PubMed Google Scholar
El Soufi K, Michel CJ (2016) Circular code motifs in genomes of eukaryotes. J Theor Biol 408:198–212
Article PubMed Google Scholar
Fimmel E, Michel CJ, Strüngmann L (2016) $n$-Nucleotide circular codes in graph theory. Philos Trans R Soc A 374:20150058
Article Google Scholar
Fimmel E, Michel CJ, Strüngmann L (2017) Strong comma-free codes in genetic information. Bull Math Biol 79:1796–1819
Article CAS PubMed Google Scholar
Golomb SW, Delbruck M, Welch LR (1958a) Construction and properties of comma-free codes. Biol Medd K Dan Vidensk Selsk 23:1–34
Google Scholar
Golomb SW, Gordon B, Welch LR (1958b) Comma-free codes. Can J Math 10:202–209
Article Google Scholar
Ikehara K (2002) Origins of gene, genetic code, protein and life: comprehensive view of life systems from a GNC-SNS primitive genetic code hypothesis. J Biosci 27:165–186
Article CAS PubMed Google Scholar
Michel CJ (2012) Circular code motifs in transfer and 16S ribosomal RNAs: a possible translation code in genes. Comput Biol Chem 37:24–37
Article CAS PubMed Google Scholar
Michel CJ (2013) Circular code motifs in transfer RNAs. Comput Biol Chem 45:17–29
Article CAS PubMed Google Scholar
Michel CJ (2015) The maximal $C^3$ self-complementary trinucleotide circular code $X$ in genes of bacteria, eukaryotes, plasmids and viruses. J Theor Biol 380:156–177
Article CAS PubMed Google Scholar
Michel CJ (2017) The maximal $C^3$ self-complementary trinucleotide circular code $X$ in genes of bacteria, archaea, eukaryotes, plasmids and viruses. Life 7(20):1–16
Google Scholar
Michel CJ, Nguefack Ngoune V, Poch O, Ripp R, Thompson JD (2017) Enrichment of circular code motifs in the genes of the yeast Saccharomyces cerevisiae. Life 7(52):1–20
Google Scholar
Michel CJ, Pirillo G, Pirillo MA (2008) Varieties of comma free codes. Comput Math Appl 55:989–996
Article Google Scholar
Shepherd JCW (1981) Method to determine the reading frame of a protein from the purine/pyrimidine genome sequence and its possible evolutionary justification. Proc Natl Acad Sci USA 78:1596–1600
Article CAS PubMed PubMed Central Google Scholar
Trifonov EN (1987) Translation framing code and frame-monitoring mechanism as suggested by the analysis of mRNA and 16S rRNA nucleotide sequences. J Mol Biol 194:643–652
Article CAS PubMed Google Scholar

Download references

Author information

Authors and Affiliations

Faculty for Computer Sciences, Institute of Mathematical Biology, Mannheim University of Applied Sciences, 68163, Mannheim, Germany
Elena Fimmel, Martin Starman & Lutz Strüngmann
Theoretical Bioinformatics, ICube, CNRS, University of Strasbourg, 300 Boulevard Sébastien Brant, 67400, Illkirch, France
Christian J. Michel

Authors

Elena Fimmel
View author publications
You can also search for this author in PubMed Google Scholar
Christian J. Michel
View author publications
You can also search for this author in PubMed Google Scholar
Martin Starman
View author publications
You can also search for this author in PubMed Google Scholar
Lutz Strüngmann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian J. Michel.

Appendix

Proof of Theorem 4.7

Claim (1): Let $l_{max}(X)=4$ and assume that $l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3$ is a longest path in $\mathcal {G}(X)$. Since the path is maximal, there is no trinucleotide of the form $dl_1$ and no trinucleotide of the form $l_3d$ in X. It follows that $c(l_3)=l_1$ and $d_1, d_2 \in \{l_2,c(l_2) \}^2$. Note that all the nucleotides $l_1,l_2,l_3$ must be different by circularity. Thus, we have 4 possibilities for $d_1, d_2$, namely $l_2l_2$, $l_2c(l_2)$, $c(l_2)l_2$ and $c(l_2)c(l_2)$. As $l_2l_2l_2 \not \in X$ by circularity, we have the following options for the 2 trinucleotides $d_1l_2 \in X$ and $l_2d_2 \in X$

$$\begin{aligned}&d_1l_2: \quad \quad l_2c(l_2)l_2 \quad c(l_2)l_2l_2 \quad c(l_2)c(l_2)l_2; \\& l_2d_2: \quad \quad l_2l_2c(l_2) \quad l_2c(l_2)l_2 \quad l_2c(l_2)c(l_2). \end{aligned}$$

If $d_1l_2$ or $l_2d_2$ is equal to $l_2c(l_2)l_2$ then self-complementarity yields $c(l_2)l_2c(l_2) \in X$ and the word $c(l_2)l_2c(l_2)l_2c(l_2)l_2$ contradicts circularity. Excluding the combinations $c(l_2)l_2l_2$, $l_2l_2c(l_2)$ and $c(l_2)c(l_2)l_2$, $l_2c(l_2)c(l_2)$ since the trinucleotides are obviously circular permutations of each other, only 2 combinations remain: $c(l_2)l_2l_2$, $l_2c(l_2)c(l_2)$ and $c(l_2)c(l_2)l_2$, $l_2l_2c(l_2)$. But also here, self-complementarity yields a contradiction to circularity since, for example, the complementary trinucleotide of $c(l_2)c(l_2)l_2$ is in the same equivalence class as $l_2l_2c(l_2)$.

Claim (2): Let $l_{max}(X)=6$ and assume that $d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d_3 \rightarrow l_3 \rightarrow d_4 $ is a longest path in $\mathcal {G}(X)$. By self-complementarity, there is the reversed complemented path

$$\begin{aligned} \overleftarrow{{c(d_4)}} \rightarrow c(l_3) \rightarrow \overleftarrow{{c(d_{3})}} \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}}. \end{aligned}$$

Now, the middle nucleotides $l_2$ and $c(l_2)$ of the 2 paths are either the pair A and T, or C and G. Therefore, it suffices to show that there are paths $A \rightarrow d \rightarrow T$ or $T \rightarrow d \rightarrow A$ and $C \rightarrow d \rightarrow G$ or $G \rightarrow d \rightarrow C$ in $\mathcal {G}(X)$; since then, we will obtain a path of length 8 combining the 2 paths, e.g.,

$$\begin{aligned} d_1 \rightarrow l_1 \rightarrow d_2 \rightarrow l_2 \rightarrow d \rightarrow c(l_2) \rightarrow \overleftarrow{{c(d_2)}} \rightarrow c(l_1) \rightarrow \overleftarrow{{c(d_1)}} \end{aligned}$$

contradicting $l_{max}(X)=6$. However, by maximality, the code X must contain exactly one trinucleotide of the class $\{ ATT, TTA, TAT\}$ and its complementary trinucleotide as well as exactly one trinucleotide from the class $\{ GCC, CCG, CGC\}$ and its complementary trinucleotide. It is easy to verify that in each case we obtain either a path of the form $A \rightarrow d \rightarrow T$ or $T \rightarrow d \rightarrow A$ and $C \rightarrow d \rightarrow G$ or $G \rightarrow d \rightarrow C$, e.g., if $ATT \in X$ then also $AAT \in X$ and we get the path $A \rightarrow AT \rightarrow T$ in $\mathcal {G}(X)$.

Claim (3): Let $l_{max}(X)=8$ and assume that $l_1 \rightarrow d_1 \rightarrow l_2 \rightarrow d_2 \rightarrow l_3 \rightarrow d_3 \rightarrow l_4 \rightarrow d_4 \rightarrow l_5$ is the longest path in $\mathcal {G}(X)$. Then obviously, 2 out of the 5 nucleotides $l_1,l_2,l_3,l_4,l_5$ must be equal, which yields a cycle in $\mathcal {G}(X)$ contradicting the circularity of X. $\square $

Proof of Theorem 5.11

Let $X\subseteq \mathcal {B}^3$ be a maximal self-complementary circular code and $\mathcal {G}(X)$ its associated graph. Since X is circular then $\mathcal {G}(X)$ is acyclic, so it has a path $p=p_{max}(X)$ of maximal length l(p).

Claim (1): Assume that $p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow b_k$, then any concatenation $d_ib_i \in X$. Choose any trinucleotide $c=s_1s_2s_3 \in X$. Then $(d_1b_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+1}$ and hence $(d_1b_1)\cdots (d_kb_k) s_1$ is a possible X-frame (for itself) with $t_b=\epsilon $ and $t_e=s_1$. Moreover, each concatenation $b_id_{i+1}$ is also a trinucleotide in X, so $d_1(b_1d_2)\cdots (b_{k-1}d_k)b_ks_1$ is a second possible X-frame with $t_b=d_1$ and $t_e=b_ks_1$. Thus, $n_X \ge l_w(p)+2$ since the sequence $d_1b_1 \cdots d_kb_ks_1$ has length $l_w(p)+1$.

Now assume that $b_1 \cdots b_k$ is a sequence of nucleotides and assume that $k \ge l_w(p)+2$ but $b_1 \cdots b_k$ has 2 different possible X-frames. We have to show a contradiction to conclude that $n_X=l_w(p)+2$. Assume that $t_b u_1 \cdots u_l t_e$ and $t_b' u_1'\cdots u_m't_e'$ with $u_i, u_i' \in X$ and $t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) $ are the 2 different possible X-frames. Obviously, $\mid t_bt_e \mid \le 4$. If $\mid t_bt_e \mid =4$ then by the difference of the 2 possible X-frames, we conclude that at least one of $t_b'$ or $t_e'$ has to have length $\ge 3$, a contradiction to the definition of possible X-frame, or $\mid t_b't_e' \mid \le 3$. Hence, w.l.o.g. we assume that $\mid t_bt_e \mid \le 3$. Consequently, $\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+2-3=l_w(p)-1$ and hence $\mid u_1 \cdots u_l \mid \ge l_w(p)+1$. We now have to distinguish cases:

(a)
If $\mid t_bt_e \mid \le 1$ then we even get $\mid u_1 \cdots u_l \mid \ge k-1 \ge l_w(p)+2-1=l_w(p)+1$ and hence $\mid u_1 \cdots u_l \mid \ge l_w(p)$. Thus, the path associated with the 2 possible X-frames has word-length at least $l_w(p)+1$, a contradiction to the maximality of $l_w(p)$. In this case, the sequence $u_1 \cdots u_l$ could contain the sequence $u_1' \cdots u_m'$ as a subsequence.
(b)
If $\mid t_bt_e \mid \ge 2$ then the second possible X-frame is at least shifted by one with respect to the first possible X-frame, i.e., it must extend the sequence $u_1 \cdots u_l$ to the left or to the right. In this case, the sequence $u_1 \cdots u_l$ cannot contain the sequence $u_1' \cdots u_m'$ as a subsequence. The path associated with the 2 possible X-frames has word-length at least $\mid u_1 \cdots u_l \mid +1 \ge l_w(p)+1$, again a contradiction to the maximality of $l_w(p)$.

Thus, $n_X=l(p)+2$.

The case $p= b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow d_k$ is symmetric and can be similarly dealt with.

Claim (2): Assume that $p=d_1 \rightarrow b_1 \rightarrow \cdots \rightarrow d_k$, then any concatenation $d_ib_i \in X$. As in Claim (1), $(d_1b_1)\cdots (d_{k-1}b_{k-1})d_k$ is a possible X-frame (for itself) with $t_b=\epsilon $ and $t_e=d_k$. Moreover, each concatenation $b_id_{i+1}$ is a trinucleotide in X, so $d_1(b_1d_2)\cdots (b_{k-2}d_{k-1})(b_{k-1}d_k)$ is a second possible X-frame with $t_b=d_1$ and $t_e=\epsilon $. Thus, $n_X \ge l_w(p)$ since the sequence $d_1b_1 \cdots d_{k-1}b_{k-1}d_k$ has length $l_w(p)$.

Now assume that $b_1 \cdots b_k$ is a sequence of nucleotides and assume that $k \ge l_w(p)+1$ but $b_1 \cdots b_k$ has 2 different possible X-frames: $t_b u_1 \cdots u_l t_e$ and $t_b' u_1'\cdots u_m't_e'$ with $u_i, u_i' \in X$ and $t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) $. As in Claim (1), we assume w.l.o.g. that $\mid t_bt_e \mid \le 3$. We distinguish cases:

(a)
If $\mid t_bt_e \mid =0$ then $\mid u_1 \cdots u_l \mid \ge l_w(p)+1$ and $u'_1 \cdots u'_m$ is a subsequence of $u_1 \cdots u_l$. Thus, the path associated with the 2 possible X-frames has word-length $l_w(p)+1$ with the associated word $ u_1 \cdots u_l$, a contradiction to the maximality of $l_w(p)$.
(b)
If $\mid t_bt_e \mid =1$ then $\mid u_1 \cdots u_l \mid \ge l_w(p)$. If the second possible X-frame is shifted by one with respect to the first one, then the path associated with the 2 possible X-frames has word-length $l_w(p)+1$, again a contradiction to the maximality of $l_w(p)$. If the second possible X-frame is shifted by two, then the path associated with the 2 possible X-frames has word-length $l_w(p)$. However, in this case, the path starts with a dinucleotide and ends with a nucleotide, a contradiction to the structure of maximal paths which have to start and end with a dinucleotide.
(c)
If $\mid t_bt_e \mid =2$ then $\mid u_1 \cdots u_l \mid \ge l_w(p)-1$. Again, we have to distinguish cases:
1. (i)
  $\mid t_b \mid =2$ and $\mid t_e \mid =0$. Then the associated path to the 2 possible X-frames has word-length $l_w(p)$ and starts with a nucleotide but ends with a dinucleotide, a contradiction to the structure of maximal paths, or has word-length $l_w(p)+1$, a contradiction to the maximality of $l_w(p)$.
2. (ii)
  $\mid t_b \mid =0$ and $\mid t_e \mid =2$, as (i).
3. (iii)
  $\mid t_b \mid =1$ and $\mid t_e \mid =1$. As above, if the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length $l_w(p)$ again starting with a nucleotide ($u_1$) and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then again the path associated with the 2 possible X-frames has word-length $l_w(p)$ starting with a nucleotide ($u'_1$) and ends with a dinucleotide.
(d)
If $\mid t_bt_e \mid =3$ then $\mid u_1 \cdots u_l \mid \ge l_w(p)-2$. We distinguish two symmetric cases:
1. (i)
  $\mid t_b \mid =2$ and $\mid t_e \mid =1$. If the second possible X-frame is shifted by one, then the path associated with the 2 possible X-frames has word-length $l_w(p)+1$, a contradiction to the maximality of $l_w(p)$, or has word-length $l_w(p)$ but starting with a nucleotide and ending with a dinucleotide, a contradiction to the structure of maximal paths. If the second possible X-frame is shifted by two, then either the path associated with the 2 possible X-frames has word-length $l_w(p)+1$, a contradiction to the maximality of $l_w(p)$, or has word-length $l_w(p)-1$ starting with a nucleotide and ending with a nucleotide. But this case cannot exist unless the arrow-length of this path is at least the arrow-length of p, a contradiction to the maximality of p.
2. (ii)
  $\mid t_b \mid =1$ and $\mid t_e \mid =2$, as (i).

Claim (3): Assume that $p=b_1 \rightarrow d_1 \rightarrow \cdots \rightarrow b_k$, then any concatenation $b_id_i \in X$. Choose any 2 trinucleotides $c=s_1s_2s_3, c'=s'_1s'_2s'_3 \in X$. Then $(s'_1s'_2s'_3)(b_1d_1)\cdots (d_kb_k) (s_1s_2s_3)\in X^{k+2}$ and hence $s'_3(b_1d_1)\cdots (b_{k-1}d_{k-1})b_k s_1$ is a possible X-frame (for itself) with $t_b=s'_3$ and $t_e=b_ks_1$. Moreover, each concatenation $d_ib_{i+1}$ is a trinucleotide in X, so $s'_3b_1(d_1b_2)\cdots (d_{k-1}b_k)s_1$ is a second possible X-frame with $t_b=s'_3b_1$ and $t_e=s_1$. Thus, $n_X \ge l_w(p)+3$ since the sequence $s'_3b_1d_1 \cdots b_{k-1}d_{k-1}b_ks_1$ has length $l_w(p)+2$.

Now assume that $b_1 \cdots b_k$ is a sequence of nucleotides with $k \ge l_w(p)+3$ but $b_1 \cdots b_k$ has 2 different possible X-frames: $t_b u_1 \cdots u_l t_e$ and $t_b' u_1'\cdots u_m't_e'$ with $u_i, u_i' \in X$ and $t_b,t_e, t_b',t_e' \in \left( \{ \epsilon \} \cup \mathcal {B}\cup \mathcal {B}^2 \right) $. As in Claim (1), we conclude that w.l.o.g. $\mid t_bt_e \mid \le 3$ and hence $\mid u_1 \cdots u_l \mid \ge k-3 \ge l_w(p)+3-3=l_w(p)$. Similar arguments as above show that the path associated with the 2 possible X-frames has word-length greater than $l_w(p)$, in contradiction to the maximality of p and $l_w(p)$. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fimmel, E., Michel, C.J., Starman, M. et al. Self-complementary circular codes in coding theory. Theory Biosci. 137, 51–65 (2018). https://doi.org/10.1007/s12064-018-0259-4

Download citation

Received: 11 July 2017
Accepted: 10 February 2018
Published: 12 March 2018
Issue Date: April 2018
DOI: https://doi.org/10.1007/s12064-018-0259-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-complementary circular codes in coding theory

Abstract

Access this article

Similar content being viewed by others

Equivalence classes of circular codes induced by permutation groups

Strong Comma-Free Codes in Genetic Information

Circular codes, symmetries and transformations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Proof of Theorem 4.7

Proof of Theorem 5.11

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-complementary circular codes in coding theory

Abstract

Access this article

Similar content being viewed by others

Equivalence classes of circular codes induced by permutation groups

Strong Comma-Free Codes in Genetic Information

Circular codes, symmetries and transformations

Notes

References

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

Proof of Theorem 4.7

Proof of Theorem 5.11

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation