Universal insertion grammars of size two

doi:10.1016/j.tcs.2020.09.002

Theoretical Computer Science

Volume 843, 2 December 2020, Pages 153-163

https://doi.org/10.1016/j.tcs.2020.09.002 Get rights and content

Abstract

In this paper, we show that pure insertion grammars of size 2 (i.e., inserting two symbols in a left and right context, each consisting of two symbols) can characterize all recursively enumerable languages. This is achieved by either applying an inverse morphism and a weak coding, or a left (right) quotient with a regular $L O C (2)$ language, or an intersection with a $L O C (2)$ language and a weak coding. The obtained results improve the descriptional complexity of insertion grammars and complete the picture of known results on insertion-deletion systems that are motivated from the DNA computing area.

Introduction

An insertion grammar is a pure grammar (i.e., there are no non-terminals as opposed to terminal symbols) having only rules of form $u v \to u x v$ . Such grammars originated in [6]; they are inspired by Marcus contextual grammars [17], [26] used in linguistics. Another motivation for their study comes from the area of biology. As pointed out in [27], the process of mismatched annealing of DNA strands can be seen as an insertion or a deletion of a string in a specified context. A similar process happens in the case of RNA editing [2], where the uracil base U is inserted or deleted in some left context, as well as in the case of CRISPR-Cas9 technology that uses insertion and deletion to edit the genome [28], [1]. These observations led to the intense study of insertion-deletion systems (considering insertion and deletion operations together) in the framework of DNA computing [13], [29], [11], [18], [19], [30], [31].

There are several related models using a similar principle of insertion or deletion of a string in a specified context. We cite guided-insertion systems [3] used to model RNA editing, leftist grammars [20] used to model accessibility problems in protection systems, restarting automata [10] used to model the analysis by reduction and the insertion operation from [8] introduced as a generalization of the concatenation (and which corresponds to a context-free insertion grammar).

Since an insertion grammar is a pure grammar, an additional squeezing mechanism must be used in order to obtain a final language, as otherwise the described language class will have poor closure properties. Usually some restricted transducers are used for this purpose. Some standard examples are (1) the intersection with the free monoid over a terminal alphabet (used traditionally for Chomsky grammars), (2) projection composed with an inverse morphism, (3) left/right quotient by a regular language, (4) projection composed with intersection with a regular language. In the area of insertion grammars, variant (2) is mostly used, because (1) may only give at most context-sensitive languages [27].

The size of the insertion grammar is naturally defined by the triple $(n, m, m^{'})$ , where n (resp. m, $m^{'}$ ) is the maximal length of the inserted string (resp. left context, right context). If all parameters coincide, we just say that the grammar is of size n. In [27], it was shown for the first time that insertion grammars of size (4,7,6) together with the squeezing mechanism (2) as described above generate all recursively enumerable languages. This result was improved in [21], where insertion grammars of size (3,5,4) are shown to be sufficient for the same task. Continuing with this race, papers [23], [12] show that insertion grammars of size 3 generate all recursively enumerable languages with squeezing mechanisms (2), (3) and (4). In [24], [22] the squeezing mechanism (4) was thoroughly investigated showing that the result above also holds when some restricted classes of regular languages are used.

Further decrease in size of rules was achieved only by using additional control mechanisms, like graph-control for the size 2 [14], [15] and matrix control [25], [16] for the size (1,2,2) [4].

In this paper, we show that it is possible to generate all recursively enumerable languages using insertion grammars of size 2 with squeezing mechanisms (2), (3) and (4). This settles the question of the computational power of this model when all three parameters are the same, as insertion grammars of size $(n, 1, 1)$ are shown to be context-free [27]. This also proves a remarkable jump when combined with squeezing mechanisms as described above, because the context-free languages are closed under these operations. Hence, while insertion grammars of size 1 with squeezing mechanisms (2), (3) and (4) stay within the context-free languages, size 2 grammars jump up to all recursively enumerable languages. The proof of the main result is greatly simplified by the introduced notion of an independent rule set that allows to formalize the conditions when the application of two insertion rules in a derivation is independent and can be done in any order.

Section snippets

Definitions

We assume the reader is familiar with the standard concepts used in formal languages theory. We recall some of them in order to fix the notations. Given an alphabet (a finite set) V, let $V^{⁎}$ denote the set of all strings over V, i.e., the free monoid generated by V; the operator symbol ⋅ (for concatenation) is mostly omitted. For a string $x \in V^{⁎}$ , we denote the length of x by $| x |$ and the empty string is written as λ. If not explicitly stated otherwise, the notion of a morphism refers to a

Mark and migration technique

For the proof of the main result, we use a variant of the mark-and-migration technique introduced in [27] and that is commonly used to obtain computational completeness results in the area of insertion systems. This technique is based on a simulation of a type-0 grammar. Since it is not possible to delete symbols in the derivation string, the main idea is to simulate the deletion of a non-terminal symbol X by adding a special marker $ to its left (in [27] two markers # and $ were used). Such a

Main results

Theorem 2

For each recursively enumerable language L, there exists a morphism h, a weak coding g and a language $L_{1} \in INS (2, 2, 2)$ such that $L = g (h^{- 1} (L_{1}))$ .

Proof

Let $G = (N, T, P, S)$ be a type-0 grammar in SGNF. We construct the insertion grammar $G_{1} = (V, {ø ø S $ S}, P_{1})$ , where $V = N \cup T \cup {K_{A B}, K_{C D}} \cup {X, \bar{X} | X \in {$, \nabla, B_{1}, D_{1}}} \cup {ø} .$

The set of rules $P_{1}$ is constructed as follows. Consider the following sets: $L = T \cup {A, C, ø}, N = N \cup {K_{A B}, K_{C D}, B_{1}, D_{1}, \nabla}, \bar{N} = {{\bar{B}}_{1}, {\bar{D}}_{1}, \bar{\nabla}}, D = {$ X | X \in N} \cup {\bar{$} \bar{Y} | \bar{Y} \in \bar{N}}, F = L^{2} \cup D, S = V ∖ {$, \bar{$}} = N \cup \bar{N} \cup T \cup {ø} .$

•
For any rule $k : X \to b Y \in P$ we add the

Conclusions

In this paper, we have shown computational completeness of insertion grammars of size 2 enriched with different squeezing mechanisms. Since insertion grammars of size $(n, 1, 1)$ are known to be context-free [27], only cases $(1, 2, 2)$ , $(n, m, p)$ , $1 \leq n \leq 2$ , $0 \leq p \leq 1$ , $m \geq 0$ as well as their symmetric variants remain to be investigated for computational completeness.

The proof of the result was greatly simplified using the concept of an independent rule set. We have checked that the rules given in Theorem 2 are

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References (31)

F. Biegler et al.
Regulated RNA rewriting: modelling RNA editing with guided insertion
Theor. Comput. Sci.
(2007)
D. Haussler
Insertion languages
Inf. Sci.
(1983)
L. Kari et al.
On the weight of universal insertion grammars
Theor. Comput. Sci.
(2008)
L. Kari et al.
Contextual insertions/deletions and computability
Inf. Comput.
(1996)
M. Margenstern et al.
Context-free insertion-deletion systems
Theor. Comput. Sci.
(2005)
I. Petre et al.
Matrix insertion-deletion systems
Theor. Comput. Sci.
(2012)
M. Adli
The CRISPR tool kit for genome editing and beyond
Nat. Commun.
(2018)
H. Fernau et al.
Universal matrix insertion grammars with small size
R. Freund et al.
Graph-controlled insertion-deletion systems

B. Galiukschov

Semicontextual grammars

V. Geffert

Normal forms for phrase-structure grammars

RAIRO Theor. Inform. Appl.

(1991)

S. Ivanov et al.

Universality of graph-controlled leftist insertion-deletion systems with two states

P. Jančar et al.

Restarting automata

L. Kari et al.

At the crossroads of DNA computing and formal languages: characterizing recursively enumerable languages using insertion-deletion systems

Cited by (5)

On the computing powers of L-reductions of insertion languages
2021, Theoretical Computer Science
We investigate the computing power of the following language operation %: Given two languages $L_{1}$ over Σ and $L_{2}$ over Γ with $Γ \subset Σ$ , we consider the language operation $L_{1} % L_{2} = {u_{0} u_{1} \dots u_{n} | \exists u = u_{0} v_{1} u_{1} \dots v_{n} u_{n} \in L_{1} and \exists v_{i} \in L_{2} (1 \leq \forall i \leq n)}$ . In this case we say that $L (= L_{1} % L_{2})$ is the $L_{2}$ -reduction of $L_{1}$ . This is extended to the language families as follows: $L_{1} % L_{2} = {L_{1} % L_{2} | L_{1} \in L_{1}, L_{2} \in L_{2}}$ . Among many works concerning Dyck-reductions, for the family of recursively enumerable languages $RE$ , it was shown that $LIN % {EQ} = RE$ (Jantzen & Petersen, 1994) with $EQ = {x^{n} {\overline{x}}^{n} | n \in N}$ and that min- $LIN % {D_{2}} = RE$ (Hirose & Okawa, 1996, and Latteux & Turakainen, 1990), where $LIN$ and min- $LIN$ are the families of linear and minimal linear context-free languages, respectively.
In this paper, we show that each recursively enumerable language L can be represented in the form $L = K % D$ , for some $K \in {INS}_{3}^{0}$ and a Dyck language D, where ${INS}_{⁎}^{0}$ ( ${INS}_{3}^{0}$ ) denotes the family of insertion languages (insertion languages where the maximum length of the string to be inserted is 3). We can refine it as ${INS}_{⁎}^{0} % {D_{2}} = RE$ , where $D_{2}$ denotes the Dyck language over binary alphabet. For context-free languages, we show that ${INS}_{3}^{0} % F = CF$ , where $F$ is the family of finite sets. This also derives that ${INS}_{⁎}^{0} % {MIR} = CF$ with $MIR = {x {\overline{x}}^{R} | x \in {0, 1}^{⁎}}$ . Further, for regular languages, it is shown that each regular language R can be represented in the form $R = K % F$ , for some $K \in {INS}_{2}^{0}$ and a finite set $F = {a b \overline{b} \overline{a} | a \in V}$ . We also present some results which characterize the computability and properties of $L$ in the framework of $L_{2}$ -reduction of $L_{1}$ .
It is intriguing to note that, from the DNA computing point of view, the notion of L-reduction is naturally motivated by a molecular biological functioning well-known as DNA(RNA) splicing occurring in most eukaryotic genes.
L -reduction computation revisited
2022, Acta Informatica
REGULATED INSERTION-DELETION SYSTEMS
2022, Journal of Automata, Languages and Combinatorics
On the generative capacity of matrix insertion-deletion systems of small sum-norm
2021, Natural Computing
Parsimonious Computational Completeness
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

View full text

Universal insertion grammars of size two

Abstract

Introduction

Section snippets

Definitions

Mark and migration technique

Main results

Conclusions

Declaration of Competing Interest

Theor. Comput. Sci.

Inf. Sci.

Theor. Comput. Sci.

Inf. Comput.

Theor. Comput. Sci.

Theor. Comput. Sci.

The CRISPR tool kit for genome editing and beyond

Nat. Commun.

Universal matrix insertion grammars with small size

Graph-controlled insertion-deletion systems

Semicontextual grammars

Normal forms for phrase-structure grammars

RAIRO Theor. Inform. Appl.

Universality of graph-controlled leftist insertion-deletion systems with two states

Restarting automata

At the crossroads of DNA computing and formal languages: characterizing recursively enumerable languages using insertion-deletion systems