An Algorithm for Gene Fragment Reconstruction

Fang, Ningyuan; Wang, Kaifa; Tong, Dali

doi:10.1007/s12539-021-00419-6

An Algorithm for Gene Fragment Reconstruction

Original research article
Published: 20 February 2021

Volume 13, pages 118–127, (2021)
Cite this article

Download PDF

Interdisciplinary Sciences: Computational Life Sciences Aims and scope Submit manuscript

An Algorithm for Gene Fragment Reconstruction

Download PDF

2337 Accesses
7 Altmetric
1 Mention
Explore all metrics

Abstract

Gene sequencing technology has been playing an important role in many aspects, such as life science, disease medicine and health medicine, particularly in the extremely tough process of fighting against 2019-novel coronavirus. Drawing DNA restriction map is a particularly important technology in genetic biology. The simplified partial digestion method (SPDP), a biological method, has been widely used to cut DNA molecules into DNA fragments and obtain the biological information of each fragment. In this work, we propose an algorithm based on 0–1 planning for the location of restriction sites on a DNA molecule, which is able to solve the problem of DNA fragment reconstruction just based on data of fragments’ length. Two specific examples are presented in detail. Furthermore, based on 1000 groups of original DNA sequences randomly generated, we define the coincidence rate and unique coincidence rate between the reconstructed DNA sequence and the original DNA sequence, and then analyze separately the effect of the number of fragments and the maximum length of DNA fragments on the coincidence rate and unique coincidence rate as defined. The effectiveness of the algorithm is proved. Besides, based on the existing optimization solution obtained, we simulate and discuss the influence of the error by computation method. It turns out that the error of position of one restriction site does not affect other restriction sites and errors of most restriction sites may lead to the failure of sequence reconstruction. Matlab 7.1 program is used to solve feasible solutions of the location of restriction sites, derive DNA fragment sequence and carry out the statistical analysis and error analysis. This paper focuses on basic computer algorithm implementation of rearrangement and sequencing rather than biochemical technology. The innovative application of the mathematical idea of 0–1 planning to DNA sequence mapping construction, to a certain extent, greatly simplifies the difficulty and complexity of calculation and accelerates the process of 'jigsaw' of DNA fragments.

Cat Swarm Optimization Algorithm for DNA Fragment Assembly Problem

DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Article Open access 18 June 2019

Licheng Wang, Jingwen Suo, … Lixiang Li

Restriction Enzyme-Mediated DNA Family Shuffling

1 Introduction

It’s commonly accepted that drawing DNA restriction map is an extremely significant method for genetic and biological analysis. In view of the high molecular weight of DNA and excessive number of base pairs, biochemical technology is used to cut DNA into small molecular fragments in scientific experiments [1, 2]. Specifically, the PDP method and the simplified PDP method (SPDP) based on the different enzyme cutting sites have been proposed in a series of classic studies. The basic biological information of each segment is analyzed to obtain the relevant information of the whole DNA molecule [3, 4]. This sequencing method is the "shotgun method" invented by Craig Venter, founder of selera genetic company in the United States [5].

Quickness, simplicity of implementation, and low cost are regarded as the advantages of the shotgun method. However, the workload is large. If the shotgun method is used, the rearrangement of DNA fragments is not easy for computation at all. In the determination of large genomes, such as the human genome and drosophila genome and so on, the improved whole-genome "shotgun method" has been extensively applied to complete the sequencing, which can sufficiently demonstrate its feasibility and effectiveness [6, 7].

Scientists, researchers and scholars have developed many algorithms and tools for predicting the precise results based on some features of target objects [8,9,10]. Recently, the extensive application of these algorithms has been witnessed in genetic analysis. Genfrag, a set of tools, was developed to generate benchmark data sets for testing DNA sequence assembly algorithms and to quest for the range of data and corresponding performance of assembly tools on "shot-gun" sequencing projects by Engle and Burks [11]. An open-source bioinformatic tool, called Grinder, was introduced by Angly et al., which could simulate amplicon and shotgun datasets from reference sequences [12]. In the detection of respiratory viruses in clinical specimens, four different bioinformatics algorithms were executed by Huang et al. to make the assessment of the performance of a metagenomic shot-gun sequencing method [13]. Based on Sanger methodology, a novel algorithm was applied by Shityakov et al. [14], which correctly predicted and stressed the performance of DNA sequencing techniques and confirmed the statistical significance of results. Although many algorithms have recently been proposed to obtain the DNA fragment sequence, these algorithms are considerably complex and require much additional information apart from the lengths of DNA fragments, which may limit their application scopes.

In this paper, based on the biological information of each segment by SPDP and the mathematical thought of 0–1 planning, we propose the general basic algorithm to solve the feasible solutions from all permutations of the location of DNA restriction sites, and further restore the possible DNA sequence. Besides, we evaluate the efficiency of this algorithm according to 1000 sets of DNA original sequences randomly generated. Moreover, the influence of measurement error of fragments’ length on the algorithm is discussed. The proposed algorithm can be conducted just based on data of fragments’ length, and thus this algorithm is relatively easy to be applied in practice.

2 Example Design

2.1 Example 1

The first set of data is 2, 3, 7, 8, 8, 9, 13, 14.

The second set of data is 2, 1, 4, 3, 6.

2.2 Example 2

The first set of data is 1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14.

The second set of data is 1, 1, 2, 1, 2, 2, 1, 2, 3.

3 Problem Analysis and Tentative Ideas

We need to find the correct sequence of DNA fragments represented by the second set of data, so that when the DNA molecule is cut at each restriction site respectively, the data obtained is consistent with the first set of data. Because of the specific data given in example 1, there are four restriction enzyme cutting sites in this DNA molecule, and the first group of data (2, 14, 8, 8, 7, 9, 3, 13) is obtained when the DNA molecule is cut on a single site, while the second group of data is (1, 2, 3, 4, 6) after DNA molecule is cut on all the restriction enzyme cutting sites. In the first set of data, the minimum fragment length is 2, and only 2 in the second set of data can correspond to it, so the corresponding enzyme cutting site should be the closest site to the endpoint; while the number 3 in the first set of data can correspond to 1 plus 2 or 3 in the second set of data. Obviously, the larger number is in the first set of data, the corresponding combination in the second set of data will be more.

Further analysis, we can get each number in the first group of data from one end according to each enzyme cutting point, and then the rearrangement of the second set of data is correct. That is because numbers in pairs in the first set of data represent the same meaning, such as 2 and 14 or 3 and 13, which only represent different restriction sites. We divide the first set of data into two groups, namely (2, 3, 7, 8) and (14, 13, 9, 8). Only one set of data (2, 3, 7, 8) represents the shorter distance between each restriction site and two ends of the DNA molecule. Therefore, we only need to analyze a half of the first data to express each restriction site. If the data of the length of DNA fragments obtained after the DNA molecule is cut on each restriction site at the same time can be the same as the second data (1, 2, 3, 4, 6), then the sequence is meaningful. Based on the above analysis, we use 0–1 planning to calibrate the shear point position and finally get the result. Specific implementation of the algorithm is as follows [15].

4 0–1 Planning Method

4.1 The Establishment of 0–1 Equation Algorithm

Suppose: the data of fragments’ length measured when the DNA molecule is cut at each restriction site separately is the first set of data:

$$A = \left[ {a_{1} {,}aa_{1} {,}a_{2} {,}aa_{2} {,}a_{3} {,}aa_{3} {,} \cdots {,}a_{n} ,aa_{n} } \right],$$

where $a_{i} ,aa_{i}$ are two data from the same cutting experiment and $a_{i} \le aa_{i}$ while $n$ is the number of restriction sites on DNA;

The data of fragments’ length measured when the DNA molecule is cut at each restriction site simultaneously is the second set of data $B = \left[ {b_{1} ,b_{2} ,b_{3} , \cdots b_{n + 1} } \right].$

The total length of the sequenced DNA molecule is $M = a_{i} + aa_{i} = b_{1} + b_{2} + b_{3} + \cdots + b_{n + 1}$.

After processing, the first set of data becomes $A = \left\{ \begin{gathered} a_{1} \, a_{2} \, a_{3} \, \cdots \, a_{n} \hfill \\ aa_{1} \, aa_{2} \, aa_{3} \, \cdots \, aa_{n} \hfill \\ \end{gathered} \right\}$, where $a_{i} \le \frac{M}{2}$ and $a_{i} + aa_{i} = M,i = 1,2,...,n$. Therefore, just one between $a_{i}$ and $aa_{i}$ can convey the meaning of the first group of data, then $A$ can be expressed as: $A = \left[ {a_{1} {, }a_{2} {, }a_{3} {,} \cdots {,}a_{n} } \right].$

Each of $a_{1} {, }a_{2} {, }a_{3} {,} \cdots {,}a_{n}$ is the distance from the corresponding restriction site to the nearest endpoint. As shown in Fig. 1, obviously, each restriction site $I_{i}$ is either on the half segment of DNA near the P endpoint or the half segment of DNA near the Q endpoint.

We suppose that the value of $x_{1} ,x_{2} ,x_{3} , \cdots x_{n}$ should only be 0 or 1, and generate the sequences below:

$$C = \left[ {a_{1} \cdot {\text{x}}_{{1}} {, }a_{2} \cdot x_{2} {, }a_{3} \cdot {\text{x}}_{{3}} {,} \cdots {,}a_{n} \cdot x_{n} } \right],$$

$$CC = \left[ {a_{1} \cdot (1 - x_{1} ), \, a_{2} \cdot (1 - x_{2} ), \, a_{3} \cdot (1 - x_{3} ), \cdots ,a_{n} \cdot (1 - x_{n} )} \right].$$

If $x_{i} = 1$, it means that the restriction site $I_{i}$ is on the half segment of DNA near the P endpoint. Otherwise, $I_{i}$ is on the half segment of DNA near the Q endpoint.

Sort the numbers in the $C$ and $CC$ from small to large to get a new sequence.

$$\begin{gathered} C = \left[ {c_{{1}} {, }c_{2} {, }c_{{3}} {,} \cdots {,}c_{n} } \right], \hfill \\ CC = \left[ {cc_{1} , \, cc_{2} , \, cc_{3} , \cdots ,cc_{n} } \right]. \hfill \\ \end{gathered}$$

where $C$ is the ascending order of $a_{1} \cdot {\text{x}}_{{1}} {, }a_{2} \cdot x_{2} {, }a_{3} \cdot {\text{x}}_{{3}} {,} \cdots {,}a_{n} \cdot x_{n}$, and $CC$ is the ascending order of $a_{1} \cdot (1 - x_{1} ), \, a_{2} \cdot (1 - x_{2} ), \, a_{3} \cdot (1 - x_{3} ), \cdots ,a_{n} \cdot (1 - x_{n} ).$

The length of fragments between the adjacent restriction sites (including the two ends of DNA) on the DNA can be expressed as follows.

1. The length of each segment on the half segment of DNA near the P endpoint is:

$$\left\{ \begin{aligned} i & = 1 \quad l_{i} = c_{1} \\ 1 &< i \le n \quad l_{i} = c_{i} - c_{i - 1} \, \\ \end{aligned} \right.$$

2. The length of each segment on the half segment DNA near the Q endpoint is:

$$\left\{ \begin{aligned} i & = 1 \quad ll_{i} = cc_{1} \\ 1 & < i \le n \quad ll_{i} = cc_{i} - c_{i - 1} \, \\ \end{aligned} \right.$$

3. The length of the middle segment is:

$$M - (c_{n} + cc_{n} )$$

A value of 0 indicates no fragment here.

Use the length of the fragments above to build a sequence:

$$S = \left[ {l_{1} , \, l_{2} , \, l_{3} , \cdots ,l_{n} ,ll_{1} , \, ll_{2} , \, ll_{3} , \cdots ,ll_{n} ,M - (c_{n} + cc_{n} )} \right].$$

Then the elements in the sequence are sorted from small to large.

$$S = \left[ {s_{1} , \, s_{2} , \, s_{3} , \cdots ,s_{2n + 1} } \right]$$

Because the point $I_{i}$ is either on the half of DNA near the P endpoint or on the half of DNA near the Q endpoint (that is, $x_{i}$ is equal to 0 or 1), there are $n$ zeros in $2n$ numbers in $C$ and $CC$. The $n$ zeros in the sequence $S$ are removed to get the sequence $SS$ representing the length of the segments between the restriction sites (including the two ends of DNA) on DNA.

$$SS = \left[ {s_{n + 1} , \, s_{n + 2} , \, s_{n + 3} , \cdots ,s_{2n + 1} } \right].$$

Sort the elements in the sequence $B$ from small to large, and get:

$$B = \left[ {bb_{1} ,bb_{2} ,bb_{3} , \cdots bb_{n + 1} } \right]$$

$bb_{1} ,bb_{2} ,bb_{3} , \cdots bb_{n + 1}$ is ascending order for $b_{1} ,b_{2} ,b_{3} , \cdots b_{n + 1}$.

Therefore, assuming that $SS$ and $B$ are exactly the same sequences (that is, $SS$ = $B$), and we can establish equations as follows:

$$\left\{ \begin{gathered} s_{n + 1} = bb_{1} \hfill \\ s_{n + 2} = bb_{2} \hfill \\ \, s_{n + 3} = bb_{3} \hfill \\ \, \cdots \cdots \hfill \\ s_{2n + 1} = bb_{n + 1} \hfill \\ \end{gathered} \right.$$

By programming on Matlab7.1, we can get the position of each restriction site on the original DNA molecule by solving $x_{1} ,x_{2} ,x_{3} , \cdots x_{n}$, and then calculate the sequence of the original DNA molecule. The flow chart of the proposed algorithm is shown in Fig. 2.

To clarify the algorithm, part of the solution process of example 1 is shown as in Fig. 3.

For example, we input $x = \left[ { \, \begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ \end{array} } \right]$ here, and then the process of calculation is as follows.

$$M = 1 + 2 + 3 + 4 + 6 = 16$$

$$C = \left[ { \, \begin{array}{*{20}c} 2 & 0 & 0 & 0 \\ \end{array} } \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}}C = \left[ { \, \begin{array}{*{20}c} 0 & 0 & 0 & 2 \\ \end{array} } \right]$$

$$CC = \left[ { \, \begin{array}{*{20}c} 0 & 3 & 7 & 8 \\ \end{array} } \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}}CC = \left[ { \, \begin{array}{*{20}c} 0 & 3 & 7 & 8 \\ \end{array} } \right]$$

$$\begin{gathered} \left\{ \begin{gathered} i = 1 \quad l_{i} = c_{1} \hfill \\ 1 < i \le n \quad l_{i} = c_{i} - c_{i - 1} \, \hfill \\ \end{gathered} \right. \hfill \\ l_{1} = 0,l_{2} = 0,l_{3} = 0,l_{4} = 2. \hfill \\ \end{gathered}$$

$$\begin{gathered} \left\{ \begin{gathered} i = 1 \quad ll_{i} = cc_{1} \hfill \\ 1 < i \le n \quad ll_{i} = cc_{i} - c_{i - 1} \quad \hfill \\ \end{gathered} \right. \hfill \\ ll_{1} = 0,ll_{2} = 3,ll_{3} = 4,ll_{4} = 1. \hfill \\ M - (c_{n} + cc_{n} ) = 16 - 2 - 8 = 6 \hfill \\ \end{gathered}$$

$$\begin{gathered} S = \left[ {l_{1} , \, l_{2} , \, l_{3} , \cdots ,l_{n} ,ll_{1} , \, ll_{2} , \, ll_{3} , \cdots ,ll_{n} ,M - (c_{n} + cc_{n} )} \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}} \hfill \\ S = \left[ {s_{1} , \, s_{2} , \, s_{3} , \cdots ,s_{2n + 1} } \right] = \left[ {0,0,0,0,1,2,3,4,6} \right] \hfill \\ \end{gathered}$$

$$SS = \left[ {s_{n + 1} , \, s_{n + 2} , \, s_{n + 3} , \cdots ,s_{2n + 1} } \right] = \left[ {1,2,3,4,6} \right]$$

Because $SS$ is the same as $B$, $x_{1} ,x_{2} ,x_{3} , \cdots x_{n}$ is the possible permutation of the location of DNA restriction sites and a feasible solution of DNA fragments’ sequence $[2{,}6,1,4,3]$ can be restored. The reconstructed sequence is exactly the same as Fig. 3a.

5 Implementation

First, we build a general 0–1 algorithm for the general situation, and then apply this algorithm to solve example 1 and example 2.

5.1 Example 1

The first set of data is 2, 3, 7, 8, 8, 9, 13, 14, then: $A = [2{,}3{,}7{,}8]$.

The second set of data is 2, 1, 4, 3, 6, then: $B = [2,1,4,3,6][2,1,4,3,6]$.

Total DNA length: $M = 2 + 1 + 4 + 3 + 6 = 16$.

By Matlab7.1 (see Appendix procedure 1), we solve out $x = \left[ \begin{gathered} \, 1 \, 0 \, 0 \, 0 \hfill \\ \, 1 \, 0 \, 0 \, 1 \hfill \\ \, 0 \, 1 \, 1 \, 0 \hfill \\ \, 0 \, 1 \, 1 \, 1 \hfill \\ \end{gathered} \right].$

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}c} 2 & 6 & 1 & 4 & 3 \\ 2 & 6 & 1 & 4 & 3 \\ 3 & 4 & 1 & 6 & 2 \\ 3 & 4 & 1 & 6 & 2 \\ \end{array}$$

If there is no difference between the two ends of P and Q, there is only one solution:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

(or $\begin{array}{*{20}l} 3 \hfill & 4 \hfill & 1 \hfill & 6 \hfill & 2 \hfill \\ \end{array}$).

If P and Q are different in order, there are two solutions:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ 3 \hfill & 4 \hfill & 1 \hfill & 6 \hfill & 2 \hfill \\ \end{array}$$

5.2 Example 2

The first set of data is 1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, then:

$$A = [1{,}2{,}3{,}3{,}4{,}5{,}6{,}7]$$

The second set of data is 1, 1, 1, 1, 2, 2, 2, 2, 3, then: $B = [1, 1, 1, 1, 2, 2, 2,2, 3]$.

Total DNA length: $M = 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + 3 = 15$.

By Matlab7.1 (see Appendix procedure 2), we obtain that:

$$x = \left[ {\begin{array}{*{20}c} 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 0 & 1 & 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 0 & 0 & 1 & 0 & 1 \\ {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 1 \\ \end{array} } \\ \end{array} } \right]$$

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ \end{array}$$

If there is no difference between P and Q, there are 8 groups of solutions:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 1} \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & {3)} \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 1} \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & {3)} \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ \end{array}$$

If P and Q are different in order, there are 16 groups of solutions:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ \end{array}$$

To further illustrate the practical significance of our solution, a set of solution is extracted from the result of example 2 for instance.

$$\begin{array}{*{20}l} 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ \end{array}$$

The solution above reconstructs a DNA sequence as in Fig. 4.

6 Error Analysis

Considering all kinds of factors, we think that the measurement of the length of the fragments is the main cause of the error. Assuming that there is no error in the total length of DNA, then the sum of the two data obtained when the DNA molecule is cut on each restriction site separately and the sum of all data obtained when the DNA molecule is cut on each restriction sites at the same time are the same and equal to the total length of DNA. According to the problem analysis, we briefly discuss the impact of the error on the results in two cases:

1. When the same error occurs in the measurement of fragments in the first set of data and corresponding fragments in the second set of data, the data change is equivalent to the data change caused by the change of the position of the corresponding restriction site in DNA. At this time, the result of the reconstruction of the restriction map will show the change of those restriction sites corresponding to the error data in DNA molecule. The determination of other restriction sites will not be affected. For example:

Suppose the real data of example 1 is:

The first set of data: 2, 14, 8, 8, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

Assume that the data obtained due to the measurement error are:

The first data: 2, 14, 7, 9, 9, 7, 13, 3.

The second set of data: 2, 2, 4, 3, 5.

The $x$ of the error data is solved by Matlab 7.1 program (see Appendix program 3):

$$x = \, \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array} } \right]$$

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}l} 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ 3 \hfill & 4 \hfill & 2 \hfill & 5 \hfill & 2 \hfill \\ 3 \hfill & 4 \hfill & 2 \hfill & 5 \hfill & 2 \hfill \\ \end{array}$$

If there is no difference between P and Q, the final result can be expressed as follows:

$$\begin{array}{*{20}l} 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

The results from real data are as follows:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

It can be seen that the data error only leads to the change of the position of the restrictive sites that produce the error, and has no effect on the reconstruction of the position of other restrictive sites.

2. When the error only appears in the first set or the second set of data, or the first set and the second set of data produce unrelated errors at the same time, there may be a variety of results: most of the restrictive site changes, that is, the calculation results can be regarded as invalid, or reconstruction cannot be carried out. For example: Suppose the real data of example 1 is:

The first set of data: 2, 14, 8, 8, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

Assume that the data obtained due to the measurement error are:

The first data: 2, 14, 7, 9, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

The $x$ of the error data is solved out by Matlab 7.1 program (see Appendix program 4): $x = \emptyset$.

The reconstruction cannot proceed due to the error.

7 Evaluation of the Model

The function (randi) of Matlab 7.1 software is used to randomly generate 1000 groups of original DNA sequences. We try to reconstruct the original DNA sequence based on data of fragments’ length by SPDP using 0–1 algorithm.

The coincidence rate (including multiple solutions or unique solutions) and the unique coincidence rate (that is, reconstruction solution is unique and exact compared with the original DNA sequence) between the reconstructed DNA sequence and the original DNA sequence are defined.

First, 1000 sets of DNA sequences are randomly generated, and the second set of data consists of a set of random numbers (DNA fragments’ length) between 1 and 30. The effect of the number of DNA fragments on the coincidence rate and the unique coincidence rate was studied.

It can be seen from Fig. 5 that the curve of coincidence rate is above 90%, and the curve of unique coincidence rate is above 80%. Especially when the number of fragments is greater than 6, the coincidence rate reaches 100%. With the increase of the number of fragments, however, the unique coincidence rate will decrease (that is, multiple solutions will appear more).

Second, 1000 sets of DNA sequences are randomly generated, while the number of DNA fragments of the second set of data is 5, and the length of each fragment is a random number between 1 and M, where M is the maximum length of DNA fragments. We study the effect of the magnitude of M on the coincidence rate and unique coincidence rate.

It can be seen from Fig. 6 that the coincidence rate between the DNA sequence calculated by this algorithm and the original DNA sequence is above 98%, and the unique coincidence rate is above 80%. As the maximum length of DNA fragments becomes larger, the unique coincidence rate increases.

As shown in Figs. 5 and 6, the high coincidence rate and unique coincidence rate are observed, validating the effectiveness of the proposed algorithm.

8 Conclusions and Remarks

Our data and analysis support the advantages of the algorithm: (1) The algorithm makes full use of the search method and 0–1 planning knowledge, and optimizes the arrangement of different DNA fragments to find the most satisfactory solution; (2) In terms of operation, it simplifies the difficulty of artificial combination and pure mathematical reasoning, and provides a relatively fast and accurate method for high-throughput and large-scale DNA sequencing; 3. We try to simplify the variables of the data, gradually approach the length of each segment, arrange them in ascending order, and finally use different sorting results to set up equations with the data related.

However, this is an algorithm related to biological background, which means that there exists the uniqueness of objective facts. Due to the limitations of the conditions given, we are not able to determine which group of solutions is exactly the sequence of the original DNA in the face of multiple groups of solutions. For example, there are fragments with the same length in example 2. This algorithm starts from the length of the DNA fragments, but the possible situation where DNA fragments of the same length may represent different sequences is ignored inevitably, and thus the result of this solution is one-sided.

Our algorithm can not only analyze genetic samples and DNA sequencing, integrate biological information of each segment but also be extended to other related life science fields like synthetic biology. In addition, if we can integrate biological knowledge and consider all kinds of variation factors, for instance, insertion, deletion and replacement of base pairs under experimental conditions, the algorithm will have a broader application prospect and solve more practical problems of biological genetic analysis.

Abbreviations

$n$ :: The number of restriction sites on a certain DNA molecule
$A$ :: The first set of data, consisting of all the length of fragments if the DNA molecule is cut into two fragments in each restriction site separately. The data size of $A$ is $2n$
$B$ :: The second set of data, consisting of all the length of fragments if the DNA molecule is cut into fragments in each restriction site simultaneously. The data size of $B$ is $n + 1$
$M$ :: The total length of the DNA molecule, equal to the sum of elements in $B$
$I_{i}$ :: The $i$ restriction site in the DNA molecule
$x = [x_{1} ,x_{2} ,x_{3} , \cdots x_{n} ]$ :: A permutation of the location of DNA restriction sites, where $x_{i}$ can only be 0 or 1, while 1 means that $I_{i}$ is closer to one end of the DNA molecule, and 0 means that $I_{i}$ is closer to the other end
${\text{P}}$ and ${\text{Q}}$ :: The two ends of the DNA molecule, respectively
$a_{i}$ :: The shorter distance between $I_{i}$ and the two ends of the DNA molecule, while $aa_{i}$ is the longer distance. Obviously, $a_{i}$ is less than or equal to $aa_{i}$ and $a_{i}$ plus $aa_{i}$ is equal to $M$
$C$ :: The set of distance from each $I_{i}$ to one end of the DNA molecule (for example, ${\text{P}}$ end) in ascending order, and $CC$ is the set of distance from each $I_{i}$ to the other end (for example, ${\text{Q}}$ end correspondingly) in ascending order. Here we let the distance take 0 if $I_{i}$ is further away from the end
$l_{i}$ :: The length of each fragment on the half of the DNA molecule near one end (for example, ${\text{P}}$ end), and $ll_{i}$ is the length of each fragment on the half of the DNA molecule near the other end (for example, ${\text{Q}}$ end)
$S$ :: A collection of the length of all the fragments obtained when the DNA molecule is cut by enzyme at all restriction sites simultaneously in the analysis. $S$ is derived from $A$ and $x_{1} ,x_{2} ,x_{3} , \cdots x_{n}$. The data size of $S$ is $2n + 1$. Actually, there are $n$ zeros in $S$
$SS$ :: Consists of nonzero elements in the collection $S$. The data size of $SS$ is $n + 1$

References

Viswanathan R, Cheruba E, Cheow LF (2019) DNA Analysis by Restriction Enzyme (DARE) enables concurrent genomic and epigenomic characterization of single cells. Nucleic Acids Res 47:e122. https://doi.org/10.1093/nar/gkz717
Article CAS PubMed PubMed Central Google Scholar
Cameron CJ, Dostie J, Blanchette M (2020) HIFI: estimating DNA-DNA interaction frequency from Hi-C data at restriction-fragment resolution. Genome Biol 21:11. https://doi.org/10.1186/s13059-019-1913-y
Article CAS PubMed PubMed Central Google Scholar
Alza L, Lavretsky P, Peters JL, Ceron G, Smith M, Kopuchian C, Astie A, McCracken KG (2019) Old divergence and restricted gene flow between torrent duck (Merganetta armata) subspecies in the Central and Southern Andes. Ecol Evol 9:9961–9976. https://doi.org/10.1002/ece3.5538
Article PubMed PubMed Central Google Scholar
Maschmann A, Masters C, Davison M, Lallman J, Thompson D, Kounovsky-Shafer KL (2018) Determining if DNA stained with a cyanine dye can be digested with restriction enzymes. J Vis Exp. https://doi.org/10.3791/57141
Article PubMed PubMed Central Google Scholar
Staden R (1982) Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing. Nucleic Acids Res 10:4731–4751. https://doi.org/10.1093/nar/10.15.4731
Article CAS PubMed PubMed Central Google Scholar
Venter JC, Adams MD, Myers EW (2001) The sequence of the human genome. Science 291(5507):1304. https://doi.org/10.1126/science.1058040
Article CAS Google Scholar
Myers EW, Sutton GG, Delcher AL et al (2000) A whole-genome assembly of Drosophila. Science 287(5461):2196–2204. https://doi.org/10.1126/science.287.5461.2196
Article CAS PubMed Google Scholar
Abualigah LM (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer, Berlin, pp 1–165. https://doi.org/https://doi.org/10.1007/978-3-030-10674-4
Abualigah LM, Khader AT, Hanandeh ES (2018) Hybrid clustering analysis using improved krill herd algorithm. Appl Intell 48(11):4047–4071. https://doi.org/10.1007/s10489-018-1190-6
Article Google Scholar
Abualigah LM, Khader AT, Hanandeh ES (2018) A new feature selection method to improve the document clustering using particle swarm optimization algorithm. J Comput Sci 25:456–466. https://doi.org/10.1016/j.jocs.2017.07.018
Article Google Scholar
Engle ML, Burks C (1993) Artificially generated data sets for testing dna sequence assembly algorithms. Genomics 16(1):288. https://doi.org/10.1006/geno.1993.1180
Article Google Scholar
Angly FE, Dana W, Forest R, Philip H, Tyson GW (2012) Grinder: a versatile amplicon and shotgun sequence simulator. Nuclc Acids Research 40(12):e94. https://doi.org/10.1093/nar/gks251
Article CAS Google Scholar
Huang W, Wang G, Lin H, Zhuge J, Nolan SM, Vail E, Dimitrova N, Fallon JT (2016) Assessing next-generation sequencing and 4 bioinformatics tools for detection of enterovirus d68 and other respiratory viruses in clinical samples. Diagn Microbiol Infect Dis 85(1):26–29. https://doi.org/10.1016/j.diagmicrobio.2016.01.013
Article CAS PubMed Google Scholar
Shityakov S, Bencurova E, Frster C, Dandekar T (2020) Modeling of shotgun sequencing of dna plasmids using experimental and theoretical approaches. BMC Bioinformatics. https://doi.org/10.1186/s12859-020-3461-6
Article PubMed PubMed Central Google Scholar
Guo JY, Lu WX, Yang QC, Miao TS (2019) The application of 0–1 mixed integer nonlinear programming optimization model based on a surrogate model to identify the groundwater pollution source. J Contam Hydrol 220:18–25. https://doi.org/10.1016/j.jconhyd.2018.11.005
Article CAS PubMed Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Fund of P. R. China (Grant No. 11771448). We thank Prof. Junjie Ren (School of Sciences, Southwest Petroleum University, Chengdu 610500, Sichuan, China) for his help in revising the English grammar. We thank Dr. Qiang Pan and Dr. Fan Jiang (Army Medical University) for their help in manuscript design.

Author information

Authors and Affiliations

School of Sciences, Southwest Petroleum University, Chengdu, 610500, Sichuan, People’s Republic of China
Ningyuan Fang
School of Mathematics and Statistics, Southwest University, Chongqing, 400715, People’s Republic of China
Kaifa Wang
Department of Urology, Daping Hospital, Army Medical University, Chongqing, 400042, People’s Republic of China
Dali Tong

Authors

Ningyuan Fang
View author publications
You can also search for this author in PubMed Google Scholar
Kaifa Wang
View author publications
You can also search for this author in PubMed Google Scholar
Dali Tong
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

DT and KW designed the study. NF collected and processed the data. DT, KW and NF analyzed the results. DT, KW and NF wrote and revised the manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Kaifa Wang or Dali Tong.

Ethics declarations

Conflicts of interest

The authors declare that there is no conflict of interest (financial and non-financial).

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file 1 (DOC 55 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fang, N., Wang, K. & Tong, D. An Algorithm for Gene Fragment Reconstruction. Interdiscip Sci Comput Life Sci 13, 118–127 (2021). https://doi.org/10.1007/s12539-021-00419-6

Download citation

Received: 05 July 2020
Revised: 21 January 2021
Accepted: 01 February 2021
Published: 20 February 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s12539-021-00419-6

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

An Algorithm for Gene Fragment Reconstruction

Abstract

Similar content being viewed by others

Cat Swarm Optimization Algorithm for DNA Fragment Assembly Problem

DDmap: a MATLAB package for the double digest problem using multiple genetic operators

Restriction Enzyme-Mediated DNA Family Shuffling

1 Introduction