1 Introduction

It’s commonly accepted that drawing DNA restriction map is an extremely significant method for genetic and biological analysis. In view of the high molecular weight of DNA and excessive number of base pairs, biochemical technology is used to cut DNA into small molecular fragments in scientific experiments [1, 2]. Specifically, the PDP method and the simplified PDP method (SPDP) based on the different enzyme cutting sites have been proposed in a series of classic studies. The basic biological information of each segment is analyzed to obtain the relevant information of the whole DNA molecule [3, 4]. This sequencing method is the "shotgun method" invented by Craig Venter, founder of selera genetic company in the United States [5].

Quickness, simplicity of implementation, and low cost are regarded as the advantages of the shotgun method. However, the workload is large. If the shotgun method is used, the rearrangement of DNA fragments is not easy for computation at all. In the determination of large genomes, such as the human genome and drosophila genome and so on, the improved whole-genome "shotgun method" has been extensively applied to complete the sequencing, which can sufficiently demonstrate its feasibility and effectiveness [6, 7].

Scientists, researchers and scholars have developed many algorithms and tools for predicting the precise results based on some features of target objects [8,9,10]. Recently, the extensive application of these algorithms has been witnessed in genetic analysis. Genfrag, a set of tools, was developed to generate benchmark data sets for testing DNA sequence assembly algorithms and to quest for the range of data and corresponding performance of assembly tools on "shot-gun" sequencing projects by Engle and Burks [11]. An open-source bioinformatic tool, called Grinder, was introduced by Angly et al., which could simulate amplicon and shotgun datasets from reference sequences [12]. In the detection of respiratory viruses in clinical specimens, four different bioinformatics algorithms were executed by Huang et al. to make the assessment of the performance of a metagenomic shot-gun sequencing method [13]. Based on Sanger methodology, a novel algorithm was applied by Shityakov et al. [14], which correctly predicted and stressed the performance of DNA sequencing techniques and confirmed the statistical significance of results. Although many algorithms have recently been proposed to obtain the DNA fragment sequence, these algorithms are considerably complex and require much additional information apart from the lengths of DNA fragments, which may limit their application scopes.

In this paper, based on the biological information of each segment by SPDP and the mathematical thought of 0–1 planning, we propose the general basic algorithm to solve the feasible solutions from all permutations of the location of DNA restriction sites, and further restore the possible DNA sequence. Besides, we evaluate the efficiency of this algorithm according to 1000 sets of DNA original sequences randomly generated. Moreover, the influence of measurement error of fragments’ length on the algorithm is discussed. The proposed algorithm can be conducted just based on data of fragments’ length, and thus this algorithm is relatively easy to be applied in practice.

2 Example Design

2.1 Example 1

The first set of data is 2, 3, 7, 8, 8, 9, 13, 14.

The second set of data is 2, 1, 4, 3, 6.

2.2 Example 2

The first set of data is 1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14.

The second set of data is 1, 1, 2, 1, 2, 2, 1, 2, 3.

3 Problem Analysis and Tentative Ideas

We need to find the correct sequence of DNA fragments represented by the second set of data, so that when the DNA molecule is cut at each restriction site respectively, the data obtained is consistent with the first set of data. Because of the specific data given in example 1, there are four restriction enzyme cutting sites in this DNA molecule, and the first group of data (2, 14, 8, 8, 7, 9, 3, 13) is obtained when the DNA molecule is cut on a single site, while the second group of data is (1, 2, 3, 4, 6) after DNA molecule is cut on all the restriction enzyme cutting sites. In the first set of data, the minimum fragment length is 2, and only 2 in the second set of data can correspond to it, so the corresponding enzyme cutting site should be the closest site to the endpoint; while the number 3 in the first set of data can correspond to 1 plus 2 or 3 in the second set of data. Obviously, the larger number is in the first set of data, the corresponding combination in the second set of data will be more.

Further analysis, we can get each number in the first group of data from one end according to each enzyme cutting point, and then the rearrangement of the second set of data is correct. That is because numbers in pairs in the first set of data represent the same meaning, such as 2 and 14 or 3 and 13, which only represent different restriction sites. We divide the first set of data into two groups, namely (2, 3, 7, 8) and (14, 13, 9, 8). Only one set of data (2, 3, 7, 8) represents the shorter distance between each restriction site and two ends of the DNA molecule. Therefore, we only need to analyze a half of the first data to express each restriction site. If the data of the length of DNA fragments obtained after the DNA molecule is cut on each restriction site at the same time can be the same as the second data (1, 2, 3, 4, 6), then the sequence is meaningful. Based on the above analysis, we use 0–1 planning to calibrate the shear point position and finally get the result. Specific implementation of the algorithm is as follows [15].

4 0–1 Planning Method

4.1 The Establishment of 0–1 Equation Algorithm

Suppose: the data of fragments’ length measured when the DNA molecule is cut at each restriction site separately is the first set of data:

$$A = \left[ {a_{1} {,}aa_{1} {,}a_{2} {,}aa_{2} {,}a_{3} {,}aa_{3} {,} \cdots {,}a_{n} ,aa_{n} } \right],$$

where \(a_{i} ,aa_{i}\) are two data from the same cutting experiment and \(a_{i} \le aa_{i}\) while \(n\) is the number of restriction sites on DNA;

The data of fragments’ length measured when the DNA molecule is cut at each restriction site simultaneously is the second set of data \(B = \left[ {b_{1} ,b_{2} ,b_{3} , \cdots b_{n + 1} } \right].\)

The total length of the sequenced DNA molecule is \(M = a_{i} + aa_{i} = b_{1} + b_{2} + b_{3} + \cdots + b_{n + 1}\).

After processing, the first set of data becomes \(A = \left\{ \begin{gathered} a_{1} \, a_{2} \, a_{3} \, \cdots \, a_{n} \hfill \\ aa_{1} \, aa_{2} \, aa_{3} \, \cdots \, aa_{n} \hfill \\ \end{gathered} \right\}\), where \(a_{i} \le \frac{M}{2}\) and \(a_{i} + aa_{i} = M,i = 1,2,...,n\). Therefore, just one between \(a_{i}\) and \(aa_{i}\) can convey the meaning of the first group of data, then \(A\) can be expressed as: \(A = \left[ {a_{1} {, }a_{2} {, }a_{3} {,} \cdots {,}a_{n} } \right].\)

Each of \(a_{1} {, }a_{2} {, }a_{3} {,} \cdots {,}a_{n}\) is the distance from the corresponding restriction site to the nearest endpoint. As shown in Fig. 1, obviously, each restriction site \(I_{i}\) is either on the half segment of DNA near the P endpoint or the half segment of DNA near the Q endpoint.

Fig. 1
figure 1

Each restriction site \(I_{i}\) on the DNA molecule and corresponding \(a_{i}\)

We suppose that the value of \(x_{1} ,x_{2} ,x_{3} , \cdots x_{n}\) should only be 0 or 1, and generate the sequences below:

$$C = \left[ {a_{1} \cdot {\text{x}}_{{1}} {, }a_{2} \cdot x_{2} {, }a_{3} \cdot {\text{x}}_{{3}} {,} \cdots {,}a_{n} \cdot x_{n} } \right],$$
$$CC = \left[ {a_{1} \cdot (1 - x_{1} ), \, a_{2} \cdot (1 - x_{2} ), \, a_{3} \cdot (1 - x_{3} ), \cdots ,a_{n} \cdot (1 - x_{n} )} \right].$$

If \(x_{i} = 1\), it means that the restriction site \(I_{i}\) is on the half segment of DNA near the P endpoint. Otherwise, \(I_{i}\) is on the half segment of DNA near the Q endpoint.

Sort the numbers in the \(C\) and \(CC\) from small to large to get a new sequence.

$$\begin{gathered} C = \left[ {c_{{1}} {, }c_{2} {, }c_{{3}} {,} \cdots {,}c_{n} } \right], \hfill \\ CC = \left[ {cc_{1} , \, cc_{2} , \, cc_{3} , \cdots ,cc_{n} } \right]. \hfill \\ \end{gathered}$$

where \(C\) is the ascending order of \(a_{1} \cdot {\text{x}}_{{1}} {, }a_{2} \cdot x_{2} {, }a_{3} \cdot {\text{x}}_{{3}} {,} \cdots {,}a_{n} \cdot x_{n}\), and \(CC\) is the ascending order of \(a_{1} \cdot (1 - x_{1} ), \, a_{2} \cdot (1 - x_{2} ), \, a_{3} \cdot (1 - x_{3} ), \cdots ,a_{n} \cdot (1 - x_{n} ).\)

The length of fragments between the adjacent restriction sites (including the two ends of DNA) on the DNA can be expressed as follows.

1. The length of each segment on the half segment of DNA near the P endpoint is:

$$\left\{ \begin{aligned} i & = 1 \quad l_{i} = c_{1} \\ 1 &< i \le n \quad l_{i} = c_{i} - c_{i - 1} \, \\ \end{aligned} \right.$$

2. The length of each segment on the half segment DNA near the Q endpoint is:

$$\left\{ \begin{aligned} i & = 1 \quad ll_{i} = cc_{1} \\ 1 & < i \le n \quad ll_{i} = cc_{i} - c_{i - 1} \, \\ \end{aligned} \right.$$

3. The length of the middle segment is:

$$M - (c_{n} + cc_{n} )$$

A value of 0 indicates no fragment here.

Use the length of the fragments above to build a sequence:

$$S = \left[ {l_{1} , \, l_{2} , \, l_{3} , \cdots ,l_{n} ,ll_{1} , \, ll_{2} , \, ll_{3} , \cdots ,ll_{n} ,M - (c_{n} + cc_{n} )} \right].$$

Then the elements in the sequence are sorted from small to large.

$$S = \left[ {s_{1} , \, s_{2} , \, s_{3} , \cdots ,s_{2n + 1} } \right]$$

Because the point \(I_{i}\) is either on the half of DNA near the P endpoint or on the half of DNA near the Q endpoint (that is, \(x_{i}\) is equal to 0 or 1), there are \(n\) zeros in \(2n\) numbers in \(C\) and \(CC\). The \(n\) zeros in the sequence \(S\) are removed to get the sequence \(SS\) representing the length of the segments between the restriction sites (including the two ends of DNA) on DNA.

$$SS = \left[ {s_{n + 1} , \, s_{n + 2} , \, s_{n + 3} , \cdots ,s_{2n + 1} } \right].$$

Sort the elements in the sequence \(B\) from small to large, and get:

$$B = \left[ {bb_{1} ,bb_{2} ,bb_{3} , \cdots bb_{n + 1} } \right]$$

\(bb_{1} ,bb_{2} ,bb_{3} , \cdots bb_{n + 1}\) is ascending order for \(b_{1} ,b_{2} ,b_{3} , \cdots b_{n + 1}\).

Therefore, assuming that \(SS\) and \(B\) are exactly the same sequences (that is, \(SS\) = \(B\)), and we can establish equations as follows:

$$\left\{ \begin{gathered} s_{n + 1} = bb_{1} \hfill \\ s_{n + 2} = bb_{2} \hfill \\ \, s_{n + 3} = bb_{3} \hfill \\ \, \cdots \cdots \hfill \\ s_{2n + 1} = bb_{n + 1} \hfill \\ \end{gathered} \right.$$

By programming on Matlab7.1, we can get the position of each restriction site on the original DNA molecule by solving \(x_{1} ,x_{2} ,x_{3} , \cdots x_{n}\), and then calculate the sequence of the original DNA molecule. The flow chart of the proposed algorithm is shown in Fig. 2.

Fig. 2
figure 2

Flow chart of the proposed algorithm

To clarify the algorithm, part of the solution process of example 1 is shown as in Fig. 3.

Fig. 3
figure 3

a An instance of the original DNA sequence (that is, the true sequence that we try to reconstruct). b The first set of data \(A = [2{,}3{,}7,8]\). c The second set of data \(B = [1,2{,}3{,}4,6]\)

For example, we input \(x = \left[ { \, \begin{array}{*{20}c} 1 & 0 & 0 & 0 \\ \end{array} } \right]\) here, and then the process of calculation is as follows.

$$M = 1 + 2 + 3 + 4 + 6 = 16$$
$$C = \left[ { \, \begin{array}{*{20}c} 2 & 0 & 0 & 0 \\ \end{array} } \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}}C = \left[ { \, \begin{array}{*{20}c} 0 & 0 & 0 & 2 \\ \end{array} } \right]$$
$$CC = \left[ { \, \begin{array}{*{20}c} 0 & 3 & 7 & 8 \\ \end{array} } \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}}CC = \left[ { \, \begin{array}{*{20}c} 0 & 3 & 7 & 8 \\ \end{array} } \right]$$
$$\begin{gathered} \left\{ \begin{gathered} i = 1 \quad l_{i} = c_{1} \hfill \\ 1 < i \le n \quad l_{i} = c_{i} - c_{i - 1} \, \hfill \\ \end{gathered} \right. \hfill \\ l_{1} = 0,l_{2} = 0,l_{3} = 0,l_{4} = 2. \hfill \\ \end{gathered}$$
$$\begin{gathered} \left\{ \begin{gathered} i = 1 \quad ll_{i} = cc_{1} \hfill \\ 1 < i \le n \quad ll_{i} = cc_{i} - c_{i - 1} \quad \hfill \\ \end{gathered} \right. \hfill \\ ll_{1} = 0,ll_{2} = 3,ll_{3} = 4,ll_{4} = 1. \hfill \\ M - (c_{n} + cc_{n} ) = 16 - 2 - 8 = 6 \hfill \\ \end{gathered}$$
$$\begin{gathered} S = \left[ {l_{1} , \, l_{2} , \, l_{3} , \cdots ,l_{n} ,ll_{1} , \, ll_{2} , \, ll_{3} , \cdots ,ll_{n} ,M - (c_{n} + cc_{n} )} \right]\mathop{\longrightarrow}\limits^{{\text{sort ascending}}} \hfill \\ S = \left[ {s_{1} , \, s_{2} , \, s_{3} , \cdots ,s_{2n + 1} } \right] = \left[ {0,0,0,0,1,2,3,4,6} \right] \hfill \\ \end{gathered}$$
$$SS = \left[ {s_{n + 1} , \, s_{n + 2} , \, s_{n + 3} , \cdots ,s_{2n + 1} } \right] = \left[ {1,2,3,4,6} \right]$$

Because \(SS\) is the same as \(B\), \(x_{1} ,x_{2} ,x_{3} , \cdots x_{n}\) is the possible permutation of the location of DNA restriction sites and a feasible solution of DNA fragments’ sequence \([2{,}6,1,4,3]\) can be restored. The reconstructed sequence is exactly the same as Fig. 3a.

5 Implementation

First, we build a general 0–1 algorithm for the general situation, and then apply this algorithm to solve example 1 and example 2.

5.1 Example 1

The first set of data is 2, 3, 7, 8, 8, 9, 13, 14, then: \(A = [2{,}3{,}7{,}8]\).

The second set of data is 2, 1, 4, 3, 6, then: \(B = [2,1,4,3,6][2,1,4,3,6]\).

Total DNA length: \(M = 2 + 1 + 4 + 3 + 6 = 16\).

By Matlab7.1 (see Appendix procedure 1), we solve out \(x = \left[ \begin{gathered} \, 1 \, 0 \, 0 \, 0 \hfill \\ \, 1 \, 0 \, 0 \, 1 \hfill \\ \, 0 \, 1 \, 1 \, 0 \hfill \\ \, 0 \, 1 \, 1 \, 1 \hfill \\ \end{gathered} \right].\)

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}c} 2 & 6 & 1 & 4 & 3 \\ 2 & 6 & 1 & 4 & 3 \\ 3 & 4 & 1 & 6 & 2 \\ 3 & 4 & 1 & 6 & 2 \\ \end{array}$$

If there is no difference between the two ends of P and Q, there is only one solution:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

(or \(\begin{array}{*{20}l} 3 \hfill & 4 \hfill & 1 \hfill & 6 \hfill & 2 \hfill \\ \end{array}\)).

If P and Q are different in order, there are two solutions:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ 3 \hfill & 4 \hfill & 1 \hfill & 6 \hfill & 2 \hfill \\ \end{array}$$

5.2 Example 2

The first set of data is 1, 2, 3, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 12, 13, 14, then:

$$A = [1{,}2{,}3{,}3{,}4{,}5{,}6{,}7]$$

The second set of data is 1, 1, 1, 1, 2, 2, 2, 2, 3, then: \(B = [1, 1, 1, 1, 2, 2, 2,2, 3]\).

Total DNA length: \(M = 1 + 1 + 1 + 1 + 2 + 2 + 2 + 2 + 3 = 15\).

By Matlab7.1 (see Appendix procedure 2), we obtain that:

$$x = \left[ {\begin{array}{*{20}c} 0 & 0 & 1 & 0 & 1 & 0 & 1 & 0 \\ 0 & 0 & 1 & 0 & 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 1 & 1 & 0 & 1 & 0 \\ 0 & 0 & 0 & 1 & 0 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 & 0 & 0 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 0 & 0 & 1 \\ 1 & 0 & 1 & 0 & 0 & 1 & 1 & 0 \\ 1 & 0 & 1 & 0 & 1 & 1 & 0 & 1 \\ 1 & 0 & 1 & 0 & 1 & 0 & 1 & 1 \\ 1 & 0 & 0 & 1 & 0 & 1 & 0 & 0 \\ 1 & 0 & 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 0 & 0 & 1 \\ 1 & 0 & 0 & 1 & 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 1 & 1 & 1 & 0 & 1 \\ 1 & 0 & 0 & 1 & 1 & 0 & 1 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 0 & 0 \\ 0 & 1 & 1 & 0 & 0 & 0 & 1 & 0 \\ 0 & 1 & 1 & 0 & 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 0 & 1 & 1 & 0 & 1 & 1 & 0 & 1 \\ 0 & 1 & 1 & 0 & 1 & 0 & 1 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 0 & 0 \\ 0 & 1 & 0 & 1 & 0 & 0 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 & 0 & 0 & 1 \\ 0 & 1 & 0 & 1 & 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 & 1 & 1 & 0 & 1 \\ 0 & 1 & 0 & 1 & 1 & 0 & 1 & 1 \\ 1 & 1 & 1 & 0 & 1 & 0 & 1 & 0 \\ 1 & 1 & 1 & 0 & 0 & 1 & 0 & 1 \\ {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 1 \\ \end{array} } & {\begin{array}{*{20}c} 1 \\ 0 \\ \end{array} } & {\begin{array}{*{20}c} 0 \\ 1 \\ \end{array} } \\ \end{array} } \right]$$

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ \end{array}$$

If there is no difference between P and Q, there are 8 groups of solutions:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 1} \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & {3)} \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 1} \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & {3)} \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ {({\mkern 1mu} {\text{or}}{\mkern 1mu} 2} \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & {1)} \hfill \\ \end{array}$$

If P and Q are different in order, there are 16 groups of solutions:

$$\begin{array}{*{20}l} 3 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 3 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 3 \hfill \\ 1 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 2 \hfill & 3 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 2 \hfill & 1 \hfill \\ 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 3 \hfill & 2 \hfill & 2 \hfill & 1 \hfill \\ \end{array}$$

To further illustrate the practical significance of our solution, a set of solution is extracted from the result of example 2 for instance.

$$\begin{array}{*{20}l} 1 \hfill & 2 \hfill & 2 \hfill & 3 \hfill & 1 \hfill & 2 \hfill & 1 \hfill & 1 \hfill & 2 \hfill \\ \end{array}$$

The solution above reconstructs a DNA sequence as in Fig. 4.

Fig. 4
figure 4

A possible sequence of example 2

6 Error Analysis

Considering all kinds of factors, we think that the measurement of the length of the fragments is the main cause of the error. Assuming that there is no error in the total length of DNA, then the sum of the two data obtained when the DNA molecule is cut on each restriction site separately and the sum of all data obtained when the DNA molecule is cut on each restriction sites at the same time are the same and equal to the total length of DNA. According to the problem analysis, we briefly discuss the impact of the error on the results in two cases:

1. When the same error occurs in the measurement of fragments in the first set of data and corresponding fragments in the second set of data, the data change is equivalent to the data change caused by the change of the position of the corresponding restriction site in DNA. At this time, the result of the reconstruction of the restriction map will show the change of those restriction sites corresponding to the error data in DNA molecule. The determination of other restriction sites will not be affected. For example:

Suppose the real data of example 1 is:

The first set of data: 2, 14, 8, 8, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

Assume that the data obtained due to the measurement error are:

The first data: 2, 14, 7, 9, 9, 7, 13, 3.

The second set of data: 2, 2, 4, 3, 5.

The \(x\) of the error data is solved by Matlab 7.1 program (see Appendix program 3):

$$x = \, \left[ {\begin{array}{*{20}c} 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 1 & 1 & 0 \\ 0 & 1 & 0 & 1 \\ \end{array} } \right]$$

The possible sequences of fragments represented by the second set of data are as follows:

$$\begin{array}{*{20}l} 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ 3 \hfill & 4 \hfill & 2 \hfill & 5 \hfill & 2 \hfill \\ 3 \hfill & 4 \hfill & 2 \hfill & 5 \hfill & 2 \hfill \\ \end{array}$$

If there is no difference between P and Q, the final result can be expressed as follows:

$$\begin{array}{*{20}l} 2 \hfill & 5 \hfill & 2 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

The results from real data are as follows:

$$\begin{array}{*{20}l} 2 \hfill & 6 \hfill & 1 \hfill & 4 \hfill & 3 \hfill \\ \end{array}$$

It can be seen that the data error only leads to the change of the position of the restrictive sites that produce the error, and has no effect on the reconstruction of the position of other restrictive sites.

2. When the error only appears in the first set or the second set of data, or the first set and the second set of data produce unrelated errors at the same time, there may be a variety of results: most of the restrictive site changes, that is, the calculation results can be regarded as invalid, or reconstruction cannot be carried out. For example: Suppose the real data of example 1 is:

The first set of data: 2, 14, 8, 8, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

Assume that the data obtained due to the measurement error are:

The first data: 2, 14, 7, 9, 9, 7, 13, 3.

The second set of data: 2, 1, 4, 3, 6.

The \(x\) of the error data is solved out by Matlab 7.1 program (see Appendix program 4): \(x = \emptyset\).

The reconstruction cannot proceed due to the error.

7 Evaluation of the Model

The function (randi) of Matlab 7.1 software is used to randomly generate 1000 groups of original DNA sequences. We try to reconstruct the original DNA sequence based on data of fragments’ length by SPDP using 0–1 algorithm.

The coincidence rate (including multiple solutions or unique solutions) and the unique coincidence rate (that is, reconstruction solution is unique and exact compared with the original DNA sequence) between the reconstructed DNA sequence and the original DNA sequence are defined.

First, 1000 sets of DNA sequences are randomly generated, and the second set of data consists of a set of random numbers (DNA fragments’ length) between 1 and 30. The effect of the number of DNA fragments on the coincidence rate and the unique coincidence rate was studied.

It can be seen from Fig. 5 that the curve of coincidence rate is above 90%, and the curve of unique coincidence rate is above 80%. Especially when the number of fragments is greater than 6, the coincidence rate reaches 100%. With the increase of the number of fragments, however, the unique coincidence rate will decrease (that is, multiple solutions will appear more).

Fig. 5
figure 5

Statistical analysis of the influence of different numbers of DNA fragments on the effectiveness of the algorithm

Second, 1000 sets of DNA sequences are randomly generated, while the number of DNA fragments of the second set of data is 5, and the length of each fragment is a random number between 1 and M, where M is the maximum length of DNA fragments. We study the effect of the magnitude of M on the coincidence rate and unique coincidence rate.

It can be seen from Fig. 6 that the coincidence rate between the DNA sequence calculated by this algorithm and the original DNA sequence is above 98%, and the unique coincidence rate is above 80%. As the maximum length of DNA fragments becomes larger, the unique coincidence rate increases.

Fig. 6
figure 6

Statistical analysis of the influence of the maximum length of DNA fragments on the effectiveness of the algorithm

As shown in Figs. 5 and 6, the high coincidence rate and unique coincidence rate are observed, validating the effectiveness of the proposed algorithm.

8 Conclusions and Remarks

Our data and analysis support the advantages of the algorithm: (1) The algorithm makes full use of the search method and 0–1 planning knowledge, and optimizes the arrangement of different DNA fragments to find the most satisfactory solution; (2) In terms of operation, it simplifies the difficulty of artificial combination and pure mathematical reasoning, and provides a relatively fast and accurate method for high-throughput and large-scale DNA sequencing; 3. We try to simplify the variables of the data, gradually approach the length of each segment, arrange them in ascending order, and finally use different sorting results to set up equations with the data related.

However, this is an algorithm related to biological background, which means that there exists the uniqueness of objective facts. Due to the limitations of the conditions given, we are not able to determine which group of solutions is exactly the sequence of the original DNA in the face of multiple groups of solutions. For example, there are fragments with the same length in example 2. This algorithm starts from the length of the DNA fragments, but the possible situation where DNA fragments of the same length may represent different sequences is ignored inevitably, and thus the result of this solution is one-sided.

Our algorithm can not only analyze genetic samples and DNA sequencing, integrate biological information of each segment but also be extended to other related life science fields like synthetic biology. In addition, if we can integrate biological knowledge and consider all kinds of variation factors, for instance, insertion, deletion and replacement of base pairs under experimental conditions, the algorithm will have a broader application prospect and solve more practical problems of biological genetic analysis.