Abstract

The origin and evolution of SARS-CoV-2 has been an important issue in tackling COVID-19. Research on these topics would enhance our knowledge of this virus and help us develop vaccines or predict its paths of mutations. There are many theoretical and clinical researches in this area. In this article, we devise a structural metric which directly measures the structural differences between any two nucleotide sequences. In order to explore the mechanisms of how the evolution works, we associate the nucleotide sequences of SARS-CoV-2 and its related families with the degrees of randomness. Since the distances between randomly generated nucleotide sequences are very concentrated around a mean with low variance, they are qualified as good candidates for the fundamental reference. Such reference could then be applied to measure the randomness of other Coronaviridae sequences. Our findings show that the relative randomness ratios are very consistent and concentrated. This result indicates their randomness is very stable and predictable. The findings also reveal the evolutional behaviours between the Coronaviridae and all its subfamilies.

1. Introduction

COVID-19 has a huge impact on all works of life. To develop stable and trustworthy vaccines [1, 2], one needs to track and analyse the properties of SARS-CoV-2, which couples with MERS-CoV [3] and SARS-CoV which are the subfamilies of betacoronavirus. Besides, one also needs to compare the properties of its related families: alphacoronavirus, deltacoronavirus, and gammacoronavirus [4]. In the Coronaviridae, betacoronavirus is the most deadly subfamily. In the category, SARS-CoV, MERS-CoV, and SARS-CoV-2 emerged in 2003, 2012, and 2019, respectively. To evaluate and analyse their properties, there are many genomic, clinical, statistical, and analytical tools available. Among all the theoretical or clinical research, genetical analysis provides a straightforward way to delve into the structures of Coronaviridae [5, 6]. There are some researchers focusing on geographic, demographic, and genomic analysis to extract some patterns of the viruses [7, 8]. Though the origin and evolution of these viruses was studied previously—for example, MERS-CoV [9] and SARS [10, 11]—there is still a long way to map out the interaction of these viruses. Currently, there are many theories or evidence about the mechanisms regulating the evolution and mutation of SARS-CoV-2 [1214]. Nonetheless, a decisive solution to reveal such mechanisms still depends on further research and findings. In this article, we analyse their properties from the point of randomness, i.e., the degree of randomness of their nucleotide sequences. We devise a structural metric which would be applied in measuring the distances between all sorts of the Coronaviridae nucleotide sequences and the randomly generated nucleotide sequences. These distances could indicate how far the Coronaviridae is with respect to the random nucleotide sequences.

We utilise the data of coronavirus genomes from NCBI datasets [15]. Then, we measure the distances for each individual subfamily of the Coronaviridae. Our results show this structural metric is very suitable in revealing the properties of randomness. Hence, the relative distances between the random sequences are fairly stable and concentrated—this feature makes the concept of randomness feasible. From these settings, we could then calculate their relative randomness ratios (RRR) and extract our findings and results from RRR. The method to implement this notion is characterized in Section 3, and the results of the implementation are listed in Section 4, and the conclusions are reached in Section 5.

2. Theoretical Settings

In order to clearly measure the distances between structures, we devise a structural metric in this section—which would be applied in the latter sections.

For any vector , we use or to denote its th element and to denote its length. We also use to denote its Euclidean norm.

2.1. Common Finite Interval (CFI)

Let denote the set of all the ascending finite sequences. Let , be arbitrary. Define the greatest lower bound . Define the least upper bound . Let denote the subsequence of whose elements lie between and . Let denote the set of all the elements of . Let . Let finite be arbitrary. Let denote the vector by sorting all the elements in . Define a difference operator over finite vectors by , where .

Definition 1. For any , any , define by .

Definition 2. (common subsequence).
If , , we define by .

This serves as the common structure between two structures.

Definition 3. (ascending finite sequences).
Let denote the set of all the ascending real vectors whose first element is and last element is . Let be the union set of all , i.e., .

Definition 4. (structural metric).
Define a distance function over by .

Claim 5. is a metric on .

Proof. It can be proved, according to Definition 4, by taking all the possible cases regarding their relations of intervals into consideration.

Claim 6. If is a set of metrics over a set , then is also a metric on .

Definition 7. It follows immediately from the definitions of a metric.

Example 1. Suppose nucleotide sequence , are given above.

Let denote the position of nitrogenous base in the sequence . Let denote the position of common sequence of and . Then, the results are presented in Table 1. Let . Now we define where the last equality comes directly from Definition 4. Since Therefore, .

The weights are all predetermined for each nitrogenous base. These values could also be adjusted according to professional judgement. For example, the weights could be decided by the relative frequencies of the bases. Example 1 lays a foundation of our latter arithmetical calculation.

3. Methods

There are several steps for calculating the relative randomness ratios (RRR). (i)Generate a set of 1000 random nucleotide sequences whose lengths are all fixed at 30000. The generated random (nucleotide) sequences are presented in Table 2(ii)Each sequence is regarded as a node. We then calculate the distance matrix for these nodes. This metric is a weighted metric consisting of 4 metrics which measure the structural distance with respect to each nitrogenous base. A concrete computation is shown in Example 1(iii)Some patterned nucleotide sequences are created and their distances with random sequences are calculated. These sequences are nonessential. They are generated only for comparative purposes. The created (followed by rules) nucleotide sequences and their distances are presented in Table 3(iv)The structural distances between SARS-CoV-2 nucleotide sequences and random ones are calculated. The results are presented in Table 4(v)The structural distances between MERS-CoV nucleotide sequences and random ones are calculated. The results are presented in Table 5(vi)The structural distances between SARS nucleotide sequences and random ones are calculated. The results are presented in Table 6(vii)The structural distances between alphacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 7(viii)The structural distances between deltacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 8(ix)The structural distance between gammacoronavirus nucleotide sequences and random ones are calculated. The results are presented in Table 9(x)RRR for each subfamily is calculated and the way to calculate it is explained in Section 4.2

4. Results

We use R program 4.0.2 (version) which in particular involves a package “Biostrings” to help us implement the theoretical setting. By the procedures mentioned in Section 3, we present the results in this section. We set the length of random nitrogenous base to be 30000, which is pretty much the length for SARS-CoV virus family. We also use R to sample 1000 samples (sequences) for our experiment (due to the capacity of our computers).

4.1. Experiment: Randomness of Nucleotide Sequences

Through Definition 4 and Example 1, we have the distance matrix as follows:

After removing the diagonal, we calculate some descriptive values for the elements: the minimum, maximum, mean, and standard derivation of the whole distance matrix. The minimum is 127.1 and the maximum is 134.7. The mean is 130.88 and the standard derivation is 0.83. Since the standard derivation is very small, the structural distance between any pair of random nucleotide sequences is highly concentrated around the mean—this is a good referential property for our further analysis. Now, let us demonstrate the distances between some patterned sequences with random sequences.

Example 2. Suppose are bundled and repeated 7500 times with ; moreover, are bundled and repeated 3750 times with ; finally, (a pattern for the Fibonacci sequence with mod operation, or mod 4, where 1, 2, 3, and 4 are identified with “A”, “C”, “G”, and “T”, respectively) are bundled and repeated 5000 times with as shown in the following: (i)(ii)(iii)

The distances between each and the random sequences are listed in Table 3.

The structural distances between patterned sequences and random ones obviously have different results in comparison with the random sequences.

4.2. Distance for Nucleotide Sequences

We import SARS-CoV-2 genomic codes and save them in S4DSC2 [15]. Since the size of S4DSC2 is too huge (4617), or , and could not be handled by our computer, we sample only 20 of them. The results are presented in Table 4, where column “Sequence” is the order of the sampled sequence in the data set; “Min” and “Max” are the minimal and maximal distance for the given sequence with the random sequences, respectively; “Mean” is the average distance between the given sequence and the random sequences; “Sd” is the standard derivation of such set of distances; “Mean rand” is the average distance of the distance matrix of random sequences; “RRR” is the relative randomness ration, which is the “Mean” over “Mean rand.” For the latter tables, meanings of the columns are the same; we will skip the wording. For MERS-CoV, the size of data downloaded is 530. We sample 20 of them randomly. The results are presented in Table 5. For SARS-CoV, the size of data downloaded is 10647. We sample 20 of them randomly. The results are presented in Table 6. For alphacoronavirus, the size of data downloaded and filtered is 1002. We sample 20 of them randomly. The results are presented in Table 7. For deltacoronavirus, the size of data downloaded and filtered is 149. We sample 20 of them randomly. The results are presented in Table 8. For gammacoronavirus, the size of data downloaded and filtered is 427. We sample 20 of them randomly. The results are presented in Table 9.

5. Conclusion

By observing all the results presented in the tables, we could reach the following statements: (i)The structural distances between random (nucleotide) sequences are highly concentrated with low standard derivation. This feature justifies the referential role under structural metric(ii)The patterned nucleotide sequences have lower means and lower standard derivations in distances with random sequences(iii)The relative randomness ratios (RRR) for Coronaviridae, which lie between 1.01 and 1.08, are much close to complete randomness ratio (or 1) in comparison with the ones for patterned nucleotide sequence, which lie around 0.84 in our examples(iv)Overall, the randomness of betacoronavirus is higher than alphacoronavirus or deltacoronavirus, which in turn are higher than the structural distances between SARS-CoV-2 and random sequences. This could probably explain why the mutations of betacoronavirus are higher than other subfamilies(v)In the betacoronavirus, the RRR of SARS-CoV-2 is almost fixed at 1.04. This indicates the mutations of SARS-CoV-2 are stabilized at this moment

These findings provide some insightful knowledge about the degree of structural randomness of SARS-CoV-2 and its related family. Linking this knowledge to other research results and findings would help us map out the dynamical structures and evolutions of these viruses.

Data Availability

The data are available from the author on reasonable request (https://www.ncbi.nlm.nih.gov/sars-cov-2/; https://www.ncbi.nlm.nih.gov/datasets/coronavirus/genomes/).

Conflicts of Interest

The author declares that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work is supported by the Humanities and Social Science Fund of Ministry of Education of China (Grant No. 20XJA-GAT001).