Abstract
Information capacity of nucleotide sequences measures the unexpectedness of a continuation of a given string of nucleotides, thus having a sound relation to a variety of biological issues. A continuation is defined in a way maximizing the entropy of the ensemble of such continuations. The capacity is defined as a mutual entropy of real frequency dictionary of a sequence with respect to the one bearing the most expected continuations; it does not depend on the length of strings contained in a dictionary. Various genomes exhibit a multi-minima pattern of the dependence of information capacity on the string length, thus reflecting an order within a sequence. The strings with significant deviation of an expected frequency from the real one are the words of increased information value. Such words exhibit a non-random distribution alongside a sequence, thus making it possible to retrieve the correlation between a structure, and a function encoded within a sequence.
Similar content being viewed by others
Notes
The theory and methodology described below is applicable to a sequence from an arbitrary (finite) alphabet ℵ, say, for amino acid sequences.
An equality of these two sums stands behind the connection of a sequence into a ring.
Strictly speaking, information capacity is defined for a frequency dictionary, not for a sequence; we shall not make the difference between them, unless a mispresentation occurs.
References
Bugaenko NN, Gorban AN, Sadovsky MG (1996) Towards the information content of nucleotide sequences. Mol Biol Mosc 30:529
Bugaenko NN, Gorban AN, Sadovsky MG (1998) Maximum entropy method in analysis of genetic text and measurement of its information content. Open Syst Inf Dyn 5:265
Carbone A, Zinovyev A, Kepes F (2003) Codon Adaptation Index as a measure of dominating codon bias. Bioinformatics 19:2005
Durand B, Zvonkin A (2004) L’héritage de Kolmogorov en Mathématiques, Berlin, pp 269–287
Gorban AN, Popova TG, Sadovsky MG (1994) Redundancy of genetic texts and mosaic structure of genomes. Mol Biology (Mosc) 28:313
Gorban AN, Karlin IV (2005) Invariant manifolds for physical and chemical kinetics. Lect. Notes Phys, 660. Springer, Berlin
Nakamura PM (2000) Codon usage: mutational bias, translational selection and mutational biases. Nucleic Acids Res 19:8023
Popova TG, Sadovsky MG (1995) Introns differ from exons in their redundancy. Russ J Genet 31:1365
Rui H, Bin W (2001) Statistically significant strings are related to regulatory elements in the promoter regions of Saccharomyces cerevisiae. Physica A 290:464
Sadovsky MG (2002a) Information capacity of symbol sequences. Open Syst Inf Dyn 9:37
Sadovsky MG (2002b) Towards the information capacity of symbol sequences. Electron Inform Control 1:82
Sadovsky MG (2002c) Towards the redundancy of viral and prokaryotic genomes. Russ J Genet 38:575
Sadovsky MG (2003) Comparison of real frequencies of strings vs. the expected ones reveals the information capacity of macromoleculae. J Biol Phys 29:23
Sadovsky MG (2005) Information capacity of biological macromoleculae reloaded ArXiv q-bio.GN 0501011 v1
Sadovsky MG (2006) Information capacity of nucleotide sequences and its applications. Bull Math Biol 68:156
Sadovsky MG, Putintzeva YA (2007) Codon usage bias measured through entropy approach, arXiv:0706.2077v1, 14 June 2007
Shannon CE, Weaver W (1949) The mathematical theory of communication. University of Illinois Press, Urbana
Sharp PM, Stenico M, Peden JF, Lloyd AT (1993) Codon usage: mutational bias, translational selection and mutational biases. Nucleic Acids Res 15:8023
Zubkov AM, Mikhailov VG (1974) Limit distributions of random variables associated with long duplications in a sequence of independent trials. Probab Theory Appl 19:173
Zvonkin AK, Levin L (1970) The complexity of finite objects and development of the concepts of information and randomness by means of the theory of algorithms. Russ Math Surv 25(6):83
Acknowledgments
We are thankful to Prof. Alexander N. Gorban from Leicester University, for valuable discussions and inspiring ideas, and to Dr. Tatyana G. Popova from the Institute of Computational Modelling of RAS for stimulating interest in this work.
Author information
Authors and Affiliations
Corresponding author
Additional information
The results present here were partially obtained due to the support from Krasnoyarsk Science Foundation.
Rights and permissions
About this article
Cite this article
Sadovsky, M.G., Putintseva, J.A. & Shchepanovsky, A.S. Genes, information and sense: complexity and knowledge retrieval. Theory Biosci. 127, 69–78 (2008). https://doi.org/10.1007/s12064-008-0032-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12064-008-0032-1