A brief review of molecular information theory
Section snippets
An approach to constructing molecular communications
The fundamental step in communications, including communications at the molecular level, is the accurate reception of a signal. As is well known in communications engineering fields, the mathematical foundation for obtaining good reception was developed by Claude Shannon in 1948 and 1949 [55], [56], [57]. How can we apply these ideas to the construction of molecular communications? One approach is to first find out how biomolecules interact with each other and how they set their states. With
Sequence logos show binding site information
We can use information theory to measure how much pattern is in a set of binding sites [54]. As an example, consider the Fis protein. In a starving bacterial cell there are below 100 molecules of Fis, but when the cell encounters nutrients, the numbers increase to over 50,000 molecules [4] and the Fis molecules then control many genes in the cell [19]. Fig. 1 shows several experimentally proven Fis sites from the front of the Fis gene itself. When there is not much Fis in the cell, the Fis gene
Evolution of information
The significance of was found by comparing it to another measure of information. In many cases (but not Fis) the number of binding sites on the genome is known. So the problem facing the DNA binding protein is to locate a number of sites, , from the entire genome of size . In information theory terms, the uncertainty before being bound to one of the sites is , while after being bound it reduces to . So, as with the computation of the information in the binding sites, the
Sequence walkers show individual information of binding sites
The significance of is that it reflects how the genetic control system evolves to meet the demands of the environment as represented by . If we inspect how is computed from Eqs. (1), (3), (4), we can see that it relies on the probability-weighted sum in Eq. (1). That is, is an average of the function . But what does this average represent? It turns out that it can be expressed as an average of the information of individual sequences by adding
Information and energy
Having determined practical measures of information in biological systems, the question arises, how is this information related to the binding energy? Surprisingly, the answer comes from two apparently different directions [38].
The first approach is from the Second Law of Thermodynamics expressed as the Clausius inequality [10], [66], [3]: Here is the total entropy of a system and it has units of joules per kelvin since is heat and is the absolute temperature. The Boltzmann–Gibbs
Coding theory explains molecular efficiency
In Shannon’s model of communications, a series of independent voltage pulses sent over a wire is represented by a point in a dimensional space [56]. Although the pulses are initially distinct values, thermal noise distorts them by the time they reach the receiver. Essentially each pulse undergoes a drunkard’s walk, which means there will be a Gaussian variation around each signal pulse. The noise on the pulses is also independent and a combination of independent Gaussian distributions forms
Information theory in biology
We have used molecular information theory to investigate many biological systems across the ‘central dogma’ and beyond:
- •
DNA replication initiation by bacteriophage P1 RepA and other proteins [30], [11], [44], [29]
- •
transcription factors [54], [61], [19], [20], [60], [12], [28]
- •
RNA polymerases including those in T7 and related phages [54], [53], [13], [14], [59]
- •
splice junctions [62] and mutations in splice junctions causing human disease [34], [32], [1], [64], including the cancer-causing xeroderma
Acknowledgements
I thank Don Court, Amar Klar, Ryan Shultzaberger, Rose Chiango and Carrie Paterson for reading and commenting on the manuscript. This research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
Thomas D. Schneider is a Research Biologist in the Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, a part of the National Institutes of Health. Dr. Schneider received a B.S. in biology at MIT in 1978 and received his Ph.D. in 1984 from the University of Colorado, Department of Molecular, Cellular and Developmental Biology. His thesis was on applying Shannon’s information theory to DNA and RNA binding sites (Schneider1986). He is continuing this work at NIH as a
References (68)
- et al.
Replication control of plasmid P1 and its host chromosome: the common ground
Prog. Nucleic Acid Res. Mol. Biol.
(1997) - et al.
Theoretical aspects of specific and non-specific equilibrium binding of proteins to DNA as studied by the nitrocellulose filter binding assay: co-operative and non-co-operative binding to a one-dimensional lattice
J. Mol. Biol.
(1982) - et al.
Xeroderma Pigmentosum-variant patients from America, Europe, and Asia
J. Investig. Dermatol.
(2008) - et al.
Xeroderma Pigmentosum Group C splice mutation associated with mutism and hypoglycinemia—a new syndrome?
J. Investig. Dermatol.
(1998) - et al.
Information analysis of sequences that bind the replication initiator RepA
J. Mol. Biol.
(1993) - et al.
EcoRI methylase. Physical and catalytic properties of the homogeneous enzyme
J. Biol. Chem.
(1977) Theory of molecular machines. I. Channel capacity of molecular machines
J. Theoret. Biol.
(1991)Theory of molecular machines. II. Energy dissipation from molecular machines
J. Theoret. Biol.
(1991)Reading of DNA sequence logos: prediction of major groove binding by information theory
Methods Enzymol.
(1996)Information content of individual genetic sequences
J. Theoret. Biol.
(1997)
Information content of binding sites on nucleotide sequences
J. Mol. Biol.
Anatomy of Escherichia coli ribosome binding sites
J. Mol. Biol.
Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites
J. Mol. Biol.
Organization of the ABCR gene: analysis of promoter and splice junction sequences
Gene
Automated kinetic assay of -galactosidase activity
BioTechniques
The Second Law
Dramatic changes in Fis levels upon nutrient upshift in Escherichia coli
J. Bacteriol.
Quantitative analysis of ribosome binding sites in E. coli
Nucleic Acids Res.
Additivity in protein-DNA interactions: how good an approximation is it?
Nucleic Acids Res.
CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments
Nucleic Acids Res.
Science and Information Theory
Thermodynamics and an Introduction to Thermostatistics
Physical Chemistry
Discovery of Fur binding site clusters in Escherichia coli by information theory models
Nucleic Acids Res.
Information theory based T7-like promoter models: classification of bacteriophages and differential evolution of promoters and their polymerases
Nucleic Acids Res.
Comparative analysis of tandem T7-like promoter containing regions in enterobacterial genomes reveals a novel group of genetic islands
Nucleic Acids Res.
The human XPG gene: gene architecture, alternative splicing and single nucleotide polymorphisms
Nucleic Acids Res.
A reexamination of information theory-based methods for DNA-binding site identification
BMC Bioinformatics
Small membrane proteins found by comparative genomics and ribosome binding site models
Mol. Microbiol.
Information analysis of Fis binding sites
Nucleic Acids Res.
Molecular flip-flops formed by overlapping Fis sites
Nucleic Acids Res.
The evolution of Carnot’s principle
Logos for amino acid preferences in different backbone packing density regions of protein structural classes
Acta Crystallogr. Sect. D
Two essential splice lariat branchpoint sequences in one intron in a xeroderma pigmentosum DNA repair gene: mutations result in reduced XPC mRNA levels that correlate with cancer risk
Hum. Mol. Genet.
Cited by (0)
Thomas D. Schneider is a Research Biologist in the Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, a part of the National Institutes of Health. Dr. Schneider received a B.S. in biology at MIT in 1978 and received his Ph.D. in 1984 from the University of Colorado, Department of Molecular, Cellular and Developmental Biology. His thesis was on applying Shannon’s information theory to DNA and RNA binding sites (Schneider1986). He is continuing this work at NIH as a tenured research biologist. A permanent web link is: http://alum.mit.edu/www/toms.