Elsevier

Nano Communication Networks

Volume 1, Issue 3, September 2010, Pages 173-180
Nano Communication Networks

A brief review of molecular information theory

https://doi.org/10.1016/j.nancom.2010.09.002Get rights and content

Abstract

The idea that we could build molecular communications systems can be advanced by investigating how actual molecules from living organisms function. Information theory provides tools for such an investigation. This review describes how we can compute the average information in the DNA binding sites of any genetic control protein and how this can be extended to analyze its individual sites. A formula equivalent to Claude Shannon’s channel capacity can be applied to molecular systems and used to compute the efficiency of protein binding. This efficiency is often 70% and a brief explanation for that is given. The results imply that biological systems have evolved to function at channel capacity, which means that we should be able to build molecular communications that are just as robust as our macroscopic ones.

Section snippets

An approach to constructing molecular communications

The fundamental step in communications, including communications at the molecular level, is the accurate reception of a signal. As is well known in communications engineering fields, the mathematical foundation for obtaining good reception was developed by Claude Shannon in 1948 and 1949 [55], [56], [57]. How can we apply these ideas to the construction of molecular communications? One approach is to first find out how biomolecules interact with each other and how they set their states. With

Sequence logos show binding site information

We can use information theory to measure how much pattern is in a set of binding sites [54]. As an example, consider the Fis protein. In a starving bacterial cell there are below 100 molecules of Fis, but when the cell encounters nutrients, the numbers increase to over 50,000 molecules [4] and the Fis molecules then control many genes in the cell [19]. Fig. 1 shows several experimentally proven Fis sites from the front of the Fis gene itself. When there is not much Fis in the cell, the Fis gene

Evolution of information

The significance of Rsequence was found by comparing it to another measure of information. In many cases (but not Fis) the number of binding sites on the genome is known. So the problem facing the DNA binding protein is to locate a number of sites, γ, from the entire genome of size G. In information theory terms, the uncertainty before being bound to one of the sites is log2G, while after being bound it reduces to log2γ. So, as with the computation of the information in the binding sites, the

Sequence walkers show individual information of binding sites

The significance of Rsequence is that it reflects how the genetic control system evolves to meet the demands of the environment as represented by Rfrequency. If we inspect how Rsequence is computed from Eqs. (1), (3), (4), we can see that it relies on the probability-weighted sum in Eq. (1). That is, Rsequence is an average of the function log2fb,l. But what does this average represent? It turns out that it can be expressed as an average of the information of individual sequences by adding

Information and energy

Having determined practical measures of information in biological systems, the question arises, how is this information related to the binding energy? Surprisingly, the answer comes from two apparently different directions [38].

The first approach is from the Second Law of Thermodynamics expressed as the Clausius inequality [10], [66], [3]: dSdQT. Here S is the total entropy of a system and it has units of joules per kelvin since Q is heat and T is the absolute temperature. The Boltzmann–Gibbs

Coding theory explains molecular efficiency

In Shannon’s model of communications, a series of D independent voltage pulses sent over a wire is represented by a point in a D dimensional space [56]. Although the pulses are initially distinct values, thermal noise distorts them by the time they reach the receiver. Essentially each pulse undergoes a drunkard’s walk, which means there will be a Gaussian variation around each signal pulse. The noise on the pulses is also independent and a combination of independent Gaussian distributions forms

Information theory in biology

We have used molecular information theory to investigate many biological systems across the ‘central dogma’ and beyond:

  • DNA replication initiation by bacteriophage P1 RepA and other proteins [30], [11], [44], [29]

  • transcription factors [54], [61], [19], [20], [60], [12], [28]

  • RNA polymerases including those in T7 and related phages [54], [53], [13], [14], [59]

  • splice junctions [62] and mutations in splice junctions causing human disease [34], [32], [1], [64], including the cancer-causing xeroderma

Acknowledgements

I thank Don Court, Amar Klar, Ryan Shultzaberger, Rose Chiango and Carrie Paterson for reading and commenting on the manuscript. This research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.

Thomas D. Schneider is a Research Biologist in the Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, a part of the National Institutes of Health. Dr. Schneider received a B.S. in biology at MIT in 1978 and received his Ph.D. in 1984 from the University of Colorado, Department of Molecular, Cellular and Developmental Biology. His thesis was on applying Shannon’s information theory to DNA and RNA binding sites (Schneider1986). He is continuing this work at NIH as a

References (68)

  • T.D. Schneider et al.

    Information content of binding sites on nucleotide sequences

    J. Mol. Biol.

    (1986)
  • R.K. Shultzaberger et al.

    Anatomy of Escherichia coli ribosome binding sites

    J. Mol. Biol.

    (2001)
  • R.M. Stephens et al.

    Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites

    J. Mol. Biol.

    (1992)
  • R. Allikmets et al.

    Organization of the ABCR gene: analysis of promoter and splice junction sequences

    Gene

    (1998)
  • D.N. Arvidson et al.

    Automated kinetic assay of β-galactosidase activity

    BioTechniques

    (1991)
  • P.W. Atkins

    The Second Law

    (1984)
  • C.A. Ball et al.

    Dramatic changes in Fis levels upon nutrient upshift in Escherichia coli

    J. Bacteriol.

    (1992)
  • D. Barrick et al.

    Quantitative analysis of ribosome binding sites in E. coli

    Nucleic Acids Res.

    (1994)
  • P.V. Benos et al.

    Additivity in protein-DNA interactions: how good an approximation is it?

    Nucleic Acids Res.

    (2002)
  • E. Bindewald et al.

    CorreLogo: an online server for 3D sequence logos of RNA and DNA alignments

    Nucleic Acids Res.

    (2006)
  • L. Brillouin

    Science and Information Theory

    (1962)
  • H.B. Callen

    Thermodynamics and an Introduction to Thermostatistics

    (1985)
  • G.W. Castellan

    Physical Chemistry

    (1971)
  • Z. Chen et al.

    Discovery of Fur binding site clusters in Escherichia coli by information theory models

    Nucleic Acids Res.

    (2007)
  • Z. Chen et al.

    Information theory based T7-like promoter models: classification of bacteriophages and differential evolution of promoters and their polymerases

    Nucleic Acids Res.

    (2005)
  • Z. Chen et al.

    Comparative analysis of tandem T7-like promoter containing regions in enterobacterial genomes reveals a novel group of genetic islands

    Nucleic Acids Res.

    (2006)
  • S. Emmert et al.

    The human XPG gene: gene architecture, alternative splicing and single nucleotide polymorphisms

    Nucleic Acids Res.

    (2001)
  • I. Erill et al.

    A reexamination of information theory-based methods for DNA-binding site identification

    BMC Bioinformatics

    (2009)
  • M.R. Hemm et al.

    Small membrane proteins found by comparative genomics and ribosome binding site models

    Mol. Microbiol.

    (2008)
  • P.N. Hengen et al.

    Information analysis of Fis binding sites

    Nucleic Acids Res.

    (1997)
  • P.N. Hengen et al.

    Molecular flip-flops formed by overlapping Fis sites

    Nucleic Acids Res.

    (2003)
  • E.T. Jaynes

    The evolution of Carnot’s principle

  • N. Kannan et al.

    Logos for amino acid preferences in different backbone packing density regions of protein structural classes

    Acta Crystallogr. Sect. D

    (2000)
  • S.G. Khan et al.

    Two essential splice lariat branchpoint sequences in one intron in a xeroderma pigmentosum DNA repair gene: mutations result in reduced XPC mRNA levels that correlate with cancer risk

    Hum. Mol. Genet.

    (2004)
  • Cited by (0)

    Thomas D. Schneider is a Research Biologist in the Gene Regulation and Chromosome Biology Laboratory, National Cancer Institute, a part of the National Institutes of Health. Dr. Schneider received a B.S. in biology at MIT in 1978 and received his Ph.D. in 1984 from the University of Colorado, Department of Molecular, Cellular and Developmental Biology. His thesis was on applying Shannon’s information theory to DNA and RNA binding sites (Schneider1986). He is continuing this work at NIH as a tenured research biologist. A permanent web link is: http://alum.mit.edu/www/toms.

    View full text