Module 12: Bio-Inspired Problems and Algorithms


Genomics: A (Really) Short Primer

Two key chemicals in living things:

  1. Proteins:
    • Different functions:
      • Enzymes (digestion)
      • Structural proteins (skin, hair)
      • Transporters of energy (hemoglobin)
      • Transporters of information (hormones)
    • Fact: human body has about 100,000 proteins.
    • A protein is a chain of amino acids
    • Only 20 different amino acids.
    • A protein can have thousands of amino acids.
      => amino acids repeat along the chain.
    • Proteomics: the study of proteins.


    (Courtesy: Dept. of Energy)

  2. DNA:
    • Molecular structure inside cell nucleus.
    • Sometimes in different pieces (chromosomes).
    • 3D structure is a double-helix.
    • DNA carries "information" in the form of bases.
    • Four bases:
      A = Adenine
      T = Thymine
      C = Cytosine
      G = Guanine
    • Bases are complementary:
      A pairs with T
      C pairs with G


      (Courtesy: Dept. of Energy)

    • DNA is composed of two complementary strands.

    • Human genome (DNA) is 3 billion-bases long.
      (e.g, compare: E.Coli has 4.6 million bases)
    • Genes: substrings of DNA.
    • Only 2% of DNA consists of genes.
    • About 40,000 genes in human DNA.
    • Note: our understanding of these matters is still evolving.

Another view: two key languages:

  1. "Protein language"
    • Protein = word
    • Amino-acid = letter
    • Alphabet = { A1, A2, ..., A20 }
      (20 amino acids).

  2. "DNA language"
    • DNA = sentence
    • Base = letter
    • Group-of-three-bases = word (called codon)
    • multiple codons (protein encoder) = phrase
    • Alphabet = { A, T, C, G }.

In-Class Exercise 12.1: A single bit can encode two unique items (use "0" for one item, "1" for the other item). Now, answer the following:

  1. How many unique items can k bits encode?
  2. How many unique items can a single letter in the alphabet { A, T, C, G } encode?
  3. How many unique items can be encoded with a k-letter word over the alphabet { A, T, C, G }?
  4. What size word (i.e., what should k be) is enough to encode 20 unique items?

What does DNA do?


DNA Alignment - A Problem in Computational Biology

Gene comparisons:

The alignment problem:

In-Class Exercise 12.2: Use the above table to compute the alignment scores (by hand) for the following alignment of CGGAT and CGT:

      C G G A T
      C G T
  
Can you find a better alignment?

Key ideas in solution:

Implementation:

In-Class Exercise 12.3: Download this template and implement the above algorithm. Gap and match costs are given.

Analysis:

An improvement (in space):

Variations of the alignment problem:


Other problems in computational biology

The protein folding problem:

Other problems:

  • Gene finding
         => Given a genome, identify genes within the genome.

  • Phylogenetics and the tree of life.
         => Goal: build best possible tree.

  • Pattern classification and microarrays
         => Given a DNA sample, classify the sample


    (Courtesy: wikicommons)

  • Biological networks


Cellular Automata and Von Neumann's Quandary

Cellular Automata:

  • What is a cellular automaton?
    • An infinite cellular space - usually a 2D grid:

    • Set of states, e.g, S = { empty, A, B, C, D }.

    • Initially each cell is in one of the states:

    • System evolves in time-steps.

    • At each step, "rules" are applied to generate the next state for each cell, e.g.,

  • State-transition rules:
    • A neighborhood for each cell is defined, usually one of
      • 4-neighborhood (N, S, E, W).
      • 8-neighborhood (N, S, E, W, NE, NW, SE, SW).
    • The next state of a cell depends on its current state and the current state of its neighbors.
    • All cells change state at the same time.
    • The rules are sometimes called the "physics" of the system.

In-Class Exercise 12.4: Search the web for applets that simulate the Game of Life and examine what happens to the following patterns. Run each pattern for a few time steps (generations).

  • The Blinker.

  • The Block.

  • The Glider.

  • The R-pentonimo.

The Game of Life:

  • A cellular automaton with only two states: on and off.

  • Devised in 1970 by John Horton Conway.

  • Rules:
    • Uses 8-neighborhood.
    • Rule 1 (birth): if a cell has exactly 3 neighbors "on", its next state is "on".
    • Rule 2 (status-quo): if a cell has exactly 2 neighbors "on", its next state is its current state.
    • Rule 3 (death): In all other cases, the next state is "off".

  • A generalization: k-Game-of-Life
    • Birth rule: if a cell has exactly k neighbors "on", its next state is "on".
    • Status quo rule: if a cell has exactly k-1 neighbors "on".
    • Death rule: all other values.

  • k = 3 in the Game-of-Life.

  • Interesting observation:
    • If k > 3: too much growth (chaos).
    • If k < 3: too little growth (empty space).
    • k = 3 is "optimal for life".

Von Neumann's quandary:

  • Problem: to prove (mathematically) that self-reproduction is possible.

  • First attempt: the kinematic model (robots):
    • Can a robot reproduce, i.e., assemble a copy of itself (that can later reproduce) from a blueprint?

  • The blueprint problem: can a blueprint contain itself?

  • Second attempt: cellular automaton.

  • Trivial vs. non-trivial reproduction in cellular worlds:
    • Trivial reproduction: the Blinker in the Game of Life.
    • Von Neumann's criteria for non-trivial reproduction: a cellular automaton that:
      • Can embed a Universal Turing machine (i.e., can compute anything)
      • Can embed a Universal Constructor (i.e., can build from a blueprint).
      • Reproduce itself entirely, including blueprint.

  • Von Neumann (with others) showed that it was possible: by constructing a cellular automaton exhibiting non-trivial self-reproduction:
    • 29 states per cell and 200,000 cells.
    • The "creature" contained a "tape" with instructions (blueprint), and a "constructor arm".
    • Another part of the cellular creature contained a universal Turing machine.

  • Solution of the blueprint problem: the blueprint was "photocopied" into the offspring.

  • Reproduction occurs in two phases:
    1. Build the offspring by reading (interpreting) the blueprint.
    2. Copy over the blueprint into the offspring (without interpreting).

Other developments in cellular automata:

  • The Game-of-Life can support computation (and it's believed) self-reproduction.

  • Simpler non-trivial self-reproducing cellular automata have been found
    => all use "blueprint copying". (Example)

  • Cellular automata have become a field of study with attempts to build "metabolic" creatures (that grow, age, evolve etc).

Intriguing questions and comparison between cellular automata and our "wet" world:

  1. Question: Does the physics/chemistry support "self-reproducing life"?
    Cellular world Yes.
    e.g, Von Neumann's example.
    Wet World Yes.

  2. Question: Does the physics also support trivial reproduction?
    Cellular world Yes.
    Game-of-Life's Blinker
    Wet World Yes.
    Crystalline growth.

  3. Question: Does non-trivial reproduction use the 2-step blueprint model?
    Cellular world Yes.
    All examples
    Wet World Yes.
    DNA replication and interpretation.

  4. Question: Is evolution supported?
    Cellular world Yes.
    In some models.
    Wet World Yes.

  5. Question: Is spontaneous occurrence of "life" possible?
    Cellular world Not known.
    Wet World Current belief: Yes.

  6. Question: Is uncontrolled chaotic growth ("grey goo") possible?
    Cellular world Yes.
    Game-of-Life's R-pentonimo
    Wet World ?

  7. Question: Do the elements of self-reproduction also support computation?
    Cellular world Yes.
    Many examples
    Wet World Yes.
    (next section)


Computing with DNA

Yes, with actual DNA.

Key ideas:

  • We will use chemical reactions with DNA to solve the Restricted Hamiltonian Path problem
    => exploit the massive parallelism with millions of DNA molecules.

  • The Restricted Hamiltonian Path problem (directed graph version):
    • Given a directed graph and two vertices s and d, find a path between s and d that visits each vertex exactly once.
    • Result: Restricted Hamiltonian Path problem is NP-complete.
    • Note: for an n-vertex graph, the path is of length n.

  • Overview of process:
    • Represent vertices using strings of DNA bases.
    • Represent edges using combinations of vertex-strings.
    • Use gel-electrophoresis to extract DNA with correct length.
    • Use filtering process in many steps to separate out paths with all vertices.
      => solution (no pun intended) to Hamiltonian path problem.

In-Class Exercise 12.5: Why is the Restricted Hamiltonian Path problem NP-complete (given that the regular Hamiltonian Path problem is NP-complete)?

Details:

  • Step 1 (on paper): associate unique DNA substrings with vertices, e.g.

    • In the above example, an 8-base string represents a vertex.
    • In practice, a larger number is required to help separate out different-length paths.

  • Step 2 (on paper): associate unique DNA substrings with edges, based on substrings for vertices:

    • For edge ( v1, v2) join the latter half of v1's string with the first half of v2's string.

  • Step 3 (on paper): identify the complementary strings for the vertices, e.g.,

  • Step 4 (on paper): create unique "start" and "stop" DNA strings for vertices s and d.

    • The "start" string represents an artifical edge between "start" and s.
    • The "end" string represents an artifical edge between d and "end".

  • Step 5 (lab): synthesize all of the above DNA material (substrings) separately (one beaker corresponding to each different string).

  • Step 6 (lab): mix all the edges and complementary vertices:

    • The complementary vertices will, lego-like, bind edges in sequence.

    • This will produce all possible paths in the graph, including invalid ones (without "start" and "end").

  • Step 6 (lab): extract all paths beginning with "start" and ending with "end".
    => use PCR techniques

  • Step 7 (lab): separate out the DNA with the correct length (exactly n substrings)
    => use gel-electrophoresis.

    • Note: this will result in paths of exactly length n.
    • However, it will also contain paths with repetitions.

  • Step 8 (lab):
    • Filter out all paths that don't contain the string for v1
      => use v'1 (complement) to bind.
    • This leaves all paths containing v1.
    • Next, filter out all paths that don't contain v2.
    • ... (repeat for all vertices)
    • What remains: the DNA representation of all paths of length n that contain all vertices
      => Hamiltonian paths!

Summary:

  • The purpose was to show that DNA and chemical processes can "compute".

  • Potential efficiency: chemical reactions occur in parallel.

  • It is not yet a practical method:
    • Problems need to be carefully coded.
    • Encoding takes time.
    • Macro-scale experimentation results in errors
      => (fraction of a teaspoonful required for 7-vertex graph).

  • Related work:
    • Using DNA for building "wetware" (gates, flip-flops).
    • Using proteins for computation.
    • Self-assembling nanoparticles.


Genetic Algorithms and Combinatorial Optimization Problems

Key ideas:

  • Use evolution as a metaphor for "winnowing" solutions.

  • Outline (for TSP):

    • Each candidate TSP tour is a "genome" (animal).
    • Start with a large number of potential solutions (initial population).
    • At each step generate a new population:
      • Use mutation to "explore"
      • Use mating to preserve "good characteristics".
    • Weak (high-cost) solutions "die".
    • Strong (low-cost) solutions "survive".
    • Eventually, optimal solution should dominate population.

Details: (TSP example)

  • Input: the n TSP points.

  • Associate tour with genome.

  • A genome's fitness is the tour's length.
    (shorter the better).

  • Step 1: create an initial population of m random tours (.e.g, m = 1000).
    (They don't have to be unique).

  • Step 2: Compute the "fitness" value of each genome (tour).
    Example with four 5-city tours:
    ID Genome
    (tour)
    Fitness
    (inverse tour length)
    Fraction of
    total (PDF)
    1 0-1-2-3-4 27.5 0.31
    2 4-0-1-3-2 12.95 0.15
    3 0-2-1-3-4 9.3 0.11
    4 2-4-3-0-1 36.0 0.42


    87.75
    (total)
    1.00
    (total)

  • Step 3: compute the population PDF (Probability Distribution Function) => fraction based on fitness.
    • Compute the total fitness (sum of tour costs).
    • Compute what fraction of the total each fitness value amounts to.
    • The fractions are the PDF.

  • Step 4: generate a new population drawing from the PDF
    => about 31% (on average) of the new population will contain genome 1 and 11% will contain genome 3.

  • Step 5: Apply crossover rules (mating):
    • The crossover-fraction is an algorithm parameter, e.g., crossover-fraction = 0.3
      => 30% of genome-pairs will engage in crossover.
    • Select a random 30% of pairs randomly (assuming crossover-fraction = 30).
    • Apply a crossover rule to each such pair: exchange parts of genomes between the pair.

  • Step 6: Apply mutation
    • mutation-fraction is an algorithm parameter.
    • mutation-fraction = fraction of genomes to mutate
      (e.g., 0.05)
    • Select 5% of the genomes (randomly) to mutate.
    • Apply mutation to each (make a slight adjustment in the tour).

  • Repeat steps 2-6 until fitness values converge
    => population is dominated by high-fitness genomes (tours).

Crossover and mutation in TSP:

  • How do we "mate" two TSP tours?

    • Take the first part (about half) of Tour 1.
    • Take the remaining points in the order these points are found in Tour 2.

  • How do we mutate a TSP tour?
    => swap 2 points

Summary:

  • Advantages of genetic algorithms:
    • Genetic algorithms are easy to implement.
    • Like simulated annealing, the problem-specific part can be separated out from the generic algorithm.
    • If re-arrangements in fact do impact the solution, genetic algorithms have a reasonable chance of finding a good solution.
    • By its nature genetic algorithms try many initial solutions (simultaneously)
      => simulated annealing needs to be re-run with different starting solutions.

  • Disadvantages:
    • Genetic algorithms are slow.
    • It's hard to define meaningful crossovers and mutations for some problems.
    • It requires some experimentation to get it working.
      => it's easier to automate this part in simulated annealing.
    • Generally, simulated annealing (with appropriate modification) is thought to be a better option.

  • Warning:
    • Its biological origins do not give it any special advantage
      => beware of its mystical appeal!