Sequence Alignment


What is Alignment?

Let's distinguish between two uses of "alignment":

Types of two-sequence alignments:

Difference between keyword-search and alignment:

Problems from a biologist's point of view:


Global Alignment - The Needleman-Wunsch Algorithm

The alignment problem in Computer Science terms:

Key ideas in solution:

Implementation:

In-Class Exercise: (Solution) Download this template and implement the above algorithm. Gap and match costs are given.

Analysis:

An improvement (in space):


Local Alignment - The Smith-Waterman Algorithm

Consider a global alignment example:

Two key modifications to global alignment algorithm:

  1. Starting new alignments:
    • As with global, start at top left corner.
    • However, if an alignment gets too expensive, discard it and start a new one.

  2. Look for best local alignment:
    • As with global, fill all matrix entries.
    • However, do not count on bottom-right as starting point
      => search entire matrix for best "bottom-right corner of local alignment"

To achieve the first part:

  • Decide when a local alignment should be abandoned (i.e., when a new one should start).
    => Typically, when score goes below zero.

  • Modify recurrence to avoid below-zero scores:
    Di,j = max (0, Ci-1,j-1, Ci-1,j, Ci,j-1)

  • If 0 is the maximum score at (i, j)
    => a new alignment is started.

  • Record (i, j) values where new alignments start.

To achieve the second part:

  • Search for Di,j with highest score.

  • Work backwards from this point.


Scoring - the PAM Approach

What is PAM?

  • PAM = Point Accepted Mutations.

  • A theory for assigning scores to substitutions (mutations) in proteins.

Consider a group of related proteins and a particular position:

  • Suppose the proteins are listed in evolutionary order and aligned.

  • We'll use a, b, c], ... to denote amino acids.

  • A mutation from a to b can occur in one step or many.

  • Let Mab = Pr[an a -> b mutation occurs] in some position.

  • Let Maa = Pr[no mutation occurs].

  • Let pa = Pr[residue a occurs].

Scoring intuition:

  • Consider scoring the substitution a to b.

  • Want a high score if Mab is high
    => because of high a -> b mutations.

  • Want a low score if pb is high
    => because high occurence of b explains occurence of b (as opposed to mutation).

  • Solution: let score be proportional to Mab / pb

PAM scoring:

  • Consider mutations in two positions: a -> b and c -> d.

  • Probability of both occuring is proportional to product: (Mab / pb) * (Mcd / pd)

  • But algorithm requires scores to be additive (for decomposability).

  • Take logarithms:
    score = log (Mab / pb) + log (Mcd / pd)

Evolutionary distance in PAM scoring:

  • In comparing a -> b in two proteins:
    • One possibility: the two proteins are phylogenetically adjacent
      => direct a -> b mutation.
    • Another possibility: several intermediate mutations occured:
      e.g., a -> e -> f -> b. (evolutionary distance = 3)
      e.g., c -> g -> d. (evolutionary distance = 2)

  • Goal: design scoring for a particular distance k.

  • What do we want reflected in the score?
    • Let Mabk = Pr[a -> b in k steps].
    • Compute Mabk for each a, b.
    • Score for a -> b is: log (Mabk / pb)

  • Thus, in PAM-250 (a popular one), k = 250.

Note:

  • We have only sketched some of the theory: a more rigorous explanation requires background in Markov chain theory.

  • Usually, it is desirable to adjust the "rate of overall mutation" in the theory.
    • Let M'aa = Pr[mutation from a]
    • Let Maa = 1 - M'aa
    • Adjust the ratio M'aa / Maa
    • Typical value: 1/100
      => one mutation in every 100 amino acids.


How BLAST Works

Key ideas in BLAST (e.g, protein) searches:

  • BLAST computes "similarity", not "alignment".

  • Given two protein sequences:
    • Find all substrings of length k (typical: k = 4) that occur in both strings.
    • Build score based on matches.
    • Extend substrings to see if match score can be increased.
    • Compute total score when no more extensions are possible.

  • Compute BLAST score against all proteins in database.

  • Rank order search results by score.