Sequence Alignment

What is Alignment?

Let's distinguish between two uses of "alignment":

Informal:

A comparison of two potentially related sequences.
Usually by hand.

Example:


        C T C T A G C A T T A G
        G T G C A C C C A

Prone to error for large sequences.
Not systematic or quantitative.

Formal:

Precise operators for alignment: matching, gaps
Example:


        C T C T A G C A T T A G
            G T - G C A C C C A     (insert a gap for better match)

Quantitative scoring system for matches and gaps.
Systematic search among possible alignments.
Use alignment algorithms to find optimal alignment.
Different alignment algorithms (global, local, affine-penalty).

Types of two-sequence alignments:

Global alignment:
- Input: treat the two sequences as potentially equivalent.
- Goal: identify conserved regions and differences.
- Algorithm: Needleman-Wunsch dynamic programming
- Applications:
  - Comparing two genes with same function (in human vs. mouse).
  - Comparing two proteins with similar function.
Local alignment:
- Input: The two sequences may or may not be related.
- Goal: see whether a substring in one sequence aligns well with a substring in the other.
- Algorithm: Smith-Waterman dynamic programming
- Applications:
  - Searching for local similarities in large sequences (e.g., newly sequenced genomes).
  - Looking for conserved domains or motifs in two proteins.
Semi-global alignment:
- Input: two sequences, one short and one long.
- Goal: is the short one a part of the long one?
- Algorithm: modification of Smith-Waterman.
- Applications:
  - Given a DNA fragment (with possible error), look for it in the genome.
  - Look for a well-known domain in a newly-sequenced protein.
Suffix-prefix alignment:
- Input: two sequences (usually DNA)
- Goal: is the prefix of one the suffix of the other?
- Algorithm: modification of Smith-Waterman.
- Applications:
  - DNA fragment assembly.
Heuristic alignment:
- Input: two sequences.
- Goal: See if two sequences are "similar" or candidates for alignment.
- Algorithms: BLAST, FASTA (and others)
- Applications:
  - Search in large databases.

Difference between keyword-search and alignment:

Keyword search:
- Keyword search is exact matching.
- Can be done quickly (straightforward scan).
- Used in Entrez (for example)
Alignment:
- Non-exact, scored matching.
- Takes much more time.
- Used in tools like BLAST2, CLUSTALW.

Problems from a biologist's point of view:

"I have just sequenced a DNA fragment"
=> you have a piece of unknown DNA.
- Run a BLAST search (which uses the BLAST algorithm).
  => result: other DNA fragments (or genomes) that contain similar stretches.
- Once you have candidates, run a more careful alignment among them.
  => local alignment or semi-global alignment.
"I've located a gene using a gene-finding algorithm"
=> you have the gene sequence fairly well determined.
- Run BLAST to locate similar genes.
- Run a global alignment to see differences.
- Alternatively, identify potential proteins via a protein database.
"I'm confirming a sequencing experiment first done by someone else"
=> do a global alignment.
"I've sequenced a new protein"
=> BLAST search (on proteins) and then global alignment on results.
"I've found a new domain or motif"
=> BLAST to locate proteins of interest, then semi-global alignment.

Global Alignment - The Needleman-Wunsch Algorithm

The alignment problem in Computer Science terms:

Given two strings and the scoring rules, find the optimal alignment.

Define scoring rules for evaluating a particular alignment:

Perfectly matched letters get a high score.
Matches between related letters get a modest score.
Matches with gaps get a low score.

Problem components:

An alphabet
e.g., { A, C, G, T } (DNA
e.g., { M, A, L, ..., } (protein)
Two input strings, e.g.,
Match and gap scores, e.g.,
Goal: find the alignment with the best (largest) value.

Potential asymmetry:

Some letter-alignments may be more favored.
Some gap costs are different.

Example:

Key ideas in solution:

First, the alignment problem can be expressed as a graph problem:
- The diagonal edges correspond to character matches (whether identical or not).
- Each down edge corresponds to a gap in string 2.
- Each across edge corresponds to a gap in string 1.
- Goal: find the longest (max value) path from top-left to bottom-right.
  => longest-path problem in Directed Acyclic Graph.
- Can be solved in O(mn) time (using topological sort)
  (m = length of string1, n = length of string2).
- Suppose we label the vertices (i, j) where i is the row, and j is the column.
- The best path to an intermediate node is via one of its neighbors: NORTH, WEST or NORTHWEST.

Dynamic programming approach:

Building an actual graph (adjacency matrix etc) is unnecessary.
Let D_i,j
= value of best path to node (i,j) from (0, 0).
= score of optimal match of first i chars in string 1 to first j chars in string 2.

Define (for clarity)

C_i-1,j-1	=	D_i-1,j-1 + exact-match-score
C_i-1,j	=	D_i-1,j + string1-gap-score
C_i,j-1	=	D_i,j-1 + string2-gap-score

Thus,

D_i,j	=	max (C_i-1,j-1, C_i-1,j, C_i,j-1)
	=	max (D_i-1,j-1 + exact-match-score, D_i-1,j + string1-gap-score, D_i,j-1 + string2-gap-score)

Initial conditions:
- D_0,0 = 0
- D_i,0 = D_i-1,0 + gap-score for i-th char in string 1.
- D_0,j = D_0,j-1 + gap-score for j-th char in string 2.

Implementation:

Pseudocode:

Algorithm: align (s₁, s₂)
Input: string s₁ of length m, string s₂ of length n

1.    Initialize matrix D properly;

      // Build the matrix D row by row.
2.    for i=1 to m
3.      for j=1 to n

          // Initialize max to the first of the three terms (NORTH).
4.        max = D[i-1][j] + gapScore2 (s₂[j-1])

          // See if the second term is larger (WEST).
5.        if max < D[i][j-1] + gapScore1 (s₁[i-1]) 
6.          max = D[i][j-1] + gapScore1 (s₂[i-1]) 
7.        endif

          // See if the third term is the largest (NORTHWEST).
8.        if max < D[i-1][j-1] + matchScore (s₁[i-1], s₂[j-1])
9.          max = D[i-1][j-1] + matchScore (s₁[i-1], s₂[j-1])
10.       endif

11.        D[i][j] = max

12.     endfor
13.   endfor

      // Return the optimal value in bottom-right corner.
14.   return D[m][n]

Output: value of optimal alignment

Explanation:
- The 2D array D[i][j] stores the D_i,j values.
- The method matchScore returns the value in matching two characters.
- The method gapScore1 returns the value of matching a particular character in string 1 with a gap.
  (gapScore2 is similarly defined).
- Note: D[i][0] and D[0][j] represent initial conditions.
- Note: there is one more row than the length of string 1.
  (likewise for string 2).
Computing the actual alignment:
- A separate 2D array tracks which term contributed to the maximum for each (i, j).
- After computing all of D, traceback and obtain path (alignment).

In-Class Exercise: (Solution) Download this template and implement the above algorithm. Gap and match costs are given.

Analysis:

The double for-loop tells the story: O(mn) to compute D_m,n.
Note: computing the actual alignment requires tracing back: O(m + n) time (length of a path).
Space required: O(mn).

An improvement (in space):

Genes are often longer than 10,000 bases (letters) long
=> space required is prohibitive.
First, note that we only need successive rows in computing D
=> space requirement can be reduced to O(n).
However, O(mn) space is still required to construct the actual alignment.
A different approach:
- Consider the "middle" character in string 1
  => the character at position m / 2.
- In the optimal alignment, this character will either align with some j-th character in string 1, or a gap.
  => we can try all possible j values.
- To try these alignments, compute the m/2-th row (in O(mn) time).
- Output the optimal alignment found.
- Recurse on either side of the optimal alignment.
It is possible to show: O(mn) time and O(n) space.

Local Alignment - The Smith-Waterman Algorithm

Consider a global alignment example:

An alignment is forced across the whole table.
=> each position is scored.

However, the score in one particular section may be higher
=> local alignments may have higher scores.

Goal in local alignment: find best possible local alignment.

Two key modifications to global alignment algorithm:

Starting new alignments:
- As with global, start at top left corner.
- However, if an alignment gets too expensive, discard it and start a new one.
Look for best local alignment:
- As with global, fill all matrix entries.
- However, do not count on bottom-right as starting point
  => search entire matrix for best "bottom-right corner of local alignment"

To achieve the first part:

Decide when a local alignment should be abandoned (i.e., when a new one should start).
=> Typically, when score goes below zero.
Modify recurrence to avoid below-zero scores:

D_i,j = max (0, C_i-1,j-1, C_i-1,j, C_i,j-1)
If 0 is the maximum score at (i, j)
=> a new alignment is started.
Record (i, j) values where new alignments start.

To achieve the second part:

Search for D_i,j with highest score.
Work backwards from this point.

Scoring - the PAM Approach

What is PAM?

PAM = Point Accepted Mutations.

A theory for assigning scores to substitutions (mutations) in proteins.

Consider a group of related proteins and a particular position:

Suppose the proteins are listed in evolutionary order and aligned.
We'll use a, b, c], ... to denote amino acids.
A mutation from a to b can occur in one step or many.
Let M_ab = Pr[an a -> b mutation occurs] in some position.
Let M_aa = Pr[no mutation occurs].
Let p_a = Pr[residue a occurs].

Scoring intuition:

Consider scoring the substitution a to b.
Want a high score if M_ab is high
=> because of high a -> b mutations.
Want a low score if p_b is high
=> because high occurence of b explains occurence of b (as opposed to mutation).
Solution: let score be proportional to M_ab / p_b

PAM scoring:

Consider mutations in two positions: a -> b and c -> d.
Probability of both occuring is proportional to product: (M_ab / p_b) * (M_cd / p_d)
But algorithm requires scores to be additive (for decomposability).
Take logarithms:
score = log (M_ab / p_b) + log (M_cd / p_d)

Evolutionary distance in PAM scoring:

In comparing a -> b in two proteins:
- One possibility: the two proteins are phylogenetically adjacent
  => direct a -> b mutation.
- Another possibility: several intermediate mutations occured:
  e.g., a -> e -> f -> b. (evolutionary distance = 3)
  e.g., c -> g -> d. (evolutionary distance = 2)
Goal: design scoring for a particular distance k.
What do we want reflected in the score?
- Let M_ab^k = Pr[a -> b in k steps].
- Compute M_ab^k for each a, b.
- Score for a -> b is: log (M_ab^k / p_b)
Thus, in PAM-250 (a popular one), k = 250.

Note:

We have only sketched some of the theory: a more rigorous explanation requires background in Markov chain theory.
Usually, it is desirable to adjust the "rate of overall mutation" in the theory.
- Let M'_aa = Pr[mutation from a]
- Let M_aa = 1 - M'_aa
- Adjust the ratio M'_aa / M_aa
- Typical value: 1/100
  => one mutation in every 100 amino acids.

How BLAST Works

Key ideas in BLAST (e.g, protein) searches:

BLAST computes "similarity", not "alignment".

Given two protein sequences:

Find all substrings of length k (typical: k = 4) that occur in both strings.
Build score based on matches.
Extend substrings to see if match score can be increased.
Compute total score when no more extensions are possible.

Compute BLAST score against all proteins in database.

Rank order search results by score.