Types of twosequence alignments:
 Global alignment:
 Input: treat the two sequences as potentially equivalent.
 Goal: identify conserved regions and differences.
 Algorithm: NeedlemanWunsch dynamic programming
 Applications:
 Comparing two genes with same function (in human vs. mouse).
 Comparing two proteins with similar function.
 Local alignment:
 Input: The two sequences may or may not be related.
 Goal: see whether a substring in one sequence aligns well
with a substring in the other.
 Algorithm: SmithWaterman dynamic programming
 Applications:
 Searching for local similarities in large sequences (e.g.,
newly sequenced genomes).
 Looking for conserved domains or motifs in two proteins.
 Semiglobal alignment:
 Input: two sequences, one short and one long.
 Goal: is the short one a part of the long one?
 Algorithm: modification of SmithWaterman.
 Applications:
 Given a DNA fragment (with possible error), look for it in
the genome.
 Look for a wellknown domain in a newlysequenced protein.
 Suffixprefix alignment:
 Input: two sequences (usually DNA)
 Goal: is the prefix of one the suffix of the other?
 Algorithm: modification of SmithWaterman.
 Applications:
 Heuristic alignment:
 Input: two sequences.
 Goal: See if two sequences are "similar" or candidates for alignment.
 Algorithms: BLAST, FASTA (and others)
 Applications:
 Search in large databases.
Difference between keywordsearch and alignment:
 Keyword search:
 Keyword search is exact matching.
 Can be done quickly (straightforward scan).
 Used in Entrez (for example)
 Alignment:
 Nonexact, scored matching.
 Takes much more time.
 Used in tools like BLAST2, CLUSTALW.
Problems from a biologist's point of view:
 "I have just sequenced a DNA fragment"
=> you have a piece of unknown DNA.
 Run a BLAST search (which uses the BLAST algorithm).
=> result: other DNA fragments (or genomes) that contain
similar stretches.
 Once you have candidates, run a more careful alignment among them.
=> local alignment or semiglobal alignment.
 "I've located a gene using a genefinding algorithm"
=> you have the gene sequence fairly well determined.
 Run BLAST to locate similar genes.
 Run a global alignment to see differences.
 Alternatively, identify potential proteins via a protein database.
 "I'm confirming a sequencing experiment first done by
someone else"
=> do a global alignment.
 "I've sequenced a new protein"
=> BLAST search (on proteins) and then global alignment on results.
 "I've found a new domain or motif"
=> BLAST to locate proteins of interest, then semiglobal alignment.
Global Alignment  The NeedlemanWunsch Algorithm
The alignment problem in Computer Science terms:
 Given two strings and the scoring rules, find the optimal alignment.
 Define scoring rules for evaluating a particular alignment:
 Perfectly matched letters get a high score.
 Matches between related letters get a modest score.
 Matches with gaps get a low score.
 Problem components:
 An alphabet
e.g., { A, C, G, T } (DNA
e.g., { M, A, L, ..., } (protein)
 Two input strings, e.g.,
 Match and gap scores, e.g.,
 Goal: find the alignment with the best (largest) value.
 Potential asymmetry:
 Some letteralignments may be more favored.
 Some gap costs are different.
 Example:
 Example:
Key ideas in solution:
 First, the alignment problem can be expressed as a
graph problem:
 The diagonal edges correspond to character matches
(whether identical or not).
 Each down edge corresponds to a gap in string 2.
 Each across edge corresponds to a gap in string 1.
 Goal: find the longest (max value) path from topleft to bottomright.
=> longestpath problem in Directed Acyclic Graph.
 Can be solved in O(mn) time (using topological sort)
(m = length of string1, n = length of string2).
 Suppose we label the vertices (i, j) where
i is the row, and j is the column.
 The best path to an intermediate node is via one of its
neighbors: NORTH, WEST or NORTHWEST.
 Dynamic programming approach:
 Building an actual graph (adjacency matrix etc) is unnecessary.
 Let D_{i,j}
= value of best path to node (i,j) from (0, 0).
= score of optimal match of first i chars in string 1
to first j chars in string 2.
 Define (for clarity)
C_{i1,j1} 
= 
D_{i1,j1} + exactmatchscore 
C_{i1,j} 
= 
D_{i1,j} + string1gapscore 
C_{i,j1} 
= 
D_{i,j1} + string2gapscore 
 Thus,
D_{i,j} 
= 
max (C_{i1,j1}, C_{i1,j}, C_{i,j1}) 

= 
max (D_{i1,j1} + exactmatchscore,
D_{i1,j} + string1gapscore,
D_{i,j1} + string2gapscore) 
 Initial conditions:
 D_{0,0} = 0
 D_{i,0} = D_{i1,0} + gapscore for
ith char in string 1.
 D_{0,j} = D_{0,j1} + gapscore for
jth char in string 2.
Implementation:
 Pseudocode:
Algorithm: align (s_{1}, s_{2})
Input: string s_{1} of length m, string s_{2} of length n
1. Initialize matrix D properly;
// Build the matrix D row by row.
2. for i=1 to m
3. for j=1 to n
// Initialize max to the first of the three terms (NORTH).
4. max = D[i1][j] + gapScore2 (s_{2}[j1])
// See if the second term is larger (WEST).
5. if max < D[i][j1] + gapScore1 (s_{1}[i1])
6. max = D[i][j1] + gapScore1 (s_{2}[i1])
7. endif
// See if the third term is the largest (NORTHWEST).
8. if max < D[i1][j1] + matchScore (s_{1}[i1], s_{2}[j1])
9. max = D[i1][j1] + matchScore (s_{1}[i1], s_{2}[j1])
10. endif
11. D[i][j] = max
12. endfor
13. endfor
// Return the optimal value in bottomright corner.
14. return D[m][n]
Output: value of optimal alignment
 Explanation:
 The 2D array D[i][j] stores the D_{i,j} values.
 The method matchScore returns the value in matching
two characters.
 The method gapScore1 returns the value of matching a
particular character in string 1 with a gap.
(gapScore2 is similarly defined).
 Note: D[i][0] and D[0][j] represent
initial conditions.
 Note: there is one more row than the length of string 1.
(likewise for string 2).
 Computing the actual alignment:
 A separate 2D array tracks which term contributed to the
maximum for each (i, j).
 After computing all of D, traceback and obtain path (alignment).
InClass Exercise:
(Solution)
Download this template and
implement the above algorithm. Gap and match costs are given.
Analysis:
 The double forloop tells the story: O(mn) to compute D_{m,n}.
 Note: computing the actual alignment requires tracing back:
O(m + n) time (length of a path).
 Space required: O(mn).
An improvement (in space):
 Genes are often longer than 10,000 bases (letters) long
=> space required is prohibitive.
 First, note that we only need successive rows in computing D
=> space requirement can be reduced to O(n).
 However, O(mn) space is still required to construct
the actual alignment.
 A different approach:
 Consider the "middle" character in string 1
=> the character at position m / 2.
 In the optimal alignment, this character will either align
with some jth character in string 1, or a gap.
=> we can try all possible j values.
 To try these alignments, compute the m/2th row (in
O(mn) time).
 Output the optimal alignment found.
 Recurse on either side of the optimal alignment.
 It is possible to show: O(mn) time and O(n) space.
Local Alignment  The SmithWaterman Algorithm
Consider a global alignment example:
 An alignment is forced across the whole table.
=> each position is scored.
 However, the score in one particular section may be higher
=> local alignments may have higher scores.
 Goal in local alignment: find best possible local alignment.
Two key modifications to global alignment algorithm:
 Starting new alignments:
 As with global, start at top left corner.
 However, if an alignment gets too expensive, discard it and
start a new one.
 Look for best local alignment:
 As with global, fill all matrix entries.
 However, do not count on bottomright as starting point
=> search entire matrix for best "bottomright corner of
local alignment"
To achieve the first part:
 Decide when a local alignment should be abandoned (i.e., when a
new one should start).
=> Typically, when score goes below zero.
 Modify recurrence to avoid belowzero scores:
D_{i,j} 
= 
max (0, C_{i1,j1}, C_{i1,j}, C_{i,j1}) 
 If 0 is the maximum score at (i, j)
=> a new alignment is started.
 Record (i, j) values where new alignments start.
To achieve the second part:
 Search for D_{i,j} with highest score.
 Work backwards from this point.
Scoring  the PAM Approach
What is PAM?
 PAM = Point Accepted Mutations.
 A theory for assigning scores to substitutions (mutations) in proteins.
Consider a group of related proteins and a particular position:
 Suppose the proteins are listed in evolutionary order and aligned.
 We'll use a, b, c], ... to denote amino acids.
 A mutation from a to b can occur in one step or many.
 Let M_{ab} = Pr[an a > b mutation occurs] in
some position.
 Let M_{aa} = Pr[no mutation occurs].
 Let p_{a} = Pr[residue a occurs].
Scoring intuition:
 Consider scoring the substitution a to b.
 Want a high score if M_{ab} is high
=> because of high a > b mutations.
 Want a low score if p_{b} is high
=> because high occurence of b explains occurence of
b (as opposed to mutation).
 Solution: let score be proportional to M_{ab} / p_{b}
PAM scoring:
 Consider mutations in two positions: a > b and c > d.
 Probability of both occuring is proportional to product:
(M_{ab} / p_{b}) * (M_{cd} / p_{d})
 But algorithm requires scores to be additive (for decomposability).
 Take logarithms:
score =
log (M_{ab} / p_{b}) + log (M_{cd} / p_{d})
Evolutionary distance in PAM scoring:
 In comparing a > b in two proteins:
 One possibility: the two proteins are phylogenetically
adjacent
=> direct a > b mutation.
 Another possibility: several intermediate mutations occured:
e.g., a > e > f > b. (evolutionary distance = 3)
e.g., c > g > d. (evolutionary distance = 2)
 Goal: design scoring for a particular distance k.
 What do we want reflected in the score?
 Let M_{ab}^{k} = Pr[a > b in k steps].
 Compute M_{ab}^{k} for each a,
b.
 Score for a > b is:
log (M_{ab}^{k} / p_{b})
 Thus, in PAM250 (a popular one), k = 250.
Note:
 We have only sketched some of the theory: a more rigorous
explanation requires background in Markov chain theory.
 Usually, it is desirable to adjust the "rate of overall
mutation" in the theory.
 Let M'_{aa} = Pr[mutation from a]
 Let M_{aa} = 1  M'_{aa}
 Adjust the ratio M'_{aa} / M_{aa}
 Typical value: 1/100
=> one mutation in every 100 amino acids.
How BLAST Works
Key ideas in BLAST (e.g, protein) searches:
 BLAST computes "similarity", not "alignment".
 Given two protein sequences:
 Find all substrings of length k (typical: k = 4) that occur in both strings.
 Build score based on matches.
 Extend substrings to see if match score can be increased.
 Compute total score when no more extensions are possible.
 Compute BLAST score against all proteins in database.
 Rank order search results by score.