The Needleman-Wunsch Algorithm

Chapter: The Needleman-Wunsch Algorithm

We are going to develop a recurrence relation for the score of the best alignment between two sequences $a$ ₁, a₂, .., a_n and $b$ ₁, b₂, ..,b_m using the principle of Mathematical Induction.

This is a common, useful and reliable way to develop algorithms. We will write $A(k,l)$ to denote the best alignment score between the prefix sequences $a$ ₁, a₂, .., a_k and $b$ ₁, b₂, ..,b_l. So, assume that we know $A(k,l)$ for every pair $(k,l)$ that precedes pair $(i,j)$ . For our purposes, we will say that $(k,l)$ precedes $(i,j)$ if either $(k<i and l<=j)$ or $(l<j and k<=i)$ . In other words, if you think of $(i,j)$ and $(k,l)$ as points in the Cartesian plane, then $(k,l)$ precedes $(i,j)$ if it is to below and to the left. At most one of "below" and "to the left" can be replaced by equality.

We proceed to derive a recursive equation for $A(i,j)$ .

How does the best alignment for $a$ ₁, a₂, .., a_i and $b$ ₁, b₂, ..,b_j relate to predecessors? There are three possible scenarios:

Alignment for $a$ ₁, a₂, .., a_i-1 and $b$ ₁, b₂, ..,b_j-1 followed by $a$ _i matched or mismatched with $b$ _j
Alignment for $a$ ₁, a₂, .., a_i and $b$ ₁, b₂, ..,b_j-1 followed by a gap with $b$ _j
Alignment for $a$ ₁, a₂, .., a_i-1 and $b$ ₁, b₂, ..,b_j followed by a gap with $a$ _i

Pictorially, these three cases look like:

```
PREVIOUS x
ALIGMENT y
```
```
PREVIOUS -
ALIGMENT y
```
```
PREVIOUS x
ALIGMENT -
```

where we've used x and y to denote

a

_i and

b

_j.

Now let's do some analysis for each of these cases:

```
PREVIOUS x
ALIGMENT y
```
In this case, the best previous alignment is (using our inductive hypothesis) A(i-1,j-1). We will add to that either the MATCH score (if x = y) or the MISMATCH score (otherwise). Let's denote that additional amount by s(x,y), or -- removing our abbreviation -- s(a_i,b_j). In this case the calculated new alignment score would be A(i-1,j-1) + s(a_i,b_j).
```
PREVIOUS -
ALIGMENT y
```
In this case, the best previous alignment is (again using the induction hypothesis) A(i,j-1). We need to add on the gap penalty. Let's denote the gap penalty by g. In this case therefore the calculated new alignment score would be A(i-1,j) + g.
```
PREVIOUS x
ALIGMENT -
```
By a similar argument, the calculated new alignment score in this case is A(i,j-1) + g

Since we want the highest possible score, we must choose the case that leads to the largest value for the calculated new alignment. $A(i,j)$ is thus the maximum of

$A(i-1,j-1) + s(a$ _i,b_j)
$A(i,j-1) + g$
$A(i-1,j) + g}$

We have derived the recurrence

$A(i,j) = MaxA(i-1,j-1)+s(a$ _i,b_j), A(i,j-1)+g, A(i-1,j)+g

This immediately suggest the program portion:

int A(int i, int j) {
  if ...          // ... denotes our yet to be determined base cases
  then return ... // to be determined
  else return max(A(i-1,j-1)+s(a[i],b[j]), A(i,j-1)+g, A(i-1,j)+g);
}

In view of what we've just seen, look carefully at my program NW.java. It is a simple implementation of Needleman-Wunsch. Next lab, you will be modifying and extending it. This lab, you'll run it to check your hand-derived arrays. For the next exercises, at least one member of the team should work on producing the matrixes by hand, and at least one should adapt/run the program to check the handiwork of their team members.

This program outputs what we'll call a "dynamic programming matrix" that can be used to produce a corresponding alignment (We'll do that next lab). The next exercises expect you to generate small dynamic programming matrices by hand and check them by program.

Exercise 6

Produce dynamic programming matrices for ACCTGCTAC and TCCAGCTTC using

4 for a match, -1 for a mismatch, -2 for an indel
Check your calculations by running NW.java.
5 for a match, 0 for a mismatch, -4 for an indel
Check your calculations by modifying NW.java (three small changes is all you need) and running it.

Deliverables: Show me your matrixes and the outputs from the programs.

The distance measure is significantly different. It needs a minimization of the northwest, north and west - derived scores. You will need to change more than just three lines of NW.java to answer the next (and final (whew!!)) exercise.

Exercise 7

Produce a dynamic programming matrix for ACCTGCTAC and TCCAGCTTC using the distance measure (0 for a match, +1 for either a mismatch or an indel). Check your answer by modifying NW.java and running it.

Deliverable: Show your matrix.

rhyspj@gwu.edu