Applications to Biology and Ecology

Module 3 > Biology/Ecology

Objectives

We will look at two very different applications:

Part-A: An example from computational biology: how to look for a protein within a longer sequence.
Part-B: An ecological example: Mapping whale migration

3.50 Video:

Part A: Computational Biology and protein "strings"

What we'll do in this section:

Understand that proteins (an important type of bio molecule) can be represented using strings.
⇒ the strings are called protein sequences.
Perform some types of operations on these protein sequences that are useful in biology.

Let's start by understanding what a protein is.

3.51 Exercise:

Look up proteins - what is a protein and what does it do?

Look up insulin. What is its purpose? What diseases are associated with insulin? What Nobel prizes were associated with insulin?

What we need to know about proteins:

Although a protein has a 3D shape, it can be considered a long chain of "units".
⇒ The 3D shape results from this chain folding on itself.
Each unit in the chain is determined by a particular kind of small (or sub-) molecule called an amino-acid.
There are 20 types of amino acids.

These are the 20:

# Amino Acid Symbol

1 Alanine A

2 Arginine R

3 Asparagine N

4 Aspartic acid D

5 Cysteine C

6 Glutamic acid E

7 Glutamine Q

8 Glycine G

9 Histidine H

10 Isoleucine I

11 Leucine L

12 Lysine K

13 Methionine M

14 Phenylalanine F

15 Proline P

16 Serine S

17 Threonine T

18 Tryptophan W

19 Tyrosine Y

20 Valine V

Think of the 20 amino acids as represented by their (20) letters:
```
      A R N D C E Q G H I L K M F P S T W Y V
    
```
Then, a protein is a "word" made from letters in this 20-letter alphabet.
For example: here is part of the insulin protein:
```
	F V N Q H L C G S H L V E A L Y L V C G E R G F F Y T P K T 
    
```
- It's length is 30. And you can see that each of the 30 letters comes from the protein "alphabet".
- It happens to be called the B-chain in human insulin.

To start with, let's understand a couple of things about insulin:

Insulin is made from a precursor molecule call preproinsulin.
Preproinsulin is made directly from the gene for insulin.

For example, here is the 110-letter sequence for preproinsulin:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

Parts of this are called the A-chain, B-chain, and C-chain respectively.

For example, this is the 30-letter B-chain above:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

And the 21-letter A-chain:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

The sequence in between is called the C-chain:

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

The real insulin molecule is actually a protein complex that is formed from the A and B chains.

Let's now work on some string problems related to insulin.

To start with, let's look for the three chains in preproinsulin:

Suppose we are given:
1. The sequence for preproinsulin.
2. The A-chain and B-chain.
The goal: find where in the preproinsulin they occur, and extract the C-chain.

Here is the program:

preproSeq = "MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN"
aChain = "GIVEQCCTSICSLYQLENYCN"
bChain = "FVNQHLCGSHLVEALYLVCGERGFFYTPKT"

# Find where the A and B chains occur in preproinsulin.
aIndex = preproSeq.index(aChain)
bIndex = preproSeq.index(bChain)
print('A chain starts at position', aIndex)    # 89
print('B chain starts at position', bIndex)    # 24

# Extract the C-chain
cIndex = bIndex + len(bChain)
cChain = preproSeq[cIndex:aIndex]
print('C-chain:', cChain)

# RREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKR

Note: we've used a version of the function index() that takes a string as parameter and returns the start index where it occurs.

3.52 Exercise: The following are the preproinsulin sequences for gorilla and owl monkey.

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAEDLQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN

Use string equality to see whether they are the same as human preproinsulin. That is, write a short program in my_insulin_test.py to do this testing. Download sequences.txt, which has the above two sequences for copy-paste, as well as the sequences needed for the exercises below. Report your results in your module pdf.

Next, let's consider the problem of approximate string matching:

Suppose we don't know the identity of the A and B chains of the owl monkey.
We do know the human ones.
Thus, we'll solve this problem:
- We're given the preproinsulin sequence of the owl monkey:
```
MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAEDLQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN
  
```
- We know the human B-chain:
```
FVNQHLCGSHLVEALYLVCGERGFFYTPKT
  
```
- We want to know: which substring of the owl-monkey matches the human B-chain most closely.
To do this, we will develop the idea of distance between two strings.

Consider the following computation of string distance:

def distance(a, b):
    # Distance between strings a and b
    m = min(len(a), len(b))
    M = max(len(a), len(b))
    mismatches = 0
    for i in range(0, m):
        if a[i] != b[i]:
            mismatches += 1

    d = mismatches + (M - m)
    return d

3.53 Exercise: Write the above in my_stringdistance.py and apply the distance measure to compare "bold" with each of the following strings: "bald", "boy", "old". Report your results in your module pdf, along with a rationale for the distance function.

About string distance:

In general, there are many definitions of string distance.
The simple definition above failed for a simple "shifted" string.
In fact, in computational biology, there are some fairly complicated definitions that make use of statistics.

We'll use our simple distance measure to find the best B-chain match within the owl monkey chain.

For example, suppose we compare the human B-chain at the 5-th position of the owl-monkey preproinsulin:

MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAEDLQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN
     FVNQHLCGSHLVEALYLVCGERGFFYTPKT

To use the distance method we wrote earlier, we'd have to extract the sub-string from the preproinsulin sequence.

Here's a program that makes the comparison at the 5-th position.

i = 5
owl_substr = owl_monkey_prepro[i:i+len(humanB)]
print(owl_substr)
d = distance(owl_substr, humanB)
print('distance=', d, 'at position', i)

3.54 Exercise: In my_stringdistance2.py, expand on the above code to search all possible positions in the owl-monkey preproinsulin and find the best possible match (with the least distance) for the human B-chain. Print the position where this occurs, and the distance. Report these numbers in your module pdf.

3.55 Exercise: In my_stringdistance3.py, try the same thing with a few non-primates. That is, try to look for the closest match to the human B-chain in in each of the preproinsulin sequences below. Before running your code, which of the three do you expect to produce a result closest to human? Farthest?

Hummingbird:
IQSLPLLALLALSGPGTSHAAVNQHLCGSHLVEALYLVCGERGFFYSPKARRDAEHPLVNGPLHGEVGDLPFQQEEFEKVKRGIVEQCCHNTCSLYQLENYCN

Zebrafish:
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQETEVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN

Fruitfly:
SSSSSSKAMMFRSVIPVLLFLIPLLLSAQAANSLRACGPALMDMLRVACPNGFNSMFAKRGTLGLFDYEDHLADLDSSESHHMNSLSSIRRDFRGVVDSCCRKSCSFSTLRAYCDS

3.56 Audio:

Note:

Even when a mismatch looks very difficult, it is possible to identify reasonable matches.
⇒ This is the subject of advanced techniques in computational biology.
More interestingly, such simple sequence comparisons can reveal much about evolution (the subject of phylogenetics) and even human history (through the study of ancient DNA).
We've only scratched the surface of a large field called bioinformatics, with surprisingly many interesting computational problems.
Although our examples have been about protein sequences (with a 20-letter alphabet), the same "string search" ideas more or less apply to DNA sequences (which have a 4-letter alphabet).

Part B: A whale of a story

We will examine a dataset derived from tracking bluefin whales off of the California coast.

Before we start:

Review 2D arrays from Module 0
Review how a set works from Module 1

To start with, let's look at the data, load the data and print its size (number of rows, number of columns):

from datatool import datatool
import numpy

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

# Extract into 2D array:
D = dt.get_data_as_array()

# Print number of rows, columns:
print('# rows=', D.shape[0], '# columns=', D.shape[1])

# The lat-long from the first row:
print('first lat=', D[0,4], 'first long=', D[0,3])

3.57 Exercise: Download bluefin_whale_data.csv and datatool.py and then in my_bluefin_whales.py, type up the above and run. Examine the CSV file.

Note:

We've extracted the data into a 2D array. Here we used the shape parameter of the array to print the size. Recall:
- shape[0]: number of rows
- shape[1]: number of columns
The array will be useful when we use loops to look through the data.
Notice that the latitude is in the 5th column, while the longitude is in the 4th column.
Although there's a lot of other data, the one that will be of interest is an identifier that distinguishes one whale from another.
- In this case, the 11-th column identifies the whale.

Next, let's count and identify the different whales.

To do this, we will use a set because a set only allows unique elements:

from datatool import datatool
import numpy
import math

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

D = dt.get_data_as_array()
print('# rows=',D.shape[0], '# columns=', D.shape[1])

# Start with an empty set
whales = set()

# Now add the whale in each row
for i in range(D.shape[0]):
    whales.add(D[i,10])

# Write a line of code to print the size of the set:

3.58 Exercise: Write up the above in my_bluefin_whales2.py and add the line needed to print the size of the set. You should see 13 whales in total.

Next, let's identify the minimum and maximum latitudes to know the range of area where the data was collected:

from datatool import datatool
import numpy
import math

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

D = dt.get_data_as_array()

# Write code to identify the min and max 
# latitudes and longitudes. The latitude is in 5th column,
# while the longitude is in the 4th column.
min_lat = D[0,4]
max_lat = D[0,4]
min_long = D[0,3]
max_long = D[0,3]

# WRITE YOUR LOOP HERE:


print('min-lat=', min_lat, 'max-lat=', max_lat)
print('min-long=', min_long, 'max-long=', max_long)
# Should print
# min-lat= 31.0522 max-lat= 40.4511
# min-long= -124.9586 max-long= -116.509

# Round the min's down using math.floor() and 
# round the max's up using math.ceil()

# WRITE YOUR CODE HERE:

# Print rounded min's and max's:
print('min-lat=', min_lat, 'max-lat=', max_lat)
print('min-long=', min_long, 'max-long=', max_long)
# Should print
# min-lat= 31 max-lat= 41
# min-long= -125 max-long= -116

3.59 Exercise: Download my_bluefin_whales3.py and fill in the needed code.

Next, let's plot the data:

from datatool import datatool
import numpy

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

dt.tilemap_attach_col_lat_long('location-lat', 'location-long', 'individual-local-identifier')

# We'll draw a square over the map. 
dt.tilemap_add_line(33, -122.5, 33.5, -122.5)
dt.tilemap_add_line(33, -122.5, 33, -122.0)
dt.tilemap_add_line(33, -122.0, 33.5, -122.0)
dt.tilemap_add_line(33.5, -122.5, 33.5, -122.0)
 
center = dict(lat = 35, lon =-119)
dt.tilemap_czt(center, zoom=5, title='Bluefin whale tracks')

dt.display()

3.60 Exercise: Download and try out my_bluefin_whales4.py. You should see something like

Note:

Each whale has its own color, and one can see the tracks each has made.
We drew a small box to demonstrate drawing a box.
Notice that the box size is a half-latitude in height and half-longitude in width.

Thus far we haven't done much that's computational.

Let's now solve a realistic problem:

We've calculated and rounded (up or down) the min and max of the latitudes and longitudes of the whale locations.
We've also drawn a half-box: a box half-latitude in height and half-longitude in width.
The goal in the problem: find the half-box with the most whale tracks.
Such a box might suggest a good spot for a research vessel looking to retrieve trackers and study the whales.
We will do this by trying all possible half-boxes in the region delimited by the min and max we've already found for latitude and longitude.

The main idea is expressed in this nested loop:

# Outer loop over latitudes
lat = min_lat
while lat < max_lat:
    # Inner loop over longitudes
    lon = min_long
    while lon < max_long:
        # Count occurrences within lat to lat+0.5, lon to lon+0.5
        count = count_occurences(lat, lon)

        # Update highest count so far ... (code not shown) ...

        lon += 0.5             # Inner-loop
    lat += 0.5                 # Outer-loop

3.61 Exercise: Download my_bluefin_whales5.py which has instructions on where to write code. All you need to do is fill in some functions. The desired output is also described.

3.62 Exercise: Download my_bluefin_whales6.py, which will visualize the "best half-box" on a map. Copy over the missing code from the exercise above. You should see a box that's slightly west of Long Beach, CA, and northwest of Catalina Island, which seems to match a description of the local whale-watching tours.

3.63 Video:

Optional further exploration

If you'd like to explore further:

See this TED talk for an example of how biology can be viewed from a digital information point of view.
Read through these two articles on whales from our neighbors at the Smithsonian and at the International Monetary Fund:
- From the Smithsonian
- From the IMF: why whales are critical to solving climate change.
Try this podcast on the mysteries that animal-tracking hopes to solve.

Back to Module 2

#	Amino Acid	Symbol
1	Alanine	A
2	Arginine	R
3	Asparagine	N
4	Aspartic acid	D
5	Cysteine	C
6	Glutamic acid	E
7	Glutamine	Q
8	Glycine	G
9	Histidine	H
10	Isoleucine	I
11	Leucine	L
12	Lysine	K
13	Methionine	M
14	Phenylalanine	F
15	Proline	P
16	Serine	S
17	Threonine	T
18	Tryptophan	W
19	Tyrosine	Y
20	Valine	V