Applications to Biology and Ecology

Module 3 > Biology/Ecology


Objectives

 

We will look at two very different applications:

 

3.50 Video:

 


Part A: Computational Biology and protein "strings"

 

What we'll do in this section:

 

Let's start by understanding what a protein is.
 

3.51 Exercise:

 

What we need to know about proteins:

 

To start with, let's understand a couple of things about insulin:

 

Let's now work on some string problems related to insulin.

To start with, let's look for the three chains in preproinsulin:

 

3.52 Exercise: The following are the preproinsulin sequences for gorilla and owl monkey.

MALWMRLLPLLALLALWGPDPAAAFVNQHLCGSHLVEALYLVCGERGFFYTPKTRREAEDLQVGQVELGGGPGAGSLQPLALEGSLQKRGIVEQCCTSICSLYQLENYCN

MALWMHLLPLLALLALWGPEPAPAFVNQHLCGPHLVEALYLVCGERGFFYAPKTRREAEDLQVGQVELGGGSITGSLPPLEGPMQKRGVVDQCCTSICSLYQLQNYCN
  
Use string equality to see whether they are the same as human preproinsulin. That is, write a short program in my_insulin_test.py to do this testing. Download sequences.txt, which has the above two sequences for copy-paste, as well as the sequences needed for the exercises below. Report your results in your module pdf.
 

Next, let's consider the problem of approximate string matching:

 

Consider the following computation of string distance:

def distance(a, b):
    # Distance between strings a and b
    m = min(len(a), len(b))
    M = max(len(a), len(b))
    mismatches = 0
    for i in range(0, m):
        if a[i] != b[i]:
            mismatches += 1

    d = mismatches + (M - m)
    return d
  
 

3.53 Exercise: Write the above in my_stringdistance.py and apply the distance measure to compare "bold" with each of the following strings: "bald", "boy", "old". Report your results in your module pdf, along with a rationale for the distance function.
 

About string distance:

 

We'll use our simple distance measure to find the best B-chain match within the owl monkey chain.

 

3.54 Exercise: In my_stringdistance2.py, expand on the above code to search all possible positions in the owl-monkey preproinsulin and find the best possible match (with the least distance) for the human B-chain. Print the position where this occurs, and the distance. Report these numbers in your module pdf.
 

3.55 Exercise: In my_stringdistance3.py, try the same thing with a few non-primates. That is, try to look for the closest match to the human B-chain in in each of the preproinsulin sequences below. Before running your code, which of the three do you expect to produce a result closest to human? Farthest?

Hummingbird:
IQSLPLLALLALSGPGTSHAAVNQHLCGSHLVEALYLVCGERGFFYSPKARRDAEHPLVNGPLHGEVGDLPFQQEEFEKVKRGIVEQCCHNTCSLYQLENYCN

Zebrafish:
MAVWLQAGALLVLLVVSSVSTNPGTPQHLCGSHLVDALYLVCGPTGFFYNPKRDVEPLLGFLPPKSAQETEVADFAFKDHAELIRKRGIVEQCCHKPCSIFELQNYCN

Fruitfly:
SSSSSSKAMMFRSVIPVLLFLIPLLLSAQAANSLRACGPALMDMLRVACPNGFNSMFAKRGTLGLFDYEDHLADLDSSESHHMNSLSSIRRDFRGVVDSCCRKSCSFSTLRAYCDS
  
 

3.56 Audio:
 

Note:

 


Part B: A whale of a story

 

We will examine a dataset derived from tracking bluefin whales off of the California coast.

Before we start:

 

To start with, let's look at the data, load the data and print its size (number of rows, number of columns):

from datatool import datatool
import numpy

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

# Extract into 2D array:
D = dt.get_data_as_array()

# Print number of rows, columns:
print('# rows=', D.shape[0], '# columns=', D.shape[1])

# The lat-long from the first row:
print('first lat=', D[0,4], 'first long=', D[0,3])
  
 

3.57 Exercise: Download bluefin_whale_data.csv and datatool.py and then in my_bluefin_whales.py, type up the above and run. Examine the CSV file.
 

Note:

  • We've extracted the data into a 2D array. Here we used the shape parameter of the array to print the size. Recall:
    • shape[0]: number of rows
    • shape[1]: number of columns

  • The array will be useful when we use loops to look through the data.

  • Notice that the latitude is in the 5th column, while the longitude is in the 4th column.

  • Although there's a lot of other data, the one that will be of interest is an identifier that distinguishes one whale from another.
    • In this case, the 11-th column identifies the whale.
 

Next, let's count and identify the different whales.

To do this, we will use a set because a set only allows unique elements:

from datatool import datatool
import numpy
import math

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

D = dt.get_data_as_array()
print('# rows=',D.shape[0], '# columns=', D.shape[1])

# Start with an empty set
whales = set()

# Now add the whale in each row
for i in range(D.shape[0]):
    whales.add(D[i,10])

# Write a line of code to print the size of the set:
  
 

3.58 Exercise: Write up the above in my_bluefin_whales2.py and add the line needed to print the size of the set. You should see 13 whales in total.
 

Next, let's identify the minimum and maximum latitudes to know the range of area where the data was collected:

from datatool import datatool
import numpy
import math

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

D = dt.get_data_as_array()

# Write code to identify the min and max 
# latitudes and longitudes. The latitude is in 5th column,
# while the longitude is in the 4th column.
min_lat = D[0,4]
max_lat = D[0,4]
min_long = D[0,3]
max_long = D[0,3]

# WRITE YOUR LOOP HERE:


print('min-lat=', min_lat, 'max-lat=', max_lat)
print('min-long=', min_long, 'max-long=', max_long)
# Should print
# min-lat= 31.0522 max-lat= 40.4511
# min-long= -124.9586 max-long= -116.509

# Round the min's down using math.floor() and 
# round the max's up using math.ceil()

# WRITE YOUR CODE HERE:

# Print rounded min's and max's:
print('min-lat=', min_lat, 'max-lat=', max_lat)
print('min-long=', min_long, 'max-long=', max_long)
# Should print
# min-lat= 31 max-lat= 41
# min-long= -125 max-long= -116
  
 

3.59 Exercise: Download my_bluefin_whales3.py and fill in the needed code.
 

Next, let's plot the data:

from datatool import datatool
import numpy

dt = datatool()

dt.load_csv('bluefin_whale_data.csv')

dt.tilemap_attach_col_lat_long('location-lat', 'location-long', 'individual-local-identifier')

# We'll draw a square over the map. 
dt.tilemap_add_line(33, -122.5, 33.5, -122.5)
dt.tilemap_add_line(33, -122.5, 33, -122.0)
dt.tilemap_add_line(33, -122.0, 33.5, -122.0)
dt.tilemap_add_line(33.5, -122.5, 33.5, -122.0)
 
center = dict(lat = 35, lon =-119)
dt.tilemap_czt(center, zoom=5, title='Bluefin whale tracks')

dt.display()
  
 

3.60 Exercise: Download and try out my_bluefin_whales4.py. You should see something like

 

Note:

  • Each whale has its own color, and one can see the tracks each has made.

  • We drew a small box to demonstrate drawing a box.

  • Notice that the box size is a half-latitude in height and half-longitude in width.
 

Thus far we haven't done much that's computational.

Let's now solve a realistic problem:

  • We've calculated and rounded (up or down) the min and max of the latitudes and longitudes of the whale locations.

  • We've also drawn a half-box: a box half-latitude in height and half-longitude in width.

  • The goal in the problem: find the half-box with the most whale tracks.

    Such a box might suggest a good spot for a research vessel looking to retrieve trackers and study the whales.

  • We will do this by trying all possible half-boxes in the region delimited by the min and max we've already found for latitude and longitude.

  • The main idea is expressed in this nested loop:
    # Outer loop over latitudes
    lat = min_lat
    while lat < max_lat:
        # Inner loop over longitudes
        lon = min_long
        while lon < max_long:
            # Count occurrences within lat to lat+0.5, lon to lon+0.5
            count = count_occurences(lat, lon)
    
            # Update highest count so far ... (code not shown) ...
    
            lon += 0.5             # Inner-loop
        lat += 0.5                 # Outer-loop
      
 

3.61 Exercise: Download my_bluefin_whales5.py which has instructions on where to write code. All you need to do is fill in some functions. The desired output is also described.
 

3.62 Exercise: Download my_bluefin_whales6.py, which will visualize the "best half-box" on a map. Copy over the missing code from the exercise above. You should see a box that's slightly west of Long Beach, CA, and northwest of Catalina Island, which seems to match a description of the local whale-watching tours.
 

3.63 Video:

 


Optional further exploration

 

If you'd like to explore further:

  • See this TED talk for an example of how biology can be viewed from a digital information point of view.

  • Read through these two articles on whales from our neighbors at the Smithsonian and at the International Monetary Fund:

  • Try this podcast on the mysteries that animal-tracking hopes to solve.


Back to Module 2



© 2020, Rahul Simha