Week 12: NumPy

Reading: Python for Data Analysis Chapter 4, through “Boolean Indexing”

Notes

The NumPy Array

Python did not originally have dedicated features for numerical computing: these were added later in a library called NumPy (or numpy).

Check to see if you have numpy installed by importing it:

import numpy as np

The convention import numpy as np exists entirely to save us some typing. It is widespread: almost every reference on numpy will use it, so we will use it here.


The basic data type in numpy is the array. It’s similar to a list, but better for doing math.

Recall that the + operator between lists concatenates lists:

x = [2, 3, 4]
y = [1, 2, 3]
x + y
[2, 3, 4, 1, 2, 3]

We create numpy arrays by using the np.array function called on a list:

x_arr = np.array(x)
type(x)
list
type(x_arr)
numpy.ndarray

Note: ndarray means “n-dimensional array” - we will just call it an array.


Arrays are very useful for mathematical operations. Note how the + operator adds the two arrays, element by element:

y_arr = np.array(y)
x_arr
array([2, 3, 4])
y_arr
array([1, 2, 3])
x_arr + y_arr
array([3, 5, 7])

Arrays also support other mathematical operations:

x_arr - y_arr
array([1, 1, 1])
x_arr * y_arr
array([ 2,  6, 12])
x_arr / y_arr
array([2.        , 1.5       , 1.33333333])

Arrays can be used for math with regular numbers (scalars):

x_arr + 2
array([4, 5, 6])

The number is broadcast across the entire array.

If you’ve had a linear algebra class, you might want to use numpy arrays for tensor operations.

  • The @ operator performs matrix multiplication
  • The .T value of an array is its transpose
np.array([[1, 2],[3, 4]]) @ np.array([[1, 2]]).T
array([[ 5],
       [11]])

Numpy arrays can also be indexed similarly to lists:

x_arr[1]
3
x_arr[1] += 1
x_arr[1]
4

Numpy arrays have some substantial differences from lists. For instance, you cannot use .append with a numpy array.

Randomness and Simulation

Many early and contemporary uses of computing involve simulation: calculating a value approximately by simulating events. Events can be simulated through randomness.

We can generate a random number with numpy:

np.random.random()
0.19961670868357984

np.random.random() gives us a random number in between 0 and 1.

Let’s use this to create a “coin”:

coin.py
import numpy as np

def flip():
    rand_number = np.random.random()
    if rand_number > 0.5:
        return "Heads"
    else:
        return "Tails"

print(flip())
Tails

Let’s test our “coin” (include all of this in one file):

trials = 10000
results = 0
for j in range(trials):
    if flip() == "Heads":
        results += 1
print(results/trials)
0.4991

About 50% - it works!

We can also randomly choose an element from a list:

np.random.choice(["bricks", "lumber", "cement"])
'bricks'

Let’s use randomness to simulate a simple problem: estimating the value of \(\pi\).

Assume:

  • We know that the formula for a circle is given by \(x^2 + y^2 = r^2\)
  • We know the area of a square is \(s^2\)
  • We know the area of a circle is \(\pi \cdot r^2\)
  • We don’t know the value of \(\pi\)

First, let’s use our random number generator to get values in between -1 and 1.

By default, np.random.random() gives values between 0 and 1. To get values between -1 and 1:

  • Double the default random values
    • This places them between 0 and 2 instead of 0 and 1
  • Subtract one from the doubled value
    • This places them between -1 and 1
( np.random.random() * 2 ) - 1
0.7576961342726714

Now let’s write a function that generates one point and returns it:

def generate_point():
    x = np.random.random() * 2 - 1
    y = np.random.random() * 2 - 1
    return x, y
generate_point()
(-0.1908546794579864, -0.5365162596453161)

We can use another function to check if a point is inside the circle:

def check_point(x, y):
    if x**2 + y**2 <= 1:
        return True
    else:
        return False
check_point(0, 0)
True
check_point(1, 1)
False

Now let’s generate a lot of points, check all of them, and count the ones inside the circle:

num_points = 1000000
in_circle = 0
for j in range(num_points):
    x, y = generate_point()
    result = check_point(x, y)
    if result:
        in_circle += 1
print(in_circle)
785222

Finally, let’s check the ratio of points in the circle to total points. We’ll multiply it by four, because the side of the “square” we are using is 2, and we’re looking at the ratio: \[\frac{\pi \cdot r^2}{(2\cdot r)^2} = \frac{\pi \cdot r^2}{4 \cdot r^2} \]

ratio = in_circle/num_points
print(ratio * 4)
3.140888

The answer is reasonably close to \(\pi\), and if we make the number of points larger, the result will become more accurate.

Plotting

We will use Python to plot charts of our results, using a library called seaborn.

Plotting is a good way to communicate numerical information visually. We’ll show you a few things about plotting, but this course won’t require you to make plots or test you on plotting.


To follow along, you can install seaborn by typing pip install seaborn at your terminal.

Plotting with Python is much more powerful and expressive than plotting with a spreadsheet program such as Microsoft Excel. We will present an ‘extra’ lesson on plotting at the end of the course.

Let’s plot the results. First, we need to import a plotting library:

import seaborn as sns
import matplotlib.pyplot as plt

Next, we need to remember the x and y values we generated, so we modify our loop to do so:

num_points = 1000000
x_circ, y_circ = [], [] # lists for values in circle
x_no_circ, y_no_circ = [], [] # lists for values out of circle
for j in range(num_points):
    x, y = generate_point()
    result = check_point(x, y)
    if result:
        x_circ.append(x)
        y_circ.append(y)
    else:
        x_no_circ.append(x)
        y_no_circ.append(y)

We can check the result: this time, we’ll look at the length of one of the lists of “in circle” values:

ratio = len(x_circ)/num_points
print(ratio * 4)
3.140604

Now we’ll plot the result:

sns.set_theme(rc={'figure.figsize':(6,6)})
sns.scatterplot(x=x_circ, y=y_circ)
sns.scatterplot(x=x_no_circ, y=y_no_circ)
plt.show()

Practice

Practice Problem 12.1

Practice Problem 12.1

Write a function to_numpy that takes as argument a list and returns:

  • A numpy array of the same values, if every element of the list is numeric
  • The bool False if any element of the list is non-numeric.

Practice Problem 12.2

Practice Problem 12.2

Write a function compare_lists that takes two lists of ints and compares each pair of elements, returning a list of bools corresponding to whether the first list’s element is greater:

  • compare_lists([2, 3, 4], [3, 1, 2]) returns [False, True, True]
    • 2 is not greater than 3, 3 is greater than 1, 4 is greater than 2

Practice Problem 12.3

Practice Problem 12.3

Write a function list_ceil that takes two arguments:

  • The first argument a list of ints
  • The second argument is a single int

The function should return a new list, identical to the original list, except no int in the new list is greater than the second argument.

  • list_ceil([3, 5, 2, 1], 4) returns [3, 4, 2, 1]
  • list_ceil([-1, 0, 6, 1], 2) returns [-1, 0, 2, 1]

Practice Problem 12.4

Practice Problem 12.4

Write a function array_mean that takes as argument a numpy array, calculates the mean, subtracts it from the array, and returns it.

Practice Problem 12.5

Practice Problem 12.5

Estimate the value of \(\pi\) with a circle of radius 2 (our example had radius 1).

Homework

  • Homework problems should always be your individual work. Please review the collaboration policy and ask the course staff if you have questions.

  • Double check your file names and return values. These need to be exact matches for you to get credit.

  • This homework includes 20 pts bonus - you can earn extra credit on this assignment.

Homework Problem 12.1

Homework Problem 12.1 (30 pts)

Write a function divider that takes as argument two lists of numbers:

  • If the second array does not contain any zeroes, divide each element of the first by its corresponding element in the second, and return the result as a numpy array
  • If the second array contains zeroes, return False

Examples:

  • divider([2, 3, 4], [1, 3, 2]) returns numpy array [2., 1., 2.]
    • \(\frac{2}{1} = 2\)
    • \(\frac{3}{3} = 1\)
    • \(\frac{4}{2} = 2\)
  • divider([4, 5, 1], [3, 0, 2]) returns False
    • The second list contains a zero

Submit as divider.py.

Homework Problem 12.2

Homework Problem 12.2 (30 pts)

Write a function absolute_difference that takes as argument two lists of numbers.

Return a numpy array of the absolute difference between each pair of numbers. The absolute difference is the larger of the two minus the smaller of the two.

Examples:

  • absolute_difference([2, 3, 4], [1, 3, 2]) returns numpy array [1, 0, 2]
    • \(2 - 1 = 1\)
    • \(3 - 3 = 0\)
    • \(4 - 2 = 2\)
  • absolute_difference([4, 5, 1], [3, 0, 2]) returns numpy array [1, 5, 1]
    • \(4 - 3 = 1\)
    • \(5 - 0 = 5\)
    • \(2 - 1 = 1\)

Submit as absolute_difference.py.

Homework Problem 12.3

Homework Problem 12.3 (40 pts)

A list of integers might have some integers in sequential order:

  • [2, 3, 4, 1, 5, 6, 8] has 2, 3, and 4 in order and 5 and 6 in sequential order
  • [4, 7, 8, 9, 2] has 7, 8, and 9 in order

An alternate representation of such a list would be to capture the sequences with the first and last element of each sequence by using a list within a list:

  • [2, 3, 4, 1, 5, 6, 8] could be represented as [[2, 4], 1, [5, 6], 8]
  • [4, 7, 8, 9, 2]could be represented as [4, [7, 9], 2]
  • [1, 3, 4, 5, 6, 7, 8, 9, 10]could be represented as [1, [3, 10]]

Write a function compressor that takes a list of integers and returns a list in this alternate representation.

Submit as compressor.py.

Homework Problem 12.4

Homework Problem 12.4 (20 pts)

Write a function that reverses the compressor function from the previous problem: take the alternate representation and return the original list representation.

Submit as compressor_reverser.py.