Module 1: Assorted Advanced Topics

Unit 2 > Module 1

Objectives

The goal of this module is to touch upon a few advanced topics that will be directly useful in the next module.

Note: The topics are introduced with a "not on the final exam" intent, for your understanding and use in the next module's exercises, but not tested on the final exam.

1.0 Audio:

1.0 Advanced topic: tuples

In this section and the ones that follow, we'll touch lightly upon a few topics that are "not on the final exam" but will be useful in general.

Some of these concepts will be directly used in the next module.

Suppose we want to write a function that computes both the square and cube of a number:

One option is to write two separate functions

def square(x):
    return x*x

def cube(x):
    return x*x*x

x = 5
print(x, square(x), cube(x))

We can alternatively write one function that computes and returns two things:

def do_both(x):
    square = x*x
    cube = square*x 
    return (square, cube)

x = 5
(y, z) = do_both(x)
print(x, y, z)

Notice that the return statement returns a pair of values:
```
    return (square, cube)
  
```
And that the pair is enclosed in parentheses.
And notice that, since two values are being returned, we need a pair to capture the return values:
```
(y, z) = do_both(x)
  
```

We can go beyond a pair to any number of such "grouped" variables:

def do_more(x):
    square = x*x
    cube = square*x 
    fourth = cube*x
    fifth = fourth*x
    return (square, cube, fourth, fifth)

x = 5
(a, b, c, d) = do_more(x)
print(x, a, b, c, d)

Such a grouping of variables is called a tuple.

Tuples are similar to lists in many ways, but different in one crucial aspect:

First, let's examine how to write the same do_both() function above but using lists:

def do_both_list(x):
    square = x*x
    cube = square*x 
    return [square, cube]

x = 5
[y, z] = do_both_list(x)
print(x, y, z)

This works just fine.

Another way in which a tuple is like a list is in using square-brackets and position indices to access individual elements, as in:

# List version:
L = do_both_list(x)
print(L[0], L[1])      # L[0] has the square, L[1] has the cube

# Tuple version:
t = do_both(x)
print(t[0], t[1])      # t[0] has the square, t[1] has the cube

However, here's the difference:

# List version:
L = do_both_list(x)
L[0] = 0              # This is allowed

# Tuple version:
t = do_both(x)
t[0] = 0              # This is NOT allowed

Thus, you can replace a list element but you cannot replace a tuple element.

This is in fact a bit subtle, as this example shows:
```
x = 3
y = 4
t = (x, y)   # The tuple's value is now fixed.
print(t)     # (3, 4)
x = 2
print(t)     # (3, 4)
  
```
Once the tuple is instantiated (that's the technical term for "made") then the tuple's value cannot be changed.
You can of course assign a different tuple value to a tuple variable as in:
```
t = (1, 2)
print(t)
t = (3, 4)
print(t)
  
```
Here, we're simply replacing one fixed-value tuple with another.
Tuples are therefore said to be an immutable type (along with strings).
Why use tuples at all? It's to allow programmers to signal clearly that their tuples shouldn't be changed.
This turns out to be convenient for mathematical tuples (like points on a graph), which are similar.

Groups of tuples can be combined into lists and other data structures.

It's very useful in working with points (the mathematical point you draw with coordinates) and other mathematical structures that need more than one number to describe.

For example, here's a program that, given a list of points, finds the leftmost point (the one with the least x value).

def leftmost(L):
    leftmost_guess = L[0]
    for q in L:
        if q[0] < leftmost_guess[0]:
            leftmost_guess = q
    return leftmost_guess

list_of_points = [(3,4), (1,2), (3,2), (4,3), (5,6)]
(x,y) = leftmost(list_of_points)
print('leftmost:', (x,y) )
# leftmost: (1, 2)

1.1 Exercise: Type up the above in my_tuple_example.py. Then trace through the iteration in your module pdf.

1.2 Exercise: Consider the following:

import math

def distance(p, q):
    return math.sqrt( (p[0]-q[0])**2 + (p[1]-q[1])**2 )

def find_closest_point(p, L):
    # Write your code here to find the closest point in L to p
    # and to return both the point and the distance to it from p.

list_of_points = [(3,4), (1,2), (3,2), (4,3), (5,6)]
query_point = (5,4)
(c, d) = find_closest_point(query_point, list_of_points)
print('closest:',c,' at distance', d)
# Should print: 
# closest: (4, 3)  at distance 1.4142135623730951

In my_tuple_example2.py, fill in the missing code to find the closest point in a list of points to a given query point. Return both the closest point and the distance to the query point as a tuple.

1.3 Audio:

1.1 Advanced topic: sets

The general mathematical term set means a "collection of like things but without duplicates".

Python has special syntax and operations to support this mathematical notion:

Here are two sets being defined:

A = {2, 4, 5, 6, 8}       # Curly brackets
B = {'hello', 'hi', 'hey', 'howdy'}

The first set contains five numbers, whereas the second contains four strings.

Consider this variation

A = {2, 4, 5, 6, 8}
B = {'hello', 'hi', 'hey', 'howdy'}

C = {8, 5, 4, 6, 2, 4, 5, 5}
print(C)
if A == C:
    print('they are equal')
else:
    print('they are not equal')

Given what we've said about sets, what will be printed?

1.4 Exercise: Type up the above in my_set_example.py to find out.

Note:

Even though a set may not have duplicates, we are actually allowed to try to create duplicates:
```
C = {8, 5, 4, 6, 2, 4, 5, 5}
  
```
Python simply removes the duplicates.
Python also organizes sets so that sets can be compared for equality: Thus, printing the set
```
C = {8, 5, 4, 6, 2, 4, 5, 5}
  
```
actually results in
```
{2, 4, 5, 6, 8}
  
```

What can we do with sets?

The most common operation is to see whether some value is in some set we've defined using the keyword in:

def check_vowel(x):
    vowels = {'a','e','i','o','u'}
    if x in vowels:
        print(x, 'is a vowel')
    else:
        print(x, 'is not a vowel')

check_vowel('a')
check_vowel('b')

Other, more mathematical operations, feature different ways of combining sets. For example:
```
A = {2, 4, 5, 6, 8}
B = {1, 3, 5, 6}

D = A | B    # union
print(D)
  
```
Here, D contains every element across both sets.
Other such operators include:
- intersection (elements that are in both sets)
- difference (elements in one set that are not in the other)
Since our goal is merely to give you a taste of these advanced topics, we won't say much more about these operations.

Instead, let's look at an application:

In text processing, the term stopword is often used to describe commonly used words that serve little or no purpose in text analysis:
- When you analyze digitized texts for meaning, topic and dialogue (for example), words like "the", "of", "and" get in the way of analysis.
- They are often discarded from text analysis that seeks to analyze more significant words.
We'll ask the question: what proportion of text typically consists of stopwords?
(The answer is surprising, as we'll see).
To answer the question, we'll scan text word-by-word and see if each word is in the set of stopwords.
Although there are hundreds of stopwords, we'll keep things simple by targeting a few.

Let's write this up:

import wordtool as wt 

# The 25 most common stopwords
stopwords = {'the','be','to','of','and','a','in','that','have','I',
             'it','for','not','on','with','he','she','you','do','as',
             'at','his','her','they','by'}

wt.open_file_byword('alice.txt')

s = wt.next_word()
num_stopwords = 0
total_words = 0

while s != None:
    if s in stopwords:
        num_stopwords += 1   # Count stopwords
    total_words += 1         # And all words
    s = wt.next_word()

percent_stopwords = 100 * (num_stopwords/total_words)
# Compare this:
string_to_print = '% of stopwords:{0:.2f}'.format(percent_stopwords)
print(string_to_print)
# to this:
# print(percent_stopwords)

1.5 Exercise: Type up the above in my_stopwords.py. Then, compare the percentage of stopwords in these two famous texts: Alice in Wonderland and Darwin's The Origin of the Species. Try two other texts from Project Gutenberg, and report the results in your module pdf.

As an aside, we'll point out something about string formatting, which is also an advanced topic:

It's plain ugly (and often useless) to print out all the digits in a floating number like:
```
% of stopwords: 31.969357958208498
  
```
Instead, we'd like
```
% of stopwords: 31.97
  
```
Specifying to Python how you'd like a number formatted, unfortunately, is a bit of a chore using somewhat cryptic commands embedded within strings:
```
string_to_print = '% of stopwords:{0:.2f}'.format(percent_stopwords)
print(string_to_print)
  
```
We'll explain this one with the understanding that a proper look into this topic is beyond the scope of this course:
- First, one builds the target strings using typical quotes, as in:
```
string_to_print = '% of stopwords: '
  
```
- Then, attach the format function with the desired variable whose value we wish to print nicely:
```
string_to_print = '% of stopwords: '.format(percent_stopwords)
  
```
- Then, figure out where in the string you want the resulting number and use curly brackets:
```
string_to_print = '% of stopwords: {0}'.format(percent_stopwords)
  
```
  Here, 0 means the first such number. If had a second number, we'd use 1, and so on.
- Finally, specify both how many digits after the decimal point, and that this number is a float:
```
string_to_print = '% of stopwords: {0:.2f}'.format(percent_stopwords)
  
```
Yes, a bit cryptic and difficult to understand. But eventually, when you get used to it, it's powerful when you are doing a lot of text output.

1.2 Advanced topic: dictionaries

Consider this problem:

We have a data file that looks like this:
```
apple
banana
apple
pear
banana
banana
apple
kiwi
orange
orange
orange
kiwi
orange
  
```
This might represent, for example, a record of sales at a fruit stand.
We'd like to count how many of each fruit.

One way would be to define a counter for each kind:

num_apples = 0
num_bananas = 0
num_pears = 0
num_kiwis = 0
num_oranges = 0
with open('fruits.txt','r') as data_file:
    line = data_file.readline()
    while line != '':
        fruit = line.strip()
        if fruit == 'apple':
            num_apples += 1
        elif fruit == 'banana':
            num_bananas += 1
        elif fruit == 'pear':
            num_pears += 1
        elif fruit == 'kiwi':
            num_kiwis += 1
        elif fruit == 'orange':
            num_oranges += 1
        else:
            print('unknown fruit:', fruit)
        line = data_file.readline()

print('# apples:', num_apples)
print('# bananas:', num_bananas)
print('# pears:', num_pears)
print('# kiwi:', num_kiwis)
print('# oranges:', num_oranges)

1.6 Exercise: Type up the above in my_fruits.py and use the data file fruits.txt to confirm. Next, in my_fruits2.py, change the program to accommodate the additional fruits in fruits2.txt.

Aside from being tedious, this approach has other issues:

One would like to be able to write a general program that does not need to know which fruits are in a file.
What if there were a thousand different kinds of items (not fruits, say, but department-store items)?
A single mistake in a variable can cause the counts to be wrong.

Fortunately, the use of dictionaries will make it easy:

# Make an empty dictionary
counters = dict()

with open('fruits.txt','r') as data_file:
    line = data_file.readline()
    while line != '':
        fruit = line.strip()
        if fruit in counters.keys():
            # If we've seen the fruit before, increment.
            counters[fruit] += 1
        else:
            # If this is the first time, set the counter to 1
            counters[fruit] = 1
        line = data_file.readline()

print(counters)

1.7 Exercise: Type up the above in my_fruits3.py and first apply it to fruits.txt and then to fruits2.txt. (Submit your program with the latter file as the input file.) In your module pdf, describe what you had to change in the code to make it work for the second file.

Now let's explain:

A dictionary is a technical term that is only somewhat related to an actual English dictionary.
Think of an English dictionary as something where you look up a word and receive its meaning.
They operations here are look up and receive an associated value (the word's meaning, in this case).
In Python, a dictionary is a structure that lets you associate one kind of data with another.
The technical equivalent of a word is called a key and the equivalent of the meaning is called the value.
So, a dictionary is a collection of key-value pairs.
Here's an example:
```
d = {'apple': 3, 'banana': 3, 'pear': 1, 'kiwi': 2, 'orange': 4}
  
```
In this case, we're associating
- The value 3 with the key 'apple'
- The value 3 with the key 'banana'
- The value 1 with the key 'pear'
- The value 2 with the key 'kiwi'
- The value 4 with the key 'orange'

Conveniently, Python allows array indexing using the key:

d = {'apple': 3, 'banana': 3, 'pear': 1, 'kiwi': 2, 'orange': 4}

print(d['apple'])  # Prints 3

d['banana'] = 0    
# Which changes the value associated with 'banana' to 3

The above is an example of a dictionary that's already built (after we've processed the data).
To process data on-the-fly, we need an additional operation that an English dictionary does not really have: we need to be able to add something that's not already there.

To add a new key, we simply use it as an index:

d = {'apple': 3, 'banana': 3, 'pear': 1, 'kiwi': 2, 'orange': 4}

d['plum'] = 0

With this understanding we can now revisit the code in the fruit example:

We've seen how to read a file line-by-line before

with open('fruits.txt','r') as data_file:
    line = data_file.readline()
    while line != '':
        fruit = line.strip()           # Remove whitespace on either side

        # This is where we'd do something with the datda

        line = data_file.readline()    # Get the next line

The rest is merely the dictionary part:

with open('fruits.txt','r') as data_file:
    line = data_file.readline()
    while line != '':
        fruit = line.strip()           # Remove whitespace on either side

        if fruit in counters.keys():
            # If we've seen the fruit before, increment.
            counters[fruit] += 1
        else:
            # If this is the first time, set the counter to 1
            counters[fruit] = 1

        line = data_file.readline()    # Get the next line

1.8 Exercise: In my_stopwords2.py, write code that uses a dictionary to compare the relative occurrence of stopwords in alice.txt and darwin.txt. In particular, we'd like to know: what percentage of the stopword occurence can be attributed to 'the', what percentage to 'and', and so on. For alice.txt the output should look like this (not necessarily in the same order) using the string formatting from the previous section. In your module pdf, compare the outputs for the two texts: do the stopwords occur with similar relative frequencies across the two texts?

1.9 Audio:

1.3 Advanced topic: global variables

Suppose we have the following application:

We prompt the user to enter the parameters (the coordinates of the center, and radius) of circle that needs to be drawn.
We do this repeatedly until the user is done (empty input) and then draw all the circles.

This is the kind of code we'd like to write:

# ... For brevity, we're not showing the import, setting up drawtool etc ...

# Read circle data from the user:
line = input('Enter circle parameters: ')
while line != '':
    process_line(line)
    line = input('Enter circle parameters: ')

# A method that'll access the list of circles
draw_circles()

dt.display()

What we'll do is have a list of circles:

# The list of circles
circles = []

def process_line(line):
    # Code for extracting the center coordinates and radius
    # This will add to the list circles

def draw_circles():
    # Code for drawing the circles: will access the list circles

line = input('Enter circle parameters: ')
while line != '':
    process_line(line)
    line = input('Enter circle parameters: ')

draw_circles()

dt.display()

Here, the variable circles is called a global variable because it can be access inside other functions defined in the file.
So, what would be an example of a non-global variable?

Let's fill out the code and see:

circles = []                      # The global variable

def process_line(line):
    parts = line.split()          # parts is a local variable
    x = float(parts[0])
    y = float(parts[1])
    r = float(parts[2])
    circle = (x,y,r)              # We've used a tuple here!
    circles.append (circle)       # Accessing global circles

def draw_circles():
    dt.set_color('b')
    for c in circles:             # Accessing global circles
        dt.draw_circle(c[0], c[1], c[2])

line = input('Enter circle parameters: ')
while line != '':
    process_line(line)
    line = input('Enter circle parameters: ')

draw_circles()

dt.display()

Let's create a simpler example to illustrate:

Consider this:

x = 5

def some_func():
    y = x + 1
    print(y)

def some_other_func():
    z = x + 2
    print(z)
    # print(y)         # This would fail: we can't access y here

some_func()          # Prints 6
some_other_func()    # Prints 7

Let's point out:
Thus: to share variables across functions, define the variables outside the functions.
In which case they become global variables.

That would be the end of it, but there's one more complication:

You cannot modify a global variable inside a function without explicitly stating an intention to do so.

So, suppose we wanted to modify x in the example:

x = 5

def some_func():
    global x        # Explicit statement of intent to modify
    x = x + 1
    y = x + 1
    print(y)

def some_other_func():
    z = x + 2
    print(z)

some_func()         # Prints 7
some_other_func()   # Prints 8

Now let's go back to the circle-drawing example to see an instance where modifying a global is useful:

Suppose we want to track the smallest circle (smallest radius) and draw that circle in red.
Thus, as we get user input line-by-line, we'll need to update the smallest.

Here's part of the code:

# For brevity, we're not showing the import, setting up drawtool etc

# ... (stuff left out) ... 

# Global variable circles (a list) available to all functions
circles = []
smallest_r = 10
smallest = None

def process_line(line):
    global smallest_r, smallest   # Identify globals being modified
    parts = line.split()
    x = float(parts[0])
    y = float(parts[1])
    r = float(parts[2])
    circle = (x,y,r) 
    if r < smallest_r:
        smallest_r = r            # Modifying global smallest_r
        smallest = circle         # Modifying global smallest
    circles.append (circle)       # Accessing global circles

def draw_circles():
    # Fill in the code here to draw the smallest in red
    # and the other circles in blue.

# The rest of the code is the same as before ... (read line by line etc)

1.10 Exercise: In my_circles.py, copy over the above and from the earlier example, and then fill in code in the draw_circles() function to complete the program. When the user types, for example,

Enter circle parameters: 3 4 3
Enter circle parameters: 4 3 1
Enter circle parameters: 5 7 2
Enter circle parameters:

the output should be:

1.4 Advanced topic: multiple files

Large Python applications consists of tens of thousands of lines of code.

Most such applications actually consist of hundreds of Python programs bundled together and where one program can call functions in another.

We've already seen examples with the use of drawtool and wordtool.

Why are applications broken into multiple files?

One large file is unwieldy to understand and work with in an editor (imagine the scrolling difficult with a thousand lines).
It's much easier to separate out functionality and test each piece. Robustly tested programs can be left alone while refining those that need work.
Separated functionality is useful in other projects.
If one piece needs upgrading, you can upgrade just that part if it's in a separate file.

Let's build a simple example:

Suppose we put together a file called my_math_functions.py which has

import math

pi = 3.141

def square(x):
    return x*x

def cube(x):
    return x*x*x

def distance(x1, y1, x2, y2):
    # Write your code here

Then in an another file called my_math_application.py you could write:

import my_math_functions as m

x = 2
y = m.square(5)
print(y)

x2 = 3
y2 = m.cube(x2)
print(y2)

d = m.distance(x,y, x2, y2)
print(d)

Notice the import statement at the top of the file:
```
import my_math_functions as m
  
```
This has the import keyword, the name of the other file, and a shortcut.
If we did not have the shortcut, we'd have to write code like:
```
import my_math_functions 

x = 2
y = my_math_functions.square(5)
print(y)
  
```
Which is perfectly legit but a bit tedious.
To access something from another file, we use the dot (period):
```
y = m.square(5)
  
```
One can access a variable in the other file as well.
```
print(m.pi)
  
```

1.11 Exercise: Complete the code for the distance() function and write the above two programs (using the file names as above) so that the output is:

25
27
2.23606797749979

1.5 Advanced topic: string formatting and numbers

Consider this program:

x = 0
for i in range(5):
    print('i=', i, 'x=', x)
    x += 0.1

1.12 Exercise: Type up and run in my_format.py.

You observed output like:

i= 0 x= 0
i= 1 x= 0.1
i= 2 x= 0.2
i= 3 x= 0.30000000000000004
i= 4 x= 0.4

Why the ugly output?

There are really three questions here, the second of which is a bit subtle.
The first one is: why doesn't Python see that most of the other values are bring printed nice, so why not print 0.3 instead of 0.30000000000000004?
The second, more sutble one, is: our for-loop clearly wants to increment by 0.1, which means x should be 0.3 and not 0.30000000000000004.
The third implied question is: how do we actually limit the number of digits printed after the decimal?

Let's answer these questions:

Why doesn't Python recognize unnecessary digits?
- This is because Python does not want to second guess the programmer's intent.
- Maybe you do want the digits.
How did 0.3 become 0.30000000000000004?
- A computer's memory has two kinds of limitations, and one "feature".
- One: it cannot store beyond a certain accuracy (and computers differ based on their hardware).
- Two: when calculations are performed, this storage limitation forces some arithmetic to be slightly wrong (many digits after the decimal point).
- The "feature" is that computers store binary numbers, not our kind of decimal numbers. This means a decimal version of a binary number might have to be an approximation.
OK, how do we actually print nicely?
- This is a bit of an advanced topic, so we'll only provide a simple example.
- The solution is to use a special set of formatting commands, like the ones you saw a long time ago (remember \n?)

Here's how it works for the above program:

x = 0
for i in range(5):
    format_string = 'i = {0:2d}  x = {1:.2f}'
    string_to_print = format_string.format(i, x)
    print(string_to_print)
    x += 0.1

Let's explain:

A format string is the eventual desired string that needs to be printed but with some placeholders:
```
    format_string = 'i = {0:2d}  x = {1:.2f}'
  
```
In this case, there are two placeholders:
```
    format_string = 'i = {0:2d}  x = {1:.2f}'
  
```
Placeholders are enclosed in curly brackets.
Each placeholder begins with a number that will eventually determine which variable gets its value into the placeholder.

So, here we have variables 0 and 1:

    format_string = 'i = {0:2d}  x = {1:.2f}'

We feed the values via the format() function:
```
    format_string = 'i = {0:2d}  x = {1:.2f}'
    string_to_print = format_string.format(i, x)
  
```
This is what lets format() put the value of i in place of the "0" and the value of x in place of the "1".
All that's left to explain are the cryptic commands that follow the "0" and "1" respectively:
```
    format_string = 'i = {0:2d}  x = {1:.2f}'
  
```
The first one says, in effect, "use a minimum of 2 spaces and expect an integer". The second says "expect a float and restrict the number of post-decimal digits to 2".

String formatting has more to it, but that's all we'll say about it here.

1.6 Randomness

As we've seen throughout the course, it is useful to be able to have Python generate random numbers:

We use random numbers to answer statistical questions.
Random numbers were also useful in creating patterns, as in art.

However, there's an important thing to know about random generation.

Consider this program:

import random

#random.seed(123)

for i in range(5):
    x = random.uniform(1.0, 10.0)
    print(x)

1.13 Exercise: Type up the above my_random_example.py. Run it multiple times and observe the output. Next, un-comment the line

#random.seed(123)

and run it again multiple times. Then, change the number 123 in the seed to any number between 1 and 1000, and once more run the program many times.

How random generation works:

Python uses a mathematical function with a so-called "seed" to produce random numbers.
The numbers aren't really random, because a particular seed will produce the same numbers, which is why they're called pseudorandom numbers.
However, if we're doing things like computing averages, a pseudorandom sequence, provided it's long enough, is a reasonably good approximation.
Is there ever a reason to fix a seed?
- Turns out, yes. It's very useful for debugging.
- Using a seed causes your program to be repeatable (with the same pseudorandom sequence).
- So, typically, one uses a seed when getting your program to work.
- After it's working, you can comment out the seed.
If a seed is not provided, Python uses the local time or something like that as the seed, so that repeated runs produce different results.

1.7 What else is there in Python?

Our goal in this almost-final module was to:

Introduce the all-important topic of arrays.
Lightly sketch a few advanced topics to introduce ideas and show some examples, without expecting mastery of all the details.

In the next module, we will dive into examples from across the disciplines. Some of these advanced topics will be useful

So, one might ask: what's left in Python to learn?

Quite a bit it turns out:

Like many modern programming languages, Python is large enough that one needs a few courses to experience all of it.
Some concepts are advanced enough to need weeks to cover (example: objects).
Others need a background in data structures to understand how they work (example: dictionaries).
Yet others involve library functions and external packages.

Do you need to learn more? Is what we've learned enough to achieve a good deal?

We'll have more to say about this in a later module.

1.14 Audio:

On to Module 2