Problem Set 4

Dear Interlocutor,

In this assignment, you will build a very basic spam filter. Each problem will be a separate step.

Your spam filter will read in the spam files and ham files and “learn” from these files how to identify what content is spam and what is not.

Important: Unlike previous assignments, you will put all of your code for this assignment in a single file: spam_filter.py. Put this in the main problem_set4 folder.

The comments in our starter code form what is called “pseudo code.” They describe the structure of the code you should write.

Whenever the comments are indented, that’s a sign that you should probably be using indented control flow such as a loop or conditional statement.

What is a spam filter?
  • A spam filter is a classifier: code that takes some input and outputs a classification.
    • In this case, the input is a message
    • The output is “spam” or “not spam.”
    • Messages that are not spam are called “ham;” this is is an old convention that has nothing to do with actually eating pork.
  • The spam filter learns what is spam and what is ham from data that has been tagged.
  • Early spam filters required the user to tag emails as spam before they worked at all
  • Modern spam filters work well because people have been tagging spam for decades
  • With only a small number of training messages, your filter will not perform perfectly - but it will work!

There are two sets of spam content provided for this assignment:

  • The first represents emails generated by ChatGPT - “if nobody wrote it, why should anybody read it?”
  • The second represents spam SMS messages related to political donation fraud and sports betting

I hope this message finds you well

Problem 0

Problem 0 (25 pts)

The first step is to write a function that takes a single argument, a file name, and returns the contents of that file as a string.

Call this function read_in_file. You can use this starter code:

def read_in_file(filename):

    return # string of file contents

spam_file_string = read_in_file('spam/42d42a9b6cd3f147f48861fd8462bf31e80cbe0fd1c71afa3e60eec9cd46d715.txt')[490:575]
# should print 'Given these unforeseen circumstances, I am seeking your permission to be excused from'
print(spam_file_string)

ham_file_string = read_in_file('ham/647655d291a86d13a450dca2686e7df6e92aa848322fa1a58eb22d8392872cf0.txt')[110:145]
# should print 'Local lumber yards are enthusiastic'
print(ham_file_string)

I understand that this subject is important

Problem 1

Problem 1 (25 pts)

For the next step, write a function read_all_files that:

  • Takes a string representing a directory as input
    • This could be 'ham', or 'alt/spam', etc.
  • Uses the string and the os library (see the note below the problem) to list the contents of the directory
  • Creates a new list
  • Reads in each file and adds the content of each file to the list
  • Returns the list of file contents

You can use this starter code (it is not complete, you will need to make some changes):

def read_all_files(directory_string):
    file_list = # convert string to list of file names
    # create a new list for file contents
    for file in file_list: #loops through all files
        # read in each file...
        # add each file's contents to the list of contents
    return #return a list of the contents of all files


# the below code won't run without the code in the previous part

# ham_directory_string and spam_directory_string both find
#  *directories* where the ham and spam are located
#  see the note below this problem!
ham_content = read_all_files(ham_directory_string)

spam_content = read_all_files(spam_directory_string)

# these should be lists of all of the files contents
# (which will vary between the main and alternative data sets)

You might be wondering how we get a list of file names. The answer: use our computer’s operating system.

  • You already use the operating system when you open up a folder and look at the files
  • Python can use the operating system, too. Here is how:

First, import the os library. Then you can use the function os.listdir() to list the contents of the current directory:

import os
print(os.listdir())

The os.listdir() function returns a list of files and folders (as strings) in the current directory, something like:

['alt', 'ham', 'spam', 'spam_filter.py']

You can also call the function on other directories, for instance, the ham directory:

print(os.listdir('ham'))

yields something like

['0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt',
 '17bdf358ced227386ebe606d4f00dd50bea9fd97a6115b23573a7767e8ee61d8.txt',
 '1b19e1ff7076d6c598f0db70070a10290ad5c64858c376550bbcc95f456d273d.txt',
 '1b9bcb9cc387f8783e9359389b556937a0b715ae700d1176485ba91b9f089b5b.txt',
 '2080ba83c6624d9667fabcdc54db1514aa02721c0987d59259d0bf3d305efde1.txt',
 '23bd9be0880f82d8757eea59972684599516b117099e0397810557f526775949.txt',
 '26ac2b627a3aee68c99c65bc755f66e62be177acc311939016b2eb9415b77f49.txt',
 '28d7ca46fb2462ebe2637ef5a9c8ec1292b854e23b246282203103c8abbfe8a4.txt',
# it's a long list, this only shows the beginning
# it might not be exactly the same for you

So: this lets us get the file names.

Be careful: Since the files in the ham folder are not in the main directory of the program, you will need to specify the path when opening them. Instead of 0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt, you will need to open ham/0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt.

This means for that list of files, you might need to concatenate ham/ or spam/ to every filename string.

I take it very seriously

Problem 2

Problem 2 (30 pts)

Now that you have lists of the spam and ham contents, you will write a function word_freq that inputs a list of contents and outputs a dictionary:

  • The dictionary has a key for each word present in the content
  • The associated value for each key represents how many times the word is present in the entire data set

If the list of contents were:

['The night is young', 'The water is warm']

The dictionary would be:

{'The': 2, 'night': 1, 'is': 2, 'young': 1, 'water': 1, 'warm': 1}

You can use this starter code:

def word_freq(content_list):
    freq_dict = {} # initialize an empty dictionary
    for content in content_list: # loop through the content of all messages
        # split the content into individual words
        # for each individual word:
            # if not in the dictionary, add it with count one
            # if already in the dictionary, increase the count by one
    return freq_dict

I know that you are busy and I appreciate your time

Problem 3

Problem 3 (10 pts)

You will write a simple function total_words to count the total word instances the dictionaries you have just built. This function takes a dictionary as input and outputs an integer.

For instance, the total word instances for:

{'The': 2, 'night': 1, 'is': 2, 'young': 1, 'water': 1, 'warm': 1}

is eight: \(2 + 1 + 2 + 1 + 1 + 1 = 8\)

You can use this starter code:

def total_words(input_dict):
    total_count = 0 # initialize a counter
    for value in input_dict.values(): # loop through each word value
        # increment the counter by the value
    return total_count

# this is a test to check functionality
d = {'The': 2, 'night': 1, 'is': 2, 'young': 1, 'water': 1, 'warm': 1}
print(total_words(d)) # should print 8

Thank you for your attention to this matter

Problem 4

Problem 4 (100 pts)

Now you will build the classifier. Write a function called classify_message.

The function will take three arguments:

  • message is a string, the message to be classified
  • spam_dict is the dictionary of spam words
  • ham_dict is the dictionary of ham words

Here is how the function will work:

  • It builds two ‘scores’ for the message
    • One score for spam
    • One score for ham
  • First, check if the word is in either dictionary, if so:
    • For spam:
      • The score starts as 1.0
      • For each word in the message, find the number of counts in the spam dictionary
      • Add one to it (even if it is zero)
      • Divide that number by the total word instances in the spam dictionary
      • Multiply that result by the current score, yielding a new score
      • One you have done this for every word in the message, you have determined the final spam score
    • Repeat this scoring process for ham
  • Once you have both scores, compare them
  • If the spam score is higher, the message is classified as spam
  • If the ham score is higher, the message is classified as ham (not spam)
  • For spam, return True, for ham, return False

You can use this starter code:

def classify_message(message, spam_dict, ham_dict):
    spam_score = 1.0
    ham_score = 1.0
    spam_total_words = #compute the total number of word instances for spam
    ham_total_words = #compute the total number of word instances for ham
    message_words = #split the message into individual words
    # loop through all the words in the message
        # check if the word is in either dictionary, if it is:
            # process the spam score    
                # find the number of counts of that word in the spam dictionary
                # add one to it (even if it zero)
                # divide that number by the total number of word instances for spam
                # multiply the result by the current spam score; this is the new score
            # do the same, but for ham rather than spam
    # compare the two scores
    # return True if spam is higher, False if ham is higher

## these are tests to check that the function is working
# the below code won't run without the code in the previous part
msg1 = 'Dear Sir,\n\nI hope this message finds you well. My name is..'
m1 = classify_message(msg1, spam_word_dict, ham_word_dict)
msg2 = 'Hey,\n\nWant to catch up next week?'
m2 = classify_message(msg2, spam_word_dict, ham_word_dict)
print("Message:\n", msg1, "\n-- Result:", m1) # should be True
print("Message:\n", msg2, "\n-- Result:", m2) # should be False

Check your spam filter against other messages of your choice. It will not be perfect, as it was trained on a relatively small number of messages.

Remember: all of your functions for this assignment go in spam_filter.py.

Performance Notes

Nothing in this note is necessary to get credit for this problem set.

  • For very long messages, your filter can fail because the probabilities it calculates become too small for Python’s floating point numbers to capture, and both round to 0.
  • You could address this by using only one portion of each message
  • You could address this by feeding each message portion-by-portion into the filter
  • If you’re comfortable with the math, you could address this with logarithms:
    • Instead of multiplying probabilities: \(p_1 \cdot p_2\), you can sum their logs: \(\log(p_1) + \log(p_2)\)
    • Since you are comparing relative magnitude, you can simply compare the log probabilities, rather than converting back to probability via exponentiation
  • You can use the “alternate” spam and ham files (in the alt) folder if you want to completely avoid issues with floats becoming zero.
  • These messages are all very short.
  • The spam filter you just built is a very rudimentary form of machine learning

    • The classifier “learns” from the data
    • You could provide a different data set, and the program would learn different rules
    • You could use this to perform different kinds of text classification, such as determining if text was written by one of two different authors
  • This filter is a type of Naive Bayes classifier

  • There are many ways to improve this kind of filter, including:

    • Removing punctuation
    • Removing capitalization
    • “Stemming,” a process that recognizes that words like ‘vote’ and ‘voting’ have the same root
    • Using pairs of words or word sequences rather than single word counts
  • The math behind how this works is one of the (very basic) building blocks for much more sophisticated language models like ChatGPT.

I look forward to your response.