Problem Set 4
This is an individual-effort assignment: the code you submit must be written by you; it may not be written by another person nor may it be generated with an algorithm.
Keep all files for this assignment in the
problem_set4
folder and submit them as a zip file namedproblem_set4.zip
.This assignment is scored out of 110 (an extra credit opportunity). There are more than 110 possible points, but the maximum score you can earn on this assignment is 110. Do not get stuck on any one problem, but do complete as many problems as you can.
Pay close attention to the format of output, names of functions, names of files, and names of variables. These errors are common reasons for losing points on problems that otherwise would have been full credit.
This is an individual-effort assignment: the code you submit must be written by you; it may not be written by another person nor may it be generated with an algorithm.
Dear Interlocutor,
In this assignment, you will build a very basic spam filter. Each problem will be a separate step.
- First, download spam_filter.zip.
- There is a folder called
problem_set4
in it: use this folder as yourproblem_set4
folder. - The folder contains three sub-folders:
spam
andham
, which contain the files themselves.alt
, which consists of an alternate set of spam and ham.
- There is a folder called
Your spam filter will read in the spam files and ham files and “learn” from these files how to identify what content is spam and what is not.
Important: Unlike previous assignments, you will put all of your code for this assignment in a single file: spam_filter.py.
Put this in the main problem_set4
folder.
The comments in our starter code form what is called “pseudo code.” They describe the structure of the code you should write.
Whenever the comments are indented, that’s a sign that you should probably be using indented control flow such as a loop or conditional statement.
- A spam filter is a classifier: code that takes some input and outputs a classification.
- In this case, the input is a message
- The output is “spam” or “not spam.”
- Messages that are not spam are called “ham;” this is is an old convention that has nothing to do with actually eating pork.
- The spam filter learns what is spam and what is ham from data that has been tagged.
- Early spam filters required the user to tag emails as spam before they worked at all
- Modern spam filters work well because people have been tagging spam for decades
- With only a small number of training messages, your filter will not perform perfectly - but it will work!
There are two sets of spam content provided for this assignment:
- The first represents emails generated by ChatGPT - “if nobody wrote it, why should anybody read it?”
- The second represents spam SMS messages related to political donation fraud and sports betting
I hope this message finds you well
Problem 0
I understand that this subject is important
Problem 1
You might be wondering how we get a list of file names. The answer: use our computer’s operating system.
- You already use the operating system when you open up a folder and look at the files
- Python can use the operating system, too. Here is how:
First, import the os
library. Then you can use the function os.listdir()
to list the contents of the current directory:
import os
print(os.listdir())
The os.listdir()
function returns a list of files and folders (as strings) in the current directory, something like:
['alt', 'ham', 'spam', 'spam_filter.py']
You can also call the function on other directories, for instance, the ham
directory:
print(os.listdir('ham'))
yields something like
['0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt',
'17bdf358ced227386ebe606d4f00dd50bea9fd97a6115b23573a7767e8ee61d8.txt',
'1b19e1ff7076d6c598f0db70070a10290ad5c64858c376550bbcc95f456d273d.txt',
'1b9bcb9cc387f8783e9359389b556937a0b715ae700d1176485ba91b9f089b5b.txt',
'2080ba83c6624d9667fabcdc54db1514aa02721c0987d59259d0bf3d305efde1.txt',
'23bd9be0880f82d8757eea59972684599516b117099e0397810557f526775949.txt',
'26ac2b627a3aee68c99c65bc755f66e62be177acc311939016b2eb9415b77f49.txt',
'28d7ca46fb2462ebe2637ef5a9c8ec1292b854e23b246282203103c8abbfe8a4.txt',
# it's a long list, this only shows the beginning
# it might not be exactly the same for you
So: this lets us get the file names.
Be careful: Since the files in the ham
folder are not in the main directory of the program, you will need to specify the path when opening them. Instead of 0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt
, you will need to open ham/0931dcf9882a86e5aea70358ea0cbb62c0d154f3ea598411c8e846b11a7e545a.txt
.
This means for that list of files, you might need to concatenate ham/
or spam/
to every filename string.
I take it very seriously
Problem 2
I know that you are busy and I appreciate your time
Problem 3
Thank you for your attention to this matter
Problem 4
Nothing in this note is necessary to get credit for this problem set.
- For very long messages, your filter can fail because the probabilities it calculates become too small for Python’s floating point numbers to capture, and both round to 0.
- You could address this by using only one portion of each message
- You could address this by feeding each message portion-by-portion into the filter
- If you’re comfortable with the math, you could address this with logarithms:
- Instead of multiplying probabilities: \(p_1 \cdot p_2\), you can sum their logs: \(\log(p_1) + \log(p_2)\)
- Since you are comparing relative magnitude, you can simply compare the log probabilities, rather than converting back to probability via exponentiation
- You can use the “alternate” spam and ham files (in the
alt
) folder if you want to completely avoid issues with floats becoming zero. - These messages are all very short.
The spam filter you just built is a very rudimentary form of machine learning
- The classifier “learns” from the data
- You could provide a different data set, and the program would learn different rules
- You could use this to perform different kinds of text classification, such as determining if text was written by one of two different authors
This filter is a type of Naive Bayes classifier
There are many ways to improve this kind of filter, including:
- Removing punctuation
- Removing capitalization
- “Stemming,” a process that recognizes that words like ‘vote’ and ‘voting’ have the same root
- Using pairs of words or word sequences rather than single word counts
The math behind how this works is one of the (very basic) building blocks for much more sophisticated language models like ChatGPT.
I look forward to your response.