Introduction to Machine Learning, Spring 2022
GWU Computer Science
In this project, you will analyze a dataset of 12,000 instances of user history for online shopping sessions over the course of a year, in order to build a binary classifier that predicts whether or not the user ended up buying something during that time.
This type of analysis -- predicting customer purchase intent, returning customers, who will respond to what kind of marketing campaigns -- is big money and big business. You may see something similar to this appear on a take-home project as part of an interview for a data scientist position. Companies want to be able to model and predict the behavior of users who visit their e-commerce sites in order to target advertising more effectively, for example.
Let's get started!
This dataset comes from the UCI Machine Learning Repository, where you will find a description of the dataset, including its features. Please download this local copy we have for the class, as we've made some modifications to the data to facilitate this assignment.
In order to compare results, we're going to use rows 10000 - 12331 as our holdout dataset. In a new Jupyter Notebook you'll be using for this project, write code to split the dataset into train and holdout, using those rows as holdout. In a markdown cell above this splitting, explain why this split, as specified, is or is not a good idea from the perspective of generalization.
Please copy each line of the grading rubric (including number) into a markdown element in your
jupyter notebook that matches the cell that completes it, so we don't miss anything during grading :-)
GRADING RUBRIC for Part 1:
1. Load csv correctly into a DataFrame and show contents in a cell | 5 points |
2. Holdout dataset split as specified | 5 points |
3. Correct explanation generalization from such a holdout split | 5 points |
Complete the following items on the train and holdout datasets. Have each item below in a separate markdown and
code cell in your notebook.
GRADING RUBRIC for Part 2:
4. Use value_counts() in pandas to print out the distributions of the categorical and
ordinal numbered features (treat SpecialDay as categorical here). Turn on the setting to reveal missing data -- how many features, and what percent of them,
were missing? Discuss in markdown in your notebook. | 5 points |
5. Use the describe() method in pandas to print out summary statistics. Discuss
which features you will have to consider more carefully, based on these results. | 5 points |
6. Handle any missing data in your training data, but do not simply delete the rows. In your notebook, discuss why you chose to handle the missing data that way. | 5 points |
7. The holdout dataset also contains missing data. Discuss and implement how you recovered those items, without deleting those rows. | 5 points |
8. Discuss (and implement if applicable) whether or not you need to scale/normalize your features, and which ones, if any. | 5 points |
9. There are several categorical features. Discuss and implement if you will encode them as ordinal numbers, or one-hot encode them, and why you chose to do so for each such feature. | 5 points |
10. You don't need to implement this, but in the dataset, were there any ordinal features that they authors should have recorded as categorical, in your opinion? Why or why not? Discuss in a markdown cell. | 5 points |
Complete the following items on the train and holdout datasets. Have each item below in a separate markdown and
code cell in your notebook.
GRADING RUBRIC for Part 3:
11. Use a heatmap to show the correlation between all feature pairs. Discuss, if any, which features you would recommend dropping from your model. Also discuss why you would want to drop them (what is the expected benefit?) | 5 points |
12. Given what you know about the limitations of RandomForests, engineer one additional feature, and discuss why you think it might help the model. | 5 points |
We're going to perform a grid search with cross validation to train our model(s), so we don't need to manually
split the training data into train and validation sets. Complete the items below:
GRADING RUBRIC for Part 4:
13. Separate your training data into features and labels (X and y ). | 5 points |
The labels for this dataset are highly imbalanced. Discuss and implement how you will handle this situation for this analysis. | 5 points |
14. Instantiate a RandomForest model of your choosing. | 5 points |
Define a grid to tune at least three different hyperparameters with at least two different values each. Discuss why you think these parameter values might be useful for this dataset. | 5 points |
15. Set up a gridsearchCV with 5-fold cross validation. Discuss what accuracy metric you chose and why. | 5 points |
16. Train your model using gridsearchCV , and report the best performing hyperparameters. | 5 points |
Using your best performing model from Part 4, complete the items below:
GRADING RUBRIC for Part 5:
17. Calculate accuracy, precision and recall on the holdout dataset. Discuss which metric you think is most meaningful for this dataset, and why | 5 points |
18. Discuss how the model performance on holdout compares to the model performance during training. Do you think your model will generalize well? Why or why not? | 5 points |
19. Generate a confusion matrix and discuss your results. | 5 points |
20. Print out the feature importances of your model. | 5 points |
Repeat your model training and tuning steps with:
GRADING RUBRIC for Part 5:
21. Train and tune another decision-tree based model on your training dataset. Using the best performing hyperparameters, test this model on your holdout. How did it perform, compared to your earlier model? Do you think your results will generalize? | 5 points |
22. Next, repeat training and tuning on the same data with a LogisticRegression model. Do you
need to do any additional feature cleaning or scaling here? Why or why not? | 5 points |
Great work on this project! If you're interested in learning more about how researchers used this dataset, check out their paper -- they even include their results using a RandomForest and other models!
Extra credit:Discuss the paper in one paragraph. | 5 points |