Project1: Classification of Tabular Data with RandomForests

In this project, you will analyze a dataset of 12,000 instances of user history for online shopping sessions over the course of a year, in order to build a binary classifier that predicts whether or not the user ended up buying something during that time.

This type of analysis -- predicting customer purchase intent, returning customers, who will respond to what kind of marketing campaigns -- is big money and big business. You may see something similar to this appear on a take-home project as part of an interview for a data scientist position. Companies want to be able to model and predict the behavior of users who visit their e-commerce sites in order to target advertising more effectively, for example.

Let's get started!

Part 1: Dataset download and extraction

This dataset comes from the UCI Machine Learning Repository, where you will find a description of the dataset, including its features. Please download this local copy we have for the class, as we've made some modifications to the data to facilitate this assignment.

In order to compare results, we're going to use rows 10000 - 12331 as our holdout dataset. In a new Jupyter Notebook you'll be using for this project, write code to split the dataset into train and holdout, using those rows as holdout. In a markdown cell above this splitting, explain why this split, as specified, is or is not a good idea from the perspective of generalization.

Please copy each line of the grading rubric (including number) into a markdown element in your jupyter notebook that matches the cell that completes it, so we don't miss anything during grading :-)

GRADING RUBRIC for Part 1:

1. Load csv correctly into a DataFrame and show contents in a cell	5 points
2. Holdout dataset split as specified	5 points
3. Correct explanation generalization from such a holdout split	5 points

Part 2: Data cleaning

Complete the following items on the train and holdout datasets. Have each item below in a separate markdown and code cell in your notebook.
GRADING RUBRIC for Part 2:

4. Use `value_counts()` in `pandas` to print out the distributions of the categorical and ordinal numbered features (treat `SpecialDay` as categorical here). Turn on the setting to reveal missing data -- how many features, and what percent of them, were missing? Discuss in markdown in your notebook.	5 points
5. Use the `describe()` method in `pandas` to print out summary statistics. Discuss which features you will have to consider more carefully, based on these results.	5 points
6. Handle any missing data in your training data, but do not simply delete the rows. In your notebook, discuss why you chose to handle the missing data that way.	5 points
7. The holdout dataset also contains missing data. Discuss and implement how you recovered those items, without deleting those rows.	5 points
8. Discuss (and implement if applicable) whether or not you need to scale/normalize your features, and which ones, if any.	5 points
9. There are several categorical features. Discuss and implement if you will encode them as ordinal numbers, or one-hot encode them, and why you chose to do so for each such feature.	5 points
10. You don't need to implement this, but in the dataset, were there any ordinal features that they authors should have recorded as categorical, in your opinion? Why or why not? Discuss in a markdown cell.	5 points

Part 3: Feature simplification and engineering

Complete the following items on the train and holdout datasets. Have each item below in a separate markdown and code cell in your notebook.
GRADING RUBRIC for Part 3:

11. Use a heatmap to show the correlation between all feature pairs. Discuss, if any, which features you would recommend dropping from your model. Also discuss why you would want to drop them (what is the expected benefit?)	5 points
12. Given what you know about the limitations of RandomForests, engineer one additional feature, and discuss why you think it might help the model.	5 points

Part 4: Model training

We're going to perform a grid search with cross validation to train our model(s), so we don't need to manually split the training data into train and validation sets. Complete the items below:
GRADING RUBRIC for Part 4:

13. Separate your training data into features and labels (`X` and `y`).	5 points
The labels for this dataset are highly imbalanced. Discuss and implement how you will handle this situation for this analysis.	5 points
14. Instantiate a `RandomForest` model of your choosing.	5 points
Define a grid to tune at least three different hyperparameters with at least two different values each. Discuss why you think these parameter values might be useful for this dataset.	5 points
15. Set up a `gridsearchCV` with 5-fold cross validation. Discuss what accuracy metric you chose and why.	5 points
16. Train your model using `gridsearchCV`, and report the best performing hyperparameters.	5 points

Part 5: Model analysis

Using your best performing model from Part 4, complete the items below:
GRADING RUBRIC for Part 5:

17. Calculate accuracy, precision and recall on the holdout dataset. Discuss which metric you think is most meaningful for this dataset, and why	5 points
18. Discuss how the model performance on holdout compares to the model performance during training. Do you think your model will generalize well? Why or why not?	5 points
19. Generate a confusion matrix and discuss your results.	5 points
20. Print out the feature importances of your model.	5 points

Part 6: Model Comparison

Repeat your model training and tuning steps with:
GRADING RUBRIC for Part 5:

21. Train and tune another decision-tree based model on your training dataset. Using the best performing hyperparameters, test this model on your holdout. How did it perform, compared to your earlier model? Do you think your results will generalize?	5 points
22. Next, repeat training and tuning on the same data with a `LogisticRegression` model. Do you need to do any additional feature cleaning or scaling here? Why or why not?	5 points

All done! Where to go next?

Great work on this project! If you're interested in learning more about how researchers used this dataset, check out their paper -- they even include their results using a RandomForest and other models!

Extra credit:

Discuss the paper in one paragraph.

5 points

CS 4364/6364