Everyone has used or seen a spreadsheet (Excel) before, and perhaps even written a bit of code in the spreadsheet to update totals.
However, some desirable actions on a spreadsheet are difficult, but straightforward in a programming language like Python. We'll examine one such application.
The more interesting application is called item-set mining, which launched the field of data mining years ago, now more fashionably called data or business analytics. It is fundamentally a business problem. We'll write some code to solve a simple version of the problem using real data.
3.30 Video:
First, let's begin with: what's a spreadsheet?
The key ideas:
0,10.00,15.00,12.00 1,-3.00,3.00,0.00 2,0.00,-5.00,5.00 3,2.00,0.00,-2.00 4,-8.00,0.00,8.00 5,0.00,2.00,-2.00 6,-4.00,4.00,0.00 7,0.00,3.00,-3.00 8,-2.00,0.00,2.00 9,-9.00,9.00,0.00 10,0.00,7.00,-7.00
data[i,j] = ... # j-th element in row i
print(data[0,1]) print(data[0,2]) print(data[0,3])will print the initial balances of the three companies.
3.31 Exercise: Download the following: trans_analysis.py, test_trans_analysis.py, trans_data.csv, and trans_data2.csv. Fill in the needed code in trans_analysis.py. and run test_trans_analysis.py which has the test code.
3.32 Video:
Consider the following kind of data:
1: {'cheese', 'egg', 'garlic', 'chicken', 'oil', 'banana'} 2: {'cheese', 'oil', 'milk', 'egg'} 3: {'oil', 'egg', 'bread'} 4: {'cheese', 'rice', 'egg'} 5: {'cheese', 'egg', 'beans', 'grapes', 'oil', 'chicken', 'lettuce'}That is, each customer purchases a set of items.
The problem of item-set mining asks the question: what can be learned about purchase patterns from the data that can then be translated into marketing opportunities?
Two common definitions:
We will focus on computing support since computing confidence is similar. And we'll do so for pairs of products.
Key ideas:
[1, 'Belem', 'oil'] [1, 'Belem', 'cheese'] [1, 'Belem', 'banana'] [1, 'Belem', 'egg'] [1, 'Belem', 'chicken'] [1, 'Belem', 'garlic'] [2, 'Belem', 'oil'] [2, 'Belem', 'cheese'] [2, 'Belem', 'milk'] [2, 'Belem', 'egg'] [3, 'Belem', 'oil'] [3, 'Belem', 'bread'] [3, 'Belem', 'egg'] [4, 'Belem', 'cheese'] [4, 'Belem', 'egg'] [4, 'Belem', 'rice'] [5, 'Belem', 'cheese'] [5, 'Recife', 'lettuce'] [5, 'Recife', 'egg'] [5, 'Recife', 'oil'] [5, 'Recife', 'beans'] [5, 'Recife', 'grapes'] [5, 'Recife', 'chicken']
def get_cities(data): # Start with a set: cities_set = set() for row in data: # Add to the set, which automatically removes duplicates cities_set.add(row[1]) # Convert to list: cities = list(cities_set) return cities
[1, ['oil', 'cheese', 'banana', 'egg', 'chicken', 'garlic']] [2, ['oil', 'cheese', 'milk', 'egg']] [3, ['oil', 'bread', 'egg']] [4, ['cheese', 'egg', 'rice']] [5, ['cheese', 'lettuce', 'egg', 'oil', 'beans', 'grapes', 'chicken']]
['cheese', 'egg', 4] ['egg', 'oil', 4] ['cheese', 'oil', 3] ['cheese', 'chicken', 2] ['chicken', 'egg', 2] ['chicken', 'oil', 2] ['banana', 'cheese', 1] ['banana', 'chicken', 1] ['banana', 'egg', 1] ['banana', 'garlic', 1] ['banana', 'oil', 1] ['bread', 'egg', 1] ['bread', 'oil', 1] ['beans', 'cheese', 1] ['beans', 'chicken', 1] ['beans', 'egg', 1] ['beans', 'grapes', 1] ['beans', 'lettuce', 1] ['beans', 'oil', 1] ['cheese', 'grapes', 1] ['cheese', 'garlic', 1] ['cheese', 'lettuce', 1] ['cheese', 'milk', 1] ['cheese', 'rice', 1] ['chicken', 'grapes', 1] ['chicken', 'garlic', 1] ['chicken', 'lettuce', 1] ['egg', 'grapes', 1] ['egg', 'garlic', 1] ['egg', 'lettuce', 1] ['egg', 'milk', 1] ['egg', 'rice', 1] ['grapes', 'lettuce', 1] ['grapes', 'oil', 1] ['garlic', 'oil', 1] ['lettuce', 'oil', 1] ['milk', 'oil', 1]
['cheese', 'egg', 4] ['egg', 'oil', 4] ['cheese', 'oil', 3]
3.33 Exercise: Start by downloading shopping_data_small.csv, the dataset in the example above. Then, download grocery_analytics.py and test_grocery_analytics.py. The latter file has all the tests, which you can run one by one for each dataset, starting with the first one (5 transactions). The former file has comments that explain what code you need to write. When you run the test, the output should look like this: results_small.txt.
3.34 Exercise: Once you got the small set working, you can un-comment the second 'and third tests in turn and use the corresponding datasets: shopping_data_medium.csv and shopping_data_full.csv. The output should look like results_medium.txt and results_full.txt, respectively. (The full set might take a while to run.)
3.35 Video:
About item-set mining:
If you'd like to explore further: