\( \newcommand{\eqb}[1]{\begin{eqnarray*}#1\end{eqnarray*}} \)

Two business applications: (I) beyond Excel, and (II) business analytics

Module 3 > Business

Objectives

Everyone has used or seen a spreadsheet (Excel) before, and perhaps even written a bit of code in the spreadsheet to update totals.

However, some desirable actions on a spreadsheet are difficult, but straightforward in a programming language like Python. We'll examine one such application.

The more interesting application is called item-set mining, which launched the field of data mining years ago, now more fashionably called data or business analytics. It is fundamentally a business problem. We'll write some code to solve a simple version of the problem using real data.

3.30 Video:

Part A: Tackling a problem that's difficult in Excel

First, let's begin with: what's a spreadsheet?

A spreadsheet is a table, like this:
But beyond just a table (which you can type in a text document), spreadsheet software like Excel comes with:
- Features to work with the data (perform aggregates, sort).
- Features to present the data (colors, graphs)
Here's the Excel version of the above spreadsheet.
What does one do with spreadsheets?
- Spreadsheets became popular because businesses need to track revenue and expenses, and enable forecasting.
- A budget with various items is easily presented in a spreadsheet.
One kind of use is to store transaction data:
- A monetary interaction with a business is often called a transaction.
- Typically, transactions are ordered by time of occurence (earliest first).
The example we've just seen is an example of a list of transactions that record inter-company payments:
Think of this as data in the bank that is used to conduct the transfers for companies that transact frequently.
As one applies each transaction, every company's balance changes.
The bank's (computational) goal: identify when a balance becomes negative.
To see this in action, let's compute the changing bank balances alongside:
The challenge is to identify "transaction 6" as the first transaction that causes an overdraft (causes one balance to go under).
This is not simple to implement in Excel but, as we'll see, reasonably easy in Python.

The key ideas:

First, how is a spreadsheet stored in a way that a Python program can read?
- The most common format is the the so-called CSV (Comma-Separated Values) format:
- For example, the above transaction data in CSV format (without the column headers, which are strings) looks like this:
```
0,10.00,15.00,12.00
1,-3.00,3.00,0.00
2,0.00,-5.00,5.00
3,2.00,0.00,-2.00
4,-8.00,0.00,8.00
5,0.00,2.00,-2.00
6,-4.00,4.00,0.00
7,0.00,3.00,-3.00
8,-2.00,0.00,2.00
9,-9.00,9.00,0.00
10,0.00,7.00,-7.00
  
```
- As you can see, each row is on one line, with the contents of each row separated by commas (and spaces removed).

We'll use a 2D array, say, data to store the CSV file:

data[i,j] = ...       # j-th element in row i

Thus, in the example above
```
print(data[0,1])
print(data[0,2])
print(data[0,3])
  
```
will print the initial balances of the three companies.
So, the goal is to scan row by row, updating balances until one balance is negative (if that happens).
The moment that happens, exit print the transaction number and offending company's (negative) balance.

3.31 Exercise: Download the following: trans_analysis.py, test_trans_analysis.py, trans_data.csv, and trans_data2.csv. Fill in the needed code in trans_analysis.py. and run test_trans_analysis.py which has the test code.

3.32 Video:

Part B: Item set mining, the foundational problem in business analytics

Consider the following kind of data:

Suppose the data is from a grocery chain spread across several cities.
In this case, the figure shows a sample from two Brazilian cities, Belem and Recife (from which we'll use actual data).
Each row is a shopping item, grouped by customer.
Thus, customer #2 purchased: oil, cheese, milk, egg.

A different way of organizing the same data:

1: {'cheese', 'egg', 'garlic', 'chicken', 'oil', 'banana'}
2: {'cheese', 'oil', 'milk', 'egg'}
3: {'oil', 'egg', 'bread'}
4: {'cheese', 'rice', 'egg'}
5: {'cheese', 'egg', 'beans', 'grapes', 'oil', 'chicken', 'lettuce'}

That is, each customer purchases a set of items.

This is sometimes called item set data.
As you can imagine, this kind of data is available for nearly any kind of business.

The problem of item-set mining asks the question: what can be learned about purchase patterns from the data that can then be translated into marketing opportunities?

For example, if oil and eggs are found to occur together very often, a number of product-placement strategies are possible.
For example, a store can place eggs near oil so that no customer forgets to purchase both.
Alternatively, the two can be separated so that a walk in the store from "eggs" to "oil" leads past many impulse-purchase displays.
Historically, this was one of the earliest "big data" problems in business, and is also called market basket analysis
Note: this subfield is awash with constantly evolving terminology, for example, business analytics, data analytics, data mining, business intelligence.

Two common definitions:

Consider "oil" and "eggs".
Support: what fraction of transactions contain both oil and eggs?
- In the above data: 80% (4 out of 5) have both oil and eggs.
Confidence: of all the transactions that contain "oil", how many also contain "eggs"?
- In the above data: 100% that contain oil (there are four of them) also contain eggs.
Note that confidence is not symmetric: the number of egg-containing transactions that also contain oil is 80%.
The same definitions apply to larger item sets.
For example, given the itemset {oil, eggs} and the target item {chicken}, one can ask how often all three occur together (the support) and what fraction of oil-and-egg containing transactions also contain chicken (confidence)?
Typically, one looks for both high support and high confidence.

We will focus on computing support since computing confidence is similar. And we'll do so for pairs of products.

Key ideas:

The data will be in CSV format. In fact, to simplify, we'll use three versions of the data:
- Small: shopping_data_small.csv, which has data for 5 transactions (and is pictured above).
- Medium: shopping_data_medium.csv, which includes data for 500 transactions.
- Full: shopping_data_full.csv, which is the full dataset with 1,538 transactions from a chain spread across several cities in Brazil.
- The data was obtained from the following paper: Gonçalves, E. C. (2014). A Human-Centered Approach for Mining Hybrid-Dimensional Association Rules. Proceedings of the 17th International Conference on Information Fusion, (FUSION 2014), Salamanca, Spain.
We've seen the structure of the data file above: a transaction number, the location, and product.

We've written some code to extract the data from a file and produce the following:

[1, 'Belem', 'oil']
[1, 'Belem', 'cheese']
[1, 'Belem', 'banana']
[1, 'Belem', 'egg']
[1, 'Belem', 'chicken']
[1, 'Belem', 'garlic']
[2, 'Belem', 'oil']
[2, 'Belem', 'cheese']
[2, 'Belem', 'milk']
[2, 'Belem', 'egg']
[3, 'Belem', 'oil']
[3, 'Belem', 'bread']
[3, 'Belem', 'egg']
[4, 'Belem', 'cheese']
[4, 'Belem', 'egg']
[4, 'Belem', 'rice']
[5, 'Belem', 'cheese']
[5, 'Recife', 'lettuce']
[5, 'Recife', 'egg']
[5, 'Recife', 'oil']
[5, 'Recife', 'beans']
[5, 'Recife', 'grapes']
[5, 'Recife', 'chicken']

From this, let's write some code to merely extract the list of cities.
First, begin by reviewing how a set works in Python, from Module 1.

Now we'll exploit the set() feature in Python to ensure that duplicates are first removed before a list is made:

def get_cities(data):
    # Start with a set:
    cities_set = set()
    for row in data:
        # Add to the set, which automatically removes duplicates
        cities_set.add(row[1])
    # Convert to list:
    cities = list(cities_set)
    return cities

Similarly, one can obtain a complete set of products (without duplicates).

We will provide code to create a list of transactions where every transaction row contains the full list of products, as in:

[1, ['oil', 'cheese', 'banana', 'egg', 'chicken', 'garlic']]
[2, ['oil', 'cheese', 'milk', 'egg']]
[3, ['oil', 'bread', 'egg']]
[4, ['cheese', 'egg', 'rice']]
[5, ['cheese', 'lettuce', 'egg', 'oil', 'beans', 'grapes', 'chicken']]

This is the data we'll use to compute support: for every pair of products in the product list, we'll run through all the transactions and count the number of transactions where the pair exists.

So, the desired output is:

['cheese', 'egg', 4]
['egg', 'oil', 4]
['cheese', 'oil', 3]
['cheese', 'chicken', 2]
['chicken', 'egg', 2]
['chicken', 'oil', 2]
['banana', 'cheese', 1]
['banana', 'chicken', 1]
['banana', 'egg', 1]
['banana', 'garlic', 1]
['banana', 'oil', 1]
['bread', 'egg', 1]
['bread', 'oil', 1]
['beans', 'cheese', 1]
['beans', 'chicken', 1]
['beans', 'egg', 1]
['beans', 'grapes', 1]
['beans', 'lettuce', 1]
['beans', 'oil', 1]
['cheese', 'grapes', 1]
['cheese', 'garlic', 1]
['cheese', 'lettuce', 1]
['cheese', 'milk', 1]
['cheese', 'rice', 1]
['chicken', 'grapes', 1]
['chicken', 'garlic', 1]
['chicken', 'lettuce', 1]
['egg', 'grapes', 1]
['egg', 'garlic', 1]
['egg', 'lettuce', 1]
['egg', 'milk', 1]
['egg', 'rice', 1]
['grapes', 'lettuce', 1]
['grapes', 'oil', 1]
['garlic', 'oil', 1]
['lettuce', 'oil', 1]
['milk', 'oil', 1]

We'll typically be interested in the case where the occurrence (number of transactions is more than 1).

So, if we require a minimum of 3, the output will be

['cheese', 'egg', 4]
['egg', 'oil', 4]
['cheese', 'oil', 3]

3.33 Exercise: Start by downloading shopping_data_small.csv, the dataset in the example above. Then, download grocery_analytics.py and test_grocery_analytics.py. The latter file has all the tests, which you can run one by one for each dataset, starting with the first one (5 transactions). The former file has comments that explain what code you need to write. When you run the test, the output should look like this: results_small.txt.

3.34 Exercise: Once you got the small set working, you can un-comment the second 'and third tests in turn and use the corresponding datasets: shopping_data_medium.csv and shopping_data_full.csv. The output should look like results_medium.txt and results_full.txt, respectively. (The full set might take a while to run.)

3.35 Video:

About item-set mining:

We have scratched the surface of the itemset-mining problem.
You noticed that the full dataset took considerably longer to execute than the small one.
For groups of products, the execution time is even more significant.
There are clever algorithmic techniques to reduce the execution time and the so-called combinatorial explosion with large data sets and many combinations of products.

Optional further exploration

If you'd like to explore further:

A TED talk on Big data
A review paper on big data analytics
A high-level "for managers" roadmap from the Harvard Business Review.
Browse through a technical paper on algorithms for business analytics (just to get an idea).
If you're concerned about your shopping data being analyzed, you should worry about your pre-shopping movement (in a mall) being analyzed: see this TED talk, for example.

Back to Module 3