Matches, Recommendations and the Netflix Challenge

Overview

A preview:

An abstract marketplace: matching users to items.
The stable marriage problem.
Graph matching.
Text matching (mining).
Data mining.
Collaborative filtering.

Our presentation will pull together material from various sources - see the references below. But most of it will come from [Bree1998], [Grif2000], [Herl1999], [Roth2008], and [Su2009].

Matching users to items: the marketplace

Consider an abstract market:

There are a number of users.
And a number of items or "goods".
Each user (buyer) is looking for some items.
Sellers both want to match their items with buyers and also want to promote new items.

The standard approach: a supply-demand marketplace with prices

Assume items are priced by sellers.
Users have knowledge of all items and prices.
Users optimize some "utility" function that takes into account items and prices.
Users purchase items.

Market failures:

Unfortunately, standard markets can fail spectacularly in many cases.
Example: college admissions
- Here, users = students and items = college-seats.
- Prices are known ahead of time.
- But items are limited, with one copy per buyer
  → a challenge for buyers
- Also, all items must be sold
  → a challenge for sellers
- Yet, both buyers and sellers compete.
In limited resource regime:
- Some other arrangement than a free-market is used.
- But that has its own problems.
Problem #1: unraveling
- A "competition" of early offers.
  → race to the bottom by sellers.
- Early offers with buyer-commitment
  → not the best choice for buyers.
Problem #2: congestion
- When a seller learns that a buyer declined, it must rapidly contact "second ranked" buyers.
- And "second ranked" buyers have little time to decide.
Problem #3: deception
- Both buyers and sellers are encouraged to be deceptive
  → Doesn't pay to reveal your preferences.
- Example: if college X knows that a student's top choice is college Y, it doesn't help the student.
  → Student is pressured to conceal this fact.
Other types of market failures:
- Buyers are sometimes unable to find "best-match" items.
- Buyers sometimes don't know of the existence of items that they would like.
- Sellers unable to discern what a buyer is actually looking for.

Let's examine an abstract market along two dimensions:

The first dimension: resources
- In some markets, the resources, and therefore items, are limited.
- Examples: college seats, airline seats, auctions.
The second dimension: clarity of buyer intent
- Buyers aren't able to exactly state what they want.
- Buyers aren't aware of items they may need or want.
Now, let's examine the four cases.
- High-resource and high-clarity.
- Low-resource and low-clarity.
- Low-resource and high-clarity.
- High-resource and low-clarity.
Case 1: High-resource and high-clarity.
- This is the standard market.
- Buyers shop around for best prices or deals.
- Sellers compete on price, on branding, on deals etc.
- Example: books.
Case 2: Low-resource and low-clarity.
- Buyers aren't sure of what exactly they're looking for.
- Sellers have limited items.
- Most often, an intermediary is helpful
  → Creates a market for intermediaries.
- Example: real estate.
Case 3: Low-resource and high-clarity.
- College admissions example.
Case 4: High-resource and low-clarity.
- Buyers aren't looking for a particular item, but some vaguely stated "category".
- Sellers want to stimulate buyer interest and try to match buyers with items.
- Example: idle browsing for movies, songs.
Cases 1 and 2 are uninteresting from our point of view.
Our focus: cases 3 and 4:
- Case 3: Algorithms for matching.
- Case 4: Algorithms for helping sellers recommend items to buyers.

The stable marriage problem

First, let's describe the problem:

We will describe the simplest version of the problem:
- n men: M₁, M₂, ..., M_n.
- n women: W₁, W₂, ..., W_n.
- Each man lists all women in some order of preference.
- Each woman lists all men in some order of preference.
- Broad goal: pair each man with a woman.
Thus, there is perfect clarity in buyer intentions, but limited resources
→ Each man is paired with one woman in the pairing.
We'll use the notation (M_i,W_j) to denote a couple produced by a pairing or marriage.
A marriage is stable if there is no M_i and W_j such that:
- M_i and W_j are not married.
- M_i is higher in W_j's list than her current spouse.
- W_j is higher in M_i's list than his current spouse.
This is important because otherwise, M_i and W_j would not respect the marriage and would elope.
Example: the marriage below is not stable:
Thus, our goal is: compute a stable marriage if one exists.

The Proposal Algorithm:

Due to David Gale and Lloyd Shapely in 1962 [Gale1962].
Also called the Deferred-Acceptance Algorithm.
Key ideas: "man proposes, woman disposes"
- Initially, all men and women are "free".
- At each step, there is an engagement.
- At each step, a free man proposes to any woman who has not rejected him before.
- A woman is required to accept unless she's engaged to someone higher on her list.
- If a woman is engaged to a lower-preference man, she "dumps" that man in favor of the proposer.
- After no more engagements are broken and all are engaged, the marriage is finalized.
Note: the women always "trade up".
Let's work through an example:
- M₁ proposes to W₂
  → She accepts
- M₂ proposes to W₂
  → She breaks off with M₁ and accepts M₂
- M₁ is back on the free list.
- M₃ proposes to W₁
  → She accepts
- M₁ proposes to W₃
  → She accepts
- Done.
Exercise: Who gets the better deal - men or women?

In pseudocode:


Algorithm: proposal (M, W)
Input: n men M₁,...,M_n and n women W₁,...,W_n with their rankings
1.    Add all men to freeList
2.    while freeList is not empty
3.        M = remove a man from freeList
4.        W = highest-ranked woman he has not yet proposed to
5.        if W is unengaged
6.            M and W are engaged
7.        else if W prefers M to her fiance M'
8.            M and W are engaged
9.            M' is placed on the free list
10.       endif
11.   endwhile

Does the algorithm work?

Does it end?
- A is only rejected once by a woman.
- This can happen only a finite number of times.
Is the marriage stable?
- Suppose man M and woman W each prefer the other to their current spouses.
- Suppose M is married to A and W is married to B.
- M proposed to W before A.
- Case 1: A rejected M's proposal.
        → She was already engaged to someone better than M.
        → But then, she couldn't be married to someone lower than M.
        → Contradiction!
- Case 2: A dumped M.
  → She must have found someone better.
  → Contradiction!
Running time: O(n²)
- At each step, an engagement occurs.
- Each woman can only be engaged to n men.
  → Only O(n²) engagements possible.
A marriage is male-optimal if:
- The marriage is stable.
- Among all stable marriages, each male gets his top choice (among the choices in the stable marriages).
Similar definition for male-pessimal, female-optimal, and female-pessimal.
Result: the proposal algorithm results in a marriage that is male-optimal and female-pessimal.

History of the stable marriage problem:

Has its origins in hospital-resident matching:
1900-1945: No regulation
→ No information, chaotic assignments
1945-51: tight deadline
→ Bad for residents and second-choices (market unraveling)
1952: Mullin-Stalnaker algorithm
→ Turned out to be unstable.
This was re-designed into the Boston-Pool algorithm that produced stable assignments.
→ Became the algorithm of choice for the National Resident Matching Program (NRMP).
1984: Boston-Pool shown to be hospital-optimal.
1998: Roth-Peranson algorithm used in NRMP
- Handles couples.
- Incomplete rank lists.
- Limited seats for certain specialties.
2004: Anti-trust lawsuit prompted enaction of a law to asserting its pro-competitiveness.

Other applications and developments:

All kinds of medical specialities.
New York and Boston schools.
Variations: stable-roomate problem, college admissions, blacklists.
Other studies: how to handle cheating.
If, with different numbers of men and women, a stable marriage is not possible, it is NP-hard to find the largest possible stable marriage.
Real marriages: see scientificmatch.com.

Graph Matching

A natural extension to the marriage problem is the matching problem:

A graph G=(V,E) is bipartite if the vertices V can be partitioned into two sets V = X U Y such that there is no edge between an X vertex and a Y vertex.
Example of a weighted bipartite graph:
A matching is a subset of edges such that no two edges share an endpoint.
Example of a matching:
Relationship to marriage problem:
- Think of X as representing males, Y as females.
- An edge weight is the "value" of a pairing.
- Goal: find the maximum weight matching.
A further generalization: when G is non-bipartite, e.g.

Types of matching problems:

Unweighted bipartite matching
- O(VE) augmenting path algorithm.
- Improved to O(E√V) in 1973 by Hopcroft and Karp.
Weighted bipartite matching
- O(V²E) so-called Hungarian algorithm.
- Devised by Harold Kuhn in 1955.
- Kuhn called it the Hungarian method because it was based on a few ideas by Hungarian mathematicians D.Konig and E.Egervary.
- Can be improved to O(V²log(V) + VE) using a Fibonacci heap.
Unweighted general (non-bipartite) graph matching
- Celebrated "blossoms" algrorithm by Jack Edmonds in 1965.
- O(V²E) complexity.
- Recall: Edmonds introduced the idea of polynomial vs. exponential algorithms.
Weighted general (non-bipartite) graph matching
- Blossoms algorithm can be modified to work for weighted version.
- Improvements by Gabow and collaborators to O(EVlog(V)) in 1986 and 1989.
- An LP formulation can be solved with a polynomial number of constraints in O(V⁴) time.

Text matching (mining)

There are many versions of this problem, but let's start with the most elementary version:

We have a (large) collection of documents.
Each document is a "bag of words".
A user query is also a "bag of words".
The objective: return the most relevant documents to the user.
Note: this is a version of the marketplace problem
- The user intention is "less clear".
- But sellers (documents) have no resource limitations.
Here, words are the most basic unit:
- However, one often uses word stems
  → e.g., "hatch" instead of "hatches" or "hatching".
- Sometimes synonyms are used, but this has been found to have limited use.
There are variations of the basic problem that we won't consider:
- Iterated query refinement, and question-answer frameworks.
- Topic detection and classification.
- Cross-language languages retrieval.
Each year the the TREC conference puts out challenge problems in various categories.

Three different approaches to text retrieval:

Boolean queries:
- User specifies a query using Boolean operators such as:
  ("green" AND "eggs" AND "ham") OR ("Horton")
- System finds all documents that exactly match the query.
- Note: documents are not ranked
  → Either a document matches the query or not.
- User studies show that only a few "power users" find this approach useful.
- Relevant-ranked queries are the most preferred form of search.
Probabilistic-relevance approach:
- Identify relevant documents.
- For each relevant document, compute Pr[document is relevant].
- Rank in order of probabilistic-relevance.
Vector-space model:
- Treat each document as a vector in document-space.
- A query is also a vector in the same space.
- Find close matches and list in order of "closeness".

Let's look into the details of the latter two:

First, if you need it: A quick overview of elementary linear-algebra.
Because of the math notation, we will do this section in latex: Click here for the document.

Data mining

What is data mining?

Example: Grocery store
- Suppose a grocer examines past transaction data to conclude
  "When a customer buys wine (item i) they also buy specialty cheese (item j) with high probability"
- Then, by placing cheese near the wine shelf, a grocer increases the chance of an impulse purchase.
- Such associations can be extracted from transaction data.
Example: banking
- Suppose a bank discovered that
  "77% of loan defaults involved a car loan to someone aged 18-21 for a red sportscar"
- Then, such an inference from the data can drive policy.
Online shopping:
- Amazon and others present "recommended items".
- These are based on other customers' purchases.
This extraction of patterns from standard database data is called data mining.
History:
- Late 1980's: rule extraction or knowledge discovery in databases.
- An early landmark paper in 1993 [Agra1993] defined the association-rule problem which spurred the development of the field.
The association-rule problem:
- The input: a database table of transaction data such as:
```
         Transaction #1:  eggs, milk, cookies
         Transaction #2:  beer, chips, pretzels, popcorn
         ...
         Transaction #i:  beer, chips, diapers
         ...
         Transaction #6,391,202: wine, brie, diapers
         
```
- Goal: identify rules of the sort:
  "60% of the time diapers are purchased, beer and chips are also purchased".
- The challenge: extract such rules efficiently from very large databases.
For more details: see these notes on data mining.

Amazon's recommendation engine:

For every pair of items, Amazon identifies a "co-purchase" score.

Algorithm:


1.   Initialize purchase-pair-set S = φ
2.   for each item i
3.       for each customer c who purchased i
4.           for each item j purchased by c
5.               S = S U {(i,j)}
6.           endfor
7.       endfor
8.   endfor

9.   for each item-pair in φ
10.         compute similarity-score for (x,y)
            // The score is the cosine distance between two items.
11.      endfor
12.  endfor

What is the similarity-score?
- One simple way: the number of customers that bought both.
  → To compute this, add a counter after line 5 above.
- Another way: compute the "cosine" distance treating items as column vectors in the matrix A where
  A_ij = 1 if user i bought item j.
Note: Amazon's algorithm does not find similar users to make recommendations.

Collaborative filtering

What is collaborative filtering (CF)?

From a user's point of view:
→ "I'd like to know what users like me rate highly"
From a seller's point of view:
→ I'd like to predict what a user would like based on ratings or purchases from similar users.
Origin of the term:
- Email system add-on called Tapestry devised by Goldberg et al [Gold1992] in 1992.
- Users rate email content or record actions "e.g., replied".
- Users filter email based on others' ratings or actions
  → e.g., "Include all emails that Alice replied to".
The current use is more technical:
- Users rate some items.
- Goal: predict a user's rating of some item they have not rated.
- This approach was developed by the pioneering GroupLens project on movie ratings in 1994 [Resn1994].
Define ratings matrix A_ij as follows
A_ij = user i's rating of item j.

Example with 5 users, 8 items:

Consider the following ratings matrix:


                  1   2   3   4   5   6   7   8
         Alice  |     1       5   5       3     |
         Bob    | 4       2           1       2 |
         Chen   |     1                         |
         Dave   |     2       4           3     |
         Ella   |     5       1   1           5 |

Alice and Bob have no overlap.
Alice and Dave have similar ratings.
→ Can predict Dave's rating of item 5 as "high".
Alice and Ella rate many items in common, but differently.
Chen offers very little data.
→ Hard to find a similar user.

The Netflix Challenge:
- Started in October 2006.
- Ratings from 480,189 users on 17,770 movies.
- Total of 100,480,507 ratings.
- Test test of about 1 million missing ratings.
- Goal: predict missing ratings.
- Challenge: beat Netflix's own algorithm by 10%.
- Won by a combination of teams in 2009.

Approaches to CF:

Data-driven approach:
- Use the data to compute similarities between users or items.
- Generate a prediction based on the data from similar users/items.
- Also called the memory-based approach.
Model-driven approach:
- Use the data to build a statistical model.
  → And estimate that model's parameters.
- Use model for prediction.
Hybrid approaches: these combine the above two.
Henceforth, we'll assume the following version of the CF problem:
- We are given the user-item ratings matrix A.
- The goal: predict missing entry A_uv.
  → User u's rating of item v.

The data-driven approach:

This approach was pioneered by the GroupLens project [Resn1994].
Since then it has been refined and modified by several others.
Two kinds:
- User-similarity: find similar users to user u who have rated v and use their ratings.
- Item-similarity: find similar items to item v, and use the ratings of these items.

Let's first focus on the user-similarity approach.


Algorithm: userSimilarity (A, u, v)
Input: ratings matrix A, and desired missing entry A[u,v]
     // Compute distances to other users.
1.   for each user i
2.      s[i] = similarity (i, u)
3.   endfor

     // Find a set of "close" or similar users.
4.   similar-user-set S = φ
5.   for each user i that has rated item j
6.       if similarEnough (i, u)
7.           add i to S
8.       endif
9.   endfor

     // Use the ratings of these users to predict user u's rating v.
10.  r = combineRatings (S, v)
11.  return r

The details of similarity() similarEnough() and combineRatings() differentiate various algorithms in this category.
→ We will only look at a few ideas - see the references for more details.

First, observe that rows of A are user-vectors:

For example, consider this matrix:


                  1   2   3   4   5   6   7   8
         Alice  |     1       5   5       3     |
         Bob    | 4       2           1       2 |
         Chen   |     1                         |
         Dave   |     2       4           3     |
         Ella   |     5       1   1           5 |

The vector for user Bob is: (4, 0, 2, 0, 0, 1, 0, 2) (assuming 0 is not a rating).

Exercise: What is the cosine of the angle between the "Alice" and "Dave" vectors? What is the cosine of the angle between the "Alice" and "Chen" vectors?

Similarly, columns are item-vectors:

For example:


                  1   2   3   4   5   6   7   8
         Alice  |     1       5   5       3     |
         Bob    | 4       2           1       2 |
         Chen   |     1                         |
         Dave   |     2       4           3     |
         Ella   |     5       1   1           5 |

Computing similarity: several ways
- Inverse Mean-square difference between the vectors for two rows.
- Cosine of angle between vectors.
- Pearson-correlation between the vectors.
Many ways of identifying the "close-enough" ones:
- Keep the K closest ones (where K is an algorithm parameter).
- Maintain a threshold δ and keep those users within this distance.
Combining ratings from close users on item v:
- Average their ratings.
- Take a weighted average, weighted by similarity
- Offset weighted average by the mean.
  → Because some users tend to have high averages or low averages.
- Note: κ is a normalizing factor.
- More complex statistical techniques such as multiple-regression have been tried [Hill1995].
The item-similarity approach: [Sarw2001]
- Recall: our goal is to estimate A_uv.
- In this approach: find a set of similar items I.
- Use the ratings of users for the items in I.
How to find similar items?
- Use the same types of vector-similarity measures as in comparing users.
  → Only this time, we use the columns of A.
- Thus, items are similar if many users tend to rate them similarly.
How to combine ratings?
- Again, using the same ideas as in user-similarity.
- Find those items in I that have ratings for user u.
- Combine those ratings as above.

Model-based approaches:

Model-based approaches try to uncover some underlying "structure" if such a structure exists or can be approximated.
For example: we "know" intuitively that users can be divided into "sci-fi movies", "romance movies" and the like.
If we can identify the user as one of these types, our prediction success might go up.
We'll next look at a few model-based approaches.

Simple clustering:

Create user categories.
→ These could be done by hand (e.g., "sci-fi", "romance" etc).
For each user, identify their category.
To compute A_uv:
- Identify u's category c.
- Find users in c who have rated v.
- Combine their ratings using some kind of averaging.

How do we classify users into categories?

Create "fake" user vectors, one for each category.
→ These serve as prototypes for the categories (ideal users).
Each such vector will have high ratings for the movies in that category, low for the others.
→ This means we need to "know" these categories.
Then, one can use vector-similarity with these category prototypes.

Example:


                  1   2   3   4   5   6   7   8
         Alice  |     1       5   5       3     |
         Bob    | 4       2           1       2 |
         Chen   |     1                         |
         Dave   |     2       4           3     |
         Ella   |     5       1   1           5 |

         Sci-fi | 0   5   0   5   0   0   0   0 |      // Movies 3 and 5 are sci-fi movies. Rest of the ratings are 0.

How to do this without pre-defined categories:

Initially, we'll create K empty clusters.
Initially, assign each user to a random cluster.
Treat each user-vector as a point in n space.
Compute the average "point" (the cluster mean) in each cluster.
Now, for each user, identify closest mean and re-classify the user accordingly.
Repeat.

This is called K-means clustering.


Algorithm: K-means (A, u, v)
Input: ratings matrix A, and desired missing entry A[u,v]
1.   for each user i
        // Initially assign each user to a random cluster.
2.      cluster[i] = random (1, K)
3.   endfor

4.   while not converged
        // Find the mean of each cluster and assign users to the cluster whose mean they are closest to.
5.      M_k = computeMean (C_k)
6.      for each user i
7.         find closest cluster mean k
8.         cluster[i] = k
9.      endfor
10.  endwhile

11.  k = cluster for user u
12.  S = users in C_k that have rated v
13.  r = combineRatings (S, v)
14.  return r

A probabilistic approach: the Naive-Bayes method

Let user u's vector of ratings be (u₁,...,u_n).
We want to predict u_v, one of the missing ones.
Suppose we treat this as the random variable U_v.
Then, in a 5-point rating system, such a random variable can take values 1,2, ... ,or 5.
We ask the question: what is
Pr[U_v=r | u₁,...,u_n].
That is, given the known ratings (u₁,...,u_n). what is the probability that the unknown one is r?
(Where r is either 1,2, ..., or 5.)
If we had these probabilities, how do we actually pick the rating?
→ The simplest way: pick the rating r whose probability Pr[U_v=r | u₁,...,u_n]. is highest.
One way to do this:
- Find other users who have rated both (u₁,...,u_n). and u_v.
- Estimate Pr[U_v=r | u₁,...,u_n]. from this data by simple averaging.
- For this to be possible, the matrix A needs to be dense
  → The data has to be rich enough to find many such "high-overlap" users.
- Unfortunately, this is not the case in practice (at least for movie data).
- Some applications explicitly ask the user to rate 10-15 items so that all users have rated the same items.
So, let's examine a way around this problem: the Naive Bayes approach.
Using Bayes' rule
Pr[U_v=r | u₁,...,u_n]
= Pr[u₁,...,u_n | U_v=r] Pr[U_v=r] / Pr[u₁,...,u_n]
- Note that Pr[u₁,...,u_n] is the simply the user-independent probability of observing the known ratings.
- This is a "constant" for the system.
  → We will absorb it in a normalizing constant.
- Thus, Pr[U_v=r | u₁,...,u_n] ∝ Pr[u₁,...,u_n | U_v=r] Pr[U_v=r]
Next, we will assume independence between items and write
Pr[u₁,...,u_n | U_v=r] = Π_i Pr[u_i | U_v=r]
Each of these can be estimated because there's hope that we can find users who've rated both item v and item i.
Similarly, Pr[U_v=r] can be estimated from the column vector for item v.
Thus, we can estimate
Pr[U_v=r | u₁,...,u_n] = κ Pr[u₁,...,u_n | U_v=r] Pr[U_v=r] = κ Pr[U_v=r] Π_i Pr[u_i | U_v=r]
where κ is a normalizing constant.
In fact, we don't need to worry about the normalizing constant at all:
- We are interested in picking that rating r for which Pr[U_v=r | u₁,...,u_n] is largest.
- This is the same as picking the rating for which
  Pr[U_v=r] Π_i Pr[u_i | U_v=r]
  is highest (without the normalizing constant).

Let's outline this approach with some pseudocode:


Algorithm: naiveBayes (A, u, v)
Input: ratings matrix A, and desired missing entry A[u,v]
      // Estimate Pr[U_v=r].
1.    for each rating value r
2.        num_v_r = # users that rated v as r
3.        num_v = # users that rated v
4.        RatingEstimate[r] = num_v_r / num_v
5.    endfor

      // Estimate Pr[u_i | U_v=r] for each possible i and r.
6.    for each r
7.        for each item j ≠ v
8.            num_u_j_r = # users that rate j as u_j and v as r
9.            num_v_r = # users that rated v as r
10.           conditionalRating[j] = num_u_j_r / num_v_r
11.       endfor
12.   endfor

      // Estimate Pr[U_v=r |  u₁,...,u_n] using the product and identify the largest.
13.   r_max = 0
14.   for each r
15.       product[r] = ratingEstimate[r]
16.       for each item j ≠ v
17.           product[r] = product[r] * conditionalRating[j]
18.       endfor
19.       if product[r] > r_max
20.           r_max = product[r]
21.       endif
22.   endfor

23.   return r_max

The Bayes approach has been used in book recommendation engines [Moon1999].

Machine learning:

The field of machine-learning has grown tremendously in the past 10-20 years.
→ Several sophisticated mc-learning algorithms have been developed.
There are many types of mc-learning problems.
→ The one most applicable to CF is classification.
The classification or supervised learning problem:
- There are m so-called classes
  C₁, ... ,C_m
- We are given a d-dimensional input or query vector
  x = (x₁, ... ,x_d)
- The objective: classify x into the "best" class.
- To train a classifer, we are also give training data, a number of previously classified vectors:
        x₁, y₁    - training vector 1 and its known class y₁
        x₂, y₂    - training vector 2 and its known class y₂
        ...
        x_k, y_k    - training vector k and its known class y_k
A machine learning algorithm:
- Uses the training data to "train" a d-dimensional function F such that
  y = F(x) = F(x₁, ..., x_d).
- Thus, F computes the class for an input vector
  x = (x₁, ... ,x_d)
How can machine-learning be applied to CF? Here are two simple ways:

First, for simplicity, let's re-order the matrix A's rows and columns so that the missing entry is at the bottom right.


                  1   2   3   4   5   6   7   8
         Alice  |     1       5   5       3   3 |
         Bob    | 4       2           1       2 |
         Chen   |     1                         |
         Ella   |     5       1   1           5 |
         Dave   |     2       4   2       3   ? |    // Desired missing entry is for last user and last item.

Then, one way to create a mc-learning problem is user-based:

The classes are the possible ratings 1, 2, ..., 5.

Treat the rows up to column d = n-1 as training vectors with the known classes in the last column.


                    TRAINING VECTORS             CLASS
         Alice  |     1       5   5       3  |     3 |
         Bob    | 4       2           1      |     2 |
         Ella   |     5       1   1          |     5 |

		    QUERY VECTOR
         Dave   |     2       4   2       3  |     ? |

The same thing could be item-based, using columns:

Treat the columns as vectors.


	              TRAINING             QUERY
                      2     4   5     7      8
         Alice  |     1     5   5     3  |   3 |
         Bob    |                        |   2 |
         Chen   |     1                  |     |
         Ella   |     5     1   1        |   5 |

         Dave   |     2     4   2     3  |   ? |    <-  class (y) values

Two problems with this approach:
- A zero rating is treated the same as a low rating.
- The matrix is often too large (vectors too wide).
More sophisticated approaches:
- Use matrix compression via factoring
  → Methods like SVD.
- Specialized factoring or preprocessing.

References and further reading