\( \newcommand{\blah}{blah-blah-blah} \newcommand{\eqb}[1]{\begin{eqnarray*}#1\end{eqnarray*}} \newcommand{\eqbn}[1]{\begin{eqnarray}#1\end{eqnarray}} \newcommand{\bb}[1]{\mathbf{#1}} \newcommand{\mat}[1]{\begin{bmatrix}#1\end{bmatrix}} \newcommand{\nchoose}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{\defn}{\stackrel{\vartriangle}{=}} \newcommand{\rvectwo}[2]{\left(\begin{array}{c} #1 \\ #2 \end{array}\right)} \newcommand{\rvecthree}[3]{\left(\begin{array}{r} #1 \\ #2\\ #3\end{array}\right)} \newcommand{\rvecdots}[3]{\left(\begin{array}{r} #1 \\ #2\\ \vdots\\ #3\end{array}\right)} \newcommand{\vectwo}[2]{\left[\begin{array}{r} #1\\#2\end{array}\right]} \newcommand{\vecthree}[3]{\left[\begin{array}{r} #1 \\ #2\\ #3\end{array}\right]} \newcommand{\vecfour}[4]{\left[\begin{array}{r} #1 \\ #2\\ #3\\ #4\end{array}\right]} \newcommand{\vecdots}[3]{\left[\begin{array}{r} #1 \\ #2\\ \vdots\\ #3\end{array}\right]} \newcommand{\eql}{\;\; = \;\;} \definecolor{dkblue}{RGB}{0,0,120} \definecolor{dkred}{RGB}{120,0,0} \definecolor{dkgreen}{RGB}{0,120,0} \newcommand{\bigsp}{\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;\;} \newcommand{\plss}{\;\;+\;\;} \newcommand{\miss}{\;\;-\;\;} \newcommand{\implies}{\Rightarrow\;\;\;\;\;\;\;\;\;\;\;\;} \)

Module 13: The Singular Value Decomposition (SVD)

Module objectives

By the end of this module, you should be able to

Describe the SVD.

Explain the relationship between SVD and eigenvalues.

Explain how the SVD is used in some applications.

13.0 A review of some ideas

We will review a few key ideas that play a role in the SVD:

Rowspace vs. column space.

How a matrix transforms a basis vector of the row space.

The space spanned by the eigenvectors.

How a matrix transforms a basis vector of the eigen space.

The matrix form of eigenvector-eigenvalue pairs: ${\bf A S}\eql {\bf S \Lambda}$.

Symmetric matrices have an orthonormal eigenvector basis.

Idea #1: rowspace vs. columnspace

Consider a simple $3 \times 5$ matrix: $$ {\bf A} \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} $$
Here is an example of this matrix transforming a vector: $$ \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \mat{1\\ 3\\ -2\\ 4\\ 1} \eql \mat{2\\ 5\\ -1} $$
Observe:
- Each row vector has 5 elements.
- The vector being transformed has 5 elements.
- Each column vector has 3 elements.
- The result vector has 3 elements.
Thus, in general when a matrix ${\bf A}_{m \times n}$ transforms a vector ${\bf x}$ into ${\bf b}$, $$ {\bf A}_{m\times n} {\bf x}_{n\times 1} \eql {\bf b}_{m\times 1} $$ Then
- ${\bf x}$ has $n$ elements.
- ${\bf b}$ has $m$ elements.
We also know:
- ${\bf b}$ is in the column space of ${\bf A}$.
- If ${\bf b}={\bf 0}$, then ${\bf x}$ is in the nullspace (by definition of the nullspace).
Another example with the same matrix: $$ \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \mat{1\\ 1\\ 1\\ 2\\ -2} \eql \mat{0\\ 0\\ 0} $$ So, the nullspace has non-zero vectors.
Now, the RREF of the above example happens to be: $$ \mbox{RREF} ({\bf A}) \eql \mat{{\bf 1} & 0 & 1 & 1 & 2\\ 0 & {\bf 1} & 1 & -2 & -1\\ 0 & 0 & 0 & 0 & 0} $$
Thus, the rowspace has dimension 2.
$\rhd$ The rank of the matrix is $r=2$.
The column space also has dimension 2 (since both have the same dimension).

In-Class Exercise 1: What is the dimension of the nullspace?
This makes sense because:
- It takes 5 basis vectors to span 5D space.
- If a subspace of 5D (the rowspace) has two vectors in its basis, the orthogonal complement (the nullspace) needs three vectors for its basis.
Next, let's look at a potential basis for each of the row and column space.
For the row space, we pick the two pivot rows of the RREF: $$\eqb{ {\bf r}_1 & \eql & (1,0,1,1,2)\\ {\bf r}_2 & \eql & (0,1,1,-2,-1)\\ }$$
For the column space, we need the original vectors that were in the pivot columns: $$ {\bf c}_1 = \vecthree{2}{1}{3} \;\;\;\;\;\; {\bf c}_2 = \vecthree{1}{0}{2} $$
Summary:
- Any matrix ${\bf A}_{m\times n}$ has a rowspace (span of it rows), and a column space (span of its columns).
- Suppose the rank is $r$. This means
  - The rowspace needs $r$ vectors in a basis (for the rowspace).
  - The colspace needs $r$ vectors in a basis (for the colspace).
- We know how to find such a basis for each of the rowspace and colspace:
  - For the rowspace: the pivot rows of the RREF.
  - For the colspace: the columns of ${\bf A}$ that were originally in the pivot columns of the RREF.
- When ${\bf A}$ is applied to a vector: $$ {\bf A}_{m\times n} {\bf x}_{n\times 1} \eql {\bf b}_{m\times 1} $$
  - If the vector ${\bf x}$ is in the rowspace, it can can be expressed with the rowspace basis.
  - The vector ${\bf b}$ is in the colspace (and therefore can be expressed with the colspace basis).

Idea #2: Applying the matrix to one of the rowspace basis vectors:

What happens when we apply ${\bf A}$ to a basis vector of the row space?
- For example: $$ {\bf Ar}_1 \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \mat{1\\ 0\\ 1\\ 1\\ 2} \eql \mat{11\\ 7\\ 15} $$
- Since $(11,7,15)$ is in the column space, it can be expressed as a linear combination of the basis vectors of the column space: $$ \vecthree{11}{7}{15} \eql 7 \vecthree{2}{1}{3} - 3 \vecthree{1}{0}{2} $$
Thus far, nothing surprising.
Let's look at another example: $$ {\bf A} \eql \mat{2 & 1 & 3\\ 1 & 0 & 1\\ 3 & 1 & 2} $$
In this case, the RREF is $$ \mbox{RREF}({\bf A}) \eql \mat{1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1} $$ which means:
- The row and column spaces both have dimension 3.
  $\rhd$ 3 vectors are sufficient for the basis.
- Because both are 3D
  $\rhd$ The same basis will work for both.
- Thus, for example, the standard basis $$ {\bf e}_1 \; = \; \mat{1\\0\\0} \;\;\;\;\;\; {\bf e}_2 \; = \; \mat{0\\1\\0} \;\;\;\;\;\; {\bf e}_3 \; = \; \mat{0\\0\\1} $$ works for both the row space and column space.
Let's now apply the matrix to one of the row space basis vectors, e.g., the last one: $$ \mat{2 & 1 & 3\\ 1 & 0 & 1\\ 3 & 1 & 2} \vecthree{0}{0}{1} \eql \vecthree{3}{1}{2} \eql 3 {\bf e}_1 + 1 {\bf e}_2 + 2 {\bf e}_3 $$
Notice that the result $(3,1,2)$ is, unsurprisingly, a linear combination of the standard vectors.

Idea #3: eigenvectors as a basis

Consider the previous $3\times 3$ example: $$ {\bf A} \eql \mat{2 & 1 & 3\\ 1 & 0 & 1\\ 3 & 1 & 2} $$
It so happens that $$ {\bf x}_1 \; = \; \mat{0.707\\0\\-0.707} \;\;\;\;\;\; {\bf x}_2 \; = \; \mat{0.18\\-0.967\\0.18} \;\;\;\;\;\; {\bf x}_3 \; = \; \mat{0.684\\0.255\\0.684} $$ are eigenvectors with eigenvalues $-1, -0.372, 5.372$, respectively.
For example $$ {\bf Ax}_1 \eql \mat{2 & 1 & 3\\ 1 & 0 & 1\\ 3 & 1 & 2} \mat{0.707\\0\\-0.707} \eql -1 \mat{0.707\\0\\-0.707} $$
Let's take a closer look at $$ {\bf Ax}_1 \eql \lambda_1 {\bf x}_1 $$
- We know that ${\bf x}_1$ is in the rowspace of ${\bf A}$.
- Since ${\bf x}_1$ appears on the right, ${\bf x}_1$ is in the colspace of ${\bf A}$.
- Thus, the size of vectors (number of elements in a vector) is the same in both the rowspace and colspace.
- This means that, since the rowspace rank is the colspace rank, the same basis can be used for both the rowspace and colspace.
Now consider this key observation when ${\bf A}$ is symmetric:
- First, Theorem 12.1 tells us that that the eigenvectors form a basis.
- A basis for what? For both the column and row spaces.

Idea #4: Applying the matrix to an eigenbasis vector:

Recall from above that ${\bf x}_1, {\bf x}_2, {\bf x}_3$ form a basis for the row and column space.
Thus, we get the unusual result that, when applying ${\bf A}$ to a basis vector of the row space, we get a scalar multiple of a basis vector of the column space: $$ {\bf Ax}_i \eql \lambda_i {\bf x}_i $$ Here ${\bf x}_i$ is a vector in both the row and column space bases.
In other words, ${\bf A}$ maps the row basis to the column basis when we choose the basis to be the eigenbasis.
Now, such a thing cannot be possible for an arbitrary matrix ${\bf A}_{m\times n}$.
No vector could possibly be in both the row and column space bases for a non-square matrix.
However, what might be possible is that there is some row space basis ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r$ and some column space basis ${\bf u}_1,{\bf u}_2,\ldots,{\bf u}_r$ such that $$ {\bf A}{\bf v}_i \eql \mbox{ some multiple of } {\bf u}_i $$ This is indeed the case, as we'll see in the next section.
For now, let's look at an example:
- Consider the matrix $$ {\bf A} \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} $$
- It turns out that the RREF is (we saw this earlier): $$ \mbox{RREF} ({\bf A}) \eql \mat{{\bf 1} & 0 & 1 & 1 & 2\\ 0 & {\bf 1} & 1 & -2 & -1\\ 0 & 0 & 0 & 0 & 0} $$
- So, we need two vectors to form a basis for the rowspace.
- And two vectors to form a basis for the colspace.
- It turns out that $$ {\bf v}_1 \eql \mat{0.413\\ 0.238\\ 0.651\\ -0.063\\ 0.587} \;\;\;\;\;\;\; {\bf v}_2 \eql \mat{-0.068\\ 0.344\\ 0.276\\ -0.756\\ 0.479} $$ is a basis for the rowspace.
- And it turns out that $$ {\bf u}_1 \eql \mat{0.528\\ 0.240\\ 0.815} \;\;\;\;\;\;\; {\bf u}_2 \eql \mat{-0.235\\ -0.881\\ 0.411} $$ is a basis for the colspace.
- Now consider $$ {\bf Av}_1 \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \mat{0.413\\ 0.238\\ 0.651\\ -0.063\\ 0.587} \eql 0.1104 \mat{0.528\\ 0.240\\ 0.815} $$
- In other words, it so happens that $$ {\bf Av}_1 \eql 0.1104 \; {\bf u}_1 $$
- That is, when ${\bf A}$ is applied to a rowspace basis vector (if the basis is carefully chosen) the result is a stretched version of a corresponding colspace basis vector.
- This is a sort-of generalization of what happens with eigenvectors, and is the foundation of the SVD.

Idea #5: When a matrix ${\bf A}$ has eigenvectors ${\bf x}_1,{\bf x}_2,\ldots,{\bf x}_n$. with corresponding $\lambda_1,\lambda_2,\ldots,\lambda_n$. Recall that we placed the ${\bf x}_i$'s as columns in a matrix ${\bf S}$ and wrote all the ${\bf A x}_i = \lambda_i {\bf x}_i$ equations in matrix form as: $$ {\bf A S} \eql {\bf S \Lambda} $$

Idea #6: A symmetric matrix ${\bf B}_{n\times n}$ satisfies the conditions for the spectral theorem (Theorem 12.2):

Theorem 12.2: A real symmetric matrix ${\bf B}$ can be orthogonally diagonalized. That is, it can be written as ${\bf B}={\bf S \Lambda} {\bf S}^T$ where ${\bf \Lambda}$ is the diagonal matrix of eigenvalues and the columns of ${\bf S}$ are orthonormal eigenvectors.
We will merely write this as: $$ {\bf B} \eql {\bf V} {\bf D} {\bf V}^T $$ This is a mere change in symbols.
The SVD commonly uses this notation.

13.1 Prelude to the SVD

Consider, for any matrix ${\bf A}_{m\times n}$ the product ${\bf B}_{n\times n} = {\bf A}^T {\bf A}$:

${\bf B}$ is symmetric.

In-Class Exercise 2: Why is this true?

By the spectral theorem, we can write $$ {\bf B} \eql {\bf V} {\bf D} {\bf V}^T $$ where:

The columns of ${\bf V}$ are orthonormal eigenvectors ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_n$ of ${\bf B}$.
${\bf D}$ is a diagonal matrix with the eigenvalues $\lambda_1,\lambda_2,\ldots,\lambda_n$ on the diagonal.

Now consider the vectors $$ {\bf Av}_1, {\bf Av}_2, \ldots, {\bf Av}_n $$ (Recall: ${\bf A}_{m\times n}$ is the original matrix.)

Each such ${\bf Av}_i$ is in the column space of ${\bf A}$.

Consider the dot product of any two of these: $$\eqb{ ({\bf Av}_i) \cdot ({\bf Av}_j) & \eql & ({\bf Av}_i)^T ({\bf Av}_j)\\ & \eql & ({\bf v}_i^T {\bf A}^T) ({\bf Av}_j)\\ & \eql & {\bf v}_i^T ({\bf A}^T {\bf A}) {\bf v}_j\\ & \eql & {\bf v}_i^T {\bf B} {\bf v}_j\\ & \eql & {\bf v}_i^T \lambda_j {\bf v}_j\\ & \eql & \lambda_j {\bf v}_i^T {\bf v}_j\\ & \eql & 0 }$$ since the ${\bf v}_i$'s are orthonormal.

Thus, the vectors $$ {\bf Av}_1, {\bf Av}_2, \ldots, {\bf Av}_n $$ are orthogonal.

But not necessarily orthonormal.

If we name these vectors ${\bf w}_i = {\bf Av}_i$, notice that $$\eqb{ {\bf w}_i \cdot {\bf w}_i & \eql & {\bf w}_i^T \cdot {\bf w}_i & \eql & \lambda_i {\bf v}_i^T {\bf v}_i & \eql & \lambda_i }$$

In-Class Exercise 3: Why is the last step true?

This means that $$ |{\bf w}_i| \eql \sqrt{\lambda_i} $$

We will normalize the ${\bf w}_i$'s and define $$\eqb{ {\bf u}_i & \defn & \frac{1}{|{\bf w}_i|} {\bf w}_i & \eql & \frac{1}{\sqrt{\lambda_i}} {\bf Av}_i }$$

With a new symbol $\sigma_i \defn \sqrt{\lambda_i}$, this becomes $$ {\bf Av}_i \eql \sigma_i {\bf u}_i $$

Finally, recall that Proposition 8.6 tells us that ${\bf A}$ and ${\bf A}^T{\bf A}$ have the same rank.

Suppose the rank is r.

Then, Theorem 12.3 tells us that there are r nonzero eigenvalues.

Suppose we renumber the eigenvalues and eigenvectors so that $$ \lambda_1 \geq \lambda_2 \geq \ldots \geq \lambda_r $$ where $$ \lambda_{r+1} = \lambda_{r+2} = \ldots = 0. $$

Now, ${\bf u}_1,{\bf u}_2,\ldots,{\bf u}_r$ are an orthogonal basis for the colspace of ${\bf A}$.
      $\rhd$ Why? Because the colspace dim (col rank) is $r$ and so $r$ orthogonal vectors that are in the column space form a basis.
      $\rhd$ We showed that the ${\bf u}_i$'s are orthogonal (because the ${\bf w}_i$'s are).
      $\rhd$ And because ${\bf u}_i \eql \frac{1}{\sigma_i} {\bf Av}_i$, the ${\bf u}_i$'s are in the column space.

Later, we'll see that ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r$ are an orthogonal basis for the rowspace of ${\bf A}$.

What we have done is derived a simple relationship between these: $$ {\bf Av}_i \eql \sigma_i {\bf u}_i $$

Next, observe that $$\eqb{ ({\bf AA}^T) {\bf u}_i & \eql & ({\bf AA}^T) \frac{1}{\sigma_i} {\bf Av}_i & \;\;\;\;\;\;\;\; & \mbox{by definition of } {\bf u}_i \\ & \eql & \frac{1}{\sigma_i} {\bf A} ({\bf A}^T{\bf A}) {\bf v}_i & \;\;\;\;\;\;\;\; & \mbox{} \\ & \eql & \frac{1}{\sigma_i} {\bf A} \lambda_i {\bf v}_i & \;\;\;\;\;\;\;\; & {\bf v}_i \mbox{ is an eigenvector of } {\bf A}^T{\bf A}\\ & \eql & \sigma_i {\bf A} {\bf v}_i & \;\;\;\;\;\;\;\; & \lambda_i = \sigma_i^2 \\ & \eql & \sigma_i \sigma_i {\bf u}_i & \;\;\;\;\;\;\;\; & \mbox{from just derived relation} \\ & \eql & \lambda_i {\bf u}_i & \;\;\;\;\;\;\;\; & \mbox{} \\ }$$

That is, remarkably, the ${\bf u}_i$'s are eigenvectors of ${\bf AA}^T$.
Recall: the ${\bf v}_i$'s are eigenvectors of ${\bf A}^T{\bf A}$.
Thus, we are now seeing the foundational role of ${\bf AA}^T$ and ${\bf A}^T{\bf A}$ for any matrix ${\bf A}$.

To see that ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r$ are an orthogonal basis for the rowspace of ${\bf A}$, we just apply the same arguments as before to ${\bf AA}^T$.

Now, let's compare the two cases (symmetric, general) side by side.

Symmetric matrix ${\bf A}_{n\times n}$ General matrix ${\bf A}_{m\times n}$

Suppose rank = r Suppose rank = r

Since the rank is $r$ there are r nonzero real eigenvalues $\lambda_1,\lambda_2,\ldots,\lambda_r$
There are r orthonormal eigenvectors ${\bf x}_1,{\bf x}_2,\ldots,{\bf x}_r$ that form a basis for for both rowspace and colspace.
For the r nonzero eigenvalues, ${\bf Ax}_i = \lambda_i {\bf x}_i$
Additional $n-r$ orthonormal vectors can be found (from the nullspace) so that ${\bf x}_1,{\bf x}_2,\ldots,{\bf x}_r, {\bf x}_{r+1},\ldots,{\bf x}_n $ are an orthogonal basis for ${\bf R}^n$.

There are r orthonormal rowspace basis vectors ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r$
And r orthonormal colspace basis vectors ${\bf u}_1,{\bf u}_2,\ldots,{\bf u}_r$
And r nonzero numbers $\sigma_1,\sigma_2,\ldots,\sigma_r$ such that ${\bf Av}_i = \sigma_i {\bf u}_i$
There are additional orthogonal vectors (from the nullspace) ${\bf v}_{r+1},\ldots,{\bf v}_n$ and ${\bf u}_{r+1},\ldots,{\bf u}_m$ such that ${\bf v}_{1},\ldots,{\bf v}_n$ are orthogonal and a basis for ${\bf R}^n$, and ${\bf u}_{1},\ldots,{\bf u}_m$ are orthogonal and a basis for ${\bf R}^m$.

13.2 The singular value decomposition (SVD)

There are three forms of the SVD for a matrix ${\bf A}_{m\times n}$ with rank $r$:

The long form, with nullspace bases included.

The short form (up to rank r)

The outerproduct form with r terms.

We will look at each in turn, although the latter two are what's practical in applications.

The long form of the SVD:

Consider a matrix ${\bf A}_{m\times n}$ with rank $r$.
Recall that $$ {\bf Av}_i \eql \sigma_i {\bf u}_i $$ where the ${\bf v}_i$'s are orthonormal rowspace basis vectors and ${\bf u}_i$'s are orthonormal colspace basis vectors.
And where $$ \sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r \geq \sigma_{r+1} = \sigma_{r+2} = \ldots = 0. $$
Recall: the rank is $r$ and so $\lambda_1, \lambda_2, \ldots, \lambda_r$ are positive while $\lambda_{r+1}, \ldots, \lambda_{n}$ are zero.
- $\sigma_i = \sqrt{\lambda_i}$ (by definition)
- Thus, $\sigma_1, \sigma_2, \ldots, \sigma_r$ are positive, while $\sigma_{r+1}, \ldots, \sigma_{n}$ are zero.
Put the ${\bf v}_i$'s as columns in a matrix ${\bf V}^\prime_{n\times r}$: $$ {\bf V}^\prime \eql \mat{ & & & \\ \vdots & \vdots & \vdots & \vdots\\ {\bf v}_1 & {\bf v}_2 & \cdots & {\bf v}_r\\ \vdots & \vdots & \vdots & \vdots\\ & & & } $$
Similarly, put the ${\bf u}_i$'s as columns in a matrix ${\bf U}^\prime_{m\times r}$: $$ {\bf U}^\prime \eql \mat{ & & & \\ \vdots & \vdots & \vdots & \vdots\\ {\bf u}_1 & {\bf u}_2 & \cdots & {\bf u}_r\\ \vdots & \vdots & \vdots & \vdots\\ & & & } $$

In-Class Exercise 4: How many rows do ${\bf V}^\prime$ and ${\bf U}^\prime$ have? Why?
Next, define the diagonal matrix ${\bf \Sigma}_{r\times r}^\prime$ $$ {\bf \Sigma}^\prime \eql \mat{\sigma_1 & 0 & & 0 \\ 0 & \sigma_2 & & 0 \\ \vdots & & \ddots & \\ 0 & 0 & & \sigma_r } $$
As we did with the spectral decomposition, we can write the relationship between ${\bf v}_i$'s and ${\bf u}_i$'s as as $$ {\bf AV}^\prime \eql {\bf U}^\prime{\bf \Sigma}^\prime $$

In-Class Exercise 5: Confirm that the matrix multiplications are size-compatible and that the dimensions on both sides match.
Now recall that ${\bf A}$ has $n$ columns while ${\bf V}^\prime$ has $r$ columns
- The dim of ${\bf A}$'s colspace is r.
- The dim of ${\bf A}$'s nullspace is therefore $n-r$.
We can use Gram-Schmidt to find an orthonormal basis for the nullspace of ${\bf A}$
- This will need $n-r$ vectors (the dim of the space).
- Let's call these vectors ${\bf v}_{r+1},{\bf v}_{r+2},\ldots,{\bf v}_n$.
Assuming we've found such vectors, we'll expand the matrix ${\bf V}^\prime$ to add these as columns and call the resulting matrix ${\bf V}$: $$ {\bf V} \eql \mat{ & & & & & & \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ {\bf v}_1 & {\bf v}_2 & \cdots & {\bf v}_r & {\bf v}_{r+1} & \cdots & {\bf v}_n \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ & & & & & & \\ } $$
Note: ${\bf V}$ is $n\times n$.
In a similar way, it is possible to expand ${\bf U}^\prime$ into a matrix ${\bf U}_{m\times m}$.
Lastly, we'll expand ${\bf \Sigma}^\prime$ into an $m\times n$ matrix by adding as many zeroes as necessary.
We can now expand the earlier relationship $$ {\bf AV}^\prime \eql {\bf U}^\prime{\bf \Sigma}^\prime $$ and write it in terms of the expanded matrices: $$ {\bf AV} \eql {\bf U} {\bf \Sigma} $$
Next, because ${\bf V}$ is orthogonal, we can multiply on the right by ${\bf V}^T$ to give $$ {\bf A} \eql {\bf U} {\bf \Sigma} {\bf V}^T $$

In-Class Exercise 6: Why is this true?
In other words, any matrix ${\bf A}$ can be decomposed as $$ {\bf A} \eql \mbox{ an ortho matrix } \times \mbox{ diagonal matrix } \times \mbox{ an ortho matrix } $$
This is the singular value decomposition.
Let's write out the matrix dimensions for clarity: $$ {\bf A}_{m\times n} \eql {\bf U}_{m\times m} {\bf \Sigma}_{m\times n} {\bf V}^T_{n\times n} $$
We will call this the long form of the SVD.

The short form:

Because every entry other than the first $r$ diagonal values of ${\bf \Sigma}$ are zero, the decomposition can be reduced to $$ {\bf A}_{m\times n} \eql {\bf U}_{m\times r} {\bf \Sigma}_{r\times r} {\bf V}^T_{r\times n} $$
The reduced matrices are the ones we started with earlier: $$ {\bf A}_{m\times n} \eql {\bf U}^\prime_{m\times r} {\bf \Sigma}^\prime_{r\times r} {\bf V}^{\prime T}_{r\times n} $$

Let's look at an example to clarify:

Consider the matrix $$ {\bf A} \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} $$
In long form, the SVD turns out to be: $$ \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \eql \mat{0.528 & -0.235 & -0.816\\ 0.240 & -0.881 & 0.408\\ 0.815 & 0.411 & 0.408} \mat{9.060 & 0 & 0 & 0 & 0\\ 0 & 1.710 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0} \mat{0.413 & 0.238 & 0.651 & -0.063 & 0.587\\ -0.068 & 0.344 & 0.276 & -0.756 & -0.479\\ 1 & -2 & 1 & 0 & 0\\ 1 & -1 & 0 & 1 & 0\\ -1 & -2 & 0 & 0 & 1} $$ where $$ {\bf U} \eql \mat{0.528 & -0.235 & -0.816\\ 0.240 & -0.881 & 0.408\\ 0.815 & 0.411 & 0.408} \;\;\;\;\; {\bf \Sigma} \eql \mat{9.060 & 0 & 0 & 0 & 0\\ 0 & 1.710 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0} \;\;\;\;\; {\bf V} \eql \mat{0.413 & -0.068 & 1 & 1 & -1\\ 0.238 & 0.344 & -2 & -1 & -2\\ 0.651 & 0.276 & 1 & 0 & 0\\ -0.063 & -0.756 & 0 & 1 & 0 \\ 0.587 & -0.479 & 0 & 0 & 1} $$
Now look at the product ${\bf U\Sigma V}^T$.
Suppose we first multiply ${\bf U\Sigma}$ in $({\bf U\Sigma}){\bf V}^T$.
In the above example, because the third row of ${\bf \Sigma}$ is zero, $$ \mat{0.528 & -0.235 & -0.816\\ 0.240 & -0.881 & 0.408\\ 0.815 & 0.411 & 0.408} \mat{9.060 & 0 & 0 & 0 & 0\\ 0 & 1.710 & 0 & 0 & 0\\ 0 & 0 & 0 & 0 & 0} \eql \mat{0.528 & -0.235 \\ 0.240 & -0.881 \\ 0.815 & 0.411 } \mat{9.060 & 0 & 0 & 0 & 0\\ 0 & 1.710 & 0 & 0 & 0} \eql \mat{4.784 & -0.402 & 0 & 0 & 0\\ 2.174 & -1.507 & 0 & 0 & 0\\ 7.384 & 0.703 & 0 & 0 & 0} $$
Furthermore, the resulting matrix has the last three columns with zeroes.
This means that we can ignore the last three rows of ${\bf V}^T$, since those elements will get multiplied by these zeroes: $$ \mat{4.784 & -0.402 & 0 & 0 & 0\\ 2.174 & -1.507 & 0 & 0 & 0\\ 7.384 & 0.703 & 0 & 0 & 0} \mat{0.413 & 0.238 & 0.651 & -0.063 & 0.587\\ -0.068 & 0.344 & 0.276 & -0.756 & -0.479\\ 1 & -2 & 1 & 0 & 0\\ 1 & -1 & 0 & 1 & 0\\ -1 & -2 & 0 & 0 & 1} \eql \mat{4.784 & -0.402\\ 2.174 & -1.507\\ 7.384 & 0.703} \mat{0.413 & 0.238 & 0.651 & -0.063 & 0.587\\ -0.068 & 0.344 & 0.276 & -0.756 & -0.479} $$
Thus, the only relevant information is in the first $r$ columns of ${\bf V}$ and the first $r$ rows of ${\bf U}^T$
So, if we trimmed all three matrices accordingly, we would have the short form: $$ {\bf A} \eql {\bf U}_{m\times r} {\bf \Sigma}_{r\times r} {\bf V}^T_{r\times n} $$ Which in our example becomes: $$ \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} \eql \mat{0.528 & -0.235 \\ 0.240 & -0.881 \\ 0.815 & 0.411 } \mat{9.060 & 0 \\ 0 & 1.710} \mat{0.413 & 0.238 & 0.651 & -0.063 & 0.587\\ -0.068 & 0.344 & 0.276 & -0.756 & -0.479} $$
Note: in the short form ${\bf V}_{n\times r}$ has $r$ columns.

In-Class Exercise 7: Download and examine and the code in SVDExample.java. Then compile and execute to confirm the above.

Next, the outer-product form:

Recall:
- Suppose a matrix ${\bf A}$ has columns ${\bf a}_1, {\bf a}_2,\ldots,{\bf a}_k$.
- Suppose ${\bf B}$ has rows ${\bf b}_1, {\bf b}_2,\ldots,{\bf b}_k$.
- Then, the outerproduct form of ${\bf AB}$ is $$ {\bf AB} \eql \sum_{i=1}^k {\bf a}_i {\bf b}_i^T $$
We will now write the SVD in this form.
Start by looking at $$ {\bf U\Sigma} \eql \mat{ & & & \\ \vdots & \vdots & \vdots & \vdots\\ {\bf u}_1 & {\bf u}_2 & \cdots & {\bf u}_r\\ \vdots & \vdots & \vdots & \vdots\\ & & & } \mat{\sigma_1 & 0 & & 0 \\ 0 & \sigma_2 & & 0 \\ \vdots & & \ddots & \\ 0 & 0 & & \sigma_r } \eql \mat{ & & & \\ \vdots & \vdots & \vdots & \vdots\\ \sigma_1{\bf u}_1 & \sigma_2{\bf u}_2 & \cdots & \sigma_r{\bf u}_r\\ \vdots & \vdots & \vdots & \vdots\\ & & & } $$
Now, right-multiply by ${\bf V}^T$ for the SVD:
- The rows of ${\bf V}^T$ are the columns of ${\bf V}$.
- These are the vectors ${\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r$.
- Thus, in outer product form: $$ {\bf U\Sigma V}^T = \sum_{k=1}^r \sigma_k {\bf u}_k {\bf v}_k^T $$
In other words, we can write any matrix ${\bf A}$ as $$ {\bf A} = \sum_{k=1}^r \sigma_k {\bf u}_k {\bf v}_k^T $$ where the ${\bf u}_i$'s are a basis for the column space and the ${\bf v}_i$'s are a basis for the row space.
Let's write our running example in outer product form: $$\eqb{ \sum_{k=1}^r \sigma_k {\bf u}_k {\bf v}_k^T & \eql & \sigma_1 {\bf u}_1 {\bf v}_1^T \plss \sigma_2 {\bf u}_2 {\bf v}_2^T \\ & \eql & 9.060 \mat{0.528\\ 0.240\\ 0.815} \mat{0.413 & 0.238 & 0.651 & -0.063 & 0.587} \plss 1.710 \mat{-0.235\\ -0.881\\ 0.411} \mat{-0.068 & 0.344 & 0.276 & -0.756 & -0.479} \\ & \eql & 9.060 \mat{ 0.218 & 0.126 & 0.343 & -0.033 & 0.310\\ 0.099 & 0.057 & 0.156 & -0.015 & 0.141\\ 0.336 & 0.194 & 0.530 & -0.052 & 0.479} \plss 1.710 \mat{ 0.016 & -0.081 & -0.065 & 0.177 & 0.113\\ 0.060 & -0.303 & -0.243 & 0.666 & 0.422\\ -0.028 & 0.141 & 0.114 & -0.311 & -0.197} \\ & \eql & \mat{ 1.973 & 1.138 & 3.111 & -0.303 & 2.808\\ 0.898 & 0.518 & 1.416 & -0.138 & 1.278\\ 3.048 & 1.758 & 4.806 & -0.468 & 4.337} \plss \mat{ 0.027 & -0.138 & -0.111 & 0.303 & 0.192\\ 0.102 & -0.518 & -0.416 & 1.138 & 0.722\\ -0.048 & 0.242 & 0.194 & -0.532 & -0.337}\\ & \eql & \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4}\\ & \eql & {\bf A}! }$$
Notice that storing the ${\bf u}_k$'s, ${\bf v}_k$'s, and $\sigma_k$'s might need less space than the whole matrix ${\bf A}$.
Finally, consider what happens if we discard some of the terms in the sum $$ \sum_{k=1}^r \sigma_k {\bf u}_k {\bf v}_k^T $$
- In our example it turns out that $r=2$ and so the entire sum is $$ \sigma_1 {\bf u}_1 {\bf v}_1^T \plss \sigma_2 {\bf u}_2 {\bf v}_2^T \\ $$
- Suppose we discard the second part of the sum, and compute $$\eqb{ {\bf A}^\prime & \eql & \sigma_1 {\bf u}_1 {\bf v}_1^T \\ & \eql & \mat{ 1.973 & 1.138 & 3.111 & -0.303 & 2.808\\ 0.898 & 0.518 & 1.416 & -0.138 & 1.278\\ 3.048 & 1.758 & 4.806 & -0.468 & 4.337}\\ }$$
- Compare this with the original matrix $$ {\bf A} \eql \mat{2 & 1 & 3 & 0 & 3\\ 1 & 0 & 1 & 1 & 2\\ 3 & 2 & 5 & -1 & 4} $$
- We see that ${\bf A}^\prime$ is a reasonable approximation $$ {\bf A}^\prime \;\; \approx \;\; {\bf A} $$
This is the main idea in using the SVD for compression (next).

13.3 Using the SVD outerproduct form for compression

Recall that any matrix can be expressed in terms of the columns of ${\bf U}$ and ${\bf V}$, with $\sigma_k$: $$ {\bf A} = \sum_{k=1}^r \sigma_k {\bf u}_k {\bf v}_k^T $$

We'll now explore the idea of using fewer terms in the sum: $$ {\bf A}_p = \sum_{k=1}^p \sigma_k {\bf u}_k {\bf v}_k^T $$ where $p \leq r$.

Why do this?

First, note that using fewer terms implies an approximation:
$\rhd$ The question is: is ${\bf A}_p \approx {\bf A}$?

There is a storage advantage to using fewer terms:

To store ${\bf A}$ we need $m \times n$ space.
To store a pair of vectors ${\bf u}_k$ and ${\bf v}_k$, we need $m+n$ space.

In-Class Exercise 8: Why is the storage of the two vectors $m+n$? For $p$ such terms, what is the total storage, and when is this less than $m \times n$? When is it greater? Provide some example values of $m,n$.

Next, we need to know how accurate the approximation is.

Or, how small can $p$ be to get a reasonable approximation?

In-Class Exercise 9: Download and examine the code in SVDExample2.java to confirm that it computes the outerproduct as written above. Try $p=1$ and $p=r$ to compare the resulting matrix with the original. How poor an approximation do we get with $p=1$?

We'll now explore this notion with images:

We'll treat a greyscale image as a matrix.
Using the approximation, we'll reconstruct an image to see if it's reasonable.

In-Class Exercise 10: Download and edit ImageExample.java to try various values of $p$ as directed. With what value of $p$ do you get a reasonable approximation? Examine the actual images (the file names should be obvious). You will also need ImageTool.java and this test image.
Thus, as we increase $p$, we get more terms in the sum, and therefore a better approximation of the original matrix (image).

Interestingly, there is a nice theoretical result to support this approach:

Suppose that, for any two matrices ${\bf A}$ and ${\bf B}$ we wish to quantify how close ${\bf A}$ is to ${\bf B}$.
One way to do this is to use the so-called Frobenius distance: $$ \left|{\bf A} - {\bf B}\right|_F \eql \sqrt{ \sum_{i,j} ({\bf A}_{ij} - {\bf B}_{ij})^2} $$
Thus, earlier, when we used $$ {\bf A}_p = \sum_{k=1}^p \sigma_k {\bf u}_k {\bf v}_k^T $$ to approximate ${\bf A}$, we'd like to know whether the Frobenius distance $$ \left|{\bf A}_p - {\bf A}\right|_F $$ is small.
An important observation: the rank of ${\bf A}_p$ is $p$.
- Why is this?
- ${\bf A}_p$ can be computed from the SVD $$ {\bf A}_p \eql {\bf U\Sigma V}^T $$ by setting $\sigma_{p+1} = \sigma_{p+2} = \ldots = \sigma_{r} = 0$.
- Then, the rank of this modified ${\bf \Sigma}$ is $p$.
- Then, by Proposition 12.4 (or the proof of Theorem 12.3), the rank of ${\bf A}_p$ is $p$.
In-Class Exercise 11: Explain why setting $\sigma_{p+1} = \sigma_{p+2} = \ldots = \sigma_{r} = 0$ in the SVD results in the same computation as $p$ terms of the sum of outerproducts? Also, why is $p$ the rank of the modified ${\bf \Sigma}$?
In the 1950's, mathematicians Carl Eckhart and Gale Young proved a remarkable result: the best p-rank approximation to a matrix ${\bf A}$ is ${\bf A}_p$.
That is, if we look at all the p-rank matrices (lower rank, therefore less space perhaps) of the same size as ${\bf A}$, then ${\bf A}_p$ is closest in Frobenius distance.
We'll state the theorem without proof.
Theorem 13.1: Let the rank of ${\bf A}$ be $r$ and let ${\bf B}$ be any matrix with rank $p < r$. Then, $\left\|{\bf A}_p - {\bf A}\right\|_F \leq \left\|{\bf B} - {\bf A}\right\|_F$.
Eckhart and Young also were the first to formally prove the existence of the full SVD, with a method for constructing the remaining (nullspace) columns of ${\bf U}$ and ${\bf V}$.

Some terminology:

The numbers $\sigma_i$ along the diagonal of ${\bf \Sigma}$ are called the singular values.
The columns of ${\bf U}$ are called the left singular vectors
The columns of ${\bf V}$ are called the right singular vectors

13.4 Latent semantic analysis (LSA): applying the SVD to text data

Let's begin by understanding two problems with text data:

The clustering problem: given a collection of documents, place the documents into topic-related clusters.

We'll use the term document to mean any blob of text.
A document could be small (a few words) or very large.
For this discussion, we'll treat a document as an unsorted collection of words.

The search problem: given a collection of documents and a new query document, find the documents in the collection that best match the query document.

The query "document" could just be a group of search terms.

Key ideas in LSA:

Documents will be converted into numeric vectors.
Each document will be a column in the data matrix.
Then we will compute the SVD.
To sharpen the results, we'll ignore (zero out) some of the singular values $\sigma_k$.
We'll then use the column vectors of the basis ${\bf U}$ to get the coordinates of the data in this basis.
Lastly, we'll use the transformed coordinates for both clustering and search.

Document vectors:

Let's use an example.

A document is a collection of words.

Consider the following four documents:
1. Document 1: "The multiplication of a matrix with a vector gives a vector"
2. Document 2: "Matrix multiplication is not commutative, but vector dot product is."
3. Document 3: "A balance sheet is like a cost matrix but shows revenue, cost and profit."
4. Document 4: "A cost matrix shows items and costs"
(Yes, they are all very short documents, to make our presentation manageable)
The first step is to identify a core list of words:
- We'll ignore so-called stop words like "is, a, the" etc.
- Remove all words that occur in only one document.
We're left with the following 5 words across all four docs: cost, matrix, multiplication, shows and vector.

We'll now assign numbers to these words:

      1   cost
      2   matrix
      3   multiplication
      4   shows
      5   vector

Next, we'll build the word-document matrix ${\bf D}$: $$ {\bf D}_{ij} = \left\{ \begin{array}{ll} 1 & \mbox{ if word i is in document j}\\ 0 & \mbox{ otherwise} \end{array} \right. $$
Thus, in our example:
Note: a more refined version could use word-occurence counts (how many times a word occurs) instead of just 0 or 1.
The average of the columns is the vector $$ {\bf \mu} \eql \mat{0.5 \\ 0.75 \\ 0.5 \\ 0.5\\ 0.5} $$
Then, we center the data by subtracting off the mean column vector from each column: $$ {\bf X} \eql \mat{ -0.5 & -0.5 & 0.5 & 0.5\\ 0.25 & -0.75 & 0.25 & 0.25\\ 0.5 & 0.5 & -0.5 & -0.5\\ -0.5 & -0.5 & 0.5 & 0.5\\ 0.5 & 0.5 & -0.5 & -0.5} $$
This will be the data used in the SVD.
Recall the SVD: $$ {\bf X} \eql {\bf U} {\bf \Sigma} {\bf V}^T $$
Next, let $r$ be the rank so that:
- The first $r$ singular values are nonzero: $\sigma_1 \geq \sigma_2 \geq \ldots \geq \sigma_r > 0$
- The others are zero: $\sigma_{r+1} = \sigma_{r+2} = \ldots = 0$.
Here, the first $r$ columns of ${\bf U}$ are a basis for the data.
We will use this basis for a change of coordinates.
And $$ {\bf U} \eql \mat{-0.481 & 0.136\\ -0.272 & -0.962\\ 0.481 & -0.136\\ -0.481 & 0.136\\ 0.481 & -0.136} \;\;\;\;\;\; {\bf \Sigma} \eql \mat{2.070 & 0\\ 0 & 0.683} \;\;\;\;\;\; {\bf V}^T \eql \mat{0.432 & 0.564 & -0.498 & -0.498\\ -0.751 & 0.658 & 0.047 & 0.047} $$ (This is the short form of the SVD).
Notice that the rank happened to be $r=2$.
In other cases, with a higher rank, we might want to zero-out low singular values and use a small value of $r$.
Now, recall that the columns of ${\bf U}$ are the ${\bf u}_i$ vectors.
$\rhd$ They are a basis for the column space of ${\bf X}$.
Thus, we could re-orient the data to get coordinates in this new basis.
How do we change coordinates using the basis ${\bf U}$?
- Suppose ${\bf x}$ is a vector in the standard basis.
  $\rhd$ The components of ${\bf x}$ are coordinates in the standard basis.
- Then, the coordinates in the new basis are $$\eqb{ {\bf y} & \eql & {\bf U}^{-1} {\bf x}\\ & \eql & {\bf U}^{T} {\bf x}\\ }$$
  
  In-Class Exercise 12: Why is this last step true?
Let ${\bf x}_1, {\bf x}_2, \ldots, {\bf x}_n$ be the columns of the centered data ${\bf X}$. $$ {\bf X} \eql \mat{ & & & \\ \vdots & \vdots & \vdots & \vdots\\ {\bf x}_1 & {\bf x}_2 & \cdots & {\bf x}_n\\ \vdots & \vdots & \vdots & \vdots\\ & & & } $$
To change coordinates, we're going to apply ${\bf U}^T$ to each column: $$ {\bf y}_i \eql {\bf U}^T {\bf x}_i $$
For convenience, we'll now put the ${\bf y}_i$'s as columns in a matrix ${\bf Y}$ so that $$ {\bf Y} \eql {\bf U}^T {\bf X} $$
This will allow us to substitute the SVD ${\bf X}={\bf U\Sigma V}^T$ $$\eqb{ {\bf Y} & \eql & {\bf U}^T {\bf X}\\ & \eql & {\bf U}^T {\bf U\Sigma V}^T\\ & \eql & {\bf \Sigma V}^T }$$ as an alternative way to compute the change in coordinates.
What do we do with the new coordinates, and why would they be useful?

Using the new coordinates for clustering:

There are many options for clustering, as we've seen in Module 12.
One approach is to compute the covariance (with changed coordinates) and identify highly-related documents.
Note: in computing the covariance, it's more common to use a ordinal-based covariance like the Spearman correlation coefficient.

Using the new coordinates for search:

In the search problem:
- We are given new query document.
- We've already computed the changed coordinates of every document in the collection.
We first compute the document vector ${\bf q}$ of the new query document.
Then we subtract off the mean: $$ {\bf q}^\prime \eql {\bf q} - {\bf \mu} $$
Then, we get its new coordinates by $$ {\bf y}^\prime \eql {\bf U}^T {\bf q}^\prime $$
Now, we compare against each transformed document ${\bf y}_i$ to see which one is closest.
One way to determine closest: find the vector ${\bf y}_i$ whose angle with ${\bf y}^\prime$ is the least. (This is called cosine-similarity.)

In-Class Exercise 13: Download and unzip LSA.zip, then compile and execute LSA.java to see the above example.

Read the code to verify the above computations in clustering.

Change the data set as directed. What are the resulting sizes of the matrices ${\bf U},{\bf \Sigma}$ and ${\bf V}^T$?

What is notable about the document-document covariance computed with the changed coordinates (for each of the two sets of data)?

Why is this approach of using the SVD called latent semantic analysis?

One interpretation of the new basis vectors ${\bf u}_i$, that is, the columns of ${\bf U}$ is that each is a kind of "pseudo" document.

For example, consider ${\bf u}_1$ in the example above, along with the corresponding words

    -0.481    cost
    -0.272    matrix
     0.481    multiplication
    -0.481    shows
     0.481    vector

What interpretation can we assign to these numbers?
- Think of $-0.481$ as the contribution of the word "cost" to the pseudo-document ${\bf u}_1$.
Then, two data vectors that are close are likely to have similar first coordinate.
$\rhd$ Which means the word "cost" has similar value in each of the two documents.
Another way to think about it:
- The coordinate transformation highlights shared words.
- Thus, two documents with many shared words will have a small angle in the new coordinates.

13.5 Image clustering and search using SVD

The ideas of clustering and search above can be applied to any dataset that can be vectorized.

So, the real question is: how do we turn images into vectors?

Making a vector out of an image:

First, we'll consider making a vector out of a raw image.

Let's examine greyscale images.

A greyscale image is a 2D array of numbers.
Typically, each number is a value between 0 and 255, representing intensity.
Suppose the image is stored in an array pixels[i][j] of size M rows by N columns.

A simple way to make this a vector:

Take the first column of the 2D array, the numbers

	      pixels[0][0]
              pixels[1][0]
              ... 
	      pixels[M-1][0]

Then append the second column:

	      pixels[0][0]
              pixels[1][0]
              ... 
	      pixels[M-1][0]

	      pixels[0][1]
              pixels[1][1]
              ... 
	      pixels[M-1][1]

Then, the third, and so on.
The resulting vector is the vector representation of a single image.

We convert each image into a vector.

As before, for search or clustering, each image is converted into a vector and these are transformed into the coordinates based on the SVD.

Search and clustering proceed exactly as described for text once the coordinates are changed.

In-Class Exercise 14: Download and unzip imageSearch.zip, then compile and execute ImageSearch.java to see the above example.

Read the code to verify the above computations.

Print the size of ${\bf U}$.

Examine the files in the eigenImages. What are these images? What do they correspond to in the SVD?

A caveat:

We need the image vectors (the analogue of document vectors) to be comparison-compatible.
In the case of documents, for example, the second row (thus, the second component of each document vector) corresponds to the same word.
But is this true of an image? When applying to face recognition, we'll need the vector components to be comparable.
That is, the pixel in position 15 of each vector should be a pixel that's meaningfully similar across the images.
What that means is that, in practice, the images need to be aligned, and often made coarse-grained before converting into vectors.

13.6 SVD Pseudoinverse and Least Squares

The SVD has turned out to be one of the most useful tools in linear algebra.

We will list some other applications below.

Application: finding the effective rank of a matrix.

Sometimes we wish to know the rank of a matrix.

If a large matrix is low rank, one can look for a more compact representation.

Recall this about inverses:

So far, the only inverse we've seen is for a square matrix, all of whose columns (and rows) are independent.
The only option we considered when the above isn't true is the least-squares approach to solve ${\bf Ax}={\bf b}$. But this requires the columns of ${\bf A}$ to be independent, and only addresses this problem.

Next, let's turn to finding an approximate inverse

In-Class Exercise 15: Suppose ${\bf \Sigma}$ is a square diagonal matrix with $\sigma_i$ as the i-th entry on the diagonal. What is the inverse of ${\bf \Sigma}$?

In-Class Exercise 16: If ${\bf A}$ is a full-rank square matrix with SVD ${\bf A}={\bf U\Sigma V}^T$, then show that ${\bf A}^{-1}={\bf V\Sigma^{-1} U}^T$ with ${\bf \Sigma}^{-1}$ computed as in the above exercise.

Application: pseudoinverse

If ${\bf A}$ is not square, we can use the SVD to define an approximate inverse.
Clearly, because ${\bf U}$ and ${\bf V}$ (in the long-form) are orthogonal and square, they already have inverses.
The problem is ${\bf \Sigma}$ is not necessarily square and may not have full rank even if it is.
But, let's use the idea from the exercise above.
Define the matrix ${\bf \Sigma}^+$ as follows:
- For each nonzero diagonal entry $\sigma_k$ in ${\bf \Sigma}$, let $\frac{1}{\sigma_k}$ be the corresponding entry in ${\bf \Sigma}^+$
Then, the product ${\bf V\Sigma^{+} U}^T$ is somewhat like an inverse.
Definition: The pseudoinverse of a matrix ${\bf A}$ is defined as ${\bf A}^+ = {\bf V\Sigma^{+} U}^T$.
When ${\bf A}$ is square and invertible, ${\bf A}^+ = {\bf A}^{-1}$.
What do we know about ${\bf A}^+$ in general?
Is the pseudoinverse useful in solving ${\bf Ax}={\bf b}$ when ${\bf A}$ has no inverse?
Define the "approximate" solution $$ {\bf x}^+ \eql {\bf A}^+ {\bf b} $$
${\bf A}^+$ has a nice properties:
- It is equivalent to the true inverse when the true inverse exists.
- ${\bf x}^+$ minimizes $|{\bf b} - {\bf Ax}|^2$ over all vectors ${\bf x}$.
- Moreover, ${\bf z}$ is any other solution that minimizes $|{\bf b} - {\bf Ax}|^2$, then $|{\bf x}^+| \leq |{\bf z}|$.
The last property implies that the solution from the pseudoinverse is the "best" in some sense.
Recall: the RREF is very sensitive to numerical errors.
Thus, it's best to use ${\bf x}^+$.
It turns out that computing the SVD can be made numerically robust.

Why use SVD instead of standard least-squares?

First, let's recall the least-squares approach from Module 7: $$ \hat{\bf x} \eql ({\bf A}^T {\bf A})^{-1} {\bf A}^T {\bf b} $$
It turns out that this approach is susceptible to numerical errors.
How would one use the SVD instead? One approach is to compute the pseudo-inverse and use it, as shown above.
A numerically better approach uses this idea:
- For any approximation $\hat{\bf x}$, consider the difference between ${\bf A}\hat{\bf x}$ and the target ${\bf b}$, and substitute the SVD: $$\eqb{ {\bf A}\hat{\bf x} - {\bf b} & \eql & {\bf U\Sigma V}^T \hat{\bf x} - {\bf b} \\ & \eql & {\bf U} ({\bf \Sigma V}^T \hat{\bf x}) - {\bf U}({\bf U}^T {\bf b})\\ & \eql & {\bf U} ({\bf \Sigma y} - {\bf c}) }$$ where we've defined ${\bf y}$ and ${\bf c}$ above.
- So, minimizing ${\bf A}\hat{\bf x} - {\bf b}$ is equivalent to minimizing ${\bf \Sigma y} - {\bf c}$.
- This is a different least-squares problem, but one with a diagonal matrix (which makes it numerically robust to work with).

Note: The presentation in 13.1, 13.2, and 13.6, is inspired by this excellent introductory article: D.Kalman. A Singularly Valuable Decomposition: The SVD of a Matrix. The College Mathematics Journal, Volume 27, Issue 1, 1996.

13.7 The relationship between SVD and PCA: SVD = PCA + more

Both involve finding the eigenvectors of the data matrix multiplied by its transpose.

Are they merely slight variations of each other?

No! PCA is merely one part (one third) of the SVD.

Let's explain:

Consider the following data matrix, $$ {\bf X} \eql \mat{ 5.1 & 4.7 & 7.0 & 6.4 & 6.3 & 5.8\\ 3.5,& 3.2 & 3.2 & 3.2 & 3.3 & 2.7\\ 1.4 & 1.3 & 4.7 & 4.5 & 6.0 & 5.1\\ 0.2 & 0.2 & 1.4 & 1.5 & 2.5 & 1.9} $$ This is from an actual dataset with samples from three categories.

Here, each column is a sample of data.

We have $m=4$ rows and $n=6$ columns.

The number of rows is the dimension of the data.

Thus the first data point is the first column: $$ \mat{5.1 \\ 3.5\\ 1.4\\ 0.2} $$ These are four measurements made of a flowering plant (sepal height, width, and petal height, width).

Let's apply PCA and then reflect on its meaning.

In-Class Exercise 17: Download PCAExample.java and SVDStats.class. Compile and execute, then examine the code in PCAExample.java to confirm that we are applying PCA to the above data.

Let's review what happens with PCA.
Compute the mean across the rows: $$ {\bf\vec{\mu}} = \mat{5.883\\ 3.183\\ 3.833\\ 1.283} $$
We first center the data by subtracting ${\bf\vec{\mu}}$ from each column of the original data to get the centered data: $$ {\bf X} \eql \mat{ -0.783 & -1.183 & 1.117 & 0.517 & 0.417 & -0.083\\ 0.317 & 0.017 & 0.017 & 0.017 & 0.117 & -0.483\\ -2.433 & -2.533 & 0.867 & 0.667 & 2.167 & 1.267\\ -1.083 & -1.083 & 0.117 & 0.217 & 1.217 & 0.617} $$
Next, we compute the covariance matrix (without the $\frac{1}{n}$) $$ {\bf C}_{m\times m} \eql {\bf X} {\bf X}^T $$ Important: this is not ${\bf X}^T {\bf X}$ (which will be $n\times n$).
Next, because ${\bf C}$ is symmetric, we can diagonalize: $$ {\bf C} \eql {\bf S \Lambda S}^T $$ where the eigenvectors are the columns of ${\bf S}$.
In our example, the eigenvectors turn out to be: $$ {\bf S} \eql \mat{ 0.314 & -0.897 & -0.103 & -0.294\\ -0.047 & -0.220 & 0.928 & 0.296\\ 0.864 & 0.155 & -0.071 & 0.475\\ 0.392 & 0.351 & 0.350 & -0.775} $$
Lastly, we transform the coordinates of the data using the eigenvectors as the new basis: $$\eqb{ {\bf Y} & \eql & {\bf S}^{-1} {\bf X} \\ & \eql & {\bf S}^{T} {\bf X} \\ }$$
The transformed coordinates give us new decorrelated dimensions.
To see why, let's compare the two covariance matrices.
The covariance matrix of the original (centered) data is: $$ {\bf X}{\bf X}^T \eql \mat{ 3.708 & -0.152 & 7.013 & 2.828\\ -0.152 & 0.348 & -1.147 & -0.512\\ 7.013 & -1.147 & 19.833 & 9.043\\ 2.828 & -0.512 & 9.043 & 4.268} $$ It's hard to see whether this is telling us anything, other than the variance withing the third dimension is higher than the others.
But look at $$ {\bf Y}{\bf Y}^T \eql \mat{ 26.546 & -0.000 & 0.000 & -0.000\\ -0.000 & 1.353 & 0.000 & -0.000\\ 0.000 & 0.000 & 0.259 & 0.000\\ -0.000 & -0.000 & 0.000 & 0.000} $$
- The re-oriented data places all the variation in the first modified dimension.
- The new dimensions are completely de-correlated.
- In the new view, the variation decreases by an order-of-magnitude from the first to the second, and by another order to the third.
- The interpretation is: only one dimension is really important.
This makes sense when we plot the first two dimensions of both the data and the PCA-transformed data:
- Here, we happen to know the three categories and are able to show them in three colors.
- If we didn't know (which is typically the case), it would harder to discern from the original data.
Next, let's examine the SVD and observe which part of the SVD is PCA, when the SVD is in long form: $$ {\bf X} \eql {\bf U} {\bf \Sigma} {\bf V}^T $$ Here:
- The columns of ${\bf U}$ are the eigenvectors of ${\bf XX}^T$.
- The columns of ${\bf V}$ (rows of ${\bf V}^T$) are the eigenvectors of ${\bf X}^T {\bf X}$.
In our example, the eigenvectors in ${\bf U}$ are: $$ {\bf U} \eql \mat{ 0.314 & -0.897 & -0.103 & -0.294\\ -0.047 & -0.220 & 0.928 & 0.296\\ 0.864 & 0.155 & -0.071 & 0.475\\ 0.392 & 0.351 & 0.350 & -0.775} $$ Which is exactly the ${\bf S}$ matrix in PCA.
Finally, let's examine the ${\bf \Sigma}$ matrix: $$ {\bf \Sigma} \eql \mat{ 5.152 & 0.000 & 0.000 & 0.000\\ 0.000 & 1.163 & 0.000 & 0.000\\ 0.000 & 0.000 & 0.509 & 0.000\\ 0.000 & 0.000 & 0.000 & 0.015} $$ This tells us the importance of each dimension.
Note: in the short-form of the SVD, ${\bf U}$ only has $r$ columns, and will therefore have fewer columns than the ${\bf S}$ matrix from PCA.

What do we learn from all of this?

Think of the SVD as the complete decomposition: $$ {\bf X} \eql {\bf U} {\bf \Sigma} {\bf V}^T $$ The SVD
- contains PCA in the ${\bf U}$ part;
- applies to data matrices and other kinds of matrices;
- has many more applications beyond data matrices.
PCA is so-called because one can go deeper into the ${\bf U}$ part for data matrices only and prove that the result decorrelates data between the rows.
- An assumption that's buried inside PCA is that when you go across a row, the variation is because of noise from column to column.
- If that were not true, then ${\bf X}{\bf X}^T$ is not correlation.
- In PCA we are really interested in correlation between rows (among the dimensions of the data).
- This assumption is not needed in SVD for its other applications.
- In fact, the SVD applications we've seen, we cared about correlation amongst the columns (and not the rows).
One more important point:
- Most often the SVD is in short-form, which means that ${\bf U}$ in reduced form (only $r$ columns) is not the same as ${\bf S}$ in PCA.
- In this case, the new coordinates via SVD, ${\bf Y}={\bf U}^T{\bf X}$, will be smaller (few components) than the coordinates from PCA, ${\bf Y}={\bf S}^T{\bf X}$. We saw this in the latent-semantic analysis section.

Let's now turn to a different application of the SVD, the last of our canonical examples (from Module 1).

13.8 Using SVD to predict missing data (the Netflix problem)

Consider a matrix where the rows are users and columns are movies:

The rows reflect users, the columns are movies.

Some entries are missing, indicated with a ?

We would like to predict these entries as part of developing a recommendation engine ("if you like X, we think you'll like Y"), by using the predicted missing entries.

In real world data, the ratings matrix is typically very sparse (most entries are missing).

Many years ago, Netflix announced a million-dollar prize for anyone who could reasonably predict movie ratings. Most top performers used some aspects of the SVD.

Let's start with the most direct approach:

Key ideas:
- Temporarily fill in some values into the missing entries.
- Compute the SVD.
- Find the best low-rank approximation for the actual entries.

Pseudocode:

    Algorithm: predictMissingRatings (X)
    Input: a ratings matrix X with missing entries
    1.   X' = temporarilyReplaceMissingEntries (X)
    2.   computeSVD (X')
    3.   Σ[m][m] = 0
    3.   for k=m-1 to 1 
             // Try the approximation with Σ[k][k] = ... =  Σ[m][m] = 0
    4.       Σ[k][k] = 0
    5.       Y = U Σ V^T
             // Compute diff only on non-missing entries of X
    6.       if diff(X,Y) is least so far
    7.           bestY = Y
    8.       endif
    9.   endfor
    10.  return bestY

Note:

What are the temporary values we could use to replace the missing entries?
One option is to use 0.
Another is to use the average movie rating (in that column).

Intuition:
- Suppose there is some "true" matrix of all ratings (the one we are trying to guess).
- Here, the hope is that a low-rank approximation will "capture" the hidden matrix.
- Remember, by leaving out some singular values (setting diagonal entries in ${\bf \Sigma}$ to be zero), we can construct the low-rank approximations.
- The loop merely tries out each such low rank approximation against the known entries to find the best one.

In-Class Exercise 18: Download MovieRatingsSVD.java and SVDStats.class. Compile and execute, then examine the code and output. Which of the two "temporary value" approaches (zero, or average-movie rating) do you think is better?

What do do with large matrices:

Unfortunately, large matrices where most entries are missing (this was the Netflix data) cause two problems:
- It is difficult to compute the SVD.
- It's empirically known that naive approaches to temporarily filling in values do not produce good results.
One of the top performers in the competition used a simpler, more direct approach, which we'll describe here.
Let's return to the SVD for a moment: $$ {\bf X} \eql {\bf U \Sigma V}^T $$
Suppose we multiply the latter two matrices and write this as: $$ {\bf X} \eql {\bf U W} $$ where ${\bf W} = {\bf \Sigma V}^T$.
Suppose ${\bf r}_i$ is the i-th row of ${\bf U}$, and ${\bf w}_j$ is the j-th column of ${\bf W}$.
Then, observe that the i-j-th entry in ${\bf X}$ can be written as: $$ {\bf X}_{ij} \eql {\bf r}_i \cdot {\bf w}_j $$ That is, it is the dot product of row $i$ in ${\bf U}$ and column $j$ in ${\bf W}$.
This alternative approach asks the following question: why not treat ${\bf r}_i$ and ${\bf w}_j$ as vector variables and optimize them directly, without calculating the SVD?
Suppose we "guessed" values for ${\bf r}_i$ and ${\bf w}_j$, we could then compute $$ {\bf Y}_{ij} \eql {\bf r}_i \cdot {\bf w}_j $$ for all $i,j$ and use the resulting matrix as the predicted set of ratings ${\bf Y}$.

The general, high-level idea is this:

    Algorithm: predictMissingRatings (X)
    Input: a ratings matrix X with missing entries
    1.   for all i,j
    2.      define vector variables r_i and w_j
    3.   for many iterations
    4.       Try different r_i and w_j
    5.       Compute Y
    6.       Compute error(X, Y)
    7.       Track the best Y
    8.   endfor
    9.   return best Y

Of course, rather than try all kinds of ${\bf r}_i$ and ${\bf w}_j$, we could be smarter about iterating.
The smarter way to iterate in an optimization is to use gradients:
- Consider what this would look like for a simple function $f(x)$ with one variable.
- The goal is to find the minimum value $f(x)$.
- If we start at some "guess" value for $x$, call it $x_0$
- Now iterate as follows: $$ x_{k+1} \; \leftarrow \; x_{k} - \alpha f^\prime(x_{k}) $$
- What this means:
  - At each step, evaluate the gradient with the current value $x_{k}$.
  - Use it to adjust $x_{k}$ to change in the opposite direction (hence the negative sign) of the gradient.
  - Opposite, because we want to minimize.
- Why does this work?
  - Think of $f(x)$ as bowl-shaped (the minimum is at the lowest part of the bowl).
  - If the gradient is positive, we are on the upward slope and need to reduce $x$ to get to the nadir.
  - If the gradient is negative, we need to increase $x$.
- To learn more about gradient iteration, see Module 10 of Continuous Algorithms (Section 10.5).
Now, back to our problem: consider the i-j-th known value and the current value of ${\bf r}_i \cdot {\bf w}_j$ and the error $$ E_{ij} \eql ({\bf X}_{ij} - {\bf r}_i \cdot {\bf w}_j)^2 $$
We will compute gradients of this and use the gradients to adjust ${\bf r}_i$ and ${\bf w}_j$.
In this case, it turns out (from calculus): $$\eqb{ \frac{\partial E_{ij}}{\partial {\bf r}_i} & \eql & - 2 {\bf w}_j ({\bf X}_{ij} - {\bf r}_i \cdot {\bf w}_j)\\ \frac{\partial E_{ij}}{\partial {\bf w}_j} & \eql & - 2 {\bf r}_i ({\bf X}_{ij} - {\bf r}_i \cdot {\bf w}_j)\\ }$$
So, at each step we update the variables as: $$\eqb{ {\bf r}_i & \eql & {\bf r}_i + 2 {\bf w}_j ({\bf X}_{ij} - {\bf r}_i \cdot {\bf w}_j)\\ {\bf w}_j & \eql & {\bf w}_j + 2 {\bf r}_i ({\bf X}_{ij} - {\bf r}_i \cdot {\bf w}_j)\\ }$$

Thus, in pseudocode:

    Algorithm: predictMissingRatings (X)
    Input: a ratings matrix X with missing entries
    1.   for all i,j
    2.      define vector variables r_i and w_j
    3.   for many iterations until converged
    4.       for all i,j
    5.           r_i = r_i + 2w_j(X_ij - r_i w_j)
    6.           w_j = w_j + 2r_i(X_ij - r_i w_j)
    7.       endfor
    8.   endfor

    9.   for all missing i,j
    10.      X_ij = r_iw_j
    11.  return X

In-Class Exercise 19: Download MovieRatingsIterativeSVD.java, compile and execute, then examine the code and output. How does the output compare to the regular SVD?

Symmetric matrix \({\bf A}_{n\times n}\)	General matrix \({\bf A}_{m\times n}\)
Suppose rank = r	Suppose rank = r
Since the rank is \(r\) there are r nonzero real eigenvalues \(\lambda_1,\lambda_2,\ldots,\lambda_r\) There are r orthonormal eigenvectors \({\bf x}_1,{\bf x}_2,\ldots,{\bf x}_r\) that form a basis for for both rowspace and colspace. For the r nonzero eigenvalues, \({\bf Ax}_i = \lambda_i {\bf x}_i\) Additional \(n-r\) orthonormal vectors can be found (from the nullspace) so that \({\bf x}_1,{\bf x}_2,\ldots,{\bf x}_r, {\bf x}_{r+1},\ldots,{\bf x}_n \) are an orthogonal basis for \({\bf R}^n\).	There are r orthonormal rowspace basis vectors \({\bf v}_1,{\bf v}_2,\ldots,{\bf v}_r\) And r orthonormal colspace basis vectors \({\bf u}_1,{\bf u}_2,\ldots,{\bf u}_r\) And r nonzero numbers \(\sigma_1,\sigma_2,\ldots,\sigma_r\) such that \({\bf Av}_i = \sigma_i {\bf u}_i\) There are additional orthogonal vectors (from the nullspace) \({\bf v}_{r+1},\ldots,{\bf v}_n\) and \({\bf u}_{r+1},\ldots,{\bf u}_m\) such that \({\bf v}_{1},\ldots,{\bf v}_n\) are orthogonal and a basis for \({\bf R}^n\), and \({\bf u}_{1},\ldots,{\bf u}_m\) are orthogonal and a basis for \({\bf R}^m\).