Matrix Factorization

Matrix Factorization

set unknown/missing number to zero

Matrix factorization follows the following:

  1. Initialize two random matrices a and b with dimensions m by j and j by n such that when multiplied, their dimension matches the original matrix z (that has dimensions m by n).
  2. Multiply a by b to achieve an estimate for z.
  3. Subtract z from y for the known values of z, or some other loss function, to evaluate how far off the estimate is from the real matrix.
  4. Use gradient descent formulas to adjust each of the values in a and b in the right direction.
  5. Repeat steps 2 to 4 repeatedly until the error has reached a reasonable value.
  6. By multiplying a by b, we now have an estimate for z that not only closely matches the known values of z, but also provides an estimate for the unknown values.

In this case,

  • R is a matrix holding the true values (with unknown values marked as 0).
  • P and Q are the two matrices that, when multiplied, form an estimate for R.
  • K represents intrinsic number of features
    • the columns of P and the rows of Q — that is,
      • P is of dimensions M by K
      • Q of dimensions K by N
      • where R is of dimensions M by N.

Since Q will be the same dimension as P when entered, by taking the transpose of Q it is multipliable by P.

if R[i][j] > 0:
    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])

The first line ensures that the value is not unknown (remember that we set missing values to 0). If it is not unknown, a variable eij is achieved by taking the difference between the true value and the prediction (achieved by multiplying the ith row of P by the jth row of Q).

Perform update

for k in range(K):
    P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
    Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])

For each value of K (this helps update individual cell values in P and Q), the corresponding values of P and Q are updated according to a (simplified) gradient descent formula.

MSE

e = 0
for i in range(len(R)):
   for j in range(len(R[i])):
      if R[i][j] > 0:
         e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
         for k in range(K):
            e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))

This calculates the mean squared error for each value and saves it to an error variable e.

Complete Code

import numpy
import numpy as np
import pandas as pd
import progressbar as pb
def matrix_factorization(R, P, Q, K, steps=10000, alpha=0.0002, beta=0.02):
    Q = Q.T
    for step in pb.progressbar(range(steps)):
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
                    for k in range(K):
                        P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
                        Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
        eR = numpy.dot(P,Q)
        e = 0
        for i in range(len(R)):
            for j in range(len(R[i])):
                if R[i][j] > 0:
                    e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
                    for k in range(K):
                        e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
        if e < 0.001:
            break
    return P, Q.T

Test Case

R = [
 [4,3,0,1],
 [4,0,0,1],
 [1,1,0,5],
 [1,0,0,4],
 [0,1,5,4],
]

R = numpy.array(R)
N = len(R)
M = len(R[0])
K = 2

P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)

nP, nQ = matrix_factorization(R, P, Q, K)
nR = numpy.dot(nP, nQ.T)

You'll only receive email when they publish something new.

More from ML
All posts