Matrix Factorization
December 19, 2022•620 words
Matrix Factorization
set unknown/missing number to zero
Matrix factorization follows the following:
- Initialize two random matrices a and b with dimensions m by j and j by n such that when multiplied, their dimension matches the original matrix z (that has dimensions m by n).
- Multiply a by b to achieve an estimate for z.
- Subtract z from y for the known values of z, or some other loss function, to evaluate how far off the estimate is from the real matrix.
- Use gradient descent formulas to adjust each of the values in a and b in the right direction.
- Repeat steps 2 to 4 repeatedly until the error has reached a reasonable value.
- By multiplying a by b, we now have an estimate for z that not only closely matches the known values of z, but also provides an estimate for the unknown values.
In this case,
R
is a matrix holding the true values (with unknown values marked as 0).P
andQ
are the two matrices that, when multiplied, form an estimate for R.K
represents intrinsic number of features- the columns of P and the rows of Q — that is,
- P is of dimensions M by K
- Q of dimensions K by N
- where R is of dimensions M by N.
- the columns of P and the rows of Q — that is,
Since Q will be the same dimension as P when entered, by taking the transpose of Q it is multipliable by P.
if R[i][j] > 0:
eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
The first line ensures that the value is not unknown (remember that we set missing values to 0). If it is not unknown, a variable eij is achieved by taking the difference between the true value and the prediction (achieved by multiplying the ith row of P
by the jth row of Q
).
Perform update
for k in range(K):
P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
For each value of K (this helps update individual cell values in P and Q), the corresponding values of P and Q are updated according to a (simplified) gradient descent formula.
MSE
e = 0
for i in range(len(R)):
for j in range(len(R[i])):
if R[i][j] > 0:
e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
for k in range(K):
e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
This calculates the mean squared error for each value and saves it to an error variable e
.
Complete Code
import numpy
import numpy as np
import pandas as pd
import progressbar as pb
def matrix_factorization(R, P, Q, K, steps=10000, alpha=0.0002, beta=0.02):
Q = Q.T
for step in pb.progressbar(range(steps)):
for i in range(len(R)):
for j in range(len(R[i])):
if R[i][j] > 0:
eij = R[i][j] - numpy.dot(P[i,:],Q[:,j])
for k in range(K):
P[i][k] = P[i][k] + alpha * (2 * eij * Q[k][j] - beta * P[i][k])
Q[k][j] = Q[k][j] + alpha * (2 * eij * P[i][k] - beta * Q[k][j])
eR = numpy.dot(P,Q)
e = 0
for i in range(len(R)):
for j in range(len(R[i])):
if R[i][j] > 0:
e = e + pow(R[i][j] - numpy.dot(P[i,:],Q[:,j]), 2)
for k in range(K):
e = e + (beta/2) * (pow(P[i][k],2) + pow(Q[k][j],2))
if e < 0.001:
break
return P, Q.T
Test Case
R = [
[4,3,0,1],
[4,0,0,1],
[1,1,0,5],
[1,0,0,4],
[0,1,5,4],
]
R = numpy.array(R)
N = len(R)
M = len(R[0])
K = 2
P = numpy.random.rand(N,K)
Q = numpy.random.rand(M,K)
nP, nQ = matrix_factorization(R, P, Q, K)
nR = numpy.dot(nP, nQ.T)