Probabilistic Models for Matrix Completion Problems

ocelotgiantAI and Robotics

Nov 7, 2013 (3 years and 9 months ago)

84 views

Probabilistic Models for Matrix Completion Problems

Arindam

Banerjee

banerjee@cs.umn.edu


Dept of Computer Science & Engineering

University of Minnesota, Twin Cities


March 11, 2011

Recommendation Systems

Probabilistic Matrix Completion

2

Age:
28

Gender:
Male

Job:
Sales man

Interest:
Travel




Users

Movies

Title:
Gone with the wind

Release year:
1940

Cast:
Vivien Leigh, Clark Gable

Genre:
War, Romance

Awards:
8 Oscars

Keywords:
Love, Civil war




Movie ratings matrix

Advertisements on the Web

Probabilistic Matrix Completion

3


1% 2% 0.01% …

0.1% 2% 3% …


2% 2% 0.5% …


0.2% 0.3% 1.5% 2% …


2.5% 1% …


1.5% 1% 0.04% …

Click
-
Through
-
Rate matrix

Category:
Sports shoes

Brand:
Nike

Ratings:
4.2/5




Category:
Baby

URL:
babyearth.com

Content:

Webpage text

Hyperlinks:


Webpages

Products



Forest Ecology

4


Leaf(N) Leaf(P) SLA Leaf
-
Size … Wood density


2 3 5 …


4 1 2 …


3 3 …


1 1 3 2 …


4 2 1 …


1 1 3 …

Plant Trait Matrix (TRY db)

(Jens
Kattage
, Peter Reich, et al)

Plants

Traits

The Main Idea

Probabilistic Matrix Completion

5

Probabilistic Matrix Completion

6

Overview


Graphical Models


Bayesian Networks


Inference


Probabilistic Co
-
clustering


Structure: Simultaneous Row
-
Column Clustering


Bayesian models, Inference


Probabilistic Matrix Factorization


Structure: Low Rank Factorization


Bayesian models, Inference


Probabilistic Matrix Completion

7

Graphical Models: What and Why




Statistical Machine Learning


Build diagnostic/predictive models from data


Uncertainty quantification based on (minimal) assumptions


The I.I.D. assumption


Data is independently and identically distributed


Example: Words in a doc drawn i.i.d. from the dictionary


Graphical models


Assume (graphical) dependencies between (random) variables


Closer to reality, domain knowledge can be captured


Learning/inference is much more difficult


Graphical Models

8

Flavors of Graphical Models


Basic nomenclature


Node = random variable, maybe observed/hidden


Edge = statistical dependency


Two popular flavors:
‘Directed’
and
‘Undirected’


Directed Graphs


A
directed

graph between random
variables, causal dependencies


Example: Bayesian networks, Hidden Markov Models


Joint distribution is a product of P(
child|parents
)


Undirected Graphs


An
undirected

graph between random variables


Example: Markov/Conditional random fields


Joint
distribution in terms of potential
functions

X
1

X
3

X
4

X
5

X
2

Probabilistic Matrix Completion

9

Bayesian Networks








Joint distribution in terms of
P(X|Parents(X))

X
1

X
3

X
4

X
5

X
2

Probabilistic Matrix Completion

10

Example I: Burglary Network

Probabilistic Matrix Completion

11

Example II: Rain Network

Probabilistic Matrix Completion

12

Example III: Car Problem Diagnosis

Probabilistic Matrix Completion

13

Latent Variable Models













Bayesian network with hidden variables


Semantically more accurate, less parameters


Example: Compute probability of heart disease


Probabilistic Matrix Completion

14

Inference









Some variables in the
Bayes

net are observed


the evidence/data, e.g.,
John has not called, Mary
has called


Inference


How to compute value/probability of other variables


Example: What is the probability of Burglary, i.e.,
P(b
|
¬
j,m
)

Probabilistic Matrix Completion

15

Inference Algorithms


Graphs without loops


Efficient exact inference algorithms are possible


Sum
-
product algorithm, and its special cases


Belief propagation in Bayes nets


Forward
-
Backward algorithm in Hidden Markov Models (HMMs)


Graphs with loops


Junction tree algorithms


Convert into a graph without loops


May lead to exponentially large graph, inefficient algorithm


Sum
-
product algorithm, disregarding loops


Active research topic, correct convergence `not guaranteed’


Works well in practice, e.g., turbo codes


Approximate inference


Probabilistic Matrix Completion

16

Approximate Inference


Variational Inference


Deterministic approximation


Approximate complex true distribution/domain


Replace with family of simple distributions/domains


Use the best approximation in the family


Example: Mean
-
field, Expectation Propagation


Stochastic Inference


Simple sampling approaches


Markov Chain Monte Carlo methods (MCMC)


Powerful family of methods


Gibbs sampling


Useful special case of MCMC methods

Probabilistic Matrix Completion

17

Overview


Graphical Models


Bayesian Networks


Inference


Probabilistic Co
-
clustering


Structure: Simultaneous Row
-
Column Clustering


Bayesian models, Inference


Probabilistic Matrix Factorization


Structure: Low Rank Factorization


Bayesian models, Inference


Probabilistic Matrix Completion

18

Example: Gene Expression Analysis

Original

Co
-
clustered

Probabilistic Matrix Completion

19

Co
-
clustering and Matrix Approximation

Probabilistic Matrix Completion

20

Probabilistic Co
-
clustering

Row clusters:

Column clusters:





Probabilistic Matrix Completion

21

Generative Process

2


Assume a mixed membership for
each row and column


Assume a Gaussian for each co
-
cluster

1.
Pick row/column clusters

2.
Generate each entry of the matrix

Probabilistic Matrix Completion

22

Bayesian Co
-
clustering (BCC)

2


A Dirichlet distribution over all
possible mixed memberships

Probabilistic Matrix Completion

23

Background: Plate Diagrams

a

b

3

a

b
1

b
2

b
3

Compact representation of large Bayesian networks

Probabilistic Matrix Completion

24

Bayesian Co
-
clustering (BCC)

Probabilistic Matrix Completion

25

Recall: The Inference Problem

What is P( b | ¬j, m) ?

Probabilistic Matrix Completion

26

Bayesian Co
-
clustering (BCC)

Probabilistic Matrix Completion

27

Learning: Inference and Estimation


Learning


Estimate model parameters


Infer ‘mixed memberships’ of individual rows and columns


Expectation Maximization (EM)





Issues


Posterior probability cannot be obtained in closed form


Parameter estimation cannot be done directly


Approach:Variational

inference


)
,
,
(
2
1



Probabilistic Matrix Completion

28

Variational

Inference


Introduce a
variational

distribution
to approximate


Use Jensen’s inequality to get a tractable lower bound





Maximize the lower bound
w.r.t
.


Alternatively minimize the KL divergence between


and







Maximize the lower bound
w.r.t
.

Probabilistic Matrix Completion

29

Variational

EM for BCC


= lower bound of log
-
likelihood

Probabilistic Matrix Completion

30

Residual Bayesian Co
-
clustering (RBC)


(z
1
,z
2
)

determines the distribution


Users/movies may have bias


(m
1
,m
2
)
: row/column means


(bm
1
,bm
2
)
:
row/column
bias

Probabilistic Matrix Completion

31

Results: Datasets


Movielens
: Movie recommendation data


100,000 ratings (1
-
5) for 1682 movies by 943 users (6.3%)


1 million ratings for 3900 movies by 6040 users (4.2%)



Foodmart
: Transaction data


164,558 sales records for 7803 customers and 1559 products (1.35%)



Jester: Joke rating data


100,000 ratings (
-
10.00,+10.00) for 100 jokes from 1000 users (100%)

Probabilistic Matrix Completion

32

BCC, RBC vs. Co
-
clustering algorithms

Jester


BCC and RBC have the
best performance


RBC
and RBC
-
FF
perform better than
BCC

Probabilistic Matrix Completion

33

RBC vs. Other Co
-
clustering Algorithms

Foodmart

Movielens

Probabilistic Matrix Completion

34

RBC vs. SVD, NNMF, and CORR

Jester


RBC and RBC
-
FF are
competitive with other
algorithms

Probabilistic Matrix Completion

35

RBC vs. SVD, NNMF, and CORR

Movielens

Foodmart

Probabilistic Matrix Completion

36

SVD vs. Parallel RBC

Parallel RBC scales well to large matrices

Probabilistic Matrix Completion

37

Co
-
embedding: Users

Probabilistic Matrix Completion

38

Co
-
embedding: Movies

Probabilistic Matrix Completion

39

Overview


Graphical Models


Bayesian Networks


Inference


Probabilistic Co
-
clustering


Structure: Simultaneous Row
-
Column Clustering


Bayesian models, Inference


Probabilistic Matrix Factorization


Structure: Low Rank Factorization


Bayesian models, Inference


Matrix Factorization


Singular value decomposition







Problems


Large matrices, with millions of row/columns


SVD can be rather slow


Sparse matrices, most entries are missing


Traditional approaches cannot handle missing entries


Probabilistic Matrix Completion

40












Model
X
ϵ

R
n
×
m

as
UV
T

where


U is a
R
n
×
k
, V is
R
m
×
k


Alternatively optimize U and V





Matrix Factorization: “Funk SVD”

Probabilistic Matrix Completion

41


X
ij

=
u
i
T
v
j

=


error = (
X
ij


X
ij
)
2



= (
X
ij


u
i
T
v
j
)
2






^

u
i
T

v
j

^

Probabilistic Matrix Factorization (PMF)

42


X
ij

~ N(
u
i
T
v
j

,
σ
2
)






u
i
T

v
j

N(0,
σ
u
2
I)

N(0,
σ
v
2
I)

u
i
T

~ N(0,
σ
u
2
I)

v
j

~ N(0,
σ
v
2
I)

R
ij
~ N(u
i
T
v
j

,
σ
2
)

Inference using gradient descent

R.
Salakhutdinov

and A.
Mnih
, NIPS 2007

Probabilistic Matrix Completion

Bayesian Probabilistic Matrix Factorization

43


X
ij

~ N(
u
i
T
v
j

,
σ
2
)






u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

µ
u

~ N(µ
0
,

Λ
u
),
Λ
u

~ W(
ν
0
, W
0
)

µ
v

~ N(µ
0
,

Λ
v
),
Λ
v

~ W(
ν
0
, W
0
)

u
i


~ N(µ
u
,
Λ
u
)

v
j

~ N(µ
v
,
Λ
v
)

R
ij


~ N(
u
i
T
v
j

,
σ
2
)

Wishart

Gaussian

Inference using MCMC

R.
Salakhutdinov

and A.
Mnih
, ICML 2008

Probabilistic Matrix Completion

Parametric PMF (PPMF)


Are the priors used in PMF and BPMF suitable?


Probabilistic Matrix Completion

44

u
i
T

v
j

N(0,
σ
u
2
I)

N(0,
σ
v
2
I)

PMF:

D
iagonal
covariance

u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

BPMF:

Full covariance,
with “
hyperprior


u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

Parametric PMF (PPMF):

Full covariance, but no “
hyperprior


PPMF

Probabilistic Matrix Completion

45

PPMF with Mixture Models (MPMF)


What if the row (column) items belong to several groups?

Probabilistic Matrix Completion

46

Parametric PMF (PPMF):

A single Gaussian to generate all
u
i

(or
v
j
)

u
i
T

v
j

N
1

1
u
,
Λ
1
u
)

N
2

2
u
,
Λ
2
u
)

N
3

3
u
,
Λ
3
u
)

Mixture PMF (MPMF):

A mixture of Gaussians represent a set of groups.

Each
u
i

(or
v
j
) is generated from one of the Gaussians

MPMF

Probabilistic Matrix Completion

47

PMF with Side Information: LDA
-
MPMF


Can we use side information to improve accuracy?

Probabilistic Matrix Completion

48

movies

users

s
ide
information

p
1
(
θ
1
u
)

p
2
(
θ
2
u
)

p
3
(
θ
3
u
)

N
1

1
u
,
Λ
1
u
)

N
2

2
u
,
Λ
2
u
)

N
3

3
u
,
Λ
3
u
)

LDA
-
MPMF:


u
i

and side information
share a membership vector

LDA
-
MPMF

Probabilistic Matrix Completion

49

p
1
(
θ
1
u
)

s
ide
information

p
2
(
θ
2
u
)

p
3
(
θ
3
u
)

PMF with Side Information: CTM
-
PPMF

Probabilistic Matrix Completion

50

N(µ
u
,
Λ
u
)

LDA
-
MPMF:


u
i

and side information share a membership vector


CTM
-
MPMF:


u
i

is converted to the membership vector to generate side information

movies

users

CTM
-
PPMF

Probabilistic Matrix Completion

51

Residual PPMF


How to consider the row (column) biases?


E.g. famous movies, critical users, etc.

Probabilistic Matrix Completion

52

X
ij

~ N(
u
i
T
v
j

,
σ
2
)






u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

PPMF:

Does not consider
row,column

biases

Residual PPMF:

Each row has a row bias
f
i


E
ach column has a column bias
g
j
.

X
ij

~ N(
u
i
T
v
j

+
f
i
+g
j
,
σ
2
)






Similar residual models for Mixture PMF, LDA
-
MPMF, and CTM
-
PPMF.

PPMF vs. PMF, BPMF

Probabilistic Matrix Completion

53

PPMF mostly achieves higher accuracy

PPMF vs. Co
-
clustering

Probabilistic Matrix Completion

54

PPMF achieves higher accuracy compared to co
-
clustering

Residual Models

Probabilistic Matrix Completion

55

Modeling row column bias helps

Effect of Side Information

Probabilistic Matrix Completion

56

-

Side information helps to a certain extent

-

Helpfulness ordering: Genre, Cast, Plot

No side information

Non
-
residual models

Residual models

Other Applications



Topic Modeling and Text Classification


Mixed
-
membership topic models


Combination with logistic regression



Cluster Ensembles


Combine multiple
clusterings

of a dataset


Mixed
-
membership naïve
-
Bayes

models



Bayesian Kernel methods


Nonlinear covariance using Gaussian Process priors


Nonlinear correlated multivariate predictions

Probabilistic Matrix Completion

57

Example: Multi
-
label classification


One data point has multiple labels











Labels

are

correlated

Probabilistic Matrix Completion

58

Arrived at [destination] and flaps were selected for
approach
. We received a
SLATS FAIL
caution message. We advised approach and they gave us
vectors while we completed the
QRH

(quick

reference

handbook)

procedure
. The slats were failed at zero. The FA‘s

(flight

attendants)

were
notified, an emergency declared and we
landed

uneventfully.

Maintenance

were

performed…

Aviation
Safety
reports

Landing

problem

Equipment

failure

Special

procedure

Maintenance





cause

starts

requires

Experimental

Results

Probabilistic Matrix Completion

59

Classification

performance,

BMR

outperforms

other

algorithms

Topics

in

class

Anomaly.Smoke

or

Fire

The Main Idea

Probabilistic Matrix Completion

60

Probabilistic Matrix Completion

61

Conclusions


Matrix Completion Problems


The Main Idea


Consider a suitable modeling structure


Perform averaging over all such models


Probabilistic Co
-
clustering


Mixed membership co
-
clustering for dyadic data


Probabilistic Factorization


Bayesian model over all possible factorizations


Future Directions


Nonlinear, High
-
dimensional, Dynamic Models


Applications: Climate & Environmental Sciences, Healthcare, Finance


References


A.
Banerjee
, “On Bayesian Bounds,”
International Conference on Machine
Learning (ICML)
, 2006.


A.
Banerjee
, I.
Dhillon
, J.
Ghosh
, S.
Merugu
, D.
Modha
., “A Generalized
Maximum Entropy Approach to
Bregman

Co
-
clustering and Matrix
Approximation,”
Journal of Machine Learning Research (JMLR)
, 2007.


A.
Banerjee

and H. Shan, “Latent
Dirichlet

Conditional Naive
Bayes

Models,”

IEEE International Conference on Data Mining (ICDM),
2007.


H. Shan and A.
Banerjee
, “Bayesian co
-
clustering,”
IEEE International Conference
on Data Mining (ICDM),

2008.


H. Wang, H. Shan, A.
Banerjee
, “Bayesian Cluster Ensembles,”
SIAM International
Conference on Data Mining (SDM)
, 2009.


H. Shan, A.
Banerjee
, and N.
Oza
, “Discriminative Mixed
-
membership Models,”

IEEE Conference on Data Mining (ICDM)
, 2009.


H. Shan and A.
Banerjee
, “Residual Bayesian Co
-
clustering for Matrix
Approximation,”
SIAM International Conference on Data Mining (SDM)
, 2010.


Probabilistic Matrix Completion

62

References


H. Shan and A.
Banerjee
, “Mixed
-
Membership Naive
Bayes

Models,”

Data Mining and Knowledge Discovery (DMKD)
, 2010.


H. Shan and A.
Banerjee
, “Generalized Probabilistic Matrix Factorizations for
Collaborative Filtering,”
IEEE International Conference on Data Mining (ICDM),

2010.


A.
Agovic
, H. Shan, and A.
Banerjee
, “Analyzing aviation safety reports: From
topic modeling to scalable multi
-
label classification,”
Conference on Intelligent
Data Understanding (CIDU)
, 2010.


A.
Agovic

and A.
Banerjee
, “Gaussian Process Topic Models,”

Conference on Uncertainty in Artificial Intelligence (UAI)
, 2010.


Probabilistic Matrix Completion

63

Probabilistic Matrix Completion

64

Acknowledgements

Hanhuai

Shan

Amrudin

Agovic

Probabilistic Matrix Completion

65