Probabilistic Models for Matrix Completion Problems

AI and Robotics

Nov 7, 2013 (4 years and 6 months ago)

103 views

Probabilistic Models for Matrix Completion Problems

Arindam

Banerjee

banerjee@cs.umn.edu

Dept of Computer Science & Engineering

University of Minnesota, Twin Cities

March 11, 2011

Recommendation Systems

Probabilistic Matrix Completion

2

Age:
28

Gender:
Male

Job:
Sales man

Interest:
Travel

Users

Movies

Title:
Gone with the wind

Release year:
1940

Cast:
Vivien Leigh, Clark Gable

Genre:
War, Romance

Awards:
8 Oscars

Keywords:
Love, Civil war

Movie ratings matrix

Probabilistic Matrix Completion

3

1% 2% 0.01% …

0.1% 2% 3% …

2% 2% 0.5% …

0.2% 0.3% 1.5% 2% …

2.5% 1% …

1.5% 1% 0.04% …

Click
-
Through
-
Rate matrix

Category:
Sports shoes

Brand:
Nike

Ratings:
4.2/5

Category:
Baby

URL:
babyearth.com

Content:

Webpage text

Webpages

Products

Forest Ecology

4

Leaf(N) Leaf(P) SLA Leaf
-
Size … Wood density

2 3 5 …

4 1 2 …

3 3 …

1 1 3 2 …

4 2 1 …

1 1 3 …

Plant Trait Matrix (TRY db)

(Jens
Kattage
, Peter Reich, et al)

Plants

Traits

The Main Idea

Probabilistic Matrix Completion

5

Probabilistic Matrix Completion

6

Overview

Graphical Models

Bayesian Networks

Inference

Probabilistic Co
-
clustering

Structure: Simultaneous Row
-
Column Clustering

Bayesian models, Inference

Probabilistic Matrix Factorization

Structure: Low Rank Factorization

Bayesian models, Inference

Probabilistic Matrix Completion

7

Graphical Models: What and Why

Statistical Machine Learning

Build diagnostic/predictive models from data

Uncertainty quantification based on (minimal) assumptions

The I.I.D. assumption

Data is independently and identically distributed

Example: Words in a doc drawn i.i.d. from the dictionary

Graphical models

Assume (graphical) dependencies between (random) variables

Closer to reality, domain knowledge can be captured

Learning/inference is much more difficult

Graphical Models

8

Flavors of Graphical Models

Basic nomenclature

Node = random variable, maybe observed/hidden

Edge = statistical dependency

Two popular flavors:
‘Directed’
and
‘Undirected’

Directed Graphs

A
directed

graph between random
variables, causal dependencies

Example: Bayesian networks, Hidden Markov Models

Joint distribution is a product of P(
child|parents
)

Undirected Graphs

An
undirected

graph between random variables

Example: Markov/Conditional random fields

Joint
distribution in terms of potential
functions

X
1

X
3

X
4

X
5

X
2

Probabilistic Matrix Completion

9

Bayesian Networks

Joint distribution in terms of
P(X|Parents(X))

X
1

X
3

X
4

X
5

X
2

Probabilistic Matrix Completion

10

Example I: Burglary Network

Probabilistic Matrix Completion

11

Example II: Rain Network

Probabilistic Matrix Completion

12

Example III: Car Problem Diagnosis

Probabilistic Matrix Completion

13

Latent Variable Models

Bayesian network with hidden variables

Semantically more accurate, less parameters

Example: Compute probability of heart disease

Probabilistic Matrix Completion

14

Inference

Some variables in the
Bayes

net are observed

the evidence/data, e.g.,
John has not called, Mary
has called

Inference

How to compute value/probability of other variables

Example: What is the probability of Burglary, i.e.,
P(b
|
¬
j,m
)

Probabilistic Matrix Completion

15

Inference Algorithms

Graphs without loops

Efficient exact inference algorithms are possible

Sum
-
product algorithm, and its special cases

Belief propagation in Bayes nets

Forward
-
Backward algorithm in Hidden Markov Models (HMMs)

Graphs with loops

Junction tree algorithms

Convert into a graph without loops

May lead to exponentially large graph, inefficient algorithm

Sum
-
product algorithm, disregarding loops

Active research topic, correct convergence `not guaranteed’

Works well in practice, e.g., turbo codes

Approximate inference

Probabilistic Matrix Completion

16

Approximate Inference

Variational Inference

Deterministic approximation

Approximate complex true distribution/domain

Replace with family of simple distributions/domains

Use the best approximation in the family

Example: Mean
-
field, Expectation Propagation

Stochastic Inference

Simple sampling approaches

Markov Chain Monte Carlo methods (MCMC)

Powerful family of methods

Gibbs sampling

Useful special case of MCMC methods

Probabilistic Matrix Completion

17

Overview

Graphical Models

Bayesian Networks

Inference

Probabilistic Co
-
clustering

Structure: Simultaneous Row
-
Column Clustering

Bayesian models, Inference

Probabilistic Matrix Factorization

Structure: Low Rank Factorization

Bayesian models, Inference

Probabilistic Matrix Completion

18

Example: Gene Expression Analysis

Original

Co
-
clustered

Probabilistic Matrix Completion

19

Co
-
clustering and Matrix Approximation

Probabilistic Matrix Completion

20

Probabilistic Co
-
clustering

Row clusters:

Column clusters:

Probabilistic Matrix Completion

21

Generative Process

2

Assume a mixed membership for
each row and column

Assume a Gaussian for each co
-
cluster

1.
Pick row/column clusters

2.
Generate each entry of the matrix

Probabilistic Matrix Completion

22

Bayesian Co
-
clustering (BCC)

2

A Dirichlet distribution over all
possible mixed memberships

Probabilistic Matrix Completion

23

Background: Plate Diagrams

a

b

3

a

b
1

b
2

b
3

Compact representation of large Bayesian networks

Probabilistic Matrix Completion

24

Bayesian Co
-
clustering (BCC)

Probabilistic Matrix Completion

25

Recall: The Inference Problem

What is P( b | ¬j, m) ?

Probabilistic Matrix Completion

26

Bayesian Co
-
clustering (BCC)

Probabilistic Matrix Completion

27

Learning: Inference and Estimation

Learning

Estimate model parameters

Infer ‘mixed memberships’ of individual rows and columns

Expectation Maximization (EM)

Issues

Posterior probability cannot be obtained in closed form

Parameter estimation cannot be done directly

Approach:Variational

inference

)
,
,
(
2
1

Probabilistic Matrix Completion

28

Variational

Inference

Introduce a
variational

distribution
to approximate

Use Jensen’s inequality to get a tractable lower bound

Maximize the lower bound
w.r.t
.

Alternatively minimize the KL divergence between

and

Maximize the lower bound
w.r.t
.

Probabilistic Matrix Completion

29

Variational

EM for BCC

= lower bound of log
-
likelihood

Probabilistic Matrix Completion

30

Residual Bayesian Co
-
clustering (RBC)

(z
1
,z
2
)

determines the distribution

Users/movies may have bias

(m
1
,m
2
)
: row/column means

(bm
1
,bm
2
)
:
row/column
bias

Probabilistic Matrix Completion

31

Results: Datasets

Movielens
: Movie recommendation data

100,000 ratings (1
-
5) for 1682 movies by 943 users (6.3%)

1 million ratings for 3900 movies by 6040 users (4.2%)

Foodmart
: Transaction data

164,558 sales records for 7803 customers and 1559 products (1.35%)

Jester: Joke rating data

100,000 ratings (
-
10.00,+10.00) for 100 jokes from 1000 users (100%)

Probabilistic Matrix Completion

32

BCC, RBC vs. Co
-
clustering algorithms

Jester

BCC and RBC have the
best performance

RBC
and RBC
-
FF
perform better than
BCC

Probabilistic Matrix Completion

33

RBC vs. Other Co
-
clustering Algorithms

Foodmart

Movielens

Probabilistic Matrix Completion

34

RBC vs. SVD, NNMF, and CORR

Jester

RBC and RBC
-
FF are
competitive with other
algorithms

Probabilistic Matrix Completion

35

RBC vs. SVD, NNMF, and CORR

Movielens

Foodmart

Probabilistic Matrix Completion

36

SVD vs. Parallel RBC

Parallel RBC scales well to large matrices

Probabilistic Matrix Completion

37

Co
-
embedding: Users

Probabilistic Matrix Completion

38

Co
-
embedding: Movies

Probabilistic Matrix Completion

39

Overview

Graphical Models

Bayesian Networks

Inference

Probabilistic Co
-
clustering

Structure: Simultaneous Row
-
Column Clustering

Bayesian models, Inference

Probabilistic Matrix Factorization

Structure: Low Rank Factorization

Bayesian models, Inference

Matrix Factorization

Singular value decomposition

Problems

Large matrices, with millions of row/columns

SVD can be rather slow

Sparse matrices, most entries are missing

Traditional approaches cannot handle missing entries

Probabilistic Matrix Completion

40

Model
X
ϵ

R
n
×
m

as
UV
T

where

U is a
R
n
×
k
, V is
R
m
×
k

Alternatively optimize U and V

Matrix Factorization: “Funk SVD”

Probabilistic Matrix Completion

41

X
ij

=
u
i
T
v
j

=

error = (
X
ij

X
ij
)
2

= (
X
ij

u
i
T
v
j
)
2

^

u
i
T

v
j

^

Probabilistic Matrix Factorization (PMF)

42

X
ij

~ N(
u
i
T
v
j

,
σ
2
)

u
i
T

v
j

N(0,
σ
u
2
I)

N(0,
σ
v
2
I)

u
i
T

~ N(0,
σ
u
2
I)

v
j

~ N(0,
σ
v
2
I)

R
ij
~ N(u
i
T
v
j

,
σ
2
)

R.
Salakhutdinov

and A.
Mnih
, NIPS 2007

Probabilistic Matrix Completion

Bayesian Probabilistic Matrix Factorization

43

X
ij

~ N(
u
i
T
v
j

,
σ
2
)

u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

µ
u

~ N(µ
0
,

Λ
u
),
Λ
u

~ W(
ν
0
, W
0
)

µ
v

~ N(µ
0
,

Λ
v
),
Λ
v

~ W(
ν
0
, W
0
)

u
i

~ N(µ
u
,
Λ
u
)

v
j

~ N(µ
v
,
Λ
v
)

R
ij

~ N(
u
i
T
v
j

,
σ
2
)

Wishart

Gaussian

Inference using MCMC

R.
Salakhutdinov

and A.
Mnih
, ICML 2008

Probabilistic Matrix Completion

Parametric PMF (PPMF)

Are the priors used in PMF and BPMF suitable?

Probabilistic Matrix Completion

44

u
i
T

v
j

N(0,
σ
u
2
I)

N(0,
σ
v
2
I)

PMF:

D
iagonal
covariance

u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

BPMF:

Full covariance,
with “
hyperprior

u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

Parametric PMF (PPMF):

Full covariance, but no “
hyperprior

PPMF

Probabilistic Matrix Completion

45

PPMF with Mixture Models (MPMF)

What if the row (column) items belong to several groups?

Probabilistic Matrix Completion

46

Parametric PMF (PPMF):

A single Gaussian to generate all
u
i

(or
v
j
)

u
i
T

v
j

N
1

1
u
,
Λ
1
u
)

N
2

2
u
,
Λ
2
u
)

N
3

3
u
,
Λ
3
u
)

Mixture PMF (MPMF):

A mixture of Gaussians represent a set of groups.

Each
u
i

(or
v
j
) is generated from one of the Gaussians

MPMF

Probabilistic Matrix Completion

47

PMF with Side Information: LDA
-
MPMF

Can we use side information to improve accuracy?

Probabilistic Matrix Completion

48

movies

users

s
ide
information

p
1
(
θ
1
u
)

p
2
(
θ
2
u
)

p
3
(
θ
3
u
)

N
1

1
u
,
Λ
1
u
)

N
2

2
u
,
Λ
2
u
)

N
3

3
u
,
Λ
3
u
)

LDA
-
MPMF:

u
i

and side information
share a membership vector

LDA
-
MPMF

Probabilistic Matrix Completion

49

p
1
(
θ
1
u
)

s
ide
information

p
2
(
θ
2
u
)

p
3
(
θ
3
u
)

PMF with Side Information: CTM
-
PPMF

Probabilistic Matrix Completion

50

N(µ
u
,
Λ
u
)

LDA
-
MPMF:

u
i

and side information share a membership vector

CTM
-
MPMF:

u
i

is converted to the membership vector to generate side information

movies

users

CTM
-
PPMF

Probabilistic Matrix Completion

51

Residual PPMF

How to consider the row (column) biases?

E.g. famous movies, critical users, etc.

Probabilistic Matrix Completion

52

X
ij

~ N(
u
i
T
v
j

,
σ
2
)

u
i
T

v
j

N(µ
u
,
Λ
u
)

N(µ
v
,
Λ
v
)

PPMF:

Does not consider
row,column

biases

Residual PPMF:

Each row has a row bias
f
i

E
ach column has a column bias
g
j
.

X
ij

~ N(
u
i
T
v
j

+
f
i
+g
j
,
σ
2
)

Similar residual models for Mixture PMF, LDA
-
MPMF, and CTM
-
PPMF.

PPMF vs. PMF, BPMF

Probabilistic Matrix Completion

53

PPMF mostly achieves higher accuracy

PPMF vs. Co
-
clustering

Probabilistic Matrix Completion

54

PPMF achieves higher accuracy compared to co
-
clustering

Residual Models

Probabilistic Matrix Completion

55

Modeling row column bias helps

Effect of Side Information

Probabilistic Matrix Completion

56

-

Side information helps to a certain extent

-

No side information

Non
-
residual models

Residual models

Other Applications

Topic Modeling and Text Classification

Mixed
-
membership topic models

Combination with logistic regression

Cluster Ensembles

Combine multiple
clusterings

of a dataset

Mixed
-
membership naïve
-
Bayes

models

Bayesian Kernel methods

Nonlinear covariance using Gaussian Process priors

Nonlinear correlated multivariate predictions

Probabilistic Matrix Completion

57

Example: Multi
-
label classification

One data point has multiple labels

Labels

are

correlated

Probabilistic Matrix Completion

58

Arrived at [destination] and flaps were selected for
approach
SLATS FAIL
caution message. We advised approach and they gave us
vectors while we completed the
QRH

(quick

reference

handbook)

procedure
. The slats were failed at zero. The FA‘s

(flight

attendants)

were
notified, an emergency declared and we
landed

uneventfully.

Maintenance

were

performed…

Aviation
Safety
reports

Landing

problem

Equipment

failure

Special

procedure

Maintenance

cause

starts

requires

Experimental

Results

Probabilistic Matrix Completion

59

Classification

performance,

BMR

outperforms

other

algorithms

Topics

in

class

Anomaly.Smoke

or

Fire

The Main Idea

Probabilistic Matrix Completion

60

Probabilistic Matrix Completion

61

Conclusions

Matrix Completion Problems

The Main Idea

Consider a suitable modeling structure

Perform averaging over all such models

Probabilistic Co
-
clustering

Mixed membership co
-

Probabilistic Factorization

Bayesian model over all possible factorizations

Future Directions

Nonlinear, High
-
dimensional, Dynamic Models

Applications: Climate & Environmental Sciences, Healthcare, Finance

References

A.
Banerjee
, “On Bayesian Bounds,”
International Conference on Machine
Learning (ICML)
, 2006.

A.
Banerjee
, I.
Dhillon
, J.
Ghosh
, S.
Merugu
, D.
Modha
., “A Generalized
Maximum Entropy Approach to
Bregman

Co
-
clustering and Matrix
Approximation,”
Journal of Machine Learning Research (JMLR)
, 2007.

A.
Banerjee

and H. Shan, “Latent
Dirichlet

Conditional Naive
Bayes

Models,”

IEEE International Conference on Data Mining (ICDM),
2007.

H. Shan and A.
Banerjee
, “Bayesian co
-
clustering,”
IEEE International Conference
on Data Mining (ICDM),

2008.

H. Wang, H. Shan, A.
Banerjee
, “Bayesian Cluster Ensembles,”
SIAM International
Conference on Data Mining (SDM)
, 2009.

H. Shan, A.
Banerjee
, and N.
Oza
, “Discriminative Mixed
-
membership Models,”

IEEE Conference on Data Mining (ICDM)
, 2009.

H. Shan and A.
Banerjee
, “Residual Bayesian Co
-
clustering for Matrix
Approximation,”
SIAM International Conference on Data Mining (SDM)
, 2010.

Probabilistic Matrix Completion

62

References

H. Shan and A.
Banerjee
, “Mixed
-
Membership Naive
Bayes

Models,”

Data Mining and Knowledge Discovery (DMKD)
, 2010.

H. Shan and A.
Banerjee
, “Generalized Probabilistic Matrix Factorizations for
Collaborative Filtering,”
IEEE International Conference on Data Mining (ICDM),

2010.

A.
Agovic
, H. Shan, and A.
Banerjee
, “Analyzing aviation safety reports: From
topic modeling to scalable multi
-
label classification,”
Conference on Intelligent
Data Understanding (CIDU)
, 2010.

A.
Agovic

and A.
Banerjee
, “Gaussian Process Topic Models,”

Conference on Uncertainty in Artificial Intelligence (UAI)
, 2010.

Probabilistic Matrix Completion

63

Probabilistic Matrix Completion

64

Acknowledgements

Hanhuai

Shan

Amrudin

Agovic

Probabilistic Matrix Completion

65