Probabilistic Models for Matrix Completion Problems
Arindam
Banerjee
banerjee@cs.umn.edu
Dept of Computer Science & Engineering
University of Minnesota, Twin Cities
March 11, 2011
Recommendation Systems
Probabilistic Matrix Completion
2
Age:
28
Gender:
Male
Job:
Sales man
Interest:
Travel
…
Users
Movies
Title:
Gone with the wind
Release year:
1940
Cast:
Vivien Leigh, Clark Gable
Genre:
War, Romance
Awards:
8 Oscars
Keywords:
Love, Civil war
…
Movie ratings matrix
Advertisements on the Web
Probabilistic Matrix Completion
3
1% 2% 0.01% …
0.1% 2% 3% …
2% 2% 0.5% …
0.2% 0.3% 1.5% 2% …
2.5% 1% …
1.5% 1% 0.04% …
Click

Through

Rate matrix
Category:
Sports shoes
Brand:
Nike
Ratings:
4.2/5
…
Category:
Baby
URL:
babyearth.com
Content:
Webpage text
Hyperlinks:
Webpages
Products
…
Forest Ecology
4
Leaf(N) Leaf(P) SLA Leaf

Size … Wood density
2 3 5 …
4 1 2 …
3 3 …
1 1 3 2 …
4 2 1 …
1 1 3 …
Plant Trait Matrix (TRY db)
(Jens
Kattage
, Peter Reich, et al)
Plants
Traits
The Main Idea
Probabilistic Matrix Completion
5
Probabilistic Matrix Completion
6
Overview
•
Graphical Models
–
Bayesian Networks
–
Inference
•
Probabilistic Co

clustering
–
Structure: Simultaneous Row

Column Clustering
–
Bayesian models, Inference
•
Probabilistic Matrix Factorization
–
Structure: Low Rank Factorization
–
Bayesian models, Inference
Probabilistic Matrix Completion
7
Graphical Models: What and Why
•
Statistical Machine Learning
–
Build diagnostic/predictive models from data
–
Uncertainty quantification based on (minimal) assumptions
•
The I.I.D. assumption
–
Data is independently and identically distributed
–
Example: Words in a doc drawn i.i.d. from the dictionary
•
Graphical models
–
Assume (graphical) dependencies between (random) variables
–
Closer to reality, domain knowledge can be captured
–
Learning/inference is much more difficult
Graphical Models
8
Flavors of Graphical Models
•
Basic nomenclature
–
Node = random variable, maybe observed/hidden
–
Edge = statistical dependency
•
Two popular flavors:
‘Directed’
and
‘Undirected’
•
Directed Graphs
–
A
directed
graph between random
variables, causal dependencies
–
Example: Bayesian networks, Hidden Markov Models
–
Joint distribution is a product of P(
childparents
)
•
Undirected Graphs
–
An
undirected
graph between random variables
–
Example: Markov/Conditional random fields
–
Joint
distribution in terms of potential
functions
X
1
X
3
X
4
X
5
X
2
Probabilistic Matrix Completion
9
Bayesian Networks
•
Joint distribution in terms of
P(XParents(X))
X
1
X
3
X
4
X
5
X
2
Probabilistic Matrix Completion
10
Example I: Burglary Network
Probabilistic Matrix Completion
11
Example II: Rain Network
Probabilistic Matrix Completion
12
Example III: Car Problem Diagnosis
Probabilistic Matrix Completion
13
Latent Variable Models
•
Bayesian network with hidden variables
–
Semantically more accurate, less parameters
•
Example: Compute probability of heart disease
Probabilistic Matrix Completion
14
Inference
•
Some variables in the
Bayes
net are observed
–
the evidence/data, e.g.,
John has not called, Mary
has called
•
Inference
–
How to compute value/probability of other variables
–
Example: What is the probability of Burglary, i.e.,
P(b

¬
j,m
)
Probabilistic Matrix Completion
15
Inference Algorithms
•
Graphs without loops
–
Efficient exact inference algorithms are possible
–
Sum

product algorithm, and its special cases
•
Belief propagation in Bayes nets
•
Forward

Backward algorithm in Hidden Markov Models (HMMs)
•
Graphs with loops
–
Junction tree algorithms
•
Convert into a graph without loops
•
May lead to exponentially large graph, inefficient algorithm
–
Sum

product algorithm, disregarding loops
•
Active research topic, correct convergence `not guaranteed’
•
Works well in practice, e.g., turbo codes
–
Approximate inference
Probabilistic Matrix Completion
16
Approximate Inference
•
Variational Inference
–
Deterministic approximation
–
Approximate complex true distribution/domain
–
Replace with family of simple distributions/domains
•
Use the best approximation in the family
–
Example: Mean

field, Expectation Propagation
•
Stochastic Inference
–
Simple sampling approaches
–
Markov Chain Monte Carlo methods (MCMC)
•
Powerful family of methods
–
Gibbs sampling
•
Useful special case of MCMC methods
Probabilistic Matrix Completion
17
Overview
•
Graphical Models
–
Bayesian Networks
–
Inference
•
Probabilistic Co

clustering
–
Structure: Simultaneous Row

Column Clustering
–
Bayesian models, Inference
•
Probabilistic Matrix Factorization
–
Structure: Low Rank Factorization
–
Bayesian models, Inference
Probabilistic Matrix Completion
18
Example: Gene Expression Analysis
Original
Co

clustered
Probabilistic Matrix Completion
19
Co

clustering and Matrix Approximation
Probabilistic Matrix Completion
20
Probabilistic Co

clustering
Row clusters:
Column clusters:
…
…
Probabilistic Matrix Completion
21
Generative Process
2
•
Assume a mixed membership for
each row and column
•
Assume a Gaussian for each co

cluster
1.
Pick row/column clusters
2.
Generate each entry of the matrix
Probabilistic Matrix Completion
22
Bayesian Co

clustering (BCC)
2
•
A Dirichlet distribution over all
possible mixed memberships
Probabilistic Matrix Completion
23
Background: Plate Diagrams
a
b
3
a
b
1
b
2
b
3
Compact representation of large Bayesian networks
Probabilistic Matrix Completion
24
Bayesian Co

clustering (BCC)
Probabilistic Matrix Completion
25
Recall: The Inference Problem
What is P( b  ¬j, m) ?
Probabilistic Matrix Completion
26
Bayesian Co

clustering (BCC)
Probabilistic Matrix Completion
27
Learning: Inference and Estimation
•
Learning
–
Estimate model parameters
–
Infer ‘mixed memberships’ of individual rows and columns
•
Expectation Maximization (EM)
•
Issues
–
Posterior probability cannot be obtained in closed form
–
Parameter estimation cannot be done directly
•
Approach:Variational
inference
)
,
,
(
2
1
Probabilistic Matrix Completion
28
Variational
Inference
•
Introduce a
variational
distribution
to approximate
•
Use Jensen’s inequality to get a tractable lower bound
•
Maximize the lower bound
w.r.t
.
–
Alternatively minimize the KL divergence between
and
•
Maximize the lower bound
w.r.t
.
Probabilistic Matrix Completion
29
Variational
EM for BCC
= lower bound of log

likelihood
Probabilistic Matrix Completion
30
Residual Bayesian Co

clustering (RBC)
•
(z
1
,z
2
)
determines the distribution
•
Users/movies may have bias
•
(m
1
,m
2
)
: row/column means
•
(bm
1
,bm
2
)
:
row/column
bias
Probabilistic Matrix Completion
31
Results: Datasets
•
Movielens
: Movie recommendation data
–
100,000 ratings (1

5) for 1682 movies by 943 users (6.3%)
–
1 million ratings for 3900 movies by 6040 users (4.2%)
•
Foodmart
: Transaction data
–
164,558 sales records for 7803 customers and 1559 products (1.35%)
•
Jester: Joke rating data
–
100,000 ratings (

10.00,+10.00) for 100 jokes from 1000 users (100%)
Probabilistic Matrix Completion
32
BCC, RBC vs. Co

clustering algorithms
Jester
•
BCC and RBC have the
best performance
•
RBC
and RBC

FF
perform better than
BCC
Probabilistic Matrix Completion
33
RBC vs. Other Co

clustering Algorithms
Foodmart
Movielens
Probabilistic Matrix Completion
34
RBC vs. SVD, NNMF, and CORR
Jester
•
RBC and RBC

FF are
competitive with other
algorithms
Probabilistic Matrix Completion
35
RBC vs. SVD, NNMF, and CORR
Movielens
Foodmart
Probabilistic Matrix Completion
36
SVD vs. Parallel RBC
Parallel RBC scales well to large matrices
Probabilistic Matrix Completion
37
Co

embedding: Users
Probabilistic Matrix Completion
38
Co

embedding: Movies
Probabilistic Matrix Completion
39
Overview
•
Graphical Models
–
Bayesian Networks
–
Inference
•
Probabilistic Co

clustering
–
Structure: Simultaneous Row

Column Clustering
–
Bayesian models, Inference
•
Probabilistic Matrix Factorization
–
Structure: Low Rank Factorization
–
Bayesian models, Inference
Matrix Factorization
•
Singular value decomposition
•
Problems
–
Large matrices, with millions of row/columns
•
SVD can be rather slow
–
Sparse matrices, most entries are missing
•
Traditional approaches cannot handle missing entries
Probabilistic Matrix Completion
40
≈
•
Model
X
ϵ
R
n
×
m
as
UV
T
where
–
U is a
R
n
×
k
, V is
R
m
×
k
–
Alternatively optimize U and V
Matrix Factorization: “Funk SVD”
Probabilistic Matrix Completion
41
X
ij
=
u
i
T
v
j
=
error = (
X
ij
–
X
ij
)
2
= (
X
ij
–
u
i
T
v
j
)
2
^
u
i
T
v
j
^
Probabilistic Matrix Factorization (PMF)
42
X
ij
~ N(
u
i
T
v
j
,
σ
2
)
u
i
T
v
j
N(0,
σ
u
2
I)
N(0,
σ
v
2
I)
u
i
T
~ N(0,
σ
u
2
I)
v
j
~ N(0,
σ
v
2
I)
R
ij
~ N(u
i
T
v
j
,
σ
2
)
Inference using gradient descent
R.
Salakhutdinov
and A.
Mnih
, NIPS 2007
Probabilistic Matrix Completion
Bayesian Probabilistic Matrix Factorization
43
X
ij
~ N(
u
i
T
v
j
,
σ
2
)
u
i
T
v
j
N(µ
u
,
Λ
u
)
N(µ
v
,
Λ
v
)
µ
u
~ N(µ
0
,
Λ
u
),
Λ
u
~ W(
ν
0
, W
0
)
µ
v
~ N(µ
0
,
Λ
v
),
Λ
v
~ W(
ν
0
, W
0
)
u
i
~ N(µ
u
,
Λ
u
)
v
j
~ N(µ
v
,
Λ
v
)
R
ij
~ N(
u
i
T
v
j
,
σ
2
)
Wishart
Gaussian
Inference using MCMC
R.
Salakhutdinov
and A.
Mnih
, ICML 2008
Probabilistic Matrix Completion
Parametric PMF (PPMF)
•
Are the priors used in PMF and BPMF suitable?
Probabilistic Matrix Completion
44
u
i
T
v
j
N(0,
σ
u
2
I)
N(0,
σ
v
2
I)
PMF:
D
iagonal
covariance
u
i
T
v
j
N(µ
u
,
Λ
u
)
N(µ
v
,
Λ
v
)
BPMF:
Full covariance,
with “
hyperprior
”
u
i
T
v
j
N(µ
u
,
Λ
u
)
N(µ
v
,
Λ
v
)
Parametric PMF (PPMF):
Full covariance, but no “
hyperprior
”
PPMF
Probabilistic Matrix Completion
45
PPMF with Mixture Models (MPMF)
•
What if the row (column) items belong to several groups?
Probabilistic Matrix Completion
46
Parametric PMF (PPMF):
A single Gaussian to generate all
u
i
(or
v
j
)
u
i
T
v
j
N
1
(µ
1
u
,
Λ
1
u
)
N
2
(µ
2
u
,
Λ
2
u
)
N
3
(µ
3
u
,
Λ
3
u
)
Mixture PMF (MPMF):
A mixture of Gaussians represent a set of groups.
Each
u
i
(or
v
j
) is generated from one of the Gaussians
MPMF
Probabilistic Matrix Completion
47
PMF with Side Information: LDA

MPMF
•
Can we use side information to improve accuracy?
Probabilistic Matrix Completion
48
movies
users
s
ide
information
p
1
(
θ
1
u
)
p
2
(
θ
2
u
)
p
3
(
θ
3
u
)
N
1
(µ
1
u
,
Λ
1
u
)
N
2
(µ
2
u
,
Λ
2
u
)
N
3
(µ
3
u
,
Λ
3
u
)
LDA

MPMF:
u
i
and side information
share a membership vector
LDA

MPMF
Probabilistic Matrix Completion
49
p
1
(
θ
1
u
)
s
ide
information
p
2
(
θ
2
u
)
p
3
(
θ
3
u
)
PMF with Side Information: CTM

PPMF
Probabilistic Matrix Completion
50
N(µ
u
,
Λ
u
)
LDA

MPMF:
u
i
and side information share a membership vector
CTM

MPMF:
u
i
is converted to the membership vector to generate side information
movies
users
CTM

PPMF
Probabilistic Matrix Completion
51
Residual PPMF
•
How to consider the row (column) biases?
–
E.g. famous movies, critical users, etc.
Probabilistic Matrix Completion
52
X
ij
~ N(
u
i
T
v
j
,
σ
2
)
u
i
T
v
j
N(µ
u
,
Λ
u
)
N(µ
v
,
Λ
v
)
PPMF:
Does not consider
row,column
biases
Residual PPMF:
Each row has a row bias
f
i
E
ach column has a column bias
g
j
.
X
ij
~ N(
u
i
T
v
j
+
f
i
+g
j
,
σ
2
)
Similar residual models for Mixture PMF, LDA

MPMF, and CTM

PPMF.
PPMF vs. PMF, BPMF
Probabilistic Matrix Completion
53
PPMF mostly achieves higher accuracy
PPMF vs. Co

clustering
Probabilistic Matrix Completion
54
PPMF achieves higher accuracy compared to co

clustering
Residual Models
Probabilistic Matrix Completion
55
Modeling row column bias helps
Effect of Side Information
Probabilistic Matrix Completion
56

Side information helps to a certain extent

Helpfulness ordering: Genre, Cast, Plot
No side information
Non

residual models
Residual models
Other Applications
•
Topic Modeling and Text Classification
–
Mixed

membership topic models
–
Combination with logistic regression
•
Cluster Ensembles
–
Combine multiple
clusterings
of a dataset
–
Mixed

membership naïve

Bayes
models
•
Bayesian Kernel methods
–
Nonlinear covariance using Gaussian Process priors
–
Nonlinear correlated multivariate predictions
Probabilistic Matrix Completion
57
Example: Multi

label classification
•
One data point has multiple labels
•
Labels
are
correlated
Probabilistic Matrix Completion
58
Arrived at [destination] and flaps were selected for
approach
. We received a
SLATS FAIL
caution message. We advised approach and they gave us
vectors while we completed the
QRH
(quick
reference
handbook)
procedure
. The slats were failed at zero. The FA‘s
(flight
attendants)
were
notified, an emergency declared and we
landed
uneventfully.
Maintenance
were
performed…
Aviation
Safety
reports
Landing
problem
Equipment
failure
Special
procedure
Maintenance
…
cause
starts
requires
Experimental
Results
Probabilistic Matrix Completion
59
Classification
performance,
BMR
outperforms
other
algorithms
Topics
in
class
Anomaly.Smoke
or
Fire
The Main Idea
Probabilistic Matrix Completion
60
Probabilistic Matrix Completion
61
Conclusions
•
Matrix Completion Problems
•
The Main Idea
–
Consider a suitable modeling structure
–
Perform averaging over all such models
•
Probabilistic Co

clustering
–
Mixed membership co

clustering for dyadic data
•
Probabilistic Factorization
–
Bayesian model over all possible factorizations
•
Future Directions
–
Nonlinear, High

dimensional, Dynamic Models
–
Applications: Climate & Environmental Sciences, Healthcare, Finance
References
•
A.
Banerjee
, “On Bayesian Bounds,”
International Conference on Machine
Learning (ICML)
, 2006.
•
A.
Banerjee
, I.
Dhillon
, J.
Ghosh
, S.
Merugu
, D.
Modha
., “A Generalized
Maximum Entropy Approach to
Bregman
Co

clustering and Matrix
Approximation,”
Journal of Machine Learning Research (JMLR)
, 2007.
•
A.
Banerjee
and H. Shan, “Latent
Dirichlet
Conditional Naive
Bayes
Models,”
IEEE International Conference on Data Mining (ICDM),
2007.
•
H. Shan and A.
Banerjee
, “Bayesian co

clustering,”
IEEE International Conference
on Data Mining (ICDM),
2008.
•
H. Wang, H. Shan, A.
Banerjee
, “Bayesian Cluster Ensembles,”
SIAM International
Conference on Data Mining (SDM)
, 2009.
•
H. Shan, A.
Banerjee
, and N.
Oza
, “Discriminative Mixed

membership Models,”
IEEE Conference on Data Mining (ICDM)
, 2009.
•
H. Shan and A.
Banerjee
, “Residual Bayesian Co

clustering for Matrix
Approximation,”
SIAM International Conference on Data Mining (SDM)
, 2010.
Probabilistic Matrix Completion
62
References
•
H. Shan and A.
Banerjee
, “Mixed

Membership Naive
Bayes
Models,”
Data Mining and Knowledge Discovery (DMKD)
, 2010.
•
H. Shan and A.
Banerjee
, “Generalized Probabilistic Matrix Factorizations for
Collaborative Filtering,”
IEEE International Conference on Data Mining (ICDM),
2010.
•
A.
Agovic
, H. Shan, and A.
Banerjee
, “Analyzing aviation safety reports: From
topic modeling to scalable multi

label classification,”
Conference on Intelligent
Data Understanding (CIDU)
, 2010.
•
A.
Agovic
and A.
Banerjee
, “Gaussian Process Topic Models,”
Conference on Uncertainty in Artificial Intelligence (UAI)
, 2010.
Probabilistic Matrix Completion
63
Probabilistic Matrix Completion
64
Acknowledgements
Hanhuai
Shan
Amrudin
Agovic
Probabilistic Matrix Completion
65
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment