Deng
Cai
(
蔡登
)
College of Computer Science
Zhejiang University
dengcai@gmail.com
1
Matrix Factorization
©
Deng
Cai
, College of Computer Science, Zhejiang University
What Is Matrix Factorization?
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
1.
Σ
∈
ℛ
×
，
Σ
Σ
−
1
=
Σ
=
2.
=
≈
−
𝐹
2
©
Deng
Cai
, College of Computer Science, Zhejiang University
Why Matrix Factorization?
©
Deng
Cai
, College of Computer Science, Zhejiang University
Image Recovery
©
Deng
Cai
, College of Computer Science, Zhejiang University
Image Recovery
©
Deng
Cai
, College of Computer Science, Zhejiang University
Image Recovery
©
Deng
Cai
, College of Computer Science, Zhejiang University
Recommendation
5
4
5
2
4
5
1
2
5
5
4
5
2
4
5
2
4
5
2
2
The Matrix
Star Wars
Roman Holiday
Titanic
Shrek
Madagascar
Alice
Bob
Tracy
Steven
John
?
?
?
?
?
?
?
?
?
?
©
Deng
Cai
, College of Computer Science, Zhejiang University
Search: Information
Retrival
Machine Learning
©
Deng
Cai
, College of Computer Science, Zhejiang University
Search: Information
Retrival
©
Deng
Cai
, College of Computer Science, Zhejiang University
Language Model Paradigm in IR
Probabilistic relevance model
Random variables
Bayes’ rule
prior probability of relevance for
document d (e.g. quality, popularity)
probability of generating a
query q to ask for relevant d
probability that document d
is relevant for query q
J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.
©
Deng
Cai
, College of Computer Science, Zhejiang University
Language Model Paradigm
First
contribution:
prior probability of relevance
simplest case: uniform (drops out for ranking)
popularity
: document usage statistics (e.g. library
circulation records, download or access statistics, hyperlink
structure)
Second
contribution:
query likelihood
query terms
q
are treated as a
sample
drawn from an
(unknown) relevant document
1
2
1
2
©
Deng
Cai
, College of Computer Science, Zhejiang University
Language Model Paradigm
Query generation model: how might a query look
like that would ask for a specific document?
Maron
&
Kuhns
: Indexer
manually
assigns probabilities
for pre

specified set of tags/
terms
Ponte & Croft
:
Statistical estimation
problem
Think of a relevant document. Formulate a
query by picking some of the keywords as
query terms.
Environmentalists are
blasting a Bush
administration proposal to
lift a ban on logging in
remote areas of national
forests, saying the move
ignores popular support for
protecting forests.
©
Deng
Cai
, College of Computer Science, Zhejiang University
Query Likelihood
𝑃

𝑅
𝑑
=
1
≡
𝑃

=
1
,
⋯
,
Independent Assumption
𝑃

=
Π
𝑤
∈
𝑃

𝑃

?
©
Deng
Cai
, College of Computer Science, Zhejiang University
Document

Term
Matrix
intelligence
Texas Instruments said it has developed
the first 32

bit computer chip designed
specifically for artificial
intelligence
applications [...]
D = Document collection
W = Lexicon/Vocabulary
...
1
0
0
...
...
2
t
=
term weighting
X
Document

Term Matrix
©
Deng
Cai
, College of Computer Science, Zhejiang University
A 100 Million
ths
of a Typical
Document

Term Matrix
0
1
0
2
𝑿
=
Typical:
•
Number of documents
1.000.000
•
Vocabulary
100.000
•
Sparseness
< 0.1 %
•
Fraction depicted
1e

8
©
Deng
Cai
, College of Computer Science, Zhejiang University
Three Problems
Image Recovery
Search
Recommendation
©
Deng
Cai
, College of Computer Science, Zhejiang University
Incomplete Matrix
0
1
0
2
𝑿
=
©
Deng
Cai
, College of Computer Science, Zhejiang University
Query Likelihood
𝑃

𝑅
𝑑
=
1
≡
𝑃

=
1
,
⋯
,
Independent Assumption
𝑃

=
Π
𝑤
∈
𝑃

𝑃

?
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
≈
Low Rank Assumption
k
Hidden Factors
=
1
,
2
,
⋯
∈
ℛ
×
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization
≈
©
Deng
Cai
, College of Computer Science, Zhejiang University
Relation to Dimensionality Reduction
=
1
,
2
,
⋯
=
=
1
,
2
,
⋯
=
∈
ℛ
,
∈
ℛ
If there is a matrix
∈
ℛ
×
which
satisfies:
=
=
©
Deng
Cai
, College of Computer Science, Zhejiang University
Dimensionality Reduction
Linear transformation
∈
ℛ
×
Original data
∈
ℛ
∈
ℛ
×
:
∈
ℛ
→
=
∈
ℛ
Reduced data
∈
ℛ
©
Deng
Cai
, College of Computer Science, Zhejiang University
Algorithms
Singular Value Decomposition
Nonnegative Matrix Factorization
Sparse Coding
Graph

regularization
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
=
≈
−
𝐹
2
min
𝑎
𝑋
=
©
Deng
Cai
, College of Computer Science, Zhejiang University
Singular Value Decomposition (SVD)
For an arbitrary matrix
∈
ℛ
×
there exists a
factorization as follows:
=
Σ
where
∈
ℛ
×
,
∈
ℛ
×
,
=
=
,
=
=
diagonal
matrix
Σ
∈
ℛ
×
If
𝑎
(
)
=
∈
ℛ
×
𝑑
,
∈
ℛ
𝑑
×
,
=
,
=
Σ
=
diag
𝜎
1
,
𝜎
2
,
⋯
𝜎
𝑑
𝜎
1
≥
𝜎
2
≥
⋯
≥
𝜎
𝑑
>
0
©
Deng
Cai
, College of Computer Science, Zhejiang University
SVD: Low

rank
Approximation
SVD can be used to compute
optimal
low

rank approximations
.
Approximation problem:
Solution
via SVD
C.
Eckart
, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika
, 1, 211

218, 1936.
∗
=
arg
min
𝑎
𝑋
=
−
𝐹
2
set small singular
values to zero
∗
=
diag
𝜎
1
,
⋯
,
𝜎
,
0
,
⋯
,
0
©
Deng
Cai
, College of Computer Science, Zhejiang University
Low rank approximation by SVD
27
©
Deng
Cai
, College of Computer Science, Zhejiang University
Latent Semantic Analysis (Indexing)
The Latent Semantic Analysis via
SVD can be summarized as
follows:
Document
similarity
,
=
Σ
,
Σ
...
LSA
term
vectors
...
LSA
document
vectors
=
<
,
>
...
terms
...
documents
©
Deng
Cai
, College of Computer Science, Zhejiang University
Latent Semantic Analysis
Latent semantic space
: illustrating example
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization: SVD
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
≈
Low Rank Assumption
k
Hidden Factors
min
𝑎
𝑋
=
−
𝐹
2
©
Deng
Cai
, College of Computer Science, Zhejiang University
A 100 Million
ths
of a Typical
Document

Term Matrix
0
1
0
2
𝑿
=
©
Deng
Cai
, College of Computer Science, Zhejiang University
Query Likelihood
𝑃

𝑅
𝑑
=
1
≡
𝑃

=
1
,
⋯
,
Independent Assumption
𝑃

=
Π
𝑤
∈
𝑃

𝑃

?
©
Deng
Cai
, College of Computer Science, Zhejiang University
Naive Approach
Terms
Documents
Maximum Likelihood Estimation
number of occurrences
of term
w
in document
d
𝑃

=
,
,
′
𝑤
′
=
𝑃
1

1
⋯
𝑃
1

⋮
⋱
⋮
𝑃

1
⋯
𝑃

©
Deng
Cai
, College of Computer Science, Zhejiang University
Probabilistic Latent Semantic
Analysis
Latent
Concepts
Terms
Documents
TRADE
economic
imports
trade
T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.
𝑷

𝒅
=
𝑷

𝑷

𝒅
=
𝑃
1

1
⋯
𝑃
1

⋮
⋱
⋮
𝑃

1
⋯
𝑃

=
𝑃
1

1
⋯
𝑃
1

⋮
⋱
⋮
𝑃

1
⋯
𝑃

=
𝑃
1

1
⋯
𝑃

1
⋮
⋱
⋮
𝑃
1

⋯
𝑃

=
≈
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization
≈
©
Deng
Cai
, College of Computer Science, Zhejiang University
Nonnegative Matrix Factorization
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
≈
Low rank assumption (
k
h
idden factors
)
Nonnegative assumption
≥
0
,
≥
0
©
Deng
Cai
, College of Computer Science, Zhejiang University
Non

negative Matrix Factorization
≅
=
,
≥
0
,
≥
0
Two cost functions
Euclidean distance
−
2
=
Σ
−
2
Divergence
𝐷
(

=
Σ
(
log
−
+
)
©
Deng
Cai
, College of Computer Science, Zhejiang University
Optimization Problems
Minimize
−
2
with respect to
and
,
subject to the constraints
,
≥
0
.
Minimize
𝐷
(


)
with respect to
and
,
subject to the constraints
,
≥
0
.
©
Deng
Cai
, College of Computer Science, Zhejiang University
NMF Optimization (Euclidean Distance)
min
−
2
,
.
.
≥
0
,
≥
0
=
−
2
=
tr
−
−
=
tr
−
−
+
ℒ
=
−
2
+
2
−
=
0
ℒ
=
tr
−
2
tr
+
tr
Γ
, same size as
Φ
, same size as
+
tr
Γ
+
tr
Φ
+
Γ
←
ℒ
=
−
2
+
2
−
=
0
+
Φ
←
©
Deng
Cai
, College of Computer Science, Zhejiang University
Multiplicative Update Rules
The Euclidean distance
−
2
is
nonincreasing
under the update rules
←
←
The
Euclidean
distance
is
invariant
under
these
updates
if
and
only
if
and
are
at
a
stationary
point
of
the
distance
.
©
Deng
Cai
, College of Computer Science, Zhejiang University
NMF (Divergence Formulation) vs.
pLSA
𝐷
(

=
log
−
+
min
𝐷
(

,
≥
0
,
≥
0
Latent
Concepts
Terms
Documents
TRADE
economic
imports
trade
𝑷

𝒅
=
𝑷

𝑷

𝒅
©
Deng
Cai
, College of Computer Science, Zhejiang University
Likelihood of
pLSA
Likelihood
documents,
words (independently)
𝐷
=
𝑃

𝑑
,
𝑤
𝑤
𝑑
Log

Likelihood
𝐷
=
,
log
𝑃

𝑃

𝑧
𝑑
,
𝑤
𝑷

𝒅
=
𝑷

𝑷

𝒅
©
Deng
Cai
, College of Computer Science, Zhejiang University
NMF (Divergence Formulation) vs.
pLSA
𝐷
(

=
log
−
+
min
𝐷
(

,
≥
0
,
≥
0
max
log
−
=
max
log
−
𝐷


=
log
−
−
log
+
=
,
=

=

max
,
log


−
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
≈
Low rank assumption (
k
h
idden factors
)
SVD
Nonnegative assumption
NMF
©
Deng
Cai
, College of Computer Science, Zhejiang University
Sparse Coding
•
Represent input vectors approximately as
a weighted linear
combination of a small number of “basis vectors.”
minimiz
e
,
−
𝐹
2
+
𝜆𝑓
subject to
Σ
,
2
≤
,
∀
=
1
,
…
,
.
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization with Local
Consistency
Nearby
points (neighbors)
share
similar properties
.
Traditional machine learning algorithms:

nearest neighbor classifier
46
©
Deng
Cai
, College of Computer Science, Zhejiang University
Local
Consistency Assumption
A lot of
unlabeled
data
Local
consistency

nearest neighbors

neighbors
…
47
A
B
C
©
Deng
Cai
, College of Computer Science, Zhejiang University
Local Consistency Assumption
Put edges between neighbors (nearby data points)
Two nodes in
the graph
connected by an edge share
similar
properties
.
48
©
Deng
Cai
, College of Computer Science, Zhejiang University
Local Consistency Assumption
Similar
properties
Labels
Representations
:
𝑓
∈
ℛ
×
: weight matrix of the graph
49
𝑓
−
𝑓
2
,
m
in
1
2
=
𝑓
=
1
,
⋯
,
min
𝐷
−
M.
Belkin
, and
P.
Niyogi
.
Laplacian
Eigenmaps
and Spectral Techniques for Embedding and Clustering,
NIPS 2001
.
.
.
𝐷
=
1
≡
𝐷
−
min
©
Deng
Cai
, College of Computer Science, Zhejiang University
Local Consistency and
Manifold
Learning
50
Manifold learning
We only need
local consistency
min
𝑓
−
𝑓
2
,
©
Deng
Cai
, College of Computer Science, Zhejiang University
Locally Consistent
NMF
If
and
are
neighbors
Neighbor: prior knowledge, label information,
p

nearest neighbors …
51
≈
D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non

negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 33(8), 1548

1560, 2011.
©
Deng
Cai
, College of Computer Science, Zhejiang University
Locally Consistent NMF
52
min
𝑓
−
𝑓
2
,
min
−
2
,
min
Tr
D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non

negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, to appear.
©
Deng
Cai
, College of Computer Science, Zhejiang University
Objective Function
53
D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non

negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, to appear.
NMF:
min
−
2
min
−
2
+
𝜆
Tr
GNMF
:
G
raph regularized
NMF
←
←
←
←
+
𝜆
+
𝜆𝐷
©
Deng
Cai
, College of Computer Science, Zhejiang University
Matrix Factorization: Summary
∈
ℛ
×
∈
ℛ
×
,
∈
ℛ
×
=
≈
Low rank assumption (
k
h
idden factors
)
SVD
Nonnegative assumption
NMF
Sparseness
assumption
Sparse Coding
Manifold
assumption
Graph

regularized NMF (SC)
Comments 0
Log in to post a comment