Matrix Factorization

zoomzurichAI and Robotics

Oct 16, 2013 (3 years and 7 months ago)

81 views

Deng
Cai

(
蔡登
)

College of Computer Science

Zhejiang University

dengcai@gmail.com

1

Matrix Factorization

©
Deng
Cai
, College of Computer Science, Zhejiang University

What Is Matrix Factorization?





×






×

,




×



=


1.

Σ



×



Σ
Σ

1

=



Σ

=


2.


=








𝐹
2


©
Deng
Cai
, College of Computer Science, Zhejiang University

Why Matrix Factorization?


©
Deng
Cai
, College of Computer Science, Zhejiang University

Image Recovery


©
Deng
Cai
, College of Computer Science, Zhejiang University

Image Recovery


©
Deng
Cai
, College of Computer Science, Zhejiang University

Image Recovery


©
Deng
Cai
, College of Computer Science, Zhejiang University

Recommendation


5

4

5

2

4

5

1

2

5

5

4

5

2

4

5

2

4

5

2

2

The Matrix


Star Wars


Roman Holiday

Titanic


Shrek


Madagascar


Alice

Bob

Tracy

Steven

John

?

?

?

?

?

?

?

?

?

?

©
Deng
Cai
, College of Computer Science, Zhejiang University

Search: Information
Retrival

Machine Learning

©
Deng
Cai
, College of Computer Science, Zhejiang University

Search: Information
Retrival

©
Deng
Cai
, College of Computer Science, Zhejiang University

Language Model Paradigm in IR

Probabilistic relevance model


Random variables






Bayes’ rule

prior probability of relevance for
document d (e.g. quality, popularity)

probability of generating a
query q to ask for relevant d

probability that document d
is relevant for query q

J. Ponte and W.B. Croft, A Language Model Approach to Information Retrieval, ACM SIGIR, 1998.

©
Deng
Cai
, College of Computer Science, Zhejiang University

Language Model Paradigm


First
contribution:
prior probability of relevance


simplest case: uniform (drops out for ranking)


popularity
: document usage statistics (e.g. library
circulation records, download or access statistics, hyperlink
structure)


Second
contribution:
query likelihood


query terms
q

are treated as a
sample

drawn from an
(unknown) relevant document

1

2

1

2

©
Deng
Cai
, College of Computer Science, Zhejiang University

Language Model Paradigm

Query generation model: how might a query look
like that would ask for a specific document?


Maron

&
Kuhns
: Indexer
manually

assigns probabilities
for pre
-
specified set of tags/
terms



Ponte & Croft
:
Statistical estimation

problem


Think of a relevant document. Formulate a
query by picking some of the keywords as
query terms.

Environmentalists are
blasting a Bush
administration proposal to
lift a ban on logging in
remote areas of national
forests, saying the move
ignores popular support for
protecting forests.

©
Deng
Cai
, College of Computer Science, Zhejiang University

Query Likelihood

𝑃

|
𝑅
𝑑
=
1

𝑃

|




=

1
,

,



Independent Assumption

𝑃

|

=
Π
𝑤


𝑃

|



𝑃

|

?

©
Deng
Cai
, College of Computer Science, Zhejiang University

Document
-
Term
Matrix

intelligence

Texas Instruments said it has developed
the first 32
-
bit computer chip designed
specifically for artificial
intelligence

applications [...]

D = Document collection

W = Lexicon/Vocabulary

...

1

0

0

...

...

2

t

=

term weighting

X

Document
-
Term Matrix

©
Deng
Cai
, College of Computer Science, Zhejiang University

A 100 Million
ths

of a Typical

Document
-
Term Matrix

0


1

0



2

𝑿
=

Typical:



Number of documents



1.000.000



Vocabulary



100.000



Sparseness



< 0.1 %



Fraction depicted



1e
-
8

©
Deng
Cai
, College of Computer Science, Zhejiang University

Three Problems

Image Recovery

Search

Recommendation

©
Deng
Cai
, College of Computer Science, Zhejiang University

Incomplete Matrix

0


1

0



2

𝑿
=

©
Deng
Cai
, College of Computer Science, Zhejiang University

Query Likelihood

𝑃

|
𝑅
𝑑
=
1

𝑃

|




=

1
,

,



Independent Assumption

𝑃

|

=
Π
𝑤


𝑃

|



𝑃

|

?

©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization





×






×

,




×



=





Low Rank Assumption

k

Hidden Factors



=

1
,

2
,






×



©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization





©
Deng
Cai
, College of Computer Science, Zhejiang University

Relation to Dimensionality Reduction


=

1
,

2
,



=

=


1
,

2
,






=









,






If there is a matrix




×


which
satisfies:



=





=






©
Deng
Cai
, College of Computer Science, Zhejiang University

Dimensionality Reduction

Linear transformation





×


Original data










×

:






=






Reduced data






©
Deng
Cai
, College of Computer Science, Zhejiang University

Algorithms

Singular Value Decomposition

Nonnegative Matrix Factorization

Sparse Coding

Graph
-
regularization

©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization





×






×

,




×



=



=









𝐹
2


min
𝑎
𝑋

=



©
Deng
Cai
, College of Computer Science, Zhejiang University

Singular Value Decomposition (SVD)

For an arbitrary matrix




×


there exists a
factorization as follows:


=

Σ


where





×

,




×

,



=



=

,



=



=


diagonal

matrix

Σ



×


If
𝑎
(

)
=






×
𝑑
,



𝑑
×

,



=

,



=


Σ
=
diag
𝜎
1
,
𝜎
2
,

𝜎
𝑑


𝜎
1

𝜎
2



𝜎
𝑑
>
0


©
Deng
Cai
, College of Computer Science, Zhejiang University

SVD: Low
-
rank
Approximation

SVD can be used to compute
optimal

low
-
rank approximations
.

Approximation problem:




Solution
via SVD

C.
Eckart
, G. Young, The approximation of a matrix by another of lower rank.
Psychometrika
, 1, 211
-
218, 1936.



=
arg
min
𝑎
𝑋

=





𝐹
2

set small singular

values to zero



=

diag
𝜎
1
,

,
𝜎

,
0
,

,
0


©
Deng
Cai
, College of Computer Science, Zhejiang University

Low rank approximation by SVD

27

©
Deng
Cai
, College of Computer Science, Zhejiang University

Latent Semantic Analysis (Indexing)

The Latent Semantic Analysis via
SVD can be summarized as
follows:






Document
similarity



,


=
Σ



,
Σ




...

LSA
term

vectors

...

LSA
document

vectors

=

<


,

>

...

terms

...

documents

©
Deng
Cai
, College of Computer Science, Zhejiang University

Latent Semantic Analysis

Latent semantic space
: illustrating example

©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization: SVD





×






×

,




×



=






Low Rank Assumption

k

Hidden Factors



min
𝑎
𝑋

=





𝐹
2

©
Deng
Cai
, College of Computer Science, Zhejiang University

A 100 Million
ths

of a Typical

Document
-
Term Matrix

0


1

0



2

𝑿
=

©
Deng
Cai
, College of Computer Science, Zhejiang University

Query Likelihood

𝑃

|
𝑅
𝑑
=
1

𝑃

|




=

1
,

,



Independent Assumption

𝑃

|

=
Π
𝑤


𝑃

|



𝑃

|

?

©
Deng
Cai
, College of Computer Science, Zhejiang University

Naive Approach

Terms

Documents

Maximum Likelihood Estimation

number of occurrences

of term
w

in document
d

𝑃

|

=


,




,


𝑤



=
𝑃

1
|

1

𝑃

1
|





𝑃


|

1

𝑃


|



©
Deng
Cai
, College of Computer Science, Zhejiang University

Probabilistic Latent Semantic
Analysis

Latent

Concepts

Terms

Documents

TRADE

economic

imports

trade

T. Hofmann, Probabilistic Latent Semantic Analysis, UAI 1999.

𝑷


|
𝒅
=

𝑷

|

𝑷

|
𝒅




=
𝑃


1
|

1

𝑃


1
|





𝑃



|

1

𝑃



|




=
𝑃


1
|

1

𝑃


1
|





𝑃



|

1

𝑃



|




=
𝑃


1
|

1

𝑃



|

1



𝑃


1
|



𝑃



|





=







©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization





©
Deng
Cai
, College of Computer Science, Zhejiang University

Nonnegative Matrix Factorization





×






×

,




×



=






Low rank assumption (
k

h
idden factors
)

Nonnegative assumption






0
,



0

©
Deng
Cai
, College of Computer Science, Zhejiang University

Non
-
negative Matrix Factorization





=



,



0
,



0


Two cost functions


Euclidean distance




2
=
Σ






2



Divergence

𝐷
(

|

=
Σ

(


log







+


)

©
Deng
Cai
, College of Computer Science, Zhejiang University

Optimization Problems


Minimize





2

with respect to


and

,
subject to the constraints

,


0
.



Minimize
𝐷
(

|
|



)

with respect to


and

,
subject to the constraints

,


0
.


©
Deng
Cai
, College of Computer Science, Zhejiang University

NMF Optimization (Euclidean Distance)

min





2
,

.

.



0
,



0



=





2

=
tr












=
tr














+











=


2

+
2
















=
0



=
tr




2
tr





+
tr







Γ
, same size as


Φ
, same size as


+
tr
Γ


+
tr
Φ



+
Γ



















=


2




+
2


















=
0

+
Φ

















©
Deng
Cai
, College of Computer Science, Zhejiang University

Multiplicative Update Rules

The Euclidean distance





2

is
nonincreasing

under the update rules






































The

Euclidean

distance

is

invariant

under

these

updates

if

and

only

if



and



are

at

a

stationary

point

of

the

distance
.

©
Deng
Cai
, College of Computer Science, Zhejiang University

NMF (Divergence Formulation) vs.
pLSA

𝐷
(

|

=



log







+





min
𝐷
(

|



,



0
,



0



Latent

Concepts

Terms

Documents

TRADE

economic

imports

trade

𝑷


|
𝒅
=

𝑷

|

𝑷

|
𝒅


©
Deng
Cai
, College of Computer Science, Zhejiang University

Likelihood of
pLSA

Likelihood




documents,


words (independently)


𝐷
=


𝑃


|


𝑑
,
𝑤
𝑤
𝑑

Log
-
Likelihood


𝐷
=



,

log

𝑃

|

𝑃

|

𝑧
𝑑
,
𝑤




𝑷


|
𝒅
=

𝑷

|

𝑷

|
𝒅


©
Deng
Cai
, College of Computer Science, Zhejiang University

NMF (Divergence Formulation) vs.
pLSA

𝐷
(

|

=



log







+





min
𝐷
(

|



,



0
,



0



max



log








=
max



log












𝐷

|
|

=



log








log



+






=



,




=



|





=




|



max





,


log




|






|








©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization





×






×

,




×



=





Low rank assumption (
k

h
idden factors
)


SVD

Nonnegative assumption


NMF



©
Deng
Cai
, College of Computer Science, Zhejiang University

Sparse Coding


Represent input vectors approximately as
a weighted linear
combination of a small number of “basis vectors.”

minimiz
e

,








𝐹
2
+
𝜆𝑓


subject to
Σ



,

2


,


=
1
,

,

.


©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization with Local
Consistency

Nearby
points (neighbors)
share
similar properties
.

Traditional machine learning algorithms:



-
nearest neighbor classifier





46

©
Deng
Cai
, College of Computer Science, Zhejiang University

Local
Consistency Assumption

A lot of
unlabeled

data

Local

consistency



-
nearest neighbors



-
neighbors





47

A

B

C

©
Deng
Cai
, College of Computer Science, Zhejiang University

Local Consistency Assumption

Put edges between neighbors (nearby data points)

Two nodes in
the graph
connected by an edge share
similar
properties
.

48

©
Deng
Cai
, College of Computer Science, Zhejiang University

Local Consistency Assumption

Similar
properties


Labels


Representations



:


𝑓







×

: weight matrix of the graph


49




𝑓



𝑓


2

,


m
in
1
2




=
𝑓




=

1
,

,




min


𝐷




M.
Belkin
, and
P.
Niyogi
.
Laplacian

Eigenmaps

and Spectral Techniques for Embedding and Clustering,
NIPS 2001
.


.

.





𝐷

=
1



𝐷



min





©
Deng
Cai
, College of Computer Science, Zhejiang University

Local Consistency and
Manifold
Learning

50

Manifold learning

We only need
local consistency

min



𝑓



𝑓


2

,


©
Deng
Cai
, College of Computer Science, Zhejiang University

Locally Consistent
NMF

If



and



are
neighbors

Neighbor: prior knowledge, label information,
p
-
nearest neighbors …

51







D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non
-
negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, 33(8), 1548
-
1560, 2011.

©
Deng
Cai
, College of Computer Science, Zhejiang University

Locally Consistent NMF

52

min



𝑓



𝑓


2

,


min









2

,



min
Tr




D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non
-
negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, to appear.

©
Deng
Cai
, College of Computer Science, Zhejiang University

Objective Function

53

D.
Cai
, X. He, J.
Han, and T. Huang
, Graph regularized Non
-
negative
Matrix Factorization for Data Representation
. IEEE
Transactions on Pattern Analysis and Machine
Intelligence, to appear.

NMF:
min





2

min





2

+
𝜆
Tr




GNMF
:

G
raph regularized
NMF


























































+
𝜆





+
𝜆𝐷




©
Deng
Cai
, College of Computer Science, Zhejiang University

Matrix Factorization: Summary





×






×

,




×



=





Low rank assumption (
k

h
idden factors
)


SVD

Nonnegative assumption


NMF

Sparseness
assumption


Sparse Coding

Manifold
assumption


Graph
-
regularized NMF (SC)