Machine Learning II

milkygoodyearAI and Robotics

Oct 14, 2013 (3 years and 9 months ago)

75 views

Last Lecture
Spectral Clustering
Topic Models
Machine Learning II
Peter Gehler
TU Darmstadt
November 05,2010
Peter Gehler TU Darmstadt
Machine Learning II 1/58
Last Lecture
Spectral Clustering
Topic Models
Good news
I
No need to re-schedule 26
th
November!
I
Up-to-date plan always on the website
Peter Gehler TU Darmstadt
Machine Learning II 2/58
Last Lecture
Spectral Clustering
Topic Models
Some more comments on Decision Theory
...inspired by the exercises yesterday
Peter Gehler TU Darmstadt
Machine Learning II 3/58
Last Lecture
Spectral Clustering
Topic Models
0/1 Error
I
In the exercises we have seen that for  being 0/1 Loss
y

= argmin
y2Y
E
p(jx)
[(y;)] (1)
= argmax
y2Y
p(yjx) (2)
= argmin
y2Y
E(yjx) (3)
for
p(yjx) =
1
Z
exp(E(yjx)) (4)
I
Energy minimization or MAP prediction
I
Not only used for classication (many image processing tasks)
Peter Gehler TU Darmstadt
Machine Learning II 4/58
Last Lecture
Spectral Clustering
Topic Models
Hamming Loss
I
Count number of mislabeled variables,
e.g.pixels

H
(y

;y) =
1
jVj
X
i 2V
(y
i
6= y

i
) (5)
I
Yields
y

= argmin
y2Y
E
p(jx)
[
H
(y;)] (6)
=

argmax
y
i
2Y
i
p(y
i
jx)
!
i 2V
(7)
(8)
I
maximum posterior marginal (MPM) prediction
Peter Gehler TU Darmstadt
Machine Learning II 5/58
Last Lecture
Spectral Clustering
Topic Models
Squared Error
I
Assume vector space on Y
i
e.g.pixel
intensities

Q
(y

;y) =
1
jVj
X
i 2V
ky

i
y
i
k
2
(9)
I
Yields
y

= argmin
y2Y
E
p(jx)
[
Q
(y;)] (10)
=
0
@
X
y
i
2Y
i
p(y
i
jx)y
i
1
A
i 2V
(11)
I
minimum mean squared error (MMSE) prediction
Peter Gehler TU Darmstadt
Machine Learning II 6/58
Last Lecture
Spectral Clustering
Topic Models
Clustering
Peter Gehler TU Darmstadt
Machine Learning II 7/58
Last Lecture
Spectral Clustering
Topic Models
Clustering:Finding Structure in Data
I
What are the correct clusters?
Peter Gehler TU Darmstadt
Machine Learning II 8/58
Last Lecture
Spectral Clustering
Topic Models
Clustering:Finding Structure in Data
I
Ground truth often not available
I
Similarity Measure
I
clustering relies on measure of similarity
I
e.g.position in space (euclidean vs.log-polar coordinates),
weighting of dierent dimensions...
Peter Gehler TU Darmstadt
Machine Learning II 9/58
Last Lecture
Spectral Clustering
Topic Models
Basic Clustering Algorithms
I
Flat Clustering Algorithms
I
K-Means
I
Mixture Models
I
Hierarchical Clustering Methods
I
Top-Down (splitting)
I
Bottom-Up (merging)
I
Other Clustering Methods
I
Spectral Clustering
Peter Gehler TU Darmstadt
Machine Learning II 10/58
Last Lecture
Spectral Clustering
Topic Models
Hierarchical Bottom-Up Clustering
Cluster Linkage:dene distances between clusters
I
Single Linkage
d(C
k
;C
l
) = min
x
i
2C
k
min
y
j
2C
l
d(x
i
;y
j
) (12)
I
Complete Linkage
d(C
k
;C
l
) = max
x
i
2C
k
max
y
j
2C
l
d(x
i
;y
j
) (13)
Peter Gehler TU Darmstadt
Machine Learning II 11/58
Last Lecture
Spectral Clustering
Topic Models
Hierarchical Bottom-Up Clustering
Cluster Linkage:dene distances between clusters
I
Average Linkage
d(C
k
;C
l
) =
1
jC
k
jjC
l
j
X
x
i
2C
k
X
y
j
2C
l
d(x
i
;y
j
)
(14)
I
Centroid Linkage
d(C
k
;C
l
) = d
0
@
1
C
k
X
x
i
2C
k
x
i
;
1
C
l
X
y
j
2C
l
y
j
1
A
(15)
Peter Gehler TU Darmstadt
Machine Learning II 12/58
Last Lecture
Spectral Clustering
Topic Models
Spectral Clustering
Excellent tutorial paper:
U.von Luxburg,A Tutorial on Spectral Clustering,Statistics and
Computing 2007
Peter Gehler TU Darmstadt
Machine Learning II 13/58
Last Lecture
Spectral Clustering
Topic Models
A.Azran,A Tutorial on Spectral Clustering
Peter Gehler TU Darmstadt
Machine Learning II 14/58
Last Lecture
Spectral Clustering
Topic Models
Main idea
I
2 spirals
I
data exhibits complex cluster shape
I
K-Means performs poorly (because it
wants dense spherical clusters)
I
nd embedding space
I
given by eigenvectors of anity
matrix
I
in the embedded space,clusters are
trivial to separate
Peter Gehler TU Darmstadt
Machine Learning II 15/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 16/58
Last Lecture
Spectral Clustering
Topic Models
Local Similarity!
I
Look at similarity values
I
typically,the similarity values reliably encode\local structure"
I
can reliably indicate which samples are\close"or\similar"
I
often the global structure induced by similarity function does
not capture the true global structure of the data
Peter Gehler TU Darmstadt
Machine Learning II 17/58
Last Lecture
Spectral Clustering
Topic Models
Local Similarity!
I
Example:
I
misleading global distances
I
Idea:
I
only rely on local information
provided by similarity
I
construct graph based on this
local information
I
machine learning should
discover global structure by
itself
Peter Gehler TU Darmstadt
Machine Learning II 18/58
Last Lecture
Spectral Clustering
Topic Models
Construct Graph 6
I
a similarity score between two objects is\high"when the
objects are very similar
I
Example:Gaussian kernel
s(x
i
;x
j
) = exp


kx
i
x
j
k
2
2
2

(16)
I
Conversely:a distance score is small when the objects are
close
I
Example:Euclidean distance
d(x
i
;x
j
) = kx
i
x
j
k (17)
I
Distance and Similarities are\inverse"to each other
I
In the following we talk about similarities only (although it
works with distances)
Peter Gehler TU Darmstadt
Machine Learning II 19/58
Last Lecture
Spectral Clustering
Topic Models
Basic Graph Vocabulary
I
a Graph consists of
vertices (nodes) and edges
I
Edges:
I
directed or un-directed,weighted or un-weighted
I
Adjacency matrix { structure of the graph
I
w
ij
= 0 vertices i and j are not connected
I
w
ij
> 0 weight of connection
I
the degree of a vertex is the sum of all adjacent edge weights
d
i
=
X
j
w
ij
(18)
I
all vertices that can be reached by a path form a connected
component
Peter Gehler TU Darmstadt
Machine Learning II 20/58
Last Lecture
Spectral Clustering
Topic Models
Undirected k-nearest neighbour graph
I
Undirected Graph:just delete
arrows
I
kNN graph:Connects i and j if
w
ij
> 0 or w
ji
> 0
I
mutual kNN graph:Connects i
and j if w
ij
> 0 and w
ji
> 0
Peter Gehler TU Darmstadt
Machine Learning II 21/58
Last Lecture
Spectral Clustering
Topic Models
Mutual and symmetric graph
I
by construction the mutual kNN graph is a subset of the
symmetric kNN graph
Peter Gehler TU Darmstadt
Machine Learning II 22/58
Last Lecture
Spectral Clustering
Topic Models
-neighbourhood graph
I
given data samples and their pairwise distances d
ij
I
connect all samples i;j with distance d
ij
< 
I
undirected or transform distances to similarities as weights
Peter Gehler TU Darmstadt
Machine Learning II 23/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 24/58
Last Lecture
Spectral Clustering
Topic Models
I
Denition of cut between 2 sets of vertices A and B
cut(A;B) =
X
i 2A
X
j2B
w
ij
(19)
I
Intuitive Idea:nd sets A and B with minimal cut:
I
minimal bipartition cut argmin
A;B
cut(A;B)
I
within A and B similar,between A and B dis-similar
I
Problem (left example is cut with low value,right is desired
output)
Peter Gehler TU Darmstadt
Machine Learning II 25/58
Last Lecture
Spectral Clustering
Topic Models
Balanced Graph Cut 1
I
W = [w
ij
] adjacency matrix of the graph
I
number of vertices in the set A:jAj
I
degree of vertex
d
i
=
X
j
w
ij
(20)
I
volume of set A
vol (A) =
X
i 2A
d
i
(21)
I
degree matrix
D = diag (d
1
;:::;d
n
) (22)
Peter Gehler TU Darmstadt
Machine Learning II 26/58
Last Lecture
Spectral Clustering
Topic Models
Balanced Graph Cut 1
I
Normalized Cut (balanced cut) (Shi&Malik 2000)
Ncut(A;B) = cut(A;B)

1
vol (A)
+
1
vol (B)

(23)
I
Notes
I
Mincut can be solved eciently
I
Ncut is NP hard
I
spectral clustering solves a relaxation of Ncut (here i 2 A or
i 2 B,in spectral clustering\soft assignment")
I
Quality of solution of relaxation unclear
Peter Gehler TU Darmstadt
Machine Learning II 27/58
Last Lecture
Spectral Clustering
Topic Models
Unnormalized Graph Laplacian
I
Graph Laplacian dened as (no agreement on the term)
L = D W (24)
I
Key property
f
>
Lf = f
>
Df f
>
Wf (25)
=
X
i
d
i
f
2
i

X
ij
f
i
f
j
w
ij
(26)
=
1
2
0
@
X
i
d
i
f
2
i
2
X
ij
f
i
f
j
w
ij
+
X
j
d
j
f
2
j
1
A
(27)
=
1
2
X
ij
w
ij
(f
i
f
j
)
2
(28)
Peter Gehler TU Darmstadt
Machine Learning II 28/58
Last Lecture
Spectral Clustering
Topic Models
Unnormalized Graph Laplacian
I
Spectral Properties
I
L is symmetric (by assumption) and positive semi-denite
I
smallest eigenvalue of L is 0,corresponding eigenvector is 1
I
thus eigenvalues 0 = 
1
 
2
::: 
n
I
Relation between spectrum of eigenvalues and clusters
I
Multiplicity of eigenvalue 0 equals the number of connected
components of the graph
Peter Gehler TU Darmstadt
Machine Learning II 29/58
Last Lecture
Spectral Clustering
Topic Models
Why multiplicity of the Eigenvalue?
I
Denition:eigenvalue  and eigenvector f if
f = Lf (29)
I
equivalent to
f
>
f = f
>
Lf (30)
I
and for eigenvalue  = 0
f
>
f = 0 = f
>
Lf (31)
Peter Gehler TU Darmstadt
Machine Learning II 30/58
Last Lecture
Spectral Clustering
Topic Models
Why is this the case?
Case k = 1 { only one connected component
I
assume graph is fully connected
I
assume f eigenvector to  = 0
I
we know
0 = f
>
Lf =
1
2
X
ij
w
ij
(f
i
f
j
)
2
(32)
I
since w
ij
 0 all terms f
i
f
j
have to be zero
I
i;j connected!f
i
= f
j
I
f is constant for all i
I
one-vector f = (1;:::;1)
>
is eigenvector to eigenvalue  = 0
Peter Gehler TU Darmstadt
Machine Learning II 31/58
Last Lecture
Spectral Clustering
Topic Models
Why is this the case?
Now k > 1 connected components
I
assume graph has k connected components
I
assume f eigenvector to  = 0
we can re-order vertices
such that W is
block-diagonal
L =
0
B
B
B
@
L1 0    0
0 L2    0
0 0   
.
.
.
0 0    L
k
1
C
C
C
A
II
note:each block is proper Graph Laplacian itself,each has
single connected component (k = 1 case)
I
L has as many eigenvalues 0 as there are connected
components k
Peter Gehler TU Darmstadt
Machine Learning II 32/58
Last Lecture
Spectral Clustering
Topic Models
Normalized graph Laplacians
I
Row sum (random walk) normalization
L
rw
= D
1
L = I D
1
W (33)
I
Symmetric normalization
L
sym
= D
1=2
LD
1=2
= I D
1=2
WD
1=2
(34)
I
Spectral properties similar to the ones of L
Peter Gehler TU Darmstadt
Machine Learning II 33/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 35/58
Last Lecture
Spectral Clustering
Topic Models
I
Input:similarity matrix W,number of desired centroids k
I
build similarity graph
I
compute the rst eigenvectors v
1
;:::;v
k
of the matrix
I
L { for un-normalized spectral clustering
I
L
rw
{ for normalized spectral clustering
I
Build the matrix V 2 R
nk
with the eigenvectors as
columns
I
Interpret rows of V as new
datapoints Z
i
2 R
k
v
1
v
2
   v
k
Z
1
v
11
v
12
   v
1k
.
.
.
.
.
.
.
.
.
.
.
.
Z
n
v
n1
v
n2
   v
nk
I
Cluster the points Z
i
with K-Means (or your favorite) in R
k
Peter Gehler TU Darmstadt
Machine Learning II 36/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
Solutions of spectral clustering
Peter Gehler TU Darmstadt
Machine Learning II 37/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
k-kmeans (left)
I
spectral clustering (right)
Peter Gehler TU Darmstadt
Machine Learning II 38/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
2 versus 3 clusters
Peter Gehler TU Darmstadt
Machine Learning II 39/58
Last Lecture
Spectral Clustering
Topic Models
Topic Models for Information Retrieval
Peter Gehler TU Darmstadt
Machine Learning II 40/58
Last Lecture
Spectral Clustering
Topic Models
company medical database
news site web search
Peter Gehler TU Darmstadt
Machine Learning II 41/58
Last Lecture
Spectral Clustering
Topic Models
Ad Hoc Retrieval
I
Goal:identify relevant documents for a given text query
I
Input:a short,ambiguous and incomplete query (think of
your usual google search)
I
Needs to be done extremely fast
I
By far the most popular form of information access
Peter Gehler TU Darmstadt
Machine Learning II 42/58
Last Lecture
Spectral Clustering
Topic Models
Under the hood
What happens after we press the\search"button?
Peter Gehler TU Darmstadt
Machine Learning II 43/58
Last Lecture
Spectral Clustering
Topic Models
Bag-of-Words
I
We need a way to represent textual information,(what is x)?
I
We usually speak of Documents and Terms
I
Think of a document as a Bag-Of-Words (bow)
Document
Texas Instruments said it has developed the rst 32-bit computer
chip designed specically for articial intelligence applications [...]
Representation:
  
artifact
articial
  
intelligence
interest
  
n
ij
  
  
0
1

1
0
Peter Gehler TU Darmstadt
Machine Learning II 44/58
Last Lecture
Spectral Clustering
Topic Models
Bag-Of-Words
I
Bag-of-Words is a histogram representation of the data
x 2 N
L
I
Need a dictionary (of length L)
I
Histograms are often used to transform set data into an
aggregated xed-length representation,e.g.local image
descriptors
Peter Gehler TU Darmstadt
Machine Learning II 45/58
Last Lecture
Spectral Clustering
Topic Models
Some Pre-Processing
I
Pre-Processing before converting to Bag-of-Words histogram
representation
I
Stopword removal { remove un-informative words (several lists
available)
I
a,i,her,it,is,the,to,...
I
Porter Stemming { Groups of rules to transform words to
common stem
I
remove'ed','ing','ly',e.g.visiting,visited!visit
I
libraries,library!librari
Peter Gehler TU Darmstadt
Machine Learning II 46/58
Last Lecture
Spectral Clustering
Topic Models
TF-IDF
I
Re-weighting the entries according to importance
I
n(d;w) number of occurences of word w in document d
I
Term frequency (TF) (of one document)
tf(d;w) =
n(d;w)
P
k
n(d;k)
(35)
I
inverse document frequency (IDF) (of a corpus)
idf(w) = log
n
jfdjn(d;w) 6= 0gj
(36)
I
tf-idf
tf-idf(d;w) = tf(d;w)idf(d) (37)
Peter Gehler TU Darmstadt
Machine Learning II 47/58
Last Lecture
Spectral Clustering
Topic Models
Document-Term Matrix
I
Document-Term Matrices are huge!
I
Typical values
I
Corpus:1.000.000 documents
I
Vocabulary:100.000 words
I
Sparseness:< 0:1%
I
Think of the internet,documents = websites,words = every
possible word
Peter Gehler TU Darmstadt
Machine Learning II 48/58
Last Lecture
Spectral Clustering
Topic Models
Beyond keyword-based Search
I
Why not nding the keywords that are typed in a query?
I
Vocabulary mismatch
I
dierent people use dierent vocabulary to describe the same
concept
I
matching queries and documents based on keywords is
insucient
I
A webpage can be important although it does not contain the
keywords
Peter Gehler TU Darmstadt
Machine Learning II 49/58
Last Lecture
Spectral Clustering
Topic Models
Challenges
I
Compactness:few search terms,rarely redundancy
I
Average number of search terms per query = 2.5 (Spink et al.:
From E-Sex to E-commerce:We search changes,IEEE
Computer,2002)
I
Variability:Synonyms and semantically related terms,
expression,writing styles,etc
I
Ambiguity:Terms with multiple senses (polysems),e.g.Java,
jaguar,bank,head
I
Quality&Authority:Correlation between quality and relevance
Peter Gehler TU Darmstadt
Machine Learning II 50/58
Last Lecture
Spectral Clustering
Topic Models
Motivation:Latent Structure Analysis
I
Given a matrix that encodes the data,e.g.co-occurence
counts
I
Potential problems
I
too large
I
too complicated
I
missing entries
I
noisy entries
I
lack of structure
A =
0
B
B
B
B
@
a
11
   a
1j
   a
1m
              
a
i 1
   a
ij
   a
im
              
a
n1
   a
nj
   a
nm
1
C
C
C
C
A
I
Is there a simpler way to explain entries?
I
There may be a latent structure underlying the data
I
Possible structure:semantic topics (websites about news,
sports,mountainbiking news,swiss mountainbiking news)
I
How to reveal this structure?
Peter Gehler TU Darmstadt
Machine Learning II 51/58
Last Lecture
Spectral Clustering
Topic Models
Matrix Decomposition
I
Common approach:approximately factorize the matrix
A 
~
A
|{z}
approximation
= L
|{z}
left factor
R
|{z}
right factor
(38)
I
Factors are typically constrained to be thin
I
reduction:n  m n  q +m q
Peter Gehler TU Darmstadt
Machine Learning II 52/58
Last Lecture
Spectral Clustering
Topic Models
Matrix Decomposition
I
Example
I
A is matrix of n documents
I
each document is represented as m-dimensional vector
I
R is matrix of\common structure"across these documents
I
each row in L is low-dimensional representation (q
dimensional) of an m dimensional document
I
Factors are typically constrained to be thin
I
reduction:n  m n  q +m q
Peter Gehler TU Darmstadt
Machine Learning II 53/58
Last Lecture
Spectral Clustering
Topic Models
Latent Semantic Analysis
I
Perform a low-rank approximation of document-term matrix
(typical rank 100-300)
I
General idea
I
Map documents (and terms) to a low-dimensional
representation
I
Design a mapping such that the low-dimensional space re ects
semantic associations (latent semantic space)
I
Compute document similarity based on the inner product in
the latent space
I
Goals
I
Similar terms map to similar locations in low dimensional space
I
noise reduction by dimension reduction
Peter Gehler TU Darmstadt
Machine Learning II 54/58
Last Lecture
Spectral Clustering
Topic Models
Singular Value Decomposition
I
For an arbitrary matrix A there exists a factorization (Singular
Value Decomposition (SVD)) as follows:
A = UV
>
2 R
nm
;(39)
I
where
U 2 R
nk
; 2 R
kk
;V 2 R
mk
(40)
UU
>
= I;VV
>
= I (41)
 = Diag (
1
;:::;
k
);
i
 
i +1
(42)
k = rank(A) (43)
Peter Gehler TU Darmstadt
Machine Learning II 55/58
Last Lecture
Spectral Clustering
Topic Models
Low-Rank Approximation
I
SVD can used to compute optimal low-rank approximations
I
Approximation problem:
X

= argmin
^
X:rank(
^
X)=q
kX 
^
Xk
F
(44)
I
Solution via SVD
X

= Udiag(
1
;:::;
q
;0;:::;0
|
{z
}
set small 
i
to zero
)V
>
(45)
Peter Gehler TU Darmstadt
Machine Learning II 56/58
Last Lecture
Spectral Clustering
Topic Models
Latent Semantic Analysis:Overview
I
Given X,n documents,vocabulary size m
I
Determine X = UV
>
via SVD
I
Use u
i
(q dim) instead of x
i
(n dim) because q n
I
Distance between two documents hu
i
;u
j
i
I
Fold in queries
^q = 
1
k
V
k
q (46)
Peter Gehler TU Darmstadt
Machine Learning II 57/58
Last Lecture
Spectral Clustering
Topic Models
We can always do better
I
So far:LSA (Latent Semantic Analysis)
S.Deerwester,S.Dumais,G.Furnas,T.Landauer,R.Harshman (1990)."Indexing by Latent Semantic
Analysis"(PDF).Journal of the American Society for Information Science 41 (6):391{407.
I
Now probabilistic version thereof:pLSA (probabilistic LSA)
T.Hofmann,Probabilistic Latent Semantic Indexing,Proceedings of the Twenty-Second Annual
International SIGIR Conference on Research and Development in Information Retrieval (SIGIR{99),1999
I
The full model:LDA (Latent Dirichlet Allocation) (not Linear
Discriminant Analysis)
D.Blei,A.Y.Ng,M.Jordan (2003).Latent Dirichlet Allocation.Journal of Machine Learning Research 3
Peter Gehler TU Darmstadt
Machine Learning II 58/58