Machine Learning II

milkygoodyearAI and Robotics

Oct 14, 2013 (4 years and 26 days ago)

92 views

Last Lecture
Spectral Clustering
Topic Models
Machine Learning II
Peter Gehler
TU Darmstadt
November 05,2010
Peter Gehler TU Darmstadt
Machine Learning II 1/58
Last Lecture
Spectral Clustering
Topic Models
Good news
I
No need to re-schedule 26
th
November!
I
Up-to-date plan always on the website
Peter Gehler TU Darmstadt
Machine Learning II 2/58
Last Lecture
Spectral Clustering
Topic Models
Some more comments on Decision Theory
...inspired by the exercises yesterday
Peter Gehler TU Darmstadt
Machine Learning II 3/58
Last Lecture
Spectral Clustering
Topic Models
0/1 Error
I
In the exercises we have seen that for  being 0/1 Loss
y

= argmin
y2Y
E
p(jx)
[(y;)] (1)
= argmax
y2Y
p(yjx) (2)
= argmin
y2Y
E(yjx) (3)
for
p(yjx) =
1
Z
exp(E(yjx)) (4)
I
Energy minimization or MAP prediction
I
Not only used for classication (many image processing tasks)
Peter Gehler TU Darmstadt
Machine Learning II 4/58
Last Lecture
Spectral Clustering
Topic Models
Hamming Loss
I
Count number of mislabeled variables,
e.g.pixels

H
(y

;y) =
1
jVj
X
i 2V
(y
i
6= y

i
) (5)
I
Yields
y

= argmin
y2Y
E
p(jx)
[
H
(y;)] (6)
=

argmax
y
i
2Y
i
p(y
i
jx)
!
i 2V
(7)
(8)
I
maximum posterior marginal (MPM) prediction
Peter Gehler TU Darmstadt
Machine Learning II 5/58
Last Lecture
Spectral Clustering
Topic Models
Squared Error
I
Assume vector space on Y
i
e.g.pixel
intensities

Q
(y

;y) =
1
jVj
X
i 2V
ky

i
y
i
k
2
(9)
I
Yields
y

= argmin
y2Y
E
p(jx)
[
Q
(y;)] (10)
=
0
@
X
y
i
2Y
i
p(y
i
jx)y
i
1
A
i 2V
(11)
I
minimum mean squared error (MMSE) prediction
Peter Gehler TU Darmstadt
Machine Learning II 6/58
Last Lecture
Spectral Clustering
Topic Models
Clustering
Peter Gehler TU Darmstadt
Machine Learning II 7/58
Last Lecture
Spectral Clustering
Topic Models
Clustering:Finding Structure in Data
I
What are the correct clusters?
Peter Gehler TU Darmstadt
Machine Learning II 8/58
Last Lecture
Spectral Clustering
Topic Models
Clustering:Finding Structure in Data
I
Ground truth often not available
I
Similarity Measure
I
clustering relies on measure of similarity
I
e.g.position in space (euclidean vs.log-polar coordinates),
weighting of dierent dimensions...
Peter Gehler TU Darmstadt
Machine Learning II 9/58
Last Lecture
Spectral Clustering
Topic Models
Basic Clustering Algorithms
I
Flat Clustering Algorithms
I
K-Means
I
Mixture Models
I
Hierarchical Clustering Methods
I
Top-Down (splitting)
I
Bottom-Up (merging)
I
Other Clustering Methods
I
Spectral Clustering
Peter Gehler TU Darmstadt
Machine Learning II 10/58
Last Lecture
Spectral Clustering
Topic Models
Hierarchical Bottom-Up Clustering
Cluster Linkage:dene distances between clusters
I
Single Linkage
d(C
k
;C
l
) = min
x
i
2C
k
min
y
j
2C
l
d(x
i
;y
j
) (12)
I
Complete Linkage
d(C
k
;C
l
) = max
x
i
2C
k
max
y
j
2C
l
d(x
i
;y
j
) (13)
Peter Gehler TU Darmstadt
Machine Learning II 11/58
Last Lecture
Spectral Clustering
Topic Models
Hierarchical Bottom-Up Clustering
Cluster Linkage:dene distances between clusters
I
Average Linkage
d(C
k
;C
l
) =
1
jC
k
jjC
l
j
X
x
i
2C
k
X
y
j
2C
l
d(x
i
;y
j
)
(14)
I
Centroid Linkage
d(C
k
;C
l
) = d
0
@
1
C
k
X
x
i
2C
k
x
i
;
1
C
l
X
y
j
2C
l
y
j
1
A
(15)
Peter Gehler TU Darmstadt
Machine Learning II 12/58
Last Lecture
Spectral Clustering
Topic Models
Spectral Clustering
Excellent tutorial paper:
U.von Luxburg,A Tutorial on Spectral Clustering,Statistics and
Computing 2007
Peter Gehler TU Darmstadt
Machine Learning II 13/58
Last Lecture
Spectral Clustering
Topic Models
A.Azran,A Tutorial on Spectral Clustering
Peter Gehler TU Darmstadt
Machine Learning II 14/58
Last Lecture
Spectral Clustering
Topic Models
Main idea
I
2 spirals
I
data exhibits complex cluster shape
I
K-Means performs poorly (because it
wants dense spherical clusters)
I
nd embedding space
I
given by eigenvectors of anity
matrix
I
in the embedded space,clusters are
trivial to separate
Peter Gehler TU Darmstadt
Machine Learning II 15/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 16/58
Last Lecture
Spectral Clustering
Topic Models
Local Similarity!
I
Look at similarity values
I
typically,the similarity values reliably encode\local structure"
I
can reliably indicate which samples are\close"or\similar"
I
often the global structure induced by similarity function does
not capture the true global structure of the data
Peter Gehler TU Darmstadt
Machine Learning II 17/58
Last Lecture
Spectral Clustering
Topic Models
Local Similarity!
I
Example:
I
misleading global distances
I
Idea:
I
only rely on local information
provided by similarity
I
construct graph based on this
local information
I
machine learning should
discover global structure by
itself
Peter Gehler TU Darmstadt
Machine Learning II 18/58
Last Lecture
Spectral Clustering
Topic Models
Construct Graph 6
I
a similarity score between two objects is\high"when the
objects are very similar
I
Example:Gaussian kernel
s(x
i
;x
j
) = exp


kx
i
x
j
k
2
2
2

(16)
I
Conversely:a distance score is small when the objects are
close
I
Example:Euclidean distance
d(x
i
;x
j
) = kx
i
x
j
k (17)
I
Distance and Similarities are\inverse"to each other
I
In the following we talk about similarities only (although it
works with distances)
Peter Gehler TU Darmstadt
Machine Learning II 19/58
Last Lecture
Spectral Clustering
Topic Models
Basic Graph Vocabulary
I
a Graph consists of
vertices (nodes) and edges
I
Edges:
I
directed or un-directed,weighted or un-weighted
I
Adjacency matrix { structure of the graph
I
w
ij
= 0 vertices i and j are not connected
I
w
ij
> 0 weight of connection
I
the degree of a vertex is the sum of all adjacent edge weights
d
i
=
X
j
w
ij
(18)
I
all vertices that can be reached by a path form a connected
component
Peter Gehler TU Darmstadt
Machine Learning II 20/58
Last Lecture
Spectral Clustering
Topic Models
Undirected k-nearest neighbour graph
I
Undirected Graph:just delete
arrows
I
kNN graph:Connects i and j if
w
ij
> 0 or w
ji
> 0
I
mutual kNN graph:Connects i
and j if w
ij
> 0 and w
ji
> 0
Peter Gehler TU Darmstadt
Machine Learning II 21/58
Last Lecture
Spectral Clustering
Topic Models
Mutual and symmetric graph
I
by construction the mutual kNN graph is a subset of the
symmetric kNN graph
Peter Gehler TU Darmstadt
Machine Learning II 22/58
Last Lecture
Spectral Clustering
Topic Models
-neighbourhood graph
I
given data samples and their pairwise distances d
ij
I
connect all samples i;j with distance d
ij
< 
I
undirected or transform distances to similarities as weights
Peter Gehler TU Darmstadt
Machine Learning II 23/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 24/58
Last Lecture
Spectral Clustering
Topic Models
I
Denition of cut between 2 sets of vertices A and B
cut(A;B) =
X
i 2A
X
j2B
w
ij
(19)
I
Intuitive Idea:nd sets A and B with minimal cut:
I
minimal bipartition cut argmin
A;B
cut(A;B)
I
within A and B similar,between A and B dis-similar
I
Problem (left example is cut with low value,right is desired
output)
Peter Gehler TU Darmstadt
Machine Learning II 25/58
Last Lecture
Spectral Clustering
Topic Models
Balanced Graph Cut 1
I
W = [w
ij
] adjacency matrix of the graph
I
number of vertices in the set A:jAj
I
degree of vertex
d
i
=
X
j
w
ij
(20)
I
volume of set A
vol (A) =
X
i 2A
d
i
(21)
I
degree matrix
D = diag (d
1
;:::;d
n
) (22)
Peter Gehler TU Darmstadt
Machine Learning II 26/58
Last Lecture
Spectral Clustering
Topic Models
Balanced Graph Cut 1
I
Normalized Cut (balanced cut) (Shi&Malik 2000)
Ncut(A;B) = cut(A;B)

1
vol (A)
+
1
vol (B)

(23)
I
Notes
I
Mincut can be solved eciently
I
Ncut is NP hard
I
spectral clustering solves a relaxation of Ncut (here i 2 A or
i 2 B,in spectral clustering\soft assignment")
I
Quality of solution of relaxation unclear
Peter Gehler TU Darmstadt
Machine Learning II 27/58
Last Lecture
Spectral Clustering
Topic Models
Unnormalized Graph Laplacian
I
Graph Laplacian dened as (no agreement on the term)
L = D W (24)
I
Key property
f
>
Lf = f
>
Df f
>
Wf (25)
=
X
i
d
i
f
2
i

X
ij
f
i
f
j
w
ij
(26)
=
1
2
0
@
X
i
d
i
f
2
i
2
X
ij
f
i
f
j
w
ij
+
X
j
d
j
f
2
j
1
A
(27)
=
1
2
X
ij
w
ij
(f
i
f
j
)
2
(28)
Peter Gehler TU Darmstadt
Machine Learning II 28/58
Last Lecture
Spectral Clustering
Topic Models
Unnormalized Graph Laplacian
I
Spectral Properties
I
L is symmetric (by assumption) and positive semi-denite
I
smallest eigenvalue of L is 0,corresponding eigenvector is 1
I
thus eigenvalues 0 = 
1
 
2
::: 
n
I
Relation between spectrum of eigenvalues and clusters
I
Multiplicity of eigenvalue 0 equals the number of connected
components of the graph
Peter Gehler TU Darmstadt
Machine Learning II 29/58
Last Lecture
Spectral Clustering
Topic Models
Why multiplicity of the Eigenvalue?
I
Denition:eigenvalue  and eigenvector f if
f = Lf (29)
I
equivalent to
f
>
f = f
>
Lf (30)
I
and for eigenvalue  = 0
f
>
f = 0 = f
>
Lf (31)
Peter Gehler TU Darmstadt
Machine Learning II 30/58
Last Lecture
Spectral Clustering
Topic Models
Why is this the case?
Case k = 1 { only one connected component
I
assume graph is fully connected
I
assume f eigenvector to  = 0
I
we know
0 = f
>
Lf =
1
2
X
ij
w
ij
(f
i
f
j
)
2
(32)
I
since w
ij
 0 all terms f
i
f
j
have to be zero
I
i;j connected!f
i
= f
j
I
f is constant for all i
I
one-vector f = (1;:::;1)
>
is eigenvector to eigenvalue  = 0
Peter Gehler TU Darmstadt
Machine Learning II 31/58
Last Lecture
Spectral Clustering
Topic Models
Why is this the case?
Now k > 1 connected components
I
assume graph has k connected components
I
assume f eigenvector to  = 0
we can re-order vertices
such that W is
block-diagonal
L =
0
B
B
B
@
L1 0    0
0 L2    0
0 0   
.
.
.
0 0    L
k
1
C
C
C
A
II
note:each block is proper Graph Laplacian itself,each has
single connected component (k = 1 case)
I
L has as many eigenvalues 0 as there are connected
components k
Peter Gehler TU Darmstadt
Machine Learning II 32/58
Last Lecture
Spectral Clustering
Topic Models
Normalized graph Laplacians
I
Row sum (random walk) normalization
L
rw
= D
1
L = I D
1
W (33)
I
Symmetric normalization
L
sym
= D
1=2
LD
1=2
= I D
1=2
WD
1=2
(34)
I
Spectral properties similar to the ones of L
Peter Gehler TU Darmstadt
Machine Learning II 33/58
Last Lecture
Spectral Clustering
Topic Models
Overview:Spectral Clustering
Steps:
1.Construct Graph { describe with anity matrix/graph
Laplacian
2.extract eigenvalues { choose eigenvectors with smallest
eigenvalues
3.cluster in embedded space { spanned by rst eigenvectors
Peter Gehler TU Darmstadt
Machine Learning II 35/58
Last Lecture
Spectral Clustering
Topic Models
I
Input:similarity matrix W,number of desired centroids k
I
build similarity graph
I
compute the rst eigenvectors v
1
;:::;v
k
of the matrix
I
L { for un-normalized spectral clustering
I
L
rw
{ for normalized spectral clustering
I
Build the matrix V 2 R
nk
with the eigenvectors as
columns
I
Interpret rows of V as new
datapoints Z
i
2 R
k
v
1
v
2
   v
k
Z
1
v
11
v
12
   v
1k
.
.
.
.
.
.
.
.
.
.
.
.
Z
n
v
n1
v
n2
   v
nk
I
Cluster the points Z
i
with K-Means (or your favorite) in R
k
Peter Gehler TU Darmstadt
Machine Learning II 36/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
Solutions of spectral clustering
Peter Gehler TU Darmstadt
Machine Learning II 37/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
k-kmeans (left)
I
spectral clustering (right)
Peter Gehler TU Darmstadt
Machine Learning II 38/58
Last Lecture
Spectral Clustering
Topic Models
Toy Examples
I
2 versus 3 clusters
Peter Gehler TU Darmstadt
Machine Learning II 39/58
Last Lecture
Spectral Clustering
Topic Models
Topic Models for Information Retrieval
Peter Gehler TU Darmstadt
Machine Learning II 40/58
Last Lecture
Spectral Clustering
Topic Models
company medical database
news site web search
Peter Gehler TU Darmstadt
Machine Learning II 41/58
Last Lecture
Spectral Clustering
Topic Models
Ad Hoc Retrieval
I
Goal:identify relevant documents for a given text query
I
Input:a short,ambiguous and incomplete query (think of
your usual google search)
I
Needs to be done extremely fast
I
By far the most popular form of information access
Peter Gehler TU Darmstadt
Machine Learning II 42/58
Last Lecture
Spectral Clustering
Topic Models
Under the hood
What happens after we press the\search"button?
Peter Gehler TU Darmstadt
Machine Learning II 43/58
Last Lecture
Spectral Clustering
Topic Models
Bag-of-Words
I
We need a way to represent textual information,(what is x)?
I
We usually speak of Documents and Terms
I
Think of a document as a Bag-Of-Words (bow)
Document
Texas Instruments said it has developed the rst 32-bit computer
chip designed specically for articial intelligence applications [...]
Representation:
  
artifact
articial
  
intelligence
interest
  
n
ij
  
  
0
1

1
0
Peter Gehler TU Darmstadt
Machine Learning II 44/58
Last Lecture
Spectral Clustering
Topic Models
Bag-Of-Words
I
Bag-of-Words is a histogram representation of the data
x 2 N
L
I
Need a dictionary (of length L)
I
Histograms are often used to transform set data into an
aggregated xed-length representation,e.g.local image
descriptors
Peter Gehler TU Darmstadt
Machine Learning II 45/58
Last Lecture
Spectral Clustering
Topic Models
Some Pre-Processing
I
Pre-Processing before converting to Bag-of-Words histogram
representation
I
Stopword removal { remove un-informative words (several lists
available)
I
a,i,her,it,is,the,to,...
I
Porter Stemming { Groups of rules to transform words to
common stem
I
remove'ed','ing','ly',e.g.visiting,visited!visit
I
libraries,library!librari
Peter Gehler TU Darmstadt
Machine Learning II 46/58
Last Lecture
Spectral Clustering
Topic Models
TF-IDF
I
Re-weighting the entries according to importance
I
n(d;w) number of occurences of word w in document d
I
Term frequency (TF) (of one document)
tf(d;w) =
n(d;w)
P
k
n(d;k)
(35)
I
inverse document frequency (IDF) (of a corpus)
idf(w) = log
n
jfdjn(d;w) 6= 0gj
(36)
I
tf-idf
tf-idf(d;w) = tf(d;w)idf(d) (37)
Peter Gehler TU Darmstadt
Machine Learning II 47/58
Last Lecture
Spectral Clustering
Topic Models
Document-Term Matrix
I
Document-Term Matrices are huge!
I
Typical values
I
Corpus:1.000.000 documents
I
Vocabulary:100.000 words
I
Sparseness:< 0:1%
I
Think of the internet,documents = websites,words = every
possible word
Peter Gehler TU Darmstadt
Machine Learning II 48/58
Last Lecture
Spectral Clustering
Topic Models
Beyond keyword-based Search
I
Why not nding the keywords that are typed in a query?
I
Vocabulary mismatch
I
dierent people use dierent vocabulary to describe the same
concept
I
matching queries and documents based on keywords is
insucient
I
A webpage can be important although it does not contain the
keywords
Peter Gehler TU Darmstadt
Machine Learning II 49/58
Last Lecture
Spectral Clustering
Topic Models
Challenges
I
Compactness:few search terms,rarely redundancy
I
Average number of search terms per query = 2.5 (Spink et al.:
From E-Sex to E-commerce:We search changes,IEEE
Computer,2002)
I
Variability:Synonyms and semantically related terms,
expression,writing styles,etc
I
Ambiguity:Terms with multiple senses (polysems),e.g.Java,
jaguar,bank,head
I
Quality&Authority:Correlation between quality and relevance
Peter Gehler TU Darmstadt
Machine Learning II 50/58
Last Lecture
Spectral Clustering
Topic Models
Motivation:Latent Structure Analysis
I
Given a matrix that encodes the data,e.g.co-occurence
counts
I
Potential problems
I
too large
I
too complicated
I
missing entries
I
noisy entries
I
lack of structure
A =
0
B
B
B
B
@
a
11
   a
1j
   a
1m
              
a
i 1
   a
ij
   a
im
              
a
n1
   a
nj
   a
nm
1
C
C
C
C
A
I
Is there a simpler way to explain entries?
I
There may be a latent structure underlying the data
I
Possible structure:semantic topics (websites about news,
sports,mountainbiking news,swiss mountainbiking news)
I
How to reveal this structure?
Peter Gehler TU Darmstadt
Machine Learning II 51/58
Last Lecture
Spectral Clustering
Topic Models
Matrix Decomposition
I
Common approach:approximately factorize the matrix
A 
~
A
|{z}
approximation
= L
|{z}
left factor
R
|{z}
right factor
(38)
I
Factors are typically constrained to be thin
I
reduction:n  m n  q +m q
Peter Gehler TU Darmstadt
Machine Learning II 52/58
Last Lecture
Spectral Clustering
Topic Models
Matrix Decomposition
I
Example
I
A is matrix of n documents
I
each document is represented as m-dimensional vector
I
R is matrix of\common structure"across these documents
I
each row in L is low-dimensional representation (q
dimensional) of an m dimensional document
I
Factors are typically constrained to be thin
I
reduction:n  m n  q +m q
Peter Gehler TU Darmstadt
Machine Learning II 53/58
Last Lecture
Spectral Clustering
Topic Models
Latent Semantic Analysis
I
Perform a low-rank approximation of document-term matrix
(typical rank 100-300)
I
General idea
I
Map documents (and terms) to a low-dimensional
representation
I
Design a mapping such that the low-dimensional space re ects
semantic associations (latent semantic space)
I
Compute document similarity based on the inner product in
the latent space
I
Goals
I
Similar terms map to similar locations in low dimensional space
I
noise reduction by dimension reduction
Peter Gehler TU Darmstadt
Machine Learning II 54/58
Last Lecture
Spectral Clustering
Topic Models
Singular Value Decomposition
I
For an arbitrary matrix A there exists a factorization (Singular
Value Decomposition (SVD)) as follows:
A = UV
>
2 R
nm
;(39)
I
where
U 2 R
nk
; 2 R
kk
;V 2 R
mk
(40)
UU
>
= I;VV
>
= I (41)
 = Diag (
1
;:::;
k
);
i
 
i +1
(42)
k = rank(A) (43)
Peter Gehler TU Darmstadt
Machine Learning II 55/58
Last Lecture
Spectral Clustering
Topic Models
Low-Rank Approximation
I
SVD can used to compute optimal low-rank approximations
I
Approximation problem:
X

= argmin
^
X:rank(
^
X)=q
kX 
^
Xk
F
(44)
I
Solution via SVD
X

= Udiag(
1
;:::;
q
;0;:::;0
|
{z
}
set small 
i
to zero
)V
>
(45)
Peter Gehler TU Darmstadt
Machine Learning II 56/58
Last Lecture
Spectral Clustering
Topic Models
Latent Semantic Analysis:Overview
I
Given X,n documents,vocabulary size m
I
Determine X = UV
>
via SVD
I
Use u
i
(q dim) instead of x
i
(n dim) because q n
I
Distance between two documents hu
i
;u
j
i
I
Fold in queries
^q = 
1
k
V
k
q (46)
Peter Gehler TU Darmstadt
Machine Learning II 57/58
Last Lecture
Spectral Clustering
Topic Models
We can always do better
I
So far:LSA (Latent Semantic Analysis)
S.Deerwester,S.Dumais,G.Furnas,T.Landauer,R.Harshman (1990)."Indexing by Latent Semantic
Analysis"(PDF).Journal of the American Society for Information Science 41 (6):391{407.
I
Now probabilistic version thereof:pLSA (probabilistic LSA)
T.Hofmann,Probabilistic Latent Semantic Indexing,Proceedings of the Twenty-Second Annual
International SIGIR Conference on Research and Development in Information Retrieval (SIGIR{99),1999
I
The full model:LDA (Latent Dirichlet Allocation) (not Linear
Discriminant Analysis)
D.Blei,A.Y.Ng,M.Jordan (2003).Latent Dirichlet Allocation.Journal of Machine Learning Research 3
Peter Gehler TU Darmstadt
Machine Learning II 58/58