CS325 Artificial Intelligence Ch. 20 – Unsupervised Machine Learning

milkygoodyearAI and Robotics

Oct 14, 2013 (3 years and 8 months ago)

46 views

CS325 Artificial Intelligence
Ch.20 – Unsupervised Machine Learning
Cengiz Günay
Spring 2013
Günay
Ch.20 – Unsupervised Machine Learning
Unsupervised Learning
Missing teacher
No labels,y
Just input data,x
What can you learn with it?
1
Simplifying data (e.g.,dimensionality reduction)
2
Organizing data (e.g.,clustering)
Works by finding structure in data,exploits redundancies
Entry survey:Unsupervised Learning (0.5 points of final grade)
What is it good for in real life?
Where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning
Unsupervised Learning
Missing teacher
No labels,y
Just input data,x
What can you learn with it?
1
Simplifying data (e.g.,dimensionality reduction)
2
Organizing data (e.g.,clustering)
Works by finding structure in data,exploits redundancies
Entry survey:Unsupervised Learning (0.5 points of final grade)
What is it good for in real life?
Where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning
Unsupervised Learning
Missing teacher
No labels,y
Just input data,x
What can you learn with it?
1
Simplifying data (e.g.,dimensionality reduction)
2
Organizing data (e.g.,clustering)
Works by finding structure in data,exploits redundancies
Entry survey:Unsupervised Learning (0.5 points of final grade)
What is it good for in real life?
Where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning
Unsupervised Learning
Missing teacher
No labels,y
Just input data,x
What can you learn with it?
1
Simplifying data (e.g.,dimensionality reduction)
2
Organizing data (e.g.,clustering)
Works by finding structure in data,exploits redundancies
Entry survey:Unsupervised Learning (0.5 points of final grade)
What is it good for in real life?
Where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
The Google PageRank Algorithm
Why called Google PageRank®?
Assigns “importance” to each page based on incoming links
Before PageRank:
Manually made online directories (e.g.,Yahoo!)
Bag of words maximum likelihood
PageRank improves bag of words model:
Iterative algorithm that models surfer randomly clicking away
In each page,the probability of finding a target page is divided
by outgoing links.
World with pages:A,B,C,and D.Init 8PR(x) = 0:25
If B,C,and D all link to A,then
PR(A) = PR(B) +PR(C) +PR(D) = 0:75
If B had a link to pages C and A,while page D had links to all
three pages,then PR(A) = PR(B)=2 +PR(C) +PR(D)=3
Günay
Ch.20 – Unsupervised Machine Learning
Other Unsupervised Learning Examples
Dimensionality reduction:
1
Principal/independent component analysis (PCA/ICA)
2
Factor analysis
3
Google PageRank
Clustering:
1
Blind source separation
2
k-Means clustering
3
Competitive learning
4
Expectation maximization (EM)
5
Self-organizing maps (SOM)
Günay
Ch.20 – Unsupervised Machine Learning
Other Unsupervised Learning Examples
Dimensionality reduction:
1
Principal/independent component analysis (PCA/ICA)
2
Factor analysis
3
Google PageRank
Clustering:
1
Blind source separation
2
k-Means clustering
3
Competitive learning
4
Expectation maximization (EM)
5
Self-organizing maps (SOM)
Günay
Ch.20 – Unsupervised Machine Learning
k-Means Clustering
Algorithm:
1
Randomly place k cluster centers
2
Assign each point to closest center
3
Move each cluster to center of gravity of new set
4
Go back to step 2 until no change
Problems:
Choosing the appropriate k
Local minima
High dimensionality
Not mathematical
Günay
Ch.20 – Unsupervised Machine Learning
k-Means Clustering
Algorithm:
1
Randomly place k cluster centers
2
Assign each point to closest center
3
Move each cluster to center of gravity of new set
4
Go back to step 2 until no change
Problems:
Choosing the appropriate k
Local minima
High dimensionality
Not mathematical
Günay
Ch.20 – Unsupervised Machine Learning
Improving k-Means with Gaussians
Gaussian or normal distribution function:
N(;) = P(xj;) =
1
p
2
e
(x)
2
=2
2
Mean and variance parameters can be approximated from data:
 =
1
M
X
i
x
i
; =
1
M
X
i
(x
i
)
2
Watch Dr.Thrun use Maximum Likelihood to derive these!
Günay
Ch.20 – Unsupervised Machine Learning
Improving k-Means with Gaussians
Gaussian or normal distribution function:
N(;) = P(xj;) =
1
p
2
e
(x)
2
=2
2
Mean and variance parameters can be approximated from data:
 =
1
M
X
i
x
i
; =
1
M
X
i
(x
i
)
2
Watch Dr.Thrun use Maximum Likelihood to derive these!
Günay
Ch.20 – Unsupervised Machine Learning
Multi-variate Gaussians
How would a 2D Gaussian look like?
N(;) = (2)
d=2
jj
1=2
exp

1
2
(x )
T

1
(x )

;where
 =
1
M
X
i
(x
i
)
T
(x
i
);andd is the number of dimensions:
Günay
Ch.20 – Unsupervised Machine Learning
Multi-variate Gaussians
How would a 2D Gaussian look like?
N(;) = (2)
d=2
jj
1=2
exp

1
2
(x )
T

1
(x )

;where
 =
1
M
X
i
(x
i
)
T
(x
i
);andd is the number of dimensions:
Günay
Ch.20 – Unsupervised Machine Learning
Multi-variate Gaussians
How would a 2D Gaussian look like?
N(;) = (2)
d=2
jj
1=2
exp

1
2
(x )
T

1
(x )

;where
 =
1
M
X
i
(x
i
)
T
(x
i
);andd is the number of dimensions:
Günay
Ch.20 – Unsupervised Machine Learning
Fitting Multi-variate Gaussians
Günay
Ch.20 – Unsupervised Machine Learning
Fitting Multi-variate Gaussians
Günay
Ch.20 – Unsupervised Machine Learning
Using Gaussians for Clusters
Assume points belong to clusters with multi-variate Gaussian
distribution.
We could use Maximum Likelihood,but we don’t know
Gaussian parameters (mean and variance).
It’s a chicken-egg problem!
Solution:pretend we have centers.Choose randomly like in
k-means,and then run Expectation Maximization.
Günay
Ch.20 – Unsupervised Machine Learning
Using Gaussians for Clusters
Assume points belong to clusters with multi-variate Gaussian
distribution.
We could use Maximum Likelihood,but we don’t know
Gaussian parameters (mean and variance).
It’s a chicken-egg problem!
Solution:pretend we have centers.Choose randomly like in
k-means,and then run Expectation Maximization.
Günay
Ch.20 – Unsupervised Machine Learning
Expectation Maximization
Expectation Maximization (EM):two-step iterative algorithm
1
Expectation step:For all i,j,calculate probability x
j
belongs
to cluster i:
p
ij
= P(C = i jx
j
) = P(x
j
jC = i )P(C = i )
2
Maximization step:Recalculate parameters

i
=
X
j
p
ij
x
j
=n
i

i
=
X
j
p
ij
(x
j

i
)(x
j

i
)
T
=n
i
Günay
Ch.20 – Unsupervised Machine Learning
Expectation Maximization
Expectation Maximization (EM):two-step iterative algorithm
1
Expectation step:For all i,j,calculate probability x
j
belongs
to cluster i:
p
ij
= P(C = i jx
j
) = P(x
j
jC = i )P(C = i )
2
Maximization step:Recalculate parameters

i
=
X
j
p
ij
x
j
=n
i

i
=
X
j
p
ij
(x
j

i
)(x
j

i
)
T
=n
i
Günay
Ch.20 – Unsupervised Machine Learning
Expectation Maximization
Expectation Maximization (EM):two-step iterative algorithm
1
Expectation step:For all i,j,calculate probability x
j
belongs
to cluster i:
p
ij
= P(C = i jx
j
) = P(x
j
jC = i )P(C = i )
2
Maximization step:Recalculate parameters

i
=
X
j
p
ij
x
j
=n
i

i
=
X
j
p
ij
(x
j

i
)(x
j

i
)
T
=n
i
Günay
Ch.20 – Unsupervised Machine Learning
We Find the Gaussian Clusters
Unsupervised learning of Gaussians is also used in
Radial Basis Function neural networks
Günay
Ch.20 – Unsupervised Machine Learning
We Find the Gaussian Clusters
Unsupervised learning of Gaussians is also used in
Radial Basis Function neural networks
Günay
Ch.20 – Unsupervised Machine Learning
We Find the Gaussian Clusters
Unsupervised learning of Gaussians is also used in
Radial Basis Function neural networks
Günay
Ch.20 – Unsupervised Machine Learning
Can Also Use Gaussians for Density Estimation
Günay
Ch.20 – Unsupervised Machine Learning
Can Also Use Gaussians for Density Estimation
Günay
Ch.20 – Unsupervised Machine Learning
Summary for Expectation Maximization
Expectation Maximization:
All points belong to all centers
Better solution
Less susceptible to local minima
What else can Expectation Maximization do?
Not limited to learning Gaussians
Find hidden variables in Bayes net if we cannot count them
like in the spam example
Find hidden (latent) variables in other algorithms,like Hidden
Markov Models
Learning structures of problems with unknowns (e.g.,Bayes
nets)
Günay
Ch.20 – Unsupervised Machine Learning
Summary for Expectation Maximization
Expectation Maximization:
All points belong to all centers
Better solution
Less susceptible to local minima
What else can Expectation Maximization do?
Not limited to learning Gaussians
Find hidden variables in Bayes net if we cannot count them
like in the spam example
Find hidden (latent) variables in other algorithms,like Hidden
Markov Models
Learning structures of problems with unknowns (e.g.,Bayes
nets)
Günay
Ch.20 – Unsupervised Machine Learning
Summary for Expectation Maximization
Expectation Maximization:
All points belong to all centers
Better solution
Less susceptible to local minima
What else can Expectation Maximization do?
Not limited to learning Gaussians
Find hidden variables in Bayes net if we cannot count them
like in the spam example
Find hidden (latent) variables in other algorithms,like Hidden
Markov Models
Learning structures of problems with unknowns (e.g.,Bayes
nets)
Günay
Ch.20 – Unsupervised Machine Learning
Dementia Reduction,
Number of dimensions to represent these data?
Günay
Ch.20 – Unsupervised Machine Learning
Dementia Reduction,
Number of dimensions to represent these data?
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction
How to do this:
1
Find Gaussian parameters of data
2
Find eigenvectors and eigenvalues of covariance matrix
3
Choose eigenvectors with maximum eigenvalues
4
Project data onto selected eigenvector space
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction
How to do this:
1
Find Gaussian parameters of data
2
Find eigenvectors and eigenvalues of covariance matrix
3
Choose eigenvectors with maximum eigenvalues
4
Project data onto selected eigenvector space
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction
How to do this:
1
Find Gaussian parameters of data
2
Find eigenvectors and eigenvalues of covariance matrix
3
Choose eigenvectors with maximum eigenvalues
4
Project data onto selected eigenvector space
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction Example
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction Example
Günay
Ch.20 – Unsupervised Machine Learning
Linear Dimensionality Reduction Example
Günay
Ch.20 – Unsupervised Machine Learning
Reducing from Large Dimensional Spaces:Eigenfaces
Face example
50 50 = 2;500 pixels (dimensions)
Reduce to 12 “eigenface” dimensions
Günay
Ch.20 – Unsupervised Machine Learning
Reducing from Large Dimensional Spaces:Eigenfaces
Face example
50 50 = 2;500 pixels (dimensions)
Reduce to 12 “eigenface” dimensions
Günay
Ch.20 – Unsupervised Machine Learning
Reducing from Large Dimensional Spaces:Bodies
Body example
Three dimensions enough to distinguish:height,size,gender
Trick is to use piecewise linear projections
See linear embedding,iso maps for more info
Günay
Ch.20 – Unsupervised Machine Learning
Clustering by Affinity
Would EM of k-means work well?
No.
Günay
Ch.20 – Unsupervised Machine Learning
Clustering by Affinity
Would EM of k-means work well?
No.
Günay
Ch.20 – Unsupervised Machine Learning
Spectral Clustering
Rank deficient matrix
Can use Principal Components Analysis to find orthogonal
components
Example with clustering?
Günay
Ch.20 – Unsupervised Machine Learning
Competitive Learning:Neural Gas
Neural gas:
Growing Neural Gas
Gesture recognition
Günay
Ch.20 – Unsupervised Machine Learning
Source Separation
Coctail party problem
Blind source separation
Independent component analysis
Difference between PCA and ICA?
PCA finds orthogonal components.
ICA finds statistically independent components.
Thus,ICA is better for signal separation.
Günay
Ch.20 – Unsupervised Machine Learning
Source Separation
Coctail party problem
Blind source separation
Independent component analysis
Difference between PCA and ICA?
PCA finds orthogonal components.
ICA finds statistically independent components.
Thus,ICA is better for signal separation.
Günay
Ch.20 – Unsupervised Machine Learning
Source Separation
Coctail party problem
Blind source separation
Independent component analysis
Difference between PCA and ICA?
PCA finds orthogonal components.
ICA finds statistically independent components.
Thus,ICA is better for signal separation.
Günay
Ch.20 – Unsupervised Machine Learning
Source Separation
Coctail party problem
Blind source separation
Independent component analysis
Difference between PCA and ICA?
PCA finds orthogonal components.
ICA finds statistically independent components.
Thus,ICA is better for signal separation.
Günay
Ch.20 – Unsupervised Machine Learning
Summary
Learning without teachers is still useful
Makes sense of hidden structure within data
Many uses like:clustering,separation,dimension reduction,
density estimation,...
Both iterative algorithms and mathematical solutions
Make sense of natural data:faces,bodies
Competitive learning can be used to find best adapted
solutions:e.g.,find best on-screen keyboard for typing?
Use unsupervised learning first to simplify data and then
combine with supervised learning!
Exit survey:Unsupervised Learning
What changed in your understanding?
Any new suggestions on where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning
Summary
Learning without teachers is still useful
Makes sense of hidden structure within data
Many uses like:clustering,separation,dimension reduction,
density estimation,...
Both iterative algorithms and mathematical solutions
Make sense of natural data:faces,bodies
Competitive learning can be used to find best adapted
solutions:e.g.,find best on-screen keyboard for typing?
Use unsupervised learning first to simplify data and then
combine with supervised learning!
Exit survey:Unsupervised Learning
What changed in your understanding?
Any new suggestions on where would you use it?
Günay
Ch.20 – Unsupervised Machine Learning