DataClustering - Grand Valley State University

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

100 views

Data Clustering

Nick Bild


Grand Valley State University

April 24, 2007

CS678

Background


Unsupervised learning


Patterns/structures in unlabeled data

Image source: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/images/clustering.gif

Goal


Determine intrinsic data grouping



Data subsets share common trait

Applications


Bioinformatics


Group homologous
sequences



Image Classification


Medical diagnosis



Marketing


Customer behavior

Distance Measure


Quantify similarity of 2 data points


Influences shape of clusters


Common measures:


Euclidean


Manhattan


Mahalanobis

Euclidean Distance


“As the crow flies”


For 2 points and ,


distance =

Image source: http://en.wikipedia.org/wiki/Euclidean_distance,

http://en.wikipedia.org/wiki/Manhattan_distance

Euclidean Distance


Distance between objects not affected by
new objects (possible outliers)


Different scales misleading

Image source: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/images/image005.gif

Manhattan Distance


“City
-
block” distance


For points P = (
x
1,
y
1) and Q = (
x
2,
y
2),


distance =

Image source: http://en.wikipedia.org/wiki/Manhattan_distance

Manhattan Distance


Dampens effect of outliers


Differences are not squared

Mahalanobis Distance


Distance between 2 points:

Image source: http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm

Mahalanobis Distance


Not dependent on scales of measurement
used

Algorithm Classifications


Exclusive Clustering


K
-
means


Overlapping Clustering


Fuzzy C
-
means


Hierarchical Clustering


Hierarchical

K
-
Means


1. Define K centroids, each represents a
cluster


2. Each point assigned to nearest centroid


3. After all assignments, adjust centroid
positions


4. Repeat 2
-
3 until centroids stop moving

K
-
Means

Image source: http://people.scs.fsu.edu/~burkardt/f_src/kmeans/test12_clusters_equal.png

K
-
Means


Pros


Simplicity


Speed


Cons


Initial random assignments, results different
every run


What is K?

Fuzzy C
-
means


1. Define K centroids, each represents a cluster


2. Each point assigned random coefficients for
being in the clusters


3. Calculate each centroid


4. For each point, compute its coefficients


5. Repeat 3
-
4 until coefficients stop changing


Fuzzy C
-
means


Pros


Points “on the edge” have less influence on
cluster location


Cons


Increased computational complexity


Initial random assignments, results different
every run

Hierarchical


1. Assign each point to a cluster; N points = N
clusters


2. Find the closest pair of clusters and merge
into a single cluster


3. Compute distances between the new cluster
and each of the old clusters


4. Repeat 2
-
3 until all points clustered into a
single cluster of size N

Hierarchical


Single
-
Linkage


similarity between clusters = greatest
similarity from any member of one cluster to
any member of the other cluster


Complete
-
Linkage


Least similarity


Average
-
Linkage

Image adapted from: http://www.autonlab.org/tutorials/kmeans11.pdf

Hierarchical

Hierarchical


Pros


Hierarchy, not amorphous group collection


Cons


O(n
2
)


Difficult to interpret, complex

Choosing K


Elbow criterion


Only clusters


adding info stay

Image source: http://upload.wikimedia.org/wikipedia/commons/c/cd/DataClustering_ElbowCriterion.JPG

Some Free Tools


Cluster / TreeView


Hierarchical, K
-
means


http://rana.lbl.gov/EisenSoftware.htm


FuzzyK / Maple Tree


Fuzzy c
-
means


http://rana.lbl.gov/EisenSoftware.htm

Cluster / TreeView

Image source: http://genomebiology.com/content/figures/gb
-
2004
-
5
-
9
-
r66
-
2.jpg

References


Alpaydin, E., Introduction to Machine Learning. MIT Press, 2004.


Clustering. Retrieved on April 22, 2007 from
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/


Cluster Analysis. Retrieved on April 21, 2007 from
http://www.statsoft.com/textbook/stcluan.html.


Data Clustering. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Data_clustering


Euclidean Distances. Retrieved on April 20, 2007 from
http://en.wikipedia.org/wiki/Euclidean_distances


K
-
Means and Hierarchical Clustering. Retrieved on April 22, 2007 from
http://www.autonlab.org/tutorials/kmeans11.pdf


Manhattan Distance. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Manhattan_distance


Mahalanobis Distance. Retrieved on April 23, 2007 from
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm.

Questions?

Image source: http://farm1.static.flickr.com/42/95263888_b1d6591adc.jpg?v=0