DataClustering - Grand Valley State University

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

122 views

Data Clustering

Nick Bild


Grand Valley State University

April 24, 2007

CS678

Background


Unsupervised learning


Patterns/structures in unlabeled data

Image source: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/images/clustering.gif

Goal


Determine intrinsic data grouping



Data subsets share common trait

Applications


Bioinformatics


Group homologous
sequences



Image Classification


Medical diagnosis



Marketing


Customer behavior

Distance Measure


Quantify similarity of 2 data points


Influences shape of clusters


Common measures:


Euclidean


Manhattan


Mahalanobis

Euclidean Distance


“As the crow flies”


For 2 points and ,


distance =

Image source: http://en.wikipedia.org/wiki/Euclidean_distance,

http://en.wikipedia.org/wiki/Manhattan_distance

Euclidean Distance


Distance between objects not affected by
new objects (possible outliers)


Different scales misleading

Image source: http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/images/image005.gif

Manhattan Distance


“City
-
block” distance


For points P = (
x
1,
y
1) and Q = (
x
2,
y
2),


distance =

Image source: http://en.wikipedia.org/wiki/Manhattan_distance

Manhattan Distance


Dampens effect of outliers


Differences are not squared

Mahalanobis Distance


Distance between 2 points:

Image source: http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm

Mahalanobis Distance


Not dependent on scales of measurement
used

Algorithm Classifications


Exclusive Clustering


K
-
means


Overlapping Clustering


Fuzzy C
-
means


Hierarchical Clustering


Hierarchical

K
-
Means


1. Define K centroids, each represents a
cluster


2. Each point assigned to nearest centroid


3. After all assignments, adjust centroid
positions


4. Repeat 2
-
3 until centroids stop moving

K
-
Means

Image source: http://people.scs.fsu.edu/~burkardt/f_src/kmeans/test12_clusters_equal.png

K
-
Means


Pros


Simplicity


Speed


Cons


Initial random assignments, results different
every run


What is K?

Fuzzy C
-
means


1. Define K centroids, each represents a cluster


2. Each point assigned random coefficients for
being in the clusters


3. Calculate each centroid


4. For each point, compute its coefficients


5. Repeat 3
-
4 until coefficients stop changing


Fuzzy C
-
means


Pros


Points “on the edge” have less influence on
cluster location


Cons


Increased computational complexity


Initial random assignments, results different
every run

Hierarchical


1. Assign each point to a cluster; N points = N
clusters


2. Find the closest pair of clusters and merge
into a single cluster


3. Compute distances between the new cluster
and each of the old clusters


4. Repeat 2
-
3 until all points clustered into a
single cluster of size N

Hierarchical


Single
-
Linkage


similarity between clusters = greatest
similarity from any member of one cluster to
any member of the other cluster


Complete
-
Linkage


Least similarity


Average
-
Linkage

Image adapted from: http://www.autonlab.org/tutorials/kmeans11.pdf

Hierarchical

Hierarchical


Pros


Hierarchy, not amorphous group collection


Cons


O(n
2
)


Difficult to interpret, complex

Choosing K


Elbow criterion


Only clusters


adding info stay

Image source: http://upload.wikimedia.org/wikipedia/commons/c/cd/DataClustering_ElbowCriterion.JPG

Some Free Tools


Cluster / TreeView


Hierarchical, K
-
means


http://rana.lbl.gov/EisenSoftware.htm


FuzzyK / Maple Tree


Fuzzy c
-
means


http://rana.lbl.gov/EisenSoftware.htm

Cluster / TreeView

Image source: http://genomebiology.com/content/figures/gb
-
2004
-
5
-
9
-
r66
-
2.jpg

References


Alpaydin, E., Introduction to Machine Learning. MIT Press, 2004.


Clustering. Retrieved on April 22, 2007 from
http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/


Cluster Analysis. Retrieved on April 21, 2007 from
http://www.statsoft.com/textbook/stcluan.html.


Data Clustering. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Data_clustering


Euclidean Distances. Retrieved on April 20, 2007 from
http://en.wikipedia.org/wiki/Euclidean_distances


K
-
Means and Hierarchical Clustering. Retrieved on April 22, 2007 from
http://www.autonlab.org/tutorials/kmeans11.pdf


Manhattan Distance. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Manhattan_distance


Mahalanobis Distance. Retrieved on April 23, 2007 from
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm.

Questions?

Image source: http://farm1.static.flickr.com/42/95263888_b1d6591adc.jpg?v=0