# DataClustering - Grand Valley State University

Τεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

134 εμφανίσεις

Data Clustering

Nick Bild

Grand Valley State University

April 24, 2007

CS678

Background

Unsupervised learning

Patterns/structures in unlabeled data

Goal

Determine intrinsic data grouping

Data subsets share common trait

Applications

Bioinformatics

Group homologous
sequences

Image Classification

Medical diagnosis

Marketing

Customer behavior

Distance Measure

Quantify similarity of 2 data points

Influences shape of clusters

Common measures:

Euclidean

Manhattan

Mahalanobis

Euclidean Distance

“As the crow flies”

For 2 points and ,

distance =

Image source: http://en.wikipedia.org/wiki/Euclidean_distance,

http://en.wikipedia.org/wiki/Manhattan_distance

Euclidean Distance

Distance between objects not affected by
new objects (possible outliers)

Manhattan Distance

“City
-
block” distance

For points P = (
x
1,
y
1) and Q = (
x
2,
y
2),

distance =

Image source: http://en.wikipedia.org/wiki/Manhattan_distance

Manhattan Distance

Dampens effect of outliers

Differences are not squared

Mahalanobis Distance

Distance between 2 points:

Image source: http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm

Mahalanobis Distance

Not dependent on scales of measurement
used

Algorithm Classifications

Exclusive Clustering

K
-
means

Overlapping Clustering

Fuzzy C
-
means

Hierarchical Clustering

Hierarchical

K
-
Means

1. Define K centroids, each represents a
cluster

2. Each point assigned to nearest centroid

3. After all assignments, adjust centroid
positions

4. Repeat 2
-
3 until centroids stop moving

K
-
Means

Image source: http://people.scs.fsu.edu/~burkardt/f_src/kmeans/test12_clusters_equal.png

K
-
Means

Pros

Simplicity

Speed

Cons

Initial random assignments, results different
every run

What is K?

Fuzzy C
-
means

1. Define K centroids, each represents a cluster

2. Each point assigned random coefficients for
being in the clusters

3. Calculate each centroid

4. For each point, compute its coefficients

5. Repeat 3
-
4 until coefficients stop changing

Fuzzy C
-
means

Pros

Points “on the edge” have less influence on
cluster location

Cons

Increased computational complexity

Initial random assignments, results different
every run

Hierarchical

1. Assign each point to a cluster; N points = N
clusters

2. Find the closest pair of clusters and merge
into a single cluster

3. Compute distances between the new cluster
and each of the old clusters

4. Repeat 2
-
3 until all points clustered into a
single cluster of size N

Hierarchical

Single
-

similarity between clusters = greatest
similarity from any member of one cluster to
any member of the other cluster

Complete
-

Least similarity

Average
-

Hierarchical

Hierarchical

Pros

Hierarchy, not amorphous group collection

Cons

O(n
2
)

Difficult to interpret, complex

Choosing K

Elbow criterion

Only clusters

Some Free Tools

Cluster / TreeView

Hierarchical, K
-
means

http://rana.lbl.gov/EisenSoftware.htm

FuzzyK / Maple Tree

Fuzzy c
-
means

http://rana.lbl.gov/EisenSoftware.htm

Cluster / TreeView

Image source: http://genomebiology.com/content/figures/gb
-
2004
-
5
-
9
-
r66
-
2.jpg

References

Alpaydin, E., Introduction to Machine Learning. MIT Press, 2004.

Clustering. Retrieved on April 22, 2007 from

Cluster Analysis. Retrieved on April 21, 2007 from
http://www.statsoft.com/textbook/stcluan.html.

Data Clustering. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Data_clustering

Euclidean Distances. Retrieved on April 20, 2007 from
http://en.wikipedia.org/wiki/Euclidean_distances

K
-
Means and Hierarchical Clustering. Retrieved on April 22, 2007 from
http://www.autonlab.org/tutorials/kmeans11.pdf

Manhattan Distance. Retrieved on April 22, 2007 from
http://en.wikipedia.org/wiki/Manhattan_distance

Mahalanobis Distance. Retrieved on April 23, 2007 from
http://www.aiaccess.net/English/Glossaries/GlosMod/e_gm_mahalanobis.htm.

Questions?