Clusteringin Data Mining

naivenorthAI and Robotics

Nov 8, 2013 (3 years and 11 months ago)

95 views

Clustering

in Data Mining

CS 157B, spring 2007


Phuong Tran

What is Data Clustering


Classification of objects into different
groups.


Objects in each subset share some
common trait.


Useful technique for Data Analysis and
Data Mining.



Clustering Examples


Biology: clustering is used to group
homologous (similar) DNA sequences into
gene families.


Market Research:partition the general
population into market segments and to
better understand the relationships
between different groups of consumers.


WWW Search: division of web
pages/documents into genres.

Types of Clustering


Hierarchichal
: Successively
determine new clusters from
previously determined clusters
(parent/child clusters).



Partitional
: Establish all clusters at
once, at the same level.

Creating Clusters:

Break Up vs Build Up


Break Up
: start from the bottom of the tree, divide
the general population into smaller and smaller
clusters.




Build Up

(Agglomerative
Clustering): start from
the top of the tree,
merge individual
elements into larger and
larger clusters.

Distance Measure


Clustering is about finding “similarity”.


To find how similar two objects are, one
needs distance measure.


Similar objects (same cluster) should be
close to one another (short distance).



Distance Measure


Many ways to define distance measure.


Some elements may be close according to
one distance measure and further away
according to another.


Select a good distance measure is an
important step in clustering.

Some Distance Functions


Euclidean

distance (2
-
norm): the most
commonly used, also called “crow
distance”.


Manhattan

distance (1
-
norm): also called
“taxicab distance”.


In general:
Minkowski

Metric (p
-
norm):

K
-
Means Clustering


Separate the objects (data points) into K clusters.


Cluster center (centroid) = the average of all the
data points in the cluster.


Assigns each data point to the cluster whose
centroid is nearest (using distance function.)

K
-
Means Algorithm

1.
Place K points into the space of the objects
being clustered. They represent the initial group
centroids.


2.
Assign each object to the group that has the
closest centroid.


3.
Recalculate the positions of the K centroids.


4.
Repeat Steps 2 & 3 until the group centroids no
longer move.


K
-
Means
Algorithm:

Example
Output

References


Data Clustering
-

Wikipedia


http://en.wikipedia.org/wiki/Data_clustering


A Tutorial on Clustering Algorithms


http://www.elet.polimi.it/upload/matteucc/Clustering/tutorial_html/


The K
-
Means Data Clustering Problem


http://people.scs.fsu.edu/~burkardt/f_src/kmeans/kmeans.html