k-means

geographertonguesΤεχνίτη Νοημοσύνη και Ρομποτική

30 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

59 εμφανίσεις

CS281B Winter02

Yan Wang and Lihua Lin

1

K
-
means clustering


CS281B Winter02

Yan Wang and Lihua Lin

2

What are clustering algorithms?

What is clustering ?




Clustering of data is a method by which large sets of data is
grouped into clusters of smaller sets of similar data.


Example:



The balls of same color are clustered into a group as shown
below :




Thus, we see clustering means grouping of data or dividing a
large data set into smaller data sets of some similarity.


CS281B Winter02

Yan Wang and Lihua Lin

3

What is a clustering algorithm ?



A clustering algorithm attempts to find natural groups of
components (or data) based on some similarity.



The clustering algorithm also finds the
centroid

of a group of
data sets.






The
centroid

of a cluster is a point whose parameter values are
the mean of the parameter values of all the points in the
clusters.


CS281B Winter02

Yan Wang and Lihua Lin

4

What is the common metric for clustering techniques ?



Generally, the
distance between two points

is taken as a
common metric to assess the similarity among the components
of a population. The most commonly used distance measure is
the
Euclidean metric

which defines the distance between two
points
p
= (
p
1
,
p
2
, ....) and
q
= (
q
1
,
q
2
, ....) as :




2
1
)
(
i
k
i
i
q
p
d




CS281B Winter02

Yan Wang and Lihua Lin

5

Uses of clustering algorithms



Engineering sciences: pattern recognition, artificial intelligence,
cybernetics etc. Typical examples to which clustering has been
applied include handwritten characters, samples of speech,
fingerprints, and pictures.



Life sciences (biology, botany, zoology, entomology, cytology,
microbiology): the objects of analysis are life forms such as
plants, animals, and insects.



Information, policy and decision sciences: the various
applications of clustering analysis to documents include votes
on political issues, survey of markets, survey of products, survey
of sales programs, and R & D.



CS281B Winter02

Yan Wang and Lihua Lin

6

Types of clustering algorithms



The various clustering
concepts available can be
grouped into two broad
categories :



Hierarchial methods


Minimal Spanning Tree
Method (Fig)



Nonhierarchial methods




K
-
means Algorithm





CS281B Winter02

Yan Wang and Lihua Lin

7

K
-
Means Clustering Algorithm


Definition:






This nonheirarchial method initially takes the number of
components of the population equal to the final required
number of clusters. In this step itself the final required
number of clusters is chosen such that the points are
mutually farthest apart. Next, it examines each component in
the population and assigns it to one of the clusters
depending on the minimum distance. The centroid's position
is recalculated everytime a component is added to the
cluster and this continues until all the components are
grouped into the final required number of clusters.


CS281B Winter02

Yan Wang and Lihua Lin

8

K
-
Means Clustering Algorithm


CS281B Winter02

Yan Wang and Lihua Lin

9

The Parameters and options for the k
-
means algorithm




Initialization
: Different init Methods



Distance Measure
:There are different distance measures that
can be used. (Manhattan distance & Euclidean distance).



Termination
: k
-
means should terminate when no more pixels
are changing classes.



Quality
: the quality of the results provided by k
-
means
classification



Parallelism
: There are several ways to parallelize the k
-
means
algorithm




What to do with dead classes
:A class is "dead" if no pixels
belong to it.



Variants
: one pass on
-
the
-
fly calculation of means



Number of classes
: Number of classes is usually given as an
input variable.

CS281B Winter02

Yan Wang and Lihua Lin

10

Comments on the K
-
means Methods

Strength of the K
-
means:



Relatively efficient:
O(tkn),

where
n

is the number of objects,
k

is the number of clusters, and
t

is number of iterations.
Normally,
k,t << n
.



Often terminates at a local optimum.

Weakness of the k
-
means:



Applicable only when mean is defined, then what about
categorical data?



Need to specify
k
, the number of clusters, in advance.



Unable tom handle noisy data and outlines.


Not suitable to discover clusters with non
-
convex shapes.

CS281B Winter02

Yan Wang and Lihua Lin

11

Direct k
-
means clustering algorithm

CS281B Winter02

Yan Wang and Lihua Lin

12

2 Initial
Clusters

Demo (I)

CS281B Winter02

Yan Wang and Lihua Lin

13

2
-
means
Clustering

Demo (I)

CS281B Winter02

Yan Wang and Lihua Lin

14

Demo (II)


Init Method: Random

CS281B Winter02

Yan Wang and Lihua Lin

15

Demo (II)


Init Method: Linear

CS281B Winter02

Yan Wang and Lihua Lin

16

Demo (II)


Init Method: Cube

CS281B Winter02

Yan Wang and Lihua Lin

17

Demo (II)


Init Method: Statistics

CS281B Winter02

Yan Wang and Lihua Lin

18

Demo (II)


Init Method: Possibility