Clustering Algorithms

naivenorthIA et Robotique

8 nov. 2013 (il y a 8 années)

219 vue(s)

Clustering Algorithms

Sunida Ratanothayanon





What is Clustering?

Clustering


Clustering is a classification pattern that
divide data into groups in meaningful
and useful way



Unsupervised classification pattern

Clustering


Clustering is a classification pattern that
divide data into groups in meaningful
and useful way



Unsupervised classification pattern

Outline


K
-
Means Algorithm



Hierarchical Clustering Algorithm

K
-
Means Algorithm


A partial clustering algorithm


k clusters (# of k is specified by a user)


Each cluster has a cluster center called
centroid.


The algorithm will literately group data
into k clusters based on a distance
function.

K
-
Means Algorithm


The centroid can be obtained from the
mean of all data points in the cluster.


Stop when there is no change of center.

A numerical example

K
-
Means example

Data Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2


We have five data points with 2
attributes


Want to group data into 2
clusters (k=2)

K
-
Means example


We can plot a graph from five
data points as following.

Plot of
5
data points over X
1
and X
2
0
5
10
15
20
25
0
5
10
15
20
25
X
1
X
2
Cluster C
2
Cluster C
1
K
-
Means example

(1
st

iteration)


Step1 : Choosing center and defining k

Data
Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2

C1=(18,22), C2= (4,2)


Step2 : Computing cluster centers

We already define c1 and c2


Step3 : Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster



2
1
n
d x y
i i
i
 


K
-
Means example

(1 st iteration)

Data

Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2


Step3 (cont):

Distance table for all data points

Data
Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

Then, we assign each data point to the cluster by comparing its distance to
the center. The data point will be assigned to its closest cluster.

K
-
Means example

(2
nd

iteration)


Step2 : Computing cluster centers

We will compute new cluster centers


Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points

Data
Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

22 19 18 59
21 20 22 63
       
  
       
       
59/3 19.7
63/3 21
   

   
   
C1 is [19.7, 21]


Member of cluster2 are (1,3) and (4,2).

1 4 5
3 2 5
     
 
     
     
5/2 2.5
5/2 2.5
   

   
   
C2 is [2.5, 2.5]

K
-
Means example

(2
nd

iteration)

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58


Step3 :
Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster



Distance table for all data points with new centers


Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.


Repeat step2 and 3 for the next iteration because centers still have a change

Data


Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

K
-
Means example

(3
rd

iteration)


Step2 : Computing cluster centers

We will compute new cluster centers


Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points

22 19 18 59
21 20 22 63
       
  
       
       
59/3 19.7
63/3 21
   

   
   
C1 is [19.7, 21]


Member of cluster2 are (1,3) and (4,2).

1 4 5
3 2 5
     
 
     
     
5/2 2.5
5/2 2.5
   

   
   
C2 is [2.5, 2.5]

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

K
-
Means example

(3
rd

iteration)

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58


Step3 :
Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster



Distance table for all data points with new centers


Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.


Stop the algorithm because centers remain the same.

Data
Point

C1’’

C2’’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

Hierarchical Clustering
Algorithm


Produce a nest sequence of cluster like
a tree.


Allow to have subclusters.


Individual data point at the bottom of
the tree are called “Singleton clusters”.


C
E
A
B
D
Hierarchical Clustering
Algorithm


Agglomerative method


A tree will be build up from the bottom
level and will be merged the nearest pair of
clusters at each level to go one level up


Continue until all the data points are
merged into a single cluster.


A numerical example

Hierarchical Clustering
example


We have five data points with 3
attributes

Data Point

x1

x2

x3

A

9

3

7

B

10

2

9

C

1

9

4

D

6

5

5

E

1

10

3

Hierarchical Clustering example

(1
st

iteration)

Data
Point

x1

x2

x3

A

9

3

7

B

10

2

9

C

1

9

4

D

6

5

5

E

1

10

3


Step1 : Calculating Euclidian distance
between two vector points

Then we obtain distance table as following

Data Point

A

B

C

D

E



(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0

Hierarchical Clustering example

(1
st

iteration)



Step2 : Forming a tree


Consider the most similar pair of data
points from the previous distance table

Data Point

A

B

C

D

E



(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0


C and E are the most similar


We will obtain the first cluster as following

C
E

Repeat step1 and 2 until all data points are merged into a single cluster.

Hierarchical Clustering example

(2
nd

iteration)

C
E
Data Point

A

B

C

D

E



(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0


Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge of two entities, C&E.

Data Point

A

B

D

C&E



(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0


A distance for C&E to A
can be obtained from




We can use a previous
table to get the distance
from C to A and E to A.

(,)
,,
(,),
d avg d d
C A E A
C E A

avg (10.44, 11.36) = 10.9

Hierarchical Clustering example

(2
nd

iteration)



Step2 : Forming a tree


Consider the most similar pair of data
points from the previous distance table


A and B are the most similar


We will obtain the second cluster as
following


Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point

A

B

D

C&E



(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0

C
E
A
B


From previous table, we can obtain
following distances for the new distance
table

Hierarchical Clustering example

(3
rd

iteration)

Data Point

A

B

D

C&E



(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0


Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge entities, C&E and A&B.

Data Point

A&B

D

C&E





(6, 5, 5)



A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0

(,),(,) (,)
(,) (4.12,6.40) 5.26
A B D A D B D
d avg d d avg
  
(,),(,) (,) (,) (,),(,),
(,) (,) (10.9,12.95) 11.93
C E A B C E A B C E A C E B
d avg d d avg d d avg
   
(,),
6.90
C E D
d

Hierarchical Clustering example

(3
rd

iteration)



Step2 : Forming a tree


Consider the most similar pair of data
points from the previous distance table


A&B and D are the most similar


We will obtain the new cluster as following


Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point

A&B

D

C&E





(6, 5, 5)



A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0

C
E
A
B
D


From previous table, we can obtain a
distance from cluster A&B&D to C&E as
following

Hierarchical Clustering example

(4
th

iteration)

Data Point

A&B

D

C&E





(6, 5, 5)



A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0


Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge entities, C&E and
A&B&D.

Data Point

A&B&D

C&E







A&B&D

0

9.4

C&E

-

0

(,,),(,) (,),(,),(,)
(,) (11.93,6.9) 9.4
A B D C E A B C E D C E
d avg d d avg
  
Hierarchical Clustering example

(4
th

iteration)



Step2 : Forming a tree


Consider the most similar pair of data
points from the previous distance table


We can form a final tree because no more recalculation has to be made


We can merge all data points into a single cluster A&B&D&C&E.


Stop the algorithm.

Data Point

A&B&D

C&E







A&B&D

0

9.4

C&E

-

0

C
E
A
B
D
Conclusion


Two major clustering algorithms.


K
-
Means algorithm


An algorithm which literately groups data into k
clusters based on a distance function.


# of k is specified by a user.


Hierarchical Clustering algorithm


It is a nest sequence of cluster like a tree.


A tree will be build up from the bottom level
and continue until all the data points are
merged into a single cluster.





References

[1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference,


Prediction.
Unsupervised Learning
. pp.453
-
480


[2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering:


A Review. ACM Computing Surveys, 31(3), 264
-
330.


[3] Liu, B. (2006). Web Data Mining.
Unsupervised Learning
. Springer.


pp.117
-
150.


[4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data


Mining.
Cluster Analysis: Basic Concepts and Algorithms
. pp.487
-


553.

Thank you