# Clustering Algorithms

IA et Robotique

8 nov. 2013 (il y a 8 années)

219 vue(s)

Clustering Algorithms

Sunida Ratanothayanon

What is Clustering?

Clustering

Clustering is a classification pattern that
divide data into groups in meaningful
and useful way

Unsupervised classification pattern

Clustering

Clustering is a classification pattern that
divide data into groups in meaningful
and useful way

Unsupervised classification pattern

Outline

K
-
Means Algorithm

Hierarchical Clustering Algorithm

K
-
Means Algorithm

A partial clustering algorithm

k clusters (# of k is specified by a user)

Each cluster has a cluster center called
centroid.

The algorithm will literately group data
into k clusters based on a distance
function.

K
-
Means Algorithm

The centroid can be obtained from the
mean of all data points in the cluster.

Stop when there is no change of center.

A numerical example

K
-
Means example

Data Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2

We have five data points with 2
attributes

Want to group data into 2
clusters (k=2)

K
-
Means example

We can plot a graph from five
data points as following.

Plot of
5
data points over X
1
and X
2
0
5
10
15
20
25
0
5
10
15
20
25
X
1
X
2
Cluster C
2
Cluster C
1
K
-
Means example

(1
st

iteration)

Step1 : Choosing center and defining k

Data
Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2

C1=(18,22), C2= (4,2)

Step2 : Computing cluster centers

We already define c1 and c2

Step3 : Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster

2
1
n
d x y
i i
i
 

K
-
Means example

(1 st iteration)

Data

Point

x1

x2

1

22

21

2

19

20

3

18

22

4

1

3

5

4

2

Step3 (cont):

Distance table for all data points

Data
Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

Then, we assign each data point to the cluster by comparing its distance to
the center. The data point will be assigned to its closest cluster.

K
-
Means example

(2
nd

iteration)

Step2 : Computing cluster centers

We will compute new cluster centers

Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points

Data
Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

22 19 18 59
21 20 22 63
       
  
       
       
59/3 19.7
63/3 21
   

   
   
C1 is [19.7, 21]

Member of cluster2 are (1,3) and (4,2).

1 4 5
3 2 5
     
 
     
     
5/2 2.5
5/2 2.5
   

   
   
C2 is [2.5, 2.5]

K
-
Means example

(2
nd

iteration)

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

Step3 :
Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster

Distance table for all data points with new centers

Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.

Repeat step2 and 3 for the next iteration because centers still have a change

Data

Point

C1

C2

(18,22)

(4,2)

(22,21)

4.13

26.9

(19,20)

2.23

23.43

(18,22)

0

24.41

(1,3)

25.49

3.1

(4,2)

24.41

0

K
-
Means example

(3
rd

iteration)

Step2 : Computing cluster centers

We will compute new cluster centers

Member of cluster1 are (22,21), (19,20) and (18,22). We will find
average of these data points

22 19 18 59
21 20 22 63
       
  
       
       
59/3 19.7
63/3 21
   

   
   
C1 is [19.7, 21]

Member of cluster2 are (1,3) and (4,2).

1 4 5
3 2 5
     
 
     
     
5/2 2.5
5/2 2.5
   

   
   
C2 is [2.5, 2.5]

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

K
-
Means example

(3
rd

iteration)

Data

Point

C1’

C2’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

Step3 :
Finding square of Euclidian distance of each data point from the
center and assigning each data points to a cluster

Distance table for all data points with new centers

Assign each data point to the cluster by comparing its distance to the center. The
data point will be assigned to its closest cluster.

Stop the algorithm because centers remain the same.

Data
Point

C1’’

C2’’

(19.7,21)

(2.5,2.5)

(22,21)

2.3

26.88

(19,20)

1.22

24.05

(18,22)

1.97

24.91

(1,3)

25.96

1.58

(4,2)

24.65

1.58

Hierarchical Clustering
Algorithm

Produce a nest sequence of cluster like
a tree.

Allow to have subclusters.

Individual data point at the bottom of
the tree are called “Singleton clusters”.

C
E
A
B
D
Hierarchical Clustering
Algorithm

Agglomerative method

A tree will be build up from the bottom
level and will be merged the nearest pair of
clusters at each level to go one level up

Continue until all the data points are
merged into a single cluster.

A numerical example

Hierarchical Clustering
example

We have five data points with 3
attributes

Data Point

x1

x2

x3

A

9

3

7

B

10

2

9

C

1

9

4

D

6

5

5

E

1

10

3

Hierarchical Clustering example

(1
st

iteration)

Data
Point

x1

x2

x3

A

9

3

7

B

10

2

9

C

1

9

4

D

6

5

5

E

1

10

3

Step1 : Calculating Euclidian distance
between two vector points

Then we obtain distance table as following

Data Point

A

B

C

D

E

(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0

Hierarchical Clustering example

(1
st

iteration)

Step2 : Forming a tree

Consider the most similar pair of data
points from the previous distance table

Data Point

A

B

C

D

E

(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0

C and E are the most similar

We will obtain the first cluster as following

C
E

Repeat step1 and 2 until all data points are merged into a single cluster.

Hierarchical Clustering example

(2
nd

iteration)

C
E
Data Point

A

B

C

D

E

(9, 3, 7)

(10, 2, 9)

(1, 9, 4)

(6, 5, 5)

(1, 10, 3)

A ( 9, 3, 7)

0

2.45

10.44

4.12

11.36

B (10, 2, 9)

-

0

12.45

6.4

13.45

C (1, 9, 4)

-

-

0

6.48

1.41

D (6, 5, 5)

-

-

-

0

7.35

E (1, 10, 3)

-

-

-

-

0

Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge of two entities, C&E.

Data Point

A

B

D

C&E

(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0

A distance for C&E to A
can be obtained from

We can use a previous
table to get the distance
from C to A and E to A.

(,)
,,
(,),
d avg d d
C A E A
C E A

avg (10.44, 11.36) = 10.9

Hierarchical Clustering example

(2
nd

iteration)

Step2 : Forming a tree

Consider the most similar pair of data
points from the previous distance table

A and B are the most similar

We will obtain the second cluster as
following

Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point

A

B

D

C&E

(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0

C
E
A
B

From previous table, we can obtain
following distances for the new distance
table

Hierarchical Clustering example

(3
rd

iteration)

Data Point

A

B

D

C&E

(9, 3, 7)

(10, 2, 9)

(6, 5, 5)

A ( 9, 3, 7)

0

2.45

4.12

10.9

B (10, 2, 9)

-

0

6.4

12.95

D (6, 5, 5)

-

-

0

6.90

C&E (1, 9.5, 3.5)

-

-

-

0

Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge entities, C&E and A&B.

Data Point

A&B

D

C&E

(6, 5, 5)

A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0

(,),(,) (,)
(,) (4.12,6.40) 5.26
A B D A D B D
d avg d d avg
  
(,),(,) (,) (,) (,),(,),
(,) (,) (10.9,12.95) 11.93
C E A B C E A B C E A C E B
d avg d d avg d d avg
   
(,),
6.90
C E D
d

Hierarchical Clustering example

(3
rd

iteration)

Step2 : Forming a tree

Consider the most similar pair of data
points from the previous distance table

A&B and D are the most similar

We will obtain the new cluster as following

Repeat step1 and 2 until all data points are merged into a single cluster.

Data Point

A&B

D

C&E

(6, 5, 5)

A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0

C
E
A
B
D

From previous table, we can obtain a
distance from cluster A&B&D to C&E as
following

Hierarchical Clustering example

(4
th

iteration)

Data Point

A&B

D

C&E

(6, 5, 5)

A&B

0

5.26

11.93

D (6, 5, 5)

-

0

6.9

C&E

-

-

0

Step1 : Calculating Euclidian

distance between two vector

points

We will redraw the distance table including the merge entities, C&E and
A&B&D.

Data Point

A&B&D

C&E

A&B&D

0

9.4

C&E

-

0

(,,),(,) (,),(,),(,)
(,) (11.93,6.9) 9.4
A B D C E A B C E D C E
d avg d d avg
  
Hierarchical Clustering example

(4
th

iteration)

Step2 : Forming a tree

Consider the most similar pair of data
points from the previous distance table

We can form a final tree because no more recalculation has to be made

We can merge all data points into a single cluster A&B&D&C&E.

Stop the algorithm.

Data Point

A&B&D

C&E

A&B&D

0

9.4

C&E

-

0

C
E
A
B
D
Conclusion

Two major clustering algorithms.

K
-
Means algorithm

An algorithm which literately groups data into k
clusters based on a distance function.

# of k is specified by a user.

Hierarchical Clustering algorithm

It is a nest sequence of cluster like a tree.

A tree will be build up from the bottom level
and continue until all the data points are
merged into a single cluster.

References

[1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference,

Prediction.
Unsupervised Learning
. pp.453
-
480

[2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering:

A Review. ACM Computing Surveys, 31(3), 264
-
330.

[3] Liu, B. (2006). Web Data Mining.
Unsupervised Learning
. Springer.

pp.117
-
150.

[4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data

Mining.
Cluster Analysis: Basic Concepts and Algorithms
. pp.487
-

553.

Thank you