Optimizing DivKmeans for Multicore Architectures: a status report

coachkentuckyAI and Robotics

Nov 25, 2013 (3 years and 11 months ago)

69 views

March 12, 2007

CICC quarterly meeting

1


Optimizing DivKmeans for

Multicore Architectures: a status
report



Jiahu Deng and Beth Plale

Department of Computer Science

Indiana University


March 12, 2007

CICC quarterly meeting

2

Acknowledgements


David Wild


Rajarshi Guha


Digital Chemistry


Work funded in part by CICC and Microsoft

March 12, 2007

CICC quarterly meeting

3

Problem Statements


1. Clustering is an important method to organize thousands of data
times into meaningful groups. It is widely applied in chemistry,
chemical informatics, biology, drug discovery, etc. However, for
large datasets, clustering is a slow process even it’s parallelized



and be executed in powerful computer clusters.




2. Multi
-
core architectures provide large degrees of parallelism.
Taking advantage of this requires examination of traditional
parallelism approaches. We apply that examination to the
DivKmeans clustering method.

March 12, 2007

CICC quarterly meeting

4

Multi
-
core Architectures


Diagram of an Intel Core 2 dual core


processor, with CPU
-
local Level 1


caches, a shared, on
-
die Level 2


cache.

Multi
-
core processors: combines two
or more independent processors into a
single package.

March 12, 2007

CICC quarterly meeting

5

Clustering Algorithm

1. hierarchical clustering


Series of partitioning steps take place, generating a hierarchy of
clusters. It includes two families,
agglomerative methods
, which
work from leaves upward, and
divisive methods

which decompose
from a root downward
.

http://www.digitalchemistry.co.uk/prod_clustering.html

March 12, 2007

CICC quarterly meeting

6

Clustering Algorithm

2. non
-
hierarchical clustering


Clusters form around centroids, the number of which can be
specified by the user. All clusters rank equally and there is no
particular relationship between them.

http://www.digitalchemistry.co.uk/prod_clustering.html

March 12, 2007

CICC quarterly meeting

7

Divisive KMeans (DivKmeans) Clustering
Algorithm

Kmeans Method:


K: number of clusters, which can be specified.



The items are initially randomly assigned to a cluster. The kmeans


clustering proceeds by repeated application of a two
-
step process:



1. The mean vector for all items in each cluster is computed.


2. Items are reassigned to the cluster whose center is closest to the


item.


Features:


The K
-
means algorithm is stochastic and the results are subject to a
random component. The K
-
means algorithm works very well for
well
-
defined clusters with a clear cluster center.




March 12, 2007

CICC quarterly meeting

8

Divisive KMeans (DivKmeans) Clustering
Algorithm

Divisive KMeans :

A hierarchical kmeans method. In the following discussion, we consider
k= 2, i.e. each clustering process accepts one cluster as input, and
generates two partitioned clusters as outputs.

Original

cluster


Kmeans

Method


Kmeans

Method


Kmeans

Method

cluster1

cluster2


Kmeans

Method









March 12, 2007

CICC quarterly meeting

9

Parallelization of DivKmeans Algorithm for
Multicore



Proceeding without Digital Chemistry DivKmeans


Once agreement was reached (Nov 2006), could not get version of source
code isolated that communicated with public interfaces instead of private
interfaces.


Naive parallelization of DivKmeans


Chose to work with Cluster 3.0 from Open Source Clustering Software
Laboratory of DNA Information Analysis, Human Genome Center Institute of
Medical Science, University of Tokyo.


The C clustering library is released under the “Python License”.


Parallelized this Kmeans code with decomposition


Gather performance results on naive parallelization



Suggest multicore
-
sensitive parallelizations



Early performance results of these parallelizations

March 12, 2007

CICC quarterly meeting

10

Naive Parallelization of Cluster 3.0
Kmeans



Treat each kmeans clustering process as a black box ,
which takes one cluster as input, and generates two
clusters as outputs



When a new cluster is generated having more than one
element in it, assign it to free processor for further
clustering



A master node maintains status of each node


March 12, 2007

CICC quarterly meeting

11

Naive Parallelization of Cluster 3.0
Kmeans

.

.

.

Working

Node1

Master

Node

Working

Node2

Working

Node3

Original

cluster

cluster1

cluster2

Assign to Node 2

Reassign to Node 1

Assign to Node 3

(Reassign to

Node 2)

March 12, 2007

CICC quarterly meeting

12

Quality of Cluster 3.0 Kmeans Naive
Parallelization


Pros:


Don’t need to worry about the details of DivKmeans

method. Can use Kmeans functions of other libraries
directly.


Cons:


Speedup and scalability?


How about parallelization overhead?



March 12, 2007

CICC quarterly meeting

13

Profiling Naive Parallelization


Platform:


A Linux cluster, each node has two 2GHz AMD Opteron(TM)
CPUs, each CPU has dual cores


Linux RHEL WS release 4


Algorithm: Cluster 3.0, parallelized and made divisive


Dataset: Pubchem dataset of sizes 24,000 and 96,000


elements


Additional Libraries:


LAM 7.1.2/MPI


March 12, 2007

CICC quarterly meeting

14

Speedup: naive parallelization of Cluster 3.0

Speedup of DivKmeans
(Item Size: 24000)
0
1
2
3
4
0
5
10
15
20
25
30
35
Number of Nodes
Speedup
speedup is defined by
S
p

= T
1
/T
p

where:


* p is number of processors


* T
1

is execution time of sequential algorithm


* T
p

is execution time of parallel algorithm with p processors

Conclusion: maximum benefit reached at 17 nodes; significant decrease
in speedup after only 5 nodes.

March 12, 2007

CICC quarterly meeting

15

CPU Utilization:

Conclusion: Node 1 maxes out at 100% utilization. A likely limiter to
overall performance.

CPU Utilization of DivKmeans
(Item Size:96000)
0
20
40
60
80
100
120
1
178
355
532
709
886
1063
1240
1417
Running Time (Second)
CPU Utilization (%)
Node0
Node1
Node2
Node3
Node4
Node5
Node6
Node7
March 12, 2007

CICC quarterly meeting

16

Memory Utilization

Conclusion: nothing outstanding

Memory Utilization of DivKmeans
(Item Size:96000)
0
10
20
30
40
50
1
185
369
553
737
921
1105
1289
1473
Running Time (Second)
Memory Utilization
(%)
Node0
Node1
Node2
Node3
Node4
Node5
Node6
Node7
March 12, 2007

CICC quarterly meeting

17

Process Behaviors

By XMPI, which is a graphical user interface for running, debugging and

visualizing MPI programs.

March 12, 2007

CICC quarterly meeting

18

Conclusions on Naive Parallelization

from Profiling


Poor scalability beyond 5 nodes.


Performance likely inhibited by 100% utilization
of Node 1.

Proposed Solution



Multi
-
core solution: using multi
-
threads on each
node, each thread runs on one core.


How this solution will explicitly address the two
problems identified above.

March 12, 2007

CICC quarterly meeting

19

Proposed Solution

Instead of treating each kmeans clustering process as a black box,

each clustering process is decomposed into several threads.

original

cluster

thread 3

thread 2

thread 4

Merge Results

some pre
-
processing

other processing

cluster1

Cluster 2

thread 1

March 12, 2007

CICC quarterly meeting

20

Step 1: identify parts to decompose
(parallelize)

Calling sequence of kmeans clustering process

Do loop







Finding

Centroids

Calculating

Distance

While loop


Do loop

Inside Kmeans

Profiling shows:

-
> About 93% of total execution time
is spent in kmeans() functions.

-
> Inside kmeans() function, almost
all time is spent in “Finding
Centroids” and “Calculating
Distance”.

-
> Hence, parallelize these two.

DivKmeans

Kmeans()

Calculating

Distance

Find

Centroids

March 12, 2007

CICC quarterly meeting

21

Simplified codes of Finding Centroids



// sum up all elements


for (k = 0; k < nrows; k++) {


i = clusterid[k];


for (j = 0; j < ncolumns; j++)


cdata[i][j]+=data[k][j];


}





// calculate mean values


for (i = 0; i < nclusters; i++) {


for (j = 0; j < ncolumns; j++)


cdata[i][j] /= total_number[i][j];


}

March 12, 2007

CICC quarterly meeting

22

Parallelized Codes of “Finding Centroids”

// sum up all elements

for (k = 0; k < nrows; k++) {


i = clusterid[k];


for (j = 0; j < ncolumns; j++)


cdata[i][j]+=data[k][j];

}




// calculate mean values





Before parallelization

After parallelization

// sum up elements assigned to current thread

for (k = nrows *
index

/ n
_
thread;


k < nrows * (
index

+ 1)

/ n
_
thread; k++) {


i = clusterid[k];


for (j = 0; j < ncolumns; j++) {



if (mask[k][j] != 0) {



t_data[i][j]+=data[k][j];


t_mask[i][j]++;


}


}

}



// merge data





// calculate mean values






March 12, 2007

CICC quarterly meeting

23

Mapping of Algorithms into Multi
-
core
Architectures

original

cluster

Core 3

Core 2

Core 4

Merge Results

some pre
-
processing

other processing

cluster1

Cluster 2

Core1

Each thread uses one core

March 12, 2007

CICC quarterly meeting

24

Mapping of Algorithms into Multi
-
core
Architectures


How to further benefit from multi
-
core architectures?




Data locality


Cache aware algorithm


Architecture aware algorithm

March 12, 2007

CICC quarterly meeting

25

Example 1
:
AMD Opteron


Mapping of Algorithms into Multi
-
core
Architectures

No cache sharing between
two cores in this

architecture

Diagram of AMD Opteron

March 12, 2007

CICC quarterly meeting

26

Mapping of Algorithms into Multi
-
core
Architectures

Example 2
:
Intel Core 2

Improve cache re
-
use:

If two threads share common data,

assign them to the cores on the same die.


Diagram of an Intel Core 2 dual


Core processor

March 12, 2007

CICC quarterly meeting

27

Mapping of Algorithms into Multi
-
core
Architectures

Dell PowerEdge 6950

NUMA (Non
-
Uniform Memory Access)


Example 3:

Improve data locality:

Keep data in local memory so that each

thread uses local memory instead of

remote ones as much as possible.

March 12, 2007

CICC quarterly meeting

28

Early Results on Multi
-
core Platform

Experiment Environments


Platform:


3 nodes in a Linux cluster, each node has two 2GHz AMD
Opteron(TM) CPUs, each CPU has dual cores



Linux RHEL WS release 4


Library:


LAM 7.1.2/MPI


Pthread for Linux RHEL WS release 4


Degree of Parallelization



Only the code of “Finding Centroids” is parallelized for early study.


4 threads are used for “Finding Centroids” on each node, and each
thread runs on one core.

March 12, 2007

CICC quarterly meeting

29

Results of Parallelizing “Finding Centroids”

Performance of DivKmeans
0
500
1000
1500
2000
2500
3000
0
20000
40000
60000
80000
100000
Data Size (Number of Items)
Total Execution Time
(Seconds)
Before Parallelization
After Parallelization
Conclusion: Modest improvement. DivKmeans runs about 12%
faster after parallelization.

March 12, 2007

CICC quarterly meeting

30

Performance of DivKmeans
(Item Size: 12000)
320
330
340
350
360
370
380
0
20
40
60
80
Number of Threads Used per Node
Total Execution
Time (seconds)
Parallelizing “Finding Centroids” with

Different Number of Threads per Node

Conclusion: can hardly benefit from using more threads than the number of cores.

Total Number of Cores per Node: 4

March 12, 2007

CICC quarterly meeting

31

Optimizations for Next Step


Reduce overhead of managing threads (e.g. use thread pool instead
of creating new threads for each call to “Finding Centroids”)



Parallelize the “Calculating Distance” part, which consumes twice
the time of “Finding Centroids”



More cores (4, 8, 32…) on a single computer are on the way. Should
get more performance enhancements with more cores if the
scalability of the program is good.



The platform we used (AMD Opteron TM) doesn’t support cache
sharing between two cores on the same die. However, L2, and
even L1 cache sharing among cores are becoming available.


March 12, 2007

CICC quarterly meeting

32

The Multi
-
core Project in the Distributed Data
Everywhere (DDE)


Lab and the Extreme Lab


Multi
-
core processors: represent a major evolution in
today’s computing technology



We are exploring the programming styles and challenges
on multi
-
core platforms, and potential applications in both
academic and commercial areas, including chemical
-
informatics, XML parsing, data streaming, Web Service,
etc.


March 12, 2007

CICC quarterly meeting

33

References

1. Open Source Clustering Software


Laboratory of DNA Information Analysis, Human Genome Center


Institute of Medical Science, University of Tokyo


http://bonsai.ims.u
-
tokyo.ac.jp/~mdehoon/software/cluster/


2.
http://www.nsc.liu.se/rd/enacts/Smith/img1.htm


3.
http://www.mhpcc.edu/training/workshop/parallel_intro/


4.
http://www.digitalchemistry.co.uk/prod_clustering.html


5. Performance Benchmarking on the Dell PowerEdge™ 6950


David Morse, Dell Inc.