Co-clustering using CUDA

plantationscarfΤεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

59 εμφανίσεις

CO
-
CLUSTERING USING
CUDA

Co
-
Clustering Explained


Problem:


Large binary matrix of samples (rows) and features (columns)


What samples should be grouped together? Why?


What are shared features?


Co
-
clustering

provides you the “why” explicitly


Correlated sample/feature pair

Row cluster:


s
1

and s
3

are in a
group


Column
cluster:

distinguishing features are
2,3, and 5

Co
-
Clustering
-

Details


Using Information Theoretic Co
-
clustering, as parallelized for
Hadoop

architecture in:


Disco
: Distributed co
-
clustering with

Map
-
Reduce
: A case study towards
petabyte
-
scale end
-
to
-
end mining
, Papadimitriou et.al,
Data Mining

2008


Partition entire matrix into row groups,
col

groups


Minimize length of encoding of resulting partitioned matrix


Competing code length factors: number of row groups &
col

groups,
homogeneity of clusters


Iterate over rows, rearrange and sub
-
partition to find better encoding using
heuristic


Repeat for columns, then rows again, until local optimum is found


Complexity: O(n*
fp
*(
row_groups+col_groups
)
2
*
iters
)



Credit: Chakrabarti et. al, KDD 2004

Implementation
-

Basics


Initial matrix generation : CPU


Initial random row/column group assignment: CPU


Memory structures very simple, arrays of
ints




Implementation


Stats step 1


Statistics calculations:


Calculates statistics for each
row of each column group


Statistic is number of 1’s in a
column group


Straight
-
forward
parallelization (each thread
works on one row at a time),
global memory

2 3 1 3 2

3

5

1

1

4

Column Groups

Row Groups

Stat(Row 3,
ColumnGroup

3) = 1

Room For Improvement


Calculate row statistics according to histogram
algorithm from text book


Block columns


Assign one thread block to each block


Compute shared memory histograms within block


Merge back to global memory when finished

Implementation


Stats step 2


Calculates cost for each row
group of each column group


Essentially a reduce on the per
-
row data


Block the rows, assign block to
thread block


Use shared memory and
atomics to build histogram of
all rows in a given row group


Merge shared histogram with
global histogram for that row
group


Iterate over all row groups


2 3 1 3 2

3

5

1

1

4

Column Groups

Row Groups

Stat(
RowGroup

1,
ColumnGroup

3) = 2

Implementation


Row/Col Group Optimization


For each row, find optimal group it could belong to


Parallelized straight
-
forwardly, one row per thread,
loop and stride to get all rows


Each row calculation goes through all row groups,
determines global cost of moving to that row group


Move all rows to their optimal group


Recompute

statistics


Repeat for column groups


Continue alternating row/column groupings until
convergence

Room For Improvement


Parallelization could be more sophisticated


Could block the rows and compute the cost of the row
joining each row group in parallel


Using shared memory atomics to identify minimum cost



In practice, this algorithm heavily favors a small
number of row and column groups


The
parllelization

would be therefore be small

Implementation Outer Loop


After local minimum is found, change initial number
of row and column groups and retry


Change number of row groups or number of column
groups, up or down


Continue changing number of row or column groups in
that direction until cost fails to decrease


Try both directions in both dimensions before stopping


Outer loop performed on CPU


Room for Improvement


Outer loop could parallelize inner loop actions over
different GPUs


Each could explore the different dimensions and
directions in parallel

Implementation


CPU + Validation


CPU implementation performed all steps described
earlier, but sequentially


Validation


Used CPU implementation of statistics calculations to
validate GPU stats calculations


CPU and GPU log implementations differ, so validated
cost calculations by allowing for a tolerance of 5% btw
results


Did not have time to validate the overall algorithm or
visualize the outputs to it to see if
coclusters

produced
were reasonable


Timing Measurements


Time was measured by
clock_t
/CLOCKS_PER_SEC
under CPU implementation


Measured by
cuda

events under GPU
implementation


Development Lessons Learned


CUDA and structured data is a bad idea


Even
structs

of arrays are impossible to deal with


Host
-
side pointer math on device pointers does not work


CUDA API has REALLY unfriendly error messages


Take care to do very, very little through that API


__device__ variables declared globally must be
passed to kernels


Runtime errors otherwise


You can
malloc

and free shared memory in device code
as of 3.2


Development Lessons Learned Cont


Visual Studio CUDA integration leaves a lot to be
desired


All optimizations removed, still can’t set breakpoints
everywhere


Many variables show as freed


No in
-
IDE, real
-
time, in editor compile errors


But, Visual Studio does give nice auto
-
complete,
auto
-
definition navigation


No CUDA linker => separate files must be directly
#
include’d


Experiment
-

Environment


Float.cs.drexel.edu


CPU: 4 quad
-
core Intel Xeon L5360 processors @2.13
Ghz


GPU: 2
Nvidia

GeForce

GTX 580 GPUs @1544Mhz

Experiment
-

Description


Sequential (CPU) and Parallel (GPU) tested on square
matrices of order 100, 1000, and 10000


Larger matrices caused memory problems


GPU tested with varying block and thread counts


Num blocks: 10, 100, 5000


Num threads: 10, 100, 1024 (max)


Resulting co
-
clusters usually stayed in the 50
-
200
row/column group range, regardless of matrix order


Row and column groupings are important in the calculation
of matrix statistics, rows and columns are blocked by these


Experiment Results

0
10
20
30
40
50
60
70
80
100
1000
10000
Matrix Order

Speedup
-

10 Blocks

10
100
1024
Num Threads

Experiment Results


For small number of blocks, 100 thread
performance peaks at
num_blocks

*
num_threads

=
matrix_order


I would expect this to be the optimal configuration,
when
num_blocks

~=
num_row_groups

~=
num_col_groups


Slowdown occurs when matrix order exceeds total
number of threads and more must be done serially



Experiment
-

Results

0
10
20
30
40
50
60
70
80
100
1000
10000
Matrix Order

Speedup
-

100 Blocks

10
100
1024
Num Threads

Experiment Results

0
10
20
30
40
50
60
70
80
100
1000
10000
Matrix Order

Speedup
-

5000 Blocks

10
100
1024
Num Threads

Experiment Results


Interestingly, the maximum speedup was the same in all
block counts


Roughly speaking, as long as
num_blocks

*
num_threads

>=
matrix order, max speedup of ~70 is achieved


10 threads never got there, due to block scheduling overhead?
Possibly cost of copying to shared memory for block processing
was not recouped in 10 thread case?


Maxing out thread count is counter
-
productive in smaller
matrices


Hypothesis: When block count is excessive (as for small
matrices), scheduling of large blocks of threads that return
immediately is costly

Experiment Results

0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
100
1000
10000
Matrix Order

Effficiency
-

10 Blocks

10
100
1024
Num Threads

Experiment Results

0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
100
1000
10000
Matrix Order

Efficiency
-

100 Blocks

10
100
1024
Num Threads

Experiment Results

0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
100
1000
10000
Matrix Order

Efficiency
-

5000 Blocks

10
100
1024
Num Threads

Experiment Results


Efficiency is consistently highest for the smaller
numbers of blocks and smaller numbers of threads
within those blocks


Hypothesis: Overhead of starting blocks and threads
must be high enough to result in diminishing returns
when adding blocks and threads