Data Mining Applications

levelsordΔιαχείριση Δεδομένων

20 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

173 εμφανίσεις

Evaluating FERMI features for
Data Mining Applications

Masters
Thesis Presentation

Sinduja

Muralidharan

Advised by: Dr.
Gagan

Agrawal

Outline


Motivation and Background


FERMI series and the TESLA series GPUs


Reduction based Data Mining Algorithms


Parallelization Methods for GPUs


Experimental Evaluation


Conclusion

Background


GPUs have emerged as a major player in high
performance computing recently.


Excellent
price to performance ratio provided by
GPUs


suitability
and popularity of CUDA to program a
variety of high
performance
applications.


GPU
hardware
and software
have evolved rapidly.
New GPU products and successive versions of
CUDA added
new functionality and better
performance.

The FERMI GPU


The Fermi series of cards


include the C2050 and the C2070 cards.


also referred to as the 20
-
series family of NVIDIA Tesla
GPUs.


S
upport for double precision atomic
o
perations.


M
uch larger shared memory/L1 cache which can
be configured


48kB shared memory, 16kB L1 cache or



16kB shared memory, 48kb L1 cache


Presence of an L2 cache.

TESLA
vs

FERMI

Thesis Objective


Optimizing and evaluating the new features of the Fermi
series GPUs


Increased Shared memory


Support for atomic operations on floating point data



Using three parallelization approaches on reduction based
mining algorithms:


Full Replication in Shared memory


Improving locking with inbuilt atomic operations


Creation of several hybrid versions for optimal
performance



Generalized Reductions












op

is a function that is both
commutative and associative and
Reduc

is a data structure referred
to as the reduction object


Specific elements of the
reduction object updated depend
on results of previous processing


Divide the data instances (or
records or transactions) among
the processing threads


The reduction object updated in
iteration

i

of the loop is
determined as a result of
previous processing


Parallelizing Generalized Reductions


It is not possible to statically partition the reduction object,
different processors update different disjoint portions at
runtime:



Can lead race conditions



E
xecution time of the process function can take up a
major chunk of the total execution time of an iteration of
the loop, so runtime preprocessing and static scheduling
techniques cannot be applied.


Sometimes the size of the reduction object may be too
large to fit in replicas in memory without significant
overheads.

Earlier Parallelization Techniques


Attempts
to parallelize
the Map
-
Reduce
class of
applications


lack
of support for atomic
operations on floating
point
numbers


large
number of threads required for
effective
parallelization.



The larger shared memory allows total replication of
the
reduction object
for some thread
configurations


significantly
avoids the possibility of
race conditions
and
thread contention.

Full Replication


In any shared memory system, the best way to
avoid race conditions would be to


Have each thread keep its own copy of the
reduction object on the device memory and
process each object separately.


Then at the end of each iteration, a global
combination could be performed either by a single
thread or by using the tree structure.


The final object is copied back to host memory

Full Replication in Shared Memory


The factors which affect performance of full replication
mode of reduction


size of the reduction object (depends on the number of threads
per multiprocessor).


the amount of computation in comparison to the amount of
data copied between devices and


whether or not, global data can be copied into shared memory.



In Tesla it was not possible to fit in all the copies of the
reduction object within 16k of shared memory available


Higher latency device memory had to be used.

Full Replication on Shared memory
(continued)


The higher amount of available shared
memory in Fermi can fit in all copies of the
reduction object entirely within the shared
memory for smaller configurations:


No race conditions and contention among threads
because each thread updates its own copy of the
object.


Global memory accesses are now replaced by low
latency shared memory accesses.

Locking Scheme


The shared memories of
different multiprocessors,
have no synchronization
mechanism, so a separate
copy of the reduction object is
placed in the shared memory
of each multiprocessor.


While performing updates on
the reduction object, all
threads of a thread block use
locking to avoid race
conditions.


Finally a global combination is
performed on all the
accumulated updates on the
different multiprocessors.


Locking : TESLA
vs

FERMI


Fine Grained Locking:


TESLA:





FERMI:


The Hybrid Scheme


Full replication


A private copy of the reduction object is needed for each thread in a
block


Larger reduction objects stored in the high latency global device
memory.


The cost of combination could be very high.


Locking


A single copy of the reduction object is stored in the shared memory


Eliminates the need for global combination.


C
ontention among threads in a block is very high.


Configuring an application with a larger number of threads in a
multiprocessor typically leads to better performance.


Latencies can be masked by context switching between warps.


The Hybrid Scheme (continued)


While choosing the number of groups, M



M copies of the reduction object should still fit into the shared
memory.


If the reduction object is big, the overhead of combination
would be higher than the overhead of contention.


When the object is smaller, the contention overhead dominates
over the combination overhead.


Since it is desirable to keep the contention overhead
smaller, a larger number of groups are preferable.


Several Hybrid versions were created and evaluated on
Fermi



to study the optimal balance between contention and
combination overheads

Experimental Evaluation


Environment:


TESLA: NVIDIA Tesla C1060 GPU with 240 cores,
clock frequency of 1.296 GHz and 4GB device
memory.


FERMI:
NVIDIA Tesla C2050 GPU with
448
processor cores, clock
frequency of 1.15GHz and a
device memory of 3 GB.

Observations


For larger reduction objects, the hybrid approach
generally
outperforms the replication and the
locking
approaches.


Contention overhead dominates.


For smaller reduction objects full replication in
shared memory yields the best performance.


Combination overhead dominates.


Inbuilt
support for atomic
floating
point
operations outperforms
the previously used
wrapper based implementation
.


K
-
Means Results

Wrapper based
implementation of atomic
floating point operations k=10

Inbuilt support for atomic floating
point operations k=10

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
512
256
192
128
64
Execution time (seconds)

Number of threads

Atomic
Replicate
Hybrid
Full replication on SM
0
2
4
6
8
10
512
256
192
128
64
Execution time (seconds)

Number of threads

Atomic
Replicate
Hybrid
Kmeans

Results

Wrapper based
implementation of atomic
floating point operations
k=100

Inbuilt support for atomic
floating point operations
k=100

0
2
4
6
8
10
512
256
192
128
64
Execution time (seconds)

Number of threads

Atomic
Replicate
Hybrid
0
1
2
3
4
5
6
7
8
9
10
512
256
192
128
64
Execution time (seconds)

Number of threads

Atomic
Replicate
Hybrid
K
-
Means Results

Hybrid Versions for k=10

Hybrid Versions for k=100

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
512
256
128
64
Execution time (seconds)

Threads per block

4 per group
8 per group
16 per group
32 per group
64 per group
0
1
2
3
4
5
6
7
8
9
512
256
128
64
Execution time (seconds)

Threads per block

4 per group
8 per group
16 per group
PCA Results

Comparison of Parallelization schemes with
wrapper based implementation for 16
columns

Comparison of Parallelization schemes
with inbuilt atomic floating point for 32
columns

0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
64
128
192
256
512
Execution time (seconds)

Threads per block

wrapper atomic
wrapper hybrid
inbuilt atomic
inbuilt hybrid
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
64
128
192
256
512
Execution TIme (seconds)

Threads per block

wrapper atomic
wrapper hybrid
inbuilt atomic
inbuilt hybrid
PCA
-

Results

Hybrid versions for 16 columns

Hybrid versions for 32 columns

0
0.01
0.02
0.03
0.04
0.05
0.06
64
128
192
256
512
Execution time (seconds)

Threads per block

4 per group
8 per group
16 per group
32 per group
0
0.05
0.1
0.15
0.2
0.25
0.3
64
128
192
256
512
Execution time (seconds)

Threads per block

4 per group
8 per group
kNN

-

Results

0
0.1
0.2
0.3
0.4
0.5
0.6
512
256
192
128
64
Execution time (seconds)

Number of threads per block

KNN comparison of schemes k=10

atomic
replicate
hybrid
full replication
on SM
0
0.2
0.4
0.6
0.8
1
512
256
192
128
64
Execution time (seconds)

Number of threads per block

KNN comparison of schemes
k=20

atomic
replicate
hybrid
full replication on SM
kNN

-

Results

0
0.05
0.1
0.15
0.2
0.25
0.3
512
256
192
128
64
Execution time (seconds)

Number of threads per block

Hybrid versions of KNN k=10

4 per group
8 per group
16 per group
32 per group
64 per group
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
512
256
192
128
64
Execution time (seconds)

Number of threads per block

Hybrid versions of KNN k=20

4 per group
8 per group
16 per group
32 per group
64 per group
Conclusions


The new features of the Fermi series GPU cards:


support for inbuilt atomic double precision operations



increase in the amount of available shared memory


Evaluated against three reduction based data
mining algorithms.


Balance between the overheads of thread contention
and global combination.


For smaller clusters, contention is a dominant factor.


For larger clusters, combination overhead dominates.


Thank You!

Questions?