Accelerating Machine Learning Applications on Graphics Processors

kettledoctorΤεχνίτη Νοημοσύνη και Ρομποτική

15 Οκτ 2013 (πριν από 4 χρόνια και 26 μέρες)

88 εμφανίσεις

Accelerating Machine Learning
Applications on Graphics Processors

Narayanan Sundaram and Bryan Catanzaro

Presented by

Narayanan Sundaram

Big Picture

Frameworks

CBIR Application

Framework

Patterns

Application

Framework
Developer

Map Reduce

Programming

Framework

Map Reduce

Programming

Pattern

Map Reduce
Programming

Framework
Developer


CUDA

Computation &

Communication

Framework

Barrier/Reduction

Computation &

Communication

Patterns

CUDA

Framework
Developer

Face Search

Developer

Consumer Search

Searcher

Feature Extraction

& Classifier

Application

Patterns

Nvidia G80


Hardware Architect

Pattern Language

SW Infrastructure

Platform


Application

GPUs as proxy for manycore


GPUs are interesting architectures to program


Transitioning from highly specialized pipelines
to general purpose


The only way to get performance from GPUs is
through parallelism (No caching, branch
prediction, prefetching etc.)


Can launch millions of threads in one call


3

5/14/08

CS258 Parallel Computer Architecture

GPUs are not for everyone


Memory coalescing is really important


Irregular memory accesses to even local stores is
discouraged
-

up to 30% performance hit on
some apps for local memory bank conflicts


Cannot forget that it is a SIMD machine


Memory consistency is non
-
existent & inter
-
SM
synchronization is absent


Hardware scheduled threads


20 us overhead for kernel call (20,000
instructions @ 1GHz)

4

5/14/08

CS258 Parallel Computer Architecture

NVIDIA G80 Architecture

5/14/08

CS258 Parallel Computer Architecture

5


NVIDIA GeForce 8800 GTX
Specifications

Number of Streaming Multiprocessors

16

Multiprocessor Width

8

Local Store Size

16 KB

Total number of Stream Processors

128

Peak SP Floating Point Rate

346 Gflops

Clock

1.35 GHz

Device Memory

768 MB

Peak Memory Bandwidth

86.4 GB/s

Connection to Host CPU

PCI Express

CPU
-
> GPU bandwidth

2.2 GB/s*

GPU
-
> CPU bandwidth

1.7 GB/s*

5/14/08

CS258 Parallel Computer Architecture

6

* measured values

GPU programming
-

CUDA


Each block can have
upto 512 threads that
synchronize


Millions of blocks can
be issued


No synchronization
between blocks


No control over
scheduling

5/14/08

CS258 Parallel Computer Architecture

7

Support Vector Machines


A hugely popular machine learning technique
for classification


Tries to find a hyperplane separating the
different classes with “maximum margin”


Non
-
linear surfaces can be generated through
non
-
linear kernel functions


Uses Quadratic Programming for training
(specific set of constraints imply a wide variety
of techniques for solving it)

8

5/14/08

CS258 Parallel Computer Architecture

SVM Training


Quadratic Program






Some kernel functions:















Variables
:

α
: Weight for each training
point (determines classifier)


Data:

l
: number of training points

C
: trades off error on training
set for generalization
performance

y
:
Label (+/
-

1) for each
training point

x
:
training points


Choice of parallel algorithm

(among chunking algorithms)

0%
25%
50%
75%
100%
2
4
8
16
32
64
128
256
512
1024
2048
Working Set size

Task distribution for SVM Training

Other
QP Solve
Select Working Set
KKT Update
5/14/08

CS258 Parallel Computer Architecture

10

0
20
40
60
80
100
2
4
8
16
32
64
128
256
512
1024
2048
Time (s)

Working Set Size

Solve Time

variation

Sequential

Minimal

Optimization

(SMO)

Fitting SMO on a GPU


Shared memory constraints on the GPU fits
the algorithm as only two vectors need to be
shared among all the threads


Performance strongly dependent on the
choice of the working set


Several heuristics proposed


two are popular
(1
st

and 2
nd
order)


2
nd

order heuristic is almost twice as costly,
but saves on the number of iterations

11

5/14/08

CS258 Parallel Computer Architecture

Adaptive heuristic


Both heuristics can be expressed as a series of
“Map Reduce” stages


A Map Reduce code generator was used to
generate the code


Sample periodically and adapt depending on the
most converging heuristic at any given time



Tightly coupled map
-
reduces are essential for
machine learning algorithms


Cannot afford the overhead of general library call
when called millions of times


12

5/14/08

CS258 Parallel Computer Architecture

Results


13

5/14/08

CS258 Parallel Computer Architecture

0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Adult
Faces
Forest
Mnist
Usps
Web
2nd:Iterations
2nd:Solve Time
Adaptive:Iterations
Adaptive:Solve Time
Normalized

to 1
st

order

heuristic

Overall speedup compared to
LIBSVM


0
5
10
15
20
25
30
35
40
Adult
Faces
Forest
Mnist
Usps
Web
SVM Classification


SVM classification task involves finding which side
of the hyperplane a point lies on


Specifically,




where



Insight : Instead of doing this serially for all
points, note that

Restructuring the Classification problem

SV

Test Data

Output

Vs

Output

Test Data

SV

Results

Dataset

LibSVM

time (
s
)

CPU Optimized
code time (
s
)

GPU time (
s
)

Adult

61.307

7.476

0.575

Web

106.835

15.733

1.063

MNIST

269.880

9.522

1.951

USPS

0.777

0.229

0.00958

Face

88.835

5.191

0.705

5/14/08

CS258 Parallel Computer Architecture

17

Results

5/14/08

CS258 Parallel Computer Architecture

18

Is this compute or memory bound?


GPUs are better for memory bound jobs
(Observed 7 GB/s vs 1 GB/s for other
streaming
-
like apps)


5/14/08

CS258 Parallel Computer Architecture

19

0%
25%
50%
75%
100%
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
CPU
GPU
50
100
200
400
800
Dimensions

Time breakup for Classification

Rest(GPU)
SGEMM(GPU)
Rest(CPU)
SGEMM(CPU)
Importance of memory coalescing


In order to avoid non
-
coalesced memory
accesses, carried both Data and Data
T

into
GPU memory


Letting 0.05% of memory accesses to be non
-
coalesced led to a 21% drop in performance
for one case


Well written code should scale with GPU size
(parallelism should be limited by problem size,
not machine size)




5/14/08

CS258 Parallel Computer Architecture

20

Is SIMD becoming ubiquitous?


SIMD already important for performance on
uniprocessor systems


Task Vs Data parallelism


Intel’s new GPU has wide SIMD


CUDA lesson
-

Runtime SIMD binding easier
for programmers


Non
-
SIMD leads to performance penalty, not
incorrect programs


prevents premature
optimizations and keep code flexible

5/14/08

CS258 Parallel Computer Architecture

21

Conclusion


GPUs and Manycore CPUs are on a collision
course


Data parallelism on GPUs or Task parallelism
on CPUs


Rethink serial control and data structures


Sequential optimizations may harm
parallelism


Machine learning can use a lot of parallel
hardware if software engineered properly




5/14/08

CS258 Parallel Computer Architecture

22