Stream Programs on GPUs

birdsowlΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

62 εμφανίσεις

Software Pipelined Execution of
Stream Programs on GPUs

Abhishek Udupa et. al.

Indian Institute of Science, Bangalore

2009

Abstract


Describe the
challenges

in
mapping

StreamIt

to
GPUs


Propose an
efficient technique
to
software pipeline
the execution of stream programs on GPUs


Formulating

this problem of
scheduling

and actor
assignment

to a processor as an
efficient ILP


Also describes an
efficient buffer layout
technique
for GPUs


That facilitates
exploiting the high memory bandwidth
available in GPUs


Yielding
speedups

between
1.87X

and
36.83X

over
a
single threaded
CPU

2

Motivations


GPUs have emerged as
massively parallel
machines


GPU can perform
general purpose
computation


Latest GPUs consisting of
hundreds
of SPUs
are
capable of supporting
thousands

concurrent
threads

with
zero
-
cost

h/w controlled context
switch


GeForce 8800 GTS [
128 SPUs
forming
16
multiprocessors
, connected by a
256
-
bit wide
bus to
512 MB
of memory]

3

Motivations Contd.


Each
multiprocessors

in
NVIDIA GPUs

is conceptually
wide SIMD processor


CPUs have
hierarchies of caches
to tolerate
memory
latency


GPUs
address the problem
by providing
high bandwidth
processor
-
memory link and supporting
high degree
of
simultaneous multithreading
(SMT) within the processing
element


The
combination
of
SIMD

and
SMT

enables GPUs to
achieve a
peak throughput
of
400 GFLOPs

4

Motivations Contd.


The high performance of
GPUs

comes at the
cost
of
reduced
flexibility

and
greater programming
effort from the
user


For instance
, in the NVIDIA
GPUs
,
threads

executing on different
multiprocessors
cannot


synchronize

in an efficient manner


communicate in a
reliable and consistent
manner
through the device
memory
, within a kernel invocation


GPU
cannot access
the memory of the host system


While the memory bus is capable of delivering
very high bandwidth


accesses
to device
memory by threads
executing on a multiprocessor,
need to be combined in order

5

Motivations Contd.


ATI

and
NVIDIA

have proposed the
CTM

and
CUDA

frameworks, respectively, for developing
general purpose applications
targeting the
GPUs


But both of these frameworks
still require
the
programmer
to express the program as data
-
parallel
computations that can be executed efficiently on the
GPU


Programming with these frameworks
tie the
application to the platform


Any change
in the platform would
require significant
rework in
porting

the applications


6

Prior Works


I. Buck, T. Foley, D. Horn, J. Sugerman, K. Fatahalian, M. Houston,
and P. Hanrahan, “
Brook for GPUs: Stream Computing on Graphics
Hardware
,” ACM Transactions on Graphics, vol. 23, no. 3, pp. 777

786, 2004


D. Tarditi, S. Puri, and J. Oglesby, “
Accelerator: Using Data
Parallelism to Program GPUs for General
-
Purpose Uses
,” in
ASPLOS
-
XII: Proceedings of the 12th International Conference on
Architectural Support for Programming Languages and Operating
Systems, 2006, pp. 325

335.


These techniques
still require considerable effort by the programmer
in order to
transform the
streaming application
to execute efficiently
on the GPU


Maps StreamIt onto the CellBE platform [M. Kudlur and S. Mahlke
PLDI 2008]

7

x2

x2

x1

y1

z2

z1

x5

x1

x6

x2

x7

x3

x8

x4

y5

y1

y6

y2

y7

y3

y8

y4

z5

z1

z6

z2

z7

z3

z8

z4

x2

x2

x1

y1

Split

x3

x1

y3

y1

x4

x2

y4

y2

SPE0

SPE1

z3

z1

z4

z2

Join

z2

z1

(a) Unthreaded CPU

(b) 4
-
wide SIMD

(c) On Cell by Scott

8

SIMD reduces the trip
-
count of the loop

With GPU Model

Zn+1

Yn+1

Xn+2

Yn+2

Z1

Y1

X2

Y2

X2n

Y2n

Xn

Yn

Zn+2

Z2

Xn+1

X1

Z2n

Zn

Thousands of iterations are carried out in parallel

9

Greatly reduces the trip
-
count of the loop

Motivations Contd.


Parallelism

must be
exploited

at various levels


The
data parallelism
across threads executing on a
multiprocessor


The
SMT parallelism
among
threads

on the same multiprocessor
needs to be managed to provide
optimum performance


Parallelism

offered by having
multiple multiprocessors
on a GPU
should be used to
exploit the task

and
pipeline

level parallelism


Further,
accesses to device memory
need to be
coalesced

as far
as possible to ensure
optimal usage
of available bandwidth


These
challenges

are solved in this paper

10

Contributions


This paper makes the following
contributions
:


Describe a
software pipelining framework
for efficiently
mapping
StreamIt

programs onto
GPUs


Present a
buffer mapping scheme
for
StreamIt

programs to make
efficient use

of the
memory bandwidth
of the GPU


Describe a
profile based approach
to decide on
the optimal
number of threads

assigned to a multiprocessor called execution
configuration for
StreamIt

filters


Implement the scheme in the
StreamIt

compiler
and demonstrate
a
1.87X to 36.83X
speedup over a
single threaded CPU
on a set
of streaming applications

11

parallel computation

StreamIt Overview


StreamIt is a novel
language for streaming


Exposes parallelism
and

communication


Architecture independent


Modular

and composable


Simple structures
composed to creates
complex graphs


Malleable


Change program behavior
with small modifications



may be
any StreamIt
language construct

joiner

splitter

pipeline

feedback loop

joiner

splitter

splitjoin

filter

12

Organization of the NVIDIA GeForce
8800 series of GPUs

Architecture of GeForce 8800 GPU

Architecture of individual SM

13

ILP Formulation: Definition


Formulate the scheduling
of the stream graph
across the
multiprocessors

(SMs) of the GPU as ILP


Each instance
of
each filter
to be the
fundamental
schedulable entity


V
denotes the set of
all nodes
(filters) in the stream
graph


N

is the number of nodes in the stream graph


E

denotes the set of all (directed)
edges

in the
stream graph


k
v

represents the
number of firings
of a filter v
Є

V in
steady state

14

ILP Formulation: Definition Contd.


T

represents the
Initiation Interval
(II) of the software
pipelined loop


d(v)

is the delay or
execution time
of filter v
Є

V


I
uv

is the number of elements
consumed by filter v
on
each firing
of v for an edge (u, v)
Є

E


O
uv

is the number of elements
produced by filter u
on each firing of u for
an edge
(u, v)
Є

E


m
uv

denotes the number of elements
initially present
on the edge
(u, v)
Є

E

15

ILP Formulation: Resource Constraints


w
k,v,p

is 0
-
1 integer
variables
all

k
Є

[0, k
v
), all

v
Є

V,
all

k
Є

[0, P
max
)


w
k,v,p

= 1 implies
that the
k
th

instance of the filter v
has been assigned to SM
p



The
following

constraint ensures that
each instance of
each filter is

assigned to exactly one SM


)
1
(
),
,
0
[
,
1
1
max
0
,
,








P
p
v
p
v
k
V
v
k
k
w
16

ILP Formulation: Resource Constraints
Contd.


This constraint models the fact that
all the
filters assigned to an SM can be scheduled in
the II (T) specified


)
2
(
,
))
(
(
1
0
,
,









v
k
k
V
v
p
v
k
P
p
T
v
d
w
17

ILP Formulation Contd.


For dependency among filters, first need to
ensure that
the execution of any instance of any filter cannot wrap
around into the next II


Consider the linear form of a software pipelined
schedule, given by




where
σ
(j, k, v) represents the time at which the
execution of the k
th

instance of filter v in the j
th

steady
state iteration of the stream graph is scheduled

and
A
k,v

are integer variables with A
k,v

≥ 0
, all k
Є

[0; kv), all v
Є

V


v
k
A
j
T
v
k
j
,
)
,
,
(




18

ILP Formulation Contd.


Can write



Defining


Thus, the linear form of the schedule is given
by


T
A
o
T
A
f
v
k
v
k
v
k
v
k
mod

and
,
,
,
,








)
3
(
)
(
)
,
,
(
,
,
,
,
,
,
v
k
v
k
v
k
v
k
v
k
v
k
o
f
j
T
A
j
T
v
k
j
o
f
T
A











T
A
T
A
T
A
v
k
v
k
v
k
mod
,
,
,









19

ILP Formulation Contd.


Constrain the
starting time of filters
o
k,v

as
follows



This constraint
ensures that every firing of a
filter is scheduled so that it can complete
within the same kernel or II


)
4
(
)
,
0
[
,
,
)
(
,
v
v
k
k
k
V
v
T
v
d
o






20

ILP Formulation: Dependency Constraints


To
ensure that the firing rule for each
instance of each filter is satisfied by a
schedule
, the
admissibility
of a schedule is
given below





It
assumes that the firings the instances of
filter v are serialized


0
,
)
,
(
),
(
,
)
1
(
)
,
(
























i
E
v
u
u
d
u
O
O
m
I
i
v
i
uv
uv
uv
uv


21

ILP Formulation: Dependency Constraints
Contd.


However
this assumption does not hold for
this model
, where
instances of each filter
could execute out of order
,
or in parallel
across different SMs
, as long as the
firing
rule is satisfied


So
need to ensure that the schedule is
admissible

with respect to all the
predecessors of the i
th

firing of the filter v


22

ILP Formulation: Dependency Constraints
Contd.

A

B

2

3

A
0

B
0

2

1

A
1

A
2

B
1

1

2

23

ILP Formulation: Dependency Constraints
Contd.


Intuitively,
the i
th

firing of a filter v must wait
for all the I
uv

tokens
produced after its
(i
-
1)

th

firing


0
,
)
,
(
],
,
1
[
),
(
,
)
,
(


























i
E
v
u
I
l
u
d
u
O
O
m
l
I
i
v
i
uv
uv
uv
uv
uv


24

ILP Formulation: Dependency Constraints
Contd.


So far the assumption that the result of executing a filter
u is available d(u) time units after the filter u has started
executing


However
, the limitations of a GPU imply that this may not
always be the case


If the results are required from a producer instance of a
filter u
,
which is scheduled on an SM different from the
SM where the consumer instance of a filter v is
scheduled
,
then the data produced can only be reliably
accessed in the next steady state iteration

25

ILP Formulation: Dependency Constraints
Contd.


Modelling this using the w
k,v,p

variables
, which
indicate which SM a particular instance of a
given filter is assigned to


Defining 0


1 integer variables g
l,k,u,v



)
5
(
)
,
0
[
],
,
1
[
,
)
,
(
),
,
0
[
max
,
,
,
,
,
,
,
,
,
,
,
,
,
,
P
p
I
l
E
v
u
k
k
w
w
g
w
w
g
uv
v
p
v
k
p
u
k
v
u
k
l
p
u
k
p
v
k
v
u
k
l
l
l














26

ILP Formulation: Dependency Constraints
Contd.


Finally by simplifying


)
6
(
]
,
1
[
),
,
0
[
,
)
,
(
)
(
)
(
)
(
,
,
,
,
,
,
,
,
,
,
,
,
uv
v
v
u
k
l
u
k
l
lag
v
k
v
k
u
k
u
k
l
lag
v
k
v
k
I
l
k
k
E
v
u
g
f
j
T
o
f
T
u
d
o
f
j
T
o
f
T
l
l






















27

All these equations 1
-
6 form an ILP

Code Generation for GPUs

Overview of the compilation process, targeting a StreamIt program onto the GPU


28

CUDA Memory Model

29

All threads of a thread block are

assigned to exactly one SM or upto

8 thread blocks can be assigned to

one SM

A group of thread blocks forms a grid

Finally, a kernel call dispatched to the

GPU through the CUDA runtime

consists of exactly one grid

Profiling and Execution Configuration
Selection


It is important to determine the
optimal execution
configuration
,
specified by the number of
threads per block
and
the number of registers
allocated per thread


This is achieved by the profiling and execution
configuration selection phase


StreamIt compiler is modified the
to generate the
CUDA sources along with a driver routine for
each filter in the StreamIt program

30

The purpose of profiling


First
, for
accurately estimating the execution
time of each filter on the GPU
, for use in the
ILP


Secondly
, by running multiple profile runs of
the same filter,
with a varying the number of
threads and the number of registers thus
identifying the optimal number of threads



31

Algorithm for Profiling Filters

32

Optimizing Buffer Layout

33

-

A filter has pop

rate 4 and executing with 4

threads

-

on a device with memory

organized as 8 banks

Optimizing Buffer Layout Contd.

34

Each thread of each cluster of 128 threads
,
pops or pushes the first token it consumes or produces from

contiguous locations in memory


Experimental Evaluation

35

Only considers stateless filters

Experimental Setup


Each benchmark was compiled and executed on
a machine with
dual quad
-
core Intel Xeon
processors, running at 2.83 GHz, with 16 GB of
FB
-
DIMM main memory


The machine
runs Linux, with kernel version
2.6.18, and the NVIDIA driver version 173.14.09


Used a
GeForce 8800 GTS 512 GPU, with 512
MB of device memory


36

gpu
host
t
t
speedup

37

SWPNC: SWP implementation with No Coalescing; Serial: Serial

execution of filters using a SAS schedule

Phased Behaviour

38

Optimized
SWP
schedule with
no coarsening
;
SWP 4: Coarsened

4 times
;
SWP 8: Coarsened 8 times
;
SWP 16: Coarsened 16 times

About ILP Solver


CPLEX version 9.0, running on Linux
to solve the ILP


The machine used was a
quad processor Intel Xeon 3.06GHz, with
4GB of RAM


CPLEX was running as a single threaded application and hence did
not make use of the SMP available


The
solver was allotted 20 seconds to attempt a solution with this II


If it failed to find a solution in 20 seconds, the II is relaxed by 0.5%
and the process is repeated until a feasible solution was found


Handling stateful filters on GPUs is a possible future work

39

Questions?

40