StreamX10: A Stream Programming Framework on X10

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

112 εμφανίσεις

StreamX10: A Stream
Programming Framework on X10

Haitao

Wei

2012
-
06
-
1
4

School of Computer Science at
Huazhong

University of
Sci&Tech

2

Outline

Introduction and Background

1

COStream

Programming Language

2

Stream Compilation on X10

3

Experiments

4

Conclusion and Future Work

5

Background and
motivition


Stream Programming


A
high level programming model that has been
productively
applied


Usually,
d
epends
on the
specific architectures
which
makes it difficult to port between different
platforms


X10


a productive parallel programming
environment


isolates the
different
architecture details



provides a flexible
parallel programming abstract layer
for stream programming



StreamX10

try

to

make the stream program
portable based on X10

4

Outline

Introduction and Background

1

COStream

Programming Language

2

Stream Compilation on X10

3

Experiments

4

Conclusion and Future Work

5

COStream

Language

5

stream


FIFO queue connecting operators


operator


Basic
func

unit

actor node in stream graph


Multiple inputs and multiple outputs


Window


like pop

peek

push operations


Init

and work function


composite


Connected operators

subgraph

of actors


A stream program is composed of

composites

COStream

and Stream Graph

6

Composite Main{


graph


stream<
int

i
> S = Source(){


state :{
int

x;}


init :{x=0;}


work :{


S[0].
i

= x;


x++;


}


window S:tumbling,count(1);


}


streamit
<
int

j> P =
MyOp
(S){


param




pn:N


}


() as
SinkOp

= Sink(P){


state :{
int

r;}


work :{


r = P[0].j;


println
(r);


}


window P

tumbling,count
(1);


}

}

Composite
MyOp
(output Out ; input In){


param


attribute:pn


graph


stream<
int

j> Out =
Averager
(In){


work :{


int

sum=0,i;


for(i=0;i<
pn;i
++)


sum += In[i],j;


Out[0].j = (sum/
pn
);


}


window In:
sliding,count
(10),count(1);


Out:tumbling,count
(1);


}

}

stream

operator

composite

Source

Sink

Averager

push=1

peek=10

pop=1

push=1

pop=1

S

P

7

Outline

Introduction and Background

1

COStream

Programming Language

2

Stream Compilation on X10

3

Experiments

4

Conclusion and Future Work

5

Compilation flow of
StreamX10


Phrase

Function

Front
-
end

Translates

the

COStream

syntax

into

abstract

syntax

tree
.

Instantiation

Instantiates

the

composites

hierarchically

to

static

flattened

operators
.

Static

Stream

Graph

Constructs

static

stream

graph

from

flattened

operators
.

Scheduling

Calculates

initialization

and

steady
-
state

execution

orderings

of

operators
.

Partitioning

Performs

partitioning

based

on

X
10

parallelism

models

for

load

balance
.

Code

Generation

Generates

X
10

code

for

COStream

programs
.

The Execution Framework

9

Place
0
Place
1
Place
2
activity
activity
activity
Local buffer object
Global buffer object
Data flow intra place
Data flow inter place
threads
pool

The node is partitioned between the places


Each node is mapped to an activity


The nodes use the pipeline fashion to exploit the parallelisms


The local and Global FIFO buffer are used

Work Partition Inter
-
place

10

Objective:Minimized

Communication and Load
Balance (Using Metis)

10

2

2

2

2

10

2

1

5

5

5

5

5

5

1

Comp.
work=10


Comp. work=10


Comp. work=10


Speedup

30/10 =3

Communication

2


Global FIFO implementation

11

Producer
Consumer
1
0
n

1
0
n

1
0
n

push
peek
/
pop
copy
copy
Local Array
DistArray
Place
0
Place
1

Each Producer/Consumer has its own local buffer


the
producer
uses push operation to store
the data to the local
buffer


The consumer uses peek/pop operation to fetch data from the local
buffer


When the local buffer is full/empty is data will be copied automatically

X10 code
in the Back
-
end

12

Spawn activities for each node at
place according to the partition

Call the work function in


initial and steady schedule

Define the work function

//
main
.
x
10
control code
public static def main
( )
{

...

finish for
(
p in Place
.
places
())

async at
(
p
)
{
switch
(
p
.
id
)
{

case
0
:

val a
_
0
=
new Source
_
0
(
rc
)
;

a
_
0
.
run
()
;
break
;

case
1
:

val a
_
2
=
new
MovingAver
_
2
(
rc
)
;

a
_
2
.
run
()
;
break
;

case
2
:

val a
_
1
=
new Sink
_
1
(
rc
)
;

a
_
1
.
run
()
;
break
;

default
:
break
;

}

}


}
//
Source
.
x
10
code
...

def work
()
{

...

push
_
Source
_
0
_
Sink
_
1
(
0
).
x
=
x
;

x
+=
1
.
0
;

pushTokens
()
;

popTokens
()
;

}

public def run
()
{

initWork
()
;
//
init

//
initSchedule

for
(
var j
:
Int
=
0
;
j
<
Source
_
0
_
init
;
j
++)

work
()
;

//
steadySchedule

for
(
var i
:
Int
=
0
;
i
<
RepeatCount
;
i
++)

for
(
var j
:
Int
=
0
;
j
<
Source
_
0
_
steady
;
j
++)

work
()
;

flush
()
;

}

...
13

Outline

Introduction and Background

1

COStream

Programming Language

2

Stream Compilation on X10

3

Experiments

4

Conclusion and Future Work

5

Experimental Platform and Benchmarks

14


Platform


Intel
Xeon processor
(
8 cores
)
2.4 GHZ

with
4GB
memory


Radhat

EL5 with Linux
2.6.18


X10 compiler and runtime used
are 2.2.0



Benchmarks


Rewrite 11 benchmarks from
StreamIt




The throughputs comparison

15


Throughputs of
4 different
configurations (NPLACE*NTHREAD=8)


Normalized to 1 place with 8 threads

0
1
2
3
4
5
6
7
8
9
10
Throughput normalized to 1 place with 8 threads
NPLACES=1, NTHREADS=8
NPLACES=2, NTHREADS=4
NPLACES=4, NTHREADS=2
NPLACES=8, NTHREADS=1

for most benchmarks, CPU utilization increases from 24% to
89% ,when places varies from 1 to 4, except for the benchmark
with low computation/communication ratio


benefits
are little or worse when the number of places increases
from 4 to
8


Observation and Analysis

16


The
throughput goes up when the number of
places
increases.
This is because that multiple
places increase the CPU
utilization



Multiple places
show
parallelism but also
bring
more communication
overhead



Benchmarks
with more computation workload like
DES and
Serpent_full


can still
benefit form the
number of places
increasing


17

Outline

Introduction and Background

1

COStream

Programming Language

2

Stream Compilation on X10

3

Experiments

4

Conclusion and Future Work

5

Conclusion


We proposed and implemented StreamX10
, a
stream programming
language and compilation
system
on
X10




A raw partitioning
optimization
is proposed
to
exploit
the parallelisms based
on
X10 execution
model



Preliminary experiment is conducted to study the
performance



18

Future Work


How to choose the best configuration (# of places
and # of threads) automatically for each benchmark



How to decrease the thread switching overhead by
mapping multiple nodes to the single activity



19

Acknowledgment


X10 Innovation
Award founding support



QiMing

Teng
,
Haibo

Lin and David P. Grove
at
IBM for their help on this research


20