Comparing MultiCore, ManyCore, and DataFlow Supercomputers: Acceleration vs Power Consumption vs Speed per Cubic Foot

shrewdnessfreedomΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

225 εμφανίσεις

Comparing
MultiC
ore
, ManyCore
, and DataF
low Supercomputers
:

Accelerati
on vs Power Consumption vs Speed

per Cubic Foot

Sasa Stojanovic
, Veljko Milutinovic
, Dragan Bojic,

Miroslav Bojovic, Oscar Mencer
, Michael Flynn

Frame No 1: To Select or to Hybridize.

When a supercomputer team

faces a new programming challenge, the first thing to do is to decide what
supercomputer architecture to select for the highest performance
,

the lowest power consumption
, and
the smallest equipment volume
.

An alternative is
to use

a hybrid machine which includes all three
arch
itectural types (MultiCore
, ManyCore
, and DataFlow
) and a sophisticated

software

dispatcher
(partially implemented in compiler and partially in operating system)
that decides what portion of the
high level language code
goes to what architectures
.

Frame No 2: A

Symbolic C
omparison

of MultiCore, ManyCore
, and DataFlow

A well known
an
ecdotic way to compare MultiCore

and Many
Core

supercomputers is to compare the
two
approaches with horses and chicken
s
that care a load wag
o
n. Along the same
anecdotic

path, one
can compare the DataFlow approach with ants that carry load in the
i
r back
-
packs. Further, an analogy
can be established between power consumption and feeding the

animals, between

cubic foot and
“housing” for animals, and between speed for Big Data and running fast up a vertical wall.


Feeding horses
is much more expensive than

feeding

chicken, and

feeding chicken is

much more
expensive

than feeding

ants (read
“f
ee
d
ing”

as

“paying month
ly

electricity bills”).

Electricity bills may
double initial investments for MultiCore in only a few years,

and
not

in many more years for ManyCore
,
while dataflow machines need as much as
a few decades for the same.

Stables for
horses

can be extremely large (some supercomputer cent
e
rs are even building new building
s
for their next generation machines).

Chicken houses are much smaller then stables, but much
bigger

then ant holes.

The cost of placing gates ac
ro
s
s the field has two
components: (a) deciding where to put
them (higher programmer effort)

and (b) physical putting at decided locations (longer compile time).

Only ants can move fast up a vertical wall. Chicken and horses cannot. In other words, if an extremely
large data set

is
crunched
, the DataFlow approach is the faste
st, as indicated in the non
-
anec
dotic part
of this paper.

Frame No 3: Definition of Multi/ManyCore and Fine/CoarseGrainDataFlow

Multi
/Many
Core

architectures are composed of general
purpose processing cores, s
imilar to factory
where each worker works everything.

While
MultiCore

is composed of
a
small number
of highly
sophisticated and high speed processing cores,

ManyCore is
composed

of a larger

number

of simpler
and slower processing cores.

At the opposite sid
e are DataFlow architectures that
process data in a manner similar to factory where
workers are arranged in assembly line and each worker works only one operation on each product.
FineGrain DataFlow
architectures are composed of special purpose processing
elements interconnected
t
o form
hardware through which data will be processed.
CoarseGrain DataFlow

architectures are
composed of

general purpose cores

and dataflow concept is used on a software level in order to utilize
as much as possible of a parallelis
m available in program.

1.
Introduction

One possible classification of supercompu
ter systems is given in Figure 1
. The three major branches of
the classification imply the foll
owing architectures: MultiCore,
ManyCore
,

FineGrain
DataF
low
, and
CoarseGrainData
Flow

s
upercomputers.

Flow Concept
Processor Granularity
ControlFlow
DataFlow
MultiCore
ManyCore
DataDlow Granularity
FineGrainDataFlow
CoarseGrainDataFlow
Figure
1
:
A classification of supercomputer architectures.

Programming of the first two architectural types follows the cla
ssical programming par
adigm and

the
achieved

speedups
are
well described in the open literature [
Cuda1
, Parallel1
, CPUvsGPU1
, CPUvsGPU2,
and CPUvsGPU3
].

Programming of a DataFlow system

follows a different programming paradigm and
the following two issues are crucial: (a)

One has to know what applications enjoy speedup when
programmed for
a

DataF
low

system
,

(b)

One
has to know how to program the D
a
taF
low
system

oriented
applications
,

in order to generate
the maximal

speedup on a
DataF
low
system

(f
or

these
applications,

the
re is a large difference between

what an e
xperienced programmer achieves,
and

what an

un
-
experienced one can achieve)
.

1.1. What
-
to and what
-
not
-
to

The what
-
to/what
-
not
-
to

defines which part of program can benefit from DataFlow
system

and

which
cannot. Tha
t

is

best explained using F
igure

2
.

The bottom line of Figure 2 is:

If


the number of operations in a single loop iteration


is above some critical value
,


while
the
available bandwidth to/from memory is large enough


to transfer data needed for one iteration

Then


m
ore data items mean more advantage for a dataflow machine
,


c
ompared to MultiCore and ManyCore
.

In other words
, m
ore data does not
necessarily m
ean better performance

of a D
ata
F
low
system

(compared to Mu
lti
C
ore

and
ManyCore
). Performance of the DataF
low
system

would be better only for
applications in which
the
number of
operations

per iteration is above

some

critical value
, keeping
needed amount of data
below

available bandwidth to/from memory
.

Here, we
h
ave made an
assumption

that
required
bandwidth to/from memory is
below available bandwidth
.
If not said
different, t
his assumption

will be

applied throughout
this

text, until it is

discussed
in

section

2
.4
.

On the contrary, i
f we
have

an application with a
relatively small

number of operations per
iteration, it is
possibly (
but
not always) a “what
-
not
-
to” application

for a DataFlow
system
,

and
we better execute it on
a MultiCore

or a
ManyCore

machine
;

otherwise, we will (or may) have a

slowdown.

In some cases,
there
is possibility to apply some

transformations (e.g. loop fusion

[Trans1]
) on in
put code

in order to make it
appropriate for DataFlow
system
. Second option is

to implement

a larger number of dataflows, but the
use of this appr
oach is limited because it increases required bandwidth to/from memory. At the point
when the whole bandwidth to/from memory is fully used, further increase of number of dataflows will
only lead to lower utilization of available hardware

and will not bring

any additional speed
-
up
.

The slope of the three solid lines in Figure 2 denotes the essence of the speedup achieved by the three
architectures

(taking into account previously defined assumption that
the amount of data needed for
one iteration is small eno
ugh comparing to available

bandwidth to/from memory)
: the more instructions
in the loop body, the higher is the slope of the sol
id lines for MultiCore and ManyCore
, while

the slope
for DataFlow

remains constant.

...
N
coresCPU
Time
...
...
N
coresCPU
N
coresCPU
Data items
...
N
coresGPU
Time
...
N
coresGPU
Data items
...
N
DF
Time
Data items
...
N
DF
...
N
DF
T
DF
T
clkDF
2
*
T
clkDF
T
clkDF
T
clkDF
T
clkGPU
T
clkCPU
T
GPU
T
CPU
(
a
)
(
b
)
(
c
)
Figure
2
: Graphical representation of execution of a loop on three architecture types. Here is supposed that there is only one
loop, in which iterations are composed of sequential code, and there are no data dependencies between two differe
nt loop
iterations. On the horizontal axis are input data (discrete scale), while the time is on the vertical axis. One vertical line

represents execution of one loop iteration. (a) The MultiCore execution model: N
coresCPU



number of available cores in

th
e
MultiCore
, T
clkCPU



clock period of the MultiCore
, T
CPU



ex
ecution time on the MultiCore
, the case of

one loop iteration; (b)
The ManyCore
execution model: N
coresGPU



number of available cores in the
ManyCore
, T
clkGPU

period of clock in the
ManyCore
,
T
GPU



one ite
ration execution time on the ManyCore
; (c) DataFlow execution model: N
DF



number of
implemented dataflows, T
clkDF



period of clock for a D
ata
F
low
system
, T
DF



latency of DataFlow pipeline.

The example of Figure 3 she
ds
concrete

light on th
e issue

(the chosen clock speeds were selected to
reflect the reality of mid 2012)
.

DataFlow
:
One new result in each cycle

e
.
g
.
Clock
=
400
MHz

Period
=
2
.
5
ns

One result every
2
.
5
ns
Comment
:
The above holds no matter how many
operations in each loop iteration
.
Consequently
:
More operations does not mean proportionally
more time
;
however
,
more operations means
higher latency till the first result
.
MultiCore or ManyCore
:
One new result after each iteration

e
.
g
.
Clock
=
4
GHz

Period
=
250
ps

One result every
250
ps times the number of operations
Comment
:
If the number of operations is greater than
10
,

then
the dataflow machine is better
,
although it uses a slower clock
.
Figure
3
:

An idealized example showing the critical value for the number of operations at which the execution times on
different architectures become same. One simplification is made which does not impact the final conclusion: The number of
pipelines in DataFlow machine and the number of cores in MultiCore
/ManyCore

are supposed to be same. If it is not the
case, the critic
al value has to be multiplied by a corrective factor defined by the ratio between the number of cores and the
number of pipelines.

In
reality,

t
he
MultiC
ore
/ManyCore

systems

will feature

additional slowdown
s
,

due

to memory
hierarchy access,
pipeline relate
d hazards

and additional time
for

execut
ion of

control flow instructions
.
For the example of Figure 3, this means that
the
critical

number of
op
erations

(bringing the same
performance) is below 1
0.

In reality, the number of operations

can easily be

much hi
gher than 10.

Of
course, many applications or parts thereof are characterized with the critical number operations per
iteration below 10, which implies that hybrid structures including all three architecture types do make
sense.


The DataF
low
system

has no

cache,

bu
t does have a memory hierarchy.
However,

memory hierarchy
access with

the

DataF
low machine

is carefully plan
ned by the programmer
at the program write time
.

On the other hand,

a MultiC
ore

and ManyCore

systems

do
have

a

memory hierarchy
,

the
access

time of
which cannot be
precisely
calculated at the program writing time; it

(
the
access time)

can be only
calculate
d at the program run time
.

Of course, this difference implies that the hybrid structure
including all three architecture types does n
eed a dispatcher to decide what goes execute where

and
supports three different programming models
.

1.2. How T
o

The

programming models for MultiC
ore
/ManyCore

and DataFlow are different. The outcome of this
difference is that in th
e first case one writes an

application

progr
am in C
,

or a
nother appropriate

language, while in the second case one writes a reduced application prog
ram in C
,
or another
appropriate language,

plus a set of programs written in some standard language extended to support
hardware descr
iption (e.g. Maxeler Java library, see Figure 4) or
in some
HDL

(Hardware Description
Language)

to define the interna
l configuration of the DataFlow

[
Maxeler1
]
. This is indicated in Figure 4
(the choice of languages reflects the typical reality).

Application Code
(.
c
)
Host Code
(.
c
)
Kernels
(.
java
)
Manager
(.
java
)
Manager Compiler
Kernel Compiler
Hardware Builder
Conpile
&
Link
Host Application
Build
Configuration
(.
max
)
MaxelerOS
SW HW
Executable Program
Executable Program
Libraries
Libraries
(
a
)
(
b
)
Figure
4
: Files involved in the programming process for: (a) MultiCore/
ManyCore

and (b) DataFlow. The names of the blocks
denote the block function. Host Code of DataFlow machine
s

is a subset of the Appli
cation Code MultiCore
/ManyCore
, which
is indicated by the size differences of the
corresponding
blocks

(Courtesy of Maxeler, London UK®)
.

In this context, kernel
describes one pipelined dataflow that takes data from input stream, processes them throughout
pipeline and produces
output stream with results. Manager orchestrates work of kernels and streams of data between kernels, memory and
host
CPU. Manager and kernels are compiled and hardware configuration is built. Configuration is then linked with host pr
ogram
together with run
-
time library and MaxelerOS in order to provide configuration of hardware, and streaming data to/from
configured hardware.

1.3. Electricity Usage Bills

The major supercomputer exploitation economy oriented question is in what time th
e electricity
usage bills reach the initial investment?

Table 1

includes data from the websites of three reputab
le
manufacturers of MultiCore
, ManyCore, and
FPGA (FPGA is most often used to implement
DataFlow
)
and two real case study
. For this type of comp
arison, essential is the correlation between power
consumption (P) and operating frequency (f), which is depicted with the following equation:

P = C x U
2

x f

where:

U


Operating Voltage, and

C


A technology type
and system size

related constant.

As a fre
quency of modern D
ata
F
low systems, based on FPGA, is almost an order of magnitude lower
than the frequency of the same sized MultiCore and ManyCore systems it is expecte
d that the electricity
bill of DataF
low will be one order of magnitude lower than

of

th
e

other two architectures. Lower
monthly electricity bills are highly desir
able
, but for
an
end user it is much more important what
will
he/she

get for his
/her

money. The importance of previous can be clearly seen if the solving of one
problem is observed on each
one
of

the

three architectures. Comparing same sized systems, becau
se of
lower working frequency, DataF
low
system

will consume approximately one order
of magnitude less
energy in the same tim
e. If a DataF
low
system

outper
forms a MultiCore
system

by
one order of
magnitude, then the difference in GFLOPS/$ will be two orders of magnitu
de. As the performances of
the DataF
low and ManyCore
system
s are comparab
le

(see Table 1)
, the difference in GFLOPS/$ will stay
approximately one order of magnitude,
in favor

of DataF
low
system
.

Explanation for
the former

can also be found inside the chip. The power dissipation is generated per unit
of area. In that context, on
e can talk about energy spent on calculations (energy spent on area where
calculations are done) and the energy spent on orchestration of data calculations (energy spent on the
rest of

the

area). Taking into account

the

previously said
and the fact that

th
e

ratio of the area dire
ctly
useful for calculations to

the rest of area is

the

largest for DataF
low

archi
tectures
,

we can expect that
dataflow systems will be the most energy efficient

ones
.

One contra argument is that

a

system with X
times slower clock w
ill require X times more area for the same amount of calculations

per second
,
leading to conclusion that FPGA will be more efficient only if its area directly useful for calc
ulations is
more than X times larger than the same area in
a
chip with

a

faster cl
ock.
Table 1 shows that this is true
for modern systems.



Table
1

Declared and measured performance data for three representative systems of presented architectures


Intel

Core

i
7
-
870

NVidia

C2070

Maxeler

MAX3

0. Type

MultiCore

ManyCore

FineGrain
DataFlow

Declared data [
Intel1, NVidia1,Xilinx1
,Xilinx2
]

1. Working frequency (MHz)

2930

1150

150

2. Declared performance (GFLOPs)

4
6.
8

1030

450

3. Declared normalized speedup

1

22

9.6

4
. Declared power consumption (W)

9
5

238

1
-
25

5. Declared normalized power consumption

36.5

4.2

1


Measured on Bond Option [CPU_GPU_DF_12
_1
]

6. Execution time (s)

476

58

50.3
(*)

7
. N
ormalized

measured

speedup

1

8.2

9.5

8
. Measured power consumption


reduced by idle power
, deltaP

(W)

103

160

7

9
.
Normalized e
nergy
for

processing

139.8

26.5

1

10
. Measured p
ower consumption



of
the
system

(W)

183

240

87

11
.
N
ormalized
energy for
the
system

19.
9

3.2

1


Measured on 3D European O
ption [CPU_GPU_
DF_12_2
]

12. Execution time (s)

145

11.5

9.6
(*)

13
. Normalized measured speedup

1

12.7

15.1

14
. Measured power consumption


reduced by idle power
, deltaP

(W)

69

117

5

15
. Normalized energy for processing

208.4

27.8

1

16
. Measured power consumption



of
the
system (W)

149

271

85

17
. Normalized energy for
the
system

26.5

3.8

1

(*)


Showed results are for reduced precision that is large enough to give the same result SP or DP
floating point.

Let us consider three systems, like those mentioned in Table 1, but scaled so that they have

the same
performance. Let us suppose that electricity bills reach initial investment of MultiCore in N years (N is a
small integer).
The data fro
m Table 1 (rows 11 and 17) lead

to

the

conclusion
that one

can expect

that

with
ManyCore

the same electricity
bill will

be generated

in about
6
*N

to 7*N

yea
rs
, and with DataF
low
in about
20
*N

to 26*N years
.

Obviously,
the smaller the N,
the
importance of the

power consumption

problem is larger, and thus
the larger the advantage of the DataFlow technology.

In the f
uture, when the idle power becomes smaller (due to technology advances), more important will
become

the

difference between

the

measured system power in run time and
the
idle power (
deltaP
, as
given in
rows
8

and 1
4
). If one
views delta
P

as the

main part of

the
usefully spent energy, and

almost the
entire

idle power as

an

overhead, then normalization of this part of power consumption leads to
the
conclusion that
in the future it will become

possible to make

a

larger relative difference in normalized
power co
nsumption

between

the
observed
three

architectures (rows 9 and 15).



1.
4
. Space Rental Bills

The
second
major supercomputer exploitation economy oriented question is
how large is the
investment into the space to host
a
supercomputer
?

Figure 5

includes dat
a from the website of
Maxeler, which shows the floor plans for a microprocessor and a FPGA chip (DataFlow
system
s are
ty
pica
l
ly implemented using FPGA chips)
.

For this type of comparison, essential is the sum of VLSI areas
needed for computation and for th
e flow control. The total area (A) is given by:

A = A
control
+A
data
,

where:

A
control
>>A
data

for control flow machines, and

A
control
<<A
data

for data flow machines.

Memory Controller
Core
Shared L
3
Cache
Queue
&
Uncore
Core
Core
Core
Core
Core
Misc I
/
O and QPI
Shared L
3
Cache
Misc I
/
O and QPI
Execution
Units
L
1
Data
Cache
L
2
Cache
&
Interrupt
Servicing
Paging
Branch
Prediction
Instruction
Fetch
&
L
1
Cache
Memory
Ordering
&
Execution
Instruction
Decode
&
Microcode
Out
-
of
-
order
Scheduling
Retirement
DataFlow hardware
MaxelerOS
(
a
)
(
b
)
Figure
5

Control area versus data
processing
area for microprocessors and FPGAs: (a) microprocessor (Intel), and (b) FPGA
(Xilinx).

The meaning of Figure
5

is best understood if an assumption is adopted that

the bigger

a product of
the
on
-
chip data processing area

and the working frequency
,

the smaller the overall equipment volume
for
systems with the same performances
. Such an assumption is justified by the fact that the same sized
chips will differ in the number of operations that are capable of doing the given task in the given time
slot
,

depending on the number of processing elements (the size of the on
-
chip data processing area) and
the speed of processing elements (the working frequency)
.
Data from Table 1 show

that the part of on
-
chip area in DataFlow architectures intended for calcul
ations is large enough to compensate for slower
clock and
also, despite

the
described
compensation
,

achieve
s a

significant speedup over control flow
architectures.

_____________________________________________________________________________________

2.
Ela
boration

of t
he Conditions when DataFlow is S
uperior

T
oday's real
-
world applications often
require

a
huge number of
calculation
s

to be
done

in short time
.
M
ost often,
the same

independent

calculation is repeated on each datum in a huge dataset
.

The best
way to speed
-
up these applications is to exploit a huge amount of data parallelism by parallelizing
independent ca
lculations.
With only few cores, MultiCore

do not have enough resources to achieve any
significant utilization of available data para
llelism. Instead of adding
new CPUs, addition of

one or more
accelerators
is

a

much better choice

(see Table 1)
, and is widely used today.

Two possible approaches
are:

ManyCore

based accelerators and
DataF
low based reconfigurable accelerators.

Making choic
e
between these accelerators is the first crucial step in achieving
a significant
speed
-
up for

an
application.
Choosing
the right solution

requires a good understanding of these architectures and how they achieve
speed
-
up.
For this reason, we

offer analyti
cal models

of these accelerators that can help

compare them
(we advocate

for

hybrid solutions co
mbin
ing the

following three

archit
ectural
types: MultiCore,
ManyCore
, and DataFlow)
.

2.1
Problem Statement

The s
et

of applications characterized with

a lot of

data parallelism spans over
a
wide range of domains.
At

one side, there are a

lot of scientific applications in different area
s
, while at the other side

there
are

numerous industrial

applications with the same requirement
s
.
Requirements that are imposed b
y these
applications can be seen
as two main one
s: r
esponse time and model precision
. These requirements are
opposite. The first requirement can be seen in example
s of risk management that have

a regular
occurrence in banks. When precision is satisfied, th
ere is

a

need to get results of calculations as soon as
possible
,

in order to maximize
the
efficiency
of the system
and

to

maximize
profit. Some other
applications require better precision or
a bigger data set, both plac
ing a requirement for processing of
bigger dataset in a time that is not so strongly limited. One example is
related to
oil drilling companies
that need

to process
an
extremely
huge set of data that represent

measurements of wave reflections, all
that in order to find where to drill for oil.

Speeding
-
up
the
mentioned applications i
s a complex task and includes

numerous subtasks. One
importan
t subtask,
the main topic of this paper
,

is understanding and comparing

of different available
accelerator

type
s. Accelerators can be compared

on several
different criteria
,

such as:

P
ower
consumption, ease of programming, ease

and cost

of maintenance,
price of the system,
etc.
, but one
that is highly important and it is often eliminatory, is performance, or execution time.
This paper

concentrates

on

compar
ison of

performances of

the

mentioned accelerators through analytical analysis

of

relationship between

execution time
and

problem size.
As

the most of architectures demonstrate a

different behavior depending on
a
concrete problem
, analysis has to take into

account this fact
,

and to
compar
e the

behavior of accelerators in all boundary
cases
.
The resulting

analysis

should show
potentials and drawbacks of these architectures, and
can
envision
the
future of accelerators, as an
important
resource

for various sim
ulations that

industry and science

depend on
.

2.2
Existing Solutions

to Be Compared

For th
e purpose of comparison, we

introduce three models of execution in

the

next section. Exact
modeling is almost impossible because of many factors that are pseudo rand
o
m. Instead, here we

simplify

the models, so one

can group several parameters into one

and

for which
one

can obtain

the

maximum and minimum value
s
.

Taking these values, one

can find maximum and minimum execution
time expressed as a function of

the

problem s
ize. If
the
minimum execution time of some models is
greater than
the
maximum execution time of
the
other

one
, this other

architecture is better for
the
given problem size
. Such a

comparison cannot give exact relationship between these architectures, but
can show that some architecture has
an
advantage compared to the other one
, and under what
conditions
.

It is possible that in some circumstances this method does not give

a

clear p
icture, i.e.
calculat
ed ranges of execution times may overlap
.

That does not mean

that the
architectures are
the
same. In that case
, one

can try to

find

out how parameters influence the

execution time and

can

show
circumstances under

which

one

architecture

has
an
advantage over other

architecture
s.

2.3 Axioms, Conditions, and Assumptions

In Figure
2, three drawings are give
n that represent

execution
time of iterations of som
e loop. We
suppose that the
loop
body
contains only sequential code. Reason for this

will be explained in
the
next
section.

On
the horizontal axis
is

input data

per iteration
, and on vertical axis is time.

This analysis does not target all kind
s

of programs in the world.
Instead, it

take
s

into consideration only
those programs that have a

lot of data parallelism concentrated in one or several nested loops
,

intended
for processing of all elements of some n
-
dimensional data structure (n is
a
natural number)
.

Further, this
one or

these s
everal nested loops are

referred to as

Data Parallel Reg
ion
,

or just DPR.
Possibly, there can
be more than one DPR in a program. Accelerators are usually used to speed
-
up
one DPR

at
a

time
, and
because of that
,

in
the
further analysis
,

we will consider only one DPR, not
the
whole program.

In the
case of more than one DPR,
the same applies for

each DPR.

Looking at the size

of

machine code, DPR is

relatively small compared to the rest of
the
program. From
the other side, looking at the time spent on execution of DPR, it
represents the most
t
ime consuming
part of
the
program.

Assum
in
g a MultiCore
, t
ypical e
xecution times of DPRs

that ar
e of interest for this
analysis span

from several minutes

(
or more

likely from several hours
)

to several days
(
or more
)
.
The
DPRs with shorter execution time ar
e not of interest
,

because th
ey are executed fast enough on

MultiCore
.

Further, we will introduce one

simplification without loss of

generality. We will suppose that all DPRs
include only one loop. This is possible because
the
set of nested loops can alway
s be replaced with

an

equivalent code that includes only
a
single loop [
Coalescing1
].

2.
4

Analytical Analysis

Let
us

suppose that
a
single iteration of

only loop in

DPR is composed of sequential code

only

and that
the bandwidth to/from memory is large enough to fulfill all requests without introducing additional
latency
.

Later we will analyze

what will happen if
branches

are included

and memory bandwidth is

more
realistically
limited
.

Let us

denote

the

num
ber of operations in

a

single

iteration

of a loop

as N
OPS
, and

the

number of
iterations

as

N.

Further, let us suppose some
parameter

C
PI
CPU

that represents
the
ratio between
average time spent per operation

on one core

and
the
clock period

of MultiCore
.
Pa
rameters
C
PI
GPU

and
C
PI
DF

are
the
same ratios for the
ManyCore

and DataF
low
,

respectively. Using these parameters and
those shown in Figure
2
,

we can calculate execution times

of
a
single iteration

(T
CPU
, T
GPU

and T
DF
)

and
execution times of

the

whole
loop

(t
CPU
, t
GPU

and t
DF
)
, for

these three
systems
, as follows
:

(a)

T
CPU

= N
OPS

* C
PI
CPU
*T
clkCPU


t
CPU

=
N * T
CPU

/ N
coresCPU




=
N * N
OPS

* C
PI
CPU
*T
clkCPU

/ N
coresCPU

,


(b)

T
G
PU

= N
OPS

* C
PI
G
PU
*T
clkG
PU



t
G
PU

= N * T
GPU

/ N
coresG
PU




= N * N
OPS

* C
PI
G
PU
*T
clkG
PU

/ N
coresG
PU

,


(c)

T
DF

= N
OPS

* C
PI
DF

* T
clkDF


t
DF

= T
DF

+ (N


N
DF
) * T
clkDF

/ N
DF




= N
OPS

* C
PI
DF

* T
clkDF

+ (N


N
DF
) * T
clkDF

/ N
DF


Observing
the previous three formulae
, one can
conclude that there are two
most important
parameters
that describe the
problem

at the data input
: N
OPS

and N.

In the first two formulae
, these two
parameters are multiplied, but in the third one, these two parameters multiplied by some constants are
added.

As a consequence of this, increasing of the number o
f operations in
a
loop iteration
,

in the first
two cases
,

will increase execution time needed for

each group of N
cores

iterations
, and thi
s
linearly
increase
s the

execution time

of the entire loop
(N
OPS

and N are multiplied). In the third case, because of
the
deeply pipelined dataflows
,

in which execution
s of loop iterations are

overlapped with a latency of
only T
clkDF

for each new group of N
DF

iterations, longer execution time of

only

the first group of N
DF

iterations will influence the total execution tim
e (N
OPS

and N components of
the
total execution time are
added).

The o
nly limitation
to high performance
is
the
capacity of hardware used for data

flow
implementations. Today’s technology
brings enough capacity to make DataF
low
systems

well
performing,
often faster than

control flow
systems

[
CPU_GPU_DF
_12_1, CPU_GPU_DF_12_2,
Maxeler1
].

In order to make it clear, N
DF

is not a constant like N
coresCPU

and N
coresGPU
. It is a parameter

that depends
on

the

size of one
pipelined
dataflow

hardware

and

the

size

o
f

the chip
;

the
ratio of these two sizes

represents
the
maximum value for this parameter.

This means that for a smaller loop bodies we will be
able to decrease execution time on

a DataF
low

system

by implementing additional dataflows

(implementing the same
dataflow several times)
. This is true

only when the

assumption about

the

required bandwidth
is true. If

the

required bandwidth reach
es the

available bandwidth, implementing
additional dataflow
s

will not bring any additional speed
-
up
,

because the additional

da
taflows will not
have access to

the
memory

(
the
whole available bandwidth is

already

used).

When this happens for
some application, execution time is no more dependent on chip capacity for calculation. Instead,
execution time becomes dependent only on a
vailable bandwidth to/from memory.

Does this mean that
dataflow is no more the best choice? If we suppose the same bandwidth to/from memory of all three
architectures

(for

the

comp
ared three chips, that is not entire
ly true, but

it

is

leveraged with
the
behavior explained in
the
next paragraph
)
, and if some architecture uses

the

whole available bandwidth,
while others do not,

then the architecture that uses the

whole bandwidth is the best performing

one
.

The s
implest explanation of

the

previous

fact

is th
at for some problem the same amount of data has to
be processed on each

one

of

the

three architectures and the fastest architecture will be
the
one that has
the highest throughput of data through processing hardware (chip).

The main conclusion of
the
previ
ous
ly said

is that if for some application dataflow is limited by the available bandwidth to/from
memory, then no other architecture can perform faster because

if not limited by available processing
power of hardware,

it will also be limited by the avai
lab
le bandwidth to/from memory.

This also gives
explanations of
the
difference between declared and achieved speedups shown in Table 1
.

Because the

declared number of operations per

second suggests that ManyCore

will achieve
a
better s
peedup for
applications
that have

larger computation/data transfer ratio
, all that at higher operational price
.

One highly import
ant property of
DataF
low system

in this context is that
it
needs same amount of data
in each cycle
. E
ach operation in loop body is executed once per cy
cle, but with data from different
iterations


whole l
oop body is executed each cycle. T
his is not case in only several cycles at beginning
until pipeline is filled, and several cycles at the end, when
pipeline has no more input data
. On the
control flow s
ystems, because of pseudo random nature of execution on Multi
Core
/ManyCore, data are
fetched in bursts (there are periods when there are no requests, while in some periods available
bandwidth is not enough and introduce additional latency).

As a consequenc
e
, it is expected that
the
DataF
low system can easily utilize whole available bandwidth to/from memory, while other two
architectures can do
the
same only in extreme case when available bandwidth is not large enough to
support required data.

The other
influence of memory on execution time is

the

latency of memory accesses. In control flow
architectures,

when processor tries to access data that are not in cache, it has to wait for data to be
fetched from memory. If there is no

reliable

mechanism to prefe
tch needed data to
be ready when

needed, additional latency will be introduced, which will be seen as
an
increase
o
f CPI. On dataflow
architectures, flow of data is planned at compile time and overlapped with calculations
. In the case
when the available ba
ndwidth is large enough, memory access latency will not have an
y significant
influence on the total

execution time.

A m
ore realistic case is when loop body contains some control flow instructions. In the

control flow
architectures
, additional time is neede
d to fetch and execute

control flow instruction
s

that
,

because of
control hazards
,

can significantly influence execution time of one loop iteration and hence also
execution time of
the
entire

loop. The c
onditionally executed code influen
ces the

execution t
ime only as
many times as
is
the number of times that it is executed.

In
the
shown formulae, this is seen
through
increase of CPI,

because of execution of control flow instruction and hazards that it introduces.

In the
DataFlow

case,

every possible choice
is calculated, and at the end one of the calculated values is chosen
,
throwing away other calculated values. This means that the execution time of control structures will be
equal to
the
execution time of

the
longest possible choice
,

plus the time needed t
o make a choice
between calculated values. As explained in
the
previous p
aragraph
s
, this will influence the

execu
tion
time of only
the first group of N
DF

iterations. More important
ly
,

effect of control

structures implemented
in data
flows is that the additional area on
the
chip is needed, but not efficiently utilized.

In some
extreme

cases
,

this can significantly reduce
the
effectively available capacity of hardware. For the most
of
the
re
al applications implemented in DataF
low
system
s
,

this does not have any significant drawback
[
CPU_GPU_DF
_12_1, CPU_GPU_DF_12_2, Maxeler1
].

3.
Conclusion

The above mentioned newly open problems lead to a conclusion that the optimal approach implies a
hybrid solution that includes all three
architecture

types: MultiCore
,
ManyCore
, and DataFlow. So far,
hybrid approaches would typically include

only

two solutions
(e.g.,
[
MontBlanc
]
)
. Our conclusion goes
one step ahead and adds a third component: DataFlow.

For a typical supercomputer workload mix, this me
ans a considerable
performance
improvement, since
the percentage of supercomputer code which is best dataflow
-
able is relatively high
(e.g.,
[
Maxeler
3
]
)

and can bring speedups of 20
-
40 [
Maxeler1
]

or even more; in some cases even 100 or well above
(e.g.,
[
M
axeler2
]
)
.


Introduction of the third component into the hybrid implies appropriate changes in the programming
model
(e.g.,
[
O
mpss
]
)

and an incorporation of the

dispatcher software which is
able to recognize what is
best

to move to the DataFlow component of the hybrid, and how

to do it
.
It can be implemented for
either compile time or run time.

More about programming models for hybrid computers one can find in
[
PMOD_HET_1, PMOD_HET_2, PMOD_HET_3
]
.

4.
References

[Cuda1] Ga
rland, M.; Le Grand, S.; Nickolls, J.; Anderson, J.; Hardwick, J.; Morton, S.; Phillips, E.; Yao
Zhang; Volkov, V.; , "Parallel Computing Experiences with CUDA,"
Micro, IEEE

, vol.28, no.4, pp.13
-
27,
July
-
Aug. 2008

doi: 10.1109/MM.2008.57

URL:

http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4626815&isnumber=4626808

[Parallel
1]

Podobas, A.; Brorsson
,

M.;

and Faxén
, K.; ,

"A Comparison of some recent Task
-
based Parallel
Programming Models,"
3rd Workshop on Programmability Issues for Multi
-
Core Computers , 24 Jan
2010, Pisa, Italy

[Coalescing1]
University of Illinois at Urbana
-
Champaign. Center for Supercom
puting Research and
Development;

Polychronopoulos
,
C
.
; , "Loop Coalescing: a Compiler Transformation for Parallel
Machines," University of Illinois, Center for Supercomputing Research and Development, 1987

[Intel
1]

Gepner, P.; Gamayunov, V.; Fraser, D.L.;, "
The 2nd Generation Intel Core Processor.
Architec
tural Features Supporting HPC
,
"

10th International Symposium on Parallel and Distributed
Computing (ISPDC),

July

2011
, Cluj Napoca, DOI:
10.1109/ISPDC.2011.13

[NVidia1]

NVidia,
"
Tesla C2050 and C2070 GPU computing processor,
"

web:
http://www.nvidia.com/docs/IO/43395/NV_DS_Tesla_C2050_C2070_jul10_lores.pdf

[Xilinx1]

Dave
,

Strenski
,

"
High Performance Computing Using FPGAs
"
, web:
http://www.xilinx.com/support/documentation/white_papers/wp375_HPC_Using_FPGAs.pdf

[Xilinx2] Xilinx Corp.,
"Virtex
-
6 FPGA Packaging and Pinout Specifications", web:

http://www.xilinx.com/support/documentation/user_guides/ug365.pdf

[Maxeler1]
Oskar Mencer,
"
Vertical Acceleration: From Algorithms to Logic Gates
,
"

web:


[Maxeler2
]
J.P.Morgan,
"
Innovation in Invest
ment Banking Technology
,
Field Programmable Gate
Arrays
", web:
http://techcareers.jpmorgan.com/downloads/Field%20Programme%20Gate%20Array%20factsheet.pdf

[
Maxeler3
]

O. Lindtjorn, R. G. Clapp
, O. Pel
l, O. Mencer and M. J. Flynn,
"
Surviving the End of
Scaling of
Traditional Microprocessors in HPC
"
,
IEEE HOT CHIPS 22, Stanford, USA, August 2010
. web:
http://www.hotchips.org/wp
-
content/uploads/hc_archives/archive22/HC22.23.120
-
1
-
Lindtjorn
-
End
-
Scaling.pdf

[MontBlanc]

Mateo Valero,
"
Mont
-
Blanc, European Ap
proach towards Energy Efficient HPC
,
"

ISC’12,
Hamburg, Germany, June 2012.

[Ompss]

ALEJANDRO DURAN, EDUARD AYGUADÉ, ROSA M. BADIA, JESÚS LABARTA, LUIS MARTINELL,
XAVIER MARTORELL, JUDIT PLANAS,
"
OmpSs: A PROPOSAL FOR PROGRAMMING HETEROGENEOUS
MULTI
-
CORE AR
CHITECTURES
,
"

Parallel Processing Letters
,

Volume: 21, Issue: 2(2011) pp. 173
-
193
DOI: 10.1142/S0129626411000151

[CPUvsGPU
1
] Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D.
Nguyen, Nadathur Satish, Mikhail Smelyans
kiy, Srinivas Chennupaty, Per Hammarlund, Ronak S
inghal,
and Pradeep Dubey.

"
Debunking the 100X GPU vs. CPU myth: an evaluation of thro
ughput computing on
CPU and GPU,"

In
Proceedings of the 37th annual international symposium on Computer architecture

(ISC
A '10). ACM, New York, NY, USA, 451
-
460. DOI=10.1145/1815961.1816021
http://doi.acm.org/10.1145/1815961.1816021

[CPUvsGPU2]
Stefano Cagnoni, Alessandro Bacchini and Luca Mussi
, "
OpenCL Implementati
on of Particle
Swarm Optimization: A Comparison between Multi
-
core CPU and GPU Performances
"
, Lecture Notes in
Computer Science, 2012, Volume 7248/2012, 406
-
415, DOI: 10.1007/978
-
3
-
642
-
29178
-
4_41

[CPUvsGPU3] Peter J. Lu, Hidekazu Oki, Catherine A. Frey, Gr
egory E. Chamitoff, Leroy Chiao, Edward M.
Fincke, C. Michael Foale, Sandra H. Magnus, William S. Mcarthur, Jr., Daniel M. Tani, Peggy A. Whitson,
Jeffrey N. Williams, William V. Meyer, Ronald J. Sicker, Brion J. Au, Mark Christiansen, Andrew B.
Scho
field,

and David A. Weitz.

"
Orders
-
of
-
magnitude performance increases in GPU
-
accelerated
correlation of images from the International Space Station.

"

J. Real
-
Time Image Process.

5, 3
(September 2010), 179
-
193. DOI=10.1007/s11554
-
009
-
0133
-
1
http://dx.doi.org/10.1007/s11554
-
009
-
0133
-
1

[Trans1] Sharad Singhai,

Kathryn McKinley
,
"
Loop Fusion for Data Locality and Parallelism
,
"

Proceedings
of MASPLAS’96, The Mid
-
Atlantic Student Workshop o
n Programming Languages and Systems, SUNY at
New Paltz April 27, 1996

[CPU_GPU_DF_12
_1
] Qiwei, Jin, Diwei, Dong, Anson, H.T., Tse, Gary, C.T., Chow, David, B. Thomas,
Wayne, Luk, and Stephen, Weston, "Multi
-
level Customisation Framework for Curve Based Mon
te Carlo
Financial Simulations,"
Lecture Notes in Computer Science, 2012, Volume 7199/2012, 187
-
201, DOI:
10.1007/978
-
3
-
642
-
28365
-
9_16

[CPU_GPU_DF_12_2]
Anson H. T. Tse, Gary C. T. Chow, Qiwei Jin, David B. Thomas and Wayne Luk
,
"
Optimising Performance of
Quadrature Methods with Reduced Precision
,"
Lecture Notes in Computer
Science, 2012, Volume 7199/2012, 251
-
263, DOI: 10.1007/978
-
3
-
642
-
28365
-
9_21

[
PMOD_HET_1
]

Michael D. Linderman, Jamison D. Collins, Hong Wang, and Teresa H. Meng. 2008.
Merge: a programmi
ng model for heterogeneous multi
-
core systems. SIGARCH Comput. Archit. News
36, 1 (March 2008), 287
-
296. DOI=10.1145/1353534.1346318
http://doi.acm.org/10.1145/1353534.1346318

[
PMOD_HET_2
]

Bellens, P.; Perez, J.M.; Badia, R.M.; Labarta, J.; , "CellSs: a Pr
ogramming Model for the
Cell BE Architecture," SC 2006 Conference, Proceedings of the ACM/IEEE , vol., no., pp.5, Nov. 2006

doi: 10.1109/SC.2006.17

URL: http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4090179&isnumber=4090163

[
PMOD_HET_3
]

Bratin Sa
ha, Xiaocheng Zhou, Hu Chen, Ying Gao, Shoumeng Yan, Mohan Rajagopalan,
Jesse Fang, Peinan Zhang, Ronny Ronen, and Avi Mendelson. 2009. Programming model for a
heterogeneous x86 platform. In Proceedings of the 2009 ACM SIGPLAN conference on Programming
lan
guage design and implementation (PLDI '09). ACM, New York, NY, USA, 431
-
440.
DOI=10.1145/1542476.1542525 http://doi.acm.org/10.1145/1542476.1542525