Solving the Convolution Problem in Performance Modeling

internalchildlikeInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

149 εμφανίσεις

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC

Solving the Convolution Problem in
Performance Modeling

Allan Snavely, Laura Carrington, Mustafa Tikir

PMaC Lab

Roy Campbell ARL

Tzu
-
Yi Chen, Pomona College



San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Some Questions


Do supercomputers double in speed every 18
months?


How can one meaningfully rank
supercomputers?


How can one reasonably procure
supercomputers?


How can one design supercomputers to run real
applications faster?


How well can simple benchmarks represent
the performance of real applications?

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



The Convolution Hypothesis


The performance of HPC applications on
supercomputers can be explained by
some combination of low
-
level
benchmarks combined with knowledge
about the applications


Note that a hypothesis is something that can
be tested and could be true or false!

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



A Framework for Performance Modeling

Machine Profile
:

Rate at which a machine
can perform different
operations collecting: rate
op1, op2, op3

Application Signature
:
Operations needed to be
carried out by the application
collecting: number of op1,
op2, and op3

Execution time

=

operation1

‘+


operation2

‘+’

operation3

rate op1 rate op2 rate op3

Convolution
:
Mapping of a machines
performance (rates) to applications
needed operations

where
‘+’ operator could be + or MAX depending on operation overlap

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Example: Convolving MEMbench data with

Metasim Tracer Data


Memory bandwidth
benchmark
measures memory
rates (MB/s) for
different levels of
cache and tracing
reveals different
access patterns

0.0E+00
2.0E+03
4.0E+03
6.0E+03
8.0E+03
1.0E+04
1.2E+04
1.4E+04
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
Message Size (8-byte words)
Memory Bandwidth (MB/s)
IBM p655
SGI Altix
IBM Opteron
Stride
-
one access

L1 cache

Stride
-
one access

L1/L2 cache

Stride
-
one access

L2/L3 cache

Stride
-
one access

L3 cache/Main Memory

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Formally, let P(m,n)=A(m,p) • M(p,n)


Where the is (rows) are applications and the js (columns) are
machines; an entry of P is the (real or predicted) runtime of
application i on machine j


the rows of A are applications, columns are operation counts, read a
row to get an application signature


the rows of M are benchmark
-
measured bandwidths, columns are
machines, read a columns to get a machine profile


P a matrix of runtimes:



=

=

p

k

kj

ik

ij

m

a

p

1

52
35
42
34
32
33
22
32
12
31
32
53
52
51
43
42
41
33
32
31
23
22
21
13
12
11
35
34
33
32
31
25
24
23
22
21
15
14
13
12
11
33
32
31
23
22
21
13
12
11
"
"
"
"
"
"
"
"

p
p
p
p
p
p
p
p
p
b
a
b
a
m
a
m
a
m
a
p
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a




=


























=










San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Investigating the Convolution Method


We have a multi
-
pronged investigation to find
the limits of accuracy of this approach


How accurately can
P

be measured directly?


“Real” runtimes can vary 10% or more


Symbiotic job scheduling (Jon Weinberg)


How well can
P

be computed
empirically
as
A


M
?


We use Linear Optimization (Roy) approach as well as a
Least Squares Fit (Yi Chen)


How well can
P
be computed
ab initio
from trace data
and judicious convolving? (Laura Carrington)


Can the ab initio approach guide the empirical (vice
-
versa)?

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



How well can
P

be computed
empirically
as
A


M
?



De
-
convolution gives
A=P/M


The big picture:


we are trying to discover if any linear
combination of simple benchmarks can
represent an HPC application from a
performance standpoint


If YES, difficult full app benchmarking can be
replaced by easy simple benchmarks and
low
-
level performance charcteristics of
machines can be related to expected
application performance

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Solving for A using Least Squares


Consider solving the matrix equality P = M
A for A


We can solve for each column of A individually
(i.e. P
i

= M A
i
) given the assumption ops
counts of an application do not depend on
other applications


We compute op counts that minimize the 2
-
norm of the residual of P
i



M A
i


nonneglsq

in Matlab


San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Solving for A using LP

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Ab Initio: MetaSim Tracer
-
Memory & FP Trace with
“processing” to get hit rates on PREDICTED MACHINE

User Specified

Memory Structure

(Power 4, Power 3,

Alpha, Itanium)

PREDICTED

MACHINE

Address stream

CPU# N

Running Application

CPU# 1

Running Application

CPU# 2

Running Application

Cache

Simulator

Expected cache hit rates for

application on the user

specified memory structure

PREDICTED MACHINE

Entire address stream is processed
through cache simulator.

Final product is a table of
average hit rates for each basic
-
block of the entire application.

Less processing than cycle
-
accurate

simulator
-
saves time and still

accurate
enough

for predictions

Parallel machine

BB#202: 2.0E9, load, 99%, 100%, stride
-
one

BB#202: 1.9E3, FP

BB#303: 2.2E10, load, 52%, 63%, random

BB#303: 1.1E2, FP

From sample application signature:

Processing trace

collecting trace

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Convolver rules trained by a human expert


Pick 40 loops from nature


Measure performance on 5 machines


Tune rules to predict performance
accurately


Predict 10,000 lops from 5 apps, 2 inputs,
3 cpu counts, 9 machines (270
predictions)

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Rules training


0
2000
4000
6000
8000
10000
12000
14000
16000
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
Data set size (8-byte words)
Memory bandwidth (MB/s)
Stride-one
Random stride
L3 Cache
Region
L2 Cache
Region
L1 Cache
Region
L2hr >= 100.00
L3hr >= 99.95
Average
L3 High
L3 Low
San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Scientific applications used in this study



AVUS

was

developed

by

the

Air

Force

Research

Laboratory

(AFRL)

to

determine

the

fluid

flow

and

turbulence

of

projectiles

and

air

vehicles
.

Its

standard

test

case

calculates

400

time
-
steps

of

fluid

flow

and

turbulence

for

a

wing,

flap,

and

end

plates

using

7

million

cells
.

Its

large

test

case

calculates

150

time
-
steps

of

fluid

flow

and

turbulence

for

an

unmanned

aerial

vehicle

using

24

million

cells
.


The

Naval

Research

Laboratory

(NRL),

Los

Alamos

National

Laboratory

(LANL),

and

the

University

of

Miami

developed

HYCOM

as

an

upgrade

to

MICOM

(both

well
-
known

ocean

modeling

codes)

by

enhancing

the

vertical

layer

definitions

within

the

model

to

better

capture

the

underlying

science
.

HYCOM's

standard

test

case

models

all

of

the

world's

oceans

as

one

global

body

of

water

at

a

resolution

of

one
-
fourth

of

a

degree

when

measured

at

the

Equator
.


OVERFLOW
-
2

was

developed

by

NASA

Langley

and

NASA

Ames

to

solve

CFD

equations

on

a

set

of

overlapping,

adaptive

grids,

such

that

the

grid

resolution

near

an

obstacle

is

higher

than

that

of

other

portions

of

the

scene
.

This

approach

allows

computation

of

both

laminar

and

turbulent

fluid

flows

over

geometrically

complex,

non
-
stationary

boundaries
.

The

standard

test

case

of

OVERFLOW
-
2

models

fluid

flowing

over

five

spheres

of

equal

radius

and

calculates

600

time
-
steps

using

30

million

grid

points
.

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



More scientific apps used in this study


Sandia

National

Laboratories

(SNL)

developed

CTH

to

model

complex

multidimensional,

multiple
-
material

scenarios

involving

large

deformations

or

strong

shock

physics
.

RFCTH

is

a

non
-
export
-
controlled

version

of

CTH
.

The

standard

test

case

of

RFCTH

models

a

ten
-
material

rod

impacting

an

eight
-
material

plate

at

an

oblique

angle,

using

adaptive

mesh

refinement

with

five

levels

of

enhancement
.



The

WRF

model

is

being

developed

as

a

collaborative

effort

among

the

NCAR

Mesoscale

and

Microscale

Meteorology

Division

(MMM),

NCEP’s

Environmental

Modeling

Center

(EMC),

FSL’s

Forecast

Research

Division

(FRD),

the

DoD

Air

Force

Weather

Agency

(AFWA),

the

Center

for

the

Analysis

and

Prediction

of

Storms

(CAPS)

at

the

University

of

Oklahoma,

and

the

Federal

Aviation

Administration

(FAA),

along

with

the

participation

of

a

number

of

university

scientists
.

Primary

funding

for

MMM

participation

in

WRF

is

provided

by

the

NSF/USWRP,

AFWA,

FAA

and

the

DoD

High

Performance

Modernization

Office
.

With

this

model,

we

seek

to

improve

the

forecast

accuracy

of

significant

weather

features

across

scales

ranging

from

cloud

to

synoptic,

with

priority

emphasis

on

horizontal

grids

of

1
-
10

kilometers
.




San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Most recent results LS & LP methods

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Most Recent Results: ab initio

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



How can one reasonably procure
supercomputers?


Assistance to DoD HPCMO, SDSC
Petascale, DOE NERSC procurements


Form performance models of strategic
applications, verify against existing HPC
assets, use to predict performance of
proposed systems


Of course performance is just one criterian
(price, power, size, colling, reliability,
diversity etc.)

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Different machines are better at different things
and the space is complicated

BENCHM_5_Opt1_2.2
0
0.2
0.4
0.6
0.8
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
BENCHM_Altix_1.5
0
0.2
0.4
0.6
0.8
1
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
NAVO_P655_FED
0
0.2
0.4
0.6
0.8
1
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



What one needs is performance sensitivities of
applications
-

how much faster my app for:

Case 1

Network latency / 2

Case 2

Network bandwidth * 2

Case 3

FLOPS * 2

Case 4

L1 BW * 2

Case 5

L1, L2 BW *2

Case 6

L1, L2, L3 BW * 2

Case 7a

L1, L2, L3, MM BW * 2

Case 7b

L1, L2, L3, MM, on node BW * 2

Case 8a

Just MM BW *2

Case 8b

Just MM, on node BW * 2

Case 9

I/O

Case 10

All but Network *2

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Performance Sensitivities

0
1
2
3
4
5
6
7
8
case1
case2
case3
case4
case5
case6
case7a
case7b
case8a
case8b
case9
case10
GYRO_0016_2X
GYRO_0016_4X
GYRO_0016_8X
0
1
2
3
4
5
6
case1
case2
case3
case4
case5
case6
case7a
case7b
case8a
case8b
case9
case10
AVUS_0064_2X
AVUS_0064_4X
AVUS_0064_8X
San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Pieces of Performance Prediction Framework

each model consists of:


Machine Profile

-

characterizations of the rates at which
a machine can (or is projected to) carry out fundamental
operations abstract from the particular application.


Application Signature

-

detailed summaries of the
fundamental operations to be carried out by the
application independent of any particular machine.


Combine Machine Profile and Application Signature using:


Convolution Methods

-

algebraic mappings of the
Application Signatures on to the Machine profiles to
arrive at a performance prediction.

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Pieces of Performance Prediction Framework

Performance prediction of

Application B on Machine A

Parallel Processor Prediction

Machine Profile
(Machine A)

Characterization of
memory performance
capabilities of
Machine A

Application Signature
(Application B)

Characterization of
memory operations
needed to be performed
by Application B

Convolution Method

Mapping memory usage needs of
Application B

to the capabilities of Machine A

Application B

=
M慣桩湥=A
=
M慣桩湥=Prof楬i=
⡍(捨楮e=A)
=
Characterization of
network performance
capabilities of
Machine A

Application Signature
(Application B)

Characterization of
network operations
needed to be performed
by Application B

Convolution Method

Mapping network usage needs of
Application B

to the capabilities of Machine A

Application B

=
M慣桩湥=A
=
Single
-
Processor Model

Communication Model

Exe. time =
Memory op



FP op


Mem. rate FP rate

Exe. time =
comm. op1



comm. op2


op1 rate op2 rate

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Machine Profile


Single processor model

collecting rates for Memory operations and FP
operations


Tables of a machine’s performance/rates for
different operations collected via benchmarks.
Sample:

Machine Operation

Performance Rate

Memory load from L1
cache stride
-
one

5000 MB/s, 99%, 100%

Memory load from Main
Memory random

300 MB/s, 66%, 78%

Floating
-
point op

2000 MFLOPs

MAPS

data

Currently set

to theoretical


peak

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



Application Signature*
-

Single processor model

collecting type and number of Memory and FP
operations then simulating in cache simulator


Trace of operations on the processor performed by
an application (memory and FP ops on processor).
Sample:

Where the format is as follows:

Basic
-
block #: # memory ref., type, hit rates, access stride

BB#202: 2.0E9, load,
99%, 100%
, stride
-
one

BB#202: 1.9E3, FP

BB#303: 2.2E10, load,
52%, 63%
, random

BB#303: 1.1E2, FP


Trace of application is collected


and processed by the


MetaSim Tracer.

Cache hit rates for the

PREDICTED MACHINE


for each basic
-
block of the
application.


This additional information

requires “processing” by

the MetaSim tracer not just
straight memory tracing,
hence the combination of the
application signature and
convolution components

San Diego Supercomputer Center

Performance Modeling and Characterization Lab

PMaC



How can one meaningfully rank supercomputers?


Thresholded Inversions

is a metric for evaluating
rankings


Basically, when a machine higher on the list runs an
application slower than a machine lower on the list, that
is an inversion


We showed the Top500 list is rife with such inversions,
78% suboptimal compared to…



A
best list

obtainable by brute force for any set of
applications


We used the framework to approach the quality of the
best list by combining these simple HPC Challenge
Benchmarks Random Access, STREAM, and HPL as
guided by application traces