San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Solving the Convolution Problem in
Performance Modeling
Allan Snavely, Laura Carrington, Mustafa Tikir
PMaC Lab
Roy Campbell ARL
Tzu

Yi Chen, Pomona College
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Some Questions
Do supercomputers double in speed every 18
months?
How can one meaningfully rank
supercomputers?
How can one reasonably procure
supercomputers?
How can one design supercomputers to run real
applications faster?
How well can simple benchmarks represent
the performance of real applications?
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
The Convolution Hypothesis
The performance of HPC applications on
supercomputers can be explained by
some combination of low

level
benchmarks combined with knowledge
about the applications
–
Note that a hypothesis is something that can
be tested and could be true or false!
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
A Framework for Performance Modeling
Machine Profile
:
Rate at which a machine
can perform different
operations collecting: rate
op1, op2, op3
Application Signature
:
Operations needed to be
carried out by the application
collecting: number of op1,
op2, and op3
Execution time
=
operation1
‘+
’
operation2
‘+’
operation3
rate op1 rate op2 rate op3
Convolution
:
Mapping of a machines
performance (rates) to applications
needed operations
where
‘+’ operator could be + or MAX depending on operation overlap
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Example: Convolving MEMbench data with
Metasim Tracer Data
Memory bandwidth
benchmark
measures memory
rates (MB/s) for
different levels of
cache and tracing
reveals different
access patterns
0.0E+00
2.0E+03
4.0E+03
6.0E+03
8.0E+03
1.0E+04
1.2E+04
1.4E+04
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1.E+09
Message Size (8byte words)
Memory Bandwidth (MB/s)
IBM p655
SGI Altix
IBM Opteron
Stride

one access
L1 cache
Stride

one access
L1/L2 cache
Stride

one access
L2/L3 cache
Stride

one access
L3 cache/Main Memory
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Formally, let P(m,n)=A(m,p) • M(p,n)
Where the is (rows) are applications and the js (columns) are
machines; an entry of P is the (real or predicted) runtime of
application i on machine j
the rows of A are applications, columns are operation counts, read a
row to get an application signature
the rows of M are benchmark

measured bandwidths, columns are
machines, read a columns to get a machine profile
P a matrix of runtimes:
=
=
p
k
kj
ik
ij
m
a
p
1
52
35
42
34
32
33
22
32
12
31
32
53
52
51
43
42
41
33
32
31
23
22
21
13
12
11
35
34
33
32
31
25
24
23
22
21
15
14
13
12
11
33
32
31
23
22
21
13
12
11
"
"
"
"
"
"
"
"
p
p
p
p
p
p
p
p
p
b
a
b
a
m
a
m
a
m
a
p
m
m
m
m
m
m
m
m
m
m
m
m
m
m
m
a
a
a
a
a
a
a
a
a
a
a
a
a
a
a
=
=
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Investigating the Convolution Method
We have a multi

pronged investigation to find
the limits of accuracy of this approach
–
How accurately can
P
be measured directly?
•
“Real” runtimes can vary 10% or more
•
Symbiotic job scheduling (Jon Weinberg)
–
How well can
P
be computed
empirically
as
A
•
M
?
•
We use Linear Optimization (Roy) approach as well as a
Least Squares Fit (Yi Chen)
–
How well can
P
be computed
ab initio
from trace data
and judicious convolving? (Laura Carrington)
•
Can the ab initio approach guide the empirical (vice

versa)?
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
How well can
P
be computed
empirically
as
A
•
M
?
De

convolution gives
A=P/M
The big picture:
–
we are trying to discover if any linear
combination of simple benchmarks can
represent an HPC application from a
performance standpoint
–
If YES, difficult full app benchmarking can be
replaced by easy simple benchmarks and
low

level performance charcteristics of
machines can be related to expected
application performance
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Solving for A using Least Squares
Consider solving the matrix equality P = M
A for A
–
We can solve for each column of A individually
(i.e. P
i
= M A
i
) given the assumption ops
counts of an application do not depend on
other applications
–
We compute op counts that minimize the 2

norm of the residual of P
i
–
M A
i
–
nonneglsq
in Matlab
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Solving for A using LP
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Ab Initio: MetaSim Tracer

Memory & FP Trace with
“processing” to get hit rates on PREDICTED MACHINE
User Specified
Memory Structure
(Power 4, Power 3,
Alpha, Itanium)
PREDICTED
MACHINE
Address stream
CPU# N
Running Application
CPU# 1
Running Application
CPU# 2
Running Application
Cache
Simulator
Expected cache hit rates for
application on the user
specified memory structure
PREDICTED MACHINE
Entire address stream is processed
through cache simulator.
Final product is a table of
average hit rates for each basic

block of the entire application.
Less processing than cycle

accurate
simulator

saves time and still
accurate
enough
for predictions
Parallel machine
BB#202: 2.0E9, load, 99%, 100%, stride

one
BB#202: 1.9E3, FP
BB#303: 2.2E10, load, 52%, 63%, random
BB#303: 1.1E2, FP
From sample application signature:
Processing trace
collecting trace
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Convolver rules trained by a human expert
Pick 40 loops from nature
Measure performance on 5 machines
Tune rules to predict performance
accurately
Predict 10,000 lops from 5 apps, 2 inputs,
3 cpu counts, 9 machines (270
predictions)
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Rules training
0
2000
4000
6000
8000
10000
12000
14000
16000
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
Data set size (8byte words)
Memory bandwidth (MB/s)
Strideone
Random stride
L3 Cache
Region
L2 Cache
Region
L1 Cache
Region
L2hr >= 100.00
L3hr >= 99.95
Average
L3 High
L3 Low
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Scientific applications used in this study
AVUS
was
developed
by
the
Air
Force
Research
Laboratory
(AFRL)
to
determine
the
fluid
flow
and
turbulence
of
projectiles
and
air
vehicles
.
Its
standard
test
case
calculates
400
time

steps
of
fluid
flow
and
turbulence
for
a
wing,
flap,
and
end
plates
using
7
million
cells
.
Its
large
test
case
calculates
150
time

steps
of
fluid
flow
and
turbulence
for
an
unmanned
aerial
vehicle
using
24
million
cells
.
The
Naval
Research
Laboratory
(NRL),
Los
Alamos
National
Laboratory
(LANL),
and
the
University
of
Miami
developed
HYCOM
as
an
upgrade
to
MICOM
(both
well

known
ocean
modeling
codes)
by
enhancing
the
vertical
layer
definitions
within
the
model
to
better
capture
the
underlying
science
.
HYCOM's
standard
test
case
models
all
of
the
world's
oceans
as
one
global
body
of
water
at
a
resolution
of
one

fourth
of
a
degree
when
measured
at
the
Equator
.
OVERFLOW

2
was
developed
by
NASA
Langley
and
NASA
Ames
to
solve
CFD
equations
on
a
set
of
overlapping,
adaptive
grids,
such
that
the
grid
resolution
near
an
obstacle
is
higher
than
that
of
other
portions
of
the
scene
.
This
approach
allows
computation
of
both
laminar
and
turbulent
fluid
flows
over
geometrically
complex,
non

stationary
boundaries
.
The
standard
test
case
of
OVERFLOW

2
models
fluid
flowing
over
five
spheres
of
equal
radius
and
calculates
600
time

steps
using
30
million
grid
points
.
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
More scientific apps used in this study
Sandia
National
Laboratories
(SNL)
developed
CTH
to
model
complex
multidimensional,
multiple

material
scenarios
involving
large
deformations
or
strong
shock
physics
.
RFCTH
is
a
non

export

controlled
version
of
CTH
.
The
standard
test
case
of
RFCTH
models
a
ten

material
rod
impacting
an
eight

material
plate
at
an
oblique
angle,
using
adaptive
mesh
refinement
with
five
levels
of
enhancement
.
The
WRF
model
is
being
developed
as
a
collaborative
effort
among
the
NCAR
Mesoscale
and
Microscale
Meteorology
Division
(MMM),
NCEP’s
Environmental
Modeling
Center
(EMC),
FSL’s
Forecast
Research
Division
(FRD),
the
DoD
Air
Force
Weather
Agency
(AFWA),
the
Center
for
the
Analysis
and
Prediction
of
Storms
(CAPS)
at
the
University
of
Oklahoma,
and
the
Federal
Aviation
Administration
(FAA),
along
with
the
participation
of
a
number
of
university
scientists
.
Primary
funding
for
MMM
participation
in
WRF
is
provided
by
the
NSF/USWRP,
AFWA,
FAA
and
the
DoD
High
Performance
Modernization
Office
.
With
this
model,
we
seek
to
improve
the
forecast
accuracy
of
significant
weather
features
across
scales
ranging
from
cloud
to
synoptic,
with
priority
emphasis
on
horizontal
grids
of
1

10
kilometers
.
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Most recent results LS & LP methods
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Most Recent Results: ab initio
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
How can one reasonably procure
supercomputers?
Assistance to DoD HPCMO, SDSC
Petascale, DOE NERSC procurements
Form performance models of strategic
applications, verify against existing HPC
assets, use to predict performance of
proposed systems
Of course performance is just one criterian
(price, power, size, colling, reliability,
diversity etc.)
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Different machines are better at different things
and the space is complicated
BENCHM_5_Opt1_2.2
0
0.2
0.4
0.6
0.8
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
BENCHM_Altix_1.5
0
0.2
0.4
0.6
0.8
1
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
NAVO_P655_FED
0
0.2
0.4
0.6
0.8
1
L1 bw (n)
L1 bw(r)
L2 bw (n)
L2 bw (r)
L3 bw (n)
L3 bw (r)
MM bw(n)
MM bw(r)
NW bw
NW lat
flops
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
What one needs is performance sensitivities of
applications

how much faster my app for:
Case 1
Network latency / 2
Case 2
Network bandwidth * 2
Case 3
FLOPS * 2
Case 4
L1 BW * 2
Case 5
L1, L2 BW *2
Case 6
L1, L2, L3 BW * 2
Case 7a
L1, L2, L3, MM BW * 2
Case 7b
L1, L2, L3, MM, on node BW * 2
Case 8a
Just MM BW *2
Case 8b
Just MM, on node BW * 2
Case 9
I/O
Case 10
All but Network *2
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Performance Sensitivities
0
1
2
3
4
5
6
7
8
case1
case2
case3
case4
case5
case6
case7a
case7b
case8a
case8b
case9
case10
GYRO_0016_2X
GYRO_0016_4X
GYRO_0016_8X
0
1
2
3
4
5
6
case1
case2
case3
case4
case5
case6
case7a
case7b
case8a
case8b
case9
case10
AVUS_0064_2X
AVUS_0064_4X
AVUS_0064_8X
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Pieces of Performance Prediction Framework
each model consists of:
Machine Profile

characterizations of the rates at which
a machine can (or is projected to) carry out fundamental
operations abstract from the particular application.
Application Signature

detailed summaries of the
fundamental operations to be carried out by the
application independent of any particular machine.
Combine Machine Profile and Application Signature using:
Convolution Methods

algebraic mappings of the
Application Signatures on to the Machine profiles to
arrive at a performance prediction.
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Pieces of Performance Prediction Framework
Performance prediction of
Application B on Machine A
Parallel Processor Prediction
Machine Profile
(Machine A)
Characterization of
memory performance
capabilities of
Machine A
Application Signature
(Application B)
Characterization of
memory operations
needed to be performed
by Application B
Convolution Method
Mapping memory usage needs of
Application B
to the capabilities of Machine A
Application B
=
M慣桩湥=A
=
M慣桩湥=Prof楬i=
⡍(捨楮e=A)
=
Characterization of
network performance
capabilities of
Machine A
Application Signature
(Application B)
Characterization of
network operations
needed to be performed
by Application B
Convolution Method
Mapping network usage needs of
Application B
to the capabilities of Machine A
Application B
=
M慣桩湥=A
=
Single

Processor Model
Communication Model
Exe. time =
Memory op
•
FP op
Mem. rate FP rate
Exe. time =
comm. op1
•
comm. op2
op1 rate op2 rate
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Machine Profile
–
Single processor model
collecting rates for Memory operations and FP
operations
Tables of a machine’s performance/rates for
different operations collected via benchmarks.
Sample:
Machine Operation
Performance Rate
Memory load from L1
cache stride

one
5000 MB/s, 99%, 100%
Memory load from Main
Memory random
300 MB/s, 66%, 78%
Floating

point op
2000 MFLOPs
MAPS
data
Currently set
to theoretical
peak
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
Application Signature*

Single processor model
collecting type and number of Memory and FP
operations then simulating in cache simulator
Trace of operations on the processor performed by
an application (memory and FP ops on processor).
Sample:
Where the format is as follows:
Basic

block #: # memory ref., type, hit rates, access stride
BB#202: 2.0E9, load,
99%, 100%
, stride

one
BB#202: 1.9E3, FP
BB#303: 2.2E10, load,
52%, 63%
, random
BB#303: 1.1E2, FP
•
Trace of application is collected
and processed by the
MetaSim Tracer.
Cache hit rates for the
PREDICTED MACHINE
for each basic

block of the
application.
This additional information
requires “processing” by
the MetaSim tracer not just
straight memory tracing,
hence the combination of the
application signature and
convolution components
San Diego Supercomputer Center
Performance Modeling and Characterization Lab
PMaC
How can one meaningfully rank supercomputers?
Thresholded Inversions
is a metric for evaluating
rankings
Basically, when a machine higher on the list runs an
application slower than a machine lower on the list, that
is an inversion
We showed the Top500 list is rife with such inversions,
78% suboptimal compared to…
A
best list
obtainable by brute force for any set of
applications
We used the framework to approach the quality of the
best list by combining these simple HPC Challenge
Benchmarks Random Access, STREAM, and HPL as
guided by application traces
Comments 0
Log in to post a comment