PERI Tiger Teams

monkeyresultΜηχανική

22 Φεβ 2014 (πριν από 3 χρόνια και 8 μήνες)

88 εμφανίσεις

PERI Tiger Teams

FY07 Report

Performance Engineering Research Institute

October 30, 2007

Contact: Bronis R. de Supinski

bronis@llnl.gov



Tiger Team Process and Milestones


Process from Section 4.3 of Proposal:


Select one or two applications per year


Consult with Office of Science Program Managers


Consist of three to four PERI researchers


Milestones


Q1: Identify applications and teams for current year


Q1: Report on prior year’s teams


Q3: Report progress; reassign as per DOE needs

FY07 Selection Process


Delayed to allow completion of application survey


Received guidance to focus on 3 JOULE metric codes


S3D, GTC and Chimera


Initial discussions w/JOULE Metric coordinator K. Roche


SciDAC PI meeting in Atlanta in January of 2007


Strong interest from both S3D and GTC


Chimera expressed concerns over their staffing and needs


Narrowed to focus on S3D and GTC in early March

FY07 Tiger Team Formation


Solicited interest in participating on teams


Assignments made by PERI management based on:


Perceived code team needs


Prior engagement activities


Balance of expertise


Participants from six of nine PERI institutions


Also strong participation in both teams by Univ. of Oregon


Coordination


Team
-
specific mailing lists


Regular telecons

S3D Tiger Team


Team Lead: Bronis de Supinski (LLNL)


PERI Team Members


John Mellor
-
Crummey, Mike Fagan (Rice)


Nick Wright, Allan Snavely (SDSC)


David Bailey (LBNL)


Rich Vuduc (LLNL)


Affiliate Team Members


Sameer Shende, Alan Morris , Allen Maloney, Kevin Huck (Oregon)


Jeff Larkin (Cray/ORNL)


Application Team Participants


Jackie Chen, David Lignell (SNL)


Facilitators


Kenny Roche, Pat Worley (ORNL)

S3D: Direct numerical simulation (DNS) of
turbulent combustion


State
-
of
-
the
-
art code developed at CRF/Sandia


2007 INCITE award
-

6M hours on XT3/4 at NCCS


Tier 1 pioneering application for 250TF system


Why DNS?


Study micro
-
physics of turbulent reacting flows


Full access to time resolved fields


Physical insight into chemistry turbulence interactions


Develop & validate reduced model descriptions used

in macro
-
scale simulations of engineering
-
level systems

DNS

Physical

Models

Engineering

CFD codes

(RANS, LES)

Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

Text and figures courtesy of S3D PI, Jacqueline H. Chen, SNL

S3D
-

DNS Solver


Solves compressible reacting Navier
-
Stokes equations


High fidelity numerical methods


8th order finite
-
difference


4th order explicit RK integrator


Hierarchy of molecular transport models


Detailed chemistry


Multiphysics (sprays, radiation & soot)


From SciDAC
-
TSTC

(Terascale Simulation of Combustion)

S3D Parallelization


Fortran90 + MPI


3D domain decomposition


each MPI process manages part of the domain


All processes have same number of grid
points & same computational load


Inter
-
processor communication only
between nearest neighbors in 3D mesh


large messages; non
-
blocking sends & receives


All
-
to
-
all communication only required for
monitoring & synchronization ahead of I/O


Communication
Computation

kN
2
kN
3


1
N






S3D logical

topology

Text courtesy of S3D PI, Jacqueline H. Chen, SNL

A Performance Mystery in S3D on PWR4 (SDSC)


Following line of code (and many similar others) has ~70% L1 hit rate


diffFlux(:,:,:,n,m) =
-

Ds_mixavg(:,:,:,n) * (grad_Ys(:,:,:,n,m) + Ys(:,:,:,n) *grad_mixMW(:,:,:,m) ))





Total L2 data cache accesses:

9784.594 M

% accesses from L2 per cycle:

5.112 %


L2 traffic:

1194408.401 MBytes


L2 bandwidth per processor:

9183.869 MBytes/sec


Total load and store operations:

33073.374 M


Number of loads per load miss:

30.527


Number of stores per store miss:

1.014


Number of load/stores per D1 miss:

3.380


L1 cache hit rate:

70.415 %



Performance model provides expectation of 90%...

Discrepancy Understood, Performance Optimized


diffFlux is defined as a pointer: “diffFlux => grad_Ys”


Compiler unrolls the loop suboptimally


Loops over the 2
nd

index instead of the 1
st


i.e., It accesses memory in “nx
-
size” strides


Alias analysis not sufficient to allow “obvious” optimization


Simple fix on IBM systems


Use “
-
qalias=noaryovrlp” compiler flag on IBM


Runtime on 8 PWR4+ 1.5 GHz CPUs, 200 timesteps


2949 s (before), 2728 s (after)


7.5%

improvement and L1 hit rates what they should be


Same loops show expected ~93 % L1 hit rate on XT3/4


Substantial time in exp in the getrates routine


Power4+ profiles


Code examination revealed calls were not vectorizable


Perl script transformed from 0% to 50% vectorized


Substantial Power4+ performance improvement


30% for getrates routine


Approximately 10% overall


Smaller perfromance improvement on XT4


Approximately 10% for getrates routine


Approximately ~1.5% overall


Subject of continuing tuning effort (D. Bailey)

Vectorizing exp for S3D (SDSC)

# of cpu cycles

PWR4+/1.5GHz

Opteron / 2.6 GHz

exp

160

49

Vectorised exp

8

31

S3D Performance at the Loop Level (Rice)

Wasted Opportunity

(Maximum FLOP rate
* cycles
-

(actual
FLOPs)) / total waste

highlighted loop accounts for

11.4% of total program waste

Overall performance

(15% of peak)

2.05 x 10
11

FLOPs / 6.73 x 10
11

cycles= .305 FLOPs/cycle

S3D: What Opportunities Exist?

reuse

performance
problem

data streams

in/out of memory

initialize

update

reuse

5D loop nest:

2D explicit loops

3D F90 vector syntax

Apply LoopTool to S3D Diffusive Flux Loop

!dir$ uj 3


do

m=1,3
! DIRECTION

!dir$ uj 2


do

n=1,n_spec
-
1
! SPECIES


!dir$ unswitch 2


if

(baro_switch)
then



! driving force includes gradient in mole fraction and baro
-
diffusion:

!dir$ fuse 1 1 1


diffFlux(:,:,:,n,m) =
-

Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) &


+ Ys(:,:,:,n) * ( grad_mixMW(:,:,:,m) &


+ (
1

-

molwt(n)*avmolwt) * grad_P(:,:,:,m)/Press))


else




! driving force is just the gradient in mole fraction:

!dir$ fuse 1 1 1


diffFlux(:,:,:,n,m) =
-

Ds_mixavg(:,:,:,n) * ( grad_Ys(:,:,:,n,m) &


+ Ys(:,:,:,n) * grad_mixMW(:,:,:,m) )


endif



! Add thermal diffusion:

!dir$ unswitch 2


if

(thermDiff_switch)
then

!dir$ fuse 1 1 1


diffFlux(:,:,:,n,m) = diffFlux(:,:,:,n,m)
-

Ds_mixavg(:,:,:,n) *


Rs_therm_diff(:,:,:,n) * molwt(n) * avmolwt * grad_T(:,:,:,m) / Temp


endif




! compute contribution to nth species diffusive flux


! this will ensure that the sum of the diffusive fluxes is zero.

!dir$ fuse 1 1 1


diffFlux(:,:,:,n_spec,m) = diffFlux(:,:,:,n_spec,m)
-

diffFlux(:,:,:,n,m)



enddo

! SPECIES


enddo

! DIRECTION

unswitching

directives

controlled fusion

directives

unroll and jam

directives

if BS

else

if TD

n=1,nspec
-
1

m=1,3

LoopTool


Optimization of S3D Diffusive Flux Loop

Transformation Log:


Scalarization (4 stmts)


Loop unswitching (2 conditions)


Fusion (loops within 4 outer nests)


Unroll
-
and
-
jam (2 loops)


Peeling excess iterations (4 nests)

2.94x faster than original

(6.7% total savings)

if BS


if TD







else




else


if TD






else

n=1,nspec
-
2,2

n=1,nspec
-
2,2

n=1,nspec
-
2,2

n=1,nspec
-
2,2

(35 lines)

(445 lines)

S3D: An Unexpected Bottleneck

Adjust routine interfaces to avoid copy

100% faster

an implicit loop that

copies a non
-
contiguous

4D slice of 5D data to

contiguous storage

5.4%

time

S3D Node Performance Tuning Summary



More opportunities remain


Register reuse and tiling of stencil computations


Inlining + fusion + array contraction of temporary variables


Further improvements require more changes


Lots of potential smaller improvements


Enabling technologies contributions


HPCToolkit enabled identifying and assessing bottlenecks


LoopTool helped automate tedious code transformations

Achieved ~12.7% overall improvement


Node performance increased from 15% of peak to 17.4%


Estimated savings for 2M CPU hour run:
?

254K CPU hours

S3D Scaling Performance (App Team)

S3D performance on XT at NCCS
0
10
20
30
40
50
60
70
80
1
10
100
1000
10000
100000
No of cores
Cost per grid point per time step
(in microsecond)
XT
3
XT
4
XT
7
S3D Scaling Study (Oregon)


Harness test case


Platform: Jaguar Combined Cray XT3/XT4 at ORNL


Several runs to identify scaling trends


Focus on 6400p


Evaluate impact of combined XT3/XT4 nodes


Performance evaluation of MPI_Wait


Study mapping of MPI ranks to nodes

Total Runtime Breakdown by Events
-

Time

MPI_Wait

WRITE_

SAVEFILE*

*Recent analysis indicates not a scaling issue

TAU: ParaProf Profile

MPI_Wait times
exhibit two
equivalence
classes

Same equivalence
classes also seen in
memory bandwidth
intensive
computation
routines

S3D Scaling Study Conclusion


Determined that XT3 nodes slowed
certain S3D routines


Consistent across all XT3 nodes


Memory bandwidth limited routines


Suggested load balancing
optimization


Reduce grid size in one dimension


for XT3 nodes


Not yet implemented due to

concerns over long term relevance


Provided estimate of benefit for
combined XT3/XT4 runs


Many scaling and single node results
appear in S3D IOP paper

Projected Heterogeneous Scaling (LLNL)
0
10
20
30
40
50
60
70
80
0
0.2
0.4
0.6
0.8
1
1.2
Proportion XT4 Nodes
Cost per grid point per time step
(in microsecond)
S3D Modeling Results & Future Directions


PMaC predictions for S3D on XT3 and XT4


Currently within 15% for an 8 CPU run


Extending to larger CPU counts


Working on improving accuracy


What is the expected performance of S3D on ORNL’s
250 TFLOP machine?


Will our optimizations benefit quad
-
core system?


Different cache structure


L2 1MB→512KB


L3 0 → 2MB shared


What architecture will S3D perform best on?

GTC Tiger Team


Team Lead: Shirley Moore (UTK)


PERI Team Members


Haihang You (UTK)


John Mellor
-
Crummey, Gabriel Marin, Guohua Jin (Rice)


Hongzhang Shan (LBNL)


Affiliate Team Members


Kevin Huck (UOregon)


Ed D’Azevedo (ORNL)


Lenny Oliker (LBNL)


Application Team Participants


Stephane Ethier, Weixing Wang, Wei
-
li Lee (PPPL)


Scott Klasky (ORNL)


Facilitators


Kenny Roche, Pat Worley (ORNL)


Bronis de Supinski (LLNL)

GTC: Gyrokinetic Toroidal Code from PPPL


Particle in Cell (PIC) code with gyrokinetic simulation


GTC
-
S: “shaped” code


More realistically represents experimentally relevant geometry


GTC
-
P is a new “petascale” version


Partitions the poloidal plane into radial shells


Fortran 90 and MPI, PETSc used for Poisson solves


Currently no OpenMP in GTC
-
S or GTC
-
P


OpenMP may be considered for multicore


Code team science goals


Impact of turbulent transport in burning plasma fusion devices


Integrated simulations for ITER plasmas for a range of
temporal and spatial scales

The Gyrokinetic Toroidal Code


3D particle
-
in
-
cell code to study microturbulence

in magnetically confined fusion plasmas


Solves the gyro
-
averaged Vlasov equation


Gyrokinetic Poisson equation solved in real space


Low noise
d
f

method


Global code (full torus as opposed to only a flux tube)


Massively parallel: typical runs use 1024+ processors


Electrostatic (for now…)


Nonlinear and fully self
-
consistent


Written in Fortran 90/95


Originally optimized for

superscalar processors

Particle
-
in
-
Cell (PIC) Method


Particles sample distribution function.


The particles interact via a grid, on which the
potential is calculated from deposited charges.

The PIC Steps



SCATTER
”, or deposit,
charges on the grid (nearest
neighbors)


Solve Poisson equation



GATHER
” forces on each
particle from potential


Move particles (
PUSH
)


Repeat…

Charge Deposition for Charged Rings:

4
-
Point Average Method

Classic PIC

Charge Deposition Step (SCATTER operation)

4
-
Point Average GK

(W.W. Lee)

GTC

Point
-
charge particles replaced by charged rings due to gyro
-
averaging

Application Team’s Flagship Code: The
Gyrokinetic Toroidal Code (GTC)


Fully global 3D particle
-
in
-
cell code
(PIC) in toroidal geometry


Developed by Prof. Zhihong Lin (now at
UC Irvine)


Used for non
-
linear gyrokinetic
simulations of plasma microturbulence


Fully self
-
consistent


Uses magnetic field line following
coordinates (
y,q,z
) [Boozer, 1981]


Guiding center Hamiltonian [White and
Chance, 1984]


Non
-
spectral Poisson solver [Lin and
Lee, 1995]


Low numerical noise algorithm (
d
f

method)


Full torus (global) simulation



Scales to a very large number of
processors


Excellent theoretical tool!

GTC Mesh and Geometry

Saves a factor of
about 100 in CPU time


(
Y,a,z
)


aq
-

z
/q

z

Field
-
line following coordinates

q

z

Poloidal plane (cross
-
section)

unstructured mesh

Processor 2

Processor 3

Processor 0

Processor 1

New GTC Codes Use a New Parallel Model:

Domain Decomposition + Particle Splitting


1D Domain decomposition:


Several MPI processes can now
share a section of the torus


Particle splitting method


The particles in a toroidal section are
equally divided between several MPI
processes


Particles randomly distributed between
processors within a toroidal domain


Pure MPI version


But OpenMP still there…


for multicore?

New Version (GTC
-
S) Inputs

Experimental Equilibrium and Profiles


Original GTC has flat temperature and density to set
the scale for the gyroradius and the grid, and an
analytical gradient for the turbulence drive


GTC
-
S uses experimental profiles and plasma
boundary extracted from the experimental database
by using the widely
-
used TRANSP tool
(
http://w3.pppl.gov/transp/
)


The magnetic equilibrium is calculated from the
profiles and boundary by using ESC or JSOLVER


Spline coefficients are calculated for the equilibrium
and profiles to allow interpolations at the particle
positions

New Grid Follows Change in

Gyro
-
radius with Temperature Profile


Local gyro
-
radius proportional to temperature


Evenly spaced radial grid in new
r

捯c牤r湡瑥nw桥牥h

T

r
Original GTC circular grid

with flat temperature

New GTC
-
S follows T(r)

)
(
/
r
T
T
dr
d
i
c

r
Poloidal Component of B Field Taken

into Account for gyro
-
orbit


For large
-
aspect ratio
circular concentric cross
-
section, the difference
between a poloidal plane
and a gyro
-
plane is
neglected.


A more accurate treatment
is used here for general
geometry.


Projection of gyro
-
plane on
poloidal plane results in
elliptic orbit.


4
-
point average method
uses ellipse

GTC Performance Issues


Three basic operation types govern PIC performance


Grids work (i.e., Poisson solve)


Particle processing (e.g., position and velocity updates)


Interpolation between the above (i.e., charge deposition and
field calculation in particle pushing)


Main GTC performance bottleneck is the charge
deposition, or scatter operation


True of most PIC codes


More complex in GTC due to fast gyrating particles


Motion described by charged rings tracked by their guiding center

More GTC Performance Issues


Some scaling issues with GTC
-
P relative to expectations


Time doubles when it should stay flat


Load imbalance in particle push routine apparently due to
variation in TLB misses


179% speedup going from single to dual core mode


Main computational kernels not memory bandwidth bound


Warning: as number of cores increases, other routines that are
showing slowdown on dual core may start to dominate

Status of GTC Tiger Team Effort


PERI Application Survey completed


Several conference calls w/application team participants


GTC and GTC
-
S code versions released to Tiger Team and
Performance Database WG members on request


Awaiting release of GTC
-
P code to investigate:


Poor scaling


Load imbalance issues


Profiling of GTC
-
S carried out on Jaguar using TAU


Data accessible in password
-
protected PerfDMF database


Optimization of charge deposition by UTK


Detailed modeling, analysis, and optimization of GTC
-
S by Rice


Brief summary follows; details in a submitted paper

TAU Profile Showing Weak Scaling

of GTC
-
S on Jaguar

Hand optimization of Charge Deposition (UTK)


Hand
-
tuning techniques


Common subexpression elimination


Code movement


Loop unrolling


Cache blocking


Improved performance of
chargei

by ~10%


Changes incorporated into GTC
-
S code


Written up as success story for Fred Johnson


Modeling, Analysis, and Optimization of GTC at Rice


Detailed modeling of computation and memory hierarchy
performance of GTC
-
S using Rice modeling toolkit










Identified opportunities for data and loop transformations


Transformations improved program node performance 33% on
Itanium2 and 13% on Opteron 275


Changes sent to Stephane Ethier; awaiting response

GTC
-
S Memory Hierarchy Performance
-

I


Total L3 miss count


L3 cache misses due to
fragmentation of data in
cache lines: 14.4% of total

GTC
-
S suffers from poor spatial locality due to data layout


Model L3 cache miss counts for individual arrays at the loop level

particle_array

is an alias to
array
zion

used in
gcmotion

fragmentation of arrays
zion

(AKA
particle_array
) +
zion0
, accounts for:



95% of all L3 fragmentation misses



48% of all misses to the
zion

arrays



13.7% of total L3 cache misses


Solution: transpose particle arrays
zion

and
zion0



transform arrays of structures into structures of arrays

(values predicted for 64 radial grid points and 15 particles/cell)

GTC
-
S Memory Hierarchy Performance
-

II

Understanding spatial and temporal data reuse patterns in GTC
-
S



Figure below: program scopes carrying
>

2% of L3 cache misses


Carried misses are non
-
compulsory misses (capacity + conflict misses)


Carrying scope is the innermost dynamic scope in which the data is reused

Two loops in
main

carry 40%
of all L3 carried misses;
misses cannot be removed.

21.4% of misses are carried
by the iterative loop of the
Poisson solver.

A recurrence in the solver
prevents transformations.


Focus on routines
chargei

and
pushi



Fuse the two main loops in
chargei


Apply tiling and fusion over several loop nests in
pushi


(values predicted for 64 radial grid points and 15 particles/cell)

GTC
-
S Memory Hierarchy Performance
-

III

Pinpointing and reducing TLB misses

335 do kz=1,mzbig

336 wz=real(kz)/real(mzbig)

337 zdum=zetamin+deltaz*(real(k
-
1)+wz)

338 do i=idiag1,idiag2

339 ii=igrid(i)

340 do j=1,mtdiag



346 phiflux(kz+(k
-
1)*mzbig,j,i)=


347

348 enddo

349 enddo

350 enddo

(values predicted for 64 radial grid points and 15 particles/cell)


Interchange loop
kz

to the innermost position

Outer loop
kz

iterates over inner
dimension of
phiflux


Additional transformations


Apply unroll & jam to increase ILP in routine
spcpft


Transform arrays used in the Poisson solver to
improve spatial locality

GTC
-
S Performance Improvements on Itanium2


Percentages represent incremental improvements for each
transformation


Results for
10

and
100

particles/cell

Transformation

L2 misses
(%)

L3 misses
(%)

TLB
misses (%)

Execution
Time (%)

+zion transpose

-
27

/
-
37.4

-
30.9

/
-
39.1

-
10.6

/
-
81.6

-
6.9

/
-
11.7

+chargei fusion

-
3.9

/
-
6.2

-
4.3

/
-
6.8

-
1.6

/
-
3.1

-
11.4
/

-
18.2

+spcpft U&J

0
/
0

+0.1

/
+0.4

-
0.1

/
-
0.4

-
11.3

/
-
1.9

+poisson transf.

-
6.6

/
-
1

-
6.4

/
-
1.3

-
1.4

/
+2

-
7.4

/
-
1.4

+smooth LI

-
3
/
-
0.4

-
2.4

/
-
0

-
63.9

/
-
3.6

-
0.7

/
0

+pushi tile/fuse

-
8.9

/
-
13.3

-
10.9

/
-
16

-
3.4

/
-
9.4

+0.6

/
+0.8

Total

-
49.4

/
-
58.3

-
54.8

/
-
63

-
81.1

/
-
96.2

-
37.3

/
-
32.4

Itanium2 has 16KB dedicated instruction cache. Improvements in
data locality negated by increase in instruction cache misses. Bigger
impact expected with larger instruction cache, e.g. Montecito.

Side effect: big reduction in
unnecessary data prefetches
inserted by Intel compiler

GTC
-
S Performance Improvements on Opteron


Issues


Hardware prefetcher crucial for performance on Opteron


Prefetcher tracks up to 20 parallel data streams


Zion transpose increases # of parallel streams in key loops


Reduces effectiveness of hardware prefetcher


Data reuse improvements are negated by higher number of
non
-
prefetched memory accesses


Approach


Reorganize five arrays in pushi as one array


Reorganize fourteen arrays in gcmotion as four arrays


Result: Improves execution time on Opteron by 13%


Reduces cache and TLB misses by > 50%

Exploring Run
-
time Data Reordering at Rice


Issue


Performance degrades during GTC execution as particles
become disordered w.r.t. underlying tokamak grid


Preliminary study


Particle reordering improves temporal locality during charge
deposition and particle pushing







Currently developing on
-
line feedback and control
mechanism for particle reordering

What Worked


Close interactions with multiple members of app teams


Tiger Team specific mailing list for S3D


Generated team
-
wide comments, tapping more expertise


Not used very much for GTC


Large distributed teams


Somewhat surprising


Avoid duplication of measurement effort


University of Oregon participation as affiliate was exemplary


Rice participation was also exemplary


Publications and publicity


S3D science focused IOP paper


SciDAC Review paper & SciDAC conf. presentation (Mellor
-
Crummey)


GTC success for Fred Johnson

What Didn’t


Timing of application selection


Not finalized until halfway through fiscal year


Delayed by survey


OK for first year; future implications?


Long Jaguar down time soon after teams formed


Initial understanding of code distribution


Provided through JOULE process, NOT direct from application teams


An appropriate distribution mechanism but unsettling to application team


Frequent, on
-
going concern of application teams


Will always start with application team in future, regardless of reason for
selection or appropriateness of distribution


Mechanism for providing improvements back to application team


Slow and cumbersome; No CVS access


May not be solvable due to application team need for internal control


Addressed by repeated direct interactions

FY08 Tiger Team Issues and Proposed Solutions


Which applications will be focus of FY08 Tiger Teams?


Guidance from HQ requested


Recommend one XT4
-
focused team, one BG/P
-
focused team


Is JOULE precedent to continue?


Expect timing to be similar to FY07 (January/March), if maybe a little sooner


Plan to continue work with S3D and GTC during FY08 selection process


Solves late Q2 decision, one of FY07’s biggest issues


Suggest elimination of Q3 reassessment milestone in light of timing


What happens to teams from previous year?


Application tuning does not respect fiscal year boundaries


Good relationships established


Don’t want to lose them


Are liaison activities sufficient to maintain them?


Plan to slowly devolve FY07 teams into very active liaison activities


Use different participants for FY08 teams in order to balance staffing requirements


How do we ensure that the results are publicized?


Initial S3D paper is good; potential for more


GTC success story is good; plan similar one for S3D


Continued interactions will support solving this question