Parallel Programming Trends in Extremely Scalable

footballsyrupSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

74 views

www.cineca.it


Parallel
P
rogramming
T
rends in
E
xtremely
S
calable
A
rchitectures

Carlo Cavazzoni, HPC department, CINECA

www.cineca.it

CINECA

CINECA

non profit Consortium, made
up of 50 Italian universities*, The National
Institute of Oceanography and
Experimental Geophysics
-

OGS, the
CNR (National Research Council), and the
Ministry of Education, University and
Research (MIUR).

CINECA
is the largest Italian computing centre, one of the most important worldwide.

The HPC department manage the HPC infrastructure, provide support to Italian and European researchers, promote
technology transfer initiatives for industry.

www.cineca.it

Why parallel programming?

Solve larger problems

Run memory demanding codes

Solve problems with greater speed

www.cineca.it

Modern Parallel Architectures

Two basic architectural scheme:



Distributed Memory




Shared Memory



Now most computers have a mixed architecture



+ accelerators
-
> hybrid architectures

www.cineca.it

Distributed Memory

memory

CPU

memory

CPU

memory

CPU

memory

NETWORK

CPU

memory

CPU

memory

CPU

node

node

node

node

node

node

www.cineca.it

Shared Memory

CPU

memory

CPU

CPU

CPU

CPU

www.cineca.it

Real Shared

CPU

CPU

CPU

CPU

CPU

System Bus

Memory banks

www.cineca.it

Virtual Shared

CPU

CPU

CPU

CPU

CPU

CPU

HUB

HUB

HUB

HUB

HUB

HUB

Network

node

node

node

node

node

node

www.cineca.it

Mixed Architectures

CPU

memory

CPU

CPU

memory

CPU

CPU

memory

CPU

NETWORK

node

node

node

www.cineca.it

Most Common Networks

Cube, hypercube, n
-
cube

Torus in 1,2,...,N Dim

switch

switched

Fat Tree

www.cineca.it

HPC Trends

www.cineca.it

T
o
p
5
0
0

….

Number of cores of no 1 system from Top500
0
100000
200000
300000
400000
500000
600000
Jun-93
Jun-94
Jun-95
Jun-96
Jun-97
Jun-98
Jun-99
Jun-00
Jun-01
Jun-02
Jun-03
Jun-04
Jun-05
Jun-06
Jun-07
Jun-08
Jun-09
Jun-10
Jun-11
Number of cores
Paradigm Change in HPC

What about applications?

Next HPC system installed in CINECA will have 200000 cores

www.cineca.it

Roadmap to Exascale

(architectural trends)

www.cineca.it

Dennard Scaling law (MOSFET)

L’ = L / 2

V’ = V / 2

F’ = F * 2

D’ = 1 / L
2

= 4D

P’ = P

do not hold anymore!

The power crisis!

L’ = L / 2

V’ = ~V

F’ = ~F * 2

D’ = 1 / L
2

= 4 * D

P’ = 4 * P

The core frequency

and performance do not

grow following the

Moore’s law any longer

CPU +
Accelerator

to maintain the

architectures evolution

In the Moore’s law

Programming crisis!

www.cineca.it

Where Watts are burnt?

Today (at 40nm) moving 3 64bit operands to compute a 64bit floating
-
point FMA takes 4.7x the energy with respect to the FMA operation itself


A

B

C

D = A + B* C

Extrapolating down to 10nm integration, the energy required to move date

Becomes 100x !

www.cineca.it

MPP System

When?

2012

PFlop/s

>2

Power

>1MWatt

Cores

>150000

Threads

>500000

Arch

Option

for BG/Q

www.cineca.it

Accelerator

A set (one or more) of very simple execution units that can perform few operations (with
respect to standard CPU) with very high efficiency. When combined with full featured
CPU (CISC or RISC) can accelerate the “nominal” speed of a system.
(Carlo
Cavazzoni
)

CPU

ACC
.

CPU

ACC
.

Physical integration

CPU & ACC

Architectural integration

Single thread perf.

throughput

www.cineca.it

nVIDIA GPU

Fermi implementation
will pack 512
processor cores

www.cineca.it

ATI FireStream, AMD GPU


2012

New Graphics Core Next “
GCN”

With new instruction set and new
SIMD design


www.cineca.it

Intel MIC (Knight Ferry)

www.cineca.it

What about parallel App?

In a massively parallel context, an upper limit for the scalability of
parallel applications is determined by the fraction of the overall
execution time spent in non
-
scalable operations (Amdahl's law).

maximum speedup tends to

1 / ( 1 −

P
)

P
= parallel fraction

1000000 core

P

= 0.999999

serial fraction
= 0.000001

www.cineca.it

Programming Models


Message Passing (MPI)


Shared Memory (OpenMP)


Partitioned Global Address Space Programming (PGAS)
Languages


UPC, Coarray Fortran, Titanium


Next Generation Programming Languages and Models


Chapel, X10, Fortress


Languages and Paradigm for Hardware Accelerators


CUDA, OpenCL


Hybrid: MPI + OpenMP + CUDA/OpenCL

www.cineca.it

trends

Vector

Distributed
memory

Shared
Memory

Hybrid codes

MPP System, Message Passing: MPI

Multi core nodes: OpenMP

Accelerator (GPGPU,
FPGA): Cuda, OpenCL

Scalar Application

www.cineca.it

Message Passing

domain decomposition

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

memory

CPU

node

Internal High Performance Network

www.cineca.it

Ghost Cells
-

Data exchange

i,j

i+1,j

i
-
1,j

i,j+1

i,j
-
1

sub
-
domain boundaries

Ghost Cells

i,j

i+1,j

i
-
1,j

i,j+1

i,j
-
1

Processor 1

Processor 2

i,j

i+1,j

i
-
1,j

i,j+1

Ghost Cells exchanged

between processors

at every update

i,j

i+1,j

i
-
1,j

i,j+1

i,j
-
1

Processor 1

Processor 2

i,j

i+1,j

i
-
1,j

i,j+1

www.cineca.it

Message Passing: MPI

Main Characteristic


Library


Coarse grain


Inter node parallelization
(few real alternative)


Domain partition


Distributed Memory


Almost all HPC parallel
App

Open Issue


Latency


OS jitter


Scalability

www.cineca.it

Shared memory

memory

CPU

node

CPU

CPU

CPU

Thread 0

Thread 1

Thread 2

Thread 3

x

y

www.cineca.it

Shared Memory: OpenMP

Main Characteristic


Compiler directives


Medium grain


Intra node parallelization (pthreads)


Loop or iteration partition


Shared memory


Many HPC App


Open Issue


Thread creation overhead


Memory/core affinity


Interface with MPI

www.cineca.it

OpenMP

!$omp parallel do

do i = 1 , nsl


call 1DFFT along z ( f [ offset( threadid ) ] )

end do

!$omp end parallel do

call fw_scatter ( . . . )

!$omp parallel

do i = 1 , nzl

!$omp parallel do


do j = 1 , Nx


call 1DFFT along y ( f [ offset( threadid ) ] )


end do

!$omp parallel do


do j = 1, Ny


call 1DFFT along x ( f [ offset( threadid ) ] )


end do

end do

!$omp end parallel

www.cineca.it

Accelerator/GPGPU

Sum of 1D array

+

www.cineca.it

CUDA sample

void


CPUCode( int* input1, int* input2, int* output, int length) {



for ( int


i = 0; i < length; ++i ) {



output[ i ] = input1[ i ] + input2[ i ];



}

}

__global__void


GPUCode( int* input1, int*input2, int* output, int length) {



int idx = blockDim.x * blockIdx.x + threadIdx.x;



if ( idx < length ) {



output[ idx ] = input1[ idx ] + input2[ idx ];



}

}

Each thread execute one loop iteration

www.cineca.it

CUDA

OpenCL

Main Characteristic


Ad
-
hoc compiler


Fine grain


offload parallelization (GPU)


Single iteration parallelization


Ad
-
hoc memory


Few HPC App


Open Issue


Memory copy


Standard


Tools


Integration with other
languages


www.cineca.it

Hybrid
(MPI+OpenMP+CUDA+…

Take the positive off all models

Exploit memory hierarchy

Many HPC applications are adopting this model

Mainly due to developer inertia

Hard to rewrite million of source lines



…+python)

www.cineca.it

Hybrid parallel programming

MPI: Domain partition

OpenMP: External loop partition

CUDA: assign inner loops

Iteration to GPU threads

http://www.qe
-
forge.org/


Quantum ESPRESSO

Python: Ensemble simulations

www.cineca.it

Storage I/O


The I/O subsystem is not
keeping the pace with CPU


Checkpointing will not be
possible


Reduce I/O


On the fly analysis and
statistics


Disk only for archiving


Scratch on non volatile
memory (“close to RAM”)

www.cineca.it

PRACE

PRACE Research Infrastructure (
www.prace
-
ri.eu
)

the top level of the European HPC ecosystem


The vision of PRACE is to enable and support European global
leadership in public and private research and development.


CINECA (representing Italy)

is an hosting member

of PRACE

can host a Tier
-
0 system

European (PRACE)


Local

Tier 0


Tier 1


Tier 2


National (CINECA today)

FERMI @ CINECA

PRACE Tier
-
0 System

Architecture: 10 BGQ Frame

Model: IBM
-
BG/Q

Processor Type: IBM PowerA2, 1.6 GHz

Computing Cores: 163840

Computing Nodes: 10240

RAM: 1GByte / core

Internal Network: 5D Torus

Disk Space: 2PByte of scratch space

Peak Performance: 2PFlop/s

ISCRA & PRACE call for projects now open!

www.cineca.it

Conclusion


Exploit millions of ALU


Hybrid Hardware


Hybrid codes


Memory Hierarchy


Flops/Watt (more that Flops/Sec)


I/O subsystem


Non volatile memory


Fault Tolerance!


Parallel programming trends in extremely scalable architectures