Australian National University

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

92 εμφανίσεις

Alistair Rendell and Josh Milthorpe

Research School of Computer Science

Australian National University


The idea


Split your program into bits that can be executed
simultaneously


Motivation


Speed,
Speed,
Speed

at a cost effective price


If we didn

t want it to go faster we would not be
bothered with the hassles of parallel programming!


Reduce the time to solution to acceptable levels


No point waiting 1 week for tomorrow

s weather
forecast


Simulations that take months to run are not useful in a
design environment


Fluid flow problems


Weather forecasting/climate modeling


Aerodynamic modeling of cars, planes, rockets etc


Structural Mechanics


Building bridge, car, etc strength analysis


Car crash simulation


Speech and character recognition, image processing


Visualization, virtual reality


Semiconductor design, simulation of new chips


Structural biology, molecular level design of drugs


Human genome mapping


Financial market analysis and simulation


Datamining, machine learning


Games programming!


Atmosphere divided into 3D regions or cells


Complex mathematical equations describe conditions in
each cell, eg pressure, temperature, velocity


Conditions change according to neighbour cells


Updates repeated frequently as time passes


Cells are affected by more distant cells the longer range the
forecast


Assume


Cells are 1x1x1 mile to a height of 10 miles, 5x10
8

cells


200 flops to update each cell per timestep


10 minute timesteps for total of 10 days


100 days on 100 mflop machine


10 minutes on a tflop machine


NCI: National Computational Infrastructure


http://nci.org.au

and
http://nf.nci.org.au


History


Establishment of APAC in 1998 with $19.5M grant from federal
government, renewed in 2004 with a grant of about $29M


Changed to NCI in 2007 with funding through NCRIS and Super
Science Programs


2010 machine is Sun X6275 Constellation Cluster, 1492 nodes
(2*2.93GHz Nehalem) or 11936 cores. QDR InfiniBand interconnect


Installing new Fujitsu Primergy system with Sandy Bridge nodes and
57,000 cores, 160TB RAM, 10PB of disk


Bunyip:
tsg.anu.edu.au/Projects/Bunyip


192 processor PC Cluster


winner of 2000 Gordon Bell prize
for best price performance


High Performance
Computing Group


Jabberwocky cluster


Sunnyvale cluster


Single Chip Cloud Computer

Year

Hardware

Languages

1950

Early Designs

Fortran I (Backus,

57)

1960

Integrated circuits

Fortran

66

1970

Large scale integration

C (72)

1980

RISC and PC

C++ (83), Python 1.0 (89)

1990

Shared

and distributed parallel

MPI,

OpenMP
, Java (95)

2000

Faster, better, hotter

Python

2.0 (00)

2010

Throughput oriented

CUDA,
OpenCL

Parallelism became an issue for programmers from late 80s

People began compiling lists of big parallel systems

All have multiple processors (many have GPUs)

No

Computer

Site


Cores

Rmax

Rpeak

Power

1

Cray XK7 , Opteron 6274 16C
2.200GHz, Cray Gemini
interconnect, NVIDIA K20x

2012, DoE,
USA

560640

17590000

27112550

8209

2

BlueGene
/Q, Power BQC
16C 1.60 GHz, Custom

2011, DoE,
USA

1572864

16324751

20132659

7890

3

K computer, SPARC64
VIIIfx

2.0GHz, Tofu interconnect

2011, RIKEN,
Japan

705024

10510000

11280384

12660

4

BlueGene
/Q, Power BQC
16C 1.60GHz, Custom

2012, DoE,
USA

786432

8162376

10066330

3945

5

BlueGene
/Q, Power BQC
16C 1.600GHz, Custom
Interconnect

2012,
MaxPlanck
,
Germany

393216

4141180

5033165

1970

24

Fujitsu PRIMERGY CX250
S1, Xeon E5
-
2670 8C
2.600GHz, Infiniband FDR

2012, NCI,
Australia

53504

978600

1112883

11

Moore’s Law

‘Transistor density will double
approximately every two
years.’

Dennard Scaling

‘As MOSFET features shrink,
switching time
and
power
consumption will fall
proportionately’

Agarwal, Hrishikesh, Keckler Burger,
Clock Rate Versus IPC
, ISCA 2000

250nm, 400mm
2
, 100%

180nm, 450mm
2
, 100%

130nm, 566mm
2
, 82%

100nm, 622mm
2
, 40%

70nm, 713mm
2
, 19%

50nm, 817mm
2
, 6.5%

35nm, 937mm
2
, 1.9%

Until the chips became too big…

…so multiple cores appeared on chip

…until we hit a bigger problem…

2004 Sun releases Sparc IV with dual cores and heralding the start of multicore

…the end of Dennard scaling…

Dennard, Gaensslen, Yu, Rideout, Bassous and Leblanc, IEEE SSC, 1974

Dennard scaling

‘As MOSFET features shrink, switching time
and
power consumption will fall proportionately.’

Moore’s Law

‘Transistor density will double approximately every two years.’

…ushering in..

1960
-
2010

2010
-
?

Few transistors

No

shortage of transistors

No shortage

of power

Limited power

Maximize transistor

utility

Minimize energy

Generalize

Customize

…and a fundamentally new set of building
blocks for our petascale systems


Level

Characteristic

Challenge/Opportunity

As a whole

Sheer number of node


䙵橩瑳甠䬠t慣a楮攠桡猠
㔴㠬U㔲

捯牥s


偲潧牡浭r湧
污湧畡来⽥L癩牯湭n湴


䙡c汴⁴潬敲e湣n

Within

a
domain

Heterogeneity


Titan system uses
CPUs and GPUs


What to use when


Co
-
location of data with unit
processing it

On the chip

Energy

minimization


䅬牥慤y 灲p捥獳c牳r
桡癥v晲敱略湣n 慮搠
癯v瑡te

獣慬楮g


䵩M業i穥

摡瑡t獩穥⁡ 搠
浯癥浥m琠楮捬c摩湧 畳u映橵獴s
敮潵杨 灲p捩獩潮


印散e慬楺i搠捯牥s

In RSCS we are working in all these areas


Multiple instruction units:


Typical processors issue ~4 instructions per cycle


Instruction Pipelining:


Complicated operations are broken into simple
operations that can be overlapped


Graphics Engines:


Use multiple rendering pipes and processing elments
to render millions of polygons a second


Interleaved Memory:


Multiple paths to memory that can be used at same
time


Input/Output:


Disks are striped with different blocks of data written
to different disks at the same time


Split program up and run parts simultaneously on
different processors


On N computers the time to solution should (ideally!) be 1/N


Parallel Programming
: the art of writing the parallel code!


Parallel Computer
: the hardware on which we run our parallel
code!

COMP4300 will discuss both


Beyond raw compute power other motivations include


Enabling more accurate simulations in the same time (finer
grids)


Providing access to huge aggregate memories


Providing more and/or better input/output capacity


Course is run every other year


Drop out this year and it won

t be repeated
until 2015


It

s a 4000/6000 level course, it

s
supposed to:


Be more challenging that a 3000 level course!


Be less well structured


Have a greater expectation on you


Have more student participation


Be fun!



Parallel Architecture:


Basic issues concerning design and likely
performance of parallel systems


Specific Systems:


Will make extensive use of NCI facilities


Programming Paradigms:


Distributed and shared memory, things in between,
data intensive computing


Parallel Algorithms:


Numeric and non
-
numeric


The Future



The pieces


2 lectures per week (~30 core lecture hours)


6 Labs (not marked, solutions provided)


2 assignments (40%)


1 mid
-
semester exam (~2 hours, 20%)


1 final exam (3 hours, 40%)


Final mark is sum of assignment, mid
-
semester and final exam mark


Two slots


Tue

14:00
-
16:00 Chem T2


Thu

15:00
-
16:00 Chem T2


Exact schedule on web site


Partial notes will be posted on the web site


bring copy to lecture


Attendance at lectures and labs is strongly
recommended


attendance at labs will be recorded

http://cs.anu.edu.au/student/comp4300


We will use wattle only for lecture recordings


Start in week 3 (March 4
th
)


See web page for detailed schedule


2 sessions available


Tue

12:00
-
14:00

N113


Thu

10:00
-
12:00

N112


Register via streams now


Not assessed, but will be examined

Course Convener

Alistair Rendell

N226 CSIT Building

Alistair.Rendell@anu.edu.au

Phone 6125 4386


Lecturer

Josh Milthorpe

N216 CSIT Building

Josh.Milthorpe@anu.edu.au

Phone 6125 4478


Course web page

cs.anu.edu.au/student/comp4300


Bulletin board (forum


available from streams)

cs.anu.edu.au/streams


At lectures and in labs


Email

comp4300@cs.anu.edu.au


In person


Office hours (to be set


see web page)


Email for appointment if you want specific time


Principles of Parallel Programming
, Calvin Lin and
Lawrence Snyder, Pearson International Edition, ISBN
978
-
0
-
321
-
54942
-
6


Introduction to Parallel Computing
, 2nd Ed., Grama,
Gupta, Karypis, Kumar, Addison
-
Wesley, ISBN
0201648652 (Electronic version accessible on line from
ANU library


search for title)


Parallel Programming: techniques and applications using
networked workstations and parallel computers
, Barry
Wilkinson and Michael Allen. Prentice Hall 2nd edition.
ISBN 0131405632.


and others on web page