High Performance Computing Platforms

smilinggnawboneInternet and Web Development

Dec 4, 2013 (3 years and 10 months ago)

205 views

High Performance
Computing Platforms
Overview and Challenges
Sebastien Varrette,PhD
Computer Science and Communications (CSC) Research
Unit,University of Luxembourg,Luxembourg
v0.7.0-b40
1/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
2/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Organization of the lecture
Theoretical classes:general overview
Laboratory/Exercise classes:practive on concrete computing
platforms
,!GPUs,UL clusters,your laptop
Lecturers:
,!Prof.Pascal Bouvry,PhD
,!Sebastien Varrette,PhD
,!Frederic Pinel
3/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Preamble
Prerequisites
HPC:High Performance Computing
Main HPC Performance Metrics
Computing Capacity/speed:often measured in ops (or op/s)
,!Floating point operations per seconds (often in DP)
,!GFlops = 10
9
Flops TFlops = 10
12
Flops PFlops = 10
15
Flops
Storage Capacity measured in multiples of bytes = 8 bits
,!GB = 10
9
bytes TB = 10
12
bytes PB = 10
15
bytes
,!GiB = 1024
3
bytes TiB = 1024
4
bytes PiB = 1024
5
bytes
Transfert rate on a medium measured in Mb/s or MB/s
Other metrics:Sequential vs Random R/W speed,IOPS
4/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Preamble
Why High Performance Computing?
"The country that out-computes will be the one that
out-competes".Council on Competitiveness
Accelerate research by accelerating computations
14.4 GFlops 14.26 TFlops
(Dual-core i7 1.8GHz) (151 computing nodes,1556 cores)
Increase storage capacity
2TB (1 disk) 480 TB (4  60 disks)
Communicate faster 1 GbE (1 Gb/s) vs Inniband QDR (40 Gb/s)
5/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
6/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Evolution of Computing Systems
1946
1956
1963
1974
1980
1994
1998
2005
ENIAC
Transistors
Integrated
Circuit
Micro-
Processor
1
50 Flops
180,000 tubes
30 t, 170 m
2
Replace tubes
1959: IBM 7090
1st Generation
2nd
33 KFlops
Thousands of
transistors in
one circuit
1971: Intel 4004
0.06 Mips
1 MFlops
3rd
4th
arpanet

internet
Beowulf
Cluster
5th
Millions of transistors
in one circuit
1989: Intel 80486
7
4 MFlops
Multi-Core
Processor
Multi-core
processor
2005: Pentium D
2 GFlops
2010
HW diversity
Cloud
6/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Sequential/serial computing
I
n
s
t
r
u
c
t
i
o
n
s
CPU
Problem
t
1
t
2
t
n
7/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Parallel computing
CPU
Problem
t
1
t
2
t
k
CPU
CPU
CPU
I
n
s
t
r
u
c
t
i
o
n
s
sub-problem 1
sub-problem 2
sub-problem 3
sub-problem 4
8/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Parallel Computing Advantages
Speedup:gain over single processor execution
Economy
,!parallel system cheaper than a faster processor
,!resources may already be available
Scalability
,!with a good architecture,easy to add more processors
,!larger applications can be run as memory scales
Security Fault Tolerance,data redundancy
9/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
for( i = 0;i < N
ENTRIES;i++)
6 x[ i ] = compX(y[i],z[ i ]);
for( i = 1;i < N
ENTRIES;i++)
9 x[ i ] = solveX(x[ i 1]);
nalize1 (&x,&y,&z);
12 nalize2 (&x,&y,&z);
nalize3 (&x,&y,&z);
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
Functional Parallelism
for( i = 0;i < N
ENTRIES;i++)
x[ i ] = compX(y[i],z[ i ]);
3
for( i = 1;i < N
ENTRIES;i++)
x[ i ] = solveX(x[ i 1]);
6
nalize1 (&x,&y,&z);
nalize2 (&x,&y,&z);
9 nalize3 (&x,&y,&z);
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
Functional Parallelism
for( i = 0;i < N
ENTRIES;i++)
x[ i ] = compX(y[i],z[ i ]);
Data Parallelism
for( i = 1;i < N
ENTRIES;i++)
x[ i ] = solveX(x[ i 1]);
3
nalize1 (&x,&y,&z);
nalize2 (&x,&y,&z);
6 nalize3 (&x,&y,&z);
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
Functional Parallelism
for( i = 0;i < N
ENTRIES;i++)
x[ i ] = compX(y[i],z[ i ]);
Data Parallelism
for( i = 1;i < N
ENTRIES;i++)
x[ i ] = solveX(x[ i 1]);
Pipelining
nalize1 (&x,&y,&z);
nalize2 (&x,&y,&z);
3 nalize3 (&x,&y,&z);
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
Functional Parallelism
for( i = 0;i < N
ENTRIES;i++)
x[ i ] = compX(y[i],z[ i ]);
Data Parallelism
for( i = 1;i < N
ENTRIES;i++)
x[ i ] = solveX(x[ i 1]);
Pipelining
nalize1 (&x,&y,&z);
nalize2 (&x,&y,&z);
3 nalize3 (&x,&y,&z);
No good?
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Introduction
Motivational example
x = initX(A,B);
y = initY(A,B);
3 z = initZ(A,B);
Functional Parallelism
for( i = 0;i < N
ENTRIES;i++)
x[ i ] = compX(y[i],z[ i ]);
Data Parallelism
for( i = 1;i < N
ENTRIES;i++)
x[ i ] = solveX(x[ i 1]);
Pipelining
nalize1 (&x,&y,&z);
nalize2 (&x,&y,&z);
3 nalize3 (&x,&y,&z);
No good?
=)What performance gains can we expect from parallelization?
10/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Generic metric to quantify performance enhancement
,!ratio of old time to new time for the same operation
Speedup =
Sequential execution time
Parallel execution time
=
T
seq
T
parallel
Ideal speedup with p processors?
linear p

T
parallel
=
T
seq
p

Expected speedup?
sublinear < p

T
parallel
>
T
seq
p

Can we get superlinear speedup S > p?
Yes!
,!eciency in memory access
,!some specic problems (ex:optim.with lucky processor)
11/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Speedup
Note detailed here in the previous speedup denition:
,!machine model (P-RAM) etc.
Ideal speedup limited due to overhead in
,!data transfers (in general:communication among tasks)
,!task startup/nalize
,!load balancing
,!inherent sequential portions of computation
12/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Work,speedup and eciency
P a problem of size n solved by parallel program A
p
T
seq
Execution time of the best known sequential algo
T
1
Execution time of A
p
on 1 processor T
1
 T
seq
T
P
Execution time of A
p
on p processors
Work & Speedup
Work W
p
= p:T
p
Speedup S
p
=
T
seq
T
p

T
1
T
p
Eciency
e
p
=
T
seq
p:T
p

T
1
p:T
p

W
1
W
p
Relative or parallel speedup and eciency:T
1
instead of T
seq
13/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
First performance model
Two kinds of operations performed by A
p
1
Computations to be performed sequentially
,!t
s
:Processing time of the serial part (using 1 processor)
2
Computations to be performed in parallel
,!t
par
(p):processing time of the parallel part using p procs
,!include parallel overhead (coms etc.) t
overhead
(p)
Note:t
overhead
(1)'0
14/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
First performance model
Simple parallel program performance model
T
1
= t
s
+t
par
(1)
T
p
= t
s
+t
par
(p)'t
s
+
t
par
(1)
p
+t
overhead
(p)  t
s
+
t
par
(1)
p
15/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
First performance model
Non-scaled percentages of A
p
,!non-scaled percentage of the serial part:f
ns
=
t
s
T
1
,!non-scaled percentage of the parallel part:1 f
ns
=
t
par
(1)
T
1
p does not occur in the denition of f
ns
f
ns
non-parallelizable portion of A
p
on execution on 1 proc
,!0  f
ns
 1 reminder:T
1
= t
s
+t
par
(1)
t
s
= f
ns
:T
1
and t
par
(1) = (1 f
ns
)T
1
16/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
First performance model
Non-scaled percentages of A
p
,!non-scaled percentage of the serial part:f
ns
=
t
s
T
1
,!non-scaled percentage of the parallel part:1 f
ns
=
t
par
(1)
T
1
p does not occur in the denition of f
ns
f
ns
non-parallelizable portion of A
p
on execution on 1 proc
,!0  f
ns
 1 reminder:T
1
= t
s
+t
par
(1)
t
s
= f
ns
:T
1
and t
par
(1) = (1 f
ns
)T
1
16/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law
Exercice
Provide a upper bound for S
p
using f
ns
S
p

T
1
T
p

T
1
t
s
+
t
par
(1)
p

T
1
f
ns
T
1
+
(1f
ns
)T
1
p
Theorem (Amdhal's law)
The maximumspeedup S
p
achievable by a parallel programwhen
executing on p processors is:S
p

1
f
ns
+
1f
ns
p
lim
p!1
S
p
=
1
f
ns
17/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law
Exercice
Provide a upper bound for S
p
using f
ns
S
p

T
1
T
p

T
1
t
s
+
t
par
(1)
p

T
1
f
ns
T
1
+
(1f
ns
)T
1
p
Theorem (Amdhal's law)
The maximumspeedup S
p
achievable by a parallel programwhen
executing on p processors is:S
p

1
f
ns
+
1f
ns
p
lim
p!1
S
p
=
1
f
ns
17/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law
0
1
2
3
4
5
0
10
20
30
40
50
60
Sp: speedup
p: number of processors
Amdhal’s law
f=20%
f=50%
f=75%
f=90%
18/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Lessons from Amdhal's law
Law of diminishing return
If a signicant fraction of the code (in terms of time spent in it)
is not parallelizable,then parallelization won't be good
,!obvious but HPC newbies forget how bad Amdahl's law can be
Luckily,many applications can be almost entirely parallelized
,!f
ns
is small
19/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law illustrations...
Exercice
Maximal speedup expected from a parallel version of a program
on eight processors?
,!hint:benchmarks reveals that 10% of the execution time is
spent in functions that must be executed on a single processor?
20/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law illustrations...
Exercice
Maximal speedup expected from a parallel version of a program
on eight processors?
,!hint:benchmarks reveals that 10% of the execution time is
spent in functions that must be executed on a single processor?
Answer:S
p

1
0:1 +
10:1
8
'4:7
20/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law illustrations...
Exercice
Maximal speedup expected from a parallel version of a program
on eight processors?
,!hint:benchmarks reveals that 10% of the execution time is
spent in functions that must be executed on a single processor?
Exercice
If 25% of the operations in a parallel program must be performed
sequentially,what is the maximum speedup achievable?
20/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Amdahl's law illustrations...
Exercice
Maximal speedup expected from a parallel version of a program
on eight processors?
,!hint:benchmarks reveals that 10% of the execution time is
spent in functions that must be executed on a single processor?
Exercice
If 25% of the operations in a parallel program must be performed
sequentially,what is the maximum speedup achievable?
Answer:lim
p!1
S
p
=
1
0:25
= 4
20/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
...and limitations
Exercice
Let A
p
having time complexity (n
2
),(n:dataset size).
,!t
s
= 18000 +n s (input the dataset and output the result)
,!the rest can be done in parallel and t
par
(1) = n
2
=100 s
1
Maximum speedup with problem size n = 10000?
2
Assumes t
overhead
(p) = (n log n +n log p).Provide a better
upper bound for the speedup,compare to Amdhal's prediction
and conclude.
3
Using your last estimation,illustrate the Amdhal's eect i.e.
the fact that for a xed number of processors,speedup is usually
an increasing function of the problem size.
21/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Eciency and Scalability
Parallel Eciency Reminder!
e
p
=
T
1
p:T
p
=
S
p
p
Parallel eciency is usually < 1,unless [super]linear speedup
Used to measure how well the processors are utilized
Ex:%p by a factor 10 =) %S
p
by a factor 2
,!perhaps it's not worth it:eciency drops by a factor 5
Important when purchasing a parallel machine
,!if eciency is low,forget buying a large cluster
22/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Eciency and Scalability
Denition (Scalable Algorithm)
Large eciency also with larger number of processors.
Measure\eort"to keep eciency while adding processors
Eciency also depends on the problem size:e
p
(n)
23/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Eciency and Scalability
Isoeciency metric
For growing number of processors p,how fast must the problem
size n grow such that eciency remain constant?
n
c
(p) such that e
p
(n
c
(p)) = c
By making a problem ridiculously large,one can typically achieve
good eciency!
24/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Toward a scalable speedup
Reminder:T
p
= t
s
+t
par
(p) and t
par
(p) =
t
par
(1)
p
+t
overhead
(p)
Scaled percentages of A
p
,!scaled percentage of the serial part:f
s
=
t
s
T
p
,!scaled percentage of the parallel part:1 f
s
=
t
par
(p)
T
p
p does occur in the denition of f
s
f
s
non-parallelizable portion of A
p
on execution on p procs
,!0  f
s
 1
t
s
= f
s
:T
p
and t
par
(p) = (1 f
s
)T
p
25/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Toward a scalable speedup
Reminder:T
p
= t
s
+t
par
(p) and t
par
(p) =
t
par
(1)
p
+t
overhead
(p)
Scaled percentages of A
p
,!scaled percentage of the serial part:f
s
=
t
s
T
p
,!scaled percentage of the parallel part:1 f
s
=
t
par
(p)
T
p
p does occur in the denition of f
s
f
s
non-parallelizable portion of A
p
on execution on p procs
,!0  f
s
 1
t
s
= f
s
:T
p
and t
par
(p) = (1 f
s
)T
p
25/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson's law
Exercice
Provide a upper bound for S
p
using f
s
From t
s
= f
s
:T
p
,t
par
(p) = (1 f
s
)T
p
and
t
par
(1)  p:t
par
(p)  p(1 f
s
)T
p
:
S
p
=
T
seq
T
p

T
1
T
p

t
s
+t
par
(1)
T
p
 f
s
+(1 f
s
)p  p +(1 p)f
s
Theorem (Gustafson's law)
The maximum speedup S
p
achievable by a parallel algorithm A
on p processors is:S
p
 p (p 1)f
s
26/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson's law
Exercice
Provide a upper bound for S
p
using f
s
From t
s
= f
s
:T
p
,t
par
(p) = (1 f
s
)T
p
and
t
par
(1)  p:t
par
(p)  p(1 f
s
)T
p
:
S
p
=
T
seq
T
p

T
1
T
p

t
s
+t
par
(1)
T
p
 f
s
+(1 f
s
)p  p +(1 p)f
s
Theorem (Gustafson's law)
The maximum speedup S
p
achievable by a parallel algorithm A
on p processors is:S
p
 p (p 1)f
s
26/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson's law
0
10
20
30
0
10
20
30
40
50
60
Sp: speedup
p: number of processors
Gustafson’s law
f=20%
f=50%
f=75%
f=90%
27/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Amdhal use the sequential execution as a starting point
Gustafson use the parallel computation as a starting point
,!Gustafson's speedup is often referred to as a scaled speedup
Gustafson's law historically claimed to break Amdhal's law
,!based on confusion in the context of each law
28/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Nave strategy: ip a coin
Head (Amdhal):f = 0:6;S
p

1
0:6+
0:4
10
'1:56
Tail (Gustafson):f = 0:6;S
p
 10 9  0:6 = 4:6
,!one of the reason of the expression"scaled speedup"
,!Who's right????
Both when using the correct ratio!
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Nave strategy: ip a coin
Head (Amdhal):f = 0:6;S
p

1
0:6+
0:4
10
'1:56
Tail (Gustafson):f = 0:6;S
p
 10 9  0:6 = 4:6
,!one of the reason of the expression"scaled speedup"
,!Who's right????
Both when using the correct ratio!
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Nave strategy: ip a coin
Head (Amdhal):f = 0:6;S
p

1
0:6+
0:4
10
'1:56
Tail (Gustafson):f = 0:6;S
p
 10 9  0:6 = 4:6
,!one of the reason of the expression"scaled speedup"
,!Who's right????
Both when using the correct ratio!
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Nave strategy: ip a coin
Head (Amdhal):f = 0:6;S
p

1
0:6+
0:4
10
'1:56
Tail (Gustafson):f = 0:6;S
p
 10 9  0:6 = 4:6
,!one of the reason of the expression"scaled speedup"
,!Who's right????
Both when using the correct ratio!
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Nave strategy: ip a coin
Head (Amdhal):f = 0:6;S
p

1
0:6+
0:4
10
'1:56
Tail (Gustafson):f = 0:6;S
p
 10 9  0:6 = 4:6
,!one of the reason of the expression"scaled speedup"
,!Who's right????
Both when using the correct ratio!
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Exercice
An application executing on 10 processors requires 100s to run.Bench-
marking reveals that 40% of the elapsed time is spent on parallel pro-
cessing (using 10 processors) and 60% is for sequential processing.
What is the speedup for this application?
Complete answer:
f
s
= 0:6;p = 10
=) (Gustafson) S
p
= 10 9  0:6 = 4:6
From T
p
= 100s,t
s
= 60,T
1
= 60 +40  10 = 460s
=) f
ns
=
60
460
= 0:13
=) (Amdhal) S
p
=
1
0:13+
0:87
10
= 4:6
29/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Theorem (Gustafson's and Amdhal's equivalence)
Gustafson's and Amdhal's laws predicts the same speedup and
f
ns
=
1
1 +

1f
s
f
s

p
f
s
=
1
1 +

1f
ns
f
ns
:p

Proof:t
s
= f
ns
T
1
= f
s
T
p
=)
f
s
f
ns
=
T
1
T
p
= S
p
From Gustafson's law:S
p
=
f
s
f
ns
= f
s
+(1 f
s
)p = f
s

1 +

1 f
s
f
s

p

1
f
ns
1 =

1
f
s
1

p =)1 +
1 f
ns
f
ns
:p
=
1
f
s
...Or with Amdhal's law:S
p
=
f
s
f
ns
=
1
f
ns
+
1f
ns
p
=
1
f
ns

1 +
1f
ns
f
ns
p

30/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Theorem (Gustafson's and Amdhal's equivalence)
Gustafson's and Amdhal's laws predicts the same speedup and
f
ns
=
1
1 +

1f
s
f
s

p
f
s
=
1
1 +

1f
ns
f
ns
:p

Proof:t
s
= f
ns
T
1
= f
s
T
p
=)
f
s
f
ns
=
T
1
T
p
= S
p
From Gustafson's law:S
p
=
f
s
f
ns
= f
s
+(1 f
s
)p = f
s

1 +

1 f
s
f
s

p

1
f
ns
1 =

1
f
s
1

p =)1 +
1 f
ns
f
ns
:p
=
1
f
s
...Or with Amdhal's law:S
p
=
f
s
f
ns
=
1
f
ns
+
1f
ns
p
=
1
f
ns

1 +
1f
ns
f
ns
p

30/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Theorem (Gustafson's and Amdhal's equivalence)
Gustafson's and Amdhal's laws predicts the same speedup and
f
ns
=
1
1 +

1f
s
f
s

p
f
s
=
1
1 +

1f
ns
f
ns
:p

Proof:t
s
= f
ns
T
1
= f
s
T
p
=)
f
s
f
ns
=
T
1
T
p
= S
p
From Gustafson's law:S
p
=
f
s
f
ns
= f
s
+(1 f
s
)p = f
s

1 +

1 f
s
f
s

p

1
f
ns
1 =

1
f
s
1

p =)1 +
1 f
ns
f
ns
:p
=
1
f
s
...Or with Amdhal's law:S
p
=
f
s
f
ns
=
1
f
ns
+
1f
ns
p
=
1
f
ns

1 +
1f
ns
f
ns
p

30/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Gustafson vs.Amdhal
Theorem (Gustafson's and Amdhal's equivalence)
Gustafson's and Amdhal's laws predicts the same speedup and
f
ns
=
1
1 +

1f
s
f
s

p
f
s
=
1
1 +

1f
ns
f
ns
:p

Proof:t
s
= f
ns
T
1
= f
s
T
p
=)
f
s
f
ns
=
T
1
T
p
= S
p
From Gustafson's law:S
p
=
f
s
f
ns
= f
s
+(1 f
s
)p = f
s

1 +

1 f
s
f
s

p

1
f
ns
1 =

1
f
s
1

p =)1 +
1 f
ns
f
ns
:p
=
1
f
s
...Or with Amdhal's law:S
p
=
f
s
f
ns
=
1
f
ns
+
1f
ns
p
=
1
f
ns

1 +
1f
ns
f
ns
p

30/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Summary
Parallel program performance metrics
Parallel Speedup S
p
Parallel eciency e
p
Isoeciency metric
Redundancy R(n) (additional workload in parallel program)
System utilization U(n) (percentage of procs kept busy)
Quality of parallelism Q(n) (summary of overall perfs)
...
31/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
32/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
HPC Platforms Performance Evaluation
H.Meuer,H.Simon,E.Strohmaier and J.Dongarra
The Top'500 project http://top500.org
Rank world's 500 most powerful computers (since 1993)
,!Updated twice a year (June/November)
,!Help to follow the progression of HPC systems
Based on High-Performance LINPACK (HPL) benchmark
,!Solves a dense system of equations Ax = b of order n
32/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
HPC Platforms Performance in Practice
Top500#1 (06/2012):Sequoia - IBM BlueGene/Q
1,572,864 cores (based on Power BQC 16C 1.60 GHz)
16.32 PFlops,7.89 MWatts
,!1.6 PB memory,96 racks covering an area of'280m
2
,!55 PB Lustre storage
Note:other benchmarks than HPL can be considered
,!HPC Challenge (HPCC) http://icl.cs.utk.edu/hpcc/
,!mprime
,!I/O:dd and IOZone
33/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Top'500 Performance Development
34/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Who's using Top500 systems?
35/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
What are the Top500 systems used for?
36/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Top500 milestones (Gordon Bell Prize)
1 G op/s 1988 { Cray Y-MP 8 Processors
1 TFlop/s 1998 { Cray T3E 1024 Processors
1 PFlop/s 2008 { Cray XT5 1:5  10
5
Processors
1 EFlop/s  2018 {????10
7
Processors (10
9
threads)
Estimation of human's brain computational power...
,!10
14
neural connections @ 200 calculations/s 20 PFlop/s
=)What hardware architecture made this possible?
37/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Top500 milestones (Gordon Bell Prize)
1 G op/s 1988 { Cray Y-MP 8 Processors
1 TFlop/s 1998 { Cray T3E 1024 Processors
1 PFlop/s 2008 { Cray XT5 1:5  10
5
Processors
1 EFlop/s  2018 {????10
7
Processors (10
9
threads)
Estimation of human's brain computational power...
,!10
14
neural connections @ 200 calculations/s 20 PFlop/s
=)What hardware architecture made this possible?
37/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Top500 milestones (Gordon Bell Prize)
1 G op/s 1988 { Cray Y-MP 8 Processors
1 TFlop/s 1998 { Cray T3E 1024 Processors
1 PFlop/s 2008 { Cray XT5 1:5  10
5
Processors
1 EFlop/s  2018 {????10
7
Processors (10
9
threads)
Estimation of human's brain computational power...
,!10
14
neural connections @ 200 calculations/s 20 PFlop/s
=)What hardware architecture made this possible?
37/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Performances Evaluation of Parallel/Distributed Programs & Systems
Top500 milestones (Gordon Bell Prize)
1 G op/s 1988 { Cray Y-MP 8 Processors
1 TFlop/s 1998 { Cray T3E 1024 Processors
1 PFlop/s 2008 { Cray XT5 1:5  10
5
Processors
1 EFlop/s  2018 {????10
7
Processors (10
9
threads)
Estimation of human's brain computational power...
,!10
14
neural connections @ 200 calculations/s 20 PFlop/s
=)What hardware architecture made this possible?
37/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
38/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
38/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
SIMD machines
Vector machines
Element-wise ops.on entire vectors in a single instruction
,!Ex:MMX extension on x86 architectures,AltiVec on PowerPC
+
+
+
3 vector pipes
V
ector register
#elements / #pipes
adds in parallel
Mainly Graphical Processing Units (GPU)
,!General-Purpose computing on Graphics Processing Units
(GPGPU).Ex:nVidia Tesla cards
38/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
MIMD with Shared Memory
Symmetric Multi Processors (SMP)
All processors access same memory & I/O
Also applies to multicore machines
,!Intel Core i7/AMD K10/IBM Cell...
M
Proc
Proc
Proc
Proc
Massively Parallel Processing (MPP)
Virtual shared-memory with global address space
,!over physically distributed memory
Ex:Jaguar Cray XT5 Supercomputer (150125 cores)
39/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
MIMD with Distributed Memory
Proc
M
Network Interconnect
Proc
M
Proc
M
Proc
M
Low-cost alternative to MPPs
[Large-scale] distributed systems
,!Ex:clusters,grids etc.
40/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
MIMD with Distributed Memory
[Beowulf] Cluster Ex:UL clusters
user
Loosely coupled computing nodes
Often homogeneous (same processor,memory...)
Fast local interconnect
,![10]GE,Inniband,Myrinet...
41/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
MIMD with Distributed Memory
Computing Grids [Foster&al.97] Ex:Globus (US),EGI (EU),Grid5000 (EU)
Cluster 2
INTERNET
user
Cluster 1
Cluster aggregation via high latency network (ex:the Internet)
42/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
MIMD with Distributed Memory
Desktop Grids aka Volunteer Computing Platforms Ex:BOINC
INTERNET
user
Steal computing cycle from idle computers (the case'75% of the time)
43/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The Cloud Computing Paradigm
Virtual computing/storage service provided"on the cloud"
,!Pay-as-you-go;scale up and down;eliminate HW considerations
,!Commercial Cloud:Amazon EC2,Google Compute Engine,MS Azure
EC2:Linux server US$0.08/hour;storage US$0.10/GB/mo
Virtualization at the core of clouds
,!Provides basis for on-demand,shared,congurable infrastructure
44/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The Cloud:Abstraction Layers
S
o
f
t
w
a
r
e

a
s

a

S
e
r
v
i
c
e
(
S
a
a
S
)
P
l
a
t
f
o
r
m

a
s

a

S
e
r
v
i
c
e

(
P
a
a
S
)
I
n
f
r
a
s
t
r
u
c
t
u
r
e

a
s

a

S
e
r
v
i
c
e

(
I
a
a
S
)
On-demand access to any application
Delivery of raw
computer infrastructure
C
l
o
u
d

p
r
o
v
i
d
e
r
m
a
n
a
g
e
d
u
s
e
r

ma
n
a
g
e
d
Network
Storage
Server
V
irtualization
O/S
Middleware
Runtime
Application
Network
Storage
Server
V
irtualization
O/S
Middleware
Runtime
Application
Network
Storage
Server
V
irtualization
Application
O/S
Middleware
Runtime
Network
Storage
Server
V
irtualization
O/S
Middleware
Runtime
Application
I
a
a
S
P
a
a
S
S
a
a
S
T
r
a
d
i
t
i
o
n
a
l
S
y
s
t
e
m
IaaS,PaaS,SaaS
,!fInfrastructure,Platform,Softwareg as a Service
Cloud Virtualization layer
currently NOT adapted to HPC
45/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Current HPC Architectures (Top500)
Mainly Cluster
-based
(82%) (Top500,Nov 2011)
Reasons:
,!scalable
,!cost-eective
46/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
47/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:[GP]CPU
CPU
Always multi-core
Ex:Intel Core i7-970 (July 2010) R
peak
'100 GFlops (DP)
,!6 cores @ 3.2GHz (32nm,130W,1170 millions transistors)
GPU/GPGPU
Always multi-core,optimized for vector processing
Ex:Nvidia Tesla C2050 (July 2010) R
peak
'515 GFlops (DP)
,!448 cores @ 1.15GHz
'10 G ops for 50 e
47/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Local Memory
C
P
U
Registers
L1
-
C
a
c
h
e
register
reference
L1-cache
(SRAM)
reference
L2
-
C
a
c
h
e
L3
-
C
a
c
h
e
M
e
m
o
r
y
L2-cache
(SRAM)
reference
L3-cache
(DRAM)
reference
Memory (DRAM)
reference
Disk memory reference
Memory Bus
I/O Bus
Larger
, slower and cheaper
Size:
Speed:
500 bytes
64 KB to 8 MB
1 GB
1
TB
sub ns
1-2 cycles
10 cycles
20 cycles
hundreds cycles
ten of thousands cycles
Le
v
el:
1
2
3
4
SSD R/W:560 MB/s;85000 IOps 1500 e/TB
HDD (SATA @ 7,2 krpm) R/W:100 MB/s;190 IOps 150 e/TB
48/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Interconnect
latency:time to send a minimal (0 byte) message from A to B
bandwidth:max amount of data communicated per unit of time
Technology
Eective Bandwidth
Latency
Gigabit Ethernet
1 Gb/s
125 MB/s
40s to 300s
Myrinet (Myri-10G)
9.6 Gb/s
1.2 GB/s
2:3s
10 Gigabit Ethernet
10 Gb/s
1.25 GB/s
4s to 5s
Inniband QDR
40 Gb/s
5 GB/s
1:29s to 2:6s
SGI NUMAlink
60 Gb/s
7.5 GB/s
1s
49/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Interconnect
latency:time to send a minimal (0 byte) message from A to B
bandwidth:max amount of data communicated per unit of time
Technology
Eective Bandwidth
Latency
Gigabit Ethernet
1 Gb/s
125 MB/s
40s to 300s
Myrinet (Myri-10G)
9.6 Gb/s
1.2 GB/s
2:3s
10 Gigabit Ethernet
10 Gb/s
1.25 GB/s
4s to 5s
Inniband QDR
40 Gb/s
5 GB/s
1:29s to 2:6s
SGI NUMAlink
60 Gb/s
7.5 GB/s
1s
49/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Operating System
Mainly Linux
-based OS
(91.4%) (Top500,Nov 2011)
...or Unix based (6%)
Reasons:
,!stability
,!prone to devels
50/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Software Stack
Remote connection to the platform:SSH
User SSO:NIS or OpenLDAP
-based
Resource management:job/batch scheduler
,!OAR
,PBS,Torque,MOAB Cluster Suite
(Automatic) Node Deployment:
,!FAI
(Fully Automatic Installation),Kickstart,Puppet
,Chef,Kadeploy
etc.
Platform Monitoring:Nagios
,Ganglia
,Cacti
etc.
(eventually) Accounting:
,!oarnodeaccounting
,Gold allocation manager etc.
51/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
Storage architectural classes & I/O layers
D
A
S
SA
T
A
SAS
Fiber Channel
DAS Interface
N
A
S
File System
SA
T
A
SAS
Fiber Channel
Fiber
Channel
Ethernet/
Network
NAS Interface
S
A
N
SA
T
A
SAS
Fiber Channel
Fiber
Channel
Ethernet/
Network
SAN Interface
Application
NFS
CIFS
AFP
...
Network
iSCSI
...
Network
SA
T
A
SAS
FC
...
[Distributed] File system
52/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
RAID standard levels
Software vs.Hardware RAID management
RAID Controller card performances diers!
,!Basic (low cost):300 MB/s;Advanced (expansive):1,5 GB/s
53/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
RAID combined levels
Software vs.Hardware RAID management
RAID Controller card performances diers!
,!Basic (low cost):300 MB/s;Advanced (expansive):1,5 GB/s
53/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
RAID combined levels
Software vs.Hardware RAID management
RAID Controller card performances diers!
,!Basic (low cost):300 MB/s;Advanced (expansive):1,5 GB/s
53/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
RAID combined levels
Software vs.Hardware RAID management
RAID Controller card performances diers!
,!Basic (low cost):300 MB/s;Advanced (expansive):1,5 GB/s
53/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Management
File Systems
Logical manner to store,organize,manipulate and access data.
Disk le systems:FAT32,NTFS,HFS,ext3,ext4
,xfs
...
Network le systems:NFS
,SMB
Distributed parallel le systems:HPC target
,!data are stripped over multiple servers for high performance.
,!generally add robust failover and recovery mechanisms
,!Ex:Lustre
,GPFS,FhGFS,GlusterFS...
HPC storage make use of high density disk enclosures
,!includes [redundant] RAID controllers
54/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Center
Denition (Data Center)
Facility to house computer systems and associated components
,!Basic storage component:rack (height:42 RU)
Challenges:Power (UPS,battery),Cooling,Fire protection,Security
Power/Heat dissipation per rack:
,!'HPC'(computing) racks:30 kW
,!'Storage'racks:15 kW
,!'Interconnect'racks:5 kW
Power Usage Eectiveness
PUE =
Total facility power
IT equipment power
55/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Center
Denition (Data Center)
Facility to house computer systems and associated components
,!Basic storage component:rack (height:42 RU)
Challenges:Power (UPS,battery),Cooling,Fire protection,Security
Power/Heat dissipation per rack:
,!'HPC'(computing) racks:30 kW
,!'Storage'racks:15 kW
,!'Interconnect'racks:5 kW
Power Usage Eectiveness
PUE =
Total facility power
IT equipment power
55/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Data Center
56/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC Components:Summary
HPC platforms involves:
A data center/server room carefully designed
Computing elements:CPU/GPGPU
Interconnect elements
Storage elements:HDD/SDD,disk enclosure,
,!disks are virtually aggregated by RAID/LUNs/FS
57/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
58/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Current architecture revolution...
Chip density 2/18-24m
Clock speed is xed/slower
No more hidden parallelism
(ILP)
The same [seq] application won't
be faster on a new machine
Pb Challenge:
hardware!software
[Source:Olukotun,Hammond,Sutter and Smith 2007;Intel,Microsoft and Stanford]
58/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
...so called the Multicore revolution
"We are dedicating all of our future product development to multicore
designs.This is a sea change in computing"
Paul Otellini,President,Intel (2005)
Industry shift to multicore
,!Intel Nehalem (4 cores),Polaris (experimental,80 cores)
,!AMD Istanbul (6 cores),IBM Cell Broadband Engine (9 cores)
,!Fujitsu Venus (8 cores),Sun Niagara2 (8 cores) etc.
99% of Top500 systems are based on multicore
Moore's law reinterpreted
Number of cores double every 18-24 months
=)Why such a revolution?
59/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
...so called the Multicore revolution
"We are dedicating all of our future product development to multicore
designs.This is a sea change in computing"
Paul Otellini,President,Intel (2005)
Industry shift to multicore
,!Intel Nehalem (4 cores),Polaris (experimental,80 cores)
,!AMD Istanbul (6 cores),IBM Cell Broadband Engine (9 cores)
,!Fujitsu Venus (8 cores),Sun Niagara2 (8 cores) etc.
99% of Top500 systems are based on multicore
Moore's law reinterpreted
Number of cores double every 18-24 months
=)Why such a revolution?
59/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#1:The power density issue
"Can soon put more transistors on a chip than can aord to turn on."
Patterson,2007
60/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
=) Multicore saves power!
Power
Performances
1.00x
Single−Core
Power/C V
2
f
Perf/Cores f
V:Voltage;f:Frequency
C:Capacitance (f/V)
Using single core
%perf and %power comsumption by over-clocking
&power comsumption and perf by under-clocking
61/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
=) Multicore saves power!
Power
Performances
Single−Core
1.2x
1.73x
(+20%)
Over−clocked
1.00x
Single−Core
Power/C 1;44V
2
1;2f
Perf/Cores 1;2f
V:Voltage;f:Frequency
C:Capacitance (f/V)
Using single core
%perf and %power comsumption by over-clocking
&power comsumption and perf by under-clocking
61/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
=) Multicore saves power!
Power
Performances
1.73x
Over−clocked
Single−Core
1.2x
(+20%)
0.51x
Single−Core
Under−clocked
(−20%)
0.8x
1.00x
Single−Core
Power/C 0;64V
2
0;8f
Perf/Cores 0;8f
V:Voltage;f:Frequency
C:Capacitance (f/V)
Using single core
%perf and %power comsumption by over-clocking
&power comsumption and perf by under-clocking
61/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
=) Multicore saves power!
Performances
Power
1.2x
(+20%)
1.73x
Over−clocked
Single−Core
0.51x
Single−Core
Under−clocked
(−20%)
0.8x
2x
Dual−Core
Power/2C V
2
f
Perf/2 Cores f
V:Voltage;f:Frequency
C:Capacitance (f/V)
Using additionnal cores
%density (more transistors = more capacitance)
%perf and %power comsumption
61/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
=) Multicore saves power!
Performances
Power
1.2x
(+20%)
1.73x
Over−clocked
Single−Core
1.00x
Single−Core
1.26x
0.51x
Dual−Core
Power/2C0;4V
2
0;63f
Perf/2 Cores 0;63f
V:Voltage;f:Frequency
C:Capacitance (f/V)
Using additionnal cores
%density (more transistors = more capacitance)
%perf and &power comsumption
61/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:the good...
1996:1 TeraFlops
1000 Flops/Watt
,!ASCI Red,10000 procs @ 200MHz
62/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:the good...
1996:1 TeraFlops
1000 Flops/Watt
,!ASCI Red,10000 procs @ 200MHz 1 MW
62/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:the good...
2010:4,64 TeraFlops
15,782,000,000 Flops/Watt
,!ATI Radeon HD 5970,3200 stream procs @ 725MHz
62/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:the good...
2010:4,64 TeraFlops
15,782,000,000 Flops/Watt
,!ATI Radeon HD 5970,3200 stream procs @ 725MHz 294 W
62/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:...and the bad
Data center
Power,Cooling,Buildings etc.
On 100W:99W lost!
,!50W Power+Cooling
,!32W Storage+Network
,!13W Power,I/O,Mem
,!4W Idle
1% of world's electricity to cooling IT,2% of world CO
2
Eciency:1 to 5 %
63/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:...and the bad
Data center
Power,Cooling,Buildings etc.
On 100W:99W lost!
,!50W Power+Cooling
,!32W Storage+Network
,!13W Power,I/O,Mem
,!4W Idle
1% of world's electricity to cooling IT,2% of world CO
2
Eciency:1 to 5 %
63/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
The power evolution:...and the bad
Data center
Eciency:1-5%
Steam Engine
Eciency:10-15%
64/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#2:Hidden Parallelism Tapped Out
SPEC (Standard Performance Evaluation Corporation) Benchmarks http://www.spec.org/
Application performances was increasing by 52% per year
,!between 1986 till 2002
,!half due to transitor density
,!half due to architecture change i.e (cf"Level of parallelism"):
pipelining
ILP (Instruction Level Parallelism):Superscalar,VLIW
Super/Hyper threading
etc.
These sources have been used up
,!since 2002,performances increased only by 20% per year
65/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#2:Hidden Parallelism Tapped Out
SPEC (Standard Performance Evaluation Corporation) Benchmarks http://www.spec.org/
65/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#3:The speed of light issue
Consider 1 TFlop/s seq.machine
Data must travel some distance r to
go from memory to CPU
1 T op/s )10
12
element per s.
Max speed:c = 3 10
8
m/s
Thus:10
12
r < c =)r < 0:3 mm
Now put 1TByte of data in 0.3mm
2
each bit occupies'1

A
2
i.e.1 atom
1 TFlop/s,
1 Tbyte sequential
machine
10
12
inst. 10
12
data
r
CPU
)No choice but
parallelism!
66/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#3:The speed of light issue
Consider 1 TFlop/s seq.machine
Data must travel some distance r to
go from memory to CPU
1 T op/s )10
12
element per s.
Max speed:c = 3 10
8
m/s
Thus:10
12
r < c =)r < 0:3 mm
Now put 1TByte of data in 0.3mm
2
each bit occupies'1

A
2
i.e.1 atom
1 TFlop/s,
1 Tbyte sequential
machine
10
12
inst. 10
12
data
r
CPU
)No choice but
parallelism!
66/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Limit#3:The speed of light issue
Consider 1 TFlop/s seq.machine
Data must travel some distance r to
go from memory to CPU
1 T op/s )10
12
element per s.
Max speed:c = 3 10
8
m/s
Thus:10
12
r < c =)r < 0:3 mm
Now put 1TByte of data in 0.3mm
2
each bit occupies'1

A
2
i.e.1 atom
1 TFlop/s,
1 Tbyte sequential
machine
10
12
inst. 10
12
data
r
CPU
)No choice but
parallelism!
66/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Summary
1
Introduction
2
Performances Evaluation of Parallel/Distributed Programs & Systems
Parallel Program Performance evaluation
HPC Platforms Performance evaluation
3
Computing systems architectures
Architecture of the Current HPC Facilities
Overview of the Main HPC Components
Why all computers are now parallel?
A concrete example:UL HPC platforms
4
Parallel Programming Challenges
Review on computer architecture
Understanding the dierent Level of parallelism
Fault Tolerance in Distributed Computing Platforms
Dynamic Capabilities of the Applications
5
Conclusion
67/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
UL HPC platforms at a glance
2 geographic sites
,!Kirchberg campus (AS.28,CS.43)
,!LCSB building (Belval)
3 clusters:chaos + gaia,granduc.
,!151 nodes,1576 cores,14.26 TFlops
,!511 TB shared storage (raw capacity)
3 system administrators
3,197,110 e (Cumul.HW Investment) since 2007
,!Hardware acquisition only
,!1,228,960 e (excluding server rooms)
Open-Source software stack
,!SSH,LDAP,OAR,Puppet etc.
67/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
UL HPC server rooms
2009 CS.43 (Kirchberg campus) 14 racks,100 m
2
,'800,000e
2011 LCSB 6
th
oor (Belval) 14 racks,112 m
2
,'1,100,000e
68/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
HPC & UL Attractivity
Institute
Location
R
peak
Storage
Manpower
UL
Luxembourg
14.26 TFlops
511 TB
3 FTEs
CRP GL
Luxembourg
3.41 TFlops
96 TB
1.5 FTE
LORIA (Graphene+Grion)
Nancy (France)
8.89 TFlops
5TB
3 FTEs
URZ (bwGrid)
Heidelberg (Germany)
12.6 TFlops
32 TB
n/a
UCL (Lemaitre)
Louvain (Belgium)
13 TFlops
198TB
4 FTEs
BCS (MareNostrum)
Barcelona (Spain)
94.21 TFlops
370TB
14 FTEs
69/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Management
Node deployment by FAI http://fai-project.org/
Boot via network card (PXE)
,!ensure a running diskless Linux OS
Fully Automatic Installation
University of Cologne
Thomas Lange
Email:
lange@informatik.uni-koeln.de
Institute of Computer Science,Univ.of Cologne
Pohligstraße 1,50969 K
¨
oln,Germany
What is FAI?

System for unattended Linux installation

Installs and configures the whole OS and all additional softw
are

Centralized configuration management and administration

Scalable and flexible rollout method for Linux migration

Linux deployment in only a few minutes
Why use FAI?

Manual installation takes hours,FAI just minutes

Recurring tasks are boring and lead to errors

You need an infrastructure management

You want to save time
The three steps of FAI
1 - Boot host

Boot via network card (PXE),CD-ROM or floppy
DHCP request, send MAC address
get IP address, netmask, gateway
send TFTP request for kernel image
get install kernel and boot it
DHCP
Server
Daemon
NFS
Server
TFTP
mount nfsroot by install kernel
install server
install client

Now a complete Linux OS is running without using local hard disk
s
2 - Get configuration data
local
hard disk
provided via HTTP, FTP or NFS
./class
./disk_config
./package_config
./scripts
./files
Debian mirror
mounted by install kernel
NFS, CVS, svn or HTTP
install client
install server
./hooks
/target/
/target/var
.../fai/config/
/var
/bin
/usr
/
/target/usr
nfsroot
config space
3 - Run installation

partition local hard disks and create filesystems

install software using apt-get command

configure OS and additional software

save log files to install server,then reboot new system
Examples of installation times
CPU + RAM
software
time
Core i7,3.2 GHz,6GB
4.3 GB
7 min
Core i7,3.2 GHz,6GB
471 MB
77 s
Core2duo,2 GHz,2GB
4.3 GB
17 min
Core2duo,2 GHz,2GB
471 MB
165 s
Pentium 4,3 GHz,1GB
2200 MB
10 min
Pentium 4,3 GHz,1GB
1100 MB
6 min
Pentium 4,3 GHz,1GB
300 MB
105 s
Athlon 800 MHz,512MB
2200 MB
32 min
Athlon 800 MHz,512MB
300 MB
4 min
Features

Installs Debian GNU/Linux,Ubuntu,Mandriva,Suse,Solaris,...

Useful for XEN and Vserver host installations

Class concept
supports heterogeneous configuration and hardware

Update running system without installation (e.g daily mainte
nance)

Central configuration repository
for all install clients

Advanced
disaster recovery
system

Reproducible installation

Automatic documentation
in central repository

Automated hardware inventory

Hooks can extend or customize the normal behavior

Full
remote control
via ssh during installation process

Shell
,
perl
,
expect
and
cfengine
script support for customization

FAI runs on i386,AMD64,PowerPC,Alpha,SPARC and IA64 archit
ecture

Fast automatic installation for Beowulf clusters

GUI for FAI
using GOsa
Plan your installation,
and FAI installs your plan.
FAI users

Anonymous,financial industry,32.000 hosts

LVM insurance,10.000 hosts

City of Munich,6000+,12.000 hosts planned

StayFriends,300+ hosts

Albert Einstein Institute,1725 hosts

Zivit,260 hosts on two IBM z10 EC mainframes

Archive.org,200+ hosts

XING AG,300-400 hosts

Opera Software,
!
300 hosts

Stanford University,450 hosts

MIT Computer science research lab,200 hosts

The Welcome Trust Sanger Institute,540 hosts

Mobile.de,
!
600 hosts

Thomas Krenn AG,500 per month

Electricit´e de France (EDF),1500 hosts

ETH Zurich,systems group,
!
300 hosts

Trinity Centre for High Performance Computing,356 opterons,80 x
eons

For more see
http://fai-project.org/reports/
Availability

Homepage:
http://fai-project.org

Open source under GPL license

Detailed documentation,mailing lists,IRC channel

O!cial Debian packages,ISO images of demo CD

Commercial support available
FAI at work
Terminals with ssh connection to an install client during an i
nstallation
Examples of FAI installations
The Centibots Project Lucidor cluster
100 autonomous robots 90 Dual Itanium2 900 MHz
funded by the DARPA 6 GB RAM per node
SRI International Artificial In-
telligence Center,USA
Center for Parallel Compu-
ters,Sweden
Genome research cluster The MERLIN cluster
168 IBM HS20 Blades 180 Dual AMD MP2200
2x2.8 GHz P4 1 GB RAM per node
The Sanger Institute Albert Einstein Institute
Cambridge,UK Golm,Germany
IITAC cluster
,top500.org
Computer Science lab
356 opterons,80 xeons 308 workstations,127 servers
Trinity Centre for High University of West Bohemia
Performance Computing,Czech Republic
University of Dublin,Ireland
Get conguration data (NFS)
Run the installation
,!partition local hard disks and create lesystems
,!install software using apt-get command
,!congure OS and additional software
,!save log les to install server,then reboot new system
70/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Management
Node deployment by FAI http://fai-project.org/
Boot via network card (PXE)
,!ensure a running diskless Linux OS
Get conguration data (NFS)
Fully Automatic Installation
University of Cologne
Thomas Lange
Email:
lange@informatik.uni-koeln.de
Institute of Computer Science,Univ.of Cologne
Pohligstraße 1,50969 K
¨
oln,Germany
What is FAI?

System for unattended Linux installation

Installs and configures the whole OS and all additional softw
are

Centralized configuration management and administration

Scalable and flexible rollout method for Linux migration

Linux deployment in only a few minutes
Why use FAI?

Manual installation takes hours,FAI just minutes

Recurring tasks are boring and lead to errors

You need an infrastructure management

You want to save time
The three steps of FAI
1 - Boot host

Boot via network card (PXE),CD-ROM or floppy
DHCP request, send MAC address
get IP address, netmask, gateway
send TFTP request for kernel image
get install kernel and boot it
DHCP
Server
Daemon
NFS
Server
TFTP
mount nfsroot by install kernel
install server
install client

Now a complete Linux OS is running without using local hard disk
s
2 - Get configuration data
local
hard disk
provided via HTTP, FTP or NFS
./class
./disk_config
./package_config
./scripts
./files
Debian mirror
mounted by install kernel
NFS, CVS, svn or HTTP
install client
install server
./hooks
/target/
/target/var
.../fai/config/
/var
/bin
/usr
/
/target/usr
nfsroot
config space
3 - Run installation

partition local hard disks and create filesystems

install software using apt-get command

configure OS and additional software

save log files to install server,then reboot new system
Examples of installation times
CPU + RAM
software
time
Core i7,3.2 GHz,6GB
4.3 GB
7 min
Core i7,3.2 GHz,6GB
471 MB
77 s
Core2duo,2 GHz,2GB
4.3 GB
17 min
Core2duo,2 GHz,2GB
471 MB
165 s
Pentium 4,3 GHz,1GB
2200 MB
10 min
Pentium 4,3 GHz,1GB
1100 MB
6 min
Pentium 4,3 GHz,1GB
300 MB
105 s
Athlon 800 MHz,512MB
2200 MB
32 min
Athlon 800 MHz,512MB
300 MB
4 min
Features

Installs Debian GNU/Linux,Ubuntu,Mandriva,Suse,Solaris,...

Useful for XEN and Vserver host installations

Class concept
supports heterogeneous configuration and hardware

Update running system without installation (e.g daily mainte
nance)

Central configuration repository
for all install clients

Advanced
disaster recovery
system

Reproducible installation

Automatic documentation
in central repository

Automated hardware inventory

Hooks can extend or customize the normal behavior

Full
remote control
via ssh during installation process

Shell
,
perl
,
expect
and
cfengine
script support for customization

FAI runs on i386,AMD64,PowerPC,Alpha,SPARC and IA64 archit
ecture

Fast automatic installation for Beowulf clusters

GUI for FAI
using GOsa
Plan your installation,
and FAI installs your plan.
FAI users

Anonymous,financial industry,32.000 hosts

LVM insurance,10.000 hosts

City of Munich,6000+,12.000 hosts planned

StayFriends,300+ hosts

Albert Einstein Institute,1725 hosts

Zivit,260 hosts on two IBM z10 EC mainframes

Archive.org,200+ hosts

XING AG,300-400 hosts

Opera Software,
!
300 hosts

Stanford University,450 hosts

MIT Computer science research lab,200 hosts

The Welcome Trust Sanger Institute,540 hosts

Mobile.de,
!
600 hosts

Thomas Krenn AG,500 per month

Electricit´e de France (EDF),1500 hosts

ETH Zurich,systems group,
!
300 hosts

Trinity Centre for High Performance Computing,356 opterons,80 x
eons

For more see
http://fai-project.org/reports/
Availability

Homepage:
http://fai-project.org

Open source under GPL license

Detailed documentation,mailing lists,IRC channel

O!cial Debian packages,ISO images of demo CD

Commercial support available
FAI at work
Terminals with ssh connection to an install client during an i
nstallation
Examples of FAI installations
The Centibots Project Lucidor cluster
100 autonomous robots 90 Dual Itanium2 900 MHz
funded by the DARPA 6 GB RAM per node
SRI International Artificial In-
telligence Center,USA
Center for Parallel Compu-
ters,Sweden
Genome research cluster The MERLIN cluster
168 IBM HS20 Blades 180 Dual AMD MP2200
2x2.8 GHz P4 1 GB RAM per node
The Sanger Institute Albert Einstein Institute
Cambridge,UK Golm,Germany
IITAC cluster
,top500.org
Computer Science lab
356 opterons,80 xeons 308 workstations,127 servers
Trinity Centre for High University of West Bohemia
Performance Computing,Czech Republic
University of Dublin,Ireland
Run the installation
,!partition local hard disks and create lesystems
,!install software using apt-get command
,!congure OS and additional software
,!save log les to install server,then reboot new system
70/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Management
Node deployment by FAI http://fai-project.org/
Boot via network card (PXE)
,!ensure a running diskless Linux OS
Get conguration data (NFS)
Run the installation
,!partition local hard disks and create lesystems
,!install software using apt-get command
,!congure OS and additional software
,!save log les to install server,then reboot new system
70/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Management
Node deployment by FAI http://fai-project.org/
Boot via network card (PXE)
,!ensure a running diskless Linux OS
Get conguration data (NFS)
Run the installation
,!partition local hard disks and create lesystems
,!install software using apt-get command
,!congure OS and additional software
,!save log les to install server,then reboot new system
Average reinstallation time:'500s
70/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Management
Server/Service conguration by Puppet http://puppetlabs.com
C
S
C

/

L
C
S
B

s
e
r
v
i
c
e
s
CSC Root Puppet Master
GIT
Files
GIT
testing
Puppet environments
Puppet modules
manifests + fi
les
Puppet clients
description
Certifi
cate
Authority
XMLRPC / REST
over SSL
Puppet master
C
S
C

H
P
C
:

C
h
a
o
s
/
g
a
i
a

c
l
u
s
t
e
r
Puppet master
GIT
Files
GIT
testing
(dhcp) puppet agent
(dns) puppet agent
(oar) puppet agent
puppet master
puppet agent
devel
devel
production
(gforge) puppet agent
(debmirror) puppet agent
(lims) puppet agent
(shiva) puppet agent
production
Automates sysadmin tasks
3 components:
,!declarative language
,!client/server model
,!cong realization library
Git based versioning
,!git-flow work ow
,!Rake + Capistrano
deployment
Security (PKI based)
71/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Monitoring
Monika http://hpc.uni.lu/{chaos,gaia,granduc}/monika
72/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures
Platform Monitoring
Drawgantt http://hpc.uni.lu/{chaos,gaia,granduc}/drawgantt
72/121
Sebastien Varrette,PhD (CSC research unit)
High Performance Computing Platforms
N
Computing systems architectures