Parallelism - Faculty of Science, Hong Kong Baptist University

builderanthologyΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 9 μήνες)

299 εμφανίσεις

Cluster Workshop


For COMP RPG students


17 May, 2010


High Performance Cluster Computing Centre (HPCCC)

Faculty of Science

Hong Kong Baptist University


2

Outline


Overview of Cluster Hardware and
Software


Basic Login and Running Program in a
job queuing system



Introduction to Parallelism


Why Parallelism


Cluster Parallelism



Open MP


Message Passing Interface


Parallel Program Examples


Policy
for using

sciblade.sci.hkbu.edu.hk

http://www.sci.hkbu.edu.hk/hpccc/sciblade

2

Overview of Cluster

Hardware and Software

4

Cluster Hardware

This 256
-
node PC cluster (
sciblade
) consist of:


Master node x 2


IO nodes x 3 (storage)


Compute nodes x 256


Blade Chassis x 16


Management network


Interconnect fabric


1U console & KVM switch


Emerson Liebert Nxa 120k VA UPS


4

5

Sciblade Cluster

256
-
node clusters supported by fund from RGC

5

6

Hardware Configuration


Master Node


Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core)


16GB RAM, 73GB x 2 SAS drive


IO nodes (Storage)


Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core)


16GB RAM, 73GB x 2 SAS drive


3TB storage Dell PE MD3000


Compute nodes x 256 each with


Dell PE M600 blade server w/ Infiniband network


2x Xeon E5430 2.66GHz (Quad Core)


16GB RAM, 73GB SAS drive

6

7

Hardware Configuration


Blade Chassis x 16


Dell PE M1000e


Each hosts 16 blade servers


Management Network


Dell PowerConnect 6248 (Gigabit Ethernet) x 6


Inerconnect fabric


Qlogic SilverStorm 9120 switch


Console and KVM switch



Dell AS
-
180 KVM


Dell 17FP Rack console


Emerson Liebert Nxa 120kVA UPS

7

8

Software List


Operating System


ROCKS 5.1 Cluster OS


CentOS 5.3 kernel 2.6.18


Job Management System


Portable Batch System


MAUI scheduler


Compilers, Languages


Intel Fortran/C/C++ Compiler for Linux V11


GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler

8

9

Software List


Message Passing Interface (MPI)
Libraries


MVAPICH 1.1


MVAPICH2 1.2


OPEN MPI 1.3.2


Mathematic libraries


ATLAS 3.8.3


FFTW 2.1.5/3.2.1


SPRNG 2.0a(C/Fortran) /4.0(C++/Fortran)

9

10

Software List


Molecular Dynamics & Quantum Chemistry


Gromacs 4.0.7


Gamess 2009R1,


Gaussian 03


Namd 2.7b1


Third
-
party Applications


FDTD simulation


MATLAB 2008b


TAU 2.18.2, VisIt 1.11.2


Xmgrace 5.1.22


etc

10

11

Software List


Queuing system


Torque/PBS


Maui scheduler


Editors


vi


emacs

11

12

Hostnames


Master node


External : sciblade.sci.hkbu.edu.hk


Internal : frontend
-
0


IO nodes (storage)


pvfs2
-
io
-
0
-
0, pvfs2
-
io
-
0
-
1, pvfs
-
io
-
0
-
2


Compute nodes


compute
-
0
-
0.local, …, compute
-
0
-
255.local



12

Basic Login and Running Program

in a Job Queuing System

14

Basic login


Remote login to the master node


Terminal login


using secure shell


ssh
-
l username sciblade.sci.hkbu.edu.hk


Graphical login


PuTTY & vncviewer
e.g.


[username@sciblade]$
vncserver


New ‘sciblade.sci.hkbu.edu.hk:3
(username)' desktop is
sciblade.sci.hkbu.edu.hk:
3

It means that your session will run on display 3.

14

15

Graphical login


Using PuTTY to setup a secured
connection:
Host Name=
sciblade.sci.hkbu.edu.hk




15

16

Graphical login (con’t)


ssh protocol version

16

17

Graphical login (con’t)


Port 5900
+
display number (i.e. 3 in this
case)





17

18

Graphical login (con’t)


Next, click
Open
, and login to sciblade


Finally, run VNC Viewer on your PC, and enter
"
localhost:3
"
{3 is the display number}








You should terminate your VNC session after you
have finished your work. To terminate your VNC
session running on sciblade, run the command


[username@tdgrocks] $
vncserver

kill : 3


18

19

Linux commands


Both master and compute nodes are installed with
Linux


Frequently used Linux command in PC cluster
http://www.sci.hkbu.edu.hk/hpccc/sciblade/faq_sciblade.php

cp

cp f1 f2 dir1

copy file f1 and f2 into directory dir1

mv

mv f1 dir1

move/rename file f1 into dir1

tar

tar xzvf abc.tar.gz

Uncompress and untar a tar.gz format file

tar

tar czvf abc.tar.gz abc

create archive file with gzip compression

cat

cat f1 f2

type the content of file f1 and f2

diff

diff f1 f2

compare text between two files

grep

grep student *

search all files with the word student

history

history 50

find the last 50 commands stored in the shell

kill

kill
-
9 2036

terminate the process with pid 2036

man

man tar

displaying the manual page on
-
line

nohup

nohup runmatlab a

run matlab (a.m) without hang up after logout

ps

ps
-
ef

find out all process run in the systems

sort

sort
-
r
-
n studno

sort studno in reverse numerical order

19

20

ROCKS specific commands


ROCKS provides the following commands for
users to run programs in all compute node.
e.g.


cluster
-
fork


Run program in all compute nodes


cluster
-
fork ps


Check user process in each compute node


cluster
-
kill


Kill user process at one time


tentakel


Similar to cluster
-
fork but run faster


20

21

Ganglia

Web based management and monitoring


http://sciblade.sci.hkbu.edu.hk/ganglia

21

Why Parallelism

23

Why Parallelism


Passively


Suppose you are using the most
efficient algorithm with an optimal
implementation, but the program still
takes too long or does not even fit onto
your machine


Parallelization is the last chance.

23

24

Why Parallelism


Initiative


Faster


Finish the work earlier

=
Same work in shorter time


Do more work

=
More work in the same time


Most importantly, you want to predict
the result before the event occurs

24

25

Examples


Many of the scientific and engineering problems
require enormous computational power.


Following are the few fields to mention:


Quantum chemistry, statistical mechanics, and relativistic
physics


Cosmology and astrophysics


Computational fluid dynamics and turbulence


Material design and superconductivity


Biology, pharmacology, genome sequencing, genetic
engineering, protein folding, enzyme activity, and cell modeling


Medicine, and modeling of human organs and bones


Global weather and environmental modeling


Machine Vision

25

26

Parallelism


The upper bound for the computing power that
can be obtained from a single processor is
limited by the fastest processor available at
any certain time.


The upper bound for the computing power
available can be dramatically increased by
integrating a set of processors together.


Synchronization and exchange of partial
results among processors are therefore
unavoidable.

26

27

Multiprocessing Clustering

IS

CU

CU

CU

CU

PU

PU

PU

PU

Shared Memory

1

n
-
1

n

2

2

1

n
-
1

n

IS

IS

IS

DS

DS

DS

DS

DS

LM

LM

LM

LM

CPU

CPU

CPU

CPU

Interconnecting Network

1

n
-
1

n

2

2

1

n
-
1

n

DS

DS

DS

Distributed Memory



Cluster

Shared Memory



Symmetric multiprocessors (SMP)

Parallel Computer Architecture

27

28

Clustering: Pros and Cons


Advantages


Memory scalable to number of processors.



Increase number of processors, size of memory

and bandwidth as well.


Each processor can rapidly access its own
memory without interference


Disadvantages


Difficult to map existing data structures to this
memory organization


User is responsible for sending and receiving data
among processors

28

29

TOP500 Supercomputer Sites (www.top500.org)

29

Cluster Parallelism

31

Parallel Programming
Paradigm


Multithreading


OpenMP



Message Passing


MPI (Message Passing Interface)


PVM (Parallel Virtual Machine)



Shared memory, Distributed memory

Shared memory only

31

32

Distributed Memory


Programmers view:


Several CPUs


Several block of memory


Several threads of action


Parallelization


Done by hand


Example


MPI

time

P1

P1

P2

P3

P2

P3

Process 0

Process 1

Process 2

Serial

Data exchange via

interconnection

Process

Message

Passing

32

33

Message Passing Model

Message Passing

The method by which
data

from one processor's memory
is copied to the memory of
another processor.

Process

A process is a set of executable
instructions (program) which runs on a
processor.

Message passing systems generally
associate only one process per
processor, and the terms "processes"
and "processors" are used
interchangeably

Data exchange

time

P1

P1

P2

P3

P2

P3

Process 0

Process 1

Process 2

Serial

Message

Passing

33

OpenMP

35

OpenMP Mission


The OpenMP Application Program Interface
(API) supports multi
-
platform shared
-
memory
parallel programming in C/C++ and Fortran on
all architectures, including Unix platforms and
Windows NT platforms.


Jointly defined by a group of major computer
hardware and software vendors.


OpenMP is a portable, scalable model that
gives shared
-
memory parallel programmers a
simple and flexible interface for developing
parallel applications for platforms ranging from
the desktop to the supercomputer.

35

36

OpenMP compiler choice


gcc 4.40 or above


compile with
-
fopenmp


Intel 10.1 or above


compile with

Qopenmp

on Windows


compile with

openmp

on linux


PGI compiler


compile with

mp



Absoft Pro Fortran


compile with
-
openmp

36

37

Sample openmp example

#include <omp.h>

#include <stdio.h>

int main() {


#pragma omp parallel

printf("Hello from thread %d, nthreads %d
\
n",
omp_get_thread_num(),

omp_get_num_threads()
);

}

37

38

serial
-
pi.c

#include <stdio.h>

static long num_steps = 10000000;

double step;

int main ()

{

int i; double x, pi, sum = 0.0;



step = 1.0/(double) num_steps;



for (i=0;i< num_steps; i++){



x = (i+0.5)*step;



sum = sum + 4.0/(1.0+x*x);


}


pi = step * sum;


printf("Est Pi= %f
\
n",pi);

}


38

39

Openmp version of spmd
-
pi.c

#include <omp.h>

#include <stdio.h>

static long num_steps = 10000000;

double step;

#define NUM_THREADS 8

int main ()

{ int i, nthreads; double pi, sum[NUM_THREADS];


step = 1.0/(double) num_steps;


omp_set_num_threads(NUM_THREADS);

#pragma omp parallel


{


int i, id,nthrds;


double x;


id = omp_get_thread_num();


nthrds = omp_get_num_threads();


if (id == 0) nthreads = nthrds;


for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {


x = (i+0.5)*step;


sum[id] += 4.0/(1.0+x*x);


}


}


for(i=0, pi=0.0;i<nthreads;i++)


pi += sum[i] * step;


printf("Est Pi= %f using %d threads
\
n",pi,nthreads);

}

39

Message Passing Interface (MPI)

41

MPI


Is a library but not a language, for parallel
programming


An MPI implementation consists of


a subroutine library with all MPI functions


include files for the calling application program


some startup script (usually called mpirun, but not
standardized)


Include the lib file mpi.h (or however called)
into the source code


Libraries available for all major imperative
languages (C, C++, Fortran …)

41

42

General MPI Program Structure

MPI include file

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&
rank
);

MPI_Comm_size(MPI_COMM_WORLD,&
np
);

/* Do Some Works */

ierr = MPI_Finalize();

}

variable declarations

#include <mpi.h>

void main (
int argc, char *argv[]
)

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&
rank
);

MPI_Comm_size(MPI_COMM_WORLD,&
np
);

/* Do Some Works */

ierr = MPI_Finalize();

}

Initialize MPI environment

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);


MPI_Comm_rank(MPI_COMM_WORLD,&
rank
);

MPI_Comm_size(MPI_COMM_WORLD,&
np
);

/* Do Some Works */

ierr = MPI_Finalize();

}

Do work and make

message passing calls

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&
rank
);

MPI_Comm_size(MPI_COMM_WORLD,&
np
);

/* Do Some Works */

ierr = MPI_Finalize();

}

Terminate MPI Environment

#include <mpi.h>

void main (int argc, char *argv[])

{

int np, rank, ierr;

ierr = MPI_Init(&argc, &argv);

MPI_Comm_rank(MPI_COMM_WORLD,&
rank
);

MPI_Comm_size(MPI_COMM_WORLD,&
np
);

/* Do Some Works */

ierr = MPI_Finalize();

}

42

43

Sample Program: Hello World!


In this modified version of the "Hello
World" program, each processor prints
its rank as well as the total number of
processors in the communicator
MPI_COMM_WORLD.


Notes:


Makes use of the pre
-
defined
communicator MPI_COMM_WORLD.


Not testing for error status of routines!

43

44

Sample Program: Hello World!

#include <stdio.h>

#include “mpi.h”


// MPI compiler header file


void main(int argc, char **argv)

{


int

nproc,myrank,ierr;



ierr=MPI_Init(&argc,&argv);

//
MPI initialization



// Get number of MPI processes


MPI_Comm_size(MPI_COMM_WORLD,&nproc);







// Get process id for this processor


MPI_Comm_rank(MPI_COMM_WORLD,&myrank);



printf (“Hello World!! I’m process %d of %d
\
n”,myrank,nproc);



ierr=MPI_Finalize();



// Terminate all MPI processes

}

44

45

Performance


When we write a
parallel program
, it is
important to identify the
fraction

of the
program that can be
parallelized

and to
maximize

it.


The goals are:


load balance


memory usage balance


minimize communication overhead


reduce sequential bottlenecks


scalability

45

46

Compiling & Running MPI Programs


Using mvapich 1.1

1.
Setting path, at the command prompt, type:


export PATH=/u1/local/mvapich1/bin:$PATH

(uncomment this line in .bashrc)

2.
Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.


mpicc

o cpi cpi.c

3.
Prepare hostfile (e.g.
machines
) number of compute
nodes:

Compute
-
0
-
0

Compute
-
0
-
1

Compute
-
0
-
2

Compute
-
0
-
3

4.
Run the program with a number of processor node:

mpirun

np 4

machinefile machines ./cpi


46

47

Compiling & Running MPI Programs


Using mvapich 1.2

1.
Prepare
.mpd.conf
and
.mpd.passwd
and saved
in your home directory :


MPD_SECRETWORD=gde1234
-
3


(you may set your own secret word)

2.
Setting environment for mvapich 1.2


export MPD_BIN=/u1/local/mvapich2


export PATH=$MPD_BIN:$PATH


(uncomment this line in .bashrc)

3.
Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.


mpicc

o cpi cpi.c

4.
Prepare hostfile (e.g.
machines
) one hostname per
line like previous section

47

48

Compiling & Running MPI Programs

5.
Pmdboot with the hostfile

mpdboot

n 4

f machines

6.
Run the program with a number of processor node:

mpiexec

np 4 ./cpi

7.
Remember to clean after running jobs by mpdallexit

mpdallexit



48

49

Compiling & Running MPI Programs


Using openmpi:1.2

1.
Setting environment for openmpi


export LD
-
LIBRARY_PATH=/u1/local/openmpi/




lib:$LD
-
LIBRARY_PATH


export PATH=/u1/local/openmpi/bin:$PATH


(uncomment this line in .bashrc)

2.
Compile using mpicc, mpiCC, mpif77 or mpif90, e.g.


mpicc

o cpi cpi.c

3.
Prepare hostfile (e.g.
machines
) one hostname per
line like previous section

4.
Run the program with a number of processor node

mpirun

np 4

machinefile machines ./cpi


49

50

Submit parallel jobs into torque batch queue


Prepare a job script, say omp.pbs like the following


#!/bin/sh


### Job name


#PBS
-
N OMP
-
spmd


### Declare job non
-
rerunable


#PBS
-
r n


### Mail to user


##PBS
-
m ae


### Queue name (small, medium, long, verylong)


### Number of nodes


#PBS
-
l nodes=1:ppn=8


#PBS
-
l walltime=00:08:00


cd $PBS_O_WORKDIR


export OMP_NUM_THREADS=8


./omp
-
test


./serial
-
pi


./omp
-
spmd
-
pi


Submit it using qsub


qsub omp.pbs

50

51

Another example of pbs scripts


Prepare a job script, say

scripts.sh
like the following


#!/bin/sh


### Job name


#PBS
-
N Sorting


### Declare job non
-
rerunable


#PBS
-
r n


### Number of nodes


#PBS
-
l nodes=4


#PBS
-
l walltime=08:00:00


# This job's working directory


echo Working directory is $PBS_O_WORKDIR


cd $PBS_O_WORKDIR


echo Running on host `hostname`


echo Time is `date`


echo Directory is `pwd`


echo This jobs runs on the following processors:


echo `cat $PBS_NODEFILE`


# Define number of processors


NPROCS=`wc
-
l < $PBS_NODEFILE`


echo This job has allocated $NPROCS nodes


# Run the parallel MPI executable


/u1/local/mvapich1/bin/mpirun
-
v
-
machinefile $PBS_NODEFILE
-
np
$NPROCS ./bubblesort >> bubble.out


Submit it using qsub


qsub scripts.sh

51

Parallel Program Examples

53

Example 1: Estimation of Pi (OpenMP)

#include <omp.h>

#include <stdio.h>

static long num_steps = 10000000;

double step;

#define NUM_THREADS 8

int main ()

{ int i, nthreads; double pi, sum[NUM_THREADS];


step = 1.0/(double) num_steps;


omp_set_num_threads(NUM_THREADS);

#pragma omp parallel


{


int i, id,nthrds;


double x;


id = omp_get_thread_num();


nthrds = omp_get_num_threads();


if (id == 0) nthreads = nthrds;


for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {


x = (i+0.5)*step;


sum[id] += 4.0/(1.0+x*x);


}


}


for(i=0, pi=0.0;i<nthreads;i++)


pi += sum[i] * step;


printf("Est Pi= %f using %d threads
\
n",pi,nthreads);

}

53

54

Example 2a: Sorting


quick sort


The quick sort is an in
-
place, divide
-
and
-
conquer,
massively recursive sort.


The efficiency of the algorithm is majorly impacted by
which element is chosen as the pivot point.


The worst
-
case efficiency of the quick sort, O(n
2
),
occurs when the list is sorted and the left
-
most element
is chosen.


If the data to be sorted isn't random, randomly choosing
a pivot point is recommended. As long as the pivot point
is chosen randomly, the quick sort has an algorithmic
complexity of O(n log n).


Pros:

Extremely fast.

Cons:

Very complex algorithm, massively recursive

54

55

Quick Sort Performance

Processes

Time

1

0.410000

2

0.300000

4

0.180000

8

0.180000

16

0.180000

32

0.220000

64

0.680000

128

1.300000

55

56

Example 2b: Sorting
-
Bubble Sort


The bubble sort is the oldest and simplest sort
in use. Unfortunately, it's also the slowest.


The bubble sort works by comparing each item
in the list with the item next to it, and swapping
them if required.


The algorithm repeats this process until it
makes a pass all the way through the list
without swapping any items (in other words, all
items are in the correct order).


This causes larger values to "bubble" to the end
of the list while smaller values "sink" towards
the beginning of the list.

56

57

Bubble Sort Performance

Processes

Time

1

3242.327

2

806.346

4

276.4646

8

78.45156

16

21.031

32

4.8478

64

2.03676

128

1.240197

57

58

Monte Carlo Integration


"Hit and miss" integration


The integration scheme is to take a
large number of random points and
count the number that are within f(x) to
get the area



58

59

Monte Carlo Integration


Monte Carlo Integration to Estimate Pi

59

60

Example 1: omp


omp/test
-
omp.c

omp/serial
-
pi.c

omp/spmd
-
pi.c

Compile program by the command:

make

Run the program in parallel by

./omp
-
spmd
-
pi

Submit job to PBS by

qsub omp.pbs


Example 3: Sorting


sorting/qsort.c

sorting/bubblesort.c

sorting/script.sh

sorting/qsort

sorting/bubblesort


Submit job to PBS queuing system by

qsub script.sh

Example 2: Prime


prime/prime.c

prime/prime.f90

prime/primeParallel.c

prime/Makefile

prime/machines


Compile by the command:

make

Run the serial program by

./primeC
or

./primeF

Run the parallel program by

mpirun

np 4

machinefile
machines ./primeMPI

Example 4: pmatlab


pmatlab/startup.m

pmatlab/RUN.m

pmatlab/sample
-
pi.m


Submit job to PBS by

qsub Qpmatlab.pbs


60

Policy for using
sciblade.sci.hkbu.edu.hk

62

Policy

1.
Every user shall apply for his/her own computer user
account to login to the master node of the PC cluster,
sciblade.sci.hkbu.edu.hk.

2.
The account must not be shared his/her account and
password with the other users.

3.
Every user must
deliver jobs

to the PC cluster from
the master node
via the PBS job queuing system
.
Automatically dispatching of job using scripts or
robots are not allowed.

4.
Users are not allowed to login to the compute nodes.

5.
Foreground jobs on the PC cluster are restricted to
program testing and the time duration should not
exceed 1 minutes CPU time per job.

63

Policy (continue)

6.
Any background jobs run on the master node or
compute nodes are strictly prohibited and will be
killed without prior notice.

7.
The current restrictions of the job queuing system are
as follows,


The maximum number of running jobs in the
job queue

is
8
.


The maximum total number of
CPU cores

used in one time
cannot exceed
512
.

8.
The restrictions in item 7 will be reviewed timely for
the growing number of users and the computation
need.