Programming with Shared Memory

errorhandleSoftware and s/w Development

Nov 18, 2013 (3 years and 11 months ago)

113 views

1

ITCS4145/5145, Parallel Programming B. Wilkinson Feb 11, 2013. slides 8b
-
1.ppt

Programming with Shared Memory



Introduction to OpenMP


Part 1


2

OpenMP


Thread
-
based shared memory programming model.


Accepted standard developed in late 1990s by a group of
industry specialists.


Higher
-
level than using thread API’s such as Pthreads or Java
threads.


Write programs in C/C++ (or Fortran!) and use OpenMP
compiler directives to specify parallelism.


OpenMP also has a few supporting library routines and
environment variables.


Several compilers available to compile OpenMP programs
include recent Linux C compilers.

3

parallel
region

Multiple threads

parallel
region

Master thread

OpenMP thread model

Synchronization

Initially, single thread executed
by a master thread.


parallel

directive here

uses team of threads with
subsequent block of code
executed by multiple threads in
parallel.


Exact number of threads
determined by one of several
ways, see later.


Other directives within
parallel

construct to specify parallel for
loops and different blocks of
code for threads.


Code outside parallel region
executed by master thread only


Master thread only

Master thread only

4

Number of threads in a team


Established by one of three ways, either:


1.
num_threads

clause after the
parallel

directive



e.g.
#pragma
omp

parallel
num_threads
(5)

or


2.

omp_set_num_threads
()

library routine being previously called




e.g.
omp_set_num_threads
(6);

or


3.
Environment variable
OMP_NUM_THREADS

is defined


e.g


$ export OMP_NUM_THREADS=8


$ ./hello


in order given or is system dependent if none of above. Number of threads
available can be altered dynamically to achieve best use of system resources.

Finding number of threads and thread ID
during program execution


omp_get_num_threads
()


get the total number of threads



omp_get_thread_num
()


Returns thread
number (ID),
an
integer from 0 to
omp_get_num_thread
()
-
1 where thread 0 is
master
thread

The names of these two functions are similar;
easy to confuse.

6

#pragma
omp

parallel





structured_block

C “pragmatic”
directive
instructs
compiler to
use OpenMP
features

All OpenMP
directives have
omp

OpenMP
parallel directive

OpenMP Parallel Directive

Single statement or
compound statement
created with { ...} with
single entry point and
single exit point.

Creates multiple threads, each one executing the specified
structured_block
.


Implicit barrier at end of construct.

7

Hello world example


#
pragma

omp

parallel

{

printf
("Hello World from thread = %d
\
n",
omp_get_thread_num
(),






omp_get_num_threads
());


}


Output from an 8
-
processor/core machine:


Hello World from thread 0 of 8

Hello World from thread 4 of 8

Hello World from thread 3 of 8

Hello World from thread 2 of 8

Hello World from thread 7 of 8

Hello World from thread 1 of 8

Hello World from thread 6 of 8

Hello World from thread 5 of 8

VERY IMPORTANT
Opening brace
must

on a new line
(tabs,spaces ok)

8

Global “shared” variables/data

Any variable declared outside a parallel construct accessible
by all threads
unless otherwise specified
:

int

main (
int

argc
, char *
argv
[]) {



int

x;



// accessibly by all threads



#pragma
omp

parallel


{







// each thread see the same x



}


}

9

Private variables

Separate copies of variables for each thread.

Can be declared within each parallel region but OpenMP
provides
private

clause.

int

tid
;



#
pragma

omp

parallel private(
tid
)

{


tid

=
omp_get_thread_num
();


printf
("Hello World from thread = %d
\
n",
tid
);

}


Each thread has a
local variable tid


Also a
shared

clause available for shared variables.

Another example of shared and private
data

int

main (
int

argc
, char *
argv
[])

{


int

x;


int

tid
;



#
pragma

omp

parallel private(
tid
)


{


tid

=
omp_get_thread_num
();


if (
tid

== 0) x = 42;


printf

("Thread %d, x = %d
\
n",
tid
, x);


}

}



x is shared by all threads

tid is private


each thread has
its own copy

Variables declared outside the parallel construct are shared unless
otherwise specified

Output

$ ./data

Thread 3, x = 0

Thread 2, x = 0

Thread 1, x = 0

Thread 0, x = 42

Thread 4, x = 42

Thread 5, x = 42

Thread 6, x = 42

Thread 7, x = 42


tid has a separate value for each thread

Why does x change?

Another Example

Shared versus Private


int

a[100];


#pragma
omp

parallel private(
tid
, n)


{


tid

=
omp_get_thread_num
();


n =
omp_get_num_threads
();


a[
tid
] = 10*n;


}


OR


#
pragma

omp

parallel private(
tid
, n) shared(a)

...

tid

and n are private

a[ ] is shared

optional

13

Variations of private variables


private

clause


creates private copies of variables for each
thread


firstprivate

clause
-

as private clause but initializes each copy
to the values given immediately prior to parallel construct.


lastprivate

clause


as private but “the value of each
lastprivate variable from the sequentially last iteration of the
associated loop, or the lexically last section directive, is
assigned to the variable’s original object.”

14

Work
-
Sharing

Specifying work inside a parallel region


Four constructs in this classification:




sections


section



for



single



master


In all cases, implicit barrier at end of construct unless a
nowait

clause included, which overrides the barrier.


Note:
These constructs do not start a new team of
threads. That done by an enclosing parallel construct.

15

Sections


The construct:


#pragma
omp

parallel

{


#pragma
omp

sections


{



#pragma
omp

section




structured_block



#pragma
omp

section




structured_block








}

}


cause structured blocks to be shared among threads in team.

The first
section

directive optional.

Blocks
executed by
available
threads

Enclosing parallel
directive

16

Example

#
pragma

omp

parallel shared(
a,b,c,d,nthreads
) private(
i,tid
)

{



tid

=
omp_get_thread_num
();


#
pragma

omp

sections
nowait



{




#
pragma

omp

section




{





printf
("Thread %d doing section 1
\
n",tid
);





for (
i
=0;
i
<N;
i
++) {






c[
i
] = a[
i
] + b[
i
];






printf
("Thread %d: c[%d]= %f
\
n",tid,i,c
[
i
]);





}




}







#
pragma

omp

section




{





printf
("Thread %d doing section 2
\
n",tid
);





for (
i
=0;
i
<N;
i
++) {






d[
i
] = a[
i
] * b[
i
];






printf
("Thread %d: d[%d]= %f
\
n",tid,i,d
[
i
]);





}




}



} /* end of sections */

} /* end of parallel section */



One
thread
does this

Another
thread
does this

Another sections example

#
pragma

omp

parallel shared(
a,b,c,d,nthreads
)


private(
i,tid
)

{


tid

=
omp_get_thread_num
();


#
pragma

omp

sections
nowait



{


#
pragma

omp

section


{


printf
("Thread %d doing section 1
\
n",tid
);


for (
i
=0;
i
<N;
i
++) {


c[
i
] = a[
i
] + b[
i
];


printf
("Thread %d: c[%d]=%f
\
n“,tid,i,c
[
i
]);


}


}




Threads do not wait after
finishing section

One thread does this

Sections example continued


#pragma
omp

section


{


printf
("Thread %d doing section 2
\
n",
tid
);


for
(
i
=0;
i
<N;
i
++) {


d[
i
] = a[
i
] * b[
i
];


printf
("Thread %d: d[%d]= %f
\
n",
tid,i,d
[
i
]);



}


}


}
/* end of sections */



printf

("Thread %d done
\
n",
tid
);

}
/* end of parallel section */

Another thread does this

Output

Thread 0 doing section 1

Thread 0: c[0]= 5.000000

Thread 0: c[1]= 7.000000

Thread 0: c[2]= 9.000000

Thread 0: c[3]= 11.000000

Thread 0: c[4]= 13.000000

Thread 3 done

Thread 2 done

Thread 1 doing section 2

Thread 1: d[0]= 0.000000

Thread 1: d[1]= 6.000000

Thread 1: d[2]= 14.000000

Thread 1: d[3]= 24.000000

Thread 0 done

Thread 1: d[4]= 36.000000

Thread 1 done

Threads do not wait (i.e.
no barrier)

Output if remove nowait clause

Thread 0 doing
section 1

Thread 0: c[0]= 5.000000

Thread 0: c[1]= 7.000000

Thread 0: c[2]= 9.000000

Thread 0: c[3]= 11.000000

Thread 0: c[4]= 13.000000

Thread 3 doing section 2

Thread 3: d[0]= 0.000000

Thread 3: d[1]= 6.000000

Thread 3: d[2]= 14.000000

Thread 3: d[3]= 24.000000

Thread 3: d[4]= 36.000000

Thread 3 done

Thread 1 done

Thread 2 done

Thread 0 done

If we remove the
nowait
,
then there is a barrier at
the end of the section.
Threads wait until they are
all done with the section.

Barrier here

21

Combining parallel and section
constructs


If a parallel directive is followed by a single “sections” directive,
they can be combined into:



#
pragma

omp

parallel sections


{



#
pragma

omp

section




structured_block



#
pragma

omp

section




structured_block










}


with similar effect. (However, a
nowait

clause is not allowed.)

22

Parallel For Loop




#pragma
omp

parallel



{







#pragma
omp

for




for ( i = 0; i < n; i++ ) {




… // for loop body




}







}



causes

for

loop

to

be

divided

into

parts

and

parts

shared

among

threads

in

the

team



equivalent

to

a


forall
.


Different

iterations

will

be

executed

by

available

threads



Must be “for” loop of a
simple C form such as

(i = 0; i < n; i++)

where lower bound and
upper bound are constants


Must have a new line here

Enclosing parallel region

23

Example

#
pragma

omp

parallel shared(
a,b,c,nthreads,chunk
) private(
i,tid
)

{



tid

=
omp_get_thread_num
();



if (
tid

== 0) {




nthreads

=
omp_get_num_threads
();




printf
("Number of threads = %d
\
n",
nthreads
);




}



printf
("Thread %d starting...
\
n",tid
);




#
pragma

omp

for



for (
i
=0;
i
<N;
i
++) {




c[
i
] = a[
i
] + b[
i
];




printf
("Thread %d: c[%d]= %f
\
n",
tid,i,c
[
i
]);




}




} /* end of parallel section */


For loop

Executed by
one thread

Without “nowait”, threads
wait after finishing loop

24

Combined parallel and for constructs


If a
parallel

directive is followed by a single
for

directive, it
can be combined into:




#pragma omp parallel for




<for loop> {







}


with similar effects.

Combining Directives Example






#pragma
omp

parallel for
shared(
a,b,c,nthreads
) private(
i,tid
)


for
(
i

= 0;
i

< N;
i
++) {


c[
i
] = a[
i
] + b[
i
];


printf
("Thread %d: c[%d]= %f
\
n",
tid,i,c
[
i
]);

}


Declares a Parallel
Region and a Parallel For

Scheduling a Parallel For


By default, a parallel for scheduled by mapping blocks (or
chunks) of iterations to available threads (static mapping)


Thread 1 starting...

Thread 1:
i

= 2, c[1] = 9.000000

Thread 1:
i

= 3, c[1] = 11.000000

Thread 2 starting...

Thread 2:
i

= 4, c[2] = 13.000000

Thread 3 starting...

Number of threads = 4

Thread 0 starting...

Thread 0:
i

= 0, c[0] = 5.000000

Thread 0:
i

= 1, c[0] = 7.000000

Default Chunk Size










threads
of
number
iterations

of
number
Barrier here

27

Loop Scheduling and Partitioning

OpenMP

offers scheduling clauses to add to
for

construct:



1. Static



#
pragma

omp

parallel for schedule (
static,chunk_size
)


Partitions loop iterations into equal sized chunks specified by
chunk_size
. Chunks assigned to threads in round robin
fashion.



2. Dynamic



#
pragma

omp

parallel for schedule (
dynamic,chunk_size
)




Uses internal work queue. Chunk
-
sized block of loop assigned
to threads as they become available.

28


3. Guided



#pragma omp parallel for schedule (guided,chunk_size)


Similar to dynamic but chunk size starts large and gets smaller
to reduce time threads have to go to work queue.


chunk size =


number of iterations remaining




2 * number of threads



4. Runtime




#pragma omp parallel for schedule (runtime)


Uses OMP_SCEDULE environment variable to specify which of
static, dynamic or guided should be used.

Question


Guided scheduling is similar to Static except that
the chunk sizes start large and get smaller.


What is the advantage of using Guided versus
Static?



Answer: Guided improves load balance

Reduction


A reduction is when we apply a commutative operator to
an aggregate values creating a single value (similar to the
MPI_Reduce
)


sum = 0;

#
pragma

omp

parallel for reduction(+:sum)

for (k = 0; k < 100; k++ ) {


sum = sum +
funct
(k);

}

Operation

Variable

Private copy of sum created for each thread by compiler.

Private copy will be added to sum at end.

Eliminates the need for critical sections here.

31

Single


The directive




#pragma omp parallel



{







#pragma omp single




structured_block







}


cause the structured block to be executed by one thread only.

Must
have a
new line
here

32

Master

The
master

directive:



#pragma omp parallel


{





#pragma omp master



structured_block





}


causes only the master thread to execute the structured block.


Different to those in work sharing group in that there is no
implied barrier at end of construct (nor beginning).

Other threads encountering master directive will ignore it and
associated structured block, and will move on.

Master Example

#
pragma

omp

parallel private(
tid
)

{


tid

=
omp_get_thread_num
();


printf

("Thread %d starting...
\
n",
tid
);



#
pragma

omp

master


{


printf
("Thread %d doing work
\
n",tid
);


...



} /* end of master */



printf

("Thread %d done
\
n",
tid
);

} /* end of parallel section */

Is there any difference between these
two approaches:

Master Directive:


#
pragma

omp

parallel

{


...


#
pragma

omp

master


structured_block


...

}

Using an if statement:


#
pragma

omp

parallel
private(
tid
)

{


...


tid
=
omp_get_thread_num
();


if (
tid

== 0)


structured_block


...

}


Questions

35