SPMD

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 4 χρόνια και 1 μήνα)

165 εμφανίσεις

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

1

ECE 498AL


Lecture 16:

Parallel Programming Basics


Part 3
-

Coding Styles

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

2

Objective


To understand SPMD, the CUDA coding style


How it relates other common parallel programming
coding styles.


What the coding style typically entails for applications.


How the coding style could be used to implement example
applications for different levels of efficiency and
complexity.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

3

Parallel Programming Coding Styles
-

Program and Data Models

Fork/Join

Master/Worker

SPMD

Program Models

Loop Parallelism

Distributed Array

Shared Queue

Shared Data

Data Models

These are not necessarily
mutually exclusive.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

4

Program Models


SPMD (Single Program, Multiple Data)


All PE’s (Processor Elements) execute the same program in
parallel, but has its own data


Each PE uses a unique ID to access its portion of data


Different PE can follow different paths through the same
code


This is essentially the CUDA Grid model


SIMD is a special case
-

WARP


Master/Worker


Loop Parallelism


Fork/Join

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

5

Program Models


SPMD (Single Program, Multiple Data)


Master/Worker


A Master thread sets up a pool of worker threads and a bag
of tasks


Workers execute concurrently, removing tasks until done


Loop Parallelism


Loop iterations execute in parallel


FORTRAN do
-
all (truly parallel), do
-
across (with
dependence)


Fork/Join


Most general, generic way of creation of threads

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

6

Review: Algorithm Structure

Start

Organize

by Task

Organize by

Data

Organize by

Data Flow

Linear

Recursive

Linear

Recursive

Task

Parallelism

Divide and

Conquer

Geometric

Decomposition

Recursive

Data

Regular

Irregular

Pipeline

Event Driven

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

7

Algorithm Structures vs. Coding Styles

Task
Parallel.

Divide/Con
quer

Geometric
Decomp.

Recursive
Data

Pipeline

Event
-
based

SPMD

☺☺☺


☺☺☺

☺☺☺


☺☺

☺☺☺

☺☺

Loop
Parallel

☺☺☺


☺☺

☺☺☺

Master/
Worker

☺☺☺


☺☺









Fork/

Join

☺☺

☺☺☺


☺☺

☺☺☺


☺☺☺


Source: Mattson, et al

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

8

Programming Models vs. Program Models

OpenMP

MPI

CUDA

SPMD

☺☺☺

☺☺☺☺

☺☺☺☺☺

Loop
Parallel

☺☺☺☺



Master/

Slave

☺☺

☺☺☺

Fork/Join

☺☺☺

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

9

More on SPMD


Dominant coding style of scalable parallel computing


MPI code is mostly developed in SPMD style


Many OpenMP code is also in SPMD (next to loop
parallelism)


Particularly suitable for algorithms based on task
parallelism and geometric decomposition.


Main advantage


Tasks and their interactions visible in one piece of source
code, no need to correlated multiple sources


SPMD is by far the most commonly used pattern for
structuring parallel programs.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

10

Typical SPMD Program Phases


Initialize


Establish localized data structure and communication channels


Obtain a unique identifier


Each thread acquires a unique identifier, typically range from 0 to N=1,
where N is the number of threads.


Both OpenMP and CUDA have built
-
in support for this.


Distribute Data


Decompose global data into chunks and localize them, or


Sharing/replicating major data structure using thread ID to associate
subset of the data to threads


Run the core computation


More details in next slide…


Finalize


Reconcile global data structure, prepare for the next major iteration

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

11

Core Computation Phase


Thread IDs are used to differentiate behavior of
threads


Use thread ID in loop index calculations to split loop
iterations among threads


Use thread ID or conditions based on thread ID to branch to
their specific actions

Both can have very different performance results and code
complexity depending on the way they are done.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

12

A Simple Example


Assume


The computation being parallelized has 1,000,000 iterations.



Sequential code:

Num_steps = 1000000;


for (i=0; i< num_steps, i++) {



}



© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

13

SPMD Code Version 1


Assign a chunk of iterations to each thread


The last thread also finishes up the remainder iteartions


num_steps = 1000000;


i_start =
my_id

* (num_steps/num_threads);

i_end = i_start + (num_steps/num_threads);

if (
my_id

== (num_threads
-
1)) i_end = num_steps;


for (i = i_start; i < i_end; i++) {

….

}

Reconciliation of results across threads if necessary.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

14

Problems with Version 1


The last thread executes more iterations than others


The number of extra iterations is up to the total
number of threads


1


This is not a big problem when the number of threads is
small


When there are thousands of threads, this can create serious
load imbalance problems.


Also, the extra if statement is a typical source of
“branch divergence” in CUDA programs.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

15

SPMD Code Version 2


Assign one more iterations to some of the threads


int rem = num_steps % num_threads;


i_start =
my_id

* (num_steps/num_threads);

i_end = i_start + (num_steps/num_threads);


if (rem != 0) {


if (my_id < rem) {


i_start += my_id;


i_end += (my_id +1);


}


else {


i_start += rem;


i_end += rem;

}

.

Less load imbalance


More branch divergence.


© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

16

SPMD Code Version 3


Use cyclic distribution of iteration

num_steps = 1000000;


for (i = my_id; i < num_steps; i+=
num_threads
) {

….

}


Less load imbalance


Loop branch divergence in the last Warp


Data padding further eliminates divergence.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

17

Comparing the Three Versions

ID=0

ID=1

ID=3

ID=2

ID=0

ID=1

ID=3

ID=2

ID=0

ID=1

ID=3

ID=2

Version 1

Version 2

Version 3

Padded version1 1 may be best

for some data access patterns.

© David Kirk/NVIDIA and Wen
-
mei W. Hwu, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

18

Data Styles


Shared Data


All threads share a major data structure


This is what CUDA supports


Shared Queue


All threads see a “thread safe” queue that maintains
ordering of data communication


Distributed Array


Decomposed and distributed among threads


Limited support in CUDA Shared Memory