Basic Concepts of Parallel Programming*

shapecartΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

92 εμφανίσεις

Basic Concepts of Parallel
Programming*
Wei Zhang
Dept of ECE, VCU
Dept of ECE, VCU
Adopted from the textbook and D. Kirk and W.
Hwu’s Slides
Fundamentals of Parallel Computing
￿
Parallel computing requires that
-The problem can be decomposed into sub-problems that can
be safely solved at the same time
-The programmer structures the code and data to solve these
sub-problems concurrently
￿
The goals of parallel computing are
￿
The goals of parallel computing are
-To solve problems in less time, and/or
-To solve bigger problems, and/or
-To achieve better solutions
The problems must be large enough to The problems must be large enough to The problems must be large enough to The problems must be large enough to
justifyjustifyjustifyjustify
parallel computing and to exhibit
parallel computing and to exhibit parallel computing and to exhibit
parallel computing and to exhibit
exploitable exploitable exploitable exploitable
concurrencyconcurrencyconcurrencyconcurrency
.
...
A Recommended Reading
Mattson, Sanders, Massingill, Patterns for Parallel
Programming, Addison Wesley, 2005, ISBN 0-321-
22811-1.
Traditional Programming Models
￿
Usually, a program begins at a defined point, such as
the main()function
￿
It then works through a series of tasks in succession
￿
If the program relies on user interaction, the main
processing instrument is a loop in which user events
processing instrument is a loop in which user events
are handled.
-A right button click
￿
This model is simple and comfortable
-Only one thing is happening at any given moment
-At any point in the process, one step generally flows into the
next, leading up to a
predictable
conclusion, based on
predetermined parameters
Parallel Programming Model
￿
Instead of using a sequential execution sequences,
programmers should identify those activities that can
be executed in parallel.
￿
A program is viewed as a set of tasks with
dependencies among them.
￿
Decomposition: breaking programs into individual
tasks and identifying dependencies.
Key Parallel Programming Steps
1)
To find the concurrency in the problem
2)
To structure the algorithm so that concurrency can
be exploited
3)
To implement the algorithm in a suitable
programming environment
programming environment
4)
To execute and tune the performance of the code
on a parallel system
Challenges of Parallel Programming
￿
Finding and exploiting concurrency often requires looking at the
problem from a non-obvious angle
-Computational thinking (J. Wing)
￿
Dependences need to be identified and managed
-The order of task execution may change the answers
•Obvious: One step feeds result to the next steps
￿
Performance can be drastically reduced by many factors
￿
Performance can be drastically reduced by many factors
-Overhead of parallel processing
-Load imbalance among processor elements
-Inefficient data sharing patterns
-Saturation of critical resources such as memory bandwidth
Finding Concurrency –The Process
Task Decomposition
Order Tasks
Decomposition
Group Tasks
Dependence Analysis
Design Evaluation
Data Decomposition
Data Sharing
Order Tasks
Design Evaluation
This is typically a iterative process.This is typically a iterative process.This is typically a iterative process.This is typically a iterative process.
Opportunities exist for dependence analysis to play
Opportunities exist for dependence analysis to play Opportunities exist for dependence analysis to play
Opportunities exist for dependence analysis to play
earlier role in decomposition.
earlier role in decomposition. earlier role in decomposition.
earlier role in decomposition.
Finding Concurrency in Problems
￿
Identify a decomposition of the problem into sub-
problems that can be solved simultaneously
-A
task decomposition
that identifies tasks for potential concurrent
execution
-A
data decomposition
that identifies data local to each task
-
A way of
grouping
tasks and
ordering
the groups to satisfy
-
A way of
grouping
tasks and
ordering
the groups to satisfy
temporal constraints
-An analysis on the data
sharing patterns
among the concurrent
tasks
-A
design evaluation
that assesses of the quality the choices
made in all the steps
Task Decomposition
￿
Task decomposition: decomposing a program by the
function it performs
￿
Many large problems can be naturally decomposed into
tasks
How to Find Tasks?
￿
In some cases, each task corresponds to a distinct
call to a function (called functional decomposition).
￿
Another place to find tasks is in distinct iterations of
the loops within an algorithm (loop-splitting
algorithms).
￿
In data-driven decomposition, tasks are the units of
execution that update individual chunks of data.
An Example of Task Decomposition
￿
Microsoft Word
Considerations of Task Decomposition
￿
Efficiency
Simplicity
￿
Simplicity
￿
Flexibility
Task Decomposition Example -
Square Matrix Multiplication
￿
P = M * N of WIDTH ￿WIDTH
-One natural
task
(sub-problem)
produces one element of P
-
All tasks can execute in parallel in
this example.
N
WIDTH
this example.
M
P
WIDTH
WIDTH
WIDTH
Data Decomposition
￿
Data decomposition: it breaks down tasks by the data
they work on rather than the nature of the task.
￿
The most compute intensive parts of many large
problem manipulate a large data structure
-Similar operations are being applied to different parts of the
data structure, in a mostly independent manner.
data structure, in a mostly independent manner.
￿
The data decomposition should lead to
An Example: Task vs Data
Decomposition
Data Decomposition Example -
Square Matrix Multiplication
￿
Row blocks
-
Computing each partition requires
access to entire N array
￿
Square sub-blocks
-Partition a matrix into rectangular
smaller matrices called blocks
N
WIDTH
smaller matrices called blocks
-Only bands of M and N are needed
M
P
WIDTH
WIDTH
WIDTH
Data Decomposition
￿
If a task-based decomposition has already been
done, the data decomposition is driven by the needs
of each task.
￿
A few common examples:
More on Task & Data Decomposition
￿
Viewing the problem decomposition in terms of two
distinct dimensions is somewhat artificial.
￿
A task decomposition implies a data decomposition
and vice versa.
￿
A problem decomposition usually proceeds most
￿
A problem decomposition usually proceeds most
naturally by emphasizing one dimension over the
other.
Tasks Grouping
￿
Sometimes natural tasks of a problem can be
grouped together to improve efficiency
-Reduced synchronization overhead –all tasks in the group
can use a barrier to wait for a common dependence
-All tasks in the group efficiently share data loaded into a
common on
-
chip, shared storage (Shar
e
d Memory)
common on
-
chip, shared storage (Shar
e
d Memory)
-Grouping and merging dependent tasks into one task
reduces need for synchronization
Task Grouping Example -
Square Matrix Multiplication?
￿
Tasks calculating a P sub-block
-Extensive input data sharing,
reduced memory bandwidth using
Shared Memory
-All synchronized in execution
N
WIDTH
P
M
WIDTH
WIDTH
WIDTH
Steps to Group Tasks
￿
Note: There is no single way to find task groups.
￿
First, look at how the original problem was
decomposed.
-The tasks corresponding to a high-level operation (e.g. a
loop) naturally group together.
￿
Second, do any task groups share the same
constraint? If so, merge the group together.
￿
Next, look at constraints between groups of tasks.
-Groups may have a clear temporal ordering
-There is a distinct chain of data moves between groups
-It is often useful to merge independent tasks into a larger
group of tasks for more scheduling flexibility and better
scalability
Data Sharing
￿
Data sharing: it analyzes how data is shared among
groups of tasks, so that access to shared data can be
managed correctly
￿
Data sharing can be a double-edged sword
￿
Efficient memory bandwidth usage can be achieved
by synchronizing the execution of task groups and
coordinating their usage of memory data
-Efficient use of on-chip, shared storage
Categories of Shared Data
￿
Read-only sharingcan usually be done at much higher
efficiency than read-write sharing, which often requires
synchronization.
￿
Read-write sharingmust be protected with some type of
exclusive-access mechanism (locks, semaphores, etc)
-Accumulate:
-Multiple-read/single-write:
Data Flow Decomposition
￿
Data flow decomposition breaks up a program by
how data flows between tasks
￿
A well-known model: producer/consumer
￿
In the gardening example, how to do data flow
decomposition?
decomposition?
￿
One gardener prepares the tools (put gas in the
mower, cleans the shears etc.); then both gardens
begin to mow
Challenges to Manage Parallel Threads and Their Interactions
￿
Synchronization: it is the process by which two or
more threads coordinate their activities.
￿
Communication: it refers to the bandwidth and
latency issues associated with exchanging data
between threads.
￿
Load balancing: it refers to the distribution of work
across multiple threads so that they all perform
roughly the same amount of work.
￿
Scalability: it is the challenge of making efficient use
of a larger number of threads when software is run on
more-capable systems.