Ques 1 Explain grain packing and scheduling with example.

shrewdnessfreedomSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

1,490 views

Ques 1 Explain grain packing and scheduling with example.

Grain Packing and Scheduling


Two questions:


How can I partition a program into parallel “pieces” to yield the shortest
execution time?


What is the optimal size of parallel grains?


There is an
obvious tradeoff between the time spent scheduling and synchronizing
parallel grains and the speedup obtained by parallel execution.

One approach to the problem is called “grain packing.”

Program Graphs and Packing


A program graph is similar to a
dependence graph


Nodes = { (n,s) }, where n = node name, s = size (larger s = larger grain size).


Edges = { (v,d) }, where v = variable being “communicated,” and d =
communication delay.


Packing two (or more) nodes produces a node with a larger grain size
and possibly more
edges to other nodes.

Packing is done to eliminate unnecessary communication delays or reduce overall scheduling
overhead

Scheduling


A schedule is a mapping of nodes to processors and start times such that communication
delay requirements

are observed, and no two nodes are executing on the same processor
at the same time.


Some general scheduling goals


Schedule all fine
-
grain activities in a node to the same processor to minimize
communication delays.


Select grain sizes for packing to
achieve better schedules for a particular parallel
machine.

Ques 2

Explain gastofson’s law for scaled problem

Gustafson's law


Gustafson's Law

(also known as
Gustafson
-
Barsis' law
) is a law in
computer science

which
states that any sufficiently large problem can be efficiently
p
arallelized
. Gustafson's Law is
closely related to
Amdahl's law
, which gives a limit to the degree to which a program can be
sped up due to parallelization. It was first describe
d by
John L. Gustafson

.

where
P

is the number of processors,
S

is the
speedup
, and α the non
-
paralleli
zable part of the
process.

Gustafson's law addresses the shortcomings of
Amdahl's law
, which cannot scale to match
availability of computing power as the machine size increases.
It removes the fixed problem size
or fixed
computation

load on the parallel processors: instead, he proposed a fixed time concept
which leads to scaled speedup for larger problem sizes.

Amdahl's law is based on fixed workload or fixed problem size. It implies that the sequential part
of a program does not cha
nge with respect to machine size (i.e, the number of processors).
However the parallel part is evenly distributed by n processors.

The impact of the law was the shift in research to focus on problems for which solving a problem
of a larger size in the same

amount of time would be considered desirable.

Gustafson's Law approximately states:



Suppose a car has already been traveling for some time at less than 90mph. Given
enough time and distance to travel, the car's average speed can always eventually reach
90mph, no matter how long or how slowly it has already traveled. For example, if the
car spent one hour at 30 mph, it could achieve this by driving at 120 mph for two
additional hours, or at 150 mph for an hour, and so on.




Limitations

Some problems do
not have fundamentally larger datasets. As example, processing one data
point per world citizen gets larger at only a few percent per year. The principal point of
Gustafson's law is that such problems are not likely to be the most fruitful applications of
parallelism.

Nonlinear algorithms may make it hard to take advantage of parallelism "exposed" by
Gustafson's law. Snyder
[

points out an O(N
3
) algorithm means that double the concurrency gives
only about a 9% increase in problem size. Thus, while it may be
possible to occupy vast
concurrency, doing so may bring little advantage over the original, less concurrent solution

however in practice there have been massive improvements.

Hill and Marty

emphasize also that methods of speeding sequential execution are still needed,
even for multicore machines. They point out that locally inefficient methods can be globally
efficient when they reduce the sequential phase.

Ques 3


Explain the properties of interconnection network



Network diameter
= Max. number of hops necessary to link up two most distant
processors




Network bisection width

= Minimum number of links to be severed for a network
to be into two halves (give or take one processor)




Network bisection bandwidth

= Minimum sum of bandwidths of chosen links to
be severed for a network to be into two halves (give or take one processo
r)




Maximum
-
Degree of PEs

= maximum number of links to/from one PE




Minimum
-
Degree of PEs

= minimum number of links to/from one PE







Network Topology

Number of Nodes

Node Degree

Linear and Ring

d

2

Shuffle
-
Exchange

2
d

3

2D Mesh

d
2

4

Hypercube

2
d

d

Star

m!

m
-
1

De Bruijn

2
d

4

Binary Tree

2
d

-

1

3

Butterfly

(d+1)* 2
d

d+1

Omega

2
d

2

Pyramid

(4
d+1



1)/3

9


Ques 4

Write a short notes on parallel computer model

A Parallel Machine Model

The rapid penetration of computers into commerce, science, and education owed much to the
early standardization on a single machine


model, the von Neumann computer. A von Neumann
computer comprises a central processing unit (CPU) connected to a storage u
nit (memory)
(Figure
1.4
). The CPU executes a stored program that specifies a sequence of read and write
operations on the memory. This simple model has proved remarkably robust.
Its persistence over
more than forty years has allowed the study of such important topics as algorithms and
programming languages to proceed to a large extent independently of developments in computer
architecture. Consequently, programmers can be trained
in the abstract art of ``programming''
rather than the craft of ``programming machine X'' and can design algorithms for an abstract von
Neumann machine, confident that these algorithms will execute on most target computers with
reasonable efficiency.







Figure 1.4:

The von Neumann computer. A central processing unit (CPU) executes a program
that performs a sequence of read and write operations on an attached memory.


Our study of parallel programming will be most rewarding if we can identify a parallel

machine
model that is as general and useful as the von Neumann sequential machine model. This machine
model must be both simple and realistic:
simple

to facilitate understanding and programming,
and
realistic

to ensure that programs developed for the mode
l execute with reasonable efficiency
on real computers.

1.2.1 The Multicomputer

A parallel machine model called the
multicomputer
fits these


requirements. As illustrated in
Figure
1.5
, a


multicomputer comprises a number of von Neumann computers, or
nodes
, linked
by an
interconnection network
. Each computer


executes its own program. This program may
access local memory and may send and receive messages over the netw
ork. Messages are used to
communicate with other computers or, equivalently, to read and write remote memories. In the
idealized network, the cost of sending a message between two nodes is independent of both node
location and other network traffic, but do
es depend on message length.




Figure 1.5:

The multicomputer, an idealized parallel computer model. Each node consists of a
von Neumann machine: a CPU and memory. A node can communicate with other nodes by
sending and receiving messages over an intercon
nection network.


A defining attribute of the multicomputer model is that accesses to local (same
-
node) memory
are less expensive than accesses to remote (different
-
node) memory. That is, read and write are
less costly than send and receive. Hence, it is d
esirable that accesses to local data be more
frequent than accesses to remote data. This property, called
locality
, is a third fundamental
requirement


for parallel software, in addition to concurrency and scalability.


The importance
of locality depends

on the ratio of remote to local access costs. This ratio can vary from 10:1 to
1000:1 or greater, depending on the relative performance of the local computer, the network, and
the mechanisms used to move data to and from the network.

1.2.2 Other Machine
Models










Figure 1.6:

Classes of parallel computer architecture. From top to bottom: a distributed
-
memory MIMD computer with a mesh interconnect, a shared
-
memory multiprocessor, and a
local area network (in this case, an Ethernet). In each case, P d
enotes an independent
processor.




We review important parallel computer architectures (several are illustrated in Figure
1.6
) and
discuss briefly how these differ from the

idealized multicomputer model.

The multicomputer is most similar to what is often called the
distributed
-
memory MIMD
(multiple instruction multiple data) computer. MIMD means that each processor can execute a


separate stream of instructions on its own
local data; distributed memory means that memory is
distributed among the processors, rather than placed in a central location. The principal
difference between a multicomputer and the distributed
-
memory MIMD computer is that in the
latter, the cost of sen
ding a message between two nodes may not be independent of node location
and other network


traffic. These issues are discussed in Chapter
3
.


Examples of this class of
machine include the IBM SP, Intel Paragon,


Thinking Machines CM5, Cray T3D,


Meiko CS
-
2, and


nCUBE.

Another important class of parallel computer is the
multiprocessor
, or shared
-
memory MIMD
computer. In multiprocessors,


a
ll processors share access to a common memory, typically via a
bus or a hierarchy of buses. In the idealized Parallel Random Access Machine (PRAM) model,
often used in theoretical studies of parallel algorithms, any processor can access any memory
element
in the same amount of time. In practice, scaling this architecture usually introduces some
form of memory hierarchy; in particular, the frequency with which the shared memory is
accessed may be reduced by storing


copies of frequently used data items in a

cache
associated


with each processor. Access to this cache is much faster than access


to the shared memory;
hence, locality is usually important, and the differences between multicomputers and
multiprocessors are really just questions of degree. Progr
ams developed for multicomputers can
also execute efficiently on multiprocessors, because shared memory permits an efficient
implementation of message passing. Examples of this class


of machine include the Silicon
Graphics Challenge,


Sequent Symmetry,


and the many multiprocessor workstations.

A more specialized class of parallel computer is the
SIMD


(single instruction multiple data)
computer. In SIMD machines, all processors execute the same instruction stream on a different
piece of data. This ap
proach can reduce both hardware and software complexity but is
appropriate only for specialized problems characterized by a high degree of regularity, for
example, image processing and certain numerical simulations. Multicomputer algorithms
cannot
in gener
al be executed efficiently on SIMD computers. The MasPar MP is


an example of this
class


of machine.

Two classes of computer system that are sometimes used as parallel


computers are the local
area network (LAN), in which computers in


close physical

proximity (e.g., the same building)
are connected by a


fast network, and the wide area network (WAN), in which geographically


distributed computers are connected. Although systems of this sort introduce additional concerns
such as reliability and secu
rity, they


can be viewed for many purposes as multicomputers, albeit
with high


remote
-
access costs. Ethernet and asynchronous transfer mode (ATM)


are
commonly used network technologies.