# Course Code : MCSE-011Course Title : Parallel Computing

Λογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 4 χρόνια και 6 μήνες)

88 εμφανίσεις

Course Code : MCSE
-
011

Course Title : Parallel Computing

Question.1 Using Bernstein’s conditions, detect maximum parallelism between the
instructions of the following code:

P1: X = Y * Z P2: P = Q + X P3: R = T + X P4: X = S + P P5: V = Q / Z

Hint :
The input and output set for the above instructions are :

I1 = { Y,Z} O1 = { X}

I2 = {Q,X } O2 = {P}

I3 = {T,X} O3 = { R}

I4 = {S,P } O4 = {X}

I5 = {Q,Z } O5 = { V}

Using bernstein’s conditions we get :

Therfore statements P1 and P3 can be execut
ed in parallel.

Therfore statements P1 and P5 can be executed in parallel.

Therfore statements P2 and P5 can be executed in parallel

Therfore statements P3 and P5 can be executed in parallel

Therfore statements P4 and P5 can be executed in para
llel

Question.2 Discuss, along with diagram, an arithmetic pipeline for Multiplication of two 8
-
digit fixed
numbers.

Hint
: Arithmetic Pipelining for fixed numbers Multiplication of 8 digit fixed numbers

X7

X6

X5

X4

X3

X2

X1

X0

=

X

Y7

Y6

Y
5

Y4

Y3

Y2

Y1

Y0

=

Y

X7Y0

X6Y0

X5Y0

X4Y0

X3Y0

X2Y0

X1Y0

X0Y0

=

P1

X7Y1

X6Y1

X5Y1

X4Y1

X3Y1

X2Y1

X1Y1

X0Y1

=

P2

X7Y2

X6Y2

X5Y2

X4Y2

X3Y2

X2Y2

X1Y2

X0Y2

=

P3

X7Y3

X6Y3

X5Y3

X4Y3

X3Y3

X2Y3

X1Y3

X0Y3

=

P4

X7
Y4

X6Y4

X5Y4

X4Y4

X3Y4

X2Y4

X1Y4

X0Y4

=

P5

X7Y5

X6Y5

X5Y5

X4Y5

X3Y5

X2Y5

X1Y5

X0Y5

=

P6

X7Y6

X6Y6

X5Y6

X4Y6

X3Y6

X2Y6

X1Y6

X0Y6

=

P7

X7Y7

X6Y7

X5Y7

X4Y7

X3Y7

X2Y7

X1Y7

X0Y7

=

P8

Following stages for pipelining:

1.
The first stage generates the partial product of number, which form the six rows of shifted
multiplicands.

2.
In second step, the eight are given to the two carry save address merging six numbers.

3.
Third step: a single CSA merging the number into 5

numbers.

4.
Similarly in next step 5 numbers into 4 number and 4 number into 3.

5.
In last step, two numbers are added through a carry propagation adder (CPA) to get the final
result. X & Y are two 8 digit fixed number so arithmetic pipeline for multipl
ication of two 8 digit fixed
number is given below:

Question.3 Discuss important environment features for parallel programming.

Hint:

The parallel programming environment consists of an editor, a debugger, performance, evaluator and
program visualiz
e for enhancing the output of parallel computation. All programming environment have
these tools in one form or the other. Based on the feature of the available tool sets the programming
environment are classified as basic, limited and well developed.

a.
Basic environment provides simple facilities for program tracing and debugging.

b. The limited integration facilities provide some additional tools for parallel debugger and
performance evaluation.

c. Well developed environment provide most advanced to
ols of debugging programs, for textual
graphics interaction and for parallel graphics handling.

There are certain parallel overhead associated with parallel computing. The parallel overhead is the amount
of time required to co
-
opposed to doing useful work. These include the following
factors:
-

i.

ii.
Synchronization.

iii.
Data communication.

Besides this hardware overhead, these are certain software overhead imposed by parallel compiler,
libraries, tools a
nd operating systems.

Parallel programming languages are developed for parallel computer environments. These are developed by
either introducing new languages or by modifying existing language. Normally, the language extension
approach is preferred by mos
t computer design. This reduce compatibility problem. High level parallel
constructs were added to FORTRAN and C to make these languages suitable for parallel computers. Beside
these, optimizing compilers are designed to automatically detect the parallelis
m in program code and
convert the code to parallea code.

Question.4 Explain the following:

a) Concurrent and parallel executions.

Hint: Concurrency and parallelism are NOT the same thing. Two tasks T1 and T2 are concurrent if the
order in which the tw
o tasks are executed in time is not predetermined,

T1 may be executed and finished before T2,

T2 may be executed and finished before T1,

T1 and T2 may be executed simultaneously at the same instance of time (parallelism),

T1 and T2 may be executed alt
ernatively,

If two concurrent threads are scheduled by the OS to run on one single
-
core non
-
SMT non
-
CMP processor,
you may get concurrency but not parallelism. Parallelism is possible on multi
-
core, multi
-
processor or
distributed systems.

Concurrency is

often referred to as a property of a program, and is a concept more general than parallelism.

Interestingly, we cannot say the same thing for concurrent programming and parallel programming. They
are overlapped, but neither is the superset of the other.
The difference comes from the sets of topics the two
areas cover. For example, concurrent programming includes topic like signal handling, while parallel
programming includes topic like memory consistency model. The difference reflects the different origna
l
hardware and software background of the two programming practices.

b) Granularity of a parallel system.

Hint: Granularity is the extent to which a system is broken down into small parts, either the system itself or
its description or observation. It
is the "extent to which a larger entity is subdivided. For example, a yard
broken into inches has finer granularity than a yard broken into feet."

In parallel computing, granularity means the amount of computation in relation to communication, i.e., the
r
atio of computation to the amount of communication.

Fine
-
grained parallelism means individual tasks are relatively small in terms of code size and execution
time. The data are transferred among processors frequently in amounts of one or a few memory word
s.
Coarse
-
grained is the opposite: data are communicated infrequently, after larger amounts of computation.

The finer the granularity, the greater the potential for parallelism and hence speed
-
up, but the greater the
nication.

In order to attain the best parallel performance, the best balance between load and communication overhead
needs to be found. If the granularity is too fine, the performance can suffer from the increased
communication overhead. On the other side
, if the granularity is too coarse, the performance can suffer

Question.5
What are the advantages of parallel processing over sequential computations? Also explains the
various levels of parallel processing.

Hint: Following are the
advantages of parallel processing over sequential computation.

Save time and/or money:
In theory, throwing more resources at a task will shorten its time to
completion, with potential cost savings. Parallel clusters can be built from cheap, commodity co
mponents.

Solve larger problems:
Many problems are so large and/or complex that it is
impractical or impossible to solve them on a single computer, especially given limited
computer memory. For example:

Web search engines/databases processing millio
ns of transactions per second

Provide concurrency:
A single compute resource can only do one thing at a time.
Multiple computing resources can be doing many things simultaneously. For example,
the Access Grid provides a global collaboration network whe
re people from around the
world can meet and conduct work "virtually".

Use of non
-
local resources:
Using compute resources on a wide area network, or even
the Internet when local compute resources are scarce. For example:

SETI@home (setiathome.berkeley
.edu) uses over 330,000 computers for a
compute power over 528 TeraFLOPS (as of August 04, 2008)

Limits to serial computing:
Both physical and practical reasons pose significant
constraints to simply building ever faster serial computers:

Transmission

speeds
-

the speed of a serial computer is directly dependent upon
how fast data can move through hardware. Absolute limits are the speed of light (30
cm/nanosecond) and the transmission limit of copper wire (9 cm/nanosecond). Increasing
speeds necessitat
e increasing proximity of processing elements.

Limits to miniaturization
-

processor technology is allowing an increasing number
of transistors to be placed on a chip. However, even with molecular or atomic
-
level
components, a limit will be reached on how

small components can be.

Economic limitations
-

it is increasingly expensive to make a single processor
faster. Using a larger number of moderately fast commodity processors to achieve the
same (or better) performance is less expensive.

Current compute
r architectures are increasingly relying upon hardware level parallelism
to improve performance:

Multiple execution units

Pipelined instructions

Multi
-
core

Parallel Processing can be challenged in 4 programmatic levels:
-

Job / Program Level

It requ
ires the development of parallel process able algorithms. The implementation of
parallel algorithms depends on the efficient allocation of limited hardware and software
resources to multiple programs being used to solve a large computational problem.
Examp
le: Weather forecasting, medical consulting, oil exploration etc.

It is conducted among procedure/tasks within the same program. This involves the
decomposition of the program into multiple tasks. ( for simultaneous execution ).

Inter
-
instruction Level

Inter
-
instruction level is to exploit concurrency among multiple instructions so that they
can be executed simultaneously. Data dependency analysis is often performed to reveal
parallelism among instructions. Vectorization may be
desired for scalar operations within
DO
loops.

Intra
-
instruction Level

Intra
-
instruction level exploits faster and concurrent operations within each instruction e.g. use of carry

Questi
on.6 a)What are various criteria for classification of parallel computers?

Hint: Following are the criteria for classification of parallel computers:

1. Organisation of the control and data flow.

3. Use of physical memory
.

b)Define and discuss instruction and data streams.

Hint: Data stream is a sequence of digitally encoded coherent signals (packets of data or datapackets) used
to transmit or receive information that is in transmission. In electronics and computer arch
itecture, a data
stream determines for which time which data item is scheduled to enter or leave which port of a systolic
array, a Reconfigurable Data Path Array or similar pipe network, or other processing unit or block. Often
the data stream is seen as t
he counterpart of an instruction stream, since the von Neumann machine is
instruction
-
stream
-
driven, whereas its counterpart, the Anti machine is data
-
stream
-
driven.

The term "data stream" has many more meanings, such as by the definition from the context

of systolic
arrays.

In formal way a data stream is any ordered pair (s,Δ) where:

1. s is a sequence of tuples and

2. Δ is a sequence of postitive

Question.7 Differentiate between UMA, NUMA and COMA. Also explain loosely coupled systems
and tightly co
upled systems.

Hint:
UMA
Uniform Memory Access (UMA) is a shared memory architecture used in parallel computers.
All the processors in the UMA model share the physical memory uniformly. In a UMA architecture, access
time to a memory location is independe
nt of which processor makes the request or which memory chip
contains the transferred data. Uniform Memory Access computer architectures are often contrasted with
Non
-
Uniform Memory Access (NUMA) architectures.

In the UMA architecture, each processor may
use a private cache. Peripherals are also shared in some
fashion, The UMA model is suitable for general purpose and time sharing applications by multiple users. It
can be used to speed up the execution of a single large program in time critical application
s.

Types of UMA architectures

1. UMA using bus
-
based SMP architectures

2. UMA using crossbar switches

3. UMA using multistage switching networks

NUMA
-

Non
-
Uniform Memory Access or Non
-
Uniform Memory Architecture (NUMA) is a computer
memory design us
ed in multiprocessors, where the memory access time depends on the memory location
relative to a processor. Under NUMA, a processor can access its own local memory faster than non
-
local
memory, that is, memory local to another processor or memory shared be
tween processors. NUMA
architectures logically follow in scaling from symmetric multiprocessing (SMP) architectures.

COMA

Shared memory multiprocessor systems may use cache memories with every processor for reducing the
execution time of an instruction.
Thus in NUMA model, if we use cache memories instead of local
memories, then it becomes COMA model. The collection of cache memories form a global memory space.
The remote cache access is also non
-
uniform in this model.

Tightly coupled verses loosely coup
led system

When multiprocessors communicate through the global shared memory modules then this organisation is
called shared memory computer or tightly coupled systems. When every processor in a multiprocessor
system, has its own
local memory and the proc
essors communicate via messages transmitted between their
local memories then this organisation called distributed memory computer or loosely coupled system.

Question.8 What are the various rules and operators used in Handler’s classification for various

machine types? What is the base for structural classification of parallel computers?

Hint : The following rules and operators are used to show the relationship between various element of the
computer :

*’
operator is used to indicate that the units are pipelined or macro
-

pipelined with a stream of data
running through all the units.

e ‘~’ symbol is used to indicate a range a values for any one of the parameters.

vlue of second element of any pair is 1 , it may be omitted for brev
ity.

Question.9 Determine the dependency relations among the following instructions:

I1: a = b+c;

I2: b = a+d;

I3: e = a/ f;

Hint: According to Bernstein’s conditions:

If I1 O2 ≠ then the statements are anti
-
dependent

If I2 O1 ≠ then the statemen
ts are flow dependent

If O2 O1 ≠ then the statements are output dependent

Appling these conditions on these statements we get

I1 O2 = {a} then the statements are anti
-
dependent

I2 O1 ={b} then the statements are flow dependent.

O3 O2 = {a} then the st
atements are output dependent.

Instruction 11 and 12 are both flow dependent and anti
-
dependent both. Instruction 12 and 13 are output
dependent and instructions 11 and 13 are Independent.

Question.10 Explain dataflow computation model.

Hint: A dataflo
w program is a graph, where nodes represent operations and edges represent data paths.
Dataflow is a distributed model of computation: there is no single locus of control. It is asynchronous:
execution ("firing") of a node starts when (matching) data is av
ailable at a node's input ports. In the original
dataflow models, data tokens are consumed when the node fires. Some models were extended with "sticky"
tokens: tokens that stay
-

much like a constant input
-

and match with tokens arriving on other inputs.
Nodes can have varying granularity, from instructions to functions.

c) Enumerate applications of parallel processing.

Hint:

Scientific Application

Image processing

Engineering Application

Database query

AI Application

Mathematical Simulation

Modeling Application

Weather forecasting

Predicting results of chemical and nuclear reaction

DNA structures of various species