# Read details - Swati Jain Education Group!

Software and s/w Development

Dec 1, 2013 (4 years and 7 months ago)

106 views

Parallel computing

Parallel computing term itself gives an implication of many calculations done
simultaneously. But the question arises why do we want it? Because we want our
computers to do things like predicting the climate 50 years hence, and for that

they are
estimated to require computers computing at the rate of 1 Tflop = 1 Teraflop = 10^12
floating point operations per second, and with a memory size of 1 TB = 1 Terabyte =
10^12 bytes. So our ultimate aim is increasing the overall speed of our compu
ter with
most efficient ness.

Then question comes why only by parallelism?

The speed of light is an intrinsic limitation to the speed of computers. Suppose we wanted
to build a completely sequential computer with 1 TB of memory running at 1 Tflop. If the
data has to travel a distance r to get from the memory to the CPU, and it has to travel this
distance 10^12 times per second at the speed of light c=3e8 m/s, then r <= c/10^12 = .3
mm. So the computer has to fit into a box .3 mm on a side. Now consider the

1TB
memory. Memory is conventionally built as a planar grid of bits, in our case say a 10^6
by 10^6 grid of words. If this grid is .3mm by .3mm, then one word occupies about 3
Angstroms by 3 Angstroms, or the size of a small atom. It is hard to imagine wh
ere the
wires would go!

So instead for having a single processor performing all jobs, won’t it be nice having
many processors working on the same or different data. However, in reality it's not that
simple, because most software is optimized for only one
CPU. An operating system can
change a lot by dividing different programs over the CPU’s. But that’s what has to be
managed and brought to be in use.

Parallel computing is concerned with
producing the desired results using
multiple processors with unimaginable
speeds. The problem, which is to be
solved, is divided up between a
number of processors and the way of
decom
posing of program into
subprograms depends on the program
itself. Dividing the problem in a
sensible and efficient manner is critical
to achieving good performance on a parallel machine. The aim of an efficient code is to
keep all the processors busy (load
-
balancing). Depending on the problem to be solved it
is decided that how different processors are going to work either performing same job on
different data i.e. SIMD or different jobs on different data i.e. MIMD.

There are three standard functions any p
arallel architecture must provide:
-

Parallelism,
Interprocessor Communication and Synchronization in Computer Architectures.

1]. Parallelism:
-

The two main styles of parallelism are SIMD and MIMD, or single
-
instruction
-
multiple
-
data and multiple
-
instructi
on
-
multiple
-
data.

2]. Communication:
-

It deals with how architecture
names

the different memory
locations to which instructions must refer. As there are multiple memories of multiple
processors we have two major ways to name the memory locations of thes
e multiple
memories are called shared memory and distributed memory. With shared memory, each
word in each memory has a unique address, which all processors agree on. On a
distributed memory machine, explicit messages must be sent over the communication
ne
twork, and be processed differently than simple loads and stores
.

3]. Synchronization:
-

Synchronization refers to the need for two or more processors to
agree on a time or a value of some data. The three most common manifestations of
synchronization are

mutual exclusion, barriers, and memory consistency.

particu
lar memory location at a time. Mutual exclusion provide a
mechanism to avoid race conditions, by allowing just one
processor waiting at the same point in the program for all the
others to "ca
tch up", before proceeding. This is necessary to make
sure a parallel job on which all the processor had been
cooperating is indeed finished. Memory consistency refers to a
problem on shared memory machines with caches when
processors Proc_1 and Proc_2 bot
into their caches.

How parallelism, communication and synchronization are achieved:

(i)

Data parallelism:
-

It means applying the same operation, of set of operations, to
all the elements of a data structure. This model evolved hist
orically from SIMD
architectures. The simplest examples are
array operations

like C = A+B, where
A, B and C are arrays, and each each entry may be added in parallel.

(ii)

Message Passing:
-

It means running p independent sequential programs, and
communicating b
y calling subroutines like
send(data,destination_proc)

and

to send data from one processor (source_proc) to
subroutines for computing global sums, barr
ier synchronization, etc. Since it is
inconvenient to maintain p different program texts for p processors (since p will
vary from machine to machine), there is usually a single program text executing
on all processors. But since the program can branch base
d on the processor
number of the processor on which it executes, they can run completely
independently. This is called SPMD programming (single program multiple data).

(iii)

Shared memory programming with threads is a natural programming model for
shared memor
y machines. There is a single program text, which initially starts
executing on one processor
.
This program may execute a statement like
"spawn(proc,0)", which will cause some other processor to execute subroutine
proc with argument 0. Subroutine proc can
"see" all the variables it can normally
see according to the scoping rules of the serial language.

Performance of a parallel algorithm:
-

1]. Efficiency can be increased by providing each processor an approximately equal
amount of work to do. Dividing up w
ork equally this way is also called
balancing
.

2]. The amount of communication should be minimized.

Communication cost is essential to calculate the performance of a parallel program.
Suppose we want to send n words of data from one processor to anot
her. The simplest
model that we will use for the time required by this operation is

Time to send n words = latency + n/bandwidth

We consider sending n words at a time because on most machines the memory hierarchy
dictates that it is most efficient
to send groups of adjacent words (such as a cache line) all
at once. Latency (in units of seconds) measure the time to send an "empty" message.
Bandwidth (in units of words/second) measures the rate at which words pass through the
interconnection network.
The time to process n words by a pipeline of s stages, each
stage taking t seconds, is

(s
-
1)*t + n*t = latency + n/bandwidth

Thus, the intercommunication network works like a pipeline, "pumping" data through it
at a rate given by bandwidth, and wit
h a delay given by the latency for the first word in
the message to make it across the network.

3]. Overhead (one of the factors effecting efficiency) is the time a processor must spend
either to send a message packet, or receive a message packet. This is

typically time spent
in packing or copying the message, time in the operating system, etc., and cannot be used
for other useful work. In other words, from the time one processor decides to send a
message to the time another processor has received it, a to
tal of 2*o+L time steps have
passed (overhead on the sending side, Latency in the network, overhead on the receiving
side).

In parallel algorithms we decompose a program into subprograms to execute them on
different processors and this decomposition frequ
example, message passing between processors. We have to consider this overhead in the
execution time of the parallel version of the subprogram, and it must be outweighed by
the reduction in execution time as a result of th
e use of (many) processors in parallel. If
we achieve this goal, we will have reduced the runtime of the subprogram and therefore
the runtime of the whole program.

Four different types of system having parallel solution techniques

to which parallel
comput
ing can be applied effectively and efficiently:
-

1)

Simulating discrete event systems
:
-
there are two types of discrete systems

In a system when time is
discrete

and we are only interested in the state at a discrete,
evenly spaced set of times, as determined by the
system clock
. Such a simulation is
called
synchronous

because changes of state only occur at clock ticks. Best ex
ample
Of it is digital circuit. Having only 0 or 1 state. Another one is slightly more
complicated discrete event model lets time to be continuous, and permits changes of
state, or
events
, to occur at any time. Such a model is called is called
event driven
,
and the simulation technique for it is
asynchronous
, because at any given real time,
different parts of the model may be updated to different simulated times. One such
discrete event system would be a
queuing model
, say for the pedestrian traffic using
t
he elevators in a new building being designed. The state consists of the number of
people waiting for each elevator, the number of people in each elevator, and the
buttons each person wants to push. Time may be continuous, rather than discrete, if
we choos
e to model the times of arrival of new passengers in the queues as a
Poisson
process
, with a random waiting time t between arrivals distributed exponentially:
Prob(t>x) = z*exp(
-
z*x). The governing rules assume passenger destinations are
chosen randomly, a
nd move people through the system accordingly.

The main source of parallelism in discrete event simulation arises from partitioning
the system into its physical components and having separate processors simulate each
component. In the case of digital cir
cuits, one can assign connected sub circuits to
separate sub processors. If one sub circuit is connected to another sub circuit, the
corresponding processors must communicate with one another. There are many ways
one could imagine assigning sub circuits to

processors. In the case of elevator
simulations, one can imagine a separate processor for each elevator, or one processor
for each floor. So again, we utilize criterion for performing things parallely on
different processors.

2)

Simulating Particle System
s :
-
In a particle system a finite number of particles move
about in space subject to certain laws of motion and certain kinds of interactions.
Time is continuous. Examples for which parallelization methods are applied to
calculate force on particles are

reactors; the forces on the particles are electrostatic, and Newton's laws govern
the motion;

along

with random collisions of neutrons and nuclei, which could possibly fission
and create more neutrons;

atoms are modeled as electrostatic, or perhaps as coming from fairly r
igid
chemical bonds with certain degrees of freedom; Newton's laws govern the
motion;

structure of the universe; Newton's laws and gravity determine the motion;

on of cars on a freeway, in transportation planning; the forces on cars
depend on Newton's laws as well as more or less sophisticated models of the
engine and how the driver reacts to the motions of nearby cars.

The force that moves each particle can typi
cally be broken into 3 components:

force = external force + nearby force + far_field_force

External force

can be evaluated for each particle without the knowledge of other particle
in a parallel fashion. Effective parallelization requires that each proces
sor own an
approximately equal number of particles. The
Nearby force

refers to forces depending
only on other particles very close by. . Effective parallelization requires
locality
, i.e. that
it be likely that the nearest neighbors of a particle reside on
the same processor as the
particle. A
far field force

depends on every other particle in the system, no matter how
distant. These forces are generically of the form. At first glance, these kinds of forces
appear to require global communication at every ste
p, defeating any attempt to maintain
locality and minimize communication. But there are a number of clever algorithms to
overcome this.

There are two obvious ways to try to partition the particles among processors to compute
the external force and the ne
arby force. A better way to partition for nearby
-
forces is to
partition the physical space

in which the particles reside, which each processor being
responsible for the particles that happen to reside in the part of space it owns.

Simulating Systems of Lu
mped Variables Depending on Continuous Parameters

A standard example for this kind of simulation is a circuit. Here we think of a circuit
as a graph with wires as edges (also called
branches
), and nodes where two or more
wires are connected. Each edge cont
ains a single resistor, capacitor, inductor, or
voltage source. There is a finite ("lumped") set of state variables depending
continuously on the parameter time.

Another example is
structural analysis

in civil engineering. A third example is
chemical kine
tics
, or computing the concentrations of chemicals undergoing a
reaction. To illustrate, suppose we have three chemicals O (atomic oxygen), O3
(ozone) and O2 (normal molecular oxygen).

Now we discuss the sources of parallelism in these problems. Even thoug
h we have
differential equations and nonlinear systems to solve, the computational bottlenecks all
reduce to
linear algebra with sparse matrices
. The simplest example is
solving a system
of linear algebraic equations

A*x=b. The second example is
solving a
nonlinear system
of equations
, which we write as f(x)=0.

-
vector multiplication and

are the computational bottlenecks we need to parallelize. And a good parallel algorithm
for sparse matrix
-
vector multiplication depends on solving the
graph pa
rtitioning
in such
a way as to

1.

2.

Balance the storage, and

3.

Minimize communication.

Load balancing and minimizing communication were goals discussed earlier; balancing
storage means having each processor store an approximately equal fract
ion of the total
data; this is important to be able to scale to large problems which cannot fit on a single
processor.

4) Continuous variables depending on continuous parameters

The state is thus a 6
-
tuple of continuous variables, each of which depends on

4
continuous parameters. The governing equations are
partial differential equations
(PDEs)
, which describes how each of the 6 variables changes as a function of all the
others. Simpler examples include

ion (Concentration(position, time))

ity(position, time))

We will deal with the heat equation in some detail

Algorithm for explicit solution. Of 1D heat equation uses an explicit formula for t
he
value of U(i,m+1) in terms of old data. Note that U(i,m+1) depends on exactly three
values in the row below it. This pattern is indicated in blue in the discretization above,
and is called the
stencil

of the algorithm.

Parallelism is available in this

problem by dividing the mesh on the bar up into
contiguous pieces, and assigning them to different processors, with processors
responsible for updating the values at the mesh points they own. Since the boundary
points require no computation, the leftmost
and rightmost processors have 1 fewer grid
point to update. Because the stencil only requires data from grid points to the immediate
left and right, only data at the boundary of the processors needs to be communicated.
Thus, the parallelism shares the
surf
ace
-
to
-
volume effect

with our parallelization of
particle systems, because most work is in internal to the data owned by a processor, and
only data on the boundary needs to be communicated.

These four types of systems cover almost all fields of our life
. We can make most of our
problems solved by applying the concept of parallelization most efficiently with reducing
time requirements, as a consequence Parallel computing, which was once the sole
preserve of academia and industrial research, is now opening

up to a much wider
audience who are slowly awakening the advantages and potential of parallelism. As
traditionally conservative areas, such as commerce and finance, embrace parallel
computing as a means to obtain competitive advantage by cost effective me
ans.

Its effectiveness can be seen in areas like weather modeling, aerodynamic analysis of
aircraft design and particle physics.

Different models like PRAM, BSP and OCPC are based on the concepts of parallel
computing adding new prospects to this concept

and will be determining its future.

The PRAM model allows straightforward cost evaluation for complex algorithms.
Nevertheless, it is not well suited for actual distributed memory machines since it does
not take into account communication overhead. Valia
nt and Mc Coll propose another
approach. They introduce a cost model called BSP. The main alternative has been to base
algorithm design on restricted forms of PRAM s or on models that try to build in more
computational realism. Our approach is to stay with

the generality and simplicity of the
basic PRAM models.

Computers such as shared memory machines, distributed memory machines, and
networks of workstations are beginning to acquire a familiar appearance: a workstation
like processor memory pair at the no
de level, and a fast, robust interconnection network
that provides node to node communication. The aim of recent research into the Bulk
Synchronous Parallel Model has been to take advantage of this architectural convergence.
The central idea of BSP is to p
rovide a high level abstraction of parallel computing
hardware whilst providing a realization of a parallel programming model. The advent of
the bulk synchronous parallel (BSP) model has provided an underlying framework for the
devising of both scalable pa
rallel architectures and portable parallel software. This model
enables the programmer to write architecture independent software. Such a model should
strike a balance between simplicity of usage and reflectivity of existing parallel
architectures. It offe
red a mechanism for efficient barrier synchronization. BSP is a
general purpose parallel programming model which has been successfully applied to a
wide range of numerical problems.

Another model attracting growing interest is the Optical Communication P
arallel
Computer (OCPC) also called Distributed Memory Machine, Direct Connection Machine
or SPRAM. It models a completely connected optical network. All processors have their
local control and memory modules attached. Every processor A can attempt to
comm
unicate directly with any processor B in a unit of time. If at the moment of
communication A is the only processor attempting to communicate with B, then the
communication is successful.

Despite many efforts, no consensus has been reached on which model s
hould be used or
which one will be deciding parallel computing’s future prospects, we are just in the
condition of wait and watch .

REFERENCES

www.cs.berkeley.edu

www.csc.
ncsu.edu

www.cpusite.examedia.nl

ADVANCED COMPUTER ARCHITECTURE BY KAI HWANG.

Submitted by:

Hitesh Setpal(B.E. (Comp. Sc.)III

Year S.V.I.T.S)

Priyanka Kasliwal(B.E. (Comp. Sc.)III

Year S.V.I.T.S)