Fundamentals of Parallel Processing - University of Illinois at Chicago

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 6 months ago)

75 views

Introduction to Parallel
Processing

Shantanu Dutt

University of Illinois at Chicago

2

Acknowledgements


Ashish Agrawal, IIT Kanpur, “Fundamentals of Parallel
Processing” (slides), w/ some modifications and
augmentations by Shantanu Dutt


John Urbanic, Parallel Computing: Overview (slides), w/
some modifications and augmentations by Shantanu
Dutt


John Mellor
-
Crummey, “COMP 422 Parallel Computing:
An Introduction”, Department of Computer Science, Rice
University, (slides), w/ some modifications and
augmentations by Shantanu Dutt



3

Outline


Moore's Law and its limits


Different uni
-
processor performance enhancement
techniques and their limits


Classification of parallel computations


Classification of parallel architectures
-

Distributed and
Shared memory


Simple examples of parallel processing


Example applications


Future advances


Summary

Some text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

4

Moore’s Law &

Need for Parallel Processing


Chip performance doubles
every 18
-
24 months


Power consumption is prop. to
freq.


Limits of Serial computing



Heating issues


Limit to transmissions
speeds


Leakage currents


Limit to miniaturization


Multi
-
core processors already
commonplace.


Most high performance servers
already parallel.

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

5

Quest for Performance


Pipelining


Superscalar Architecture


Out of Order Execution


Caches


Instruction Set Design
Advancements


Parallelism


Multi
-
core processors


Clusters


Grid

This is the future

Top text from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur

6

Pipelining


Illustration of Pipeline using the fetch, load, execute, store stages.


At the start of execution


Wind up.


At the end of execution


Wind down.


Pipeline stalls due to data dependency (RAW, WAR), resource conflict, incorrect
branch prediction


Hit performance and speedup.


Pipeline depth


No of cycles in execution simultaneously.


Intel Pentium 4


35 stages.

7

Pipelining


T
pipe
(n) is pipelined time to process n instructions = fill
-
time + max{ti}, ti =
exec. time of the i’th stage

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

8

Cache


Desire for fast cheap and non volatile memory


Memory speed growth at 7% per annum while processor growth at 50% p.a.


Cache


fast small memory.


L1 and L2 caches.


Retrieval from memory takes several hundred clock cycles


Retrieval from L1 cache takes the order of one clock cycle and from L2 cache
takes the order of 10 clock cycles.


Cache ‘hit’ and ‘miss’.


Prefetch used to avoid cache misses at the start of the execution of the
program.


Cache lines used to avoid latency time in case of a cache miss


Order of search


L1 cache
-
> L2 cache
-
> RAM
-
> Disk


Cache coherency


Correctness of data. Important for distributed parallel
computing


Limit to cache improvement: Improving cache performance will at most improve
efficiency to match processor efficiency

9

(exs. of limited data parallelism)

(exs. of limited & low
-
level functional parallelism)

(single
-
instr.

multiple data)

: instruction
-
level parallelism

degree generally low and dependent
on how the sequential code has been written, so not v. effective

10

11


Thus need development of explicit parallel algorithms that are
based on a fundamental understanding of the parallelism inherent
in a problem, and exploiting that parallelism with minimum
interaction/communication between the parallel parts

12

13

14

(simultaneous multi
-

threading)

(multi
-
threading)

15

16

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

17


18

19

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

20

Applications of Parallel Processing

21

22

23

24

25

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

26

Example problems & solutions


Easy Parallel Situation


Each data part is
independent. No communication is required
between the execution units solving two different
parts.


Heat Equation
-


The initial temperature is zero on the boundaries
and high in the middle


The boundary temperature is held at zero.


The calculation of an element is dependent upon
its neighbor elements


data1 data2 …... data N

Code from: Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur

27

1.
find out if I am MASTER or WORKER

2.
if I am MASTER

3.

initialize array

4.

send each WORKER starting info and
subarray

5.

do until all WORKERS converge

6.

gather from all WORKERS convergence
data

7.

broadcast to all WORKERS convergence
signal

8.

end do

9.

receive results from each WORKER


1.
else if I am WORKER

2.

receive from MASTER starting info and
subarray

3.

do until solution converged {

4.

update time

1.

non
-
blocking send neighbors my border
info

5.

non
-
blocking receive neighbors border
info

6.

update interior of my portion of solution
array

7.

wait for non
-
block.
commun
. to complete

14.

update border of my portion of solution
array

15.

determine if my solution has converged

16.

if so {send MASTER convergence signal

17.

recv. from MASTER convergence signal}

18.

end do }

19.

send MASTER results

20.
endif

Serial Code
-

do iy=
2
, ny
-
1

do ix=
2
, nx
-
1


u
2
(ix,iy)=u
1
(ix,iy)+cx*{u
1
(ix+
1
,iy)} + u
1
(ix
-
1
,iy) + cy*{u
1
(ix,iy+
1
)} + u
1
(ix,iy
-
1
)


enddo

enddo

Master (can be one of the workers)

Workers

Problem

Grid

28

How to interconnect the
multiple cores/processors
is a major consideration in
a parallel architecture

29

Tflops

Tflops

kW

1

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

30

Parallelism
-

A simplistic understanding


Multiple tasks at once.


Distribute work into multiple
execution units.


A classification of parallelism:


Data Parallelism


Functional or Control
Parallelism


Data Parallelism

-

Divide the
dataset and solve each sector
“similarly” on a separate
execution unit.


Functional Parallelism



Divide the 'problem' into
different tasks and execute the
tasks on different units. What
would func. parallelism look like
for the example on the right?

Sequential

Data Parallelism

16/12/2008

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

31

Data Parallelism

Functional Parallelism

Flynn’s Classification


Flynn's Classical Taxonomy
-


Single Instruction, Single Data (SISD)

your single
-
core uni
-
processor PC


Single Instruction, Multiple Data (SIMD)

special purpose low
-
granularity multi
-
processor m/c w/ a single control unit relaying
the same instruction to all processors (w/ different data) every cc


Multiple Instruction, Single Data (MISD)

pipelining is a major
example


Multiple Instruction, Multiple Data (MIMD)

the most prevalent
model. SPMD (Single Program Multiple Data) is a very useful
subset. Note that this is v. different from SIMD. Why?


Note that Data vs Control Parallelism is another independent
classification to the above

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

32

Flynn’s Classification (cont’d).

33

Flynn’s Classification (cont’d).

34

Flynn’s Classification (cont’d).

35

Flynn’s Classification (cont’d).

36

Flynn’s Classification (cont’d).

37

Flynn’s Classification (cont’d).

38


Data Parallelism: SIMD and SPMD fall into this category


Functional Parallelism: MISD falls into this category

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

39

Parallel Arch. Classification


Multi
-
processor Architectures
-


Distributed Memory

Most prevalent architecture model for # processors >
8


Indirect interconnectionn n/ws


Direct interconnection n/ws


Shared Memory


Uniform Memory Access (UMA)



Non
-

Uniform Memory Access (NUMA)

Distributed shared memory

1

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

40

Distributed Memory

Message Passing
Architectures


Each processor P (with its own local
cache C) is connected to exclusive
local memory, i.e. no other CPU has
direct access to it.


Each node comprises at least one
network interface (NI) that mediates
the connection to a communication
network.


On each CPU runs a serial process
that can communicate with other
processes on other CPUs by means of
the network.


Non
-
blocking vs Blocking
communication


Direct vs Indirect
Communication/Interconnection
network

Example: A 2x4
mesh n/w (direct
connection n/w)

The ARGO Beowulf Cluster at UIC
(http://accc.uic.edu/service/argo
-
cluster)


41


Has 56 compute nodes/computers and a master node


Master here has a different meaning

generally a system front
-
end where you login and perform
various tasks before submitting your parallel code to run on several compute nodes

than the
“master” node in a parallel algorithm (e.g., the one we saw for the finite
-
element heat distribution
problem), which would actually be one of the compute nodes, and generally distributes data to the
other compute nodes, monitors progress of the computation, determines the end of the
computation, etc., and may also additionally perform a part of the computation


Compute nodes are divided among 14 zones, each zone containing 4 nodes which are
connected as a ring network. Zones are connected to each other by a higher
-
level n/w.


Each node (compute or master) has 2 processors. Each processor on some nodes are
single
-
core ones, and dual cores in others; see
http://accc.uic.edu/service/arg/nodes

1

System Computational Actions in a Message
-
Passing Program

42

(a) Two basic parallel processes


X, Y, and their data dependency

a := b+c;

b := x*y;

Proc. X

Proc. Y

recv(P2, b);

a := b+c;

b := x*y;

send(P
1
,b);

Proc. X

Proc. Y

b

P(X)

P(Y)

Processor/core

containing Y

Processor/core

containing X

Message passing

of data item “b”.

Link (direct

or indirect) betw.

the 2 processors

(b) Their mapping to a message
-
passing multicomputer

Message passing
mapping

1

Dual
-
Core

Quad
-
Core

L1 cache

L
2
cache

Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

43

Distributed Shared Memory Arch.: UMA


Flat memory model


Memory bandwidth and latency are the same for all
processors and all memory locations.


Simplest example


dual core processor


Most commonly represented today by Symmetric
Multiprocessor (SMP) machines


Cache coherent UMA

consistent cache values of the
same data item in different proc./core caches

1

System Computational Actions in a Shared
-
Memory Program

44

(a) Two basic parallel processes


X, Y, and their data dependency

a := b+c;

b := x*y;

Proc. X

Proc. Y

a := b+c;

b := x*y;

Proc. X

Proc. Y

P(X)

P(Y)

(b) Their mapping to a shared
-
memory multiprocessor

Shared
-
memory
mapping

Shared Memory

Possible Actions by O.S.:

(i)
Since “b” is a shared
data item (e.g.,
designated by
compiler or
programmer), check

b”’s

location to see if
it can be written to (all
prev. reads done:
read_cntr

for “b” = 0).

(ii)
If so, write “b” to its
location and mark
status bit as written
by “Y”. Initialize
read_cntr

for “b” to
pre
-
determined value

Possible Actions by O.S.:

(i)
Since “b” is a shared
data item (e.g.,
designated by
compiler or
programmer), check

b”’s

location to see if
it has been written to
by “Y” or any process
(if don’t care about
the writing process).

(ii)
If so {read “b” &
decrement
read_cntr

for “b”} else go to (
i
)
and busy wait (check
periodically).

1

Most text from Fundamentals of Parallel
Processing, A. Agrawal, IIT Kanpur

45

Distributed Shared Memory Arch.: NUMA


Memory is physically distributed but logically shared.


The physical layout similar to the distributed
-
memory message
-
passing case


Aggregated memory of the whole system appear as one single address space.


Due to the distributed nature, memory access performance varies depending on which
CPU accesses which parts of memory (“local” vs. “remote” access).


Two locality domains linked through a high speed connection called Hyper Transport (in
general via a link, as in message passing arch’s, only here these links are used by the
O.S. to transmit read/write non
-
local data to/from processor/non
-
local memory).


Advantage


Scalability (compared to UMA’s)


Disadvantage


a) Locality Problems and Connection congestion. b) Not a natural
parallel prog./algo. Model (it is easier to partition data among proc’s instead of think of
all of it occupying a large monolithic address space that each proc. can access).

2
x
2
mesh

connection

1

46

47

An example of an SPMD message
-
passing parallel
program

48

SPMD message
-
passing parallel program (contd.)

49

node
xor

D,

1

Most text from: Fund. of Parallel
Processing, A. Agrawal, IIT Kanpur

50

Summary


Serial computers / microprocessors will probably not get much faster
-

parallelization unavoidable


Pipelining, cache and other optimization strategies for serial computers
reaching a plateau


Data and functional parallelism


Flynn’s taxonomy: SIMD, MISD, MIMD/SPMD


Parallel Architectures Intro


Distributed Memory


Shared Memory


Uniform Memory Access


Non Uniform Memory Access


Application examples


Parallel program/algorithm examples





Fundamentals of Parallel Processing,
Ashish Agrawal, IIT Kanpur

51

Additional References


Computer Organization and Design


Patterson Hennessey


Modern Operating Systems


Tanenbaum


Concepts of High Performance Computing


Georg Hager
Gerhard Wellein


Cramming more components onto Integrated Circuits


Gordon
Moore, 1965


Introduction to Parallel Computing


https://computing.llnl.gov/tutorials/parallel_comp


The Landscape of Parallel Computing Research


A view from
Berkeley, 2006