Parallel Computing Overview

footballsyrupΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

112 εμφανίσεις

Slides Prepared from the CI
-
Tutor Courses at NCSA

http://ci
-
tutor.ncsa.uiuc.edu/

By

S. Masoud Sadjadi

School of Computing and Information Sciences

Florida International University

March 2009


Parallel Computing Explained


Parallel Computing Overview

1

Agenda

1 Parallel Computing Overview

2 How to Parallelize a Code

3 Porting Issues

4 Scalar Tuning

5 Parallel Code Tuning

6 Timing and Profiling

7 Cache Tuning

8 Parallel Performance Analysis

9 About the
IBM

Regatta P690

2

Agenda

1 Parallel Computing Overview

1.1 Introduction to Parallel Computing

1.1.1 Parallelism in our Daily Lives

1.1.2 Parallelism in Computer Programs

1.1.3 Parallelism in Computers

1.1.4 Performance Measures

1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers

1.3 Summary


3

Parallel Computing Overview


Who should read this chapter?


New Users


to learn concepts and terminology.


Intermediate Users


for review or reference.


Management Staff


to understand the basic concepts


even if
you don’t plan to do any programming.


Note: Advanced users may opt to skip this chapter.


4

Introduction to Parallel Computing


High performance parallel computers


can solve large problems much faster than a desktop computer


fast CPUs, large memory, high speed interconnects, and high speed
input/output


able to speed up computations


by making the sequential components run faster


by doing more operations in parallel


High performance parallel computers are in demand


need for tremendous computational capabilities in science,
engineering, and business.


require gigabytes/terabytes f memory and gigaflops/teraflops of
performance


scientists are striving for petascale performance

5

Introduction to Parallel Computing


HPPC are used in a wide variety of disciplines.


Meteorologists: prediction of tornadoes and thunderstorms


Computational biologists: analyze
DNA

sequences


Pharmaceutical companies: design of new drugs


Oil companies: seismic exploration


Wall Street: analysis of financial markets


NASA: aerospace vehicle design


Entertainment industry: special effects in movies and
commercials


These complex scientific and business applications all need to
perform computations on large datasets or large equations.


6

Parallelism in our Daily Lives


There are two types of processes that occur in computers and
in our daily lives:


Sequential processes


occur in a strict order


it is not possible to do the next step until the current one is completed.


Examples


The passage of time: the sun rises and the sun sets.


Writing a term paper: pick the topic, research, and write the paper.


Parallel processes


many events happen simultaneously


Examples


Plant growth in the springtime


An orchestra


7

Agenda

1 Parallel Computing Overview

1.1 Introduction to Parallel Computing

1.1.1 Parallelism in our Daily Lives

1.1.2 Parallelism in Computer Programs

1.1.2.1 Data Parallelism

1.1.2.2 Task Parallelism

1.1.3 Parallelism in Computers

1.1.4 Performance Measures

1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers

1.3 Summary


8

Parallelism in Computer Programs


Conventional wisdom:


Computer programs are sequential in nature


Only a small subset of them lend themselves to parallelism.


Algorithm: the "sequence of steps" necessary to do a computation.


The first 30 years of computer use, programs were run sequentially.


The 1980's saw great successes with parallel computers.


Dr. Geoffrey Fox published a book entitled Parallel Computing
Works!


many scientific accomplishments resulting from parallel computing


Computer programs are parallel in nature


Only a small subset of them need to be run sequentially


9

Parallel Computing


What a computer does when it carries out more than one
computation at a time using more than one processor.


By using many processors at once, we can speedup the execution


If one processor can perform the arithmetic in time t.


Then ideally p processors can perform the arithmetic in time t/p.


What if I use 100 processors? What if I use 1000 processors?


Almost every program has some form of parallelism.


You need to determine whether your data or your program can be
partitioned into independent pieces that can be run simultaneously.


Decomposition is the name given to this partitioning process.


Types of parallelism:


data parallelism


task parallelism.

10

Data Parallelism


The same code segment runs concurrently on each processor,
but each processor is assigned its own part of the data to
work on.


Do loops (in Fortran) define the parallelism.


The iterations must be independent of each other.


Data parallelism is called "fine grain parallelism" because the
computational work is spread into many small subtasks.


Example


Dense linear algebra, such as matrix multiplication, is a perfect
candidate for data parallelism.


11

An example of data parallelism

Original Sequential Code

Parallel Code



DO K=1,N

DO J=1,N

DO I=1,N

C(I,J) = C(I,J) +
A(I,K)*B(K,J)

END DO

END DO

END DO



!$OMP PARALLEL DO

DO K=1,N

DO J=1,N

DO I=1,N

C(I,J) = C(I,J) +
A(I,K)*B(K,J)

END DO

END DO

END DO

!$END PARALLEL DO

12

Quick Intro to OpenMP


OpenMP is a portable standard for parallel directives
covering both data and task parallelism.


More information about OpenMP is available on the
OpenMP
website
.


We will have a lecture on
Introduction to OpenMP

later.


With OpenMP, the loop that is performed in parallel is the
loop that immediately follows the Parallel Do directive.


In our sample code, it's the K loop:


DO K=1,N

13

OpenMP Loop Parallelism

Iteration
-
Processor
Assignments

The code segment running
on each processor




DO J=1,N

DO I=1,N

C(I,J) = C(I,J) +
A(I,K)*B(K,J)

END DO

END DO

Processor

Iterations

of K

Data

Elements

proc0

K=1:5

A(I, 1:5)

B(1:5 ,J)

proc1

K=6:10

A(I, 6:10)

B(6:10 ,J)

proc2

K=11:15

A(I, 11:15)

B(11:15 ,J)

proc3

K=16:20

A(I, 16:20)

B(16:20 ,J)

14

OpenMP Style of Parallelism


can be done incrementally as follows:

1.
Parallelize the most computationally intensive loop.

2.
Compute performance of the code.

3.
If performance is not satisfactory, parallelize another loop.

4.
Repeat steps 2 and 3 as many times as needed.


The ability to perform
incremental parallelism

is considered a
positive feature of data parallelism.


It is contrasted with the MPI (Message Passing Interface)
style of parallelism, which is an "all or nothing" approach.


15

Task Parallelism


Task parallelism may be thought of as the opposite of data
parallelism.


Instead of the same operations being performed on different parts
of the data, each process performs different operations.


You can use task parallelism when your program can be split into
independent pieces, often subroutines, that can be assigned to
different processors and run concurrently.


Task parallelism is called "coarse grain" parallelism because the
computational work is spread into just a few subtasks.


More code is run in parallel because the parallelism is
implemented at a higher level than in data parallelism.


Task parallelism is often easier to implement and has less overhead
than data parallelism.


16

Task Parallelism


The abstract code shown in the diagram is decomposed into
4 independent code segments that are labeled A, B, C, and D.
The right hand side of the diagram illustrates the 4 code
segments running concurrently.

17

Task Parallelism

Original Code

Parallel Code

program main





code segment labeled A



code segment labeled B



code segment labeled C



code segment labeled D





end

program main





code segment labeled A



code segment labeled B



code segment labeled C



code segment labeled D





end

program main

!$OMP PARALLEL

!$OMP SECTIONS

code segment labeled A

!$OMP SECTION

code segment labeled B

!$OMP SECTION

code segment labeled C

!$OMP SECTION

code segment labeled D

!$OMP END SECTIONS

!$OMP END PARALLEL

end

18

OpenMP Task Parallelism


With OpenMP, the code that follows each SECTION(S)
directive is allocated to a different processor. In our sample
parallel code, the allocation of code segments to processors is
as follows.

Processor

Code

proc0

code segment
labeled A

proc1

code segment
labeled B

proc2

code segment
labeled C

proc3

code segment
labeled D

19

Parallelism in Computers


How parallelism is exploited and enhanced within the
operating system and hardware components of a parallel
computer:


operating system


arithmetic


memory


disk


20

Operating System Parallelism


All of the commonly used parallel computers run a version of the
Unix operating system. In the table below each OS listed is in fact
Unix, but the name of the Unix OS varies with each vendor.









For more information about Unix, a collection of
Unix documents

is available.

Parallel Computer

OS

SGI

Origin2000

IRIX

HP V
-
Class

HP
-
UX

Cray T3E

Unicos

IBM

SP

AIX

Workstation

Clusters

Linux

21

Two Unix Parallelism Features


background processing facility


With the Unix background processing facility you can run the
executable
a.out

in the background and simultaneously view the
man page for the
etime

function in the foreground. There are
two Unix commands that accomplish this:


a.out

> results &

man
etime



cron

feature


With the Unix
cron

feature you can submit a job that will run at
a later time.


22

Arithmetic Parallelism


Multiple execution units



facilitate arithmetic parallelism.


The arithmetic operations of add, subtract, multiply, and divide (+
-

* /) are
each done in a separate execution unit. This allows several execution units to be
used simultaneously, because the execution units operate independently.


Fused multiply and add



is another parallel arithmetic feature.


Parallel computers are able to overlap multiply and add. This arithmetic is named
MultiplyADD

(MADD) on
SGI

computers, and Fused Multiply Add (FMA) on
HP computers. In either case, the two arithmetic operations are overlapped and
can complete in hardware in one computer cycle.


Superscalar arithmetic



is the ability to issue several arithmetic operations per computer cycle.


It makes use of the multiple, independent execution units. On superscalar
computers there are multiple slots per cycle that can be filled with work. This
gives rise to the name n
-
way superscalar, where n is the number of slots per
cycle. The
SGI Origin2000

is called a 4
-
way superscalar computer.


23

Memory Parallelism


memory interleaving


memory is divided into multiple banks, and consecutive data elements are
interleaved among them. For example if your computer has 2 memory banks,
then data elements with even memory addresses would fall into one bank, and
data elements with odd memory addresses into the other.


multiple memory ports


Port means a bi
-
directional memory pathway. When the data elements that are
interleaved across the memory banks are needed, the multiple memory ports
allow them to be accessed and fetched in parallel, which increases the memory
bandwidth (MB/s or GB/s).


multiple levels of the memory hierarchy



There is global memory that any processor can access. There is memory that is
local to a partition of the processors. Finally there is memory that is local to a
single processor, that is, the cache memory and the memory elements held in
registers.


Cache memory



Cache

is a small memory that has fast access compared with the larger main
memory and serves to keep the faster processor filled with data.


24

Memory Parallelism

Memory Hierarchy

Cache Memory

25

Disk Parallelism


RAID

(
R
edundant
A
rray of
I
nexpensive
D
isk)


RAID disks are on most parallel computers.


The advantage of a RAID disk system is that it provides a
measure of fault tolerance.


If one of the disks goes down, it can be swapped out, and the
RAID disk system remains operational.


Disk Striping


When a data set is written to disk, it is striped across the RAID
disk system. That is, it is broken into pieces that are written
simultaneously to the different disks in the RAID disk system.
When the same data set is read back in, the pieces are read in
parallel, and the full data set is reassembled in memory.


26

Agenda

1 Parallel Computing Overview

1.1 Introduction to Parallel Computing

1.1.1 Parallelism in our Daily Lives

1.1.2 Parallelism in Computer Programs

1.1.3 Parallelism in Computers

1.1.4 Performance Measures

1.1.5 More Parallelism Issues

1.2 Comparison of Parallel Computers

1.3 Summary


27

Performance Measures


Peak Performance


i
s the top speed at which the computer can operate.


It is a theoretical upper limit on the computer's performance.


Sustained Performance


is the highest consistently achieved speed.


It is a more realistic measure of computer performance.


Cost Performance


is used to determine if the computer is cost effective.


MHz


is a measure of the processor speed.


The processor speed is commonly measured in millions of cycles per second,
where a computer cycle is defined as the shortest time in which some work can be
done.


MIPS


is a measure of how quickly the computer can issue instructions.


Millions of instructions per second is abbreviated as
MIPS
, where the instructions
are computer instructions such as: memory reads and writes, logical operations ,
floating point operations, integer operations, and branch instructions.

28

Performance Measures


Mflops

(Millions of floating point operations per second)


measures how quickly a computer can perform floating
-
point operations
such as add, subtract, multiply, and divide.


Speedup



measures the benefit of parallelism.


It shows how your program scales as you compute with more processors,
compared to the performance on one processor.


Ideal speedup happens when the performance gain is linearly proportional to
the number of processors used.


Benchmarks



are used to rate the performance of parallel computers and parallel
programs.


A well known benchmark that is used to compare parallel computers is the
Linpack

benchmark.


Based on the
Linpack

results, a list is produced of the
Top 500
Supercomputer Sites
. This list is maintained by the University of Tennessee
and the University of Mannheim.


29

More Parallelism Issues


Load balancing


is the technique of evenly dividing the workload among the processors.


For data parallelism it involves how iterations of loops are allocated to processors.


Load balancing is important because the total time for the program to complete is
the time spent by the longest executing thread.


The
problem size



must be large and must be able to grow as you compute with more processors.


In order to get the performance you expect from a parallel computer you need to
run a large application with large data sizes, otherwise the overhead of passing
information between processors will dominate the calculation time.


Good software tools



are essential for users of high performance parallel computers.


These tools include:


parallel compilers


parallel debuggers


performance analysis tools


parallel math software


The availability of a broad set of application software is also important.


30

More Parallelism Issues


The high performance computing market is risky and chaotic. Many
supercomputer vendors are no longer in business, making the
portability

of your application very important.


A
workstation farm



is defined as a fast network connecting heterogeneous workstations.


The individual workstations serve as desktop systems for their owners.


When they are idle, large problems can take advantage of the unused
cycles in the whole system.


An application of this concept is the SETI project. You can participate in
searching for extraterrestrial intelligence with your home PC. More
information about this project is available at the
SETI Institute
.


Condor


is software that provides resource management services for applications that
run on heterogeneous collections of workstations.


Miron

Livny

at the University of Wisconsin at Madison is the director of the
Condor project, and has coined the phrase
high throughput computing

to describe
this process of harnessing idle workstation cycles. More information is available
at the
Condor Home Page
.


31

Agenda

1 Parallel Computing Overview

1.1 Introduction to Parallel Computing

1.2 Comparison of Parallel Computers

1.2.1 Processors

1.2.2 Memory Organization

1.2.3 Flow of Control

1.2.4 Interconnection Networks

1.2.4.1 Bus Network

1.2.4.2 Cross
-
Bar Switch Network

1.2.4.3 Hypercube Network

1.2.4.4 Tree Network

1.2.4.5 Interconnection Networks Self
-
test

1.2.5 Summary of Parallel Computer Characteristics

1.3 Summary

32

Comparison of Parallel Computers


Now you can explore the hardware components of parallel
computers:


kinds of processors


types of memory organization


flow of control


interconnection networks



You will see what is common to these parallel computers,
and what makes each one of them unique.


33

Kinds of Processors


There are three types of parallel computers:

1.
computers with a small number of powerful processors


Typically have tens of processors.


The cooling of these computers often requires very sophisticated and
expensive equipment, making these computers very expensive for computing
centers.


They are general
-
purpose computers that perform especially well on
applications that have large vector lengths.


The examples of this type of computer are the
Cray SV1

and the
Fujitsu
VPP5000
.

34

Kinds of Processors


There are three types of parallel computers:

2.
computers with a large number of less powerful processors


Named a
M
assively
P
arallel
P
rocessor (
MPP
), typically have thousands of
processors.


The processors are usually proprietary and air
-
cooled.


Because of the large number of processors, the distance between the furthest
processors can be quite large requiring a sophisticated internal network that
allows distant processors to communicate with each other quickly.


These computers are suitable for applications with a high degree of
concurrency.


The MPP type of computer was popular in the 1980s.


Examples of this type of computer were the
Thinking Machines CM
-
2

computer, and the computers made by the MassPar company.

35

Kinds of Processors


There are three types of parallel computers:

3.
computers that are medium scale in between the two extremes


Typically have hundreds of processors.


The processor chips are usually not proprietary; rather they are commodity
processors like the Pentium III.


These are general
-
purpose computers that perform well on a wide range of
applications.


The most common example of this class is the Linux Cluster.


36

Trends and Examples


Processor trends :







The processors on today’s commonly used parallel computers:

Decade

Processor Type

Computer Example

1970s

Pipelined, Proprietary

Cray
-
1

1980s

Massively Parallel, Proprietary

Thinking Machines CM2

1990s

Superscalar, RISC, Commodity

SGI

Origin2000

2000s

CISC
, Commodity

Workstation Clusters

Computer

Processor

SGI

Origin2000

MIPS

RISC R12000

HP V
-
Class

HP PA 8200

Cray T3E

Compaq Alpha

IBM

SP

IBM

Power3

Workstation Clusters

Intel Pentium III, Intel Itanium

37

Memory Organization


The following paragraphs describe the three types of
memory organization found on parallel computers:


distributed memory


shared memory


distributed shared memory


38

Distributed Memory


In distributed memory computers, the total memory is partitioned
into memory that is private to each processor.


There is a
N
on
-
U
niform
M
emory
A
ccess time (
NUMA
), which is
proportional to the distance between the two communicating
processors.



On NUMA computers,
data is accessed the
quickest from a private
memory, while data from
the most distant
processor takes the
longest to access.


Some examples are the
Cray T3E
, the
IBM SP
,
and workstation clusters.

39

Distributed Memory


When programming distributed memory computers, the
code and the data should be structured such that the bulk of
a processor’s data accesses are to its own private (local)
memory.


This is called having
good
data locality
.


Today's distributed
memory computers use
message passing

such as
MPI

to communicate
between processors as
shown in the following
example:



40

Distributed Memory


One advantage of distributed memory computers is that they
are easy to scale. As the demand for resources grows,
computer centers can easily add more memory and
processors.


This is often called the
LEGO block

approach.


The drawback is that programming of distributed memory
computers can be quite complicated.


41

Shared Memory


In shared memory computers, all processors have access to a single pool
of centralized memory with a uniform address space.


Any processor can address any memory location at the same speed so
there is
U
niform
M
emory
A
ccess time (
UMA
).


Processors communicate with each other through the shared memory.



The advantages and
disadvantages of shared
memory machines are
roughly the opposite of
distributed memory
computers.


They are easier to program
because they resemble the
programming of single
processor machines


But they don't scale like
their distributed memory
counterparts


42

Distributed Shared Memory


In
D
istributed
S
hared
M
emory (
DSM
) computers, a cluster or partition of
processors has access to a common shared memory.


It accesses the memory of a different processor cluster in a NUMA fashion.


Memory is physically distributed but logically shared.


Attention to data locality again is important.



Distributed shared memory
computers combine the best
features of both distributed
memory computers and
shared memory computers.


That is, DSM computers have
both the scalability of
distributed memory
computers and the ease of
programming of shared
memory computers.


Some examples of DSM
computers are the
SGI
Origin2000

and the
HP V
-
Class

computers.

43

Trends and Examples


Memory organization
trends:






The memory
organization of
today’s commonly
used parallel
computers:


Decade

Memory Organization

Example

1970s

Shared Memory

Cray
-
1

1980s

Distributed Memory

Thinking Machines CM
-
2

1990s

Distributed Shared Memory

SGI

Origin2000

2000s

Distributed Memory

Workstation Clusters

Computer

Memory Organization

SGI

Origin2000

DSM

HP V
-
Class

DSM

Cray T3E

Distributed

IBM

SP

Distributed

Workstation Clusters

Distributed

44

Flow of Control


When you look at the control of flow you will see three types
of parallel computers:


S
ingle
I
nstruction
M
ultiple
D
ata (
SIMD
)


M
ultiple
I
nstruction
M
ultiple
D
ata (
MIMD
)


S
ingle
P
rogram
M
ultiple
D
ata (
SPMD
)


45

Flynn’s Taxonomy


Flynn’s Taxonomy, devised in 1972 by Michael Flynn of Stanford
University, describes computers by how streams of instructions interact
with streams of data.


There can be single or multiple instruction streams, and there can be
single or multiple data streams. This gives rise to 4 types of computers as
shown in the diagram below:


Flynn's taxonomy
names the 4 computer
types SISD, MISD,
SIMD and MIMD.


Of these 4, only SIMD
and MIMD are
applicable to parallel
computers.


Another computer
type, SPMD, is a special
case of MIMD.


46

SIMD Computers


SIMD

stands for
S
ingle
I
nstruction
M
ultiple
D
ata.


Each processor follows the same set of instructions.


With different data elements being allocated to each processor.


SIMD computers have distributed memory with typically thousands of simple processors,
and the processors run in lock step.


SIMD computers, popular in the 1980s, are useful for fine grain data parallel applications,
such as neural networks.


Some examples of SIMD computers
were the Thinking Machines CM
-
2
computer and the computers from the
MassPar

company.


The processors are commanded by the
global controller that sends
instructions to the processors.


It says
add
, and they all add.


It says
shift to the right
, and they all
shift to the right.


The processors are like obedient
soldiers, marching in unison.

47

MIMD Computers


MIMD

stands for
M
ultiple
I
nstruction
M
ultiple
D
ata.


There are multiple instruction streams with separate code segments distributed
among the processors.


MIMD is actually a superset of SIMD, so that the processors can run the same
instruction stream or different instruction streams.


In addition, there are multiple data streams; different data elements are allocated
to each processor.


MIMD computers can have either distributed memory or shared memory.


While the processors on SIMD
computers run in lock step, the
processors on MIMD computers
run independently of each other.


MIMD computers can be used for
either data parallel or task parallel
applications.


Some examples of MIMD
computers are the
SGI Origin2000

computer and the
HP V
-
Class

computer.


48

SPMD Computers


SPMD

stands for
S
ingle
P
rogram
M
ultiple
D
ata.


SPMD is a special case of MIMD.


SPMD execution happens when a MIMD computer is programmed to have the
same set of instructions per processor.


With SPMD computers, while the processors are running the same code
segment, each processor can run that code segment asynchronously.


Unlike SIMD, the synchronous execution of instructions is relaxed.


An example is the execution of an if statement on a SPMD computer.


Because each processor computes with its own partition of the data elements, it
may evaluate the right hand side of the if statement differently from another
processor.


One processor may take a certain branch of the if statement, and another
processor may take a different branch of the same if statement.


Hence, even though each processor has the same set of instructions, those
instructions may be evaluated in a different order from one processor to the next.


The analogies we used for describing SIMD computers can be modified for
MIMD computers.


Instead of the SIMD obedient soldiers, all marching in unison, in the MIMD world
the processors march to the beat of their own drummer.


49

Summary of SIMD versus MIMD

SIMD

MIMD

Memory

distributed memory

distriuted memory

or

shared memory

Code Segment

same per

processor

same

or

different

Processors

Run In

lock step

asynchronously

Data

Elements

different per

processor

different per

processor

Applications

data parallel

data parallel

or

task parallel

50

Trends and Examples


Flow of control trends:






The flow of control on today:



Decade

Flow of Control

Computer Example

1980's

SIMD

Thinking Machines CM
-
2

1990's

MIMD

SGI

Origin2000

2000's

MIMD

Workstation Clusters

Computer

Flow of

Control

SGI

Origin2000

MIMD

HP V
-
Class

MIMD

Cray T3E

MIMD

IBM

SP

MIMD

Workstation Clusters

MIMD

51

Agenda

1 Parallel Computing Overview

1.1 Introduction to Parallel Computing

1.2 Comparison of Parallel Computers

1.2.1 Processors

1.2.2 Memory Organization

1.2.3 Flow of Control

1.2.4 Interconnection Networks

1.2.4.1 Bus Network

1.2.4.2 Cross
-
Bar Switch Network

1.2.4.3 Hypercube Network

1.2.4.4 Tree Network

1.2.4.5 Interconnection Networks Self
-
test

1.2.5 Summary of Parallel Computer Characteristics

1.3 Summary

52

Interconnection Networks


What exactly is the interconnection network?


The
interconnection network

is made up of the wires and cables that define how the
multiple processors of a parallel computer are connected to each other and to the
memory units.


The time required to transfer data is dependent upon the specific type of the
interconnection network.


This transfer time is called the communication time.


What network characteristics are important?


Diameter: the maximum distance that data must travel for 2 processors to
communicate.


Bandwidth: the amount of data that can be sent through a network connection.


Latency: the delay on a network while a data packet is being stored and forwarded.


Types of Interconnection Networks

The network topologies (geometric arrangements of the computer network
connections) are:


Bus


Cross
-
bar Switch


Hybercube



Tree

53

Interconnection Networks


The aspects of network issues are:


Cost


Scalability


Reliability


Suitable Applications


Data Rate


Diameter


Degree


General Network Characteristics


Some networks can be compared in terms of their degree and diameter.


Degree:

how many communicating wires are coming out of each processor.


A large degree is a benefit because it has multiple paths.


Diameter:

This is the distance between the two processors that are farthest
apart.


A small diameter corresponds to low latency.


54

Bus Network


Bus topology is the original coaxial cable
-
based
L
ocal
A
rea
N
etwork
(
LAN
) topology in which the medium forms a single bus to which all
stations are attached.


The positive aspects


It is also a mature technology that is well known and reliable.


The cost is also very low.


simple to construct.



The negative aspects


limited data
transmission rate.


not scalable in terms
of performance.


Example:
SGI

Power
Challenge.


Only scaled to 18
processors.


55

Cross
-
Bar Switch Network


A cross
-
bar switch is a network that works through a switching mechanism to
access shared memory.


it scales better than the bus network but it costs significantly more.


The telephone system uses this type of network. An example of a computer
with this type of network is the HP V
-
Class.


Here is a diagram of a
cross
-
bar switch
network which shows
the processors talking
through the
switchboxes to store or
retrieve data in
memory.


There are multiple
paths for a processor to
communicate with a
certain memory.


The switches determine
the optimal route to
take.

56

Cross
-
Bar Switch Network


In a hypercube network, the processors are connected as if they
were corners of a multidimensional cube. Each node in an N
dimensional cube is directly connected to N other nodes.


The fact that the number of directly
connected, "nearest neighbor",
nodes increases with the total size of
the network is also highly desirable
for a parallel computer.


The degree of a hypercube network
is log n and the diameter is log n,
where n is the number of
processors.


Examples of computers with this
type of network are the CM
-
2,
NCUBE
-
2, and the Intel iPSC860.

57

Tree Network


The processors are the bottom nodes of the tree. For a processor
to retrieve data, it must go up in the network and then go back
down.


This is useful for decision making applications that can be mapped
as trees.


The degree of a tree network is 1. The diameter of the network is
2 log (n+1)
-
2 where n is the number of processors.


The Thinking Machines CM
-
5 is an
example of a parallel computer
with this type of network.


Tree networks are very suitable for
database applications because it
allows multiple searches through
the database at a time.

58

Interconnected Networks


Torus Network: A mesh with wrap
-
around connections in
both the x and y directions.


Multistage Network: A network with more than one
networking unit.


Fully Connected Network: A network where every processor
is connected to every other processor.


Hypercube Network: Processors are connected as if they
were corners of a multidimensional cube.


Mesh Network: A network where each interior processor is
connected to its four nearest neighbors.


59

Interconnected Networks


Bus Based Network: Coaxial cable based
LAN

topology in
which the medium forms a single bus to which all stations are
attached.


Cross
-
bar Switch Network: A network that works through a
switching mechanism to access shared memory.


Tree Network: The processors are the bottom nodes of the
tree.


Ring Network: Each processor is connected to two others
and the line of connections forms a circle.

60

Summary of Parallel Computer
Characteristics


How many processors does the computer have?


10s?


100s?


1000s?


How powerful are the processors?


what's the MHz rate


what's the
MIPS

rate


What's the instruction set architecture?


RISC


CISC


61

Summary of Parallel Computer
Characteristics


How much memory is available?


total memory


memory per processor


What kind of memory?


distributed memory


shared memory


distributed shared memory


What type of flow of control?


SIMD


MIMD


SPMD

62

Summary of Parallel Computer
Characteristics


What is the interconnection network?


Bus


Crossbar


Hypercube


Tree


Torus


Multistage


Fully Connected


Mesh


Ring


Hybrid

63

Design decisions made by some of the
major parallel computer vendors

Computer

Programming

Style

OS

Processors

Memory

Flow of

Control

Network

SGI

Origin2000

OpenMP

MPI

IRIX

MIPS

RISC

R10000

DSM

MIMD

Crossbar

Hypercube

HP V
-
Class

OpenMP

MPI

HP
-
UX

HP PA 8200

DSM

MIMD

Crossbar

Ring

Cray T3E

SHMEM

Unicos

Compaq Alpha

Distributed

MIMD

Torus

IBM

SP

MPI

AIX

IBM

Power3

Distributed

MIMD

IBM

Switch

Workstation

Clusters

MPI

Linux

Intel Pentium
III

Distributed

MIMD

Myrinet

Tree

64

Summary


This completes our introduction to parallel computing.


You have learned about parallelism in computer programs, and
also about parallelism in the hardware components of parallel
computers.


In addition, you have learned about the commonly used parallel
computers, and how these computers compare to each other.


There are many good texts which provide an introductory
treatment of parallel computing. Here are two useful references:



Highly Parallel Computing, Second Edition

George S.
Almasi

and Allan Gottlieb

Benjamin/Cummings Publishers, 1994


Parallel Computing Theory and Practice

Michael J. Quinn

McGraw
-
Hill, Inc., 1994


65