Distributed Memory Multiprocessors

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

61 εμφανίσεις

Graduate Computer Architecture, Fall 2005

Lecture 10


Distributed Memory Multiprocessors


Shih
-
Hao Hung

Computer Science & Information Engineering

National Taiwan University



Adapted from Prof. Culler’s CS252 Spring 2005 class notes,

Copyright 2005 UC Berkeley

2

Natural Extensions of Memory
System

P

1

Switch

Main memory

P

n

(Interleaved)

(Interleaved)

First
-
level $

P

1

$

Inter

connection network

$

P

n

Mem

Mem

P

1

$

Inter

connection network

$

P

n

Mem

Mem

Shared Cache

Centralized Memory

Dance Hall, UMA

Distributed Memory (NUMA)

Scale

3

Fundamental Issues


3 Issues to characterize parallel machines

1)
Naming

2)
Synchronization

3)
Performance: Latency and Bandwidth (covered
earlier)

4

Fundamental Issue #1: Naming


Naming
:


what data is shared


how it is addressed


what operations can access data


how processes refer to each other


Choice of naming affects
code produced by a compiler
:


via load where just remember address
,

or


keep track of processor number and local virtual address for
msg. passing


Choice of naming affects
replication of data
:


via load in cache memory hierarchy
,

or


via SW replication and consistency

5

Fundamental Issue #1: Naming


Global
physical

address space
:

any processor can generate, address and access it in
a single operation


memory can be anywhere:

virtual addr. translation handles it


Global
virtual

address space
: if the address space of
each process can be configured to contain all
shared data of the parallel program


Segmented shared address space
:

locations are named

<process number, address>

uniformly for all processes of the parallel program

6

Fundamental Issue #2: Synchronization


To cooperate, processes must coordinate


Message passing is implicit coordination with
transmission or arrival of data


Shared address

=> additional operations to explicitly coordinate:

e.g., write a flag, awaken a thread, interrupt a
processor

7

Parallel Architecture Framework


Layers:


Programming Model:


Multiprogramming
: lots of jobs, no communication


Shared address space
: communicate via memory


Message passing
: send and recieve messages


Data Parallel
: several agents operate on several data sets
simultaneously and then exchange information globally and
simultaneously (shared or message passing)


Communication Abstraction:


Shared address space
: e.g., load, store, atomic swap


Message passing
: e.g., send, recieve library calls


Debate over this topic (ease of programming, scaling)

=> many hardware designs 1:1 programming model

Programming Model

Communication Abstraction

Interconnection SW/OS

Interconnection HW

8

Scalable Machines


What are the design trade
-
offs for the spectrum of
machines between?


specialize or commodity nodes?


capability of node
-
to
-
network interface


supporting programming models?



What does scalability mean?


avoids inherent design limits on resources


bandwidth increases with P


latency does not


cost increases slowly with P

9

Bandwidth Scalability


What fundamentally limits bandwidth?


single set of wires


Must have
many independent wires


Connect modules through
switches


Bus vs Network Switch?

P
M
M
P
M
M
P
M
M
P
M
M
S
S
S
S
T
y
p
i
c
a
l

s
w
i
t
c
h
e
s
B
u
s
M
u
l
t
i
p
l
e
x
e
r
s
C
r
o
s
s
b
a
r
10

Dancehall MP Organization


Network bandwidth?


Bandwidth demand?


independent processes?


communicating processes?


Latency?






S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
P
$
S
w
i
t
c
h
M
P
$
P
$
P
$
M
M





S
w
i
t
c
h
S
w
i
t
c
h
11

Generic Distributed Memory Org.


Network bandwidth?


Bandwidth demand?


independent processes?


communicating processes?


Latency?






S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
C
A
P
$
S
w
i
t
c
h
M
S
w
i
t
c
h
S
w
i
t
c
h
12

Key Property


Large number of independent communication paths
between nodes

=> allow a large number of concurrent transactions
using different wires


initiated independently


no global arbitration


effect of a transaction only visible to the nodes
involved


effects propagated through additional transactions

13

Programming Models Realized by
Protocols

CAD

Multipr

ogramming

Shar

ed

addr

ess

Message

passing

Data

parallel

Database

Scientific modeling

Parallel applications

Pr

ogramming models

Communication abstraction

User/system boundary

Compilation

or library

Operating systems support

Communication har

dwar

e

Physical communication medium

Har

dwar

e/softwar

e boundary

Network Transactions

14

Network Transaction


Key Design Issue:


How much interpretation of the message?


How much dedicated processing in the Comm. Assist?



P

M

CA

P

M

CA

°

°

°

Scalable Network

Node Architecture

Communication Assist

Message

Output Processing




checks




translation




formating




scheduling

Input Processing




checks




translation




buffering




action

15

Shared Address Space Abstraction


Fundamentally a two
-
way request/response protocol


writes have an acknowledgement


Issues


fixed or variable length (bulk) transfers


remote virtual or physical address, where is action performed?


deadlock avoidance and input buffer full


coherent? consistent?

S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
L
o
a
d

r




G
l
o
b
a
l

a
d
d
r
e
s
s
]
R
e
a
d

r
e
q
u
e
s
t
R
e
a
d

r
e
q
u
e
s
t
M
e
m
o
r
y

a
c
c
e
s
s
R
e
a
d

r
e
s
p
o
n
s
e
(
1
)

I
n
i
t
i
a
t
e

m
e
m
o
r
y

a
c
c
e
s
s
(
2
)

A
d
d
r
e
s
s

t
r
a
n
s
l
a
t
i
o
n
(
3
)

L
o
c
a
l
/
r
e
m
o
t
e

c
h
e
c
k
(
4
)

R
e
q
u
e
s
t

t
r
a
n
s
a
c
t
i
o
n
(
5
)

R
e
m
o
t
e

m
e
m
o
r
y

a
c
c
e
s
s
(
6
)

R
e
p
l
y

t
r
a
n
s
a
c
t
i
o
n
(
7
)

C
o
m
p
l
e
t
e

m
e
m
o
r
y

a
c
c
e
s
s
W
a
i
t
R
e
a
d

r
e
s
p
o
n
s
e
16

Key Properties of Shared Address
Abstraction


Source and destination data addresses are specified
by the source of the request


a degree of logical coupling and trust


no storage logically

outside the address space



may employ temporary buffers for transport


Operations are fundamentally request response


Remote operation can be performed on remote
memory


logically does not require intervention of the remote
processor

17

Consistency


write
-
atomicity violated without caching

M
e
m
o
r
y
P
1
P
2
P
3
M
e
m
o
r
y
M
e
m
o
r
y
A
=
1
;
f
l
a
g
=
1
;
w
h
i
l
e

(
f
l
a
g
=
=
0
)
;
p
r
i
n
t

A
;
A
:
0
f
l
a
g
:
0
-
>
1

I
n
t
e
r
c
o
n
n
e
c
t
i
o
n

n
e
t
w
o
r
k
1
:

A
=
1
2
:

f
l
a
g
=
1
3
:

l
o
a
d

A
D
e
l
a
y
P
1
P
3
P
2
(
b
)
(
a
)
C
o
n
g
e
s
t
e
d

p
a
t
h
18

Message passing


Bulk transfers


Complex synchronization semantics


more complex protocols


More complex action



Synchronous


Send completes after matching recv and source data
sent


Receive completes after data transfer complete from
matching send


Asynchronous


Send completes after send buffer may be reused


19

Synchronous Message Passing


Constrained programming model.


Deterministic! What happens when threads added?


Destination contention very limited.


User/System boundary?

S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d

P
d
e
s
t
,

l
o
c
a
l

V
A
,

l
e
n
S
e
n
d
-
r
d
y

r
e
q
T
a
g

c
h
e
c
k
(
1
)

I
n
i
t
i
a
t
e

s
e
n
d
(
2
)

A
d
d
r
e
s
s

t
r
a
n
s
l
a
t
i
o
n

o
n

P
s
r
c
(
4
)

S
e
n
d
-
r
e
a
d
y

r
e
q
u
e
s
t
(
6
)

R
e
p
l
y

t
r
a
n
s
a
c
t
i
o
n
W
a
i
t
R
e
c
v

P
s
r
c
,

l
o
c
a
l

V
A
,

l
e
n
R
e
c
v
-
r
d
y

r
e
p
l
y
D
a
t
a
-
x
f
e
r

r
e
q
(
5
)

R
e
m
o
t
e

c
h
e
c
k

f
o
r

p
o
s
t
e
d

r
e
c
e
i
v
e

(
a
s
s
u
m
e

s
u
c
c
e
s
s
)
(
7
)

B
u
l
k

d
a
t
a

t
r
a
n
s
f
e
r
S
o
u
r
c
e

V
A



D
e
s
t

V
A

o
r

I
D
(
3
)

L
o
c
a
l
/
r
e
m
o
t
e

c
h
e
c
k
Processor

Action?

20

Asynch. Message Passing: Optimistic


More powerful programming model


Wildcard receive => non
-
deterministic


Storage required within msg layer?

S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d

(
P
d
e
s
t
,

l
o
c
a
l

V
A
,

l
e
n
)
(
1
)

I
n
i
t
i
a
t
e

s
e
n
d
(
2
)

A
d
d
r
e
s
s

t
r
a
n
s
l
a
t
i
o
n

(
4
)

S
e
n
d

d
a
t
a
R
e
c
v

P
s
r
c
,

l
o
c
a
l

V
A
,

l
e
n
D
a
t
a
-
x
f
e
r

r
e
q
T
a
g

m
a
t
c
h
A
l
l
o
c
a
t
e

b
u
f
f
e
r
(
3
)

L
o
c
a
l
/
r
e
m
o
t
e

c
h
e
c
k
(
5
)

R
e
m
o
t
e

c
h
e
c
k

f
o
r

p
o
s
t
e
d

r
e
c
e
i
v
e
;

o
n

f
a
i
l
,

a
l
l
o
c
a
t
e

d
a
t
a

b
u
f
f
e
r
21

Asynch. Msg Passing: Conservative


Where is the buffering?


Contention control? Receiver initiated protocol?


Short message optimizations

S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d

P
d
e
s
t
,

l
o
c
a
l

V
A
,

l
e
n
S
e
n
d
-
r
d
y

r
e
q
T
a
g

c
h
e
c
k
(
1
)

I
n
i
t
i
a
t
e

s
e
n
d
(
2
)

A
d
d
r
e
s
s

t
r
a
n
s
l
a
t
i
o
n

o
n

P
d
e
s
t
(
4
)

S
e
n
d
-
r
e
a
d
y

r
e
q
u
e
s
t
(
6
)

R
e
c
e
i
v
e
-
r
e
a
d
y

r
e
q
u
e
s
t
R
e
t
u
r
n

a
n
d

c
o
m
p
u
t
e
R
e
c
v

P
s
r
c
,

l
o
c
a
l

V
A
,

l
e
n
R
e
c
v
-
r
d
y

r
e
q
D
a
t
a
-
x
f
e
r

r
e
p
l
y
(
3
)

L
o
c
a
l
/
r
e
m
o
t
e

c
h
e
c
k
(
5
)

R
e
m
o
t
e

c
h
e
c
k

f
o
r

p
o
s
t
e
d

r
e
c
e
i
v
e

(
a
s
s
u
m
e

f
a
i
l
)
;

r
e
c
o
r
d

s
e
n
d
-
r
e
a
d
y
(
7
)

B
u
l
k

d
a
t
a

r
e
p
l
y
S
o
u
r
c
e

V
A



D
e
s
t

V
A

o
r

I
D
22

Key Features of Msg Passing
Abstraction


Source knows send data address, dest. knows
receive data address


after handshake they both know both


Arbitrary storage

outside the local address spaces



may post many sends before any receives


non
-
blocking asynchronous sends reduces the
requirement to an arbitrary number of descriptors


fine print says these are limited too


Fundamentally a 3
-
phase transaction


includes a request / response


can use optimisitic 1
-
phase in limited

Safe


cases


credit scheme

23

Active Messages


User
-
level analog of network transaction


transfer data packet and invoke handler to extract it from
the network and integrate with on
-
going computation


Request/Reply


Event notification: interrupts, polling, events?


May also perform memory
-
to
-
memory transfer

Request

handler

handler

Reply

24

Common Challenges


Input buffer overflow


N
-
1 queue over
-
commitment => must slow sources


reserve space per source

(credit)


when available for reuse?


Ack or Higher level


Refuse input when full


backpressure in reliable network


tree saturation


deadlock free


what happens to traffic not bound for congested dest?


Reserve ack back channel


drop packets


Utilize higher
-
level semantics of programming model

25

Challenges (cont)


Fetch Deadlock


For network to remain deadlock free, nodes must continue
accepting messages, even when cannot source msgs


what if incoming transaction is a request?


Each may generate a response, which cannot be sent!


What happens when internal buffering is full?


logically independent request/reply networks


physical networks


virtual channels with separate input/output queues


bound requests and reserve input buffer space


K(P
-
1) requests + K responses per node


service discipline to avoid fetch deadlock?


NACK on input buffer full


NACK delivery?


26

Challenges in Realizing Prog.
Models in the Large


One
-
way transfer of information


No global knowledge, nor global control


barriers, scans, reduce, global
-
OR give fuzzy global state


Very large number of concurrent transactions


Management of input buffer resources


many sources can issue a request and over
-
commit
destination before any see the effect


Latency is large enough that you are tempted to

take risks



optimistic protocols


large transfers


dynamic allocation


Many many more degrees of freedom in design and
engineering of these system

27

Network Transaction Processing


Key Design Issue:


How much interpretation of the message?


How much dedicated processing in the Comm.
Assist?

P

M

CA

P

M

CA

°

°

°

Scalable Network

Node Architecture

Communication Assist

Message

Output Processing




checks




translation




formating




scheduling

Input Processing




checks




translation




buffering




action

28

Spectrum of Designs


None: Physical bit stream


blind, physical DMA

nCUBE, iPSC, . . .


User/System


User
-
level port

CM
-
5, *T


User
-
level handler

J
-
Machine,
Monsoon, . . .


Remote virtual address


Processing, translation

Paragon, Meiko CS
-
2


Global physical address


Proc + Memory controller

RP3, BBN, T3D


Cache
-
to
-
cache


Cache controller

Dash, KSR, Flash

Increasing HW Support, Specialization, Intrusiveness, Performance (???)

29

Spectrum of Designs


None: Physical bit stream


blind, physical DMA

nCUBE, iPSC, . . .


User/System


User
-
level port

CM
-
5, *T


User
-
level handler

J
-
Machine, Monsoon, . . .


Remote virtual address


Processing, translation

Paragon, Meiko CS
-
2


Global physical address


Proc + Memory controller

RP3, BBN, T3D


Cache
-
to
-
cache


Cache controller

Dash, KSR, Flash

Increasing HW Support, Specialization, Intrusiveness, Performance (???)

30

Shared Physical Address Space


NI emulates memory controller at source


NI emulates processor at dest


must be deadlock free

S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
P
$
M
e
m
o
r
y

m
a
n
a
g
e
m
e
n
t

u
n
i
t
D
a
t
a
L
d

R







A
d
d
r
P
s
e
u
d
o

m
e
m
o
r
y
P
s
e
u
d
o
-

p
r
o
c
e
s
s
o
r
D
e
s
t
R
e
a
d
A
d
d
r
S
r
c
T
a
g
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t

p
r
o
c
e
s
s
i
n
g




M
e
m

a
c
c
e
s
s




R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n


I
n
p
u
t

p
r
o
c
e
s
s
i
n
g




P
a
r
s
e




C
o
m
p
l
e
t
e

r
e
a
d
P
$
M
M
U
M
e
m
P
s
e
u
d
o
-

m
e
m
o
r
y
P
s
e
u
d
o
-

p
r
o
c
e
s
s
o
r
M
e
m
a
s
s
i
s
t
31

Case Study: Cray T3D


Build up info in

shell



Remote memory operations encoded in address

D
R
A
M
R
e
q
o
u
t
P
$
M
M
U
1
5
0
-
M
H
z

D
E
C

A
l
p
h
a

(
6
4

b
i
t
)
8
-
K
B

i
n
s
t
r
u
c
t
i
o
n

+

8
-
K
B

d
a
t
a
4
3
-
b
i
t

v
i
r
t
u
a
l

a
d
d
r
e
s
s
P
r
e
f
e
t
c
h
L
o
a
d
-
l
o
c
k
,

s
t
o
r
e
-
c
o
n
d
i
t
i
o
n
a
l
3
2
-
b
i
t
D
T
B
P
r
e
f
e
t
c
h

q
u
e
u
e


1
6



6
4
M
e
s
s
a
g
e

q
u
e
u
e


4
,
0
8
0



4



6
4
S
p
e
c
i
a
l

r
e
g
i
s
t
e
r
s



s
w
a
p
e
r
a
n
d



f
e
t
c
h
&
a
d
d



b
a
r
r
i
e
r
P
E
#

+

F
C
D
M
A
R
e
s
p
i
n
3
D

t
o
r
u
s

o
f

p
a
i
r
s

o
f

P
E
s


s
h
a
r
e

n
e
t

a
n
d

B
L
T


u
p

t
o

2
,
0
4
8


6
4

M
B

e
a
c
h
R
e
q
i
n
R
e
s
p
o
u
t
B
l
o
c
k

t
r
a
n
s
f
e
r
3
2
-

a
n
d

6
4
-
b
i
t

m
e
m
o
r
y

a
n
d

b
y
t
e

o
p
e
r
a
t
i
o
n
s
N
o
n
b
l
o
c
k
i
n
g

s
t
o
r
e
s

a
n
d

m
e
m
o
r
y

b
a
r
r
i
e
r
e
n
g
i
n
e

p
h
y
s
i
c
a
l

a
d
d
r
e
s
s
32

Case Study: NOW


General purpose processor embedded in NIC

L
2

$





B
u
s

a
d
a
p
t
e
r
S
B
U
S

(
2
5

M
H
z
)
M
e
m
U
l
t
r
a
S
p
a
r
c
s

D
M
A
H
o
s
t

D
M
A
S
R
A
M
M
y
r
i
n
e
t

X
-
b
a
r
r

D
M
A
B
u
s

i
n
t
e
r
f
a
c
e
M
a
i
n
p
r
o
c
e
s
s
o
r
L
i
n
k
I
n
t
e
r
f
a
c
e
1
6
0
-
M
B
/
s
b
i
d
i
r
e
c
t
i
o
n
a
l
l
i
n
k
s
M
y
r
i
c
o
m
L
a
n
a
i

N
I
C
(
3
7
.
5
-
M
H
z

p
r
o
c
e
s
s
o
r
,
2
5
6
-
M
B

S
R
A
M
3

D
M
A

u
n
i
t
s
)
E
i
g
h
t
-
p
o
r
t
w
o
r
m
h
o
l
e
s
w
i
t
c
h
e
s
33

Context for Scalable Cache Coherence






S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
C
A
P
$
S
w
i
t
c
h
M
S
w
i
t
c
h
S
w
i
t
c
h
Realizing Pgm Models

through net transaction

protocols


-

efficient node
-
to
-
net interface


-

interprets transactions

Caches naturally replicate

data


-

coherence through bus

snooping protocols


-

consistency

Scalable Networks


-

many simultaneous

transactions

Scalable

distributed

memory

Need cache coherence protocols that scale!


-

no broadcast or single point of order

34

Generic Solution: Directories


Maintain state vector explicitly


associate with memory block


records state of block in each cache


On miss, communicate with directory


determine location of cached copies


determine action to take


conduct protocol to maintain coherence

P1
Cache
Memory
Scalable Interconnection Network
Comm.
Assist
P1
Cache
Comm
Assist
Directory
Memory
Directory
35

Adminstrative Break


Project Descriptions due today


Properties of a good project


There is an idea


There is a body of background work


There is something that differentiates the idea


There is a reasonable way to evaluate the idea

36

A Cache Coherent System Must:


Provide set of states, state transition diagram, and actions


Manage coherence protocol


(0) Determine when to invoke coherence protocol


(a) Find info about state of block in other caches to
determine action


whether need to communicate with other cached copies


(b) Locate the other copies


(c) Communicate with those copies (inval/update)


(0) is done the same way on all systems


state of the line is maintained in the cache


protocol is invoked if an

access fault


occurs on the line


Different approaches distinguished by (a) to (c)

37

Bus
-
based Coherence


All of (a), (b), (c) done through broadcast on bus


faulting processor sends out a

search




others respond to the search probe and take necessary
action


Could do it in scalable network too


broadcast to all processors, and let them respond


Conceptually simple, but broadcast doesn

t scale with p


on bus, bus bandwidth doesn

t scale


on scalable network, every fault leads to at least p network
transactions


Scalable coherence:


can have same cache states and state transition diagram


different mechanisms to manage protocol

38

One Approach: Hierarchical
Snooping


Extend snooping approach: hierarchy of broadcast media


tree of buses or rings (KSR
-
1)


processors are in the bus
-

or ring
-
based multiprocessors at the
leaves


parents and children connected by two
-
way snoopy interfaces


snoop both buses and propagate relevant transactions


main memory may be centralized at root or distributed among
leaves


Issues (a)
-

(c) handled similarly to bus, but not full broadcast


faulting processor sends out

search


bus transaction on its bus


propagates up and down hiearchy based on snoop results


Problems:


high latency: multiple levels, and snoop/lookup at every level


bandwidth bottleneck at root


Not popular today

39

Scalable Approach: Directories



Every memory block has associated directory
information


keeps track of copies of cached blocks and their states


on a miss, find directory entry, look it up, and
communicate only with the nodes that have copies if
necessary


in scalable networks, communication with directory
and copies is through network transactions


Many alternatives for organizing directory
information

40

Basic Operation of Directory

• k processors.

• With each cache
-
block in memory: k
presence
-
bits, 1 dirty
-
bit

• With each cache
-
block in cache: 1
valid bit, and 1 dirty (owner) bit




P
P
Cache
Cache
Memory
Directory
presence bits
dirty bit
Interconnection Network


Read from main memory by processor i:



If dirty
-
bit OFF then { read from main memory; turn p[i] ON; }



if dirty
-
bit ON then { recall line from dirty proc (cache state to
shared); update memory; turn dirty
-
bit OFF; turn p[i] ON; supply
recalled data to i;}



Write to main memory by processor i:



If dirty
-
bit OFF then { supply data to i; send invalidations to all
caches that have the block; turn dirty
-
bit ON; turn p[i] ON; ... }



...

41

Basic Directory Transactions


P
A
M/D
C
P
A
M/D
C
P
A
M/D
C
Read req uest
to d irectory
Rep ly with
o
wner id entity
Read req.
to o
wner
Data
Rep ly
Re
visio n messag e
to d irectory
1.
2.
3.
4a.
4b
.
P
A
M/D
C
P
A
M/D
C
P
A
M/D
C
Rd Ex req uest
to d irectory
Rep ly with
sharers id entity
In
v
al. req.
to sh arer
1.
2.
P
A
M/D
C
In
v
al. req.
to sh arer
In
v
al. ack

In
v
al. ack

3a.
3b
.
4a.
4b
.
Requestor
Node with
dirty
cop
y
Dir
ectory
node
for block
Requestor
Dir
ectory
node
Shar
er
Shar
er
(a) Read miss to a block in dirty state
(b)
Write miss to a block with tw
o sharers
42

Example Directory Protocol (1
st

Read)

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

ld vA
-
> rd pA

Read pA

R/reply

R/req

P1: pA

S

S

43

Example Directory Protocol (Read Share)

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

ld vA
-
> rd pA

R/reply

R/req

P1: pA

ld vA
-
> rd pA

P2: pA

R/req

R/_

R/_

R/_

S

S

S

44

Example Directory Protocol (Wr to shared)

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

st vA
-
> wr pA

R/reply

R/req

P1: pA

ld vA
-
> rd pA

P2: pA

R/req

W/req E

R/_

R/_

R/_

Invalidate pA

Read_to_update pA

Inv ACK

RX/invalidate&reply

S

S

S

D

E

reply xD(pA)

W/req E

W/_

Inv/_

Inv/_

EX

45

Example Directory Protocol (Wr to Ex)

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

R/reply

R/req

P1: pA

st vA
-
> wr pA

R/req

W/req E

R/_

R/_

R/_

Reply xD(pA)

Write_back pA

Read_toUpdate
pA

RX/invalidate&reply

D

E

Inv pA

W/req E

W/_

Inv/_

Inv/_

W/req E

W/_

I

E

W/req E

RU/_

46

Directory Protocol (other transitions)

E

S

I

P1

$

P2

$

D

S

U

M

Dir

ctrl

R/reply

R/req

W/req E

R/_

R/_

RX/invalidate&reply

W/req E

W/_

Inv/_

RU/_

RX/reply

Inv/write_back

Evict/?

Evict/write_back

Write_back

47

A Popular Middle Ground


Two
-
level

hierarchy



Individual nodes are multiprocessors, connected non
-
hiearchically


e.g. mesh of SMPs


Coherence across nodes is directory
-
based


directory keeps track of nodes, not individual processors


Coherence within nodes is snooping or directory


orthogonal, but needs a good interface of functionality


Examples:


Convex Exemplar: directory
-
directory


Sequent, Data General, HAL: directory
-
snoopy



SMP on a chip?

48

Example Two
-
level Hierarchies

P
C
Sn oo ping
B1
B2
P
C
P
C
B1
P
C
Main
Mem
Main
Mem
Adap ter
Sn oo ping
Adap ter
P
C
B1
Bu s (o r Ring )
P
C
P
C
B1
P
C
Main
Mem
Main
Mem
Netw
ork
Assist
Assist
Netw
ork 2
P
C
A
M/D
Netw
ork 1
P
C
A
M/D
Directory adap ter
P
C
A
M/D
Netw
ork 1
P
C
A
M/D
Directory adap ter
P
C
A
M/D
Netw
ork 1
P
C
A
M/D
Dir/Sn oo p
y adap ter
P
C
A
M/D
Netw
ork 1
P
C
A
M/D
Dir/Sn oo p
y adap ter
(a) Snooping-snooping
(b) Snooping-directory
Dir
.
Dir
.
(c) Directory-directory
(d) Directory-snooping
49

Latency Scaling


T(n) = Overhead + Channel Time + Routing Delay


Overhead?


Channel Time(n) = n/B
---

BW at bottleneck


RoutingDelay(h,n)


50

Typical example


max distance:
log n


number of switches:
a

n log n


overhead = 1 us, BW = 64 MB/s, 200 ns per hop


Pipelined

T
64
(128) = 1.0
us

+ 2.0
us

+ 6
hops

* 0.2
us/hop

= 4.2
us

T
1024
(128) = 1.0
us

+ 2.0
us

+ 10
hops

* 0.2
us/hop

= 5.0
us



Store and Forward

T
64
sf
(128) = 1.0
us

+ 6
hops

* (2.0 + 0.2)
us/hop

= 14.2
us

T
64
sf
(1024) = 1.0
us

+ 10
hops

* (2.0 + 0.2)
us/hop

= 23
us

51

Cost Scaling


cost(p,m) = fixed cost + incremental cost (p,m)


Bus Based SMP?


Ratio of processors : memory : network : I/O ?



Parallel efficiency(p) = Speedup(P) / P



Costup(p) = Cost(p) / Cost(1)



Cost
-
effective: speedup(p) > costup(p)


Is super
-
linear speedup possible?