Graduate Computer Architecture I
Lecture 11: Distribute
Memory Multiprocessors
2
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Natural Extensions of Memory System
P
1
Switch
Main memory
P
n
(Interleaved)
(Interleaved)
First
-
level $
P
1
$
Inter
connection network
$
P
n
Mem
Mem
P
1
$
Inter
connection network
$
P
n
Mem
Mem
Shared Cache
Centralized Memory
Dance Hall, UMA
Distributed Memory (NUMA)
Scale
3
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Fundamental Issues
1.
Naming
2.
Synchronization
3.
Performance: Latency and Bandwidth
4
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Fundamental Issue #1: Naming
•
Naming
–
what data is shared
–
how it is addressed
–
what operations can access data
–
how processes refer to each other
•
Choice of naming affects
–
code produced by a compiler
•
via load where just remember address or keep track of
processor number and local virtual address for msg. passing
–
replication of data
•
via load in cache memory hierarchy or via SW replication and
consistency
5
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Fundamental Issue #1: Naming
•
Global
physical
address space
–
any processor can generate, address, and access it in
a single operation
–
memory can be anywhere: virtual addr. translation
handles it
•
Global
virtual
address space
–
if the address space of each process can be
configured to contain all shared data of the parallel
program
•
Segmented shared address space
–
locations are named <process number, address>
uniformly for all processes of the parallel program
6
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Fundamental Issue #2: Synchronization
•
Message passing
–
implicit coordination
–
transmission of data
–
arrival of data
•
Shared address
–
explicitly coordinate
–
write a flag
–
awaken a thread
–
interrupt a processor
7
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Parallel Architecture Framework
•
Programming Model
–
Multiprogramming
•
lots of independent jobs
•
no communication
–
Shared address space
•
communicate via memory
–
Message passing
•
send and receive messages
•
Communication Abstraction
–
Shared address space
•
load, store, atomic swap
–
Message passing
•
send, recieve library calls
–
Debate over this topic
•
ease of programming vs. scalability
8
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Scalable Machines
•
Design trade
-
offs for the machines
–
specialize vs commodity nodes
–
capability of node
-
to
-
network interface
–
supporting programming models
•
Scalability
–
avoids inherent design limits on resources
–
bandwidth increases with increase in resource
–
latency does not increase
–
cost increases slowly with increase in resource
9
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Bandwidth Scalability
•
Fundamentally limits bandwidth
–
Amount of wires
–
Bus vs. Network Switch
P
M
M
P
M
M
P
M
M
P
M
M
S
S
S
S
T
y
p
i
c
a
l
s
w
i
t
c
h
e
s
B
u
s
M
u
l
t
i
p
l
e
x
e
r
s
C
r
o
s
s
b
a
r
10
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Dancehall Multiprocessor Organization
S
c
a
l
a
b
l
e
n
e
t
w
o
r
k
P
$
S
w
i
t
c
h
M
P
$
P
$
P
$
M
M
S
w
i
t
c
h
S
w
i
t
c
h
11
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Generic Distributed System Organization
S
c
a
l
a
b
l
e
n
e
t
w
o
r
k
Comm
Assist
P
$
S
w
i
t
c
h
M
S
w
i
t
c
h
S
w
i
t
c
h
12
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Key Property of Distributed System
•
Large number of independent
communication paths between nodes
–
allow a large number of concurrent transactions
using different wires
•
Independent Initialization
•
No global arbitration
•
Effect of a transaction only visible to the
nodes involved
–
effects propagated through additional
transactions
13
-
CSE/ESE 560M
–
Graduate Computer Architecture I
CAD
Multipr
ogramming
Shar
ed
addr
ess
Message
passing
Data
parallel
Database
Scientific modeling
Parallel applications
Pr
ogramming models
Communication abstraction
User/system boundary
Compilation
or library
Operating systems support
Communication har
dwar
e
Physical communication medium
Har
dwar
e/Softwar
e boundary
Network Transactions
Programming Models Realized by Protocols
14
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Network Transaction
•
Interpretation of the message
–
Complexity of the message
•
Processing in the Comm. Assist
–
Processing power
P
M
CA
P
M
CA
°
°
°
Scalable Network
Node Architecture
Communication Assist
Message
Output Processing
–
checks
–
translation
–
formatting
–
scheduling
Input Processing
–
checks
–
translation
–
buffering
–
action
15
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Shared Address Space Abstraction
•
Fundamentally a two
-
way request/response protocol
–
writes have an acknowledgement
•
Issues
–
fixed or variable length (bulk) transfers
–
remote virtual or physical address
–
deadlock avoidance and input buffer full
–
Memory coherency and consistency
S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
L
o
a
d
[
G
l
o
b
a
l
a
d
d
r
e
s
s
]
R
e
a
d
r
e
q
u
e
s
t
R
e
a
d
r
e
q
u
e
s
t
M
e
m
o
r
y
a
c
c
e
s
s
R
e
a
d
r
e
s
p
o
n
s
e
(
1
)
I
n
i
t
i
a
t
e
m
e
m
o
r
y
a
c
c
e
s
s
(
2
)
A
d
d
r
e
s
s
t
r
a
n
s
l
a
t
i
o
n
(
3
)
L
o
c
a
l
/
r
e
m
o
t
e
c
h
e
c
k
(
4
)
R
e
q
u
e
s
t
t
r
a
n
s
a
c
t
i
o
n
(
5
)
R
e
m
o
t
e
m
e
m
o
r
y
a
c
c
e
s
s
(
6
)
R
e
p
l
y
t
r
a
n
s
a
c
t
i
o
n
(
7
)
C
o
m
p
l
e
t
e
m
e
m
o
r
y
a
c
c
e
s
s
W
a
i
t
R
e
a
d
r
e
s
p
o
n
s
e
16
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Shared Physical Address Space
S
c
a
l
a
b
l
e
n
e
t
w
o
r
k
P
$
M
e
m
o
r
y
m
a
n
a
g
e
m
e
n
t
u
n
i
t
D
a
t
a
L
d
R
A
d
d
r
P
s
e
u
d
o
m
e
m
o
r
y
P
s
e
u
d
o
-
p
r
o
c
e
s
s
o
r
D
e
s
t
R
e
a
d
A
d
d
r
S
r
c
T
a
g
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t
p
r
o
c
e
s
s
i
n
g
∙
M
e
m
a
c
c
e
s
s
∙
R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n
I
n
p
u
t
p
r
o
c
e
s
s
i
n
g
∙
P
a
r
s
e
∙
C
o
m
p
l
e
t
e
r
e
a
d
P
$
M
M
U
M
e
m
P
s
e
u
d
o
-
m
e
m
o
r
y
P
s
e
u
d
o
-
p
r
o
c
e
s
s
o
r
M
e
m
a
s
s
i
s
t
17
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Shared Address Abstraction
•
Source and destination data addresses are
specified by the source of the request
–
a degree of logical coupling and trust
•
No storage logically “outside the address space”
–
may employ temporary buffers for transport
•
Operations are fundamentally request response
•
Remote operation can be performed on remote
memory
–
logically does not require intervention of the remote
processor
18
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Message passing
•
Bulk transfers
•
Synchronous
–
Send completes after matching recv and
source data sent
–
Receive completes after data transfer complete
from matching send
•
Asynchronous
–
Send completes after send buffer may be
reused
19
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Synchronous Message Passing
•
Constrained programming model
•
Destination contention very limited
•
User/System boundary
S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d
P
d
e
s
t
,
l
o
c
a
l
V
A
,
l
e
n
S
e
n
d
-
r
d
y
r
e
q
T
a
g
c
h
e
c
k
(
1
)
I
n
i
t
i
a
t
e
s
e
n
d
(
2
)
A
d
d
r
e
s
s
t
r
a
n
s
l
a
t
i
o
n
o
n
P
s
r
c
(
4
)
S
e
n
d
-
r
e
a
d
y
r
e
q
u
e
st
(
6
)
R
e
p
l
y
t
r
a
n
s
a
c
t
i
o
n
W
a
i
t
R
e
c
v
P
s
r
c
,
l
o
c
a
l
V
A
,
l
en
R
e
c
v
-
r
d
y
r
e
p
l
y
D
a
t
a
-
x
f
e
r
r
e
q
(
5
)
R
e
m
o
t
e
c
h
e
c
k
f
o
r
p
o
s
t
e
d
r
e
c
e
i
v
e
(
a
s
s
u
m
e
s
u
c
c
e
s
s
)
(
7
)
B
u
l
k
d
a
t
a
t
r
a
n
s
f
e
r
S
o
u
r
c
e
V
A
D
e
s
t
V
A
o
r
I
D
(
3
)
L
o
c
a
l
/
r
e
m
o
t
e
c
h
e
c
k
20
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Asynch Message Passing: Optimistic
•
More powerful programming model
•
Wildcard receive
湯n
-
摥瑥牭t湩獴楣
•
Storage required within msg layer?
S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d
(
P
d
e
s
t
,
l
o
c
a
l
V
A
,
l
e
n
)
(
1
)
I
n
i
t
i
a
t
e
s
e
n
d
(
2
)
A
d
d
r
e
s
s
t
r
a
n
s
l
a
t
i
o
n
(
4
)
S
e
n
d
d
a
t
a
R
e
c
v
P
s
r
c
,
l
o
c
a
l
V
A
,
l
e
n
D
a
t
a
-
x
f
e
r
r
e
q
T
a
g
m
a
t
c
h
A
l
l
o
c
a
t
e
b
u
f
f
e
r
(
3
)
L
o
c
a
l
/
r
e
m
o
t
e
c
h
e
c
k
(
5
)
R
e
m
o
t
e
c
h
e
c
k
f
o
r
p
o
s
t
e
d
r
e
c
e
i
v
e
;
o
n
f
a
i
l
,
a
l
l
o
c
a
t
e
d
a
t
a
b
u
f
f
e
r
21
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Active Messages
•
User
-
level analog of network transaction
–
transfer data packet and invoke handler to extract it from the
network and integrate with on
-
going computation
•
Request/Reply
•
Event notification: interrupts, polling, events?
•
May also perform memory
-
to
-
memory transfer
Request
handler
handler
Reply
22
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Message Passing Abstraction
•
Source knows send data address, dest. knows
receive data address
–
after handshake they both know both
•
Arbitrary storage “outside the local address
spaces”
–
may post many sends before any receives
–
non
-
blocking asynchronous sends reduces the
requirement to an arbitrary number of descriptors
•
Fundamentally a 3
-
phase transaction
–
includes a request / response
–
can use optimistic 1
-
phase in limited “Safe” cases
23
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Data Parallel
•
Operations can be performed in parallel
–
each element of a large regular data structure, such as
an array
–
Data parallel programming languages lay out data to
processor
•
Processing Element
–
1 Control Processor broadcast to many PEs
–
When computers were large, could amortize the control
portion of many replicated PEs
–
Condition flag per PE so that can skip
–
Data distributed in each memory
•
Early 1980s VLSI
卉䵄S牥扩牴h
–
32 1
-
bit PEs + memory on a chip was the PE
24
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Data Parallel
•
Architecture Development
–
Vector processors have similar ISAs,
but no data placement restriction
–
SIMD led to Data Parallel Programming
languages
–
Single Program Multiple Data (SPMD) model
–
All processors execute identical program
•
Advanced VLSI Technology
–
Single chip FPUs
–
Fast µProcs (SIMD less attractive)
25
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Cache Coherent System
•
Invoking coherence protocol
–
state of the line is maintained in the cache
–
protocol is invoked if an “access fault” occurs
on the line
•
Actions to Maintain Coherence
–
Look at states of block in other caches
–
Locate the other copies
–
Communicate with those copies
26
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Scalable Cache Coherence
S
c
a
l
a
b
l
e
n
e
t
w
o
r
k
C
A
P
$
S
w
i
t
c
h
M
S
w
i
t
c
h
S
w
i
t
c
h
Realizing Program Models through
net transaction protocols
-
efficient node
-
to
-
net interface
-
interprets transactions
Caches naturally replicate data
-
coherence through bus snooping protocols
-
consistency
Scalable Networks
-
many simultaneous transactions
Scalable
distributed
memory
Need cache coherence protocols that scale!
-
no broadcast or single point of order
27
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Bus
-
based Coherence
•
All actions done as broadcast on bus
–
faulting processor sends out a “search”
–
others respond to the search probe and take necessary
action
•
Could do it in scalable network too
–
broadcast to all processors, and let them respond
•
Conceptually simple, but doesn’t scale with p
–
on bus, bus bandwidth doesn’t scale
–
on scalable network, every fault leads to at least p
network transactions
28
-
CSE/ESE 560M
–
Graduate Computer Architecture I
One Approach: Hierarchical Snooping
•
Extend snooping approach
–
hierarchy of broadcast media
–
processors are in the bus
-
or ring
-
based multiprocessors at the
leaves
–
parents and children connected by two
-
way snoopy interfaces
–
main memory may be centralized at root or distributed among
leaves
•
Actions handled similarly to bus, but not full broadcast
–
faulting processor sends out “search” bus transaction on its bus
–
propagates up and down hierarchy based on snoop results
•
Problems
–
high latency: multiple levels, and snoop/lookup at every level
–
bandwidth bottleneck at root
29
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Scalable Approach: Directories
•
Directory
–
Maintain cached block copies
–
Maintain memory block states
–
On a miss in own memory
•
Look up directory entry
•
Communicate only with the nodes with copies
–
Scalable networks
•
Communication through network transactions
•
Different ways to organize directory
30
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Basic Directory Transactions
P
A
M/D
C
P
A
M/D
C
P
A
M/D
C
Read request
to directory
Reply with
o
wner identity
Read req.
to o
wner
Data
Reply
Re
vision message
to directory
1.
2.
3.
4a.
4b
.
P
A
M/D
C
P
A
M/D
C
P
A
M/D
C
RdEx request
to directory
Reply with
sharers identity
In
v
al. req.
to sharer
1.
2.
P
A
M/D
C
In
v
al. req.
to sharer
In
v
al. ack
In
v
al. ack
3a.
3b
.
4a.
4b
.
Requestor
Node with
dirty
cop
y
Dir
ectory
node
for block
Requestor
Dir
ectory
node
Shar
er
Shar
er
(a) Read miss to a block in dirty state
(b)
Write miss to a block with tw
o sharers
31
-
CSE/ESE 560M
–
Graduate Computer Architecture I
E
S
I
P1
$
E
S
I
P2
$
D
S
U
M
Dir
ctrl
ld vA
-
> rd pA
Read pA
R/reply
R/req
P1: pA
S
S
Example Directory Protocol (1
st
Read)
32
-
CSE/ESE 560M
–
Graduate Computer Architecture I
E
S
I
P1
$
E
S
I
P2
$
D
S
U
M
Dir
ctrl
R/reply
R/req
P1: pA
ld vA
-
> rd pA
P2: pA
R/req
R/_
R/_
R/_
S
S
S
Example Directory Protocol (Read Share)
33
-
CSE/ESE 560M
–
Graduate Computer Architecture I
E
S
I
P1
$
E
S
I
P2
$
D
S
U
M
Dir
ctrl
st vA
-
> wr pA
R/reply
R/req
P1: pA
P2: pA
R/req
W/req E
R/_
R/_
R/_
Invalidate pA
Read_to_update pA
Inv ACK
RX/invalidate&reply
S
S
S
D
E
reply xD(pA)
Inv/_
Excl
Example Directory Protocol (Wr to shared)
I
34
-
CSE/ESE 560M
–
Graduate Computer Architecture I
A Popular Middle Ground
•
Two
-
level “hierarchy”
–
Coherence across nodes is directory
-
based
•
directory keeps track of nodes, not individual
processors
–
Coherence within nodes is snooping or
directory
•
orthogonal, but needs a good interface of
functionality
•
Examples
–
Convex Exemplar: directory
-
directory
–
Sequent, Data General, HAL: directory
-
snoopy
35
-
CSE/ESE 560M
–
Graduate Computer Architecture I
P
C
Snooping
B1
B2
P
C
P
C
B1
P
C
Main
Mem
Main
Mem
Adapter
Snooping
Adapter
P
C
B1
Bus (or Ring)
P
C
P
C
B1
P
C
Main
Mem
Main
Mem
Netw
ork
Assist
Assist
Netw
ork2
P
C
A
M/D
Netw
ork1
P
C
A
M/D
Directory adapter
P
C
A
M/D
Netw
ork1
P
C
A
M/D
Directory adapter
P
C
A
M/D
Netw
ork1
P
C
A
M/D
Dir/Snoop
y adapter
P
C
A
M/D
Netw
ork1
P
C
A
M/D
Dir/Snoop
y adapter
(a) Snooping
-
snooping
(b) Snooping
-
directory
Dir
.
Dir
.
(c) Directory
-
directory
(d) Directory
-
snooping
Two
-
level Hierarchies
36
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Memory Consistency
•
Memory Coherence
–
Consistent view of
the memory
–
Not ensure how
consistent
–
In what order of
execution
M
e
m
o
r
y
P
1
P
2
P
3
M
e
m
o
r
y
M
e
m
o
r
y
A
=
1
;
f
l
a
g
=
1
;
w
h
i
l
e
(
f
l
a
g
=
=
0
)
;
p
r
i
n
t
A
;
A
:
0
f
l
a
g
:
0
-
>
1
I
n
t
e
r
c
o
n
n
e
c
t
i
o
n
n
e
t
w
o
r
k
1
:
A
=
1
2
:
f
l
a
g
=
1
3
:
l
o
a
d
A
D
e
l
a
y
P
1
P
3
P
2
(
b
)
(
a
)
C
o
n
g
e
s
t
e
d
p
a
t
h
37
-
CSE/ESE 560M
–
Graduate Computer Architecture I
Memory Consistency
•
Relaxed Consistency
–
Allows Out
-
of
-
order Completion
–
Different Read and Write ordering models
–
Increase in Performance but possible errors
•
Current Systems
–
Relaxed Models
–
Expectation for synchronous programs
–
Use of standard synchronization libraries
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο