Lecture 11: Distribute Memory Multiprocessors

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

67 εμφανίσεις

Graduate Computer Architecture I

Lecture 11: Distribute
Memory Multiprocessors


2

-

CSE/ESE 560M


Graduate Computer Architecture I

Natural Extensions of Memory System

P

1

Switch

Main memory

P

n

(Interleaved)

(Interleaved)

First
-
level $

P

1

$

Inter

connection network

$

P

n

Mem

Mem

P

1

$

Inter

connection network

$

P

n

Mem

Mem

Shared Cache

Centralized Memory

Dance Hall, UMA

Distributed Memory (NUMA)

Scale


3

-

CSE/ESE 560M


Graduate Computer Architecture I

Fundamental Issues

1.
Naming

2.
Synchronization

3.
Performance: Latency and Bandwidth


4

-

CSE/ESE 560M


Graduate Computer Architecture I

Fundamental Issue #1: Naming


Naming


what data is shared


how it is addressed


what operations can access data


how processes refer to each other


Choice of naming affects


code produced by a compiler


via load where just remember address or keep track of
processor number and local virtual address for msg. passing


replication of data


via load in cache memory hierarchy or via SW replication and
consistency


5

-

CSE/ESE 560M


Graduate Computer Architecture I

Fundamental Issue #1: Naming


Global
physical

address space


any processor can generate, address, and access it in
a single operation


memory can be anywhere: virtual addr. translation
handles it


Global
virtual

address space


if the address space of each process can be
configured to contain all shared data of the parallel
program


Segmented shared address space


locations are named <process number, address>

uniformly for all processes of the parallel program


6

-

CSE/ESE 560M


Graduate Computer Architecture I

Fundamental Issue #2: Synchronization


Message passing


implicit coordination


transmission of data


arrival of data


Shared address


explicitly coordinate


write a flag


awaken a thread


interrupt a processor


7

-

CSE/ESE 560M


Graduate Computer Architecture I

Parallel Architecture Framework


Programming Model


Multiprogramming


lots of independent jobs


no communication


Shared address space


communicate via memory


Message passing


send and receive messages


Communication Abstraction


Shared address space


load, store, atomic swap


Message passing


send, recieve library calls


Debate over this topic


ease of programming vs. scalability


8

-

CSE/ESE 560M


Graduate Computer Architecture I

Scalable Machines


Design trade
-
offs for the machines


specialize vs commodity nodes


capability of node
-
to
-
network interface


supporting programming models


Scalability


avoids inherent design limits on resources


bandwidth increases with increase in resource


latency does not increase


cost increases slowly with increase in resource


9

-

CSE/ESE 560M


Graduate Computer Architecture I

Bandwidth Scalability


Fundamentally limits bandwidth


Amount of wires


Bus vs. Network Switch

P

M

M

P

M

M

P

M

M

P

M

M

S

S

S

S

T

y

p

i

c

a

l



s

w

i

t

c

h

e

s

B

u

s

M

u

l

t

i

p

l

e

x

e

r

s

C

r

o

s

s

b

a

r


10

-

CSE/ESE 560M


Graduate Computer Architecture I

Dancehall Multiprocessor Organization











S

c

a

l

a

b

l

e



n

e

t

w

o

r

k

P

$

S

w

i

t

c

h

M

P

$

P

$

P

$

M

M











S

w

i

t

c

h

S

w

i

t

c

h


11

-

CSE/ESE 560M


Graduate Computer Architecture I

Generic Distributed System Organization











S

c

a

l

a

b

l

e



n

e

t

w

o

r

k

Comm

Assist

P

$

S

w

i

t

c

h

M

S

w

i

t

c

h

S

w

i

t

c

h


12

-

CSE/ESE 560M


Graduate Computer Architecture I

Key Property of Distributed System


Large number of independent
communication paths between nodes


allow a large number of concurrent transactions
using different wires


Independent Initialization


No global arbitration


Effect of a transaction only visible to the
nodes involved


effects propagated through additional
transactions


13

-

CSE/ESE 560M


Graduate Computer Architecture I

CAD

Multipr

ogramming

Shar

ed

addr

ess

Message

passing

Data

parallel

Database

Scientific modeling

Parallel applications

Pr

ogramming models

Communication abstraction

User/system boundary

Compilation

or library

Operating systems support

Communication har

dwar

e

Physical communication medium

Har

dwar

e/Softwar

e boundary

Network Transactions

Programming Models Realized by Protocols


14

-

CSE/ESE 560M


Graduate Computer Architecture I

Network Transaction


Interpretation of the message


Complexity of the message


Processing in the Comm. Assist


Processing power

P

M

CA

P

M

CA

°

°

°

Scalable Network

Node Architecture

Communication Assist

Message

Output Processing




checks




translation




formatting




scheduling

Input Processing




checks




translation




buffering




action


15

-

CSE/ESE 560M


Graduate Computer Architecture I

Shared Address Space Abstraction


Fundamentally a two
-
way request/response protocol


writes have an acknowledgement


Issues


fixed or variable length (bulk) transfers


remote virtual or physical address


deadlock avoidance and input buffer full


Memory coherency and consistency

S

o

u

r

c

e

D

e

s

t

i

n

a

t

i

o

n

T

i

m

e

L

o

a

d





[

G

l

o

b

a

l



a

d

d

r

e

s

s

]

R

e

a

d



r

e

q

u

e

s

t

R

e

a

d



r

e

q

u

e

s

t

M

e

m

o

r

y



a

c

c

e

s

s

R

e

a

d



r

e

s

p

o

n

s

e

(

1

)



I

n

i

t

i

a

t

e



m

e

m

o

r

y



a

c

c

e

s

s

(

2

)



A

d

d

r

e

s

s



t

r

a

n

s

l

a

t

i

o

n

(

3

)



L

o

c

a

l

/

r

e

m

o

t

e



c

h

e

c

k

(

4

)



R

e

q

u

e

s

t



t

r

a

n

s

a

c

t

i

o

n

(

5

)



R

e

m

o

t

e



m

e

m

o

r

y



a

c

c

e

s

s

(

6

)



R

e

p

l

y



t

r

a

n

s

a

c

t

i

o

n

(

7

)



C

o

m

p

l

e

t

e



m

e

m

o

r

y



a

c

c

e

s

s

W

a

i

t

R

e

a

d



r

e

s

p

o

n

s

e


16

-

CSE/ESE 560M


Graduate Computer Architecture I

Shared Physical Address Space

S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
P
$
M
e
m
o
r
y

m
a
n
a
g
e
m
e
n
t

u
n
i
t
D
a
t
a
L
d

R







A
d
d
r
P
s
e
u
d
o

m
e
m
o
r
y
P
s
e
u
d
o
-

p
r
o
c
e
s
s
o
r
D
e
s
t
R
e
a
d
A
d
d
r
S
r
c
T
a
g
D
a
t
a
T
a
g
R
r
s
p
S
r
c
O
u
t
p
u
t

p
r
o
c
e
s
s
i
n
g




M
e
m

a
c
c
e
s
s




R
e
s
p
o
n
s
e
C
o
m
m
m
u
n
i
c
a
t
i
o
n


I
n
p
u
t

p
r
o
c
e
s
s
i
n
g




P
a
r
s
e




C
o
m
p
l
e
t
e

r
e
a
d
P
$
M
M
U
M
e
m
P
s
e
u
d
o
-

m
e
m
o
r
y
P
s
e
u
d
o
-

p
r
o
c
e
s
s
o
r
M
e
m
a
s
s
i
s
t

17

-

CSE/ESE 560M


Graduate Computer Architecture I

Shared Address Abstraction


Source and destination data addresses are
specified by the source of the request


a degree of logical coupling and trust


No storage logically “outside the address space”


may employ temporary buffers for transport


Operations are fundamentally request response


Remote operation can be performed on remote
memory


logically does not require intervention of the remote
processor


18

-

CSE/ESE 560M


Graduate Computer Architecture I

Message passing


Bulk transfers


Synchronous


Send completes after matching recv and
source data sent


Receive completes after data transfer complete
from matching send


Asynchronous


Send completes after send buffer may be
reused


19

-

CSE/ESE 560M


Graduate Computer Architecture I

Synchronous Message Passing


Constrained programming model


Destination contention very limited


User/System boundary

S

o

u

r

c

e

D

e

s

t

i

n

a

t

i

o

n

T

i

m

e

S

e

n

d



P

d

e

s

t

,



l

o

c

a

l



V

A

,



l

e

n

S

e

n

d

-

r

d

y



r

e

q

T

a

g



c

h

e

c

k

(

1

)



I

n

i

t

i

a

t

e



s

e

n

d

(

2

)



A

d

d

r

e

s

s



t

r

a

n

s

l

a

t

i

o

n



o

n



P

s

r

c

(

4

)



S

e

n

d

-

r

e

a

d

y



r

e

q

u

e

st

(

6

)



R

e

p

l

y



t

r

a

n

s

a

c

t

i

o

n

W

a

i

t

R

e

c

v



P

s

r

c

,



l

o

c

a

l



V

A

,



l

en

R

e

c

v

-

r

d

y



r

e

p

l

y

D

a

t

a

-

x

f

e

r



r

e

q

(

5

)



R

e

m

o

t

e



c

h

e

c

k



f

o

r



p

o

s

t

e

d



r

e

c

e

i

v

e



(

a

s

s

u

m

e



s

u

c

c

e

s

s

)

(

7

)



B

u

l

k



d

a

t

a



t

r

a

n

s

f

e

r

S

o

u

r

c

e



V

A



D

e

s

t



V

A



o

r



I

D

(

3

)



L

o

c

a

l

/

r

e

m

o

t

e



c

h

e

c

k


20

-

CSE/ESE 560M


Graduate Computer Architecture I

Asynch Message Passing: Optimistic


More powerful programming model


Wildcard receive


湯n
-
摥瑥牭t湩獴楣


Storage required within msg layer?

S
o
u
r
c
e
D
e
s
t
i
n
a
t
i
o
n
T
i
m
e
S
e
n
d

(
P
d
e
s
t
,

l
o
c
a
l

V
A
,

l
e
n
)
(
1
)

I
n
i
t
i
a
t
e

s
e
n
d
(
2
)

A
d
d
r
e
s
s

t
r
a
n
s
l
a
t
i
o
n

(
4
)

S
e
n
d

d
a
t
a
R
e
c
v

P
s
r
c
,

l
o
c
a
l

V
A
,

l
e
n
D
a
t
a
-
x
f
e
r

r
e
q
T
a
g

m
a
t
c
h
A
l
l
o
c
a
t
e

b
u
f
f
e
r
(
3
)

L
o
c
a
l
/
r
e
m
o
t
e

c
h
e
c
k
(
5
)

R
e
m
o
t
e

c
h
e
c
k

f
o
r

p
o
s
t
e
d

r
e
c
e
i
v
e
;

o
n

f
a
i
l
,

a
l
l
o
c
a
t
e

d
a
t
a

b
u
f
f
e
r

21

-

CSE/ESE 560M


Graduate Computer Architecture I

Active Messages


User
-
level analog of network transaction


transfer data packet and invoke handler to extract it from the
network and integrate with on
-
going computation


Request/Reply


Event notification: interrupts, polling, events?


May also perform memory
-
to
-
memory transfer

Request

handler

handler

Reply


22

-

CSE/ESE 560M


Graduate Computer Architecture I

Message Passing Abstraction


Source knows send data address, dest. knows
receive data address


after handshake they both know both


Arbitrary storage “outside the local address
spaces”


may post many sends before any receives


non
-
blocking asynchronous sends reduces the
requirement to an arbitrary number of descriptors


Fundamentally a 3
-
phase transaction


includes a request / response


can use optimistic 1
-
phase in limited “Safe” cases


23

-

CSE/ESE 560M


Graduate Computer Architecture I

Data Parallel


Operations can be performed in parallel


each element of a large regular data structure, such as
an array


Data parallel programming languages lay out data to
processor


Processing Element


1 Control Processor broadcast to many PEs


When computers were large, could amortize the control
portion of many replicated PEs


Condition flag per PE so that can skip


Data distributed in each memory


Early 1980s VLSI


卉䵄S牥扩牴h


32 1
-
bit PEs + memory on a chip was the PE


24

-

CSE/ESE 560M


Graduate Computer Architecture I

Data Parallel


Architecture Development


Vector processors have similar ISAs,

but no data placement restriction


SIMD led to Data Parallel Programming
languages


Single Program Multiple Data (SPMD) model


All processors execute identical program


Advanced VLSI Technology


Single chip FPUs


Fast µProcs (SIMD less attractive)



25

-

CSE/ESE 560M


Graduate Computer Architecture I

Cache Coherent System


Invoking coherence protocol


state of the line is maintained in the cache


protocol is invoked if an “access fault” occurs
on the line


Actions to Maintain Coherence


Look at states of block in other caches


Locate the other copies


Communicate with those copies


26

-

CSE/ESE 560M


Graduate Computer Architecture I

Scalable Cache Coherence






S
c
a
l
a
b
l
e

n
e
t
w
o
r
k
C
A
P
$
S
w
i
t
c
h
M
S
w
i
t
c
h
S
w
i
t
c
h
Realizing Program Models through
net transaction protocols


-

efficient node
-
to
-
net interface


-

interprets transactions

Caches naturally replicate data


-

coherence through bus snooping protocols


-

consistency

Scalable Networks


-

many simultaneous transactions

Scalable

distributed

memory

Need cache coherence protocols that scale!


-

no broadcast or single point of order


27

-

CSE/ESE 560M


Graduate Computer Architecture I

Bus
-
based Coherence


All actions done as broadcast on bus


faulting processor sends out a “search”


others respond to the search probe and take necessary
action


Could do it in scalable network too


broadcast to all processors, and let them respond


Conceptually simple, but doesn’t scale with p


on bus, bus bandwidth doesn’t scale


on scalable network, every fault leads to at least p
network transactions


28

-

CSE/ESE 560M


Graduate Computer Architecture I

One Approach: Hierarchical Snooping


Extend snooping approach


hierarchy of broadcast media


processors are in the bus
-

or ring
-
based multiprocessors at the
leaves


parents and children connected by two
-
way snoopy interfaces


main memory may be centralized at root or distributed among
leaves


Actions handled similarly to bus, but not full broadcast


faulting processor sends out “search” bus transaction on its bus


propagates up and down hierarchy based on snoop results


Problems


high latency: multiple levels, and snoop/lookup at every level


bandwidth bottleneck at root


29

-

CSE/ESE 560M


Graduate Computer Architecture I

Scalable Approach: Directories


Directory


Maintain cached block copies


Maintain memory block states


On a miss in own memory


Look up directory entry


Communicate only with the nodes with copies


Scalable networks


Communication through network transactions


Different ways to organize directory


30

-

CSE/ESE 560M


Graduate Computer Architecture I

Basic Directory Transactions

P

A

M/D

C

P

A

M/D

C

P

A

M/D

C

Read request

to directory

Reply with

o

wner identity

Read req.

to o

wner

Data

Reply

Re

vision message

to directory

1.

2.

3.

4a.

4b

.

P

A

M/D

C

P

A

M/D

C

P

A

M/D

C

RdEx request

to directory

Reply with

sharers identity

In

v

al. req.

to sharer

1.

2.

P

A

M/D

C

In

v

al. req.

to sharer

In

v

al. ack



In

v

al. ack



3a.

3b

.

4a.

4b

.

Requestor

Node with

dirty


cop

y

Dir

ectory


node

for block

Requestor

Dir

ectory


node

Shar

er

Shar

er

(a) Read miss to a block in dirty state

(b)

Write miss to a block with tw

o sharers


31

-

CSE/ESE 560M


Graduate Computer Architecture I

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

ld vA
-
> rd pA

Read pA

R/reply

R/req

P1: pA

S

S

Example Directory Protocol (1
st

Read)


32

-

CSE/ESE 560M


Graduate Computer Architecture I

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

R/reply

R/req

P1: pA

ld vA
-
> rd pA

P2: pA

R/req

R/_

R/_

R/_

S

S

S

Example Directory Protocol (Read Share)


33

-

CSE/ESE 560M


Graduate Computer Architecture I

E

S

I

P1

$

E

S

I

P2

$

D

S

U

M

Dir

ctrl

st vA
-
> wr pA

R/reply

R/req

P1: pA

P2: pA

R/req

W/req E

R/_

R/_

R/_

Invalidate pA

Read_to_update pA

Inv ACK

RX/invalidate&reply

S

S

S

D

E

reply xD(pA)

Inv/_

Excl

Example Directory Protocol (Wr to shared)

I


34

-

CSE/ESE 560M


Graduate Computer Architecture I

A Popular Middle Ground


Two
-
level “hierarchy”


Coherence across nodes is directory
-
based


directory keeps track of nodes, not individual
processors


Coherence within nodes is snooping or
directory


orthogonal, but needs a good interface of
functionality


Examples


Convex Exemplar: directory
-
directory


Sequent, Data General, HAL: directory
-
snoopy


35

-

CSE/ESE 560M


Graduate Computer Architecture I

P

C

Snooping

B1

B2

P

C

P

C

B1

P

C

Main

Mem

Main

Mem

Adapter

Snooping

Adapter

P

C

B1

Bus (or Ring)

P

C

P

C

B1

P

C

Main

Mem

Main

Mem

Netw

ork

Assist

Assist

Netw

ork2

P

C

A

M/D

Netw

ork1

P

C

A

M/D

Directory adapter

P

C

A

M/D

Netw

ork1

P

C

A

M/D

Directory adapter

P

C

A

M/D

Netw

ork1

P

C

A

M/D

Dir/Snoop

y adapter

P

C

A

M/D

Netw

ork1

P

C

A

M/D

Dir/Snoop

y adapter

(a) Snooping
-
snooping

(b) Snooping
-
directory

Dir

.

Dir

.

(c) Directory
-
directory

(d) Directory
-
snooping

Two
-
level Hierarchies


36

-

CSE/ESE 560M


Graduate Computer Architecture I

Memory Consistency


Memory Coherence


Consistent view of
the memory


Not ensure how
consistent


In what order of
execution

M

e

m

o

r

y

P

1

P

2

P

3

M

e

m

o

r

y

M

e

m

o

r

y

A

=

1

;

f

l

a

g

=

1

;

w

h

i

l

e



(

f

l

a

g

=

=

0

)

;

p

r

i

n

t



A

;

A

:

0

f

l

a

g

:

0

-

>

1



I

n

t

e

r

c

o

n

n

e

c

t

i

o

n



n

e

t

w

o

r

k

1

:



A

=

1

2

:



f

l

a

g

=

1

3

:



l

o

a

d



A

D

e

l

a

y

P

1

P

3

P

2

(

b

)

(

a

)

C

o

n

g

e

s

t

e

d



p

a

t

h


37

-

CSE/ESE 560M


Graduate Computer Architecture I

Memory Consistency


Relaxed Consistency


Allows Out
-
of
-
order Completion


Different Read and Write ordering models


Increase in Performance but possible errors


Current Systems


Relaxed Models


Expectation for synchronous programs


Use of standard synchronization libraries