MemoryConsistencyx - LRR

coleslawokraΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

82 εμφανίσεις

SMP
Systems

MIMD

Shared Memory

UMA

NUMA

ccNUMA

nccNUMA

COMA

Uniform Memory Architectures (UMA)


UMA or Symmetric
Multiprocessors (SMP) are shared
memory systems with the following characteristics


global physical address space


symmetric access to all main memory from any processor, i.e.
same latency for accesses


SMPs dominate the server market and are becoming
more common at the desktop.


Throughput engines for sequential jobs with varying memory
and CPU requirements.


Shared address space makes SMPs attractive for parallel
programming. Efficient automatic
parallelizers

are available for
those systems.


They are important building blocks for larger
-
scale systems.


Design of a Bus
-
based SMP

P
0
$
P
n
$
Memory
IO device
Memory Semantics of a Sequential Computer


One program


A read should return the last value written to that location.


Operations are executed in program order.


Multiple programs


Time sharing


Same condition for read operations


Operations are executed in some order which respects the
individual program order of the programs.



The hardware makes sure that the semantics is
enforced taking into account:


write buffers


cache


...

Cache Coherence Problem


Replicas in the caches of multiple processors in an
SMP have to be updated or kept coherent.

P1

Cache

P2

Cache

P3

Cache

Memory

u: 5

1

u: 5

2

u: 5

3

u: 7

4

5

Observations


Write
-
through cache


P1 reads old value from own cache, P2 reads new value


Write
-
back cache:


P1 and P2 read old value from cache and memory
respectively.


With multiple new values in caches of different processors the
final value in memory is determined by the order of flushing
the cache instead of the order in which the writes occurred.


Cache Coherence Problem in Sequential Computers


The problem occurs in the context of IO operations.


The DMA accesses data in the memory independent
of the processor cache.


Coarse solutions have been developed based on the
fact that IO operations occur much less frequent than
memory operations.


Mark address ranges as
uncacheable


Flush cache before starting IO operations


IO operations go through cache hierarchy




Definition of a Coherent Memory System


A
multiprocessor memory system is coherent

if the
result of any execution of a program is such that, for
each location
, it is possible to construct a
hypothetical total order

of all memory accesses
that


is consistent with the result of the execution


and in which

1.
operations
by any particular process occur in the order they
were issued, and

2.
the value returned by each read operation is the value
written by the last write to that location in the total order.


Cache Coherence in a Bus
-
based SMP


Key properties:


Cache controllers can snoop on the bus.


All bus transactions are visible to all cache controllers.


All controllers see the transaction in the same order.


Controllers can take action if a bus transaction is relevant, i.e.
involves a memory block in its cache.


Coherence is maintained at the granularity of a cache block.


P
0
$
P
n
$
Memory
IO device
Protocols


Invalidation protocols


invalidate replicas if a processor writes a location.


Update protocols


update replicas with the written value.


Based on


Bus transactions with three phases


Bus arbitration


Command and address transmission


Data transfer


State transitions for a cache block


State information (e.g. invalid, valid, dirty) is available for blocks
in a cache.


State information for uncached blocks is implicitly defined (e.g.
invalid or not present).

Definition of a Snooping Protocol


A snooping protocol is a distributed algorithm
represented by a collection of cooperating finite state
machines. It is specified by the following components:


the set of states associated with memory blocks in the local
caches


the state transition diagram with the following input symbols


Processor requests


Bus transactions


actions associated with each state transition


The different state machines are coordinated by bus
transactions.


Protocol for Write
-
Through, Write No
-
Allocate Caches

val
inval
PrRd/BusRd
BusWr/--
PrRd/--
PrWr/BusWr
PrWr/BusWr
Processor-initiated
transation
Bus-snooper-
initiated transation
Protocol Ensures Coherence


Determine total order such
that it gives the same result
and in which:


Operations
by any particular process occur in the order they were
issued.


The
value returned by each read operation is the value written by
the last write to that location in the total order.


R
R
R
R
R
R
R
R
R
W
R
R
R
R
R
R
R
R
R
W
P
0
P
1
P
2
Drawback of Write
-
Through Caches


Every store instruction goes to memory.


Example:


Processor with 2GHz executing one instruction per cycle.


15% of all instructions are stores for 8bytes of data.


Bus has a 3.2 GB/s bandwidth (400 MHz, 8 Byte)




300 million stores per second, requiring 2.4 GB/s bandwidth



one processor saturates bus ignoring read misses, address
information ...


Write
-
back cache
reduces
load on the bus by writes


Writes are not seen at the bus and therefore more
sophisticated protocols are required.

Requirements for Memory Consistency (1/2)


Memory coherence defines only properties for
accesses to a single location.


Programs need, in addition, guaranteed properties for
accesses to multiple locations.

P
0

P
1

Assume intial value of A and flag is 0

A=1;

flag=1;

while (flag==0);

print A;

P
0

P
1

Assume intial value of A is 0

A=1;

barrier(b1);


barrier(b1);

print A;

Requirements for Memory Consistency (2/2)


The programmer expects atomic execution of write
operations.


If a new value is assigned to the register in P
2
, this
value should be 1.

P
0

P
1

Assume intial value of A and B is 0




if (b==1)


reg=a

fi


if (a==1)


b=1

fi

a=1

P
2

Memory Consistency Model


It specifies constraints on the order in which memory
operations become visible to the other processors.


It includes operations to the same location and to
different locations.


Therefore, it subsumes coherence.


The software and hardware have to agree on the
rules, i.e., it can be seen as a contract.

Memory Consistency

P
0

P
1

Assume intial value of A, B, C is 0 and D is 5




if (b==1)


c=3


d=a

fi


if (a==1)


b=1

fi

a=1

P
2

c

3

0

3

d

1

5

0

Ok

Ok

No

Strict Consistency


A read returns value of most recent write.


Easy model for programmers but hinders many
optimizations.


P
0

P
1

Assume intial value of A, B, C is 0 and D is 5




if (b==1)


c=3


d=a

fi


if (a==1)


b=1

fi

a=1

P
2

c

3

0

3

d

1

5

0

Ok

No

No

Sequential Consistency


Definition (Lamport 1979):


A multiprocessor is sequentially consistent if the result of any
execution is the same as if


the operations of all the processors were executed in some
sequential order,


and the operations of each individual processor occur in this
sequence in the order specified by its program.


Two constraints:


program order and atomicity of memory operations.

P
0

P
1

P
n

Memory

Sequential Consistency

P1

P2

P3

P4

X

W
1

100

W
2

200

R
3

x

R
3

x

R
4

x

R
4

x

W
1

100

W
2

200

R
3

200

R
3

200

R
4

200

R
4

200

W
1

100

R
3

100

W
2

200

R
4

200

R
3

200

R
4

200

W
2

200

R
4

200

W
1

100

R
3

100

R
4

100

R
3

100


Impossible


P3 gets (100, 200)
and


P4 gets (200, 100)

Sequential Consistency

P
0

P
1

Assume intial value of A, B, C is 0 and D is 5




if (b==1)


c=3


d=a

fi


if (a==1)


b=1

fi

a=1

P
2

c

3

0

3

d

1

5

0

OK

OK

False

Processor Consistency


Goodman, 1989


Rules


Writes by a CPU are seen by all CPUs in the order they were
issued.


For every memory word, all CPUs see all writes to it in the
same order.


P
0

P
1

Assume intial value of A, B, C is 0 and D is 1




if (b==1)


c=3


d=a

fi


if (a==1)


b=1

fi

a=1

P
2

c

3

0

3

d

1

5

0

Ok

Ok

Ok

Processor Consistency

P
0

P
1

Assume intial value of A, B, C is 0 and D is 5




if (b==1)


c=3


d=a

fi


if (a==1)


a=2


b=1

fi

a=1

P
2

c

3

0

3

d

2

5

1

Ok

Ok

No

Weak Consistency


Dubois et al., 1986


Rules


Does not guarantee that writes from single CPU are seen in order.


Synchronization operations finish all pending memory operations


Holds all new ones until sync is done


Some order of syncs is chosen and seen by all CPUs

Weak Consistency

P
0

P
1

Assume intial value of A, B, C is 0 and D is 5




if (b==1)


c=3


d=a

fi


if (a==1)


a=2


b=1

fi

a=1

P
2

c

3

0

3

d

2

5

1

Ok

Ok

Ok

Weak Consistency with Sync

P
0

P
1

Assume intial value of A, B, C is 0 and D is 5






sync

if (b==1)


c=3


d=a

fi


if (a==1)


a=2


b=1

fi

sync

a=1





sync

P
2

c

3

0

3

d

2

5

1

Ok

Ok

No

Epoch 1

Epoch 2

Sequence of

Epochs

Basic MSI Writeback Invalidation Protocol


States


Invalid (I)


Shared (S):

one or more


Dirty
or

Modified (M):

one only


Processor Events:


PrRd

(read)


PrWr

(write)


Bus Transactions


BusRd
: asks for copy with no intent to modify


BusRdX
: asks for copy with intent to modify


BusWB
: updates memory


Actions


Update state, perform bus transaction, flush value onto bus

State Transition Diagram


Write to shared block:


Already have latest data; can use upgrade (BusUpgr) instead of BusRdX

PrRd/—
PrRd/—
PrW
r/BusRdX
BusRd/—
PrW
r/—
S
M
I
BusRdX/Flush
BusRdX/—
BusRd/Flush
PrW
r/BusRdX
PrRd/BusRd
MESI (4
-
state) Invalidation Protocol


Problem with MSI protocol


Reading and modifying data is 2 bus transactions, even if no
one is sharing


e.g. even in sequential program


BusRd (I
-
>S) followed by BusRdX or BusUpgr (S
-
>M)


Add exclusive state


Allows write without bus transaction


Not yet modified

MESI Protocol


States


invalid


exclusive or exclusive
-
clean (only this cache has copy, but not
modified)


shared (two or more caches may have copies)


modified (dirty)



I
-
> E on PrRd if no one else has copy


needs “shared” signal on bus: wired
-
or line asserted in
response to
BusRd


The MESI protocol implements sequential
consistency.

Classification of Cache Misses


Compulsory (Cold) misses


occur on the first reference to a memory block by a processor.


Capacity misses


occur when not all of the blocks that are referenced by a
processor fit in the cache, so some are replaced and later
accessed again.

Classification of Cache Misses


Conflict (Collision) misses


occur in a cache with less than full associativity when the
collection of blocks referenced by a program that maps to a
single cache set does not fit in the set.



Coherence misses


occur when blocks of data are invalidated due to the
coherence protocol.


True sharing
misses occur when a data word produced by one
processor is used (read or written) by another.


False sharing
misses occur when independent data words
accessed by different processors happen to be placed in the
same cache block, and at least one of the accesses is a write.


Reason for
miss
First access
systemwide
First reference by processor
Written before
Modified words
accessed
No
Yes
Cold
Yes
Cold
False
-
sharing
-
Cold
No
No
True
-
sharing
-
cold
Yes
Reason for
elimination of last
copy
Other
Old copy with
state
=
invalid still
there
Invalidation
Modified words
accessed
False
-
sharing
-
invalidation
-
capacity
No
True
-
sharing
-
invalidation
-
capacity
Yes
No
Modified words
accessed
Pure
-
false
-
sharing
No
Pure
-
true
-
sharing
Yes
Yes
Has block been
modified since
replacement
Pure
-
capacity
No
Modified words
accessed
False
-
sharing
-
capacity
-
invalid
No
True
-
sharing
-
capacity
-
invalid
Yes
Yes
Replacement
Miss classification
Non
-
Uniform Memory Access Computers
(NUMA)


Cache
-
Coherent NUMA Computers


Scalable machine, like CRAY T3E, disable caching of
remote addresses.


Every access goes over the network or


Programmer responsible to keep copies coherent.


Requirements for implicit caching and coherence on
physically distributed memory machines:


Latency and bandwidth scale well


Protocol scales well


In contrast to cache
-
only memory architectures
(COMA), the home location of an address is fixed.


Focus will here be on hardware
-
based directory
-
based
cache coherence.


A directory is a place where the
state
of a block in the
caches is stored.

Scalable Multiprocessor with Directories

Simple
Directory
-
Based Cache Coherence
Protocol


Single writer
-

multiple reader


Cache miss leads to transaction to home of the
memory block


Remote node
checks state and performs protocol
actions


Invalidating copies on write


Returning value on read


All requests, replies, invalidations etc. are network
transactions



Questions:


How
is the
directory information stored?


How may efficient protocols be designed?

Classification of Directory Implementations

Directory Storage Schemes

Flat

Centralized

Hierarchical

Finding source of

directory information

Locating Copies

Memory
-
based

Cache
-
based

Information co
-
located
with memory block that
is home of that
location

Caches with a copy
form a linked list.
Memory holds head
pointer only.

Hierarchy of caches
with inclusion property.

Protocol Scalability


Precondition for application: Small number of
sharers


Performance depends on


Number of transactions (bandwidth requirements)


Number of transactions on the critical path (latency)


Storage overhead


It can be quite severe since presence bits scale linearly with
memory size and number of processors


Example: Block size 64 byte

Processors

Fraction of Nondirectory
Memory

64

12,5%

256

50%

1024

200%

Properties of Hierarchical Schemes


Advantages:


Transactions need not go to home


Multiple requests from different nodes can be
combined


Disadvantages:


Number of transactions to traverse tree might be greater than
in flat schemes.


If startup costs are high, this is worse than traversing long
distance


Each transaction needs to look up the directory information
which increases latency of transactions
.


Summary


Hierarchical schemes are not popular due to latency and
bandwidth characteristics.


They have been used in systems providing data migration

Flat Memory
-
based Directory Schemes


Properties


The number of transactions to invalidate sharers is
proportional to the number of sharers.


The invalidation transaction can be overlapped or sent in
parallel so that latency is reduced.


The main disadvantage is the memory
overhead


Reduction of memory overhead:


Increase cache
-
line size


Increase number of processors per directory (two
-
level
protocol
)


Example:


Four processor nodes and 128 byte cache blocks lead to only
6.25
% on a 256 processor system instead of 50%.


Overhead is still proportional to P*M (P is the number of
processors and M is memory size)