Memory Models: A Case for Rethinking Parallel Languages and Hardware

hedgebornabaloneΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

61 εμφανίσεις

Memory Models: A Case for

Rethinking Parallel Languages and Hardware


Sarita V. Adve

University of Illinois

sadve@illinois.edu


Acks
: Mark Hill, Kourosh Gharachorloo, Jeremy Manson, Bill Pugh,


Hans Boehm,

Doug Lea, Herb Sutter, Vikram Adve, Rob Bocchino, Marc Snir,

Byn Choi, Rakesh Komuravelli, Hyojin Sung



Also a paper by S. V. Adve & H.
-
J. Boehm, To appear in CACM


Memory Consistency Models

Parallelism for the masses!

Shared
-
memory most common

Memory model = Legal values for reads

Memory Consistency Models

Parallelism for the masses!

Shared
-
memory most common

Memory model = Legal values for reads

Memory Consistency Models

Parallelism for the masses!

Shared
-
memory most common

Memory model = Legal values for reads

Memory Consistency Models

Parallelism for the masses!

Shared
-
memory most common

Memory model = Legal values for reads

Memory Consistency Models

Parallelism for the masses!

Shared
-
memory most common

Memory model = Legal values for reads

20 Years of Memory Models


Memory model is at the heart of concurrency semantics


20 year journey from confusion to convergence at last!


Hard lessons learned


Implications for future



Current way to specify concurrency semantics is too hard


Fundamentally broken



Must rethink parallel languages and hardware


E.g., Illinois Deterministic Parallel Java, DeNovo architecture

What is a Memory Model?


Memory model defines what values a read can return





Initially A=B=C=Flag=0



Thread 1 Thread 2


A = 26 while (Flag != 1) {;}


B = 90 r1 = B





r2 = A


Flag = 1






90

26

0

Memory Model is Key to Concurrency Semantics


Interface between
program

and
transformers

of program


Defines what values a read can return


C++ program

Compiler

Dynamic
optimizer

Hardware



Weakest system component exposed to the programmer



Language level model has implications for hardware



Interface must last beyond trends

Assembly

Desirable Properties of a Memory Model


3
P
s


P
rogrammability


P
erformance


P
ortability


Challenge: hard to satisfy all 3 Ps


Late 1980’s
-

90’s: Largely driven by hardware


Lots of models, little consensus


2000 onwards: Largely driven by languages/compilers


Consensus model for Java, C++ (C, others ongoing)


Had to deal with mismatches in hardware models

Path to convergence has lessons for future


Programmability


SC [Lamport79]


Programmability
: Sequential consistency (SC) most intuitive


Operations of a single thread in program order


All operations in a total order or atomic


But
Performance?


Recent (complex) hardware techniques boost performance with SC


But compiler transformations still inhibited


But
Portability?


Almost all h/w, compilers violate SC today


SC not practical, but…




Next Best Thing


SC Almost Always


Parallel programming too hard even with SC


Programmers (want to) write well structured code


Explicit synchronization, no data races


Thread 1 Thread 2


Lock(L) Lock(L)

Read Data1



Read Data2

Write Data2



Write Data1

… …


Unlock(L) Unlock(L)


SC for such programs much easier: can reorder data accesses



Data
-
race
-
free
model

[AdveHill90]


SC for data
-
race
-
free programs


No guarantees for programs with data races

Definition of a Data Race


Distinguish between
data

and non
-
data (
synchronization
) accesses


Only need to define for SC executions


t
潴慬牤or


Two memory accesses form a race if


From different threads, to same location, at least one is a write


Occur one after
another


Thread 1



Thread 2


Write, A, 26


Write, B, 90






Read
, Flag, 0


Write, Flag, 1










Read
, Flag, 1







Read, B, 90










Read, A, 26


A race with a
data

access is a
data race


Data
-
race
-
free
-
program = No data race in any SC execution

Data
-
Race
-
Free Model

Data
-
race
-
free
model = SC for data
-
race
-
free programs


Does not preclude races

for wait
-
free constructs, etc.


Requires races be explicitly identified as synchronization


E.g., use volatile variables in Java, atomics in C++


Dekker’s algorithm


Initially Flag1 = Flag2 = 0


volatile Flag1, Flag2


Thread1

Thread2



Flag1 = 1 Flag2 = 1


if Flag2 == 0 if Flag1 == 0


//critical section //critical section


SC prohibits both loads returning 0


Data
-
Race
-
Free Approach


Programmer’s model: SC for data
-
race
-
free programs


Programmability


Simplicity of SC, for data
-
race
-
free programs


Performance


Specifies
minimal

constraints (for SC
-
centric view)


Portability


Language must provide way to identify races


Hardware must provide way to preserve ordering on races


Compiler must translate correctly

1990's in Practice (The Memory Models Mess)


Hardware


Implementation/performance
-
centric view


Different vendors had different models


most non
-
SC


Alpha, Sun, x86, Itanium, IBM, AMD, HP, Cray, …


Various ordering guarantees + fences to impose other orders


Many ambiguities
-

due to complexity, by design(?), …


High
-
level languages


Most shared
-
memory programming with Pthreads, OpenMP


Incomplete, ambiguous model specs


Memory model property of language, not library [Boehm05]


Java


commercially successful language with threads


Chapter 17 of Java language spec on memory model


But hard to interpret, badly broken


LD

LD

LD

ST

ST

ST

ST

LD

Fence

2000


2004: Java Memory Model


~ 2000: Bill Pugh publicized fatal flaws in Java model


Lobbied Sun to form expert group to revise Java model


Open process via mailing list


Diverse participants


Took 5 years of intense, spirited debates


Many competing models


Final consensus model approved in 2005 for Java 5.0


[
MansonPughAdve

POPL 2005]

Java Memory Model Highlights


Quick agreement that SC for data
-
race
-
free was required


Missing piece: Semantics for programs with data races


Java cannot have undefined semantics for ANY program


Must ensure safety/security guarantees


Limit damage from data races in
untrusted

code


Goal: Satisfy security/safety, w/ maximum system flexibility


Problem: “safety/security, limited damage” w/ threads very vague

Java Memory Model Highlights


Initially X=Y=0


Thread 1 Thread 2



r1 = X r2 = Y


Y = r1 X = r2


Is r1=r2=42 allowed?

Data races produce causality loop!


Definition of a causality loop was surprisingly hard

Common compiler optimizations seem to violate“causality”

Java Memory Model Highlights


Final model based on consensus, but complex


Programmers can (must) use “SC for data
-
race
-
free”


But system designers must deal with complexity


Correctness tools, racy programs, debuggers, …??


Recent discovery of bugs [SevcikAspinall08]


2005
-

:C++, Microsoft Prism, Multicore


~ 2005: Hans Boehm initiated C++ concurrency model


Prior status: no threads in C++, most concurrency w/
Pthreads


Microsoft concurrently started its own internal effort


C++ easier than Java because it is unsafe


Data
-
race
-
free is plausible model


BUT
multicore



乥眠N⽷/潰瑩浩o慴楯湳Ⱐ浯牥m獣s畴楮u


Mismatched h/w, programming views became painfully obvious


Debate that SC for data
-
race
-
free inefficient w/ hardware models

Hardware Implications of Data
-
Race
-
Free


Synchronization (volatiles/atomics) must appear SC


Each thread’s synch must appear in program order



synch Flag1, Flag2



Thread 1 Thread 2






Flag1 = 1


Flag2 = 1






Fence
Fence






if Flag2 == 0


if Flag1 == 0



critical section


critical section




SC


扯瑨b牥r摳 捡湮n琠牥瑵牮r0


Requires efficient fences between synch stores/loads



All synchs must appear in a total order (atomic)


Independent reads, independent writes (IRIW):


Initially X=Y=0


T1 T2 T3 T4

X = 1 Y = 1 … = Y … = X





fence fence


… = X … = Y


SC


湯⁴桲敡搠獥s猠湥眠癡汵攠畮瑩u汤l捯灩敳 楮癡汩摡瑥d


Shared caches w/ hyperthreading/multicore make this harder


Programmers don’t usually use IRIW


Why pay cost for SC in h/w if not useful to s/w?

0

Implications of Atomic Synch Writes

1

1

0

C++ Challenges

2006: Pressure to change Java/C++ to remove SC baseline

To accommodate some hardware vendors


But what is alternative?


Must allow some hardware optimizations


But must be teachable to undergrads


Showed such an alternative (probably) does not exist

C++ Compromise


Default C++ model is data
-
race
-
free


AMD, Intel, … on board


But


Some systems need expensive fence for SC


Some programmers really want more flexibility


C++ specifies low
-
level model only for experts


Complicates spec, but only for experts


We are not advertising this part


[
BoehmAdve

PLDI 2008]

Summary of Current Status


Convergence to “SC for data
-
race
-
free” as baseline


For programs with data races


Minimal but complex semantics for safe languages


No semantics for unsafe languages

Lessons Learned


SC for data
-
race
-
free minimal baseline


Specifying semantics for programs with data races is HARD


But “no semantics for data races” also has problems


Not an option for safe languages; debugging; correctness checking tools


Hardware
-
software mismatch for some code


“Simple” optimizations have unintended consequences


State
-
of
-
the
-
art is fundamentally broken


Lessons Learned


SC for data
-
race
-
free minimal baseline


Specifying semantics for programs with data races is HARD


But “no semantics for data races” also has problems


Not an option for safe languages;
ebugging
; correctness checking tools


Hardware
-
software mismatch for some code


“Simple” optimizations have unintended consequences


State
-
of
-
the
-
art is fundamentally broken

Banish shared
-
memory?

Lessons Learned


SC for data
-
race
-
free minimal baseline


Specifying semantics for programs with data races is HARD


But “no semantics for data races” also has problems


Not an option for safe languages; debugging; correctness checking tools


Hardware
-
software mismatch for some code


“Simple” optimizations have unintended consequences


State
-
of
-
the
-
art is fundamentally broken


We need


Higher
-
level disciplined models that
enforce

discipline


Hardware co
-
designed with high
-
level models

Banish
wild

shared
-
memory!

Lessons Learned


SC for data
-
race
-
free minimal baseline


Specifying semantics for programs with data races is HARD


But “no semantics for data races” also has problems


Not an option for safe languages; debugging; correctness checking tools


Hardware
-
software mismatch for some code


“Simple” optimizations have unintended consequences


State
-
of
-
the
-
art is fundamentally broken


We need


Higher
-
level disciplined models that
enforce

discipline



Hardware co
-
designed with high
-
level models

Banish
wild

shared
-
memory!

Deterministic Parallel Java [V. Adve et al.]

DeNovo

hardware

Research Agenda for Languages


Disciplined shared
-
memory models


Simple


Enforceable


Expressive


Performance



Key: What discipline?


How to enforce it?




Data
-
Race
-
Free


A near
-
term discipline: Data
-
race
-
free


Enforcement


Ideally, language prohibits by design


Else, runtime catches as exception



But data
-
race
-
free still not sufficiently high level


Deterministic
-
by
-
Default Parallel Programming


Even data
-
race
-
free parallel programs are too hard


Multiple
interleavings

due to unordered synchronization (or races)


Makes reasoning and testing hard


But many algorithms are deterministic


Fixed input gives fixed output


Standard model for sequential programs


Also holds for many transformative parallel programs


Parallelism not part of problem specification, only for performance

Why write such an algorithm in non
-
deterministic style,
then struggle to understand and control its behavior?



Deterministic
-
by
-
Default Model


Parallel programs should be deterministic
-
by
-
default


Sequential semantics (easier than SC!)


If non
-
determinism is needed


should be explicitly requested


should be isolated from deterministic parts


Enforcement:


Ideally, language prohibits by design


Else, runtime



State
-
of
-
the
-
art


Many deterministic languages today


Functional, pure data parallel, some domain
-
specific, …


Much recent work on runtime, library
-
based approaches



Our work: Language approach for modern O
-
O methods


Deterministic Parallel Java (DPJ) [V. Adve et al.]






Deterministic Parallel Java (DPJ)


Object
-
oriented type and effect system


Use “named”
regions

to partition the heap


Annotate methods with
effect

summaries: regions read or written


If program type
-
checks, guaranteed deterministic

*
Simple, modular compiler checking

*
No run
-
time checks today, may add in future


Side benefit:

regions, effects are valuable documentation


Extended sequential subset of Java (DPC++ ongoing)


Initial evaluation for expressivity, performance [Oopsla09]


Integrating disciplined non
-
determinism


Encapsulating frameworks and unchecked code


Semi
-
automatic tool for effect annotations [ASE09]



Research Agenda for Hardware


Current hardware not matched even to current model


Near term: ISA changes, speculation


Long term: Co
-
design hardware with new software models

Illinois DeNovo Project


Design hardware to exploit disciplined parallelism


Simpler hardware


Scalable performance


Power/energy efficiency


Working with DPJ as example disciplined model


Exploit data
-
race
-
freedom, region/effect information

*
Simpler coherence

*
Efficient communication: point to point, bulk, …

*
Efficient data layout: region vs. cache line centric memory

*
New hardware/software interface


Cache Coherence


Commonly accepted definition (software
-
oblivious)


All writes to the same location appear in the same order


Source of much complexity


Coherence protocols to scale to 1000 cores?


What do we really need (software
-
aware)?


Get the right data to the right task at the right time


Disciplined models make it easier to determine what is “right”

(Assume only for
-
each loops)


Read must return value of


Last write in its own task or


Last write in previous for
-
each loop

Today's Coherence Protocols


Snooping


Broadcast, ordered networks


Directory


avoid broadcast through level of indirection


Complexity: Races in protocol


Performance: Level of indirection


Overhead: Sharer list



Today's Coherence Protocols


Snooping


Broadcast, ordered networks


Directory


avoid broadcast through level of indirection


Complexity: Races in protocol


Race
-
free software

牡re
-
晲敥f捯c敲e湣攠灲潴潣ol


Performance: Level of indirection


But definition of coherence no longer requires serialization


Overhead: Sharer list




Region
-
effects enable self
-
invalidations

+ No false sharing, flexible communication granularity, region
based data layout



Simpler, more efficient DeNovo protocol




Conclusions


Current way to specify concurrency semantics fundamentally broken


Best we can do is SC for data
-
race
-
free

*
But cannot hide from programs with data races


Mismatched hardware
-
software

*
Simple optimizations give unintended consequences


Need


High
-
level disciplined models that enforce discipline


Hardware co
-
designed with high
-
level model


DPJ


deterministic
-
by
-
default parallel programming


DeNovo


hardware for disciplined parallel programming


Previous memory models convergence from similar process


But this time, let’s co
-
design s/w, h/w