Threads Cannot be Implemented as a Library

buninnateSoftware and s/w Development

Nov 18, 2013 (3 years and 9 months ago)

81 views

Threads Cannot be
Implemented as a Library

Hans
-
J. Boehm

About the Author


Hans
-
J. Boehm


Boehm conservative garbage collector


Parallel GC for C/C++


Participated in revising the Java Memory
Model


Co
-
authored the Memory model for multi
-
threaded C++


Compiler
-
centric background


Introduction


Multi
-
threaded programs are ubiquitous


Many programs need to manage logically
concurrent interactions


Multiprocessors are becoming mainstream


Desktop computers support multiple hardware
contexts, which makes them logically
multiprocessors


Multi
-
threaded programs are a good way
to utilize increasing hardware parallelism

Thread support


Threads included in language specification


Java


C#


Ada


Multiple
-
threads not a part of language
specification


C/C++


Thread support provided by add
-
on libraries


Posix threads


Ptreads standard does not specify formal
semantics for concurrency

Memory Model


Which assignments to a variable by one thread can be seen
by a concurrently executing thread


Sequential Consistency


All actions occur in a total order (the execution order) that is consistent
with program order; furthermore, each read
r

of a variable
v

sees the
value written by the write
w
to
v
such that:


w

comes before r in the execution order, and


There is no other write
w
´

such that
w

comes before
w
´

and
w
´

comes before
r

in the execution order


Happens
-
Before


Simple version of java memory model, slightly too weak


Weak


Allows for compiler optimizations

Surprising results caused by
statement reordering


r1 & r2 are local, A & B are shared


Write in one thread


Read of same variable in another thread


Write and read are not ordered by synchronization


-

Surprising results caused by
statement reordering


r1 & r2 are local, A & B are shared


Write in one thread


Read of same variable in another thread


Write and read are not ordered by synchronization


Race Condition!

Pthread approach


Provided as add
-
on library


Include hardware instructions to prevent
reordering


Avoid compiler reordering by appearing as
an opaque function


Require disciplined style of synchronization


Valid 98% of the time


What about the other two percent??

Pthread correctness


Apparently correct programs may fail
intermittently


New compiler or hardware induced failure


Poor performance may force slight rule
bending


Difficult for programmer to reason about
correctness


Let’s see some examples why…..

Concurrent modification


Pthread specifications prohibit races


But is this enough?

x=y=0

if(x==1) ++y;



++y; if(x!=1)

-
y;

if (y==1) ++x;



++x; if (y!=1)
--
x;

Is x==1 y==1 acceptable?


No for sequential consistent interpretation


But, if the compiler makes the
modifications on the right, there is a race!

T1:

T2:

Why threads cannot be
implemented as a library


Argument ( 1 )


Since the compiler is unaware of threads, it is
allowed to transform code subject only to
sequential correctness constraints and
produce a race



But, example is kind of far
-
fetched

Rewriting of Adjacent Data


Bit fields on a little endian 32
-
bit
machine


Concurrent write to memory location,
not variable.


Implementation of x.a=42

{


tmp = x;


tmp &= ~0x1ffff; //mask off old a


tmp | 42;


x = tmp; //replace x

}

struct {int a:17; int b:15 } x;

Rewriting of Adjacent Data


Bit fields on a little endian 32
-
bit
machine


Concurrent write to memory location,
not variable.


Implementation of x.a=42

{


tmp = x;


tmp &= ~0x1ffff; //mask off old a


tmp | 42;


x = tmp; //replace x

}

struct {int a:17; int b:15 } x;

Updates to x.b
introduce a race

Why threads cannot be
implemented as a library


Argument ( 2 )



For languages like C, if the specification does
not define when adjacent data can be
overwritten, then race conditions can be
introduced. If so, then the compiler would
know to avoid this optimization

Register promotion

for(…) {





if (mt) pthread_mutex_lock(…);


x = … x ….


if ( mt) pthread_mutex_unlock(…);

}

r = x;

for(…) {





if (mt) {


x = r; pthread_mutex_lock(…); r = x;


}


r = … r ….


if ( mt) {


x = r; pthread_mutex_unlock(…); r = x;


}

}

x = r;



Repeatedly update globally shared variable
x





Register promotion

for(…) {





if (mt) pthread_mutex_lock(…);


x = … x ….


if ( mt) pthread_mutex_unlock(…);

}

r = x;

for(…) {





if (mt) {


x = r; pthread_mutex_lock(…); r = x;


}


r = … r ….


if ( mt) {


x = r; pthread_mutex_unlock(…); r = x;


}

}

x = r;



Repeatedly update globally shared variable
x






Using profile feedback or static heuristics

it becomes beneficial to promote
x

to a
register

r

in the loop


Register promotion

for(…) {





if (mt) pthread_mutex_lock(…);


x = … x ….


if ( mt) pthread_mutex_unlock(…);

}

r = x;

for(…) {





if (mt) {


x = r; pthread_mutex_lock(…); r = x;


}


r = … r ….


if ( mt) {


x = r; pthread_mutex_unlock(…); r = x;


}

}

x = r;



Repeatedly update globally shared variable
x






Using profile feedback or static heuristics

it becomes beneficial to promote
x

to a
register

r

in the loop


Thus


Extra reads and writes introduce possible race
conditions

Why threads cannot be
implemented as a library


Argument ( 3 )


If the compiler is not aware of existence of
threads, and a language specification does
not address thread
-
specific semantic issues,
then optimizations might cause race
conditions

Implications


Compilers forced into blanket removal of
optimization in many cases


Or perhaps a toned
-
down version of the
optimization


This can degrade performance of code
that is not thread
-
specific

Sieve of Eratosthenes

10,000 10,002 10,003..10,005 10,007….. 100,000,000


false


false


false


false

false



false


true


true


false


false

false



true



true


true


true


false

false



true



true


true


true


true

false



true



true


true


true


true


prime



true



For(mp=start ; mp < 10,000 ; ++mp)


if(!get(mp)) {


.

for(multiple = mp ; multiple <100,000,000 ; multiple+=mp)


.


if(!get(multiple))


.



set(multiple);


}

Synchronizing global array access

For(mp=start ; mp < 10,000 ; ++mp)


if(!get(mp)) {


.

for(multiple = mp ; multiple <100,000,000 ; multiple+=mp)


.


if(!get(multiple))


.



set(multiple);


}



Mutex


Spin
-
locks


Non
-
blocking


None

Performance results


Pthreads library approaches (1)&(2)
cannot reach optimal levels


This algorithm is designed for a weak
memory model, which is not possible
using thread library

Performance results


Similar results for hyper
-
threaded p4 processor


Even more dramatic
performance differences
moving to a more parallel
processor


Itanium


HT P4

Additional Implications of Pthreads approach


If we choose to allow concurrent accesses
to concurrent variables, within library code


Unpredictable results can occur without
language specifications

x = 1;

pthread_mutex_lock(lock);

y = 1;

pthread_mutex_unlock(lock);

pthread_mutex_lock(lock);

y = 1;

x= 1;

pthread_mutex_unlock(lock);

Additional Implications of Pthreads approach


If we choose to allow concurrent accesses
to concurrent variables, within library code


Unpredictable results can occur without
language specifications

x = 1;

pthread_mutex_lock(lock);

y = 1;

pthread_mutex_unlock(lock);

pthread_mutex_lock(lock);

x = 1;

y = 1;

pthread_mutex_unlock(lock);

Is this a problem??

Conclusion


Compilers can introduce race conditions where
there are none in source code


Library code cannot intervene


Impossible to achieve the performance gains of
a multiprocessor without direct fine
-
grained use
of atomic operations


Which is impossible to do in a library based thread
implementation


Why not just use the java memory model


Designed to preserve type
-
safety


which C/C++ are not


C++ needs it’s own memory model

REFERENCES


JSR
-
133 Expert Group, “JSR
-
133: Java Memory Model and Thread
Specification” http://www.cs.umd.edu/~pugh/java/memoryModel


Daniel P. Bovet,Marco Cesati, “Understanding the Linux Kernel 3
rd

Edition” O’Reilly


Sarita V. Adve, Kourosh Gharachorloo, “Shared Memory
Consistency Models: A Tutorial” Digital Western Research
Laboratory

Appendix


Happens
-
Before


Appendix


Section 5


Appendix


Section 5(cont)