Madan Musuvathi: Hi everyone

perchorangeΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

171 εμφανίσεις

>> Madan Musuvathi: Hi everyone. Thanks a lot for coming on this
sunny, Seattle morning today. I am Madan Musuvathi from the Research
in Software Engineering group. And it’s my pleasure to introduce
Angelina Lee today. She comes
here from MIT. She just finished her,
you know, PHD and she will be joining as a post
-
op in the same group
starting August. And so her interests are broadly in pilot, all
aspects of pilot programming. You know, programming models,
programming languages,

run time systems and she likes working across
the stack; both on the theoretical foundations and on engineering
problems. So Angelina.


>> Angelina Lee: Thank you Madan. That’s a really nice introduction,
thanks. Okay.
Let me just ask,
who is in the
field of parallel
programming? Wow, okay so almost everybody. Well I have some
background material just to give you the context of my research, but
given that pretty much everyone is in the field of parallel computing I
am just going to fly through the f
irst few slides.


Okay today I am going to talk about memory abstraction for parallel
programming. But before I get into that let’s say we were in the Era
of Multi
-
core right? I think there is no doubt about that. So these
are just like a list of top s
elected, top rated, personal lap tops that
I picked from
www.newegg.com.

And what you notice is that with a few
hundred dollars you can get a personal lap top with Multi
-
core. And I
think that’s pretty striking b
ec1ause right now basically Multi
-
core is
a commodity hardware.


And so how do we get to this point? Well let’s look at the revolution
of processors. On Moore’s Law which states that the number of
transistors on chip doubles every two years. It is sti
ll going strong,
ever since 1975. However, at some point the clock frequency sort of
plateaued and really the power is the root on the mental cost of all
these, because have reached a point where we have reached a power
density that the device can handle.

We just can’t cool these devices,
and therefore we can no longer keep on increasing the clock frequency.


So as a result, you know, the chip vendor, instead of increasing the
clock frequency every two years, instead now you get double the core
every tw
o years. So as a result what that means is that now we have to
write a parallel program in order to unlock the computation power
provided by the modern hardware.


But, unfortunately writing a parallel program is not as easy as writing
serial code. And
I am sure you can all sympathize with that. And I
believe to help writing parallel program easier a concurrency platform
can help. And what a concurrency platform is it’s a software
abstraction layer that runs between the user application and the
underly
ing operating system that provides the linguistic interface for
the user to specify the parallelism of their application. And the
software abstraction layer basically handles low balancing and task
scheduling for the user code.


Okay. Many
researchers

are interested in the problem of making
parallel programming easier and as a result there are a lot of
concurrency
platforming

being developed. This also includes the TPL
from Microsoft and PPL from Microsoft. And, you know, this is just a
list that I pu
t together since 1993, but before that there is also a
lot of other parallel programming language that I didn’t include. These
are just, I guess, the modern concurrency platform.


Okay. So why does concurrency platform help; because it provides a
parallel
ism abstraction. The job of task scheduling and load balancing
are abstracted away and handled automatically for the user application,
which brings us to the topic today: memory abstraction.

So I believe that properly designed memory abstraction can help

ease
the task of parallel programming. So what do I mean by memory
abstraction? Well, memory abstraction is an abstraction layer that
runs between the user program and the underlying memory, such that it
provides different view of the memory depending o
n the execution
context in which the memory is accessed.


So just to make it a little more concrete for example; transactional
memory is what I consider as a memory abstraction because memory
accesses dynamically enclosed by an atomic block appear to occ
ur
atomically.


Okay. So in this talk my thesis is properly designed memory
abstractions help ease the task of parallel programming. And I am
going to demonstrate that point by showing you two case studies.


The first one is Cactus Stack Memory Abstr
action which is a cactus
stack that interoperates with linear stack. And the second one is
Memory
-
Mapped Reducer Hyperobjects. It’s a linguistic mechanism for
avoiding determinacy races in multi
-
thread of computation.


Okay. Any questions so far?


S
o this is the outline of my talk today. I will mainly focus on the
Cactus Stack Memory Abstraction and then I will briefly touch on the
Memory
-
Mapped Reducer Hyperobjects into a concluding remark.


Okay. So for a concurrency platform I believe there ar
e three
desirable criteria that a concurrency platform should provide. The
first one is what I call serial parallel reciprocity. And what that
means is that the parallel code should be able to seamlessly
interoperate with serial code; including legacy bi
naries.


And the second one is it should provide good performance. And what
that means is that if the application has ample parallelism you should
see near perfect linear speedup when you run the concurrency platform.


And finally the last one is know

as stack space which is to say that
when you run the computation in parallel it should consume a reasonable
amount of space compared to the serial execution.


Well why are these three the desirable criteria? I like to demonstrate
the point using a cart
oon. We have an engineer and then we have a
customer. Well the customer would like to parallelize the software and
the engineer says, “Sure use my concurrency platform”. Well if today
the engineer can only satisfy two out of the three criteria then what

he needs to say is, “Well you need to say he just decides to forgo the
SP reciprocity, which is to say interoperatibility being parallel to
serial code”. They need to ask the customer to recompile its entire
code base.


Well, but that may not work so w
ell because the customer actually used
a third party binary. You can’t recompile source because he doesn’t
have the source. So if that’s not an option maybe the next one to
forgo is the space usage. And so what the engineer might say is,
“We’ll upgrade
your RAM because I don’t know how much space you are
going to consume”. I don’t think that would go too well.


The last one to forgo, if that’s also not an option, is basically
performance. But, well if you can’t get the performance rating
parallel pro
gram then you might as well just write the serial code.


So I believe these are the three desirable criteria. And we really
want to operate in this space where we get all three criteria. Okay,
but it turns out that there seems to be a fundamental trait

between the
three criteria. Our group and the practitioner in the field actually
thought about different strategies to build concurrency platform that
can satisfy all three criteria. But it turns out that many of the
strategies that we can think of sati
sfy only two out of the three
criteria, except for the last strategy which is what I am going to
focus on today.


And so, the cactus stack problem is basically the problem of how to
satisfy all three criteria simultaneously? Okay so now we know what
the

cactus stack problem is. Let’s see why we have this problem.


Well, so if you think about it in serial code people typically use a
linear stack. And an execution of a serial language can be thought of
as a serial walk of the invocation tree. So here
on the left I am
showing you an invocation tree where A caused B and caused C; C caused
D and E.


These are the view of the stack when the function is active. So when B
is active it sees A and B on the stack. When B is active it sees A, C,
D. And the li
near stack works really well for the serial program.
When you call [indiscernible] allocating a stack frame when the return
it pops it off and so essentially if you think about it this is very
space efficient because B and C occupy the same place in the s
tack and
the parent and child are allocated in the contiguous memory address
space.


So that’s the linear stack. And furthermore, it follows the rule for
pointers where the parent can pass a pointer to its children, but not
the other way around. Okay.
So that’s the liner stack.


So what’s a cactus stack? A cactus stack is essentially just like a
liner stack. It follows the same rule for pointers, except for now it
needs to support all the views for stack, for all the function that can
be active in p
arallel.


So in this case I am showing you the same invocation tree, except for I
am using a red arrow to indicate that these functions are spong, I
haven’t defined spong yet, but basically what is says is that B and C
in their sub
-
computation can potent
ially execute concurrently in
parallel. And what that means is that the linear stack no longer works
because B and C now can be active at the same time, but the cannot
occupy the same space. So, linear stack is no longer sufficient.


Okay. Right, so t
hat is cactus stack. Now obviously people have been
building paralleling for many decades, so what do they do? Well in
many parallel languages when they use the cactus stack they basically
have built a heap
-
based cactus stack. What that means that when
a
function becomes active you allocate a frame in the heap. So the
parent and the child are not allocated in the continuous space. But
rather you have a pointer on child pointing back to the parent so you
know where to return to.


And this is including

Cilk5 and Cilk++. And with this strategy a good
time bound and space bound can be obtained, but it failed to
interoperate with legacy serial code because now you have to follow
this heap
-
linkage where the call return is performed via the frame
allocated
in the heap. And when you have a piece of legacy serial code
it does not understand the heap
-
based cactus stack. It still allocates
things on the stack. And so therefore you failed to interoperate. And
so, as I said, many strategies I would try to thin
k of failed, except
for the last strategy.


And I am not going to get into all the different strategies, but one
thing I am going to say is the main constraint of all these strategies
is that once a frame is allocated the frame fluctuation in virtual
mem
ory cannot change. And that’s a problem.


Okay. So that is the cactus stack problem. And next I am going to
overview Cilk
-
M which is a system that we built to basically solve the
cactus stack problem. Okay. So first of all throughout the rest of
the

talk I am going to use the program model in Cilk
-
M. Here I have a
very naive way of calculating [indiscernible] code where you just do
fib (n
-
1) is, fib (n
-
1) is 2. And then you sum the result together.
And obviously fib (n
-
1) is fib (n
-
1) can compute
in parallel.


So to write this naive fib computation in Cilk the way you do it is
basically you spawn them in this one. And what spawn states is that
the named child function may execute in parallel with the continuation
of its parents. So in this case

the continuation is the call to fit
them on a suite. So in this case fib (n
-
1) may execute in parallel was
the call to fib (n
-
2).


Sync on the other hand is the counterpart of spawn. It serves as a
local barrier which states that the control cannot pa
ss beyond point
until all previously spawned children in this lexical scope have
returned. And note these Cilk keywords grant permission for parallel
execution. They do not command parallel execution.


What that means is that the programmer specified t
he logical
parallelism of his application used these keywords. An underlying
runtime system will schedule accordingly and whether the fib (n
-
1) fib
(n
-
2) the true parallelism gets realized or not during actual execution
depends on the underlying resource
you have. And the runtime scheduler
will schedule accordingly.


Okay. Any questions so far? Okay. So what is Cilk
-
M? Well, Cilk
-
M
is basically a work
-
stealing runtime system based on Cilk that solves
the cactus Stack problem using thread
-
local memo
ry mapping. So what’s
thread
-
local memory mapping or TLMM? TLMM is a virtual address range
in which each thread can map the physical memory independently. So
what that means is that in this range all [indiscernible] sees the same
virtual addresses, but
they can potentially map different physical
pages in that region.


And so the high level idea is that at the system startup this is the
stack where you have heap
-
stack and data. And the runtime system will
allocate the heap, and the data, and the code i
n the share region, but
allocate the stack in the TLMM region where each thread can map
independently. And at a very high level this is what the solution
looked like. And right now, bare with me, it was an unreasonable
simplification let’s assume that we

can map with arbitrary granularity.
I will come back to this point later.


So we have three workers. We have again the same invocation tree and
say P1 is now actively executing B, and P2 is actively executing on D,
and P3 is actively executing on E.
What you can do in the TLMM region
is that now you can allow the thread to share a frame by mapping the
same physical page for that frame in the same virtual address. And in
this case, say for example, if P1 and P2 shared the stack prefix of A
they can ma
p they physical memory for A at the same virtual address and
map the actual same physical page.


On the other hand, they don’t share B, C and D so they would map
different physical memory after frame A, and with that, because A is
allocated at the same v
irtual address for both threads. If A has a
reference down to its descendent for both workers they will see the
same virtual address and when they reference that virtual address it
will actually point to the same physical memory.

Okay. Any questions so

far? Okay.


>>: [indiscernible].


>> Angelina Lee: That’s a good question. I will get you that, like
although, much later so if I don’t answer your question by the end
let’s talk.


>>: Just a question, so that I know that I understand it. So suppose
in P1 I take a stack address over a local variable in B and I secretly
pass it along to P2 it will really map the different, like it will go
all wrong right? It will map?


>> Angelina Lee: That’s correct.


>>: That’s right? Okay, I understand it then.



>> Angelina Lee: Okay. Although I have to say in multi
-
thread
programming you shouldn’t do that because that’s a bad part.


>>: Oh, no, no sure. And that’s not what I am saying. It’s just that
I really understand it. It’s not a critique, because I

don’t
[indiscernible] stack addresses in the first place.


>> Angelina Lee: Okay. Yes, so your first understanding is correct.
Okay. Right. So this is the high level idea and with TLMM based
cactus stack we guarantee time balanced and spaced balance
d. But
before I get into the time balanced and spaced balanced of a TLMM based
cactus stack I have to first go over the time balanced and space
balanced of heap
-
based cactus stack.


So in Cilk using a heap
-
based cactus stack, say we call Tp as the
execu
tion time on P processors. Then T1 basically equates to execution
time on single processors, which means that’s the total work of your
computation.


On the other hand T infinity, you can think of that as the execution
time if you have infinitely many pr
ocessors. And so what that
translates to is the span or some people call critical pass of the
computation, which is the longest dependency chain in your computation.


So we define parallelism to be T1 over T infinity. On the other hand
with stack space

we call SP as the stack space consumed by P processor
and S1 as the stack space consumed when we execute serially among
workers. And with that using a heap
-
based cactus stack Cilk guaranteed
the following time bound and space bound. First the time bound

Tp
equals T1/P plus O

(T infinity). And what that bound tells you is that
when the number process you execute is much smaller than the
parallelism you have or i.e. your application has ample parallelism
than the first term dominate and therefore when you

execute on P
processor you should see near perfect linear speedup.


On the other hand it guarantees space bound that states that each
worker does not use more space than S1. But with heap
-
based cactus
stack it does not support SP reciprocity. Now with

Cilk
-
M using TLM
based cactus stack to define the time bound space bound I have to use
an additional parameter called Cilk depth.


Cilk depth is the maximum number of Cilk function listed on the stack
during your serial execution. So here when I see Ci
lk function I am in
a function that can contain spawn and sync; meaning that the function
itself contains parallelism. So note that the Cilk depth is not the
same as Spawn depth.


For example here I am showing you an invocation tree where I am marking
t
he parallel function, the Cilk function in blue and I am marking the
regular ordinary serial function in green, the red arrow indicates
spawn, and the black indicates call. In this case the Cilk depth is 3
because you have at most 3 Cilk functions on this

stack i.e. A, C, D.
[indiscernible] is only true where C spawns, make E spawns, make F. So
sub
-
depths
are

defined as the maximum of Cilk function listed on the
stack during serial execution.


Okay. So this is the Cilk
-
M guarantee. Now in terms of tim
e bound
instead of [indiscernible] infinity we have this term S1 plus T in
front of the T infinity. And so what that means is that with space
bound you application needs to have a little bit more parallels in
order to get near perfect linear speed
-
up whic
h is this. On the other
hand with space bound we have an additional plus D term.


And, right, so one thing I forgot to mention here we are always
measuring the stack space in terms of number of pages. So in this case
we have an additional plus D term.

But, by giving up just a little on
the time bound and space bound why you get that back is serial parallel
reciprocity which means that the [indiscernible] no longer needs to
distinguish function types at the call site. And whether you have
parallelism o
r not is dictated only by how a function is invoked. So
you can actually spawn either a sub
-
function or C function or you can
call a Cilk function or a C function and this was not possible in Cilk
-
5 or in Cilk with heap
-
based cactus stack.


Okay. So we

implemented this. We basically implemented the Cilk
prototype based on the open source Cilk
-
5 runtime, which Cilk
-
5 was
heap
-
based cactus stack. And TLMM, well a transitional operating
system does not provide such functionality as TLMM so we also modify

open
-
source Linux kernel to provide support for TLMM. It’s not a big
modification, about 600 lines of code.


And we ported the runtime system to work with Intel’s Cilk Plus
compiler. What that means is that we can actually compile a piece of C
or C plu
s code that contains spawn and sync and it will run on our
runtime system.


Okay. So let’s take a look at her performance. And as I said earlier,
the Cilk, Cilk
-
M has the time bound of this, where the C is S1 plus D
and we compared that with Cilk Plus.



So Cilk Plus is also a Cilk based runtime system developed by Intel.
And it
--
.

The runtime system has time bound that’s like the heap
-
based cactus stack. And so when you look at the performance here what
I am showing you is I have a list of applica
tion and I run each
application
10 times and take the average w
ith standard variation less
than a few percent. And I take this Cilk
-
M running time and normalize
that by the Cilk Plus running time.


So what that means is that if it’s below one Cilk
-
M runs

faster and if
it’s a [indiscernible] one that means Cilk Plus runs faster. And you
can see that the
performance of the two is

actually comparable.


In terms of space consumption and now this is the time bound S1 plus D.
So again we used the similar, t
he same center bench mark and we have
the first problem D which is the Cilk depth I just defined and S1, this
is the stack space used during serial execution measuring number of
pages and this is what the bound predicted.


But the balance is actually kin
d of pessimistic. And in practice when
you run in parallel these
is

the actual SP/P, the number of pages on
average used per worker when you run on P workers. And this you can
see the amount of page used per worker is willing 2 times factor
compared to t
he serial execution and cross benchmark, even though the
space bound predicted that you will use much more. And later as I
describe how each of the
components

works you will see why the balance
is kind of pessimistic. But, yeah, okay. Any questions so f
ar?


>>: Do you have comparison of the space consumption between Cilk
-
M and
the basic Cilk Plus?


>> Angelina Lee: No, unfortunately I do not have that comparison. But
that’s a good question. It’s, at the time when we did this work there
was actually th
e runtime source for Cilk Plus was not available so we
couldn’t like, it wasn’t an open source. But not it’s actually open
source. But at the time it wasn’t so we didn’t, we weren’t able to
take it and try to instrument that.


Okay. To answer your q
ue
stion, w
e do actually have a space bound
comparison number with Cilk
-
5 which uses a heap
-
based cactus stack
which has a better time bound. And the space usage that we have is
comparable. I don’t remember the number right off the top of my head,
but I can

show you later, after the talk.


Okay. So right, so that’s the Cilk
-
M overview. And now I am going to
go into a little bit more detail on each of the components and how
things work.


T
he first one is the Cilk
-
M Work
-
Stealing Scheduler. Okay, so in t
he
work
-
stealing scheduler during execution each worker maintains a work
deque of frames. It basically keeps track of what functions the worker
has responsible to execute. And for most of the time the worker
manipulates the bottom of the stack, just like

a linear stack. So for
example if a worker executes a call it pushes the call frame onto the
bottom of the deque.


On the other hand if you execute a spawn it executes the spawn frame
onto the bottom of
its

stack. And here I am marking the frame using

how the frame is invoked. And obviously the other worker can do that
in parallel. On the other hand if a worker does a return it pops the
frame off the bottom of the stack.


So as you can see for the most part the worker’s behavior operating on
the st
ack just like a linear stack, however, the worker’s behavior
diverges when a worker runs out of work to do. So in this case the
green worker has run out of work to do so it randomly choose a victim
and stole from the top of the victim’s stack.


So in th
is case say the green worker’s happen to choose the blue
workers to steal and they steal from the top of the deque. Okay, so
can anyone tell me why the green worker stole two frames instead of one
in this case? I am checking if you are actually paying att
ention.


>>: [indiscernible].


>>: In this case, yes you are right. That when the successful steal
occurred that’s when true parallelism is realized. Yes?


>>: The continuation of the spawn is the candidate for parallel
execution. Continuation of [i
ndiscernible] makes execution sequential.


>> An
gelina Lee: That’s correct. Yes
, someone is paying attention.
Okay, yeah, well I am not saying the rest of you are not, I am just
--
.


Okay, yes, you are right, i
n this case because the second frame is
cau
ght. And what that means is that the parent cannot be resumed until
this function returns. And therefore when the green worker steal’s it
has to steal both frames together.


Okay, right. And so now the green has a successful steal and it can
continue
to execution, unless as you pointed out, this is the point
when true parallelism is realized. Okay and the theorem from
[indiscernible] states that with sufficient parallelism worker’s steal
infrequently and you get near perfect linear speedup. So that’s

the
gist of work
-
stealing scheduler.


So next let’s look at TLMM
-
based cactus stack. See how we maintain a
cactus stack on top of this work
-
stealing runtime system. So again
bare with me still. We are using that unreasonable simplification and
I have

the same invocation tree A spawn and B and C spawn D and E. And
here I am showing you three workers and I am showing you the TLMM
region of each of the worker.


So during execution the worker basically used the stack like a linear
stack. And so P1 exe
cutes A and subsequently spawn B. P2 comes in and
decides to steal and they happen to steal A because that’s pretty much
the only thing available to steal right now. And when P2 steal’s A in
map i.e. not copy the stolen prefix, in this case, the stolen p
refix is
just frame A to the same virtual addresses. So in map the physical
memory corresponds to frame A alining the same virtual address as what
it is in P1 space in
its own TLMM region a
nd the continued execution.


At this point it just operates like

a linear stack. P3 comes in and
steals C, probably speaking it should be stealing A, failed to make
progress, steal again, but let’s just for the sake of simplification it
steals C. So it steals C from P2. In this case the stolen prefix
corresponds to
A and C. So it mapped the physical memory corresponding
to frame A and C in
its

own TLMM region aligning at the same virtual
address as in P2 space and then resumed continued execution.


Okay. So again just to emphasize, because they are mapping the sa
me
physical memory and
aligning

mapping at the same virtual address when
you have a pointer of local variable it will realize to the same
virtual address on all workers. And when you do reference it, it will
point to the right physical memory.


So now o
bviously one cannot map at arbitrary granularity. So let
’s see
how we address this, n
ope, oops, yeah. So here again I am showing you
three workers and I am showing you the TLMM region and let’s say I am
using vertical, horizontal line to know a page size

boundary.


So again P1 executed B and in this case when P2 comes in and steals it
mapped the page corresponding to frame A into its own TLMM region. And
since a worker can only map at the page granularity inevitably it is
going to map part of the frame

B. And so it needs to be careful not to
clobber frame B that’s executing on P1.


T
o do that before P2 resumes, basically it advances
its

stack pointer
to the next page boundary,
this creates some fragmentation.

And then
it continues with execution. Sim
ilarly for P3 if it seals C again it
maps the pages where A and C reside on into
its

own TLMM region. And
again inevitably it’s going to map part of B and D. So again before it
resumes execution it needs to advance
its

stack pointer to the next
page boun
dary, which further creates additional fragmentation.


So in Cilk
-
M we use the space
-
reclaiming heuristic where we reset the
stack pointer back to where it was once the functions sync. Because
once a function successfully sync you know there is no longe
r any
parallel sub
-
computation so you know you can reclaim the space. Okay.
So that’s how the runtime system maintains TLMM
-
based cactus stack.


And let’s look at the analysis, how we maintain the bound. Okay. So
again let’s review the heap
-
based cac
tus stack where it basically
states that S1 be the stack space required by a serial execution and SP
be the space required by parallel execution. Using heap
-
based cactus
stack each worker does not use more than S1 amount of space.


And the gist of the p
rove is basically staying that the work
-
stealing
runtime system maintains what

we call busy
-
leaves property; w
hich is to
say that every active frame, and when I say active it’s basically a
frame that has been invoked and has been allocated, every active le
af
frame has a worker executing on it.


And so what that means is that you can at most have P active pass from
the root of the computation to leaf and there is one worker every leaf.
And from the root to the leaf you use at the most S1 amount of space.

So the entire computation each worker used at most S1 amount of space.


With the Cilk
-
M space bound it’s very similar except for you have this
plus D term and this is because we do also maintain the busy
-
leaves
property, but because every time when we ha
ve a successful steal you
have to advance a stack pointer to the next page boundary. So in the
worst case scenario if every Cilk function that realized the Cilk
depth’s D term gets stolen successfully then you have that many number
s

of
fragmentations

in y
our stack. And so therefore you have this
additional plus D term.


But back to the number, as I said in practice, it’s actually using
much, much less space because the bound is kind of pessimistic. For
one when you have successful steal you are mapping

the safest physical
page. So the balance is actually double counting those pages. And
furthermore the worst case scenario described by that right amount
rarely happened in practice. Because in parallelism most of the time
the steal corresponds near the

root of the computation and then each
worker just executes.


Okay. In terms of performance bound on the other hand, using a heap
-
based stack cactus stack this is the term. And this is just a step.
At every tiny step the
worker either does the work or

steals
. And
after a constant round of stealing you will successfully steal and make
progress on your critical pass.


And so in this bound the constant hidden behind the big O roughly
corresponds to the overhead that you have for performing a successful

steal.
U
sing a heap
-
based cactus stack that overhead is just switching
a few pointers. But, in the Cilk
-
M using TLMM
-
based cactus stack
because every time when events
--
. When you have a successful steal
you have to advance the stack pointer and you ha
ve to do the page
mapping. And in the worst case you have to map S1 plus D number of
pages and that’s why now the C term is no longer a constant, but rather
a function growing with S1 plus D. But, as I said, the worst case
scenario rarely happens in prac
tice.


And what the corollary says is that now you need to have a bit more
parallelism, meaning you have this additional term here in order to get
near perfect linear speedup. Okay. Any questions so far?


>>: [indiscernible].


>> Angelina Lee: So rec
all that in the work
-
stealing algorithm we
always steal from the top of the deck from the victim. And what that
means is that you are always stealing the oldest frame and executing.
And when you’re application has ample parallelism what that means is
tha
t if you imagine your computation tree you are stealing near the
root of computation tree. And then each worker just does work for the
rest of the time. Because whenever you steal you steal a big chunk of
work.


>>: So it’s near the beginning of the sta
ck and [indiscernible] page?


>> Angelina Lee: Yes. And it maintains the
--
. Yeah, well, we can
talk more after, yes. Right.


Okay. So that’s the analysis of Cilk
-
M and now I am going to tell you
more about how we maintain support for TLMM. So when

we start this
work there were two possibilities. We can either make a worker a
process. What that means is that every worker has
its

own page table
and

by default nothing is shared, b
ecause in an operating system,
traditional operating system and when y
ou have a process each process
has
its

own address space.


And what that means is that you don’t have to make changes to the
operating system, but since these concurrency platforms typically
execute applications that are assumed to run on shared memory n
ow you
have to use mmap. The runtime system has to manually do mmap to allow
you to share the rest of the heap, the data and the code.


And that’s not so ideal because when the user call, the user
application does something to call mmap the runtime syst
em actually has
to intercept that mmap call and do the same synchronize across the all
page tables of all the workers. And do the same mapping in every page
table.


And this actually happens more commonly than you may think because in
say [indiscernible
] or [indiscernible] when you allocate memory that’s
larger than like 128 bytes it may do it under mmap underneath that the
user code is not even aware of.


So we took the second approach which is to say we just make the worker
a thread which is the orig
inal abstraction and what that means is that
workers share a single page table and they share the entire virtual
address space. By default everything is shared. And what we need to
do is basically we need to reserve a region to be mapped independently.
By using this strategy the user call to mmap can operate correctly.


Okay. So how do we do that?


>>: I am sorry, but why, will you explain just a little bit more why we
need to change the operating system?


>> Angelina Lee: Well, because traditional
ly when you have multiple
threads within one process, they share one single page table and the
entire address space is shared. So you cannot have this abstraction
where each thread sees the same virtual address range from mmap
different page.


>>: [indi
scernible].


>> Angelina Lee: Yes, but the mapping of what physical page mmap at
what virtual address is handled by the operating system.


>>: The threads which have changed right?


>> Angelina Lee: Right, the way that tread sees
its

virtual address
s
pace. That’s handled by the operating system.


Okay. So how do we support that? Well ideally this is what we like to
do. Ideally what we like to do is take a region that’s mapped
independently and that space we can call it the TLMM region and the
res
t is just shared. So we will have two separate page tables for each
thread. Where one page will manage the independent region and the
other manage the shared region. So when you access an address if it
falls under TLMM then the thread will access it’s o
wn like private page
table. If it falls under the shared region it will access the shared
page table.


But that doesn’t quite work because in some hardware the hardware walks
the page table and each thread must have a single root page directory.
So you
can’t have those two separate page
tables
. So what did instead
is we allocated a unique page directory for each thread and we reserved
one entry to be mapped independently. And, however, if some thread
acts as some address in the shared space that causes

the shared region
to be initialized and mapped to some other page then we synchronize the
root page directory across all threads to point to the same thing.


And so what that means is that you must synchronize the root page
directory. But that does not

occur often because one entry in the root
page directory corresponds to a huge space and the only time
--
. You
only need to perform synchronization and most synchronization per root,
per root page directory entry when it’s first initialized. So the
sync
hronization does not occur often. And once it’s initialized it
just accesses it.


Any questions?


>>: So you just, in the previous slide you said, “Oh you know processes
don’t work because we become to mmap and threads don’t work because we
have to have

just one base table”. So what do you do here because now
you are saying, “Oh you can’t have multiple page table’s per address”.
So you are showing proc
ess here or are you showing
--
?


>> Angelina Lee: Ah, okay, so each thread can only have one root page

directory. So what we did is that we have
--
.


>>: So each thread is really an OS thread.


>> Angelina Lee: Right. If each thread is an OS thread
--
.


>>: But there is only one root page.


>> Angelina Lee: There is only one page table which means ther
e is only
one root page directory. And when I say root page directory it’s
--
.
So this is like one page table right? That’s the root page directory.


>>: [indiscernible].


>> Angelina Lee: Right, so to enable TLMM refraction we basically
--
.
Each th
read still
has

a single root page directory, but they are
different root page directories across the thread.


>>: It’s the same has having different page
tables

for a single
processor right? So in a sense [indiscernible] which would link one
thread to an
other thread in the same process you would have to change
the [indiscernible].


>> Angelina Lee: That’s correct you would have to flush your
[indiscernible].


>>: So how do you do that? Like how
--
. I am wondering what did you
run this in? Is it run
ning only in Linux, or can I run this in
Windows, or does it run in a modified OS?


>> Angelina Lee: So we did modify the OS. This is only implemented on
Linux. I think you can do the same thing in Windows.


>>: Now I get it. So you did modify the OS.



>> Angelina Lee: Oh, yes.


>>: So you can do multiple threads in one process.


>> Angelina Lee: Yes.


>>: Okay, now I get it. I though maybe you actually had some
--
. I
was hoping like after this slide you would say, “Oh I have a trick so
you can
actually do it anyway or

-
“.


>> Angelina Lee: Oh, yeah sorry, there are no tricks. Sorry to
disappoint.


>>: So actually so, it is a big drawback right?


>> Angelina Lee: To modify the operating system?


>>: Yes.


>> Angelina Lee: Some may argu
e that, but, yes customer might, but
--
.


>>: And you kind of say the drawback it’s okay because you get like
amazing speedups right?


>> Angelina Lee: Well so for one it allows you to satisfy all three
criteria simultaneously. And furthermore I think it
’s actually a very
interesting abstraction that the operating system can provide. Because
traditionally you know you either have to thread share
everything

or
you have processes that nothing is shared.


Well why can’t we have something between where you

have partially
shared and partially private? And I will demonstrate in my second
study we actually also use this abstraction from the operating system
to implement a different linguistic abstraction.


And other people have studied different linguistic
abstraction that
this seems to be a really useful abstraction for the operating system
to provide.


>>: So I would think that having other applications of, you know,
having different page protections domains within a simple process. And
are you aware of

prior work of systems community people trying to do
this for different pages?


>> Angelina Lee: So I am aware of different applications where this
would be useful, but they all did it by basically creating worker as
processes and then do the manual and m
map. But, right, if you have to
do that to implements on abstraction than this would be really useful.
Then you don’t have to do this whole manual sharing and yeah.


Okay. So, right, so this is TLMM based cactus stack. But there is one
limitation whe
re basically we do not work for codes where we require
one thread to see another thread stack. Which is done, I forgot his
name,
and sorry

what is your name?


>>: Dan.


>> Angelina Lee: Dan, right. Dan’s point is, “Well if you pass a local
reference to l
ocal variable on a stack from one thread to the other
thread it doesn’t work”. Right.


Okay. So in summary Cilk
-
M is a C
-
based concurrency platform that
satisfies all three criteria simultaneously and to allow that we
basically use a TLMM based cactus
stack memory abstraction. And we
also, on top of that, use Legacy compatible linkage to allow the
interoperability between parallel and serial code.


Okay. So that’s the cactus stack memory abstraction. Next I am just
going to briefly touch on the Mem
ory
-
Mapped Reducer Hyperobjects
because I think I am running out of time.


Okay,
Reducer Hyperobjects. What’s a reducer hyperobject? It’s a
useful linguistic mechanism for avoiding determinacy races in
dynamically multithreaded region.


So what do I m
ean by that? So let’s see a piece of code. Say you have
a piece of code. You do some computation and store the result to
computation into a rate of strings. And then later you want to
concatenate the string together in the same order from index 0 to N.

And if you want to paralyze this code naively if you just paralyze and
say, “Let me just run all the iteration in parallel”. There is a
determinacy race because now logically parallel sub
-
computation can
access the same variable and at least one of whic
h is right. And in
this case every single access is right.


So it is a determinacy race. And as a result the opposite string is

non
-
deterministic. You no longer have this guarantee of the
concatenation is from index 0 to N. Okay. And so reducer b
asically is
linguistic mechanism to avoid exactly that.


To use reducer you basically declare res, the final output string as a
reducer in sort of just a regular string. And now you no longer have a
determinacy race. And furthermore the output reducer
is deterministic.
It’s the same output as the sequential execution provided that the
operator that you use is associative.


Okay. So conceptually how does it work? Well during parallel
execution here I have three workers. And say the loop iteration i
s
divided across three workers. Essentially, conceptually what the
reducer does is that it creates a local view for each worker to
accumulate
its

update on. And at the appropriate point these local
views gets reduced or combined together in a way that pr
oduces the same
result of serial execution.


So the result is determined then provided that the operator is
associated. And in this case it is. The string concatenate is
associative. And note that it just needs to be
associative;

it does
not have to b
ecome unitive.


Okay. And so in short the reducer allows different logical branches of
the parallel execution to maintain coordinated local views and same
non
-
local variable.


So with Memory
-
Mapped Reducer the high level idea is that given
that

the
[in
discernible] reducer is to have local views for the same non
-
local variable you coordinate the access between them and combine them
at the appropriate moment. Wouldn’t it be cool if you could leverage
the virtual address translation provided by the underl
ying hardware to
provide this mapping the reducer to different local views for different
workers?


And so that’s what we did. And comparing to the existing approach,
using memory
-
mapped approach we get faster reducer access time. Every
time you access th
e reducer it needs to be mapped to some local view.
And this access is not much faster. It has parallel overhead and what
that means is that the application can potentially scale better. And
finally it’s simpler on runtime implementation which is always

good
news for me.


Okay. So that’s what we did. And here I am showing you, so we
implemented this in Cilk
-
M and this is showing you the overhead
normalized L1
-
cache. So we are using memory
-
mapped reducer. Each
access essentially translates to two me
mory access and the predictable
branch.


So when you measure that against
--
. When you normalize that against
one memory access to L1
-
cache it’s about three times overhead. On the
other hand when you use the existing approach it’s actually about
twelve

times overhead. And just to show you as a point of comparison,
if you use spin lock to protect the access variable it’s over twelve
times overhead.


Okay. And to implement this there are actually four questions we need
to address. What operating
-
syst
em support is required to do the
memory
-
mapped reducer? How do you support reducers with different
types, sizes, and life spans? And how do you support a data structure
that we need to use for reducer that allows you to do constant
-
time
lookups and effic
ient sequencing? And finally how can you allow
efficient access to one worker’s local view by another worker?


And I am not going to go into detail of these four questions. The only
thing I am going to focus on is what operating
-
system support. And th
e
traditional TLMM is actually very good for that. And obviously the
idea is that allocate local views for each worker in the TLMM region at
the same virtual address.


And so now in Cilk
-
M and TLMM, besides the stack, we also allocate a
region for reduc
er for that purpose. Okay. So that’s the gist of
memory
-
mapped reducer. I will be happy to talk about it if you have
questions on that.


Okay. So concluding remarks: R
ecently researchers have started
exploring ways to enable memory abstractions using

page mapping or page
protection mechanism. Right so like you asked earlier. These are all
examples of that. So C strong with atomic section I believe
was

coming
out of Microsoft UK. And an example like Grace D thread they basically
used the page prote
ction mechanism to enable this deterministic
execution. There is also some work on using page mapping protection to
detect race and avoid that lock. And obviously adding to the list is,
you know, the TLMM based cactus stack and Cilk
-
m and memory
-
mapped
r
educers.


So this is the end of my talk. And if you are going to take away one
thing from this talk it is this. I hope I convinced you that if you
have properly designed memory abstractions you can ease the task of
parallel programming either by direct
ly mitigating the complexity of
synchronization such as in the example of reducer hyperobject, or
indirectly by allowing concurrency platfor
m to utilize resources
better; s
uch as in the case of TLMM based cactus stack and memory
-
mapping reducer.


And fin
ally the second take away. It seems like the TLMM mechanism may
provide a general way for building memory abstractions for multi
-
threaded programs. Okay. Thank you.


>> Madan Musuvathi: So we have time for a couple of questions.


>>: So in the begin
ning you noted that this [indiscernible] to the fact
that you have to recompile existing code in order to use this. Can you
explain why you have to do
that?

It seems like maybe you could just,
you know, use the existing code and it grows
its own stack n
o
rmally on
[indiscernible].


>> Angelina Lee: Sure, yes. So I guess what you can do is, I mean, the
main problem
--
. So with our recompiling code you need to enable some
way to switch between heap
-
based cactus stack and linear stack. But
that’s not rea
lly the main problem per say. You can use a wrapper to
enable that. But the problem is that then you might run into a space
bound issue where each worker actually used more than S1 amount of
space. And I can go into an example if you like. But I am won
dering
if I should take more questions or if I should spend more time. And we
can chat afterwards.


>>: Okay one more question. So you said in the reducer, which I like
by the way, that [indiscernible]. And I am a little bit surprised
because it means

that you have to make sure that in all indexes you go
over, somehow sequentially
--
.


>> Angelina L
ee: Yes, so in the runtime system, basically

--
. So
notice that in the work
-
stealing scheduler the execution is always so
we do steal the continuation of p
arent instead of stealing child. And
so what that means is that it, well sorry let me back up a step.


F
or Cilk if you remove
all parents

--
. A Cilk program actually has a
serial [indiscernible] in the sense that when you remove all the Cilk
key words
that are available to C function and when you execute a Cilk
program for single worker the execution order is the same as serial
execution.


And the way runtime stealing scheduler works is that whenever you steal
between each steal the code that’s execut
ing on worker is actually the
same as the serial execution order that you would have. And what that
means is that you just need to keep track of the different relationship
between your parent and child whenever you perform a successful steal
which we alre
ady maintained a runtime system. And so when you combine
,
you know,
exactly what is the relationship between the two local views
you have. And you just combine accordingly. And so does that answer
your question?


>>: No, sorry. Maybe I will take it up

later.


>> Angelina Lee: Yeah.


>>: Okay so to follow
-
up on his. Let’s say there is a loop of 100
things that was powered by this reducer object and then I want to
steal. Do you partition it by 50 percent or how would you know what to
partition it by?



>> Angelina Lee: Right. So the way that the parallel loop implemented
it in Cilk is that essentially you do divide and conquer on the loop
space. So the first worker will execute the first half and when you
steal you actually steal the second half. S
o it’s like, it’s actually
a binary tree divide and conquer of the loop iteration space.


>> Madan Musuvathi: Any more questions? So actually, Angelina has some
[indiscernible] scheduled today so if you would like to meet with her
send me an e
-
mail
madanm@microsoft.com.

So let’s thank the speaker.


>> Angelina Lee: Thank you.