17453 >> Sumit Gulwani: It is my great pleasure to welcome Swarat Chaudhuri, who is an assistant professor at Penn State University. Swarat graduated in 2007 from UPENN, and he won the ACM SIGPLAN Doctoral Dissertation Award. He has a wide variety of interests. Works in program verification, parallel programming, logic determinative

rangesatanskingdomΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 9 μήνες)

90 εμφανίσεις


17453



>> Sumit Gulwani: It is my great pleasure to welcome Swarat Chaudhuri, who is an
assistant professor at Penn State University. Swarat graduated in 2007 from UPENN,
and he won the ACM SIGPLAN Doctoral Dissertation Award. He has a wide variety o
f
interests. Works in program verification, parallel programming, logic determinative
theory. And today he's going to tell us about parallel programming.


>> Swarat Chaudhuri: Thank you, Sumit. It's great pleasure to be here. So today I will
tell yo
u about this parallel programming language and programming model that I have
been working on for almost a year now.


So the focus of this model, which is called Chorus, is data parallelism. So in the space of
data parallelism, main success stories have
been either when the granularity of
parallelism is high core screened and an example is not produced or when we are
working with dense data structures with highly fine
-
grained computations.


So an example of that is these numeric computations that are co
mmon in scientific
computing on dense arrays. And in addition, there are some problem specific methods.
For example, people have worked a lot on parallelizing model simulation and other
certain important problems like that.


So here our goal is not to
address just one or two specific applications, but a range of
data parallel computations over large unstructured shared memory graphs. And here the
granularity of parallelism is not known in advance. In fact, it's not predictable by static
analysis. So
in the average case there is a lot of data parallelism in these problems. But
in the worst case there is no parallelism at all.


And as we will see, there are lots and lots of applications which fall into this category.
And our goal, when we offer this
model, will be high level correctness as well as
efficiency. So, for example, we would aspire to achieve race freedom at the language
level or develop freedom.


Also we would like to express the essence of this kind of parallelism so our compiler and
run
time system can take advantage of that.


Now, a classic example of an application of this sort is [inaudible] refinement. So here
the problem instance is a triangulation the problem instance is a triangulation of a set of
points. You're given a settle
of points and you're building triangles with these points as
nodes.


Now what you want to achieve are these two properties. So the first property is known as
a Delaunay property, which is that no point is contained within the certain circle of a
triangl
e. As you see that's the case here.


But the second property is what's known as the quality property, which is that all the
triangles that you get in this mesh, they satisfy a certain goodness constraint, which is
that in this case there is no triangle w
here there is an angle that's greater than 120
degrees.


So typically you get meshes when this quality constraint may not be satisfied. So what
you want to do is you want to refine the mesh. You want to add new points to it and
retriangulate it so that

there are no bad triangles left in the mesh. So this is the problem
of Delaunay mesh refinement.


Now, let's look at this a little bit more closely. So the one classic algorithm for this
problem works as follows: That you collect in a data structure
known as a cavity all
triangles whose circum circle contains a new point. You've added a new point to refine
the mesh. Now what you do is you collect all these triangles. And as you're seeing here
in this second picture I have colored them in a differen
t color, and then you retriangulate
them, like I've done in the picture on the right.


So actually it so happens that even after you do this thing, some new triangles may again
violate this quality constraint so you have to continue. But there is a guar
antee that this
algorithm will terminate. So that requires some reasoning.


So let's look at this algorithm in the pharmaceutical. So I have a mesh and then I have a
work list in which I'm going to collect all these bad triangles. And then I'm going t
o pick a
triangle from this work list. I'm going to create a cavity. I'm going to expand the cavity.
Then I'm going to re triangulate it.


And, finally, I'm going to update the measure and I'm going to continue until there's
nothing left in the work l
ist.


So observation number one, there is no kind of

--

so these cavities we have here these
are contiguous regions within the mesh but the second which is not obvious the worst is
the cavities can encompass the whole mesh so you can create such an instan
ce.


Now, what do we do if we want to parallelize this algorithm? So the first observation is
here we are working on a complex unstructured graph, which is a mesh. And we do not
really have properties necessarily that we can use in parallelism. For ex
ample, the size
of a cavity as we have noted can encompass the entire mesh.


So what we will do here is that we are going to view this problem more abstractly as a
computation over a graph. And in this case we can view these nodes in this graph as
trian
gles and edges representing adjacency between triangles.


And so by the definition, a cavity corresponds to neighborhood contiguous region in this
graph and we are going to use this observation later. But let's continue.


So now what's the adversary p
roperty we want to we want them to be triangulated
atomically. And non overlapping cavities may be processed in parallel. However, this
question of what can be done in parallel seems impossible to handle with static analysis.
The reason is that the shap
e of the data structure, it changes greatly over time.
Moreover, it's extremely independent. Easy to create an instance where you get no
parallelism at all. There's been a lot of work on this type of late [inaudible] at his group
UT Austin.


So just to

note that this is not just one isolated application, there are lots and lots of
similar problems. So in the space of measuring, there is mesh refinement, there's
triangulation, problems in clustering, ray tracing, maintenance of social networks, graph
lo
g problems such as spanning tree and flow computation. [inaudible] simulation, sparse
problems in scientific computing. Sparse matrix vector multiplication and even some
program analysis problems. So I added this for the effect because I primarily work
on
program analysis, iterative data analysis and patronage assimilation. And these
represents a class of applications which are very large but these traditional static
techniques of extracting parallelism don't really work so what can we do about it?


S
o what I'm going to do here is that first I'm going to present more abstractly our
programming model and show how it can be used to encode that application I showed,
Delaunay mesh refinement. Then I'll present a more concrete language JChorus an
embedding

of the model on top of the Java and I'll talk about implementation and finally
I'm going to talk about some ongoing work on embedding this model into the Habanero
Java which is an offshoot of IBM's extent language.


>>: With these kind of applications,
[inaudible] permutations how are they performed?


>> Swarat Chaudhuri: So as you will see our implementation has some stuff related to
[inaudible] so we are using some of the ideas there. But I do not think that that alone will
suffice. We'll come to
that. Maybe at the end we can discuss this in more detail.


So one of the interesting observations here is that in all these applications there is a
common property of locality, which is that even though you are doing computations that
in the worst case

can record global access to a gigantic data structure, in the typical
scenario that is not at all so. So, for example, we ran some experiments on a mesh of
about 100,000 triangles from these benchmarks from [inaudible] and we found that the
average size
of a cavity is actually only 3.75 triangles and the maximum cavity size was
12 triangles. So in that case there is a lot of parallelism in this problem, and the sense of
that is perhaps this average case locality.


But, again, there is no way to know th
is ahead of time what is the extent of locality in a
particular, for a particular application. So the main idea of Chorus is that since we are
working on these local regions in the data structure, why not let these regions be the
drivers of the parallel c
omputation, just as objects of the drivers of computation in object
oriented programming or concurrent objects are the drivers of computation in the actor
model so we are going to make this more clear soon.


So let us first phrase the problem a little bit

more abstractly. So I'm going to view from
now on these problems as a problem of parallel computations and graphs. So here I
have a giant graph and this represents the heap.


And the objects here correspond to the nodes. The edges here correspond to
the
pointers, and I'm going to refer to an induced sub graph in this heap as a region.


And so an assembly in our terminology is a region in this heap that is equipped with a
thread of control. So the typical situation would be that these assemblies wou
ld be
short
-
lived and speculative. And we are going to see what this means in the context of
some applications.


So one point I would like to make here is that this might remind some of you of the actor
model of computation model. As we'll see there ar
e some differences but for one here
we have this property that all these regions that are there in the various assemblies in the
heap at one time they all formed partitions of the heap and they're all isolated. Objects in
them are completely isolated.


A
nd second is that there is no a priori guarantee on how big a region might be. It might
consist of one object. It might consist of the entire heap.


Okay. So in a situation like ours we need to talk about not just static data partitions but
partitions

that actually change over time because of the very dynamic adaptive nature of
these applications.


So our primitive for synchronization, in fact, this is the only primitive for synchronization
here, is margin of assemblies. Here we have assembly I and a
ssembly J. And there's an
edge from this I to J. And now I can merge with J and the result is a bigger assembly we
see at the bottom.


This assembly J which previously had ownership of that region, that now dies. So in
order to rule out races, we requ
ire that J must be in a so
-
called ready state, that is, it
should not be doing in the middle of doing some sort of an imperative update this model
happens. I'll clarify what that means.


So then merging is a way in which we coarsen the granularity of th
e parallelism in the
heap. How do we find the granularity of parallelism. Well it's splitting. What we have
here is assembly I. Now I has split through these assemblies I one through I six. And
then this tau work.


So let me actually get to that in
a little bit. For now let's just get this picture that assembly
I has bled into these constituent assemblies. Now one of the things to note here is that
this J is completely oblivious to what has gone on inside I here. So I can split. J doesn't
have to

know. In other words, the splitting operation is not actually a synchronization
construct.


Whereas, in the previous case, for merging, there is actually synchronization involved
because we are saying that J must be in a ready state while the merge hap
pens.


Okay. And finally an assembly being, having an imperative computation associated with
it can actually modify its heap as well as its local variables. So the restriction, though, is
that objects inside an assembly are isolated, which means that y
ou're not allowed to
follow that point, read some data and do some modification of that. That's not permitted.
Everything else is permitted.


So okay that sounds nice. But what do programs here look like? So what we do here is
that we just generalize

--

we just seek inspiration from object oriented programming and
we define the notion of an assembly class. An assembly class consists of a set of local
variables, a set of guarded updates so the computations here are in the guarded
command style and ther
e's a constructor which creates an instance of an assembly from
an assembly class and then there are some things called public variables which I'm going
to explain soon.


And a program is simply a set of classes. And each assembly we are going to view it

as
this sort of very simple state machine. So each assembly can be in a busy state, which
means that it's now executing an update. So remember that we have these guarded
updates so the pattern is evaluate the guard, execute the update and then come back

and then do this again and again, right?


So this is the state when an assembly is executing an update. And this ready state,
which is where the assembly starts and this is the default state and it goes back to that
after finishing an update is a state

where an assembly can actually be preempted by
another assembly.


So an assembly can now be merged by a neighboring assembly and die as a result.
When an assembly dies, it just goes to this terminated state. So let's now look at the
structure of compu
tations here a little bit more closely. So, first of all, these guarded
updates is there's an update guard and an update.


And the rule is that this guard is executed atomically. After that, this update is only going
to refer to the objects that are own
ed by the assembly. Therefore, there is no further
need for trying to acquire locks of simple objects. So that's the idea. So guard is
executed atomically. Update refers to objects owned by the assembly. So now the
construct for merge is going to happ
en actually within guards. So this is a slight
difference from standard guarded command languages that our guards can actually have
these effects of merging two assemblies. However, they cannot modify the heap. So
they cannot have effects that are imper
ative in that sense they cannot touch the heap.
However, they can change the structure of concurrency in the heap.


So here at the top I have this guarded command, merge.UF followed by S. So this says
that try to do this merge. So I when it executes it

tries to do this merge along U.F so U is
is that object at the top. U.F is the reference which points outside of I. And it says that
try to merge with this. If this merge goes through, you know, in that case the guard
evaluates to true, and in that cas
e execute the command S on the new region that you
acquire as a result.


>>: For verification, U and F, are those

--

I think of those as two argument storage,
passing both the U and the U.F.


>> Swarat Chaudhuri: You should say U.F as? Java it's a re
ference.


>>: But in that case how do you know what you're merging to? What's the capital S?


>> Swarat Chaudhuri: S is the statement that you're going to execute after that.


>>: You're saying that the parameter to merge, there's just one paramet
er.


>> Swarat Chaudhuri: Right.


>>: It's going to be whatever the [inaudible] points to, but how do you know what you're
going to join that to.


>> Swarat Chaudhuri: You have no control of that aside from wanting

--

that I've not
shown here. You c
an specify the type of assembly you're going to join with. But aside
from that there is no control over who you will merge with.


>>: In particular, it might not merge with the assembly that contains U.


>> Swarat Chaudhuri: No, U is what U contains.

U is within I. I is trying to execute this.
So U.F is the reference that points outside of I.


>>: So I refers to the whole assembly.


>> Swarat Chaudhuri: Yeah, yeah. Sorry. It may look like it's just that one object. No,
it's the whole assemb
ly. And then there's a slight refinement of that that you can also
guard this with an original guard which is G, which says that G is constraint that's
evaluated on this assembly I again. And when this property G is true at that point you try
to execute
this merge if it goes through then you execute S. And for guards, that's really
what we have. So we do not really need anything further. And the split so for

--

as for
splits what we have is that it's just command that's executed within that S thing tha
t I
showed, that there is a guard and there is an update and that update was that F symbol
S. So the split is executed within that. When it splits it produces something like this and
the syntax is like this that you can split tau over tau is the class of

the assemblies being
produced as a result of the split. This P1 through PN is going to be the parameters that
you pass to the constructors of these new assemblies and that's the way you pass from
local state from the parent assembly to this children asse
mbly.


>>: If you can imagine that failed split comes

--



>> Swarat Chaudhuri: Speed cannot fail because it doesn't depend on anybody else,
right. It's a completely local operation. Whereas marginal synchronization construct and
as a result that mig
ht potentially fail. And finally local updates are, as I said before. So
this is just like in Java, except again if you try to access something that's outside your
assembly, then you are going to get an exception.


And there are some refinements of tha
t. So one version of merge, for example, is that
you can create new assembly of a new type when you part from the merge because
remember that previously when I merged with J it was just I. So now you can also create
a new thing in the process.


And lik
ewise you can, the results are refinement of split, which is that earlier we had to
split into all these individual components, assemblies that contained only one object
each. However, you can also basically release one of the objects in the assembly and
create one assembly out of that. And that would be a split one.


But the most interesting constructs are really the ones I showed earlier. Okay. So
enough about this sort of abstract discussion of the language. Let's look at an
application. So here
let's say we try to implement this Delaunay mesh refinement
application in this language. So how will we do this? So we will use two kinds of
assembly classes. One will be these triangles and another will be these cavities. And
what does each triangle
do. Each triangle determines if it's bad or not. If it is bad, then it
merges with neighboring triangle to form a cavity. Okay. What about a cavity? So a
cavity can determine whether it has all the triangles needed to finish the retriangulation.
Righ
t? And that's a local operation.


So if it finds that it needs more, then it tries to merge with a neighbor. If it finds that it has
enough, then it's going to just retriangulate locally and then split back into these triangles
because now you're sort o
f back in the default state all these triangles are back and
you're going to continue, this process is going to continue until there is no further activity
in this solution, if you will. Solution of triangles which are just merging and splitting with
each

other.


So the code then looks like this, that I have one assembly class called triangle and the
action within is that merge with V.F let's say V.F corresponds to a neighbor. You merge
with that become a cavity, when this property is true meaning this
triangle is bad. That's
the only thing that a triangle does. Otherwise it just stays inactive.


What about a cavity, the cavity determines if it's not complete, if it's not complete then it's
going to merge with, again, let's say this points to an arbi
trary neighbor and it's going to
form a bigger cavity.


And if it's complete, then it's going to retriangulate and then it's going to split into these
individual triangles that form it.


So what happens then if there is a conflict? Suppose you're tryi
ng to grow a cavity.
You're trying to merge with more and more triangles and in parallel there's another cavity
being formed and these two things collide, what happens then? Remember what we
said. That when one cavity, when one assembly merges with anot
her, the assembly that
is on the receiving end of the merge dies as a result.


So what that means here is that there is one cavity that's being formed and another
cavity that's being formed and then one cavity gets absorbed by another. So what this
will

lead to is that all the work that one of the cavities did up to this point will have to go.
Right? In a sense the computation that you're trying to achieve is rolled back.


However, so there are some subtleties here so in some sense the work that you
set out
to do that will not be finished you're back at the original state. However, at the end of all
this, when the cavity that is the killer, if you will, the cavity that does the absorption, this
cavity at some point will have done its job and then it'
s going to split again.


Then the original thing can start all over again from scratch. Okay. Are there questions?
Yes?


>>: Every time mesh on clarity you can [inaudible] there are cases when you match two
cavities?


>> Swarat Chaudhuri: So I'm

going to address the [inaudible] location, no you cannot
merge with two cavities at one time. You're going to merge with one assembly who is
going to be your neighbor. But when you do the merge, then the cavity that's on the
receiving end of it, that ha
s to go. And as a result all the work that he has done until this
point that work is basically rendered worthless because you're not using

--

it's not able to
finish what it started.


However, the point is that eventually this cavity who killed it, kille
d this other cavity, this
cavity will have finished its work. And then it's going to split off again. And then
presumably if that work still needs to be done, that work will again start from the
beginning and it's going to reach its final point if it's n
ot interrupted by anybody else and
then it's going to do the retriangulation and split back and then it will log the offer.


So let's now look at some of the other approaches. So I'm going to give a few other
applications that we can encode in this styl
e but let's look at some other competing
approaches. So the first question is, of course, what about threads and explicit locking.
So here perhaps the main objection is that the heap in shared memory languages is a
completely global entity. And furtherm
ore you have arbitrary aliasing, and as a result
when, if you are trying to ensure that there is no, there are not multiple objects that are
trying to, multiple cavities that are trying to work on the same object at the same time,
that's very hard to do.
And it's also low level and error prone.


What about software transactions? So here what happens is that this burden of
reasoning is really passed on to the transaction managers. And in most implementations
of software transactions, the conflicts betwee
n two activities that are happening in parallel
but happen to conflict on some data item, that's detected by monitoring reads and writes
to the memory.


And as a result this either leads to you doing some sort of conservative analysis and
getting conserv
ative results, or it is very expensive because dynamically you have to
really search the entire memory and try to see if there is a conflict or not so what about
static data par tingeing as is there, for example, in languages call extent. But here the
pro
blem is the unpredictive nature of partition you may have started with data partition
that's very nice but after a while after all these changes it may become quickly bad so
you need some sort of an adaptive mechanism and really what our constructs for mar
gin
splits offer is a way to implement this adaptive nature of isolation in using a few simple
constructs.


So finally what about actors? So the actor model is perhaps the most well known and the
oldest type of data oriented parallel programming. So if

you are going to model this
problem of mesh refinement using actors, what you do is you would give each actor a set
of cavities. Right? And then what would expansion mean? It would mean that one actor
passes around some triangles to another actor. Now

there are two ways of doing this.
One is that you would copy the triangles and send it to the other person. But this
presumably would be too expensive. But then the other option is to just pass around
references but on the other hand this introduces pr
oblems like aliasing and races. Note
on the other hand that in our model there is no aliasing of objects. As a result of margins
and splits objects get transferred from one entity to another.


An entity does not have aliases to objects that are owned by

other entities. So that's an
important point. So finally if I'm talking about irregular applications I have to talk about
Gulliver, which is [inaudible] system for irregular data parallel applications.


So this is in this system what they do is that t
hey annotate data structures with
information about commitivity and associativity and have iterators not just over lists but
sets and Poe sets and it's a different style of programming we see ours as an alternative
that's trying to address the same problem
.


Okay. So let's now look at a few other applications. So another application that has
been looked at a lot in the setting of this sort of parallelism is Boruvka's algorithm for
minimum spanning tree. So the parallel version of this is as follows: T
hat you have this
graph. In the beginning you start with all these small spanning trees which correspond to
just one or a few nodes each. And then these small spanning trees merge to form bigger
and bigger spanning trees until finally you're going to get

this spanning tree that covers
the entire craft. So this is pretty obvious how to model this in our language. So what you
do is simply you model this spanning trees as assemblies and merging is just a direct
construct in this system.


In another appli
cation from [inaudible] benchmarks is that of focused community. So the
goal is you have a giant social network where there are updates being made
continuously. And what you want to do is that you want to maintain communities in this
network where a comm
unity is defined as a sub graph, which is such that the measure of
closeness within the graph is bigger than the measure of closeness between any node in
the sub graph and something outside.


So basically you're in a community if you're more tightly conn
ected with people in the
community rather than people outside the community. So, of course, this graph is being
updated constantly. And as a result the structure of this communities change all the time.


And now a sequential algorithm for this is as fo
llows, that you, suppose you want to
determine the community of one person. So you do this sort of greedy fixed point
computation that you keep on adding nodes if you find that this addition of this node
increases some objective function, and otherwise yo
u also shed some nodes.


So again it's pretty obvious how to do this in our setting. You use an assembly to
represent such a community and then the shedding and growing of communities, the
shrinking and growing of communities is captured using the margi
n split primitives.


So a few nodes. So, first of all, note that in the worst case there's no parallelism here at
all. The entire heap merges into one neighborhood. Second, note that the merges and
the splits here are unordered. So, for example, there

is no real ordering between a
merge that happens between say this and some assembly here and the split of that
assembly. So it's all completely true concurrent semantics.


So what about data erases. So there are no data erases, because each assembly o
nly
updates, only modifies locally isolated objects and merges happen only when an
assembly is, the recipient of the merge is not within the in the middle of an update. So of
course the definition of a data erase is multiple imperative computations do not

attack the
same object at the same time, whether for reads or writes.


So what about deadlocks? So imagine a situation like this where there are two
assemblies and this assembly is trying to merge with assembly and this assembly is
trying to merge with

this assembly. So in other words I is waiting for J and J is waiting for
I. So this is perhaps the closest to what we can define as deadlock in a situation like this.


But note that merges here are unordered. So, for example, if you have a situation
like
this, then the runtime system can really execute one of the two merges and when that
happens the other merge, the other assembly is dead. There is no further requesting and
therefore there is no further waiting. Progress happens because a bigger ass
embly is
formed and that goes on to do whatever it needs to do.


So the key to this is this preemptive nature of assembly is that when an assembly is in
the ready state, then it can be killed by anybody. There is no protection whatsoever.


And further
more, at the end of every update you have to come back to this top level
ready state. So assuming updates terminate you are going to be able to resolve such a
situation eventually.


>>: So when Gene wants to merge, so it stays to the system I want to me
rge with
assembly that owns J.net, the object.net.


>> Swarat Chaudhuri: Absolutely, yes.


>>: So in the process if I is splitting, so it could actually merge with some of assembly,
some child of mine?


>> Swarat Chaudhuri: Absolutely.


>>: J can
not mean I specifically.


>> Swarat Chaudhuri: Absolutely. That's key to the erase freedom in development, yes.
Of course, I said all this but this assumes the existence of some kind of omniscient
runtime system which is going to look at all of these t
hings, all of these requests and just
order them and resolve them somehow.


But if you are going to write an implementation for this system on say a multi
-
core
machine, then there's not going to be just one scheduler that is centralized somewhere.
So we'
ll have to talk about more distributed runtime system and I'm going to talk about
that in a little bit. So then we'll have to reason about deadlock at a slightly lower level
and we will do so.


Okay. But before I go there, let me just present the langu
age for which that runtime
system has built. So this language is known as J Chorus. This is simply Chorus on top
of sequential Java. So in fact these abstractions of assembly classes can very naturally
be integrated with an object oriented language beca
use we already have notions of
object classes. So what we do is that in addition to that we also now have these classes
of assemblies.


And we have these method calls from within the assembly class bodies. We saw those
guarded commands, right? So insi
de the updates we can now make method calls and
these calls can now be on any of the object classes that I've defined in the sequential
part of the program.


And there are a few extra features that I did not show in the previous core version of the
model
. So we have now objects which are read only and mutable and these are
specified as part of the types. And, for example, a read only data item can be shared
ubiquitously by assemblies all over the system. And mutable objects are like what I
explained ea
rlier.


Okay. So in that case the code for mesh refinement would look something like this. I'm
not going to give you the details. But let's just look at this little piece here. So I have the
declaration of the cavity assembly and inside that I have th
is declaration of actions and
inside the actions I have calls to these methods, which are just Java methods.


And similarly we have these other assemblies and in fact we found that when we looked
at this mesh refinement algorithm, mesh refinement applica
tion and tried to write it in
Chorus, we found that most of the code was actually just sequential Java code. We only
needed to add about 15 lines of code, which corresponded to Chorus. So we really could
just take the sequential version of the program an
d then add these 15 lines of code on
top of that and we got a Chorus version of the program, which I thought was quite nice.


So this is the language aspect of the problem. What about the system aspect of the
problem? That's what I'm going to discuss n
ow. So one thing

--

one problem that you hit
as soon as you try to implement something like this is that the parallelism that you
express in the language is potentially huge.


So in any realistic system which perhaps has eight cores or 16 cores, what yo
u need to
do you would have to have some mechanism for mapping these hundreds of thousands
of possible assemblies to the cores in the system.


So how to manage this mapping is a funny question. And the questions that arise here
are what are the right ma
pping strategies, what are the right scheduling strategies,
because a core now contains maybe tens of thousands of these assemblies and you
have to execute all of them in some order.


So that raises the question how do we schedule things? How do we bala
nce loads,
right? So this graph that we have here, that's a completely unstructured thing. So it's
considerable that after a while some of the codes will be doing all the work. So we need
some sort of load balancing strategy. And finally what are the r
ight data structures for
maintaining the relationships, say, between the objects and their assemblies. How do we
determine which assembly an object belongs to?


So all of these questions need to be addressed and this also needs to be hidden from the
pro
grammer and shoved into the body of the runtime system. So designing this runtime
system is quite a challenging task.


So the main lower level abstraction that we have is known as a division. So a division is
a set of assemblies that's mapped to a core
. So here I see these two divisions. So this
division contains these assemblies and likewise. So now note that some of the merges
are now actually local within divisions, because a division now contains many
assemblies, right?


So if one assembly with
in a division wants to merge within a assembly within the same
division that's like a local operation. It's not really parallelism. However, there are these
remote merges which is what if this assembly wants to merge with this assembly. So in
that case
there needs to be an actual request sent from one division to another and that
would require some real synchronization.


So this is the high level structure of the problem. Now perhaps what we want to do is we
want to minimize the amount of synchronizat
ion that happens across divisions. We want
to make most of these merges local. So how do we do that?


Okay. So the main strategies here are that first the divisions are adaptive, which is our
heuristic for reducing the number of remote merges. So the

idea here is as follows that
when, say, this assembly tries to merge with this assembly, you are going to migrate this
assembly to this division. Right? And along with that you're also going to migrate some
of the nearby assemblies to that division.


So the result of that is that since there is this locality property in these applications, that
typically if you want to merge with an assembly you are perhaps going to merge, want to
merge with some nearby assemblies, if at all.


So that's why this heur
istic of migrating some chunk of, some local collection of
assemblies from one division to another, that actually turns out to help quite a bit the
second thing is the heuristic for load balancing which is done by simple modification of
this migration stra
tegy, which is basically when you are doing the migration you can also
do some computation and make sure that, you know, the migration is happening from the
direction of the core, which has more assemblies to the division which is pure
assemblies. So that

way you're just making sure that assemblies are always migrated
from the core with more assemblies to the one with pure assemblies.


>>: So how does that handle the case when you have an assembly that has to merge
through the whole space?


>> Swarat
Chaudhuri: That would require a series of merges, right? So that gives

--



>>: At one point where it's merging from a place that's smaller than

--



>> Swarat Chaudhuri: Right. So of course these are just heuristics, you cannot just
apply them comple
tely blindly. You have to also put in other heuristics.


>>: A decision

--



>> Swarat Chaudhuri: So in all these applications that we considered it's not really that

--

with the exception of the spanning tree algorithm, it doesn't really happen that y
ou need
to merge with everything to form a big heap. So if it turns out that you are going to,
you're forming bigger and bigger assemblies and you are just one division is going to
have lots of stuff and this cannot be avoided, then what can you do? You'
ll just have to
deal with that.


And the far heuristic used to use the union define structure, the reason for that is these
assemblies here are these joint sets, right? So in order to maintain the relationship
between objects and what assembly it belong
s to, we can use union fine data structure.
So basically when there is a merge, what we do is that we use the union part of union
find. And when we want to find out what assembly an object belongs to we use the find
part of that.


This also seems to gi
ve quite a bit of benefit. And finally we have this token passing
strategy to prevent deadlocks in this system. So as I mentioned earlier, that there is no
runtime system that's completely centralized, right? That's just going to resolve all these
diffe
rent merge requests that are being made simultaneously.


So what we do is we have this token passing strategy, which is used in many other
distributed programs as well, to handle the remote merges. So if you are going to do the
local merges, these you c
an do at any time at your will. But as for remote, you need
access to the global to do that. Therefore, it introduces serialization and that prevents
deadlocks.


There's a little bit more subtlety to the freedom deadlock, I will avoid that I can answer
that if there are questions.


>>: [inaudible].


>> Swarat Chaudhuri: Right.


>>: [inaudible].


>> Swarat Chaudhuri: So what happens is that you batch all the remote requests. So,
for example, you have a division, right, which is executing all it
s assemblies in following
some scheduling policy.


Now you see that there's a remote merge that's been requested in some assembly. So
what you do is you batch all these requests into a queue. When the token comes to you
you try to execute all of these
things at one time.


And this means, yes, that at one time only one person can do these sorts of remote
merges but since you're batching and since there are so many assemblies typically par
core it seems to work out reasonably.


>>: Wouldn't that be mo
re efficient to have [inaudible].


>> Swarat Chaudhuri: A what?


>>: [inaudible] a single scheduler?


>> Swarat Chaudhuri: I do not know what that

--



>>: Having single scheduler.


>> Swarat Chaudhuri: But you need to distribute that, right? H
ow are we going to
have

--



>>: It's not distributed.


>> Swarat Chaudhuri: Sure, sure. But if you have these multiple cores, so where will
this master scheduler run and how will you find out

--



>>: Global view of the whole thinking maybe the load
balancing would be much more
efficient.


>> Swarat Chaudhuri: Sure, but when would the scheduler run.


>>: On the core.


>> Swarat Chaudhuri: That means in order to do the scheduling you have to
communicate from that code to the core which possesses
the assemblies. So you will
need to basically communicate for every assembly that's going to be executed. And that
will require a lot of synchronization.


>>: When you talked about batching with this token, you have two assemblies that are
trying to m
erge to the same point they batched up, these are remote merges, they
batched up, now I want to go with I and he wants to go with I, how do you prevent a
deadly embrace type

--



>> Swarat Chaudhuri: When the token comes to one of the divisions, it's goin
g to try to
execute this thing, right. Once that merge is executed then that other request is voided
because that assembly is dead. It's just not there any longer.


>>: So the other one is

--



>> Swarat Chaudhuri: Yeah, it's just dead. So this is th
e structure of the system. So
you take a program and then you pass it to the compiler front end and create a Java
translation and finally we have this runtime system which will layer over the Java virtual
machine. The code for this is available. Just se
nd me an e
-
mail this Web page should
be up in a few days it's not up now but if you want to look at the code send me an e
-
mail
and I'll send it to you.


So let's look at some experiments. We encoded the Delaunay refinement problem from
the Lonestar benc
hmarks in this language. And here the dataset had about 100,000
triangles, almost all of them were initially bad. There were 1 to 8 threads.


So this is what the mesh looked like after a lot of triangulation. Initially we started with
partitioning of t
he mesh, which was fairly nice, it's this grid, right. And we note that even
after many thousands of retriangulations the mesh that results still kind of looks like a
grid, which means that there is a lot of locality in this computation. It's not just th
at the
structure completely dissolves after all these operations.


And so that is the property we tried to exploit. So now as for our numbers, so we have a
lot of overheads actually because of the fact that we are running on top of the Java virtual
mach
ine and we are making expensive method calls for pretty much every single thing,
even when we try to check if an assembly contains a certain object or not we have to
make a method call. And it turned out that these method calls also eat up a lot of time.


However, I must say that this is DSTM two which is Morris

--

his software transactional
library. And we are only beating the sequential version at eight cores but our overheads
are way better than them at least in this application.


And this is perhap
s more accurate measure of performance. So this is called

--

this is
self
-
relative speed up which is speed up over the sequential version of the same system.
So this is an attempt to rule out some of the effects caused by these overheads.


So here we n
oted that we are pretty much close to DSTM but we are perhaps still a little
better. But I would say we pretty much have the same curve in DTSM, which is
interesting why these two are very similar. That's something to observe.


>>: You're relatively fl
at from 4 to 7 PCUs, did you investigate what's going on there?


>> Swarat Chaudhuri: We tried to investigate but we did not have any answers. The
problem with this Java
-
based system is that there's no underlying performance model. I'll
show you a gra
ph in a little bit that's really mindboggling. We had absolutely no clue why
this happened but the power performance started to drop when you moved from 7 to 8
threads. Not in our model. When you tried to use fine grain logging. It suddenly
collapsed w
hen you went from 7 to 8 threads. And we figured it was because of the
garbage collector but we have no clue what is happening there.


So what about the number of conflicts? So we define conflicts in DTSM two in the usual
way how many transactions are r
olled back. We also wrote a fine grained locking
implementation that you saw in the slides also. Here a conflict meant you tried to build a
cavity but you basically have to stop doing it to get back to the original state. Our fine
grain locks are not bl
ocking locks. Basically you try to echo a triangle and expand your
cavity. If you cannot, you jump ship and go to the original and start again.


We found a number of conflicts in our system is actually very, very low, which gives us
home that programs
if you have a more efficient implementation we would be able to get
better results.


>>: Load balance.


>> Swarat Chaudhuri: No, we do not have real numbers for load balancing, primarily
because the load balancing heuristic we gave we think it's a pret
ty simple thing and can
be improved we'd like to investigate it.


>>: The balancing [inaudible] try to recraft. The synchronize the wake as a division.
Assembly, probably wait until later.


>>: Right.


>> Swarat Chaudhuri: It's more like you are ma
king a request from a division which has
lots of stuff to a division which has very little stuff. So now just the normal migration
would bring stuff from the division which has less to the division which has more. But that
doesn't seem right. So what we

do is we exploit the property that gives migration
symmetric and we basically make the migration in the other direction.


>>: Usually a number off assembly.


>> Swarat Chaudhuri: Right.


>>: But not size.


>> Swarat Chaudhuri: The reason is that
the sizes are hard to track. One could imagine
you could improve this by having some sort of a heuristic number of objects that are there
in the division currently. So as for [inaudible] algorithm. This was the bizarre graph I was
telling you about, whi
ch is that in this fine grained locking implementation [inaudible]
suddenly dipped when we went from seven threads to eight threads and we worked very
carefully to see if there were any bugs, none we could find. So we suspect that this is
because of the g
arbage collector. But again we don't really know why.


>>: Were you running with eight cores.


>>: We were running with eight cores, yes.


>>: Had overheads, other stuff going on.


>> Swarat Chaudhuri: Yes, interesting thing to note is our speed u
p on this spanning
tree application is quite different. And the reason that the DSTM, there's no graph
corresponding, no line corresponding to DSTM we couldn't run the DSTM on the
application because it's so big it ran out of memory immediately.


Here's
the far part of the application. Speed ups were rather miserable. Even though
everything else was good. But we're hoping for better implementation and maybe this
can be improved. But I'm not claiming this is fundamentally because of implementation
over
heads. It may be that our approach is also, has some fundamental problems. But
we do not know, basically.


>>: [inaudible] what does the raw speed up.


>> Swarat Chaudhuri: Raw speed up, you run the sequential version of the code. And
you plot the
speed up over the sequential version. Then versus your performance, your
performance on one thread. So basically the sequential version of your system that's the
baseline in this relative speed up, and the rest is the best sequential implementation,
whic
h is just written in Java.


>>: [inaudible] affect 1
-
1.


>> Swarat Chaudhuri: Yes, exactly.


>>: How are you going to make it ten times slower?


>> Swarat Chaudhuri: You'll be surprised. So initially it was about 25 times. And then
we found if
we replaced these exceptions by just simple checks, like where you did not
have to prepare a stack each time, then immediately jump from 25 times to ten times. So
now our understanding of the overheads is not that deep. It would be useful to talk to
some
body who knows the details of the Java runtime system, but this is the best we
could do at least for now.


And in ongoing work, I'm working with Professor Vivek Sukor at Rice University to embed
this Chorus model into Habanero Java which is an offshoot o
f Extint which [inaudible]
spearheaded the development of.


The point here is that this Habanero Java already has these abstractions for isolation,
however these are cores grained and static. So what we want to have is this more fine
grained April Tate
active construction for isolation which will be as you saw here. And
then there are some other things that are being considered in the settings.


So in particular in the model that I showed you, the objects were either completely in the
ownership of one

assembly or not owned at all. However, we are considering these
abstractions for a fraction of ownership where multiple assemblies can own a object
however aliasing is very tightly controlled.


And other ongoing work includes a more optimized implement
ation. So my student is
working hard at it right as we speak. And we are also looking at foundational process
calculus for the problems, actually that's where we started in all of this. So I'm primarily a
theoretician and our goal was to develop foundat
ional calculators for local computing in
graphs. But then it became a more perhaps realistic thing but now we still want to
address this question that what would pi calculus with splits and ownerships look like.
We're investigating that. Finally as prim
arily as somebody who is very interested in
program verification, I'm interested in also questions about type systems for Chorus that
how do we reason about type state of these assemblies and how can you do some static
analysis. So these are all questions

that we want to do.


And then finally one other interesting application of Chorus seems to be in the modular
robotics, where there are these self
-
assembling robots which are just running around in
parallel, get together to do some stuff and break off ag
ain and do some other stuff. So
that seems to be quite similar, at least at a superficial level to our assemblies and their
merging and splitting. We're looking into those applications and trying to see if we can
program some of them.


So for more info
rmation you can to this Web page. I don't think it's up yet. But it will be
in a few days. If you want the code ask me. The paper is published in OOPSLA. If you
want to see that e
-
mail me too. I'll end here and take questions.

[applause]


>>: So y
ou say you're going back in to start to work on type system and static analysis.


>> Swarat Chaudhuri: So we're trying to build an preliminary examination of this
language, you can download and lose but we want to integrate it with this Habanero.
And th
en there are questions what are the types, what do types mean for this and what
kind of compiler optimizations can we do and those questions. So I want to address
these questions.


>>: Almost sounds like on a linear type setup.


>> Swarat Chaudhuri: Y
es.


>>: But the pitfalls are always fun.


>> Swarat Chaudhuri: Right.


>>: So he asked my question. So if I wanted to use

--

if I wanted to use [inaudible]
parallelize numeric correlation, notion of a job, a job would correspond to I guess
triangle
s which are bad. Or you can attract.


>> Swarat Chaudhuri: Right.


>>: Then push them into two and a half threads. Key jobs for the queue, right?


>> Swarat Chaudhuri: Right. The question is what kind of heuristics would you need to
maintain the
job. It can't be completely naive. What you presumably do is you're going
to perhaps use the same kind of locality strategy; that if you're stealing a triangle then
perhaps you'll always going to steal some local triangles that's because you will use the
m
afterwards. So in some sense you can see our division runtime system as an
implementation of what you said, that each division has a queue which steals work from
another division which is basically about migrating these assemblies from one division to
a
nother division.


And there is load balancing, but just as you're going to have load balancing in silk. But
the point is that all of those things are hidden from the programmer because we also
want to guarantee this high level property, freedom, develop

freedom. And so on.


>>: You would compile your large to

--



>> Swarat Chaudhuri: Yes, that's actually a part I did not have. That seems like a
possibility, yes.


I mean, okay. So I would be

--

yeah I would be hesitant to say yes completely and th
e
reason is that these questions like can deadlock freedom there we have to do some
special things to ensure that. So I do not know if we just do a naive translation into silk
we will get all those properties academically, that would require some thought.



>>: We can talk about it.


>>: Sure. Yes.


>>: [inaudible].


>> Swarat Chaudhuri: The only way you do any kind of passing of objects from one
assembly to another is via this merge or split. So when you do a merge, you are just
giving the entir
e region to another assembly. So you are not

--



>>: So basically [inaudible] you're passing overloads and stuff and you see going
through the local part of the heap?


>> Swarat Chaudhuri: Yes, that is right. That is right. Yeah. So that's where p
oint
separation comes in.


>>: A measure of the

--

[inaudible] intercept and goes

--



>> Swarat Chaudhuri: Right. So not every one of them

--



>>: Ten percent slow down.


>> Swarat Chaudhuri: I agree that is perhaps

--

yeah, so I agree that is per
haps the
biggest bottleneck. But I think one clarification, though, is that we do not need to monitor
every single write. So really we can store some notion of a boundary of these objects.
So you can do some optimizations which will guarantee that certa
in accesses are
guaranteed to be local. But sometimes you're not sure and then, yes, you will have to do
this sort of check.


>>: This is like actual memory.


>> Swarat Chaudhuri: One way you're doing this only at a completely local level. In
case of

transactional memory, if you have at least something that's global, your context
routine is going to have to go through the entire memory. Whereas here this sort of
check whether you are going outside or whether you're staying inside, that's something
th
at you are doing using that find operation in union find, right? And it's also a local thing
that if this, as you saw in that mesh example, assembly counts, what, ten triangles. It's
about checking these 9 or 10 pages that go out of an assembly and try a
nd see if

--



>>: It depends on what size of the

--



>> Swarat Chaudhuri: Absolutely.


>>: The size of the entire part of this assembly.


>> Swarat Chaudhuri: Sure. But what I'm saying is that the property that we're trying to
exploit is that ther
e is a lot of parallelism at a low level, at a fine grain level and that's the
property that we're trying to exploit. So in cases where there's no such parallelism, then,
yes, it will just reduce to doing the sorts of global checks.


>>: The overhead of

these checks

--



>> Swarat Chaudhuri: Sure.


>>: Essentially the same as the actual memory. You can have [inaudible]
implementation

--



>> Swarat Chaudhuri: But the basic point I want to make the only difference between
that sort of checking that g
oes on in transactional memory and here is just that we are
using this local property that we're operating on local regions. So if you said that this
was, these assemblies contain the entire heap, so in that case yes there's no difference
at all in terms
of overheads.


>>: Multiple transactions you can access something outside the transactions so you
need to monitor.


>> Swarat Chaudhuri: But remember in transactions there is no annotation of this is the
region that this transaction belongs to. So tr
ansactions are refer to [inaudible] city of
control in some sense and the data is one global blob and go do whatever you want to do
with it.


>>: [inaudible]


>> Swarat Chaudhuri: The what?


>>: The writes itself.


>> Swarat Chaudhuri: Sure, but i
f you have heap allocated data structures and if you
have field expressions then you will have to basically do either some kind of analysis,
some kind of shape analysis to see how far you can go or you'll have to do it dynamically
by doing a check. So her
e all we're exploiting is the locality property. That in these
applications the parallel computations are restricted to these local regions in the heap
and that's what we're exploiting, yes.


>>: Any further questions?


>>: I have a question. So som
ething about the ease of programming with respect to the
fine grain docking transaction

--



>> Swarat Chaudhuri: The transactional memory is easy to program in, but fine grain
locking is pretty hard. Even in these applications that are not complicated j
ust trying to
make sure there are no deadlocks and no aliases are going from here to there. So that's
a huge pain.


So I think that in terms of ease of programming you saw in that application ten lines of
code over the sequential. So I think it seems v
ery natural at least for these classes of
applications.


>>: Ten lines, 90 percent.


>> Swarat Chaudhuri: Ten lines.


>>: Only get ten percent of the performance.


>> Swarat Chaudhuri: Oh, sure, sure.


>>: Yeah, that's right.


>> Swarat Chaud
huri: But there I would point you to DSDM.


>>: [laughter].


>> Sumit Gulwani: Let's thank the speaker again.