13-td-dryadlinqx

homelybrrrInternet and Web Development

Dec 4, 2013 (3 years and 8 months ago)

90 views

DryadLINQ

A System for General
-
Purpose

Distributed Data
-
Parallel Computing

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,

Ú
lfar Erlingsson, Pradeep Kumar Gunda, Jon Currey


Microsoft Research Silicon Valley



Presented by:
TD
(Tathagata Das)



Designing
a
general purpose language

for
writing distributed data
-
parallel programs for a
compute cluster



General purpose

Single
-
thread abstraction

Familiar language / environment



???

Dryad

Cluster

Shell script

Shell

Machine



Dryad = Execution Engine


Nebula


limited to existing binaries



Scope


SQL
-
ish
, not general purpose



Can we do better?


Can we get the general purpose
-
ness of C#/Java and
conciseness of SQL?


And at the same time, be efficient too?


Can I have my cake and eat it too!


Language Integrated Query (LINQ)

Language Integrated Query (LINQ)


The creamy goodness of SQL
-
like queries
within a declarative programming model



Basic abstraction
-

collections




“All the world’s a
collection
,



And all the men and women merely

iterate on collections”

-

implied by Shakespeare

Collections, Iterators and LINQ

IEnumerable

<
T>

+

LINQ

=>

IEnumerable

<
T>

=>

import
system.linq
;

var

result =
from

num

in

numbers





where

num

% 2 == 0





orderby

num





select

num
;

List
<
int
> result =
new List
<
int
>();

foreach

(
int

num

in
numbers) {


if
(
num

% 2 == 0)



result.Add
(
num
);

}

r
esult.sort
()
;

Syntactical sweetness of LINQ

var

result =
from

num

in

numbers






where

num

% 2 == 0






orderby

num





select

num
;

var

result =




numbers

.
Where
(
num

=>
num

% 2 == 0)






.
OrderBy
(n => n);

Query Style






Method Style

LINQ Functionality


Select /
SelectMany


Where


GroupBy


OrderBy


Join


Union / Intersect / Except




Map (1
-
to
-
1 / 1
-
to
-
many)

Filter

Reduce

Sort

Join

Set operations

LINQ Providers

SQL

XML




Google

Wikipedia

Twitter


Select /
SelectMany


Where


GroupBy


OrderBy


Join


Union / Intersect / Except




LINQ System Architecture

.Net

P
rogram

LINQ

Provider

Interface

Query

Objects

LINQ
-
to
-
SQL

LINQ
-
to
-
XML

PLINQ

DryadLINQ

Parallel Collections

Partition

Collection

Simplest example: GFS/HDFS file

Dryad + LINQ =
DryadLINQ

s
tring
uri

=
@"file://
\
\
machine
\
directory
\
input.pt
"
;

PartitionedTable
<
LineRecord
>

input =








PartitionedTable
.
Get
<
LineRecord
>
(
uri
);


var

lengths =

input.
Select
(line =>
line.ToString
().Length);


Word Count with
DryadLINQ

s
tring
uri

=
@"file://
\
\
machine
\
directory
\
input.pt
"
;

PartitionedTable
<
LineRecord
>

input =








PartitionedTable
.
Get
<
LineRecord
>
(
uri
);


s
tring
separator =
","
;

var

words =
input.
SelectMany
(x =>
SplitLineRecord
(separator));


var

groups =
words.
GroupBy
(x => x);


var

counts =
groups.
Select
(x => new
Pair
(
x.Key
,
x.Count
()));


var

ordered =
counts.
OrderByDescending
(x => x[2]);


var

top =
ordered.
Take
(k);


top.
ToDryadPartitionedTable
(
"
matching.pt
"
);



Get

SM

G

S

O

Take

Execution Plan Graph

DryadLINQ

Word Count

Dryad

SM

G

S

O

SM

D

MS

G

S

SM

D

MS

G

S

SM

D

MS

G

S

G

G

G

D

D

D

MS

MS

MS

SM

D

MS

G

S

G

D

MS

Execution Plan Graph

Data Flow Graph

Distributed Data Flow Graph

DryadLINQ

Architecture [1]


DryadLINQ












Client machine

Distributed

Query
P
lan



.Net

Programs



Query
Expr

Cluster

Output Tables

Input
Tables

Query

Dryad
Execution

Dryad JM

Vertex

code

Con
-

text

DryadLINQ

Code Generation

s
tring
uri

= @"file://
\
\
machine
\
directory
\
input.pt
";

PartitionedTable
<
LineRecord
>

input =








PartitionedTable
.
Get
<
LineRecord
>
(
uri
);


s
tring
separator = ",";

var

words =
input.
SelectMany
(x =>
SplitLineRecord
(separator));


var

groups =
words.
GroupBy
(x => x);


var

counts =
groups.
Select
(x => new
Pair
(
x.Key
,
x.Count
()));


var

ordered =
counts.
OrderByDescending
(x =>
x.count
);


var

top =
ordered.
Take
(k);


top.
ToDryadPartitionedTable
("
matching.pt
");



Conversion of
subexpressions

to code
for Dryad vertices…

1.
Local variables

2.
Local libraries and functions

DryadLINQ

Architecture [2]


DryadLINQ












Client machine

(11)

Distributed

Query
P
lan



.Net

Programs



Query
Expr

Cluster

Output Tables

Results

Input
Tables

Invoke

Query

Output

Partitioned
-

Table

Dryad
Execution

.Net

Objects

Dryad JM

Vertex

code

Con
-

text

Combining with LINQ
-
to
-
SQL

19

DryadLINQ

Subquery

Subquery

Subquery

Subquery

Subquery

Query

LINQ
-
to
-
SQL

LINQ
-
to
-
SQL

DryadLINQ

Optimizations


Some are similar to existing DB optimizations


Eliminate redundant partitioning steps


Aggregation steps moved up the graph, before
partitioning steps



Existing Dryad optimizations as well


Dynamic reconfiguration of aggregation trees



Thoughts [1]


Easy to read, though reads more like a PL
paper



What are system contributions that are
different from Dryad?



Does the high level abstraction provide any
extra information that allow






Thoughts [2]

Interesting anecdote…


DryadLINQ

is inefficient for random access workload, but for
some workloads they outperformed systems customized for
random
-
access


HDD performance characteristics are such that sequential read
(even if you discard 99% data) is better than small random
accesses




Thoughts [3]


How different is
FlumeJava

from this?