SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets

cornawakeΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

64 εμφανίσεις

SCOPE: Easy and Efficient
Parallel Processing of Massive
Data Sets

Prepared by : Kholoud Alsmearat

Outline


SCOPE


Large
-
scale Distributed Computing


Map
-
Reduce programming model


SCOPE / Cosmos


Input and Output


Select and Join


User Defined Operators


Process


Reduce


Combine


Importing script





SCOPE


tructured
omputations
ptimized for
arallel
xecution


A declarative scripting language which its targeted for
massive data analysis.


Easy to use: SQL
-
like syntax plus
MapRecuce
-
like
extensions


Highly extensible:


Fully integrated with .NET framework


Flexible programming style: nested expressions or a series of
simple transformations

SCOPE cont..



SCOPE borrows several features from SQL.




Users can easily define their own functions and implement
their own versions of operators:

I.

extractor : parsing rows from a file

II.
Processor: row
-
wise processing..

III.
Reducer: group
-
wise processing.

IV.
Combiner: combining rows from two inputs .




Large
-
scale Distributed Computing


Due to the large size of the dataset, traditional parallel DB solution can be
more expensive so some companies developed distributed data storage and
processing system on large cluster of shared

nothing commodity machines.

. . .

. . .












Map
-
Reduce


The Map
-
Reduce programming model


Good abstraction of group
-
by
-
aggregation operations


Map function
-
> grouping


Reduce function
-
> aggregation



Map
-
Reduce achieving parallel processing.

Limitation:


For some applications it’s unnatural to use Map
-
Reduce
model


such custom
-
code is error
-
prone and hardly reusable.

SCOPE / Cosmos


Cosmos Storage System


distributed storage subsystems
designed

to efficient store large
sequential file.


Data is compressed and replicated


Cosmos Execution Environment
An environment for deploying,


executing, and debugging distributed
applications

An Example:
QCount

Compute the popular queries that have been requested at least 1000 times

SELECT

query,
COUNT
(*)
AS

count

FROM

“search.log”
USING
LogExtractor

GROUP BY

query

HAVING

count> 1000

ORDER BY
count
DESC
;


OUTPUTTO

qcount.result


e

=
EXTRACT

query

FROM

“search.log”
USING
LogExtractor
;


s1
=
SELECT

query,
COUNT
(*)
AS

count

FROM
e
GROUP BY

query;


s2

=
SELECT

query, count

FROM
s1

WHERE

count> 1000;


s3

=
SELECT

query, count

FROM
s2
ORDER BY

count
DESC
;


OUTPUT
s3
TO


qcount.result


Every
rowset

has well
-
defined schema

Input and Output



EXTRACT

and
OUTPUT

commands provide a relational abstraction of
underlying data sources





Built
-
in/customized extractors and
outputters

(C# classes)

EXTRACT

column[:<type>] [, …]

FROM
<
input_stream
(s) >

USING
<Extractor> [(
args
)]

[
HAVING
<predicate>]

OUTPUT

[<input>]

TO
<
output_stream
>

[
USING
<
Outputter
> [(
args
)]]


publicclass
LineitemExtractor

:
Extractor

{




public override
Schema

Produce
(string[]
requestedColumns
, string[]
args
)


{ … }



public override
IEnumerable
<Row>

Extract
(
StreamReader

reader, Row
outputRow
, string[]
args
)


{ … }

}

Select and Join


Supports different
Agg

functions: COUNT, COUNTIF, MIN, MAX,
SUM, AVG, STDEV, VAR, FIRST, LAST.


No
subqueries

(but same functionality available because of outer join)






SELECT

[
DISTINCT
] [
TOP

count]
select_expression

[
AS
<name>] [, …]

FROM

{ <input stream(s)>
USING
<Extractor> |


{<input> [<
joined input
> […]]} [, …]


}

[
WHERE
<predicate>]

[
GROUP BY
<
grouping_columns
> [, …] ]

[
HAVING
<predicate>]

[
ORDER BY
<
select_list_item
> [
ASC

|
DESC
] [, …]]



joined input
:


<
join_type
>
JOIN
<input> [
ON
<equijoin>]


join_type
: [
INNER

| {
LEFT

|
RIGHT

|
FULL
}
OUTER
]

Deep Integration with .NET (C#)


SCOPE supports C# expressions and built
-
in .NET
functions/library

R1

=
SELECT

A+C
AS

ac,
B.
Trim
()
AS

B1

FROM

R

WHERE
StringOccurs
(C, “xyz”) > 2


#CS

public static
int
StringOccurs
(string
str
, string
ptrn
)

{ … }

#ENDCS

User Defined Operators


SCOPE supports three highly extensible commands:
PROCESS
,
REDUCE
, and
COMBINE


Easy to customize by extending built
-
in C# components


Easy to reuse code in other SCOPE scripts.

Process


PROCESS

command takes a
rowset

as input, processes each row, and
outputs a sequence of rows (zero, one, multiple rows).



flexible command it’s enable user to implement processing that is
difficult or impossible to express in SQL.





PROCESS

[<input>]

USING
<Processor> [ (
args
) ]

[
PRODUCE

column [, …]]

[
WHERE
<predicate> ]

[
HAVING
<predicate> ]


publicclass
MyProcessor
:
Processor

{


public override
Schema

Produce
(string[]
requestedColumns
, string[]
args
, Schema
inputSchema
)


{ … }



public override
IEnumerable
<Row>

Process
(
RowSet

input, Row
outRow
, string[]
args
)


{ … }

}

Reduce


REDUCE

command takes a
grouped
rowset
, processes each group, and
outputs zero, one, or multiple rows per group

REDUCE

[<input> [
PRESORT

column [
ASC
|
DESC
] [, …]]]

ON
grouping_column

[, …]

USING
<Reducer> [ (
args
) ]

[
PRODUCE

column [, …]]

[
WHERE
<predicate> ]

[
HAVING
<predicate> ]


publicclass
MyReducer
:
Reducer

{


public override
Schema

Produce
(string[]
requestedColumns
, string[]
args
, Schema
inputSchema
)


{ … }



public override
IEnumerable
<Row>

Reduce
(
RowSet

input, Row
outRow
, string[]
args
)


{ … }

}

Combine


COMBINE command takes two
matching

input
rowsets
, combines
them in some way, and outputs a sequence of rows

COMBINE
<input1> [
AS
<alias1>] [
PRESORT

…]

WITH
<input2> [
AS
<alias2>] [
PRESORT

…]

ON
<
equality_predicate
>

USING
<Combiner> [ (
args
) ]

PRODUCE

column [, …]

[
HAVING
<expression> ]


publicclass
MyCombiner
:
Combiner

{


public override
Schema

Produce
(string[]
requestedColumns
, string[]
args
,



Schema
leftSchema
, string
leftTable
, Schema
rightSchema
, string
rightTable
)


{ … }



public override
IEnumerable
<Row>

Combine
(
RowSet

left,
RowSet

right, Row
outputRow
, string[]
args
)


{ … }

}

COMBINE

S1
WITH

S2

ON

S1.A==S2.A AND S1.B==S2.B AND S1.C==S2.C

USING
MyCombiner

PRODUCE

A, B, C

Importing Scripts


Similar to SQL table function.


Improves reusability and allows parameterization


Provides a security mechanism

IMPORT
<
script_file
>

[
PARAMS
<
par_name
> = <value> [,…]]


Life of a SCOPE Query

. . .

. . .







Scope Queries

Parser /
Compiler
/ Security







Example Query Plan (
QCount
)

1.
Extract the input cosmos file

2.
Partially aggregate at the rack
level

3.
Partition on “query”

4.
Fully aggregate

5.
Apply filter on “count”

6.
Sort results in parallel

7.
Merge results

8.
Output as a cosmos file

SELECT

query,
COUNT
(*)
AS

count

FROM

“search.log”
USING
LogExtractor

GROUP BY

query

HAVING

count> 1000

ORDER BY
count
DESC
;


OUTPUT TO

qcount.result


Conclusions


SCOPE: a new scripting language for large
-
scale
analysis


Strong resemblance to SQL: easy to learn and port
existing applications


Very extensible


Fully benefits from .NET library


Supports built
-
in C# templates for customized operations


High
-
level declarative language


Implementation details (including parallelism, system
complexity) are transparent to users



thanks for listening



any Questions???