Pig Contributors Workshop

nostalgicisolatedΛογισμικό & κατασκευή λογ/κού

4 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

71 εμφανίσεις

Pig Contributors Workshop

-

2

-

Agenda


Introductions


What we are working on


Usability


Howl


TLP


Lunch


Turing Completeness


Workflow


Fun (
Bocci

ball)


-

3

-

Richard Ding


Usage stats collection


New Top
-
level API


package
org.apache.pig
;


public class
PigRunner

{



public static
PigStats

run(String

args
[]);


}


New Entries in Job XML


pig.script.id
,
pig.script.id

,
pig.launcher.host
,
pig.command.line
,


pig.parent.jobid
,
pig.alias
,
pig.script.features
,
pig.job.feature


pig.version
,
pig.hadoop.version


New

Counter Groups


MultiStoreCounters
,

MultiInputCounters


-

4

-

Ashutosh

Chauhan



UDFs

in scripting languages


-

5

-

Daniel Dai


Optimizer rewrite


Why do we need an optimizer


Complex script is hard to optimize


In reality, optimizer kick in quite often in user script


Brand new framework to add a rule easier (PIG
-
1178)


Optimization rules (PIG
-
1319)


Split filter


Pushup Filter


Merge filter


Prune Columns


Pushdown
foreach

flatten


Expression optimizer


Merge
foreach




-

6

-

Aniket

Mokashi



Custom
partitioner

&& Scalar


Custom
partitioner


Use case


Controls the spraying of output by
getPartition

function


Allows custom grouping policy





Scalar




B = group A by $0 PARTITION BY

org.apache.pig.test.utils.SimpleCustomPartitioner

parallel 2;


A = load '
censors_total
' as (state, population);

B = group A all;

total
=
foreach

B generate
SUM(population
);

C =
foreach

A generate state, population/(
long)
total

as
percentage;

store C into '
censors_percentage
';


Scalar

-

7

-

Olga
Natkovich



Usability and error messages


New parser that allows better control over error messages


More meaningful error messages


Early error detection


Clarified language semantics


Resurrect support for illustrate


-

8

-

Howl, Why We Need It

What we have now


Hive has its own data catalog


Pig, Map Reduce can


Use a
InputFormat

or loader that knows the schema (e.g.
ElephantBird
)


Describe the schema in code

A = load ‘
foo
’ as (
x:int
,
y:float
)


Still have to know where to read and write files themselves


Must write Loader, and
SerDe

to read new file type in
Pig, and Hive


Workflow systems must poll HDFS to see when data is
available

8

-

9

-

Howl, What We Want


Given an
InputFormat

and
OutputFormat

only need to write one
piece of code to read/write data for all tools


Schema shared across tools


Disk location and storage format abstracted by service


Workflow notified of data availability by service


9

table mgmt
service

Pig

Hive

Map
Reduce

Streaming

RCFile

Sequence
File

Text File

-

10

-

TLP

-

11

-

Alan Gates


Turing complete Pig

Options on the table so far


Extend Pig Latin itself


Embed in scripting language via
precompiler


Embed in scripting language as DSL

-

12

-

Pig Integration With Workflow


-

13

-

In Conclusion


Should we do this more often?


Thanks everyone
for coming