PPT - PADS

cowphysicistInternet και Εφαρμογές Web

4 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

64 εμφανίσεις









David Walker

Princeton University

Computer Science


Pads:

Simplified Data Processing

For Scientists

2

Computer Science in the 21st Century


One part computation to determine the answer to your problem.


One part
communication

to tell someone about it.

Who:

actress Jennifer Aniston and
actor Brad Pitt


When:

July 29, 2000


Where:

The nuptials took place on
the grounds of TV producer Marcy
Carsey's Malibu estate


The Ceremony:
As the sun sank
low in the California sky
, two
hundred assembled guests watched
as John Aniston, known to daytime
television fans for his work on Days
of Our Lives, walked his daughter
down the aisle. Shielded by a
flower
-
bedecked canopy, the bride
and groom were able to say ....

4

5

Our Common

Communication Infrastructure


Behind the scenes, much of this information is represented in
standardized data formats


Standardized data formats:


Web pages in HTML


Pictures in JPEG


Movies in MPEG


“Universal” information format XML


Standard relational database formats


A plethora of data processing tools:


Visualizers (Browsers Display JPEG, HTML, ...)


Query languages allow users extract information (SQL, XQuery)


Programmers get easy access through standard libraries


Java XML libraries
---

JAXP


Many applications handle it natively and convert back and forth


MS Word



6

Ad Hoc Data


Massive amounts of data are stored in XML, HTML or
relational databases but there’s
even more

data that
isn’t


An
ad hoc data format

is any nonstandard data format
for which convenient parsing, querying, visualizing,
transformation tools are not available


ad hoc data is everywhere.

7

Ad Hoc data from www.investors.com

Date: 3/21/2005 1:00PM PACIFIC

Investor's Business Daily ®

Stock List Name: DAVE


Stock Company Price Price Volume EPS RS

Symbol Name Price Change % Change % Change Rating Rating


AET Aetna Inc 73.68
-
0.22 0% 31% 64 93

GE General Electric Co 36.01 0.13 0%
-
8% 59 56

HD Home Depot Inc 37.99
-
0.89
-
2% 63% 84 38

IBM Intl Business Machines 89.51 0.23 0%
-
13% 66 35

INTC Intel Corp 23.50 0.09 0%
-
47% 39 33


Data provided by William O'Neil + Co., Inc. © 2005. All Rights Reserved.

Investor's Business Daily is a registered trademark of Investor's Business Daily, Inc.

Reproduction or redistribution other than for personal use is prohibited.

All prices are delayed at least 20 minutes.


8

Ad Hoc data from www.geneontology.org

!autogenerated
-
by: DAG
-
Edit version 1.419 rev 3

!saved
-
by: gocvs

!date: Fri Mar 18 21:00:28 PST 2005

!version: $Revision: 3.223 $

!type: % is_a is a

!type: < part_of part of

!type: ^ inverse_of inverse of

!type: | disjoint_from disjoint from $Gene_Ontology ; GO:0003673

<biological_process ; GO:0008150


%behavior ; GO:0007610 ; synonym:behaviour


%adult behavior ; GO:0030534 ; synonym:adult behaviour


%adult feeding behavior ; GO:0008343 ; synonym:adult feeding behaviour


% feeding behavior ; GO:0007631


%adult locomotory behavior ; GO:0008344 ;


...

9

Ad Hoc Data From Steve Kleinstein

(Immune Response Simulation Data)

0

8

125

8

3

2

6

0

(~6:0:0:0:0~1:0:0:0:1,1:1:0:0:0)


1

3

7

7

2

1

6

0

(~6:0:0:0:0~1:1:0:0:0)


2

7

37

6

2

1

5

0

(~5:0:0:0:0~1:1:0:0:0)


3

5

16

5

4

3

2

0

(~2:0:0:0:0~1:1:0:0:0,1:1:0:0:0,1:0:0:1:0)


4

8

161

2

2

1

1

0

(~1:0:0:0:0~1:0:0:1:0)


5

5

27

18

4

5

13

4

(~13:0:0:0:0~2:0:0:0:1,1:0:0:1:0,2:0:0:1:0)


6
6

50

5

1

0

5

0

5:0:0:0:0


....

10

Ad Hoc Data in Chemistry

O=C([C@@H]2OC(C)=O)[C@@]3(C)[C@]([C@](CO4)

(OC(C)=O)[C@H]4C[C@@H]3O)([H])[C@H]

(OC(C7=CC=CC=C7)=O)[C@@]1(O)[C@@](C)(C)C2=C(C)

[C@@H](OC([C@H](O)[C@@H](NC(C6=CC=CC=C6)=O)

C5=CC=CC=C5)=O)C1

O
O
O
O
H
A
c
O
H
O
O
O
H
O
N
H
O
O
O
H
O
11

Ad Hoc Data from Web Server Logs (CLF)

207.136.97.49
-

-

[15/Oct/1997:18:46:51
-
0700] "GET /tk/p.txt HTTP/1.0" 200 30

tj62.aol.com
-

-

[16/Oct/1997:14:32:22
-
0700] "POST /scpt/dd@grp.org/confirm
HTTP/1.0" 200 941

12

Ad Hoc Data: DNS packets

00000000: 9192 d8fb 8480 0001 05d8 0000 0000 0872 ...............r

00000010: 6573 6561 7263 6803 6174 7403 636f 6d00 esearch.att.com.

00000020: 00fc 0001 c00c 0006 0001 0000 0e10 0027 ...............'

00000030: 036e 7331 c00c 0a68 6f73 746d 6173 7465 .ns1...hostmaste

00000040: 72c0 0c77 64e5 4900 000e 1000 0003 8400 r..wd.I.........

00000050: 36ee 8000 000e 10c0 0c00 0f00 0100 000e 6...............

00000060: 1000 0a00 0a05 6c69 6e75 78c0 0cc0 0c00 ......linux.....

00000070: 0f00 0100 000e 1000 0c00 0a07 6d61 696c ............mail

00000080: 6d61 6ec0 0cc0 0c00 0100 0100 000e 1000 man.............

00000090: 0487 cf1a 16c0 0c00 0200 0100 000e 1000 ................

000000a0: 0603 6e73 30c0 0cc0 0c00 0200 0100 000e ..ns0...........

000000b0: 1000 02c0 2e03 5f67 63c0 0c00 2100 0100 ......_gc...!...

000000c0: 0002 5800 1d00 0000 640c c404 7068 7973 ..X.....d...phys

000000d0: 0872 6573 6561 7263 6803 6174 7403 636f .research.att.co

13

Who uses ad hoc data?


Ad hoc data sources are everywhere


containing valuable information of all kinds


everybody wants it:


chemists, physicists, biologists, economists, computer
scientists, network administrators, ...


just about anyone who writes their own programs


14

The challenge of ad hoc data


What can we do about ad hoc data?


how do we read it into programs?


how do we detect errors?


how do we correct errors?


how do we query it?


how do we view it?


how do we gather statistics on it?


how do we load it into a database?


how do we transform it into a standard format like XML?


how do we combine multiple ad data sources?


how do we filter, normalize and transform it?


In short: how do we do all the things we take for
granted when dealing with standard formats in a
reliable
,
fault
-
tolerant

and
efficient
, yet
effortless

way?


15

Most people use C / Perl / Shell scripts


But:


Writing hand
-
coded parsers is
time consuming

&
error prone
.


Reading

and
maintaining them

in the face of even small format
changes can be difficult.


Such programs are
often incomplete
, particularly with respect to
errors.


Not all that
efficient
unless the author invests extra effort


For reliable, fault
-
tolerant, efficient data processing, we
can do better!

16

Why not use traditional parsers?


Overall, a
very heavy
-
weight

solution


people just do not do it


specifying a lexer and parser separately can be a barrier


data specs as Lex and Yacc files are relatively complicated


lexing and parsing tools only solve a small part of the problem


internal data structures built by hand


printer by hand


transforms by hand


viewers by hand


query engine by hand


Error processing is fairly rigid


We can do better!

17

Enter Pads


Pads:

a system for
P
rocessing
A
d hoc
D
ata
S
ources


Two main components:


a data description language


for concise and precise specifications of ad hoc data formats and
properties


a compiler

that
automatically generates

a suite of data processing tools


robust libraries for C programming


parser that flags all errors and automatically recovers


printing utilities


an interface that allows users to query ad hoc data


converter to XML


a statistical profiler


collects stats on common values appearing in all parts of the
data; records error stats


visual interface & viewer (coming soon!)

18

The rest of the talk


Introduction to ad hoc data sources (check)


Pads Tools


Pads Language


Pads Semantics


Wrap
-
up


19

Pads Tool Generation Architecture

Pads

Compiler

Gene Ontology

description

Statistical

Profiler

Tool

gene data

Profile


ACE 25%

BKJ 25%

...

XML

Formatter

Tool

gene data

<foo s d/>

<bar dd h/>

Viewer

Tool

gene data

20

Pads Tool Generation Architecture

Pads

Compiler

Gene Ontology

description

Gene
Ontology

Generated
Parser

Pads Base

Library

Gene Ontology

Statistical Profiler

Glue code

for statistical

profile

21

Pads Programmer Tools

Pads

Compiler

Gene Ontology

description

Gene
Ontology

Generated
Parser

Pads Base

Library

Ad Hoc User

Program

Ad Hoc

User

Program in C

22

The Statistical Profiler Tool


for each part of a data source, profiler reports errors &
most common values.


from example weblog data:

<top>.length : uint32

+++++++++++++++++++++++++++++++++++++++++++

good: 53544

bad: 3824

pcnt
-
bad: 6.666

min: 35


max: 248591

avg: 4090.234


top 10 values out of 1000 distinct values:

tracked 99.552% of values


val: 3082

count: 1254

%
-
of
-
good: 2.342

val: 170


count: 1148

%
-
of
-
good: 2.144

val: 43


count: 1018

%
-
of
-
good: 1.901

.....


23

The Statistical Profiler Tool


ad hoc data is often poorly documented or out
-
of
-
date


even the documentation of weblog data from our
textbook was missing some information:




good: 53544


bad: 3824


pcnt
-
bad: 6.666



web server sometimes return a ‘
-
’ instead of length of bytes,
which wasn’t mentioned in the textbook


data descriptions can be written in a iterative fashion



use the profiler at each stage to uncover additional
information about the data and refine the description

Pads Language

25

PADS language


Based on
Type Theory


in most modern programming languages, types (int, bool,
struct, object ...) describe program data


the source of most of my research


in Pads, types describe


physical data formats,


semantic properties of data, and


a mapping into an internal program representation (ie, a
parser)


Can describe ASCII, binary, and mixed data formats.

26

PADS language


Basic Types


Rich and
extensible
.


Pint8, Puint8, Pint16, ...


Pstring(:
term
-
char
:)


Pstring_FW(:
size
:)


Pstring_ME(:
reg_exp
:)


Pdate, ...


Supports user
-
defined compound types to describe
data source structure:


Pstruct
,
Parray
,
Punion
,
Ptypedef
,
Penum

27

Example: CLF web log


Common Log Format from
Web Protocols and
Practice. (Bala and Rexford)



Fields:


IP address of remote host


Remote identity (usually ‘
-
’ to indicate name not collected)


Authenticated user (usually ‘
-
’ to indicate name not collected)


Time associated with request


Request


Response code


Content length


207.136.97.50
-

-

[15/Oct/1997:18:46:51
-
0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

28

Example:
Pstruct


Pstruct

http_weblog {


host
client
; /
-

Client requesting service


' '; auth_id
remoteID
; /
-

Remote identity


' '; auth_id
auth
; /
-

Name of authenticated user


“ [”; Pdate(:']':)
date
; /
-

Timestamp of request


“] ”; http_request
request
; /
-

Request


' '; Puint16_FW(:3:)
response
; /
-

3
-
digit response code


' '; Puint32
contentLength
; /
-

Bytes in response

};

207.136.97.50
-

-

[15/Oct/1997:18:46:51
-
0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013


For reading a sequence of different data elements:

29

Example:
Punion



Punion

auth_id {


Pchar
unavailable

: unavailable == '
-
';


Pstring(:' ':)
id
;


};




Union declarations allow the user to describe variations.



Implementation tries branches in order.



Stops when it finds a branch whose constraints are all true.


207.136.97.50
-

-

[15/Oct/1997:18:46:51
-
0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

30

Example:
Parray



Parray

nIP {


Puint8[4]:
Psep
(‘.’) &&
Pterm
(‘ ’);


};


Array declarations allow the user to specify:



Size (fixed, lower
-
bounded, upper
-
bounded, unbounded.)



Boolean
-
valued constraints



Psep

and
Pterm

predicates

Array terminates upon exhausting EOF/EOR, reaching terminator,
or reaching maximum size.

207.136.97.50

-

-

[15/Oct/1997:18:46:51
-
0700] "GET /turkey/amnty1.gif HTTP/1.0" 200 3013

31

Example: User constraints


int checkVersion(http_v version, method_t meth) {


if ((version.major == 1) && (version.minor == 0)) return 1;


if ((meth == LINK) || (meth == UNLINK)) return 0;


return 1;

}


Pstruct

http_request {


'
\
"'; method_t
meth
; /
-

Request method


' '; Pstring(:' ':)
req_uri
; /
-

Requested uri.


' '; http_v
version

:
checkVersion(version, meth);


/
-

HTTP version number of request


'
\
"';

};

207.136.97.50
-

-

[15/Oct/1997:18:46:51
-
0700] "
GET

/turkey/amnty1.gif HTTP/1.0
" 200 3013

32

Example: Parameterization & Dependency


“Early” data often affects parsing of later data:


Lengths of sequences


Branches of switched unions


To accommodate this usage, we allow PADS types to
be parameterized:

Pstruct

packet_t (: Puint32 length:) {


...


Pstring_FW(: length :) payload;

};

Pads Semantics


34

Semantics: The Big Picture


As a theorist, I want to be able describe the meanings
(semantics) of programs and programming languages


Why bother? What is the point?


communication


spread ideas, techniques and algorithms


often means extracting the essence of a language and reducing
it to a simple set of mathematical relations


verification


prove properties of implementations


particularly security
-
relevant or safety
-
critical applications


generalization


the mathematics brings out the central principles and invariants


leads to more general, compositional, scalable solutions


it’s just fun


immensely satisfying to come up with the perfect formal system
where all parts compose and blend seemlessly together


35

Semantics for Pads: Goals


Communication


Pads descriptions can be incorporated into just about any
language. ML? Java? Perl? Matlab?


Language designers need a precise specification to do so


Verification


In some cases, we find the implementation incomplete or
making arbitrary choices (eg: error correction semantics)


Every once in awhile, the implementation is outright wrong
(eg: array semantics)


Generalization


Semantics allows us to compare and contrast Pads with
related languages & add features (eg: intersection types
& overlays from PacketTypes; recursive types; more)

36

Semantics for Pads: Overview


Pads is large language and if we tried to formalize the whole thing right
from the get
-
go, we wouldn’t succeed


we’d get lost in details and make mistakes


we’d be unable to structure our proofs of key properties


we wouldn’t
communicate

the essential elements to our fellow researchers


Strategy:


pick out the key ingredients & eliminate the ugly, but unimportant details


develop an idealized version of the real language


each type in our idealized version of pads represents a single, simple
pure idea


each type composes with all others


we give a semantics to each individual construct; we get a semantics for
complex objects by putting several simple individual ones together

37

Semantics for Pads: Overview


Part 1: Specify idealized (abstract) syntax of types

T ::=
True


(parse nothing successfully)


| False


(parse nothing unsuccessfully)



| {x:T | P(x)}

(constrained type; parse data as T and check P)




| C (arg)


(parse parameterized base type; eg: string(:’ ‘:))



| T1


T2

(union type; parse one or the other)



| T1


T2

(intersection type; parse data as both T1 and T2)



|

x:T1.T2

(dependent pair; parse T1, call it x, then parse T2)



| T seq(arg)

(sequence type; parse Ts until finding arg)




|

x.T


(type parameterized by argument x)



| T (arg)


(parameterized type applied to argument)



| hide T


(skip data described by T; eg: absorb ‘|’ )



| spoof (arg)

(parse nothing; add arg to internal representation)

basics

38

Semantics for Pads: Overview


Part 1: Specify idealized (abstract) syntax of types

T ::=
True


(parse nothing successfully)


| False


(parse nothing unsuccessfully)



| {x:T | P(x)}

(constrained type; parse data as T and check P)




| C (arg)


(parse parameterized base type; eg: string(:’ ‘:))



| T1


T2

(union type; parse one or the other)



| T1


T2

(intersection type; parse data as both T1 and T2)



|

x:T1.T2

(dependent pair; parse T1, call it x, then parse T2)



| T seq(arg)

(sequence type; parse Ts until finding arg)




|

x.T


(type parameterized by argument x)



| T (arg)


(parameterized type applied to argument)



| hide T


(skip data described by T; eg: absorb ‘|’ )



| spoof (arg)

(parse nothing; add arg to internal representation)

basics

structured

types

39

Semantics for Pads: Overview


Part 1: Specify idealized (abstract) syntax of types

T ::=
True


(parse nothing successfully)


| False


(parse nothing unsuccessfully)



| {x:T | P(x)}

(constrained type; parse data as T and check P)




| C (arg)


(parse parameterized base type; eg: string(:’ ‘:))



| T1


T2

(union type; parse one or the other)



| T1


T2

(intersection type; parse data as both T1 and T2)



|

x:T1.T2

(dependent pair; parse T1, call it x, then parse T2)



| T seq(arg)

(sequence type; parse Ts until finding arg)




|

x.T


(type parameterized by argument x)



| T (arg)


(parameterized type applied to argument)



| hide T


(skip data described by T; eg: absorb ‘|’ )



| spoof (arg)

(parse nothing; add arg to internal representation)

basics

structured

types

para
-

meterized

types

40

Semantics for Pads: Overview


Part 1: Specify idealized (abstract) syntax of types

T ::=
True


(parse nothing successfully)


| False


(parse nothing unsuccessfully)



| {x:T | P(x)}

(constrained type; parse data as T and check P)




| C (arg)


(parse parameterized base type; eg: string(:’ ‘:))



| T1


T2

(union type; parse one or the other)



| T1


T2

(intersection type; parse data as both T1 and T2)



|

x:T1.T2

(dependent pair; parse T1, call it x, then parse T2)



| T seq(arg)

(sequence type; parse Ts until finding arg)




|

x.T


(type parameterized by argument x)



| T (arg)


(parameterized type applied to argument)



| absorb T

(skip data described by T; eg: absorb ‘|’ )



| compute (arg)

(parse nothing; add arg to internal representation)

basics

structured

types

para
-

meterized

types

transforms

41

Semantics for Pads: Overview


Part 2: Specify denotational semantics of types


in general, a
denotational semantics

describes one language (poorly
understood) in terms of another language (better understood)


in our case, we specify the meaning of Pads types (poorly understood)
in terms of the polymorphic

-
calculus (better understood, at least by
me)

semantics(T) =

bits.e

a parser function

mapping external bits

to data structures

in the

-
calculus

42

Semantics for Pads: Overview


Part 3: Prove Pads has the required properties



Theorem:

Parsers never generate “bad” internal representations of
external data. ie, representations are well
-
typed in the implementation
language.



Theorem:

Parsers check all semantic constraints.


Wrap
-
up

44

Challenges of Ad Hoc Data Revisited


Data arrives “as is”


Format determined by data source, not consumers.


The Pads language allows consumers to describe data in just
about any format.


Often has little documentation.


A Pads description can serve as documentation for data source.


The statistical profiler helps analysts understand data.


Some percentage of data is “buggy.”


Constraints allow consumers to express expectations about data.


Parsers check for errors and say where errors are located.


Ad hoc data is a rich source of information for chemists, biologists,
computer scientists, if they could only get at it.


Pads generates a collection of useful tools automatically from data
descriptions


Pads

is our answer to the challenge of ad hoc data sources.

45

Related work


DataScript [Back: CGSE 2002] &
PacketTypes [McCann & Chandra: SIGCOMM 2000]


Primarily for networking data


Binary data formats only


Stop on first error


No value
-
added tools (Profiler; XML conversion; Query engine)


No semantics

46

Current and Future Work


Pads Language


recursion and pointers (eg: for tree
-

and graph
-
structured data)


integrated pre
-

and post
-
processing (eg: encryption, compression)


composition and reuse (via polymorphism, modules)


multi
-
source data integration


Pads Compiler


parsing and querying optimization (eg: dealing with massive data sets)


Pads Tools


new architecture for robust & reliable tool generation


application
-
specific customization


error correction, data normalization, ignoring or rearranging components


general data transformation


visual interface for nonprogrammers


Pads Applications


genomics data (with Olga Troyanskaya)


networking and telephony data (AT&T)


a great domain for interdisciplinary undergraduate research projects



47

Pads Summary


The overarching goal of Pads is to make
understanding, querying and transforming ad hoc data
an
effortless

task.


We do so with new programming language technology
based on the principles of
Type Theory.




AT&T Research:

Kathleen Fisher

Mary Fernandez

Joel Gottlieb

Robert Gruber (now Google)

Ricardo Medel (summer intern)

Princeton:

Joe Kovba (UGrad)

Yitzhak Mandelbaum (Grad)

David Walker

http://www.padsproj.org/

End!