Karma A Provenance Collection Framework for Data-driven ... - myGrid

beansproutscompleteSoftware and s/w Development

Dec 13, 2013 (3 years and 7 months ago)

65 views

Karma

Provenance Collection Framework
for Data
-
driven Workflows

Yogesh Simmhan

Microsoft Research


Beth Plale, Dennis Gannon, Ai Zhang, Girish
Subramanian, Abhijit Borude, et al

Indiana University

Putting the ‘e’ in e
-
Science


Many scientific domains are moving to
in Silico

experiments…Earth Sciences, Life Sciences, Astronomy


Common requirements


Complex & Dynamic Systems,


Adaptive Resources


Data Deluge


Need for Collaboration


Cyberinfrastructure

to support these needs


Massively Parallel Systems


High Bandwidth Computer Networks


Petascale

Data Archives


Grid Middleware provides the glue to tie these using a
Service Oriented Architecture

Workflows as Experiments


Data
-
driven applications designed as
workflows


Data flows across applications as they are
transformed, fused and used generating
derived data


Control flows determine path to execute
but data flow determines data movement
and dependency


Manually keeping track of input & derived
data to experiments is challenging given the
number

of data and
complexity

of application

Data Management Challenges


Complex, dynamic data
-
processing pipelines


Remote execution on Grid resources


How was a particular dataset created?



Collaboratory environments with shared resources


Large search space & missing metadata


How good is a given dataset for one’s
application?

Data Provenance


Metadata
that describes the
causality

of an event


Along with context to interpret it


What
,

when
,

where
,

who
,
how
, …


We consider provenance for


Workflow execution


Service invocations


Data products


Workflow

&
Service

Provenance


Describes execution of a workflow & invocation of service


Data

Provenance


Describes usage and generation of data products

Provenance
/’prɒv ə nəns,
-
,nɑns/

The history or pedigree of a work of art, manuscript, etc. A record of the ultimate derivation and
passage of an item through its various owners.

Source: The Oxford English Dictionary

Benefits


What if the experiment fails?


Did the workflow run correctly? Completely?


Was the correct data/service/parameter used?


Verification, Validation


Can my peer run the experiment & get the
same result?


Repeatability


Can I use the results in my publication?


Attribution, Copyright


Can I trust the results of prediction?


Data Quality


How much did it cost? How much will it cost?


Resource Usage & Prediction

[
7
/43]

[2007
-
08
-
16]

Gateway Services

Core Grid Services

LEAD Science Gateway Architecture

Grid

Portal Server

Execution

Management

Information

Services

Self

Management

Data

Services

Resource

Management

Security

Services

Resource Virtualization (OGSA)

Compute Resources

Data Resources

Instruments & Sensors

Proxy Certificate

Server (Vault)

Events & Messaging

Resource Broker

Community & User

Metadata Catalog

Workflow engine

Resource

Registry

Application

Deployment

User’s Grid Desktop

What is Karma

Provenance Framework


A
standalone
framework


to collect
data
provenance


for
adaptive
workflows


with
low overhead
and lightweight schema


able to answer
complex
queries


Data Provenance is


a form of
metadata



to track
derivation history

of data



created by a
workflow

run



executing across organizations (space)



over a period of
time


Data Usage:
Move forward
in time


Workflow trace:
Inverse view from the actors


A Typical e
-
Science Experiment

Weather forecast using WRF in LEAD

Pre
-
Processing

Assimilation

Visualization

Forecast

Workflows

Abstract Workflow Model


Temporal & Spatial composition


Data Flow vs. Invocation Flow





Central vs. Distributed Orchestration


Assumption


Directed Graph of Service Nodes & Data Edges


Data Driven Applications


Hierarchical Composition: Workflows a form of Service


Workflow definition not required


Standalone, independent of Workflow System

Provides
Port

Uses
Port

Data Flow

Workflows

Simple & Complex Workflow Models

Workflow Engine
Service
S2
D1
D2
Service
S1
D2
D3
D1
D2
D3
Workflow
WF
D1
D3
Workflow Engine
Service
S2
D1
D2
Service
S1
D2
D3
D1
D2
D3
Workflow
WF
D1
D3
Service

S1

D2

Workflow

WF1

D1

Workflow

WF2

Service

S3

Service

S2

D3

D4

[
12
/43]

[2007
-
08
-
16]

Pro
ven
anc
e
Fra
me
wor
k in
Sup
por
t of
Dat
a
Qua
lity
Esti
mati
on

Activities

Collecting Provenance


Activities generated during lifecycle of workflow


“Sensors” generate activities: Instrumentation of
services, clients


Track execution across space, time, depth & operation


Space: which service


Time: when (logical time)


Depth: distance from invocation root (client » workflow »
service … nested workflows)


Operation: Track dataflow


18 activities defined

Support Dynamic, Adaptive Workflow

WF Engine

Web Service

Instrumentation of Services & WF

WF Tracking

WS Client

WF Tracking

Karma Service

Karma Provenance Service


Provenance

Listener

Activity

DB

Karma Architecture

Workflow Instance

10 Data Products Consumed & Produced by each Service

Service

2



Service

1

Service

10

Service

9

10P/10C

10C

10P

10C

10P/10C

10P

Workflow

Engine

Message Bus





WS
-
Eventing

Service API

WS
-
Messenger

Notification Broker

Publish Provenance

Activities as Notifications

Application

Started &

Finished,

Data

Produced &

Consumed

Activities

Workflow

Started &


Finished Activities

Provenance

Query API

Provenance

Browser Client

Query for Workflow, Process,

& Data Provenance

Subscribe & Listen to

Activity Notifications

A Framework for Collecting Provenance in Data
-
Centric Scientific Workflows
, Simmhan, Y., et al.;
ICWS
,

2006

Service Invocation State Diagram

Invoking
Service

Service
Invoked

Service
Invocation
Failed

Data
Transfer In

Computation

Data
Consumed

Data
Produced

Data
Transfer
Out

Sending
Result

Sending
Fault

Received
Response

S
E
R
V
I
C
E

C
L
I
E
N
T

Start

I/P
Staging

Compu
tation

O/P
Staging

End

Activities

Types & Source

Activity

Generated By

[Service

|

Workflow]

Initialized

Service

[Service

|

Workflow]

Terminated

Service

Invoking

Service

Client

Service

Invoked

Service

Invoking

Service

[Succeeded

|

Failed]

Client

Data

Transfer

Service

Computation

Service

Data

Produced

Service

Data

Consumed

Service

Sending

[Result

|

Fault]

Service

Received

[Result

|

Fault]

Client

Sending

Response

[Succeeded

|

Failed]

Service

Type

Independent

Independent

Bounding

Bounding

Bounding

Operational

Operational

Operational

Operational

Bounding

Bounding

Bounding

[
17
/43]

[2007
-
08
-
16]

Provenance Framework in Support of Data Quality Estimation

Client

Service

D1

D2

Time

Space

Operation

S: Initialize

S: Terminate

S: Send Response
Successful

C: Receive Response

S: Send Response

S: Transfer Output

Data D2

S: Produce Data D2

S: Perform
Computation

S: Consume Data
D1

S: Transfer Input

Data D1

C: Invocation
Successful

S: Invoked

C: Invoke Service

Transfer

Consume

Produce

Compute

Client

Service

Depth

Activities

Sequence Diagram for
Basic
Workflow

[
18
/43]

[2007
-
08
-
16]

Pro
ven
anc
e
Fra
me
wor
k in
Sup
por
t of
Dat
a
Qua
lity
Esti
mati
on

Workflow Engine


Service

S2

Service

S1

D1

D2

D3

Workflow

WF

D1

D3

Time

Operation

S1,S2,WF: Initialize

S1,S2,WF: Terminate

S1: Send Response
Successful

WF: Receive
Response

S1: Send Response

S1: Produce Data D2

S1: Consume Data
D1

WF: Invocation
Successful

S1: Invoked

WF: Invoke Service
S1

Consume

Produce

WF

S1

S2

S2: Send Response
Successful

WF: Receive
Response

S2: Send Response

S2: Produce Data D3

S2: Consume Data
D2

WF: Invocation
Successful

S2: Invoked

WF: Invoke Service
S2

Space

Depth

Sequence Diagram for

Simple Workflow

[
19
/43]

[2007
-
08
-
16]

Pro
ven
anc
e
Fra
me
wor
k in
Sup
por
t of
Dat
a
Qua
lity
Esti
mati
on

Activities

Naming


Uniquely identifying data & services is critical for
provenance


Data product has GUID. Replicas have URLs.


Service & Workflow instances have GUID


Services defined in the context of workflows have a
Node ID in the workflow name space


Clients have GUID


Entity: 4
-
tuple


<Workflow ID, Service ID, Node ID,
Timestep
>


Invocation: 2
-
tuple


<Invoker Entity,
Invokee

Entity>

Activities

Provenance Activity Contents


Activity Type


Source Entity: 4
-
tuple


<Workflow ID, Service ID, Node ID,
Timestep>


Remote Entity: 4
-
tuple


Attributes


todo


Annotations

Activities

Modeling Activities in XML

<
serviceInvoked

xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”
>


<
notificationSource

workflowNodeID
=“
ConvertService_4

workflowTimestep
=“
36



workflowID
=“
tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1




serviceID
=“
urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService
” />


<
timestamp
>
2006
-
09
-
10T23:56:28.677Z
</
timestamp
>


<
description
>
Convert Service was Invoked
</
description
>


<
request
><
header
>
...
</
header
><
body
>
...
</
body
></
request
>


<
initiator



serviceID
=“
tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1
” />

</
serviceInvoked
>

<
dataProduced

xmlns=“http://lead.extreme.indiana.edu/namespaces/2006/06/workflow_tracking”
>


<
notificationSource

workflowNodeID
=“
ConvertService_4

workflowTimestep
=“
36



workflowID
=“
tag:gpel.leadproject.org,2006:69B/ProvenanceChallengeBrainWorkflow17/instance1




serviceID
=“
urn:qname:http://www.extreme.indiana.edu/karma/challenge06:ConvertService
” />


<
timestamp
>
2006
-
09
-
10T23:56:32.324Z
</
timestamp
>


<
dataProduct
>


<
id
>
lead:uuid:1157946992
-
atlas
-
x.gif
</
id
>


<
location
>


gsiftp://tyr1.cs.indiana.edu/tmp/20060910235628_Convert/outputData/atlas
-
x.gif
</
location
>


<
timestamp
>
2006
-
09
-
10T23:56:32.324Z
</
timestamp
>


</
dataProduct
>

</
dataProduced
>

Activities

Publishing Activities as Notifications


Activities are modeled as notifications that
are sent by different components


Loosely coupled, easy to generate provenance


XML Representation of provenance
activities


WS
-
Messenger Notification Broker acts as
message bus


WS
-
Eventing

& WS
-
Notification


Provenance service & interested clients
subscribe to notification

Backend

Provenance Database


~Union of provenance model


Provenance incrementally built


Relational database
(
MySQL
)

Information Model

Data Provenance View


Data Provenance


Entity is the state of a service or a client


Invocation relates a client (invoker) to a service (invokee).
Status.


Data provenance of
produced data
relates invocation with
consumed data


Lightweight schema

Karma2: Provenance Management for Data Driven Workflows
, Simmhan, Y., et al.;
J. Web Svc. Res.
,

2008

Client

ENTITY (Invoker)

Service

ENTITY (Invokee)

Request

Response

Information Model

Data Provenance & Usage Views

Client

ENTITY (Invoker)

Service

ENTITY (Invokee)

Request

Response

Information Model

Workflow & Process Provenance Views

Client

ENTITY (Invoker)

Service

ENTITY (Invokee)

Request

Response

Dissemination

Querying Provenance


All 5 provenance models can be queried for by ID


Data Provenance (by Data ID)


Recursive Data Provenance (by Data ID, depth)


Data Usage (by Data ID)


Process provenance (by Invoker &
Invokee
)


Workflow Trace (by Invoker &
Invokee
, depth)


Service API to query and return results as XML
Document


Provenance Challenge Workshop


Direct API, Incremental client, Graph matching
algorithm


Incremental
building of complex queries

Query Capabilities of the Karma Provenance Framework
, Simmhan, Y., et al.; 1
st

Provenance Challenge &
CCPE J.
,

2007

Applications: Process Monitoring

Realtime Monitoring using XBaya

Applications: Information Integration

Visual Exploration using Karma GUI

Performance & Scalability Study

Experimental Setup

odin001

odin065

odin064

odin128





Provenance Clients

tyr10

tyr12

tyr11

tyr13

Karma

WS
-
Messenger

Broker

PReServ in Tomcat 5.0,

Embedded Java DB

MySQL

Gbps Network

Dual
-
Processor 2.0 GHz 64
-
bit Opteron,

4GB RAM

Dual
-
Processor 2.0 GHz 64
-
bit Opteron,

16GB RAM, Local IDE disk

Generate Provenance

Query Provenance


Karma Service, WS
-
Messenger Notification Broker,
MySQL


PReServ

in Tomcat 5.0 container


Tyr

web
-
services cluster (16 Nodes)


Odin

computer cluster (128 Nodes)


Gigabit Ethernet, local IDE disk storage


SLURM job manager for parallel job submission on
Odin


Java 1.5,
Jython

Provenance Service Components

[
31
/43]

[2007
-
08
-
16]

Performance & Scalability Study

Collecting Provenance


Comparative Study of Karma
with
PReServ

(U. Soton)


Provenance services on
tyr

(2Ghz/16GB/64bit) & clients on
odin
(2Ghz/4GB/64bit)


Time to collect provenance
activities synchronously

1.
Single service with increasing
number of service
invocations


Karma scales linearly

2.
Linear workflow with
increasing number of data
produced/ consumed


Karma scales linearly, PReServ
constant

2.76
4.80
7.11
9.37
11.78
14.01
16.30
0.55
1.30
18.55
2.52
3.70
4.90
6.32
7.95
10.49
15.12
0.85
1.68
20.52
0
5
10
15
20
25
50
75
100
125
150
175
200
225
250
Number of Service Invocations
Time to Record Provenance for Service Invocations
(in Seconds)
1250
3750
7500
12500
18750
26250
35000
45000
56250
68750
Cummulative Number of Invocations present in Provenance Service (x22 Karma activities; x4 PReServ assertions)
Karma
PReServ
0.76
2.26
3.66
5.72
7.44
9.15
10.77
12.40
0.31
14.41
1.22
1.33
1.62
0.37
1.02
0.84
0.72
0.65
0.61
0.40
0
2
4
6
8
10
12
14
16
250
500
750
1000
1250
1500
1750
2000
2250
2500
Total Number of Input/Output Data Products in Linear Workflow
Average Time to Record Provenance for Linear Workflow
(in Seconds)
0
2
4
6
8
10
12
14
16
25
50
75
100
125
150
175
200
225
250
Number of Input/Output Data Products per Service in Linear Workflow
Karma
PReServ
Performance Evaluation of the Karma Provenance Framework
,
Simmhan, Y., et al.;
IPAW & LNCS 4145
,

2006

[
32
/43]

[2007
-
08
-
16]

Performance & Scalability Study

Collecting Provenance

70.40
69.27
68.72
68.83
99.57
133.24
53.42
53.68
58.52
93.38
179.43
0
20
40
60
80
100
120
140
160
180
200
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Number of Simultaneous WRF Workflows
Average Time to Record Provenance per WRF Workflow
(in Seconds)
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
Total time to Record Provenance for all WRF Workflows
(in Seconds)
Average Time (Karma)
Average Time (PReServ)
Total Time (Karma)
Total Time (PReServ)
Performance Evaluation of the Karma Provenance Framework
,
Simmhan, Y., et al.;
IPAW & LNCS 4145
,

2006


Time to collect
provenance from
simulated ensemble
WRF forecasting
workflow


Scalability with
increasing # of
parallel runs


1

20 concurrent
workflows


Karma scales sub
-
linear

[
33
/43]

[2007
-
08
-
16]

Performance & Scalability Study

Querying Provenance

12.47
3.32
11.74
21.62
25.13
1.69
3.33
2.29
22.28
27.35
10.00
17.60
26.71
2.52
1.76
0
5
10
15
20
25
30
35
40
0
2
4
6
8
10
12
14
16
18
20
22
24
26
28
30
32
34
36
38
40
42
44
46
48
50
Number of Simultaneous Query Clients
Average Query Response Time per Client (in Seconds)
0
250
500
750
1000
1250
1500
1750
2000
Total Query Response Time for All Clients (in Seconds)
Workflow Trace Query (Average Time)
Process Provenance Query (Average Time)
Data Provenance (Average Time)
Workflow Trace Query (Total Time)
Process Provenance Query (Total Time)
Data Provenance Query (Total Time)
Performance Evaluation of the Karma Provenance Framework
,
Simmhan, Y., et al.;
IPAW & LNCS 4145
,

2006


Response time to query workflow, process, and data
provenance from Karma (PReServ was order of magnitude
slower)


Scalability with increasing # of concurrent clients


Karma contains 1000 workflow invocations


Query for 20 workflow/200 process/200 data provenance
documents

Related Work


PReServ, U. of Southampton
(Luc Moreau)


Standalone, Annotation support


No data provenance, workflow concept; poor performance


VisTrails, U. of Utah
(Juliana Freire)


Workflows for graphical modeling


Constrained to browser


PASS, Harvard U.
(Margo Seltzer)


System level provenance


No service/data abstraction


Trio, Stanford U.
(Jennifer Widom)


Tuple level provenance on Database operations


Restricted to databases


Data Collector, IBM
(alphaworks)


Automatically record & track SOAP Messages


No data provenance

What is new in Karma
3
?


Process control flow tracking


Vertical integration across applications


Support for database queries


Process & data abstraction


Mining provenance logs


WF composition


Semantic support (S
-
OGSA)