Golden - Trail - And Retrieving Data Lineage Information

destructivebewInternet και Εφαρμογές Web

19 Ιουλ 2012 (πριν από 4 χρόνια και 9 μήνες)

517 εμφανίσεις









Golden
-
Trail


A Provenance Repository For Storing
And Retrieving Data Lineage Information




Saumen Dey,
Michael Agun,
Michael Wang, Bertram
Ludäscher, Shawn Bowers, and Paolo Missier


DataONE Provenance & Workflow Working Group (ProvWG)








Table  of  Contents
 
1
 
INTRODUCTION
................................
................................
................................
....................
3
 
2
 
USE  CASE
................................
................................
................................
..............................
4
 
3
 
PROVENANCE  MODEL
................................
................................
................................
...........
6
 
4
 
GOLDEN
-­‐
TRAIL  APPLICA
TION
................................
................................
................................
6
 
4.1
 
User  Interface
................................
................................
................................
.............................
7
 
4.1.1
 
Upload  Trace  File
................................
................................
................................
..........................
7
 
4.1.2
 
Query  Builder
................................
................................
................................
...............................
7
 
4.1.3
 
Query  Result
................................
................................
................................
................................
.
8
 
4.2
 
Trace  Parser
................................
................................
................................
................................
8
 
4.3
 
Graph  Visualization
................................
................................
................................
...................
10
 
4.4
 
Data  Store
................................
................................
................................
................................
.
10
 
5
 
GOLDEN
-­‐
TRAIL  ARCHITE
CTURE
................................
................................
............................
10
 
5.1
 
GWT  (Google  Web  Toolkit)  Framework
................................
................................
.....................
12
 
5.2
 
Graphviz  (DOT)
................................
................................
................................
..........................
12
 
5.3
 
JIT  (JavaScript  Infovis  Toolkit)
................................
................................
................................
....
12
 
5.4
 
Tom
cat  (Web  and  Application  Server)
................................
................................
.......................
12
 
5.5
 
Database  Server
................................
................................
................................
........................
12
 
6
 
DATAONE  PROVENANCE  Q
UERY  LANGUAGE  (DPQL)
:
................................
...........................
13
 
7
 
QUERY  EVALUATION:
................................
................................
................................
..........
15
 
8
 
TECHNICAL  DISCUSSION
:
................................
................................
................................
.....
15
 
9
 
CONCLUSION  AND  FUTUR
E  WORK:
................................
................................
......................
16
 
10
 
REFERE
NCES
................................
................................
................................
......................
17
 
11
 
APPENDIX
................................
................................
................................
..........................
18
 
11.1
 
Graph  Database  (Neo4j)  Implementation  Detail
................................
................................
......
18
 
11.2
 
Relational  Database  (MySQL)  Implementation  Detail
................................
..............................
19
 
11.3
 
Server
-­‐
Side  Implementation  Detail
................................
................................
..........................
19
 


1

INTRODUCTION


A scientific experiment may involve many domain scientists, various workflow systems
(e.g. ASKALON, Ga
l
axy,
Kepler, Taverna, Pegasus, Vistrails, etc) and some
computational and data analysis systems (e.g. KNIME, Matlab, R, etc). Scientists need to
share da
ta items for reuse so every experiment doesn’t have to start from raw data
collection. The problem is, the data item itself is not sufficient to understand it; further
information is needed, e.g. how it was created. The data lineage, a.k.a. the provenance,

captures detailed processing history of the data. Thus, provenance data can be used to
better understand the data.


Provenance is extremely valuable as it can be used to interpret, validate, debug and
repeat results (“reproducible science”). Most scientific workflow systems capture
provenance. But, this provenance data has limitations, which stops scientists from using
it.
First, different workflow systems use different provenance models to capture provenance
data. This makes it very difficult to share provenance data. Second, there is a lack of
global identifiers for the data identifiers. It gets impossible to view the
complete
provenance where the scientific experiment is designed across multiple workflows. Third,
there can be “provenance gaps,” i.e. missing provenance because one or more steps that
do not capture provenance. An example of this is manual data manipulati
on steps. These
problems can make understanding the complete provenance of a data item difficult.
Current community standards, e.g. OPM, W3C, provide only a minimal “core provenance
model” that does not capture workflow
-
specific aspects. A further problem
is that there is
no common provenance repository, in which provenance data could be stored and shared.


What we need

are (
i) a common provenance model so that provenance data captured by
a workflow system could be mapped to this model and also provenance
data in this
common model could be mapped back to other provenance models, (ii) a provenance
repository where scientists can submit their provenance data and view others’ provenance
and download into his local environment for further use, (iii) common iden
tifiers so that
individual provenance data could be linked to provide a complete view of a scientific
experiment, (iv) a query language using which user would be able to (a) find owner of a
data item, actor/process, workflow, (b) get derivation history of
a data item and select
parts of it, and (c) view dependencies among users, workflows, etc, and (v) reach user
interface using which scientists would be able to interact with the provenance repository to
(a) s
u
bmit
provenace data (in form of trace files), (
b) view and download o
t
hers
provenace
data, and (c) interactively query the repository.



In the 2010 summer pro
j
ect (
DToL: Data Tree of Life), we investigated the common
identifier issue when a scientific experiment is divided among multiple workflows
and
developed a set of “stitching” techniques. We published this work in our [WORKS 2010]
paper. We developed a prototype to demonstr
ate these techniques. We demonstrated the
interoperability
a
mong
Kepler, CoMaD, and Taverna using those “stitching”
techniq
ues.The source code, executables, and further details are provided in [DToL url].


In this year’s summer project (Golden
-
Trail), we focused on (i) common provenance
model, (ii) provenance repository, and (iii) interactive user interface (
Golden
-
Trail
). We

used the OPM (Open Provenance Model) as the basis of our common provenance model
(D
-
OPM, DataONE Provenance Model) and extend it to add workflow specific features,
complex data structures.
Golden
-
Trail
allows users to upload provenance data (in form of
tr
ace files) and it converts the provenance model used into D
-
OPM while loading the data.
It also allows users to query and visualize provenance data at different levels (e.g. user,
workflow, run, etc), including run level dependencies (outside provenance) a
nd invocation
level dependencies (inside provenance).


2

USE CASE


Our experimental testbed consists of a suite of pre
-
existing Kepler workflows, prepared
from the “Tree of Life”/pPOD project (
2008b
). The pPOD testbed includes a su
ite of
workflows for performing various phylogenetic analyses, using a library of reusable
components for aligning biological sequences and inferring phylogenetic trees based on
molecular and morphological data. The workflows are divided into various subta
sks that
can be run independently as smaller, exploratory workflows for testing different
parameters and algorithms, or combined into larger workflows for automating multiple data
access, tree inference, and visualization steps. A number of the smaller wor
kflows within
pPOD are designed explicitly to be run over output generated from other workflows within
the suite.



Figure 1
: Phylogenetics workflow (top) with provenance trace (bottom) from the Kepler/pPOD
package using the COMAD module.


Having
demonstrated provenance interoperability and integration as part of a previous
effort [
WORKS 2010
], the emphasis has been less on experimenting with specific
provenance integration techniques. Instead, we focused on populating the repository
using multiple
executions of multiple workflow fragments, each related to each other
through their input and output (sometimes intermediate) data products, and on testing
query functionality to extract Golden
-
Trails from the repository. More specifically, we
demonstrate
query capability with different views of the result, including returning and
rendering all or a portion of a run graph, where nodes represent whole workflow runs, and
possibly with data nodes as intermediate connections, as the result of a query,
emphasiz
ing the lineage of data across different e
-
science infrastructures.


To demonstrate all the query capabilities, we developed the following synthetic experiment
involving three workflows. Two scientists (user1 and user2) participated in this experiment.
The
dependencies among the workflows are as shown in Figure 2. First workflow (wf1)
was executed first and then second (wf2) and third (wf3) workflows used output data items
from wf1’s execution.



Figure 2
:
wf2 (second workflow) and wf3 (third
workflow) use data items from wf1 (first workflow)


During the execution of each of these workflows, the respective workflow systems capture
processing histories in trace files. Many of the existing workflow systems can capture true
dependencies (i.e. inside provenance), where some workflow systems can only
capture
inputs to the run (a workflow execution) and the outputs data items that the run generated.
Our provenance repository needs to upload both types of provenance data.


Now, the scientists may submit the trace files into the Golden
-
Trail provenance
repository
using the web
-
based user interface. In case two trace files maintain identifiers of the
shared data items (if any), Golden
-
Trail should be able to stitch them automatically. In our
synthetic experiment, workflows wf2 and wf3 are using data items
from workflow wf1 and
the workflow system (CoMaD) used are able to maintain data identifiers. Thus, after
loading all three trace files, provenance data would be stitched and represents the
provenance of the experiment.


After all the trace files loaded i
nto the Golden
-
Trail provenance repository, scientists may
want to query the repository. Following is a set of benchmark queries,
for
which scientists
would like to have the answers:




Who is owner of data item “d”?



Which run produced data item “d”?



What a
re the input and output data items of invocation “i”?



What are the input and output data items of the run “r”?



Which users used the data item “d”?



Did any of my runs use data items generated by the run “r”?



Am I collaborating with the user “u”?



Which data
items are dependent on the data item “d”?



Which data items does data item “d” depend on?



Which data items are dependent on the data items “di” and “dj?



How am I dependent on other users?



How are my runs dependent on other runs?


3

PROVENANCE MODEL


A
provenance (or lineage) graph is an acyclic graph G = (V,E), where the nodes V = D

I
represent either data items D or invocations I. The graph G is bipartite, i.e., the edges E =
E
used


E
genBy
are either used edges E
used


I × D or generated
-
by edges E
genBy


D × I.
Here, a used edge (i, d)

E means that invocation i has read d as part of its input, while a
generated
-
by edge (d, i)

E means that d was output data, written by invocation i. An
invocation can use many data items as input, but a data item
is written by exactly one
invocation. Following are the relations in our D
-
OPM:


User
:
scientists who are sharing provenance data

Workflow
:
A systematic orchestration of a set of processes/actors or sub
-
workflows to
achieve a scientific task

Run
:
is an ex
ecution of a workflow

Actor
:
A computational step to achieve some scientific tasks

Invocation
:
is an execution of a process/actor or a sub
-
workflow

Data Item
:
data artifacts used or generated during the execution of the workflow

Dependency
:
captures the de
pendencies of a node (data item, invocation) on other nodes

Used
:
captures the dependencies of a invocation on data
i
tems

GenBy
:
captures the dependency of a data item on an invocation


user(U,UN)

U = user id, UN = user name

workflow(W,WN,U)

W = workflow
id, WN = workflow name, U = user id

actor(A,S)

A = actor id, S = source code reference

run(R,W)

R = run id, W = workflow id

invocation(I,A,R)

I = invocation id, A = actor id, R = run id

data(D,DR)

D= data artifact id, DR = data artifact reference

used
(I,D)

I = invocation id, D= data artifac
t
id

genBy(D,I)

I = invocation id, D= data artifact id

Note: nesting of workflows, data are not considered for this prototype development.


4

GOLDEN
-
TRAIL APPLICA
TION


The Golden
-
Trail solution is built on 4 logical components: the
User Interface
,
Trace
Parser
,
Graph Visualization
, and the
Data Store
. The
User Interface
allows scientists to
interact with the provenance repository. Users can upload trace files and build q
ueries
using the query builder. Query results can be viewed in both tabular format and as an
interactive provenance graph. The
Trace Parser

parses trace files from different workflow
systems
(
e.g.
Kepler, Taverna etc). N
ew parsers could easily be added to
Golden
-
Trail to
be able to load trace files from other workflow systems.
The
Graph Visualization

component shows provenance data as a directed acyclic graph in an interactive manner.
The
Data Store
stores stitched provenance data.



Figure 3:
Golden
-
Trail Application


4.1

User Interface

Golden
-
Trail user interface has three primary features. They are
Upload Trace File
,
Query
Builder
, and
Query Result
sections.

4.1.1

Upload Trace File

Upload Trace File
allows scientists to upload provenance data (in form of trace files) to the
provenance repository. In the upload page scientist would provide his/her user name,
workflow name and which workflow system the trace file is from as shown in Figure 4. The
workf
low system is needed to decide which trace parser to use. Then the scientist would
choose a trace file using the dialog box and initiate the upload process by clicking the
upload button.



Figure 4:
Golden
-
Trail Application
-
Upload Trace File

4.1.2

Query
Builder

Using this feature, scientists would be able to query the provenance repository. The
repository is expected to be very large and th
u
s it would be very inefficient to view
everything in the repository
, and probably not useful
. Instead, scientists wo
uld use this
query build
er
to view only the relevant parts. Scientists are expected to pro
v
ide (
i)
provenance view, (ii) dependency view, and (iii) query conditions as shown in Figure 5.
Provenance view is us
ed
to view the provenance data at
the
required a
bstraction level.



Figure 5:
Golden
-
Trail Application

Query Builder


For example, if the scientist only cares about what’s happening at the run level, then
he/she does not need to view the details of individual invocations. Provenance data can
be viewed at the user, workflow, run, actor, and invocation levels. The scientist
may want
to view only the data dependency (i.e. how a data item depends on other data items), or
only the invocation dependency (i.e. how an invocation depends on other invocations), or
in combination. This can be achieved by using the dependency view. In
case the scientist
wants to view only a part of the repository, he/she can specify a set of starting nodes (data
items or invocations), a set of through nodes and a set of nodes to end at using the query
condition section.

4.1.3

Query Result

When a query is ex
ecuted,
Golden
-
Trail
provides the result in two different ways:
a
s a (
i)
table: a dependency is printed as a row with either input or output type. If the type is
“input”, the start node is a data item, or as a (ii) dependency graph: showing the
dependencie
s from a right node (data item or invocation) to a left node.




(a) ….

(b) ….

Figure 6:
Golden
-
Trail Application

Query Result


4.2

Trace Parser

In today’s environment, a scientific experiment involves many workflows to accomplish
different
computational and data analysis steps. Some of these workflows share data
items (i.e. one workflow run generate
s
a data item and another workflow run use
s
that
data item). In
a
case
where
two runs maintain the same data identifiers of the shared data
items,
Golden
-
Trail
link
s
these
gen
-
by/used
relations based on the shared data identifiers.





(a) trace1

(b) trace2

(c) trace3

Figure 7:
Individual trace files


For example, the runs of workflow1, workflow2, and workflow3 produces trace files trace1,
trace2, and trace3 respectively. Workflow1 run generated the data item “d4” and
workflow2 run uses that same data item. In the same way Workflow1 run generated the
da
ta item “d5” and workflow3 run uses that same data item. After loading all three trace
files into the Provenance Repository,
Golden
-
Trail
would automatically stitch (Automatic
Stitching) the trace1 & trace2 based on data item “d4” and trace1 & trace3 based
on data
item “d5” as shown in Figure 6.



Figure 8:
Automatic Stitching


If two runs do not maintain the same data identifiers of a shared data item, but the shared
data item can be identified, then the scientist can map the different identifiers of th
e same
data item and create a manual trace. For example a run of workflow1 produce a data item
with identifier “d4” and run of the workflow2 used a data item with identifier “d4’”. But, the
scientist knows that “d4” and “d4’” refers to the same data item,
then he can create a
manual trace file as shown on Figure 7(d) to state the same fact. For the same reason the
scientist would create the manual trace file as shown in 7(e) to state that “d5” and “d5’”
refers to the same data item
. Golden
-
Trail

Upload Trac
e File
would upload this manually
created trace files in the similar way it upload system generated trace files. After loading
all three trace files and two “manual trace” files, as shown in Figure 7, into the Provenance
Repository,
Golden
-
Trail
would stit
ch the trace1 & trace2 using the manual trace as shown
in Figure 7(d) and trace1 & trace3 using the manual trace as shown in Figure 7(e). The
stitched provenance graphs in the Provenance Repository would look like as shown in
Figure 8.





(a) trace1

(b) trace2

(c) trace3



(d) identifying that d4 and d4’ are same data
item with different data identifiers

(e) identifying that d5 and d5’ are same data
item with different data identifiers

Figure 9:
Individual trace files and “manual maps”



Figure 10:
Stitched Provenance Graph using manual trace files


4.3

Graph Visualization

The scientist can view query result (a selected part of the provenanc
e repository) as a
dependency graph in addition to the tabular format. The result can be view
e
d a
s
an (
i)
interactive dependency graph: a dependency graph is printed for the predefined set of
dependency levels (i.e. depth of dependencies). Scientist can change the level or click on
the nodes to view/expand further dependencies, and as a (ii) static dependen
cy graph:
Scientist can save the selected provenance graph (a part of the provenance repository)
based on the query result into a PDF (or other formats) file for future use.


4.4

Data Store

Golden
-
Trail provides an extendable database layer, which is implemen
ted using the
abstract factory design pattern. Currently it supports a relational database and a graph
database. In the relational database, the provenance model is implemented as tables and
their respective relationships. In the graph database, the proven
ance model is
implemented as a graph with nodes and their respective relationships. One of the node
properties is “node type”, which specifies whether the node represents a data item or an
invocation. The relationships specify the dependencies (i.e.
u
sed,
genBy) among the
nodes. These implementation details are hidden from the scientists. Scientists would use
the Upload Trace File feature to load provenance data into the repository and would use
the Query features to query the repository.


5

GOLDEN
-
TRAIL ARC
HITECTURE


Golden
-
Trail is developed using the GWT (Google Web Toolkit) framework and bui
l
t on
a
J2EE 3
-
tier architecture. The client
-
side code (Upload GUI, Query GUI, and GWT client
-
server interface) resides in the web server. The server
-
side code (Upload
Trace File,
Query Builder, and Query Result) resides in the application server. Tomcat is used to
serve as both the web server and application for the prototype development. The final tier
is our database server. Overall interactions of all the components
are shown in Figure 11.



Figure 11:
Golden
-
Trail Application Architecture


The GWT Client
-
Server Interface intercepts Scientist’s Upload Trace File or Query
requests, which make asynchronous calls to the respective server
-
side components.
Server
-
side
Upload component is invoked in case of an Upload Trace File request. This
Upload component calls an appropriate Trace Parser (based on the selection of the
workflow system). The Trace Parser parses the trace file and creates a provenance model
object, whi
ch is passed to the Database Interface. The Database Interface prepares the
DML statements for the targeted database and calls the respective database server API.


Server
-
side Query component is invoked in case of a Query request. This Query
component call
s the Database Query Interface. The Database Query Interface rewrites
the query for the intended database and calls the respective database server API. The
database server executes the query and returns the result to the Database Query
Interface, which con
vert the results into a provenance model object. This model object is
passed to the Query component, which makes three ca
l
ls; (
i) Graphviz/DOT: it converts
the object into a static dependency graph, where circle represents data items and
rectangle represen
ts invocations, (ii) JIT: it converts the object into a JSON object, which
is passed to the client side and the client
-
side JIT then renders this JSON object as an
interactive dependency graph, and (iii) Query GUI: it renders the object as a table with
row
s, representing a dependency, with either input or output type. If the type is “input”, the
start node is a data item and if the type is “output”, the end node is the data node. Next we
describe the technologies we used for this prototype.


5.1

GWT (Google Web
Toolkit) Framework

This framework has three components; client
-
side, server
-
side, and shared. The client
-
side code contains most of the code regarding the GUI, such as placement and alignment
of all of the widget elements in every tab, and all of the hand
lers related to the GUI
elements, such as buttons, text fields, and drop
-
down lists. All of the JavaScript elements
that have been hard coded via GWT’s JavaScript Native Interface (JSNI) are present in
the client side code also. The shared code contains on
ly a single class that shares a set of
constants between the client side and server side code, which is mostly localized to the
specific deployment platform. GWT compiler transforms the client
-
side and shared code
into JavaScript to be passed to the user v
ia a web browser. The server side code contains
all of the dispatches to the
neo4j server, and forwards results back to the client for
rendering via
J
IT.


5.2

Graphviz (DOT)

It is an open
-
source graph visualization software. Graph visualization is a way of
representing structural information as diagrams of abs
tract graphs and networks. It has
important applications in networking, bioinformatics, software engineering, database and
web desig
n, machine learning, and in visual interfaces for other technical domains. In the
graph we developed using this visualization tool, we represent data as circles and
invocations (instances of an actor/process) as b
o
xes.
Graphviz is installed in the
applicat
ion server and our application would interact with it through the driver class we
have developed. It would use the /wa
r
/dot/
tmp directory for
its
internal workings. We have
also written a dot file generator from our model ob
j
ect.
Graphviz would read this d
ot file
and produce the graph as an image file in /war/dot directory. This image file is presented
on the
Golden
-
Trail
web site.


5.3

JIT (JavaS
c
ript
Infovis Toolkit)

In the program, the JavaScript Infovis Toolkit (JIT) is used
for
its visualization widget. In
particular, a subset of JIT that draws Force Directed Graphs is used. JIT is interface
s
with
GWT through one of GWT’s features called JavaScript Native Interface (JSNI), which
allows blocks of JavaScript to be inserted into ja
va code, and interpreted as JavaScript
code.

Communication of data between the server and the drawing of a new graph is
handled through JavaScript Object Notation (JSON) objects. Strings that contain a JSON
object are calculated on the server to be passed
back, based on the results of the query,
which is forwarded to the JavaScript code for loading.


5.4

Tomcat (Web and Application Server)

Golden
-
Trail
application is currently running on an Apache Tomcat server.


5.5

Database Server

We have
implemented
a relationa
l database (MySQL) and a graph database (Neo4j) as
the data servers for
Golden
-
Trail
prototype. A typical provenance query is recursive in
nature. Executing such queries in Neo4j is relatively easy as it provides a set of REST
APIs for querying with recurs
ions. We used these features in
Golden
-
Trail
. MySQL does
not provide such constructs. We developed a set of stored procedures to achieve this.


6

DATAONE PROVENANCE Q
UERY LANGUAGE (DPQL)
:


For this project we developed a simple provenance query language to describe queries on
the provenance repository. The output of a query is a
Provenance Graph (PG)
.
It
is
captured as a directed acyclic graph (DAG) where a node is dependent on another node
(i.e. de
pendencies flows from right to left). The dependencies are captured using the
following relat
i
on:


dep(X,Y)

used(X,Y).

dep(X,Y)

genBy(X,Y).


To allow for only seeing the relevant level of detail in the provenance, DPQL supports 5
different
Provenance Type
s
(PT)
.Th
ey consist of:
(1) Invocation, i.e. inside provenance,
(2) Actor
provenance which is the workflow equivalent of an Invocation, (3) Run
provenance, i.e. outside provenance, (4) workflow provenance with workflows being the
abstract th
ing that a Run is an execution of
, and (5) User provenance. Below are the
relations that describe the 5
PTs
:


1.

Invocation provenance (inside provenance).

2.

Actor prove
n
ance

actor_used(A,D)

used(I,D), invocation(I,A,_).

actor_genBy(D,A)

genBy (D,I), invoca
tion(I,
A
,_).

actor_dep(X,Y)

actor_used(X,Y).

actor_dep(X,Y)

actor_genBy(X,Y).

3.

Run provenance (outside provena
n
ce):

run_used(R,D)

used(I,D), invocation(I,_,R).

run_genBy(D,R)

genBy (D,I), invocation(I,_,R).

run_dep(X,Y)

run_used(X,Y).

run_dep(X
,Y)

run_genBy(X,Y).

4.

Workflow prove
n
ance

wf_used(W,D)

used(I,D), invocation(I,_,R), run(R,W).

wf_denBy(D,W)

genBy (D,I), invocation(I,_,R), run(R,W).

wf_dep(X,Y)

wf_used(X,Y).

wf_dep(X,Y)

wf_genBy(X,Y).

5.

User provena
n
ce,

user_used(U,D)

used(I,
D), invocation(I,_,R), run(R,W), workflow(W,_,U).

user_genBy(D,U)

genBy (D,I), invocation(I,_,R), run(R,W), workflow(W,_,U).

user_dep(X,Y)

user_used(X,Y).

user_dep(X,Y)

user_genBy(X,Y).


Within each provenance typ
e, there are 3
Dependency Type
s
(DT)
that can
be
chosen to
see only the dependencies that are relevant to the user. The three types are: (1)
Invocation depend
e
ncy (
idep(X,Y)) which specifies that an invocation
X
used some data
item that another
invocation Y
produced
without specifying t
he intermediate data item(s);
(2) data depend
e
ncy (
ddep(X,Y)) which specifies that data item X somehow depends on
data item Y; and (3)
Data and Invocation depend
e
ncy (
dep(X,Y)
)
, which describes how
the data items and invocations are related to each other.


To narrow the provenance results to just the area of the graph that the user is inter
e
sted
in,
Query Condition
s
(QC)
can be specified. Query conditions consist of one of three
types, and any combination of them can be specified. The three types are
:
(1)
start_at(X):

All the paths containing the nodes on which this node is directly or transitively dependent
on are selected
;
(2)
end_at(X):
All the paths containing the nodes which
are
directly or
transitively dependent on this node are selected
; and (3)
thro
ugh(X):

only
paths
containing this node are selected.
All paths satisfying the QC will be returned.


There are also two simple non
-
recursive (NR) queries which can be expressed as
ancestor(N,X,Y) and
descendent(N,X,Y).


Query Examples:
Let’s assume the pro
venance graph in the provenance repository is as
shown bel
o
w:

<<
Saumen

is there supposed to be a picture here?>>


(
i) Recursive (R)


below are a list of example recursive queries which can be made on
the graph




Find the “data and invocation” dependencie
s where provenance type = “invocation
provenance” and the query condition

is (“
start_at” node d8).




Find the “data and invocation” dependencies where provenance type = “invocation
provenance” and the query condition

is (“
start_at” node d7 or d8).




Find the
“data and invocation” dependencies where provenance type = “invocation
provenance” and the query condition

is (“
start_at” node d8 and “end_at” node d3).


result(X,Y)

ancestor(N1,X1
,
Y1),
decendent(N2,X2,Y2), N1=d7, N2=d3, X1=X2,
Y1=Y2.






Find the “data and invocation” dependencies where provenance type = “invocation
provenance” and the query condition

is (“
start_at” node d8, “through” node d5 and
“end_at” node d3).




Find the “data” depe
ndencies where provenance type = “invocation provenance”
and the query condition

is (“
start_at” node d8, “through” node d5 and “end_at” node
d3).




Find the “invocation” dependencies where provenance type = “invocation
provenance” and the query condition

is
(“
start_at” node d8, “through” node d5 and
“end_at” node d3).



i3

d
7

d
4

d
3

i2



Find the “data and invocation” dependencies where provenance type = “run
provenance” and the query condition

is (“
start_at” node d8).



(ii) Non
-
recursive (NR)


below are the list of non
-
recu
rsive queries which can be made
against the provenance repository:




Find which data/invocations depend on

the <
dataId=?, invocationId=?, runId=?>



W
h
ich <
userId, invocationId, runId> consumed this <dataId=?>



Which <userId, invocationId, runId> produced this
<
dataId=?>



What are the inputs used by the <invocationId=?>



What are the outputs produced by the <invocationId=?>



7

QUERY EVALUATION:





8

TECHNICAL DISCUSSION
:


GWT Pro/Con

GWT, overall, is a very convenient tool for those that know Java, ye
t
not
Javascript
, to be
able to program webpages in a very Java
-
like structure, without having to deal with what
goes on underneath. The GWT compiler handles multiple browser compatibility by
generating different JavaScript for different browsers, as needed, and automatic
ally sends
clients the JavaScript that will work best for their browser. The GWT compiler also
optimizes the compil
ed JavaScript to speed it up. Limitations on communication between
server and client code made the project more difficult, and the biggest pr
oblem with GWT
was the inability to write to local data files due to sand
-
boxing restrictions imposed by the
GWT server. We were able to get around this problem by deploying the program on an
Apache Tomcat server instead of a GWT server, while still gettin
g the benefits of the ease
of coding in java by using the GWT compiler to compile the java into JavaScript.


C
lient
:


The GUI was for the most part arranged and programmed via the GWT Designer, which is
another plug
-
in that allowed for streamlined addition
and preview of widgets. However,
this Designer is unwieldy to use in terms of previewing, since unless the user is using a
very large screen, the viewable part of the entire webpage is rather small.


Implementation of mouse and click handlers for all of t
he widgets was done manually, as
the code for the handlers got very long for some of the buttons. The GWT Designer
normally does all of the adding of widgets and resizing in the main me
t
hod,
onModuleLoad, which causes the onModuleLoad to be cluttered with
lots of code that is
hard to break up at times.


Client
-
Server Inter
a
ction
:

All of the client
-
server interaction takes place through asynchronous calls to the server
-
based code. Interfaces are required for the client to have access to the function prototy
pes
that the server
-
based code will use. All necessary interfaces are present in the client
package of the code. A note on future changes would be that if the second part of the
server side code were to be split up into more class files for readability and
maintenance,
more interfaces must be added onto the client side to make the new server
-
side class files
visible to the client.


Of note, the client
-
server interaction is limited b
y
the
javascript that the client code must
compile into. The negative effect
of this constraint is that only serialize
-
able types can be
used to communicate between the client and server code, meaning specifically unless the
class to be used is defined in the public section of the code, the class cannot be used in
communication be
tween the client and server processes.


For example, in the implementation of the code, the invocation data type was not serialize
-
able, and thus the server
-
side dispatcher code had to encode the data o
f
the
ArrayList of
invocations returned from Neo4j int
o an
ArrayList of Strings back to the client.


The architecture is split between client
-
side, server
-
side, and shared (), and the GWT
compiler transforms the client
-
side and shared code

into
javascript to be passed to the
user via a web browser.




Graphviz vs. JIT

Early in the project we were using GraphViz to sh
ow a complete visualization of the output
graphs. However, the limitation with the graph file

that
GraphViz outputs is that there is no
interactive portion of the graph file. Thus, JIT was i
mplemented to allow the user to
interact with the graph. The GraphViz output is still provided to give the user a detailed
output showing everything as it is for a normal provenance structure, as the current JIT
implementation still lacks a feature to prec
ompute node positions in a deterministic tree
-
like structure.



Gra
p
h DB
vs Relational DB in representing D
-
OPM

Our initial implementation used Neo4j as the underlying database for the datastore. This
choice made sense initially because the graph database
structure very closely lines up
with the D
-
OPM provenance model. The main difficulty with using Neo4j was the in
-
progress nature of Neo4j. New features are still being added that would have made the
implementation easier or more efficient, and at times we
were using features that were not
yet well documented. In the future Neo4j may be a very good fit for this type of work, but it
is still a relatively new project. The MySQL implementation went very smoot
hly, as MySQL
is a very stable database system and th
ere are lots of resources for implementing systems
using it.


9

CONCLUSION AND FUTUR
E WORK:


As an outcome of this project we developed a provenance repository prototype that allows
users to upload, then query and visualize provenance at different levels of dependencies,
including: User level, run level (outside provenance), and invocation level (
inside
provenance). The repository also incorporates automatic stitching of provenance traces
where possib
l
e.
<<
Saumen>>
With our abstraction layer for the trace parser, we have
already set up the system to easily extend to new workflow systems.


Future wo
rk for this project would include first finalizing D
-
OPM details, then implementing
a repository system (based on what was learned in our Golden Trail project) in
collaboration with the DataONE Provenance Working Group and possibly other
contributors (e.g.
CCIG or upcoming MS thesis at UCD)


10

REFERENCES


11

APPENDIX


11.1

Graph Database (Neo4j) Implementat
ion Detail



Our solution stores workflow provenance data in a Neo4j database using the REST

API.
calls
to the REST server are made using Jersey API. All the Jersey API calls are
encapsulated in a set of classes written for this project to make remote Neo4j server
access more object oriented.
Data
is sent to, and returned from the server in JSON
payloads, w
hich are create
d and processed using the set of java classes provided at
json.org
.


Being a graph based dat
a
base
system, Neo4j has nodes and relations as the two types of
entities. Both nodes and relations can have properties, though for our system all th
e
properties are on the node
s.


Here is how our system stores the relations and nodes in the Neo4j database:

Nodes represent runs, data artifacts, or invocations and in each node there is a type
property which currently is on
e
of {
run,data,invoc}. each nod
e also has an id property to
differentiate nodes. Invocation nodes have a run property to more efficiently check the run
that an invocation belongs to. Data nodes have a value property to store whatever value
was parsed from the trace file, e.g. a URI


Relations used:



GEN_BY:
d
ata
-
>
invoc
-
connects data to the invocation that generated it



FINAL_OUTPUT: run
-
>data
-
connects run to final outputs to more quickly find run
outputs



ROOT_RUN: root node
-
>run
-
connects root node to each run to enumerate runs



IN
IT_INPUT: run
-
>data
-
connects run to initial input to more quickly find run inputs



RUN_I
N
VOC:
invoc
-
>run
-
connects invocs to their run to enumerate invocations of
a run



RUN_DATA: data
-
>run
-
connects data artifacts to their run to enumerate data
artifact
s of a run



U
SED:
invoc
-
>data
-
connects invocation to a data item that it uses


Our solution use
s
the
findPaths call to the Neo4j server to find paths between nodes and
the traverse call to traverse the graph in a given direction.


traversal:

When specifi
ed start or end node is given, but not both, or only a through node is given,
our solution calls traverse node on the Neo4j Server. If through nodes are specified in
addition, then we call traverse path, which returns a list of paths, and only return paths

including all of the through nod
e
s.


findPaths:

When both start and end nodes are specified, our solution calls findPaths to get a list of
paths between the start and end nodes. If through nodes are specified, only paths
including all the through nodes a
re returned.


Overall, the underlying implementation for the Neo4j back
-
end data store looks very much
like the provenance that is displayed as the output to the user, resulting in a very intuitive
solution. As we implemented the Neo4j solution before we h
ad added users to our model,
our Neo4j model does not include users.


11.2

Relational Database (MySQL) Implementation Detail





Our MySQL solution is a much more traditional relational database approach to storing
provenance. As MySQL is a very mature system, this solution was completed
i
n am
uch
shorter period of time and went more smoothly
.


11.3

Server
-
Side Implementation Detail


The
server code is broken into two p
a
rts:
FileUpload and UploadServiceImpl.


The first part, FileUpload.java, is simply the form handler that is used to save trace files
that are uploaded
v
ia a
FileUpload widget that is present in the GUI. The trace file deta
ils,
which are parsed on the server side also, are then uploaded on
to the provenance
database for further querying.


An extra comment on the saving of local fil
e
s to
disk, is that GWT in its base
implementation does not allow for direct interaction with th
e hard disk, and forces the user
to use their encrypted datastore that is used with any GWT application.


The second part, UploadServiceImpl.java, is the dispatcher that takes in requests and
their arguments from the client, and dispatches the proper metho
d calls to the provenance
server. Once the results are computed, they are passed back up to the client.


In the code base
,
the
javascript used in creating the graph visualization resides in
GraphGenerator.java. The server sends data back to the client, whi
ch the client then
forwards to GraphGenerator.java to load into the graph
.
The
javascript used in
GraphGenerator.java calls rendering functions in the local jit source files to render the
graph inside of a widget.