motivation and vision

californiamandrillSoftware and s/w Development

Dec 13, 2013 (3 years and 8 months ago)

84 views

Workflow

management
:

motivation

and
vision


Ela
Hunt

Ela.Hunt@SystemsX.ch

Plan

Overview

of
existing

workflows

Gains

to
be

achieved

via
workflows

Methodological

assumptions
:
how

to
support

and
construct

workflows

with

less

effort

and
more

effectively


2

Ela Hunt, SyBIT

Three

areas

of
workflow

use
:


Deep

sequencing

High
content

screening

Proteomics


Future:
workflows

combining

those

three

methodologies
,
possibly

including

metabolomics
, NMR. etc

3

Ela Hunt, SyBIT

Deep

sequencing

Management of
reads

(
images
)
coming

off
the

microscopy

devices

Processing

of
images

into

sequence

files

Aligment

to a
genome

or

genome

assembly

from

short

reads

Annnotation
with

data

from

external

sources

Candidate

gene/drug

target

identification


4

Ela Hunt, SyBIT

5

Ela Hunt, SyBIT

DeepSequencingWorkflow Status (Lausanne)

1b. Illuminasequencing

Possible

extensions

6. DAS
server

8. Association

Viewer

7. Microbe

Browser

1a. Web


sample

metadata

capture

Perl


4.
Submit

analysis

pipeline

2.
fileserver

3.

Web
-
browse

Sequence
analysis

Meta
-
data

Sequence data

Deep

sequencing

workflow

status

Lausanne


alignment

via
Eland

(Emmanuel
Beaudoing
,
Sylvain

Pradervand
)

Basel


under

construction

(Manuel
Kohler)

Zurich


FGCZ


under

construction

(
Remy

Bruggmann
)

6

Ela Hunt, SyBIT

Proteomics

workflows

MS
spectra

Mapping

to
proteins

(
merging

output

from

various

analysis

programs
)

Annotation

with

additional
data


ETHZ


Perl
scripts

and KNIME (Andreas
Quandt)

Lausanne,
Geneva
, Basel (?)


7

Ela
Hunt
,
SyBIT

ETHZ
proteomics

example


(
drawn

in KNIME
by

Andreas Quandt)

8

Ela Hunt, SyBIT

Screening

workflows

Microscopy
, image
transfer
,
compression

Matlab

scripts

(light
intensity

adjustment
,
feature

recognition
, etc,
leading

to
the

identification

of
features
)
writing

feature

counts

to a
DB/files

Stats

and
chart

generation
,
sometimes

including

a
user

interface

showing

images

(also
for

training
), KNIME, R,
Matlab
, etc

9

Ela Hunt, SyBIT

Screening

workflows

Lausanne


Petr
Strnad‘s

workflows

in
KNIME,
Matlab
,
MySQL

iBRAIN

developed

by

Berend

Snijder

-

an
end
-
to
-
end

solution

with

a GUI (
shell

script
, XML, XSLT, HTML)

imageJ

in S.
Maerkl‘s

lab in Lausanne,
needing

more

automation

and DB

HCDC (
Postgress
,
Matlab
, KNIME)

10

Ela Hunt, SyBIT

Lausanne
workflow

fragment

11

Ela Hunt, SyBIT

Loop for every plate…

Read available

plates

…read cell data

for the plate in the loop

Calculate the
number of centrosomes

for 7 different threshold

iBRAIN

overview

Purpose
:
plates
,
wells
,
images

=>
compress

images
,
classify

cells

into

types
,
count

cells

of
various

types
,
graph

Submit

project

via
drag
-
and
-
drop

of a
file

Monitor
progress

on
cluster

via HTML
pages

Technology:
bash
,
Matlab
,
cluster
, XML,
HTML, web
pages

generated

from

a
bash

script
,
paths

and
file

names

are

embedded

12

Ela Hunt, SyBIT

iBRAIN

use

cases

13

Ela Hunt, SyBIT

OUR GOALS:
addressing

technical

challenges

Maintainablility

(
extendability
) of
the

entire

workflow


Portability

Automation (
end
-
to
-
end

execution
)

Cost

savings

via
code

base

sharing

Various

architectures

(
storage
,

clusters
)

Multiple
logins

(
security
,
ease

of
administration
)

Privacy

Most of
those

can

be

solved

via
extending

KNIME
(
next

talk
)


14

Ela Hunt, SyBIT

Extending

KNIME:

see

workflows

wiki

page

15

Ela Hunt, SyBIT

What

is

KNIME?

A Java
workflow

management

system

Integrates

Python, R, Perl, Java
snippets
,
jdbc

GUI


can

be

used

by

a
bioinformatician

Also
server

and
cluster

products

(
SunGRID

engine
)

Used

at
several

locations

(
below

P.
Strnad‘s

at Lausanne)

16

Ela Hunt, SyBIT

KNIME
Analysis (from P.
Strnad
)

GFP
-
Centrin expression threshold

50% of
cells

have

2 centrosomes

Usually exclude 10% of cells

with low GFP
-
Centrin signal

Percentage of cells bellow

threshold

KNIME Analysis

Centrosome number

Cell count

Image Regions Viewer

Image Regions Viewer

Goals of KNIME
extension

Maintainablility

(
extendability
)

Portability

Automation (
end
-
to
-
end

execution
)

Cost

savings

via
code

base

sharing

Various

architectures

(
storage
,
clusters
)

Doing

away

with

multiple
logins

or

no
logins

(
security
,
ease

of
administration
,
privacy
)

21

Ela Hunt, SyBIT

Security

Security



one

uname/passw

per
user
,
one

login

that

carries

out
the

whole

workflow

Will
include

cluster/db

logins

KNIME


needs

the

concepts

of
user/session
,
login
,
accounting

of
who

did

what

Allows

for

workflow

tracking
,
scientific

repeatability
,
accounting



22

Ela Hunt, SyBIT

Distributed

data

and
computation

Data
Mover

as a KNIME
node

(
expose

input

params
,
input

and
output

as KNIME
ports
)


KNIME
abstracts

over

those
, and
calls

them

ports

Usage

of
clusters

(LSF and
others
, as
needed
)


probably

involving

the

spawning

of
several

Java
workflows

distributed

over

a
cluster
, also
reporting

of
status

as
jobs

are

being

processed

23

Ela Hunt, SyBIT

Language

additions

Wrapping

for

Matlab

Improved

wrapping

of Perl

Better

facilities

for

R
embedding

(
viewports
)

CP2
embedding

Sequence
:
Eland
, MAQ,
Bowtie
, BWA

Proteomics
:
Mascot
,
Xtandem
, OMSSA,
SpectraSS

24

Ela
Hunt
,
SyBIT

GUI
additions

Job
submission

GUI

Job
monitoring

GUI (to
show

errors

in a
manner

appropriate

for

a
biological

user
)

Workflow

sharing

GUI (
choose

workflow
,
associate

with

data
)

GUI
embedding

facility

for

Java
GUIs

(
currently

implementation

is

too

fiddly
)

25

Ela Hunt, SyBIT

Workflow

portability

A
reconfiguration

tool
,
based

on
the

XML
workflow

description

format

supported

by

KNIME, in
XPath

or

Xquery

(GUI?):

select

all
data

paths

and
change

them

select

all
software

paths

and
change

them

select

db/login/cluster

user

data
, update

check
the

updated

values

by

testing

all
new

parameters
,
report

for

two

identical

workflow

instances
,
report

the

config

differences


26

Ela Hunt, SyBIT

Better

workflow

management

An
open

repository

of
workflow

nodes
,
shared

by

all KNIME
user

groups

(
two

parts



mature

and
beta
)

Saving

of
graphing

parameters
, so
that

an
entire

workflow

can

be

automated

Adding

a
workflow

start
node

with

iteration

over

directories

Data
flow

efficiency

-

data

exchange

between

nodes



via
hierarchical

structures

(XML?) and
tables

(
for

Perl?)

27

Ela Hunt, SyBIT

Image
handling

Image
type

improvements

(
this

type

is

under

development

and
may

not

be

mature

yet
)

Image
storage

in
openBIS

(
various

levels

of
resolution
,
by

well,
plate
, etc),
with

associated

indexes
, so
that

stats

at
various

levels

can

be

generated

easily

28

Ela Hunt, SyBIT

openBIS/B
-
Fabric

connectivity

29

Ela Hunt, SyBIT

Access

to
raw

data

from

KNIME

Image
indexing
, so
that

KNIME
can

effectively

query

features

Analysis

results

storage

Dumping of
workflow

run

parameters/outcomes

to DB

(
maybe

picking

up a
workflow

from

DB)

SQL
handling

Better

table

merging

(to
merge

data

from

several

tables
,
supported

by

a
query

definition
), as
this

is

cumbersome

30

Ela Hunt, SyBIT

Summary

KNIME
is

used

in Zurich and Lausanne,
but

does

not

provide

end
-
to
-
end

processing

List of
new

requirements

was
gathered

from

workflow

users

An
outline

grant

submitted

to KTI

Your

input

is

needed
!

31

Ela Hunt, SyBIT