TavernaIntroKatyx

signtruculentBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

150 views

Taverna and myExperiment:


Designing, Exchanging and Sharing of
Scientific Workflows

Katy Wolstencroft

University of Manchester

Connecting things Together


Data Resources


Genome databases


Kinetic/metabolite data


Analysis tools


Sequence alignment


Similarity searching


Pattern matching


Knowledge Resources


Ontologies


Controlled vocabularies


What is a

Workflow?


A mechanism for connecting things together

Workflows provide a general technique for describing and enacting a process

Describes
what

you want to do, not
how

you want to do it

Simple language specifies how bioinformatics processes fit together

Processes are represented as web services

Repeat

Masker

Web service

GenScan

Web Service

Blast

Web Service

Sequence

Predicted
Genes out

What is a workflow?


Business Process workflows


Tasks, Schedules, dependencies (on staff time), and costs


Scientific Workflows


on
in silico

data


Data throughput, dependencies (on analysis results)


Input, algorithm, output


Flow of information, scheduling of order, collection of results,
intermediate results and provenance


High level description of your experiment


Workflow is the model of the experiment


Methods section in your publication


Workflow can be shared and reused

Kepler

Triana

BPEL

Ptolemy II

Taverna

Workflow
diagram

Tree view of
workflow structure

Available
services

Taverna

Open source and extensible

What is a web service?






NOT
the same as services on the web (i.e. web forms)


Web services support machine
-
to
-
machine interaction
over a network


Web Evolution

Programmability

Connectivity

Presentation

Browse

the Web

Program

the Web

Taken from :http://www.softstar
-
inc.com/

How do you use Web Services?



SOAP (Simple Object Access Protocol)


An xml protocol for passing messages


WSDL (Web Service Definition Language)


A machine
-
readable description of the operations supported


Normally transferred by http


Who Provides the Services?




Open

domain services and resources


Taverna accesses 3500+ services


Third party


we don’t own them


we didn’t build them


All the major providers


NCBI, DDBJ, EBI …


Enforce NO common data model.


Quality Web
Services
considered
desirable


What types of service?


WSDL Web Services


BioMart


R
-
processor


BioMoby


Soaplab


Local Java services


Beanshell


Workflows



Coming soon.....REST, Matlab......?


Create and run workflows

Share, discover
and reuse
workflows

Manage the
metadata needed
and generated

RDF, OWL

Discover and
reuse services

Feta

A Collection of Components

What do Scientists use Taverna for?



Data gathering, annotation and model building


Data analysis from distributed tools


Data mining and knowledge management


Data curation and warehouse population


Parameter sweeps and simulation



Users

from

Systems

Biology
,

Proteomics,

Sequence

analysis,

Protein

structure

prediction,

Gene/protein

annotation,

Microarray

data

analysis,

QTL

studies,

Chemioinformatics,

Medical

image

analysis,

Public

Health

care

epidemiology
,

Heart

model

simulation,

Phenotype

studies,

Phylogeny,

Statistical

analysis,

Pharmacogenomics,

Text

mining

Astronomy,

Music,

Meteorology


Taverna
-

Successful cases of adoption


Selected Successful Cases of Adoption

Originally designed to

support bioinformatics,

now expanded into new areas

Annotation Pipelines


Genome annotation pipelines


Bergen Center for Computational Science


Gene Prediction in
Algal Viruses, a case study.


Workflow assembles evidence for predicted genes / potential
functions


Human expert can ‘review’ this evidence before submission to the
genome database


Data warehouse pipelines


e
-
Fungi


model organism warehouse


ISPIDER


proteomics warehouse


Annotating the up/down regulated genes in a microarray
experiment

Building models and knowledge
management






SBML population



Comparing models and experimental data


Mining text resources and building knowledge models


[Peter Li, Doug Kell]

Systems Biology


Model Construction

Automatic reconstruction of
genome
-
scale yeast
metabolism
from

distributed data in the life sciences to
create and manipulate Systems Biology Markup Models.

LibSBML Integration


API consumer used to integrate libSBML directly into
Taverna




Performing statistical analyses on quantitative data in Taverna workflows: an
example using R and maxdBrowse to identify differentially
-
expressed genes
from microarray data Peter Li, Juan I. Castrillo, Giles Velarde, Ingo Wassink,
Stian Soiland
-
Reyes, Stuart Owen, David Withers, Tom Oinn, Matthew R.
Pocock, Carole A. Goble, Stephen G. Oliver, Douglas B. Kell


Submitted to
BMC bioinformatics

Data Analysis Pipelines



Access to local and remote analysis tool


You start with your own data / public data of interest


You need to analyse it to extract biological knowledge

Trichuris muris




Mouse whipworm infection
-

parasite model of the human
parasite
-

Trichuris trichuria

Understanding Phenotype


Comparing resistant vs susceptible strains


Microarrays

Understanding Genotype


Mapping quantitative traits


Classical genetics QTL



Joanne Pennock, Richard Grencis

University of Manchester

Trichuris muris




Identified the biological pathways involved in sex dependence in
the mouse model, previously believed to be involved in the ability
of mice to expel the parasite.


Manual experimentation:
Two year study

of candidate genes,
processes unidentified


Joanne Pennock, Richard Grencis

University of Manchester

Trichuris muris




Identified the biological pathways involved in sex dependence in
the mouse model, previously believed to be involved in the ability
of mice to expel the parasite.


Manual experimentation:
Two year study

of candidate genes,
processes unidentified


JO IS A LAB BIOLOGIST


JO HAS NEVER BUILT A WORKFLOW


Joanne Pennock, Richard Grencis

University of Manchester

http://www.genomics.liv.ac.uk/tryps/trypsindex.html

Andy Brass

Steve Kemp

Paul Fisher


Sleeping Sickness in African Cattle


Caused by infection by parasite (
Trypanosoma brucei)



Some cattle breeds more resistant than others


Differences between resistant and susceptible cattle?


Can we breed cattle resistant to infection?



Fisher et al (2007) A systematic strategy
for large
-
scale analysis of genotype
phenotype correlations: identification of
candidate genes involved in African
trypanosomiasis.

Nucleic Acids Res.35(16):5625
-
33

Why was the Workflow Approach
Successful?


Workflows are protocols


they can be reused or
repurposed


Workflow analysed each piece of data
systematically


Eliminated user bias and premature filtering of datasets and
results leading to single sided, expert
-
driven hypotheses


The size of the QTL and amount of the microarray data
made a manual approach impractical


Workflows capture exactly where data came from and
how it was analysed


Workflow output produced a manageable amount of data
for the biologists to interpret and verify


“make sense of this data”
-
> “does this make sense?”

Sharing Experiments




Taverna supports the
in silico

experimental process for
individual scientists


How do you share your results/experiments/experiences
with your


Research group


Collaborators


Scientific community

Just Enough Sharing….



myExperiment can provide a central location for
workflows from one community/group


myExperiment allows you to say


Who can look at your workflow


Who can download your workflow


Who can modify your workflow


Who can run your workflow

The most important aspect of
myExperiment

-

Designed by scientists


Ownership and Attribution



Packs allow you to collect different items together, like you
might with a "wish list" or "shopping basket"


You can collect internal things (such as workflows, files
and even other packs) as well as link to things outside
myExperiment


Your packs can then be shared, tagged, discovered and
discussed easily on myExperiment

Packs

Bringing myExperiment
to the Taverna User

myExperiment Plugin in Taverna

Running Workflows Through myExperiment

Taverna Remote Execution (T
-
REX)

PREFIX rdf: <http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#>

PREFIX myexp: <http://rdf.myexperiment.org/ontology#>

PREFIX sioc: <http://rdfs.org/sioc/ns#>

select ?friend1 ?friend2 ?acceptedat where {?z rdf:type

<http://rdf.myexperiment.org/ontology#Friendship> . ?z myexp:has
-
requester

?x .

?x sioc:name ?friend1 . ?z myexp:has
-
accepter ?y . ?y sioc:name ?friend2 .

?z myexp:accepted
-
at ?acceptedat }

All accepted Friendships
including accepted
-
at time

Semantically
-
Interlinked
Online Communities

Service Discovery

Feta “old School”


Semantic Discovery


Ability to find service
mismatches


Complex queries


Closed curation


Ugly GUI interface

BioCatalogue


Discovery by tags, text
and semantics


Social curation


Web based catalogue


Finding Services


There are over 3500
distributed

services. How do we
find an appropriate one?




We need to
annotate

services by their functions (and
not their names!)


The services might be distributed, but a registry of
service descriptions can be central and queried


Annotated with terms from the
my
Grid ontology


Questions we can ask: Find me all the services that
perform a multiple sequence alignment
and

accept
protein sequences in FASTA format as input




my
Grid Ontology

Logically separated into two parts:


Service ontology


Physical and operational features of web services


Domain ontology


Vocabulary for core bioinformatics data, data types and
their relationships

Ontology developed in OWL

my
Grid ontology


Example : BLAST (from the DDBJ)


Performs task: Alignment


Uses Method: Similarity Search Algorithm


Uses Resources: DNA/Protein sequence databases


Inputs:


biological sequence


database name


blast program


Outputs: Blast Report


Feta Search Result


Limitations of the Current Model



Feta discovery tool is only accessible from the Taverna
Workbench


Only pertinent to Taverna users


other people need to
find and use web services


Focuses on finding services, but not workflows. For
reuse, we need to do both


Closed annotation system
-

myGrid curator provides
service descriptions

BioCatalogue: A Community Resource



Expanding annotation to allow the community to join in


What is the minimum annotation we need to find the
service, and to execute it?


Graduated annotation


bronze, silver, gold, platinum


Record who annotated what and when, to address
service versioning and status


Service status monitors

Curation by Experts

Curation by the
Community

Automated

Curation

refine

validate

refine

validate

Curation by
Developers

seed

seed

refine

validate

seed

BioCatalogue

Joint Manchester
-
EBI

Launch ISMB
2009

Current work

Speed and Scalability

Taverna 2 enactor


Support for long running workflows


Large scale data


industrial bioinformatics


Data streaming


Passing data by reference


Integration with established computing platforms


caGrid, EGEE, KnowArc, Dutch e
-
Science Grid

caGrid Plugin for Taverna



Enables discovery of services
in caGrid service registry


Taverna support for GAARDS
-
secured caGrid services

Lymphoma type prediction w
orkflow

Extensibility and ease of use



Drag and drop workflow building


More content


greater pool of workflows from myExperiment


More components


Gathering together commonly used sets of services


Service and workflow annotation checking


Shim libraries


for connecting incompatible services



Remote Execution


Taverna Remote Execution Service (T
-
REX)


Running workflows on a server


Running workflows inside other applications



Taverna is for informatics people (bioinformaticians,
cheminformaticians etc). We need other interfaces for
uptake by laboratory scientists and health workers

Toolkits
“Taverna Inside”

Workflows under the hood


e
-
Laboratories (portals)


Systems Biology, e
-
Health


Web based execution


Running workflows over the web through myExperiment


Visualisation clients that call workflows in the
background

UTOPIA

Pettifer, Kell, University of Manchester

Toolkits “Taverna Inside”

Workflow development pipeline

Workflows
developed by
bioinformaticians

Enacted locally

E
-
Labs and 3
rd

party
clients

Social support for
bioinformaticians to find
and reuse workflows
and expertise

Access to ready made
workflows for biologists

Workflows enacted locally

Taverna remote execution service (T
-
Rex)

Social support to find
and reuse workflows
and expertise


CONFIGURABLE
access to ready made
workflows for biologists


Workflows embedded
in applications and
combined with data
management systems

myGrid Team

More Information


myGrid


http://www.mygrid.org.uk


Taverna


http://taverna.sourceforge.net


myExperiment


http://www.myexperiment.org


http://rubyforge.org/projects/myexperiment/


http://wiki.myexperiment.org/


BioCatalogue


http://www.biocatalogue.org



Thanks to Carole Goble, David De Roure, Stian Soiland
-
Reyes and Jiten
Bhagat for slide contributions