Biogrid – Bioinformatics for the grid - NORDUnet

clumpfrustratedBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

62 views

Biogrid


Bioinformatics for the grid

Joel Hedlund <yohell@ifm.liu.se>

Biogrid User and Developer

Linköping University, Sweden


Birds
-
of
-
a
-
feather session tonight: see me after this talk!

Outline


What is it?


What is it good for?


Does it really work?


Gory details.


Why did we do this?


Profit!


What is it?

NDGF BIO Community Grid



Bioinformatics for the Grid

What is it?


Unified interface

...to popular bioinformatic applications

...on shared, distributed computational resources

...using versioned and cached databases

What is it good for?


Burst computing


High demand for short periods of time


high during development / production


low during analysis / writing papers


Share resources to enable more efficient use


Database accessibility


Availibility


Unified interface


What is NDGF?

What is NDGF?


Nordic Data Grid Facility


A WLCG Tier1 facility


Worldwide LHC Computational Grid


Stores and processes data from LHC at CERN


peak rate ≈ 1.6Gb/s, when the accelerator is running

(and that’s after most of the data have been filtered away)

”Does it really work, this
distributed thingie?”

”Does it really work, this
distributed thingie?”

Why yes, very well thank you!

NDGF


96% availablity

(highest of all Tier1 facilities)


Third largest Tier1 facility in the world


Lowest ratio of failed ATLAS jobs


Production goals met, and beyond


Goal: 8% of all ATLAS resources (10.5% provided)


Goal: 9% of all ALICE resources (12% provided)



* Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :
-
)

DISTRIBUTION

IS A

STRENGTH

It enforces unification


It ensures availability

Does it really work?

It’s good enough for LHC.


It’s good enough for Bioinformatics.

Gory details

Biogrid provides

Optimised applications:


BLAST


ClustalW


HMMER


Muscle


Mafft


Planned: molecular dynamics, phylogeny...

Biogrid provides

Versioned, indexed and cached databases


UniProtKB (subreleases)


Uniref (subreleases)


Planned: genomes (EnsEMBL), nucleotides (EMBL)...

Cached database access

Database files are transfered to the cluster at most once per project.

Unified Interface

Unified Interface

Unified Interface

DATA

RESULTS

Unified Interface


XRSL Job Description

Standard in ARC Grid Middleware


Well defined runtime environments

$
HMMERDIR
: node local (fast) scratch dir containing db files

prepare_db
: download and unpack db files on the fly from front node to $
HMMERDIR

XRSL Job Description

(jobName=refinehmm
-
family023)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=


(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)


(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)


(family023.hmm ””)

)

(outputfiles=


(family023.refined.hmm ””)

)

XRSL Job Description

(jobName=refinehmm
-
$HMM_NAME)

(runTimeEnvironment=APPS/BIO/HMMER2.3.2)

(cpuTime=3000)

(executable=refinehmm.jobscript.sh)

(inputFiles=


(sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)


(tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)


($HMM_NAME.hmm ””)

)

(outputfiles=


($HMM_NAME.refined.hmm ””)

)

Unified Interface


Run on any resource I can access:

$ ngsub myjob.xrsl


...or run on my buddy’s cluster:

$ ngsub
-
c kiniini.csc.fi myjob.xrsl


Check jobs:

$ ngstat refinehmm
-
family023

(or use Grid Monitor web interface at
www.nordugrid.org
)


Fetch results:

$ ngget refinehmm
-
family*


DATA
RESULTS

GRID

What do I need?

1.
A resource with ARC and Biogrid REs

2.
An ARC client

3.
A Grid Certificate

(available from a number of global certificate authorities)

4.
Time allowance on the resource



5.
Biogrid VO Membership

Not really necessary, but it will get you 1 & 4


(

)

What do I need?

...or you can just grab the RE scripts off the biogrid website,

and your db of choice from the biogrid dCache.


Why did we do this?

Bioinformatic applications...



CPU intensive


Small input and output files


”Large” databases can be cached


...are very well suited for distributed computing.


Profit!

Subclassification of the MDR superfamily


15000 members

from all kingdoms of life


500 families

25% sequence identity


40 human members


Different substrate specificities


Different subunit & cofactor count


2 HMMs available for superfamily detection


None for any of the individual families

Subclassification of the MDR superfamily


We made HMMs for all MDR (sub)families
with 20+ members.


86 families


34 detected subfamilies to 14 of these


11579 / 15000 sequences classified


≈5000*hmmsearch vs UniProtKB

Manuscript in preparation

refinehmm


Algorithm for automated HMM refinement


Produces stable and reliable HMMs


Developed using Biogrid REs and resources

Will also be open source software once the paper is out.

Acknowledgements


Olli Tourunen

Biogrid developer


Bengt Persson

Biogrid PI


NDGF

Michael Grønager

Josva Kleist


Biogrid co
-
applicants

Ann
-
Charlotte Berglund Sonnhammer

Erik Sonnhammer

Inge Jonassen


Supercomputing centers


NSC

Jens Larsson, Leif Nixon


HPC2N

Åke Sandgren


Others

C3SE, CSC, Uppmax, Lunarc, PDC,

Aalborg University, Oslo University


Birds
-
of
-
a
-
feather session tonight: see me after the talk!

Joel Hedlund

yohell@ifm.liu.se


Biogrid User and Developer

Linköping University, Sweden

Acknowledgements


Olli Tourunen

Biogrid developer


Bengt Persson

Biogrid PI


NDGF

Michael Grønager

Josva Kleist


Biogrid co
-
applicants

Ann
-
Charlotte Berglund Sonnhammer

Erik Sonnhammer

Inge Jonassen


Supercomputing centers


NSC

Jens Larsson, Leif Nixon


HPC2N

Åke Sandgren


Others

C3SE, CSC, Uppmax, Lunarc, PDC,

Aalborg University, Oslo University



Birds
-
of
-
a
-
feather session tonight: see me after the talk!

Joel Hedlund

yohell@ifm.liu.se


Biogrid User and Developer

Linköping University, Sweden