BBSRC E-science & Bioinformatics (with additional ... - e-Protein

websterhissΒιοτεχνολογία

1 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

84 εμφανίσεις

BBSRC E
-
science & Bioinformatics (with additional support from the DTI)



Grant Title:

A Distributed Pipeline for Structure
-
based Proteome Annotation using Grid
Technology


Project name

Name: e
-
Protein


Aims and Objectives

The aim is to provide a struc
ture
-
based annotation of the proteins in the major genomes
linking resources at 3 sites by Grid technology. The objectives are: (1)
-

to establish local
databases with structural and function annotations, (2)
-

to disseminate to the biological
community ou
r proteome annotation via a single web
-
based distributed annotation system
(DAS), (3)
-

to share computing power transparently between sites using Grid
middleware such as the Globus Toolkit, (4) to use the developed system for comparison
of alternative app
roaches for annotation and thereby identify methodological
improvements, (5) to establish a pre
-
prototype at 6 months for demonstration purposes,
(6) to provide a working system after two years, (7) to link to relevant bioinformatic and
Grid resources that

will be integrated into this project.


E
-
science issues involved

The project presents a complex Grid infrastructure of distributed computational resources
with different databases and specialised analysis software that may not be deployed on
every resourc
e. It is therefore essential that the Grid infrastructure is able to accurately
represent the state of the software and hardware resources at each site. This
heterogeneous capability needs to be effectively exploited by the complex workflow
presented withi
n the protein annotation pipeline. The capture of this workflow and the
mapping of its components to the distributed resources are the key e
-
science issues within
this project.


Brief overview of the system architecture

The proposed system builds on top of

a service oriented architecture based around the
current stable release of the Globus Toolkit (version 2) while research and development
activities are under taken with version 3. The scientists will define the required workflow
within, say, a graphical e
nvironment by ‘dragging and dropping’ database and processing
components to build their required workflow. This application specification will then be
mapped to the ‘best’ currently available resources through a scheduling infrastructure.
The scheduler wil
l examine the currently available services (both software and data
sources) and evaluate the capability of the free resources to meet the requirements
specified by the user, e.g. the inter
-
operation dependencies. Instantiation of the workflow
takes place o
n the resources using an X.509 based security infrastructure.


Use of metadata

The ICENI middleware used within the project has a rich meta
-
data structure that will
allow the current state of the resources to be captured. This meta
-
data will allow the
diff
erent library versions and application programs to be accurately represented. This
information will be defined in an XML schema.


Use of and contribution to evolving standards

The computer science researchers within the project are active in developing
inf
rastructures using the Open Grid Services Architecture, and open standard defining the
the next generation of Grid middleware infrastructures.


Current state of the project


The project has met its first milestone and demonstrated:



inter
-
institution Grid c
omputing



integrated web
-
based access to databases at different institutions


Specifically:




All staff have been recruited




The groups of Sternberg, Jones and Orengo each have a local pipeline for proteome
annotation that has common features but substantia
l difference.




The group of Thornton is developing libraries for structure
-
based assignment of
protein function.




The group of Darlington and Newhouse at ICL and Sorensen at UCL have
implemented Globus
-
based facilities for external sites to utilize their

computing
resources.




Proteome annotation has been run by Imperial at UCL and by UCL at Imperial
using the Globus Toolkit V2 protocol




The group of Birney (EBI) has developed software for Protein DAS that serves as a
front end to the proteome annotations
at different sites.




The DAS front end has successfully integrated access to databases at Imperial and
UCL.



Institutions involved and their role




Imperial College



University College London



European Bioinformatics Institute




Each institution is involved

both with the proteome annotation and the GRID
computing


Names of the team




Prof M Sternberg, Prof J Darlington and Dr S Newhouse (Imperial College
London)



Prof D Jones, Prof C Orengo & Dr S Sorensen (University College London)



Prof J Thornton, Dr E Bi
rney & Dr A Robinson (European Bioinformatics Institute,
Cambridge)




Resources

The project obtained support for six PDRAs, with two at each site together with
consumables, travel and local hardware. 36 months support was provided by the BBSRC
and a furth
er 3 months from the DTI to promote links with industry. In addition, the
project will use existing high performance computing at the three sites such as that
purchased recently under SRIF for e
-
science.