BBSRC E-science & Bioinformatics (with additional ... - e-Protein


Oct 1, 2013 (3 years and 6 months ago)


science & Bioinformatics (with additional support from the DTI)

Grant Title:

A Distributed Pipeline for Structure
based Proteome Annotation using Grid

Project name

Name: e

Aims and Objectives

The aim is to provide a struc
based annotation of the proteins in the major genomes
linking resources at 3 sites by Grid technology. The objectives are: (1)

to establish local
databases with structural and function annotations, (2)

to disseminate to the biological
community ou
r proteome annotation via a single web
based distributed annotation system
(DAS), (3)

to share computing power transparently between sites using Grid
middleware such as the Globus Toolkit, (4) to use the developed system for comparison
of alternative app
roaches for annotation and thereby identify methodological
improvements, (5) to establish a pre
prototype at 6 months for demonstration purposes,
(6) to provide a working system after two years, (7) to link to relevant bioinformatic and
Grid resources that

will be integrated into this project.

science issues involved

The project presents a complex Grid infrastructure of distributed computational resources
with different databases and specialised analysis software that may not be deployed on
every resourc
e. It is therefore essential that the Grid infrastructure is able to accurately
represent the state of the software and hardware resources at each site. This
heterogeneous capability needs to be effectively exploited by the complex workflow
presented withi
n the protein annotation pipeline. The capture of this workflow and the
mapping of its components to the distributed resources are the key e
science issues within
this project.

Brief overview of the system architecture

The proposed system builds on top of

a service oriented architecture based around the
current stable release of the Globus Toolkit (version 2) while research and development
activities are under taken with version 3. The scientists will define the required workflow
within, say, a graphical e
nvironment by ‘dragging and dropping’ database and processing
components to build their required workflow. This application specification will then be
mapped to the ‘best’ currently available resources through a scheduling infrastructure.
The scheduler wil
l examine the currently available services (both software and data
sources) and evaluate the capability of the free resources to meet the requirements
specified by the user, e.g. the inter
operation dependencies. Instantiation of the workflow
takes place o
n the resources using an X.509 based security infrastructure.

Use of metadata

The ICENI middleware used within the project has a rich meta
data structure that will
allow the current state of the resources to be captured. This meta
data will allow the
erent library versions and application programs to be accurately represented. This
information will be defined in an XML schema.

Use of and contribution to evolving standards

The computer science researchers within the project are active in developing
rastructures using the Open Grid Services Architecture, and open standard defining the
the next generation of Grid middleware infrastructures.

Current state of the project

The project has met its first milestone and demonstrated:

institution Grid c

integrated web
based access to databases at different institutions


All staff have been recruited

The groups of Sternberg, Jones and Orengo each have a local pipeline for proteome
annotation that has common features but substantia
l difference.

The group of Thornton is developing libraries for structure
based assignment of
protein function.

The group of Darlington and Newhouse at ICL and Sorensen at UCL have
implemented Globus
based facilities for external sites to utilize their


Proteome annotation has been run by Imperial at UCL and by UCL at Imperial
using the Globus Toolkit V2 protocol

The group of Birney (EBI) has developed software for Protein DAS that serves as a
front end to the proteome annotations
at different sites.

The DAS front end has successfully integrated access to databases at Imperial and

Institutions involved and their role

Imperial College

University College London

European Bioinformatics Institute

Each institution is involved

both with the proteome annotation and the GRID

Names of the team

Prof M Sternberg, Prof J Darlington and Dr S Newhouse (Imperial College

Prof D Jones, Prof C Orengo & Dr S Sorensen (University College London)

Prof J Thornton, Dr E Bi
rney & Dr A Robinson (European Bioinformatics Institute,


The project obtained support for six PDRAs, with two at each site together with
consumables, travel and local hardware. 36 months support was provided by the BBSRC
and a furth
er 3 months from the DTI to promote links with industry. In addition, the
project will use existing high performance computing at the three sites such as that
purchased recently under SRIF for e