Bioinformatics applications using GRID - IFIC

abalonestrawBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

66 views

Acelerando la bioinformatica

con el GRID computing

Angel Merino

Centro Nacional de Biotecnología,

Unidad de Biocomputación


Qué contar ….



Microscopia Electrónica


Qué es la EM.


Cuál es el proceso de trabajo.



Que se está resolviendo con la GRID:
Procesos/Aplicaciones que se han “gridificado”


Maximum Likelihood


Estimación de la CTF



Superando la barrera de potencial


Web
-
portal


Web/Grid Services & Workflows



Otras aplicaciones del mundillo


Que es la EM (I)



La EM es una técnica de análisis estructural.



Nos permite adentrarnos en el entorno molecular de las partículas a
estudiar.

Cual es el proceso de trabajo


Preparación de muestras.

Obtención de las imágenes.

Procesado de las
imágenes y cálculo de
volúmenes 3D

Biological Material

-

High H2O content

-

Elevated radiation damage


Negative Tint

-

Dehydration

-

Structural changes / Crushing

-

Image comes from metal mold

Cryomicroscopy

-

Hydrated
/ Biologic
-
friendly

-

Less distorsions

-

Image comes from biological


specimen



Que es la EM (II)

Que es la EM (III)

Tinción negativa

Criomicroscopía

Aberrations in the microscope
optics affect the experimental
images (blurring). These effect
may be described by the CTF.

CTF
-
estimation in Xmipp may take up to half a
day per micrograph. Moreover per experiment,
a user processes about 100 micrographs.
Therefore, grid computing is necessary.

Estimation of the CTF
allows correction of
the blurred images.

Estimación de la CTF (I)



Estimación de la CTF(II)

Estimación de la CTF (III)

Por micrografía

1000x

Maximum
-
Likelihood

Maximum
-
Likelihood (I)

Ejecución “lenta”

1 iteración

Maximum
-
Likelihood(II)

Ejecución “rapida” (MPI)

Desarrollo de Maximum
-
Likelihood

usando EGEE
-
GRID vs local cluster

Usando EGEE GRID

Grid

Durante el pasado mes de
Noviembre se consumieron
17160 horas de CPU (casi

2 años!)


23 CPUs tiempo
completo

Usando nuestro cluster local (50%)
(jumilla.cnb.uam.es), para la misma
actividad

20 cpu
´
s

Tiempo de uso real = 50%

del
tiempo total debido a la actividad de
desarrollo que se estaba realizando

46 CPUs!!!

Superando la barrera de potencial

4 simple steps to run all jobs that you need for your experiment

1º Select your application

2º Login into the UI

3º Upload your necessary files

4º Submit your experiment, giving a notification e
-
mail address and your password certificate

Superando la barrera de potencial (I)

Input from Grid portal

C++ Object

Submit job

and publish the data(first time)

Checking status

Get Output and retrieve

the output data.

JDLs

Required scripts (3)

Required input tar
´
s

For each JDL

Aborted or not submitted

Done (success)

First script

Second script

Third script

Run the job and publish

the output data when job finishes.

Send e
-
mail to the notification


e
-
mail address

El motor del portal

Superando la barrera de potencial (II)

Workflows & Grid Services

Grid Protein Structure Analysis




Scientific objectives

Bioinformatic

analysis

of

data

produced

by

complete

genome

sequencing

projects

is

one

of

the

major

challenge

of

the

next

years
.

Integrating

up
-
to
-
date

databanks

and

relevant

algorithms

is

a

clear

requirement

of

such

an

analysis
.

Grid

computing,

such

as

the

infrastructure

provided

by

the

EGEE

European

project,

would

be

a

viable

solution

to

distribute

data,

algorithms,

computing

and

storage

resources

for

Genomics
.

Providing

bioinformatician

with

a

good

interface

to

grid

infrastructure

will

also

be

a

challenge

that

should

be

successful
.

GPS@

web

portal,

Grid

Protein

Sequence

Analysis
,

aims

to

be

such

an

user
-
friendly

interface

for

these

grid

genomic

resources

on

the

EGEE

grid
.




Method

A well
-
known web interface eases the access to the algorithms offered.

Protein databases

are stored on grid storage as flat files.

Most protein sequence analysis tools are reference
legacy code

that is run
unchanged. This tools are
wrapped in grid jobs

to be executed on grid
resources.

The algorithms output are analysed and displayed in graphic format through the
web interface.

Otras aplicaciones

Otras aplicaciones(I)

Scientific objectives

Provide docking information helping in search for new drugs.

Biological goal: propose new inhibitors (drug candidates) addressed to
neglected diseases.

Bioinformatics goal:
in silico

virtual screening of drug candidate DBs.

Grid goal : demonstrate to the research communities active in the area
of drug discovery the relevance of grid infrastructures through the
deployment of a compute intensive application.

Method

Large scale molecular docking

on malaria


to compute million of potential drugs with

some software and parameters settings.

Docking is about computing the binding

energy of a protein target to a library of

potential drugs using a scoring algorithm.

In silico Drug Discovery

Genome evolution modeling




Scientific objectives

Study human evolutionary genetics and answer questions such as the
geographic origin of modern human populations, the genetic signature of
expanding populations, the genetic contacts between modern humans
and Neanderthals, and the expected null distributions of genetic statistics
applied on genome
-
wide data sets.





Method

Simulate the past demography (growth and migrations) of human
populations into a geographically realistic landscape, by taking into
account the spatial and temporal heterogeneity of the environment.

Generate the molecular diversity of several samples of genes drawn at
any location of the current human's range, and compare it to the observed
contemporary molecular diversity
.

SPLATCHE
uses a region sampling Bayesian framework that requires105
independent demographic and genetic simulations
.

Otras

aplicaciones (II)

Para

mas info

Xmipp web page:
www.cnb.uam.es/~bioinfo


Unit web page:
http://biocomp.cnb.uam.es


NA4 EGEE biomed applications home:

http://egee
-
na4.ct.infn.it/biomed/index.php



aj.merino@cnb.uam.es

Gracias