BIIT - Apie LitGRID

underlingbuddhaBiotechnology

Oct 2, 2013 (4 years and 11 days ago)

106 views


Bioinformatics Applications

on Grid in Estonia

Igor Kuzmitshov

EENet
, Estonia


Applications meeting, Vilnius, 16
-
18 January 2007

2

Outline


BIIT Group


BI Applications


Experience and Problems


Applications meeting, Vilnius, 16
-
18 January 2007

3

BIIT

Bioinformatics and Data Mining

Institute of Computer Science

University of Tartu, Estonia


Lead by Jaak Vilo, Ph.D.


Applications meeting, Vilnius, 16
-
18 January 2007

4

BIIT Group (Jaak Vilo)

Maarika Traat

Meelis Kull

Hedi Peterson

Pavlos Pavlidis

Jüri Reimand

Jelena Zaitseva

Asko Tiidumaa

Marion Reuter

Margus Jager

Konstantin Tretjakov

Priit Adler

Raivo Kolde

Jaanus Hansen

Kristo Tammeoja

Jaanus Uri

Darja Krushevskaja

Anton Litvinenko

Igor Kuzmitshov

Ilja Livenson


Applications meeting, Vilnius, 16
-
18 January 2007

5

BIIT


Functional genomics and systems biology


Pattern matching and discovery


Gene expression data analysis (e.g., clustering)


Functional annotation


Information retrieval


Applications meeting, Vilnius, 16
-
18 January 2007

6

Biological Research Questions
(Examples)


How are genes regulated?

Which genes, when, how?


What are the causes and effects of diseases

like cancer?


What is the function of each and every molecule?


What are the complex relationships, networks,
and pathways that form the basis of the living
organism?


How to intervene with the diseases

(drug design)?


How to predict all the effects of drugs?


Applications meeting, Vilnius, 16
-
18 January 2007

7

BalticGrid Goals

in Bioinformatics


Activity NA3


“Application Identification and
Support”


Task 1: Pilot applications


Subtask: Bioinformatics (BI)


sequence pattern discovery and the gene regulatory
network reconstruction,


modelling of biosensors and other reaction
-
diffusion
processes.


Applications meeting, Vilnius, 16
-
18 January 2007

8

Typical Structure of Applications

Main logic
(application
-
specific)

Produce
data/queries

Analyze
results

Common
subroutines

(some: to grid)

Tool 1

Tool 2

Tool 3


Applications meeting, Vilnius, 16
-
18 January 2007

9

Pattern Discovery

1.
Choose the language (formalism) to represent
the patterns

2.
Choose the rating for patterns, to tell that one
pattern is “better” than other

3.
Design an algorithm that finds the best patterns
from the pattern class, fast.



Brazma A, Jonassen I, Eidhammer I, Gilbert D.

Approaches to the automatic discovery of patterns in biosequences.

J Comput Biol.

1998;5(2):279
-
305.


Applications meeting, Vilnius, 16
-
18 January 2007

10

SPEXS

Sequence Pattern Exhaustive Search

(Jaak Vilo, 1998, 2002)


User
-
definable pattern language
: substrings, character
groups, wildcards, flexible wildcards (c.f. PROSITE)


Fast exhaustive search over pattern language


“Lazy suffix tree construction”
-
like algorithm (Kurtz,
Giegerich)


Analyze multiple sets of sequences simultaneously


Restrict search to most frequent patterns only (in each
set)


Report

most frequent patterns, patterns over
-

or
underrepresented in selected subsets, or patterns
significant by various statistical criteria, e.g., by binomial
distribution


Applications meeting, Vilnius, 16
-
18 January 2007

11

All Against All

Approximate Matching

(Hendrik Nigul, Jaak Vilo)


For every subsequence of every sequence


Match approximately against all the sequences


Approximate hits define PWM matrices (not all
positions vary equally)


Look for ALL PWM
-
s derived from data that are
enriched in data set (vs. background)


Applications meeting, Vilnius, 16
-
18 January 2007

12

AlignACE

Align
s

Nucleic

A
cid

C
onserved

E
lements

(Roth at al.)


Finds sequence elements conserved in a set of
DNA sequences


Uses a Gibbs sampling strategy


Applications meeting, Vilnius, 16
-
18 January 2007

13

MEM

Multi
-
Experiment
-
Matrix

(Adler)


Many data sets


Which experiments produce similar

cellular states?


When are the genes co
-
expressed?


Which genes are related to each other?


Applications meeting, Vilnius, 16
-
18 January 2007

14

Finding Motifs Using

Phylogenetic Information


The goal is to reveal conserved DNA segments in
a range of species


Motif
-
finding tools:


trie*agrep (needs a lot of memory)


SPEXS (can be run on grid for some cases)


Applications meeting, Vilnius, 16
-
18 January 2007

15

Sample Result


Applications meeting, Vilnius, 16
-
18 January 2007

16

Refined Samples

Random data

Real data


Applications meeting, Vilnius, 16
-
18 January 2007

17

Running Applications


Modifying programs


Separate main logic and resource
-
consuming
subroutines/tools


Make those subroutines/tools run on grid


Add wrapper scripts


Putting data to SE


Running grid jobs


Getting data from SE


Applications meeting, Vilnius, 16
-
18 January 2007

18

Limitations


Just some programs are suitable for running

on grid:


not eager for memory


divisible data


Otherwise, local powerful computer is better

or is the only choice


Applications meeting, Vilnius, 16
-
18 January 2007

19

Problems: Users


CLI is quite difficult to use


CLI UI is too heavy
-
weight and heterogeneous


File access is difficult


ssh server:output/* .


lcg
-
cp
--
vo balticgrid
lfn:/grid/balticgrid/biit/output/result01.tar.gz
file://$PWD/result01.tar.gz


Applications meeting, Vilnius, 16
-
18 January 2007

20

Problems: Developers


Reasonable API is missing

(for Java and Python)


No good manuals and samples

for writing grid
-
enabled applications


Applications meeting, Vilnius, 16
-
18 January 2007

21

Conclusions


Basically scripts can be easily run on grid

(those without high memory requirements)


Main problems:


UI usability


File management


Applications meeting, Vilnius, 16
-
18 January 2007

22

Thank you