Algorithms and bioinformatics

wickedshortpumpBiotechnology

Oct 1, 2013 (3 years and 10 months ago)

94 views

Algorithmsandbioinformatics
V?lgyesDÆvid
david@elte.hu
Algorithmsandbioinformaticsp.1
Structureofthepresentation
²
Computationalcomplexity
²
Classicalproblems:sortingandsearching
²
Datamining
²
Bioinformatics
²
Databases
²
Algorithms:sequencealignment
Algorithmsandbioinformaticsp.2
Computationalcomplexityofal-
goritmhs
²
Complexity:thehighestexponent(without
constants)determiningthenumberofsteps
²
pl.Time(n)»1:2n2
+3n+1299)O(n2
)
²
Complexityclasses:
²
O(c):constantstepnumberproblems(C)
²
O(P):polinomialproblems(P)
²
O(NP):polinomialproblemson
non-deterministicmachine(NP)
²
O(Exp):exponentialproblems(Exp)
²
P6=NP?andNP=Exp?Probablyboth
aretrue.
Algorithmsandbioinformaticsp.3
Someexamples
²
Graphsearches(A¤inachessgame):O(Exp)
²
Gausselimination,matrixinversion:O(n
3
)
²
Bubblesort:O(n2
)
²
Mergesort:O(nlogn)
²
Linearsearch:O(n)
²
Bisectionsearch:O(logn)
²
Radixsearch:O(logn)
²
Hashsearch:O(c)
Algorithmsandbioinformaticsp.4
Computationcapacities
²
Nowadays»109
¡1011
FLOPS=PC
(1015
FLOP=day)
²
Elementnumbersinsolvablealgorithms
(1day/PC)
²
O(Exp)!n»20
²
O(n4
)!n»5000
²
O(n3
)!n»100000
²
O(n2
)!n»3¢107
Algorithmsandbioinformaticsp.5
Requirements
²
Google:10
11
webpages,1014
words
²
EMBL:DNAsequence:1011
basepairs
(138billioninSeptember2006)
²
SocialSecuritydata(USA:¢3¢108
people)
²
Hypermarkets(>109
items/store/year)
²
Banksector
)Classicalmethodsdon'tapply.
Algorithmsandbioinformaticsp.6
Datamining
²
Verylargedatabases:rarelyO(n
2
),
usuallyO(n);O(c)
²
Wedon'talwaysknowwhatweareexactly
lookingfor:
²
patterns
²
similarities
²
frequentphenomena
²
veryrarephenomena
)classicalmethodsare
²
slowand
²
inadequate
Algorithmsandbioinformaticsp.7
Datamininginbiology
²
Geneexpressionpatterns
²
Genomics,proteomics
²
Automaticdiagnosesbasedonsymptoms
(cancercheck,stb.)
²
DNAsequencealignment,homologysearch
²
...
Algorithmsandbioinformaticsp.8
DNAsequencealignment
²
Intheory,acompletematchcanbeachievedby
insertingenoughgaps
)Gappenaltyisneeded
²
Globalandlocalalignmentgivedifferentresults
Algorithmsandbioinformaticsp.9
Dotplot1:Samesequences
Algorithmsandbioinformaticsp.10
Dotplot2:Verysimilarseq.
Algorithmsandbioinformaticsp.11
Dotplot3:relativesequences
Algorithmsandbioinformaticsp.12
Needleman-Wunshalgorithm
²
ai
andbj
thetwoaminoacid-sequences
²
Hi;j
=s(a
i
;bj
)wheresthesimilarityfunction
(inthesimpliestcase0or1)
²
Hi;j
=Hi;j
+
maxfH
i+1;j+1
;+maxk>=1
(Hi+k+1;j+1
¡Wk
);
+maxk>=1
(Hi+1;j+k+1
¡Wk
)g
Algorithmsandbioinformaticsp.13
Needleman-Wunshexample1.
ADLGAVFALCDRYFQ
A
1
000
1
00
1
0000000
D0
1
00000000
1
0000
L00
1
00000
1
000000
G000
1
00000000000
R00000000000
1
000
T000000000000000
Q00000000000000
1
N000000000000000
C000000000
1
00000
D0
1
00000000
1
0000
R00000000000
1
000
Y000000000000
1
00
Y000000000000
1
00
Q00000000000000
1
Algorithmsandbioinformaticsp.14
Needleman-Wunshexample2.
ADLGAVFALCDRYFQ
A100010010432110
D010000000442110
L001000001432110
G000100005432110
R000000005433110
T000000005432110
Q000000005432111
N000000005432110
C000000004532110
D010000003342110
R000000002223110
Y000000002222210
Y000000001111210
Q000000000000001
Algorithmsandbioinformaticsp.15
Needleman-Wunshexample3.
ADLGAVFALCDRYFQ
A
9
76676675432110
D7
8
6666665442110
L66
7
555556432110
G555
6
55555432110
R5555
5
5555433110
T55555
5
555432110
Q555555
5
55432111
N5555555
5
5432110
C444444444
5
32110
D3433333333
4
2110
R22222222222
3
110
Y222222222222
2
10
Y1111111111112
1
0
Q00000000000000
1
Algorithmsandbioinformaticsp.16
Modi?cations
²
changinggappenalty
²
usingadvancedsimilaritymatrix
²
Smith-Waterman:localsearch
Globalmatch:
Globális összerendezés:
TTGACACCCTCC-CAATTGTA
:: :: :: :
ACCCCAGGCTTTACACAT---
Lokális összerendezés:
Localmatch:
---------TTGACACCCTCCCAATTGTA TTGACAC
:: :::: vagyis :: ::::
ACCCCAGGCTTTACACAT----------- TTTACAC
Algorithmsandbioinformaticsp.17
Similaritymatrix
Cisztein
C
12
Speciális
S
02
T
-213
P
-3106
A
-21112
G
-310-115
Asx, Glx
N
-410-1002
D
-500-10124
E
-500-100134
Q
-5-1-100-11224
Bázikus
H
-3-1-10-1-221136
R
-4010-2-30-1-1126
K
-500-1-121001035
Alifás
M
-5-2-1-2-1-3-2-3-2-1-2006
I
-2-10-2-1-3-2-2-2-2-2-2-225
L
-6-3-2-3-2-4-3-4-3-2-2-3-3426
V
-2-10-10-1-2-2-2-2-2-2-22424
Aromás
F
-4-3-3-5-4-5-4-6-5-5-2-4-5012-19
Y
0-3-3-5-3-5-2-4-4-40-4-4-2-1-1-2710
W
-8-2-5-6-6-7-4-7-7-5-32-3-4-5-2-60017
CSTPAGNDEQHRKMILVFYW
Algorithmsandbioinformaticsp.18
Problems
²
Theyaretooslowatsearch
)Ideasforspeedingthemup:
²
heuristicmethods
²
parallelmethods(BLASTN)
²
Canhandleonly2seq.simultaneously
)Multiplesequencealignment
)FASTA,BLAST,GappedBLAST,PSI-BLAST,
...
Algorithmsandbioinformaticsp.19
Applications
²
Searchingdatabases:geneticdisases
²
Codingregionprediction
²
Exon-intronprediction
²
Phylogeneticanalysis
²
Structureprediction
²
Functionalareaprediction
)Geneticallyconservedpartsmaybe
functionallyimportant
Algorithmsandbioinformaticsp.20
Phylogenetictree
Algorithmsandbioinformaticsp.21
Researchdirections
Algorithms
²
faster
²
fewererrors,lessheuristics
Databases
²
moredata(morebasepairs)
²
storageofsupplementaryinformations
(functionallyimportantplaces,...)
Newapplications
Algorithmsandbioinformaticsp.22
Thankyouforyourattention!
Algorithmsandbioinformaticsp.23