Advanced Computation Biology Project Presentation

signtruculentBiotechnology

Oct 2, 2013 (4 years and 1 month ago)

113 views

Team Members:

Joshua Wu 11174269

Shuyu

(Christine)
Xu

11161640

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Project Description

Explicit Suffix Trees


Suppose that we want to store explicitly
all strings that are edge labels of a suffix
tree.


The main question of this project is how
much space explicit suffix trees require
comparing to implicit suffix trees.


Implement suffix tree algorithm and run
it on substrings of real data.

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Introduction


Any string of length m can be
degenerated into m suffixes, and these
suffixes can be stored in a suffix tree.



Setup time O(m) (m is length of string)



searching time O(n) (n is length of
pattern)

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Motivation


"Suffix trees are widely used in the
computer field... Recent improvements
in the method have cut the memory
requirement to 17 bytes per letter, which
brings the method to the verge of
practicality [for bioinformatics
applications]"
--

Nat Goodman (Genome
Technology).

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Bioinformatics Application

1.
multiple genome alignment (Michael
Hohl

et al., 2002)


2.
selection of signature
oligonucleotides

for DNA arrays (
Kaderali

and
Schliep
,
2002)


3.
identification of sequence repeats
(Kurtz and Schleiermacher, 1999)


OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Explicit vs Implicit


ABC $ Explicit


1 2 3 4 ABC$ $


BC$ C$




Implicit


1,4 4,4


2,4 3,4

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Problem Analysis


Best Case for explicit and implicit suffix
trees: All different characters



Best case not likely with DNA inputs:
total of 4 characters



Worst case: same characters throughout

Assumptions


In implicit trees, each number will only
take up one bit. (the number 10 takes up
1 bit)



Only alphabets will be in the sequence




Example: all different char


ABCD $ 1,5 5,5


1 2 3 4 5 2,5 3,5 4,5



N: string length


N = 5


Memory = 10


best case

Example


ABCABC $ 7,7


1 2 3 4 5 6 7


1,3 2,3 6,6


N: string length


N = 7 4,7 7,7 7,7 7,7


Memory = 20 4,7 4,7

Example: all same character


AAAA $


1 2 3 4 5 1,1 5,5


N=string length


N = 5, 6, 7 2,2 5,5


Memory = 16, 20, 24


Memory = 4n
-
4 3,3 5,5


Worse case


4,5 5,5

Program Input Data

DNA for all kinds of creatures:


Homo Sapiens, Monkeys, Chickens, …



OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Sample input: Homo
Sapien


cagctcctgagactgctggcatgaaggggagccgtgcc
ctcctgctggtggccctcaccctgttctgcatctgccggatg
gccacaggggaggacaacgatgagtttttcatggacttcc
tgcaaacactactggtggggaccccagaggagctctatg
aggggaccttgggcaagtacaatgtcaacgaagatgcc
aaggcagcaatgactgaactcaagtcctgcagagatgg
cctgcagccaatgcacaaggcggagctggtcaagctgc
tggtgcaagtgctgggcagtcaggacggtgcctaagtgg
acctcagacatggctcagccataggacctgccacacaa
gcagccgtggacacaacgcccactaccacctcccacat
ggaaatgtatcctcaaaccgtttaatcaataa

Sample result

Sample input 2: plants


EARPIVVGPPPPLSGGLPGTENSDQA
RDGTLPYTKDRFYLQPLPPTEAAQRA
KVSASEILNVKQFIDRKAWPSLQNDLR
LRASYLRYDLKTVISAKPKDEKKSLQEL
TSKLFSSIDNLDHAAKIKSPTEAEKYYG
QTVSNINEVLAKLG

Sample output:


OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Homo Sapien


Sample Input: Homo Sapiens


atgaaggggagccgtgccctcctgctggtggccctca
ccctgttctgcatctgccggatggccacaggggagga
caacgatgagtttttcatggacttcctgcaaacactact
ggtggggaccccagaggagctctatgaggggacctt
gggcaagtacaatgtcaacgaagatgccaaggcag
caatgactgaactcaagtcctgcagagatggcctgc
agccaatgcacaaggcggagctggtcaagctgctg
gtgcaagtgctgggcagtcaggacggtgcctaa

Comparisons: Homo Sapiens

Comparisons: Homo Sapiens

Monkey Virus

Sample Input: Monkey Virus


GGSCFKCGKKGHFAKNCHEHAHNNA
EPKVPGLCPRCKRGKHWANECKSKT
DNQGNPIPPH

Monkey Virus

Plants

Sample Input: Plants


EARPIVVGPPPPLSGGLPGTENSDQA
RDGTLPYTKDRFYLQPLPPTEAAQRA
KVSASEILNVKQFIDRKAWPSLQNDLR
LRASYLRYDLKTVISAKPKDEKKSLQEL
TSKLFSSIDNLDHAAKIKSPTEAEKYYG
QTVSNINEVLAKLG

Plants

Tobacco


Sample input: tobacco


SYSITTPSQFVFLSSAWADPIELINLCT
NALGNQFQTQQARTVVQRQFSEVWK
PSPQVTVRFPDSDFKVYRYNAVLDPLV
TALLGAFDTRNRIIEVENQANPTTAETL
DATRRVDDATVAIRSAINNLIVELIRGTG
SYNRSSFESSSGLVWTSGPAT

Tobacco

Insects



Sample Input: Insects


DCLSGRYKGPCAVWDNETCRRVCKE
EGRSSGHCSPSLKCWCEGC

Insects

Birds

Sample Input: Birds


IDTCRLPSDRGRCKASFERWYFNGRT
CAKFIYGGCGGNGNKFPTQEACMKRC
AKA

Birds

SARS


Sample Input: SARS


ALNTLVKQLSSNFGAISSVLNDILSRLD
KVEAEV

SARS

Fish


Sample Input: Fish


GHHHHHHLEDPSGGTPYIGSKISLISK
AEIRYEGILYTIDTENSTVALAKVRSFGT
EDRPTDRPIAPRDETFEYIIFRGSDIKDL
TVCEPPKPIM

Fish

Chicken


Sample Input: Chicken


RVKRVWPLVIRTVIAGYNLYRAIKKK

Chicken

files


Code



Results



OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work


Now we are here

Conclusion


Explicit suffix trees require more space
than implicit suffix trees in real
datas
.



Data comparison: worst case is DNA
input (least variety of characters)


results


Implicit trees should be used for smaller
use of storage



0
500
1000
1500
2000
2500
3000
1
3
5
7
9
11
13
15
17
19
21
23
25
variety of string vs tree size

variety of string vs tree
size
# of alphabets

Conclusion


Application:


it is easier to compare structures for
implicit
than
explicit suffix trees (number
comparisons)


Save space


Easy to implement


Further improvement?

OVERVIEW




Project Description


Introduction


Motivation


Bioinformatics Application


Explicit vs Implicit


Problem Analysis


Implement Files


Experimental Results


Conclusion


Possible Future Work

Now we are here

Possible Future Work


Program speed is too slow



The interface of our program should be
improved. (
Matlab
)



More variety of input





References


Real Data


http://www.ncbi.nlm.nih.gov/entrez/viewe
r.fcgi?db=nucleotide&val=74273665


http://www.rcsb.org/pdb


http://www.ncbi.nlm.nih.gov/sites/entrez
?cmd=search&db=nucleotide


References


Online info


http://en.wikipedia.org/wiki/Suffix_tree


http://marknelson.us/1996/08/01/suffix
-
trees/


http://homepage.usask.ca/~ctl271/857/s
uffix_tree.shtml


http://www.cs.uku.fi/~kilpelai/BSA05/lect
ures/print07.pdf



THANK YOU!