Faik Bioinformatics PowerPoint 1-2006

dasypygalstockingsΒιοτεχνολογία

2 Οκτ 2013 (πριν από 4 χρόνια και 10 μέρες)

141 εμφανίσεις


Bioinformatics:

Guide to bio
-
computing and the Internet

Copyright© Kerstin Wagner


Introduction:
What is bioinformatics?


Can be defined as the body of tools, algorithms needed to handle
large


and
complex

biological information.


Bioinformatics is a
new

scientific discipline created from the interaction


of
biology

and
computer
.


周攠乃䉉⁤敦楮敳扩潩湦潲浡瑩捳慳:


"Bioinformatics is the field of science in which biology, computer


science, and information technology merge into a single discipline”



Genomics era:
High
-
throughput DNA sequencing


The first
high
-
throughput

genomics


technology was automated DNA sequencing


in the early 1990.


In September 1999, Celera Genomics


completed the sequencing of the


Drosophila

genome.


Baker’s yeast,
Saccharomyces cerevisiae



(15 million bp), was the first eukaryotic


genome to be sequenced.


The 3
-
billion
-
bp human genome sequence


was generated in a competition between


the publicly funded Human Genome


Project and Celera


Top image: confocal detection


by the
MegaBACE sequencer


of fluorescently labeled DNA


䉯瑴潭o業慧攺⁣潭灵瑥i

†
業慧攠潦⁳o煵q湣攠牥慤⁢r

†
慵瑯浡瑥搠獥煵敮捥r

High
-
throughput DNA sequencing


Genomics:
Completed genomes as 2002


Currently the genome of over 600 organisms are sequenced:

Organism

Base pairs

Whole
-
genome
shotgun

Map
-
based

54 Bacteria

0.8
-
6 million

+



Yeast

15 million



+

C. elegans

(
roundworm
)

100 million



+

Drosophila

(
fruitfly
)

120 million

+



Arabidopsis

(
thale cress
)

130
million




+

Rice

435 million



+

Human

3
billion


+

+

Fugu

(
puffer fish
)

365 million

+



Anopheles

(
malaria
-
carrying mosquito
)

278 million

+




This generates large amounts of information to be handled by individual


computers.

The trend of data growth


21
st

century is a century of biotechnology:


Microarray
:
Global expression analysis: RNA levels of every



gene in the genome analyzed in parallel.


Proteomics
:
Global protein analysis generates by large mass



spectra libraries.


Metabolomics
:
Global metabolite analysis: 25,000 secondary



metabolites characterized


Genomics
:
New sequence information is being


produced at increasing rates. (
The



contents of GenBank double every year
)


Glycomics
:
Global sugar metabolism analysis

How to handle the large amount of information?

Drew Sheneman, New Jersey
--
The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics history

IBM 7090 computer


In1960s: the birth of bioinformatics


䵡M条牥琠佡歬g礠䑡票潦映捲敡瑥携



The first protein database



The first program for sequence assembly


There is a need for
computers

and
algorithms

that allow:



Access, processing, storing, sharing, retrieving, visualizing, annotating…


Why do we need the Internet?



omics
” projects and the information associated with involve a huge amount


of data that is stored
on computers all over the world
.


Because it is impossible to maintain up
-
to
-
date copies of all relevant


databases within the lab. Access to the data is
via the internet
.

You are
here

Database
storage

The Commercial Market


Current bioinformatics market is worth 300 million / year


(Half software)


Prediction:

$2 billion / year in 5
-
6 years


縵〠䉩潩湦潲n慴楣猠捯浰慮c敳e



Genomatrix Software, Genaissance Pharmaceuticals, Lynx, Lexicon Genetics, DeCode


Genetics, CuraGen, AlphaGene, Bionavigation, Pangene, InforMax, TimeLogic,


GeneCodes, LabOnWeb.com, Darwin, Celera, Incyte, BioResearch Online, BioTools,


Oxford Molecular, Genomica, NetGenics, Rosetta, Lion BioScience, DoubleTwist,


eBioinformatics, Prospect Genomics, Neomorphic, Molecular Mining, GeneLogic,


GeneFormatics, Molecular Simulations, Bioinformatics Solutions….

Scope of this lab


The lab will touch on the following computational tasks:



Similarity search




Sequence comparison
: Alignment, multiple alignment, retrieval



Sequences analysis:

Signal peptide, transmembrane domain,…



Protein folding:

secondary structure from sequence



Sequence evolution:

phylogenetic trees


Make you familiar with bioinformatics resources available on the


web to do these tasks.

You have just
cloned a gene

Evolutionary
relationship?

-
Phylogenetic
tree

-
Accession #?

-
Annotation?

Is it already in
databases?

-
Sub
-
localization

-
Soluble?

-
3D fold

Protein
characteristics?

-
% identity?

-
Family member?


Is there similar
sequences?

-
Alignments?

-
Domains?

Is there conserved
regions?

Other
information?

-
Expression profile?

-
Mutants?


A critical failure of current bioinformatics is the lack of a
single software

package that can perform all of these functions.

Applying algorithms to analyze genomics data


DNA (nucleotide sequences) databases


They are big databases and searching either one should produce


similar results because they exchange information routinely
.





-
GenBank (NCBI)
:
http://www.ncbi.nlm.nih.gov






-
DDBJ

(DNA DataBase of Japan):
http://www.ddbj.nig.ac.jp




-
TIGR
:
http://tigr.org/tdb/tgi





-
Yeast
:
http://yeastgenome.org




-
E. coli
:
http://colibase.bham.ac.uk/blast/



Specialized databases:Tissues, species…




-
ESTs

(
E
xpressed
S
equence
T
ags)




~at NCBI
http://www.ncbi.nlm.nih.gov/dbEST





~at TIGR
http://tigr.org/tdb/tgi





-

...many more!



They are big databases too:




-
Swiss
-
Prot

(very high level of annotation)



http://au.expasy.org/








-
PIR

(protein identification resource)
the world's most




comprehensive catalog of information on proteins



http://www.pir.uniprot.org/



Translated databases:




-
TREMBL

(translated EMBL):
includes entries that have



not been annotated yet into Swiss
-
Prot.




http://www.ebi.ac.uk/trembl/access.html





-
GenPept

(translation of coding regions in GenBank)





-
pdb

(sequences derived from the 3D structure


Brookhaven PDB)
http://www.rcsb.org/pdb/


Protein (amino acid) databases

Database homology searching


Use algorithms to efficiently provide mathematical basis of searches


that can be translated to statistical significance.


Assumes that sequence, structure, and function are inter
-
related.


䅬氠獩m楬慲楴礠獥s牣桩湧整桯摳牥汹r潮⁴o攠e潮o数瑳映
慬a杮浥湴



and
distance

between sequences.


A獩浩污物r礠
獣潲o

楳⁣慬捵污瑥t晲潭f愠
摩獴慮ae

the number of DNA


bases or amino acids that are different between two sequences
.


Calculating alignment scores


Scoring system:
Uses scoring matrices that allow biologists to quantify the





quality of sequence alignments.



The raw score
S

is calculated by
summing

the scores for each aligned


position

and the scores for
gaps
. Gap creation/extension scores are


inherent to the scoring system in use (BLAST, FASTA…)



The
score

for an identity or a mismatch is
given

by the specified
substitution


matrix

(e.g., BLOSUM62).


Devising a scoring system


How the matrices were created:



Very similar sequences were
aligned
.




䙲潭瑨敳t慬楧湭敮瑳Ⱐ瑨攠
晲敱略湣礠潦o獵扳瑩t畴楯u

扥瑷敥n



敡e栠灡楲潦o慭楮漠慣楤猠睡猠
捡汣l污瑥l

慮搠瑨敮t
P䅍A

睡猠扵楬琮




After
normalizing to log
-
odds

format, the full series of PAM matrices


can be calculated by multiplying the PAM1 matrix by itself.


Some popular scoring matrices are:




P䅍A(
P
敲捥湴n
A
捣数瑥搠
M
畴慴楯温
㨠景爠敶潬畴楯湡i礠獴畤楥s⸠



䙯爠數慭灬攠楮⁐䅍ㄬㄠ慣1数瑥搠灯楮琠p畴慴楯渠灥p㄰〠慭楮a



acids is erquired.




䉌体啍U(
䉌B
捫猠慭楮漠慣楤i

扳瑩t畴楯渠
M
慴物砩
㨠景爠晩湤楮f

†††††††
捯浭潮m浯瑩m献s䙯爠數慭灬攠楮⁂䱏单䴶㈬L瑨攠慬楧湭敮琠楳

†††††††
捲敡瑥搠畳楮朠獥煵敮捥猠獨慲楮朠湯n浯牥瑨慮t㘲┠楤敮瑩iy.


Devising a scoring system


Importance:



卣S物r朠浡瑲楣敳t慰灥慲a楮⁡汬慮慬y獩s


involving sequence comparison.




T桥捨c楣i潦慴物o捡c⁳牯r杬y楮晬略湣n

†††
瑨t瑣潭t潦⁴桥⁡慬y獩献




啮摥牳瑡r摩湧瑨t潲o敳e畮摥牬r楮朠愠a楶in

†††
獣潲s湧浡瑲楸捡c慩搠楮浡歩湧灲p灥爠

†††
捨c楣i:

††††
-
卯浥m浡瑲楣m猠牥晬s捴c獩浩污物瑹㨠
good for


database searching


-
Some reflect distance:
good for phylogenies




Log
-
odds matrices,

a normalisation method for matrix values
:







S

楳⁴桥⁰牯扡扩汩瑹瑨t琠瑷漠牥獩摵敳Ⱐ
i

慮搠
j
Ⱐ慲a⁡汩杮敤批
敶e汵瑩潮慲a摥獣敮t


†††
慮搠dy
捨c湣n
.

††

q
ij

are the frequencies that
i

and
j
are observed to align in sequences known to


be related.
p
i

and
p
j

are their frequencies of occurrence in the set of sequences.



Database search methods:
Sequence Alignment


Two broad classes of sequence alignments exist:








䝬潢G氠l汩杮浥gt
:



†
湯n⁳敮e楴i癥









䱯L慬慬楧湭敮n
:



††
晡f瑥t


QKESGPSSSYC

VQQESGLVRTTC

ESG

ESG


The most widely used
local similarity algorithms

are:



Smith
-
Waterman
(
http://www.ebi.ac.uk/MPsrch/
)



䉡獩B䱯L慬a䅬A杮浥湴⁓敡e捨T潯o

⡂䱁(T,

http://www.ncbi.nih.gov
)



Fast Alignment (
FASTA,

http://fasta.genome.jp
;
http://www.ebi.ac.uk/fasta33/
;




http://www.arabidopsis.org/cgi
-
bin/fasta/nph
-
TAIRfasta.pl
)

Which algorithm to use for database similarity search?


Speed
:



䉌B協
>

F䅓TA
>

卭S瑨
-
W慴敲a慮a(
It is VERY SLOW and uses a



LOT OF COMPUTER POWER
)




Sensitivity/statistics
:



FASTA

is more sensitive, misses less homologues



S浩mh
-
W慴敲m慮

楳i敶敮潲攠獥湳楴楶攮e



BLAST

calculates probabilities




FASTA

more accurate for DNA
-
DNA search then BLAST



-
tuple methods provide optimal alignments



These methods are faster and excellent in comparing sequences.


䉌䅓B慮aF䅓AA⁰牯杲g浳m慲a扡b敤e潮o

-
tuple algorithms:

1
-
Using query sequence, derive a list of


words of length
w
(e.g.,

3)


2
-
Keep high
-
scoring words using a


scoring matrix(e.g. BLOSUM 62)


3
-
High
-
scoring words are compared


with database sequences


4
-
Sequences with many matches to


high
-
scoring words are used for final


alignments



The dilemma
: DNA or protein?



Is the comparison of two nucleotide sequences accurate?




By translating into amino acid sequence, are we losing information?


The genetic code is degenerate (Two or more codons can represent


the same amino acid)



V敲礠摩d晥牥湴f䑎A獥煵敮捥猠浡礠捯摥d景爠獩浩污s灲潴敩渠獥煵敮捥s


We certainly do not want to miss those cases!

Search by similarity

Using nucleotide seq.

Using amino acid seq.

Tools to search databases



Comparing DNA sequences give more random matches
:

Reasons for translating

A good alignment with end
-
gaps


A very poor alignment

Almost 50% identity!


Conservation of protein in evolution

(
DNA similarity decays faster!)


It is almost always better to compare coding sequences in their amino acid form,


especially if they are
very divergent.



Very highly similar nucleotide sequences may give better results.


Conclusion:


FASTA
:
Compares a DNA query to DNA database, or a protein query



to protein database


FAST
X
:
Compares a translated DNA query to a protein database


T
FASTA
:
Compares a protein query to a translated DNA database



BLAST and FASTA variants


BLAST
N
:

Compares a DNA query to DNA database
.




BLAST
P
:

Compares a protein query to protein database.




BLAST
X
:

Compares the 6
-
frame translations of DNA query to protein




database.


T
BLAST
N
:

Compares a protein query to the 6
-
frame translations of a DNA



database.




T
BLAST
X
:

Compares the 6
-
frame translations of DNA query to the 6
-
frame




translations of a DNA database (
each sequence is comparable to



BLASTP searches!
)



PSI
-
BLAST
:
Performs iterative database searches. The results from each round



are incorporated into a 'position specific' score matrix, which is



used for further searching

A practical example of sequence alignment

http://www.ncbi.nlm.nih.gov

BLAST results

Detailed BLAST results


E value:

is the expectation value or probability to find by chance hits similar to


your sequence. The lower the E, the more significant the score.


Database searching tips


Use latest database version.



Use BLAST first, then a finer tool (FASTA,…)



Search both strands when using FASTA.



Translate sequences where relevant



Search 6
-
frame translation of DNA database



E < 0.05 is statistically significant, usually biologically


interesting.



If the query has repeated segments, delete them and


repeat search


Most widely used sites for sequence analysis


Sites for alignment of 2 sequences:




T
-
COFFEE

(
http://www.ch.embnet.org/software/TCoffee.html
):
more accurate


than ClustalW for sequences with less than 30% identity.



ClustalW
(
http://www.ch.embnet.org/software/ClustalW.html
;






http://align.genome.jp
)



bl2sequ

(
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
)



LALIGN

(
http://www.ch.embnet.org/software/LALIGN_form.html
)



MultiALIGN

(
http://prodes.toulouse.inra.fr/multalin/multalin.html
)


卩瑥猠景爠䑎s瑯⁰牯瑥楮⁴牡湳污瑩t渺n


These algorithms can translate DNA sequences in any of the 3 forward or three


reverse sense frames.



Translate

(
http://au.expasy.org/tools/dna.html
)



Translate

a DNA sequence
: (
http://www.vivo.colostate.edu/molkit/translate/index.html
)



Transeq

(
http://www.ebi.ac.uk/emboss/transeq
)