Bioinformatics 1-Chen 2011-powerpoint

hordeprobableBiotechnology

Oct 4, 2013 (3 years and 10 months ago)

114 views


Bioinformatics:

Guide to bio
-
computing and the Internet

Copyright© Kerstin Wagner


Introduction:
What is bioinformatics?


Can be defined as the body of tools, algorithms needed to handle
large


and
complex

biological information.


Bioinformatics is a
new

scientific discipline created from the interaction


of
biology

and
computer
.


周攠乃䉉⁤敦楮敳扩潩湦潲浡瑩捳慳:


"Bioinformatics is the field of science in which biology, computer


science, and information technology merge into a single discipline”



Genomics era:
High
-
throughput DNA sequencing


The first
high
-
throughput

genomics


technology was automated DNA sequencing


in the early 1990.


In September 1999, Celera Genomics


completed the sequencing of the


Drosophila

genome.


In 1995, Venter and Hamilton used whole
-
genome shotgun sequencing strategy to
sequence the genomes of
Mycoplasma

and
Haemophilus

.


周攠3
-
扩汬楯b
-
扰b桵浡渠来湯浥m獥煵敮捥

†
睡猠来湥牡瑥搠楮⁡i捯浰整楴楯渠扥瑷敥n

†
瑨攠灵扬楣汹⁦畮摥搠䡵浡渠䝥湯浥m


Project and Celera


Top image: confocal detection


by the
MegaBACE

sequencer


of fluorescently labeled DNA

High
-
throughput DNA sequencing


That was then. How about
now?

Next Generation Sequencing

(2010) vol11:31


Genomics:
Completed genomes as
of 2010


Currently the genome of
the organisms
are sequenced:


周楳来湥牡瑥猠污牧攠慭潵湴a潦o楮景牭慴楯渠瑯t扥b桡湤汥搠批b楮摩癩摵慬

†
捯浰畴敲m.

1598

bacterial/
85
archaeal
/
294
eukaryotic
genomes

The trend of data growth


21
st

century is a century of biotechnology:


Microarray
:
Global expression analysis: RNA levels of every



gene in the genome analyzed in parallel
. (
OUT!)



Replaced by RNA
-
seq


Proteomics
:
Global

protein analysis generates by large mass



spectra libraries.


Metabolomics
:
Global

metabolite analysis: 25,000 secondary



metabolites characterized


Genomics
:
New sequence information is being


produced at increasing rates. (
The



contents of GenBank double every year
)

Metagenomics

-


Who is there and what are they doing
?”


-

Cultivation
-
independent
approaches to
study the big impact of microbes

How to handle the large amount of information?

Drew Sheneman, New Jersey
--
The Newark Star Ledger

Answer: bioinformatics and Internet

Bioinformatics history

IBM 7090 computer


In1960s: the birth of bioinformatics


䵡M条牥琠佡歬g礠䑡票潦映捲敡瑥携



The first protein database



The first program for sequence assembly


There is a need for
computers

and
algorithms

that allow:



Access, processing, storing, sharing, retrieving, visualizing, annotating…


Why do we need the Internet?



omics
” projects and the information associated with involve a huge amount


of data that is stored
on computers all over the world
.


Because it is impossible to maintain up
-
to
-
date copies of all relevant


databases within the lab. Access to the data is
via the internet
.

You are
here

Database
storage

The Commercial Market


Current bioinformatics market is worth 300 million / year


(Half software)


Prediction:

$2 billion / year in 5
-
6 years


縵〠䉩潩湦潲n慴楣猠捯浰慮c敳e



Genomatrix

Software,
Genaissance

Pharmaceuticals, Lynx, Lexicon Genetics,
DeCode



Genetics,
CuraGen
,
AlphaGene
,
Bionavigation
,
Pangene
,
InforMax
,
TimeLogic
,


GeneCodes
, LabOnWeb.com, Darwin, Celera,
Incyte
,
BioResearch

Online,
BioTools
,


Oxford Molecular,
Genomica
,
NetGenics
, Rosetta, Lion
BioScience
,
DoubleTwist
,


eBioinformatics
, Prospect Genomics,
Neomorphic
, Molecular Mining,
GeneLogic
,


GeneFormatics
, Molecular Simulations, Bioinformatics Solutions….

Scope of this lab


The lab will touch on the following computational tasks:



Similarity search




Sequence comparison
: Alignment, multiple alignment, retrieval



Sequences analysis:

Signal peptide, transmembrane domain,…



Protein folding:

secondary structure from sequence



Sequence evolution:

phylogenetic trees


Make you familiar with bioinformatics resources available on the


web to do these tasks.

You have just
cloned a gene

Evolutionary
relationship?

-
Phylogenetic
tree

-
Accession #?

-
Annotation?

Is it already in
databases?

-
Sub
-
localization

-
Soluble?

-
3D fold

Protein
characteristics?

-
% identity?

-
Family member?


Is there similar
sequences?

-
Alignments?

-
Domains?

Is there conserved
regions?

Other
information?

-
Expression profile?

-
Mutants?


A critical failure of current bioinformatics is the lack of a
single software

package that can perform all of these functions.

Applying algorithms to analyze genomics data


DNA (nucleotide sequences) databases


They are big databases and searching either one should produce


similar results because they exchange information routinely
.





-
GenBank

(NCBI)
:
http://www.ncbi.nlm.nih.gov






-
DDBJ

(DNA
DataBase

of Japan):
http://www.ddbj.nig.ac.jp




-
TIGR
:
http://tigr.org/tdb/tgi





-
Yeast
:
http://yeastgenome.org




-
Microbes
:
http://img.jgi.doe.gov/cgi
-
bin/pub/main.cgi



Specialized
databases:Tissues
, species…




-
ESTs

(
E
xpressed
S
equence
T
ags)




~at NCBI
http://www.ncbi.nlm.nih.gov/dbEST





~at TIGR
http://tigr.org/tdb/tgi





-

...many more!



They are big databases too:




-
Swiss
-
Prot

(very high level of annotation)



http://au.expasy.org/








-
PIR

(protein identification resource)
the world's most




comprehensive catalog of information on proteins



http://www.pir.uniprot.org/



Translated databases:




-
TREMBL

(translated EMBL):
includes entries that have



not been annotated yet into Swiss
-
Prot.




http://www.ebi.ac.uk/trembl/access.html





-
GenPept

(translation of coding regions in
GenBank
)





-
pdb

(sequences derived from the 3D structure


Brookhaven PDB)
http://www.rcsb.org/pdb/


Protein (amino acid) databases

Database homology searching


Use algorithms to efficiently provide mathematical basis of searches


that can be translated to statistical significance.


Assumes that sequence, structure, and function are inter
-
related.


䅬氠獩m楬慲楴礠獥s牣桩湧整桯摳牥汹r潮⁴o攠e潮o数瑳映
慬a杮浥湴



and
distance

between sequences.


A獩浩污物r礠
獣潲o

楳⁣慬捵污瑥t晲潭f愠
摩獴慮ae

the number of DNA


bases or amino acids that are different between two sequences
.


Calculating alignment scores


Scoring system:
Uses scoring matrices that allow biologists to quantify the





quality of sequence alignments.



The raw score
S

is calculated by
summing

the scores for each aligned


position

and the scores for
gaps
. Gap creation/extension scores are


inherent to the scoring system in use (BLAST, FASTA…)



The
score

for an identity or a mismatch is
given

by the specified
substitution


matrix

(e.g., BLOSUM62).


Devising a scoring system


How the matrices were created:



Very similar sequences were
aligned
.




䙲潭瑨敳t慬楧湭敮瑳Ⱐ瑨攠
晲敱略湣礠潦o獵扳瑩t畴楯u

扥瑷敥n



敡e栠灡楲潦o慭楮漠慣楤猠睡猠
捡汣l污瑥l

慮搠瑨敮t
P䅍A

睡猠扵楬琮




After
normalizing to log
-
odds

format, the full series of PAM matrices


can be calculated by multiplying the PAM1 matrix by itself.


Some popular scoring matrices are:




P䅍A(
P
敲捥湴n
A
捣数瑥搠
M
畴慴楯温
㨠景爠敶潬畴楯湡i礠獴畤楥s⸠



䙯爠數慭灬攠楮⁐䅍ㄬㄠ慣1数瑥搠灯楮琠p畴慴楯渠灥p㄰〠慭楮a



acids is
required
.




䉌体啍U(
䉌B
捫c

慭楮漠慣楤i

扳瑩t畴楯u

M
慴物砩
㨠景爠晩湤楮f

†††††††
捯浭潮m浯瑩m献s䙯爠數慭灬攠楮⁂䱏单䴶㈬L瑨攠慬楧湭敮琠楳

†††††††
捲敡瑥搠畳楮朠獥煵敮捥猠獨慲楮朠湯n浯牥瑨慮t㘲┠楤敮瑩iy.


Devising a scoring system


Importance:



卣S物r朠浡瑲楣敳t慰灥慲a楮⁡汬慮慬y獩s


involving sequence comparison.




T桥捨c楣i潦慴物o捡c⁳牯r杬y楮晬略湣n

†††
瑨t瑣潭t潦⁴桥⁡慬y獩献




啮摥牳瑡r摩湧瑨t潲o敳e畮摥牬r楮朠愠a楶in

†††
獣潲s湧浡瑲楸捡c慩搠楮浡歩湧灲p灥爠

†††
捨c楣i:

††††
-
卯浥m浡瑲楣m猠牥晬s捴c獩浩污物瑹㨠
good for


database searching


-
Some reflect distance:
good for phylogenies




Log
-
odds matrices,

a normalisation method for matrix values
:







S

楳⁴桥⁰牯扡扩汩瑹瑨t琠瑷漠牥獩摵敳Ⱐ
i

慮搠
j
Ⱐ慲a⁡汩杮敤批
敶e汵瑩潮慲a摥獣敮t


†††
慮搠dy
捨c湣n
.

††

q
ij

are the frequencies that
i

and
j
are observed to align in sequences known to


be related.
p
i

and
p
j

are their frequencies of occurrence in the set of sequences.



Database search methods:
Sequence Alignment


Two broad classes of sequence alignments exist:








䝬潢G氠l汩杮浥gt
:



†
湯n⁳敮e楴i癥









䱯L慬慬楧湭敮n
:



††
晡f瑥t


QKESGPSSSYC

VQQESGLVRTTC

ESG

ESG


The most widely used
local similarity algorithms

are:



Smith
-
Waterman
(
http://www.ebi.ac.uk/MPsrch/
)



䉡獩B䱯L慬a䅬A杮浥湴⁓敡e捨T潯o

⡂䱁(T,

http://www.ncbi.nih.gov
)



Fast Alignment (
FASTA,

http://fasta.genome.jp
;
http://www.ebi.ac.uk/fasta33/
;




http://www.arabidopsis.org/cgi
-
bin/fasta/nph
-
TAIRfasta.pl
)

Which algorithm to use for database similarity search?


Speed
:



䉌B協
>

F䅓TA
>

卭S瑨
-
W慴敲a慮a(
It is VERY SLOW and uses a



LOT OF COMPUTER POWER
)




Sensitivity/statistics
:



FASTA

is more sensitive, misses less homologues



S浩mh
-
W慴敲m慮

楳i敶敮潲攠獥湳楴楶攮e



BLAST

calculates probabilities




FASTA

more accurate for DNA
-
DNA search then BLAST



-
tuple methods provide optimal alignments



These methods are faster and excellent in comparing sequences.


䉌䅓B慮aF䅓AA⁰牯杲g浳m慲a扡b敤e潮o

-
tuple algorithms:

1
-
Using query sequence, derive a list of


words of length
w
(e.g.,

3)


2
-
Keep high
-
scoring words using a


scoring matrix(e.g. BLOSUM 62)


3
-
High
-
scoring words are compared


with database sequences


4
-
Sequences with many matches to


high
-
scoring words are used for final


alignments



The dilemma
: DNA or protein?



Is the comparison of two nucleotide sequences accurate?




By translating into amino acid sequence, are we losing information?


The genetic code is degenerate (Two or more codons can represent


the same amino acid)



V敲礠摩d晥牥湴f䑎A獥煵敮捥猠浡礠捯摥d景爠獩浩污s灲潴敩渠獥煵敮捥s


We certainly do not want to miss those cases!

Search by similarity

Using nucleotide seq.

Using amino acid seq.

Tools to search databases



Comparing DNA sequences give more random matches
:

Reasons for translating

A good alignment with end
-
gaps


A very poor alignment

Almost 50% identity!


Conservation of protein in evolution

(
DNA similarity decays faster!)


It is almost always better to compare coding sequences in their amino acid form,


especially if they are
very divergent.



Very highly similar nucleotide sequences may give better results.


Conclusion:


FASTA
:
Compares a DNA query to DNA database, or a protein query



to protein database


FAST
X
:
Compares a translated DNA query to a protein database


T
FASTA
:
Compares a protein query to a translated DNA database



BLAST and FASTA variants


BLAST
N
:

Compares a DNA query to DNA database
.




BLAST
P
:

Compares a protein query to protein database.




BLAST
X
:

Compares the 6
-
frame translations of DNA query to protein




database.


T
BLAST
N
:

Compares a protein query to the 6
-
frame translations of a DNA



database.




T
BLAST
X
:

Compares the 6
-
frame translations of DNA query to the 6
-
frame




translations of a DNA database (
each sequence is comparable to



BLASTP searches!
)



PSI
-
BLAST
:
Performs iterative database searches. The results from each round



are incorporated into a 'position specific' score matrix, which is



used for further searching

A practical example of sequence alignment

http://www.ncbi.nlm.nih.gov

BLAST results

Detailed BLAST results


E value:

is the expectation value or probability to find by chance hits similar to


your sequence. The lower the E, the more significant the score.


Database searching tips


Use latest database version.



Use BLAST first, then a finer tool (FASTA,…)



Search both strands when using FASTA.



Translate sequences where relevant



Search 6
-
frame translation of DNA database



E < 0.05 is statistically significant, usually biologically


interesting.



If the query has repeated segments, delete them and


repeat search


Most widely used sites for sequence analysis


Sites for alignment of 2 sequences:




T
-
COFFEE

(
http://tcoffee.vital
-
it.ch/cgi
-
bin/Tcoffee/tcoffee_cgi/index.cgi
):
more
accurate than
ClustalW

for sequences with less than 30% identity.



ClustalW

(
http://www.ch.embnet.org/software/ClustalW.html
;






http://align.genome.jp
)



bl2sequ

(
http://www.ncbi.nlm.nih.gov/blast/bl2seq/wblast2.cgi
)



LALIGN

(
http://www.ch.embnet.org/software/LALIGN_form.html
)



MultiALIGN

(
http://prodes.toulouse.inra.fr/multalin/multalin.html
)


卩瑥猠景爠䑎s瑯⁰牯瑥楮⁴牡湳污瑩t渺n


These algorithms can translate DNA sequences in any of the 3 forward or three


reverse sense frames.



Translate

(
http://au.expasy.org/tools/dna.html
)



Translate

a DNA sequence
: (
http://www.vivo.colostate.edu/molkit/translate/index.html
)



Transeq

(
http://www.ebi.ac.uk/emboss/transeq
)



http://www.mbio.ncsu.edu/bioedit/bioedit.html

BioEdit



a sequence editing software package

Oligo

Design and Analysis Tools

http://www.idtdna.com/scitools/scitools.aspx