Bioinformatics 2-Chen 2011-powerpoint

tastelesscowcreekBiotechnology

Oct 4, 2013 (3 years and 8 months ago)

105 views

Tools to analyze
protein

characteristics

Protein
sequence

-
Family member

-
Multiple alignments

Identification of
conserved regions

Evolutionary

relationship (Phylogeny)

3
-
D fold model

Protein sorting and
sub
-
cellular localization

Anchoring into
the membrane

Signal sequence
(tags)


Some nascent proteins
contain a specific signal
, or targeting sequence


that directs them to the correct organelle. (
ER, mitochondrial, chloroplast,


lysosome, vacuoles, Golgi, or cytosol
)


Can we
train

the computers:



T漠摥瑥ct

s楧湡氠獥煵敮c敳
慮搠灲敤楣t

灲潴敩渠摥s瑩湡瑩潮
?



T
o identify

conserved domains
(
or a pattern
)

in proteins?



T漠灲敤楣琠瑨e

浥浢m慮e
-
慮c桯物湧瑹灥
潦⁡⁰牯瑥楮?



(
Transmembrane domain, GPI anchor…
)



T
o predict the
3D structure
of a protein?


Learning algorithms

are good for solving problems in pattern


recognition because they can
be trained

on a sample data set.


䍬慳獥猠潦敡牮e湧n慬杯物瑨浳g



-
Artificial neural networks (ANNs)



-
Hidden Markov Models (HMM)

Questions

Artificial neural networks (ANN)


Machine learning algorithms that
mimic

the
brain
.
R
eal brains, however
, are orders of
magnitude more complex than any
ANN.


ANNs, like people, learn by example
.
ANNs cannot be
programmed
to perform a
specific task.


ANN is composed of a large number of
highly
interconnected

processing
elements
(
neurons
) working simultaneously to solve

specific
problems.


周T⁦牳琠慲瑩t楣楡氠湥畲l渠睡w⁤癥汯v敤e
in 1943 by
the neurophysiologist
Warren
McCulloch and the logician Walter Pits.

Hidden Markov Models (HMM)


HMM is a
probabilistic

process

over a set of
states,
in which the


states are “
hidden
”. It is
only the outcome

that visible to the


observer. Hence, the name Hidden Markov Model.


HMM has many uses in genomics:



䝥湥⁰牥r楣瑩t渠⡇䕎千䅎n



卩S湡nP



䙩湤n湧⁰敲n潤o挠灡p瑥牮s


Used to answer questions like:



坨W琠楳⁴桥
灲p扡b楬楴i

潦扴b楮楮朠g⁰牴r捵污爠
潵o捯浥
?



坨慴⁩猠瑨W
扥b琠浯摥t

晲潭f
m慮礠捯浢楮c瑩潮s
?


Expasy

server
(
http://au.expasy.org
)



is dedicated to the analysis of


protein sequences and structures.

The ExPASy (
Ex
pert
P
rotein
A
nalysis
S
ystem)


Sequence analysis tools include:



䑎A
-
㸠偲>瑥楮[
Translate
]



偡瑴敲e

慮搠灲p晩汥獥s牣桥s

†

偯獴
-
瑲慮獬s瑩潮慬浯m楦楣i瑩潮慮搠

††
瑯t潬潧y灲p摩捴楯c



偲P浡特獴牵s瑵牥⁡a汹獩s




却牵捴畲S灲p摩捴楯c⠲(⁡搠㍄)


†

䅬A杮浥mt



PredictProtein:

A service for sequence analysis, and structure prediction


http://www.predictprotein.org/newwebsite/submit.html



TMpred
:
http://www.ch.embnet.org/software/TMPRED_form.html



TMHMM
:
Predicts transmembrane helices in proteins (CBS; Denmark)


http://www.cbs.dtu.dk/services/TMHMM
-
2.0/



big
-
PI
:
Predicts GPI
-
anchor site
:
http://mendel.imp.univie.ac.at/sat/gpi/gpi_server.html



DGPI
:
Predicts GPI
-
anchor site
:
http://129.194.185.165/dgpi/index_en.html



SignalP
:
Predicts signal peptide
:
http://www.cbs.dtu.dk/services/SignalP/



PSORT
:
Predicts sub
-
cellular localization:

http://www.psort.org/



TargetP
:
Predicts sub
-
cellular localization:

http://www.cbs.dtu.dk/services/TargetP/



NetNGlyc
:
Predicts N
-
glycosylation sites
:
http://www.cbs.dtu.dk/services/NetNGlyc/



PTS1
:
Predicts peroxisomal targeting sequences



http://mendel.imp.univie.ac.at/mendeljsp/sat/pts1/PTS1predictor.jsp


MITOPROT
:
Predicts of mitochondrial targeting sequences



http://ihg.gsf.de/ihg/mitoprot.html


Hydrophobicity
:
http://www.vivo.colostate.edu/molkit/hydropathy/index.html

Multiple alignment


Used to do
phylogenetic

analysis
:


Same protein from different species


䕶潬畴楯湡o礠牥污瑩潮獨楰㨠

桩h瑯特


Used to find
conserved

regions


Local multiple alignment reveals conserved regions


Conserved regions usually are
key functional regions


周敳攠牥杩潮g慲攠灲業攠
瑡牧整猠景f

摲畧

摥癥汯灭敮瑳


偲潴敩渠摯浡楮d慲攠潦o敮e捯湳敲癥搠
慣牯獳浡湹m獰散楥i


Algorithm for search of
conserved regions:



䉬潣B浡m敲
:
http://blocks.fhcrc.org/blocks/make_blocks.html

Multiple alignment tools


Free programs:



Phylip and PAUP
:
http://evolution.genetics.washington.edu/phylip.html



Phyml
:
http://atgc.lirmm.fr/phyml/



周T潳琠畳敤⁷敢獩瑥猠:

†

http://align.genome.jp/




http://prodes.toulouse.inra.fr/multalin/multalin.html




http://www.ch.embnet.org/index.html

(T
-
COFFEE and
ClustalW
)


ClustalW
:



Standard popular software



It

aligns 2 and keep on adding a new sequence to the alignment



Problem: It is simply a heuristics.


Motif discovery:

use your
own motif to search databases
:



PatternFind
:
http://
myhits.isb
-
sib.ch/cgi
-
bin/pattern_search





http
://meme.nbcr.net/meme4_6_0/intro.html

Phylogenetic analysis


Phylogenetic trees


Describe evolutionary relationships between sequences


Major modes that drive the evolution:



Point mutations modify existing sequences



Duplications (re
-
use existing sequence)



Rearrangement


Two most common methods


Maximum parsimony


䵡M業畭i汩步汩桯hd

http://www.megasoftware.net/mega4/m_con_select.html


The most useful software:

Parsimony
vs

Maximum likelihood


Parsimony

is the most popular method in which the simplest


answer is always the preferred one.



It involves

statistical evaluation

of the number of mutations need




to explain the observed data.



The best tree is the one that requires the

fewest

number of



evolutionary changes
.


Likelihood generally performs better than parsimony



I
n contrast,

maximum likelihood

does not necessarily satisfy


any optimality criterion. It attempts to answer the question:



What

parameters

of evolutionary events was likely to produce the


current data set?



This is computationally difficult to do. This is the slowest of all


methods.



Definitions


Homologous
:
Have a common ancestor. Homology cannot be measured.




Orthologous
:

The same gene in
different species

. It is the result of




speciation (common ancestral)





Paralogous
:
Related genes (already diverged) in the same species. It is




the result of genomic rearrangements or duplication

Determining protein structure


Direct measurement of structure



X
-
ray crystallography



NMR spectroscopy


Site
-
directed mutagenesis


Computer modeling



Prediction of structure



Comparative protein
-
structure modeling

Comparative protein
-
structure modeling


Goal:
Construct 3
-
D model of a protein of unknown




structure (target), based on similarity of sequence to



proteins of known structure (templates)

Blue
: predicted model by PROSPECT

Red
: NMR structure


Procedure:



Template selection



Template

target alignment



Model building



Model evaluation

The Protein 3
-
D Database


The
P
rotein
D
ata
B
ase (PDB) contains 3
-
D structural data


for proteins


䙯畮摥搠楮iㄹ㜱1睩瑨w愠摯穥渠獴牵r瑵牥r


䅳映䩵湥A㈰〴2瑨敲e⁷敲e㈵2㜶〠獴牵r瑵牥r⁩⁴桥⁤慴慢慳攮


䅬A⁳瑲畣u畲u猠慲攠牥r楥睥搠景爠慣捵a慣礠慮搠摡d愠畮楦潲m楴i.


却牵捴畲慬u摡瑡⁦牯r瑨攠t䑂捡渠扥⁦c敥e礠慣捥獳敤y慴

†
桴h瀺p⽷睷⹲捳戮潲术灤p/


80% come from X
-
ray crystallography


ㄶ1捯浥m晲潭f乍N


㈥捯浥m晲潭f瑨t潲o瑩t慬潤汩湧

High
-
throughput methods

Most used websites for 3
-
D structure prediction


Protein Homology/analogY Recognition Engine (Phyre) at


http://www.sbg.bio.ic.ac.uk/phyre/html/index.html


偲敤楣P偲潴敩e




桴瑰㨯⽷睷⹰牥r楣i灲p瑥楮⹯i术湥睷敢獩瑥n獵扭楴i桴ml



UCLA Fold Recognition at


http://www.doe
-
mbi.ucla.edu/Services/FOLD/


Commercial bioinformatics
softwares

CLC Genomics Workbench


Genomics:



454
,
Illumina

Genome Analyzer and
SOLiD

sequencing
data;


De
novo

assembly of genomes of any
size;


Advanced
visualization, scrolling, and zooming
tools;


SNP
detection using advanced quality
filtering;



Transcriptomics
:



RNA
-
seq

including
paired
data and transcript
-
level
expression;



Small RNA
analysis;



Expression
profiling by
tags;



Epigenetics:



Chromatin
immunoprecipitation

sequencing (
ChIP
-
seq
)
analysis;


Peak
finding and peak
refinement;


Graph
and table of background
distribution;


false
discovery
rate;


Peak
table and
annotations;






VectorNTI
:




Sequence analysis
and illustration;




restriction
mapping
;



recombinant molecule
design and cloning
;



in
silico

gel electrophoresis;



synthetic
biology
workflows



AlignX
:




BioAnnotator
:




ContigExpress
:




GenomBench





The bioinformatics not covered in this class


Comparative genomics and Genome browser:


http
://
genome.lbl.gov/vista/index.shtml


http://www.sanger.ac.uk/resources/software/artemis/



䝥湯浥⁡湮n瑡t楯渺


http
://
linux1.softberry.com/berry.phtml


http
://
rast.nmpdr.org/



Metagenomics
:


http://metagenomics.anl.gov/



卹獴敭S扩潬潧礠瑯潬献