Usage of profile HMMs in Bioinformatics

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

89 εμφανίσεις

Using PFAM database’s profile HMMs in
MATLAB Bioinformatics Toolkit


Presentation by:
Athina

Ropodi

University of Athens
-

Information
Technology in Medicine and Biology


Introduction


HMMs


Profile HMMs


Pfam

Database


General info


Useful links


Available Data


Bioinformatics Toolkit


Function presentation


Other available software


Bibliography


In order to approach sequential data without
failing to exploit any correlation between
observations close to each other, we need a
probabilistic model that calculates the joint
distributions for the sequence of observations.


A simple way to do this is by assuming a
Markovian

chain model. The probability of going
form one state to another is called transition
probability.


In Hidden Markov Models(HMM), assuming a
sequence of symbols (X), e.g. nucleotides in a
DNA sequence or amino
-
acids in the case of
protein sequences, the emission probabilities
are defined as the probability of having symbol b
when in state k.



The M
-
states produce one of 20 amino
-
acid
letters, according to P(
x|m
i
).


For each state, there is a delete state(
d
i
), where
no amino
-
acid is produced.


There is a total of M+1 insert states to either
side of match states according to
P(
x|
d
i
).


Pfam

is a collection of multiple sequence
alignments and profile hidden Markov models
(HMMs). Each
Pfam

HMM represents a protein
family or domain.


For each
Pfam

entry there is a family page
which can be accessed in several ways.


Pfam

contains two types of families,
Pfam
-
A
and
Pfam
-
B.
Pfam
-
A families are manually
curated

HMM based families which we build
using an alignment of a small number of
representative sequences
.



For each family we build two HMMs, one to
represent fragment matches and one to
represent full length matches. We use the
HMMER2 software to build and search our
profile HMMs.



Available links:

http://pfam.sanger.ac.uk/

http://hmmer.janelia.org/


Each family has the following data:



A
seed alignment

which is a hand edited multiple
alignment representing the family.


Hidden Markov Models (HMM)

derived from the seed
alignment, which can be used to find new members
of the domain and also take a set of sequences to
realign them to the model. One HMM is in
ls

mode
(global) the other is an
fs

mode (local) model.


A
full alignment

which is an automatic alignment of
all the examples of the domain using the two HMMs
to find and then align the sequences.


Annotation

that contains a brief description of the
domain, links to other databases and some
Pfam

specific data.
To

record

how

the

family

was

constructed
.



v. 3.1 for MATLAB (2008a)


Uses the profile HMMs found in PFAM.


The search is usually done by accession
number or name of the family.


Multiple sequence profiles


MATLAB
implementations for multiple alignment and
profile hidden Markov model
.


algorithms (
gethmmprof
,
gethmmalignment
,
gethmmtree
,
pfamhmmread
,
hmmprofalign
,
hmmprofestimate
,
hmmprofgenerate
,
hmmprofmerge
,
hmmprofstruct
,
showhmmprof
).



HMMStruct

=
gethmmprof
(
‘2’
)





Name: '7tm_2'

PfamAccessionNumber
: 'PF00002.14'

ModelDescription
: [1x42
char
]


ModelLength
: 296

Alphabet: 'AA'

MatchEmission
: [296x20 double]

InsertEmission
: [296x20 double]

NullEmission
: [1x20 double]

BeginX
: [297x1 double]

MatchX
: [295x4 double]

InsertX
: [295x2 double]

DeleteX
: [295x2 double]

FlankingInsertX
: [2x2 double]

LoopX
: [2x2 double]

NullX
: [2x1 double]


Number of
match states

emission
probabilities
in the MATCH
states.

Symbol emission
probabilities in the MATCH
and INSERT states for the
NULL model.

>>site='http://pfam.sanger.ac.uk/';

hmm =
pfamhmmread
([site
'family/
gethmm?mode
=
ls&id
=7tm_2']);

Ή

>>
pfamhmmread
(‘pf00002.ls’);


>>model =
pfamhmmread
('pf00002.ls');

showhmmprof
(model, 'Scale', '
logodds
');

hydrophobic = 'IVLFCMAGTSWYPHNDQEKR';

showhmmprof
(model, 'Order', hydrophobic);





'
logprob
'


Log

probabilities


'
prob
'


Probabilities


'
logodds
'


Log
-
odd

ratios

Choices

for

TypeValue

are
:


'seed'


Returns a tree with only the
alignments used to generate the HMM model.


'full' (default)


Returns a tree with all of the
alignments that match the model.


>>tree =
gethmmtree
(2, 'type', 'seed');

And

>>
tr

=
phytreeread
('pf00002.tree');


Gethmmalignment
: retrieve multiple sequence
alignment associated with hmm profile from
Pfam

database


Hmmprofalign
:
Align query sequence to profile
using hidden Markov model alignment


>>load('hmm_model_examples','model_7tm_2');

exampleload
('
hmm_model_examples','sequences
');

exampleSCCR_RABIT
=sequences(2).Sequence;

[
a,s
]=
hmmprofalign
(model_7tm_2,SCCR_RABIT,'sh
owscore',true);


a =


514.7448


s =



LLKLKVMYTVGYSSS
-
LVMLLVALGILCAFRRLHCTRNYIHMHLFLSFILRALSNFI
KDAVLFSSDdaihcdahrvgCKLVMVFFQYCIMANYAWLLV
EGLYLHSLLVVS
---
FFSERKCLQGFVVLGWGSPAMFVTSWAVTR
------------
HFLEDSGC
-
WDIN
-
ANAAIWWVIRGPVILSILINFILFINILRILTRKLR
----
TQETRGQDMNHYKRLARSTLLLIPLFGVHYIVFVFSPEG
-----
AMEIQLFFELALGSFQGLVVAVLYCFLNGEV


hmmprofestimate

-

Estimate profile hidden
Markov model (HMM) parameters using
pseudocounts


Hmmprofgenerate

-

Generate random
sequence drawn from profile hidden Markov
model (HMM)


Hmmprofmerge

-

Concatenate
prealigned

strings of several sequences to profile hidden
Markov model (HMM)


>>
load('hmm_model_examples','model_7tm_2‘)%load
modelload
('
hmm_model_examples','sequences
') %load
sequences



for
ind

=1:length(sequences)
[scores(
ind
),sequences(
ind
).Aligned] =...
hmmprofalign
(model_7tm_2,sequences(
ind
).Sequence);


end


hmmprofmerge
(sequences, scores)

HMMER:

http://hmmer.wustl.edu/

SAM:

http://www.cse.ucsc.edu/research/compbio/sam.html

PFTOOLS:

http://www.isrec.isb
-
sib.ch/ftp
-
server/pftools/

GENEWISE:

http://www.ebi.ac.uk/Wise2/

PROBE:

ftp://ftp.ncbi.nih.gov/pub/neuwald/probe1.0/

META
-
MEME:

http://metameme.sdsc.edu/

PSI
-
BLAST:

http://www.ncbi.nlm.nih.gov/BLAST/newblast.html


[1] Durbin et al. “Biological Sequence Analysis“,
Cambridge University Press, 1998

[
2
] Anders Krogh et al. “Hidden Markov Models in
Computational Biology
-

Applications to protein
modeling”, 1994

[3]
Sean
R.Eddy

“Profile Hidden Markov Models”,
1998

[4]
Sean
R.Eddy

“Hidden Markov Models”, 1996

[5]
http://hmmer.janelia.org/#thanks

[6]
E.L.L.
Sonnhammer
, S.R. Eddy and R. Durbin,

Pfam
: a comprehensive database of protein
families based on seed alignments”, 1997

[7] R.D. Finn et al. “
Pfam
: clans, web tools and
services”, 2006

[8]
http://www.mathworks.com/