Using PFAM database’s profile HMMs in
MATLAB Bioinformatics Toolkit
Presentation by:
Athina
Ropodi
University of Athens

Information
Technology in Medicine and Biology
Introduction
HMMs
Profile HMMs
Pfam
Database
General info
Useful links
Available Data
Bioinformatics Toolkit
Function presentation
Other available software
Bibliography
In order to approach sequential data without
failing to exploit any correlation between
observations close to each other, we need a
probabilistic model that calculates the joint
distributions for the sequence of observations.
A simple way to do this is by assuming a
Markovian
chain model. The probability of going
form one state to another is called transition
probability.
In Hidden Markov Models(HMM), assuming a
sequence of symbols (X), e.g. nucleotides in a
DNA sequence or amino

acids in the case of
protein sequences, the emission probabilities
are defined as the probability of having symbol b
when in state k.
The M

states produce one of 20 amino

acid
letters, according to P(
xm
i
).
For each state, there is a delete state(
d
i
), where
no amino

acid is produced.
There is a total of M+1 insert states to either
side of match states according to
P(
x
d
i
).
Pfam
is a collection of multiple sequence
alignments and profile hidden Markov models
(HMMs). Each
Pfam
HMM represents a protein
family or domain.
For each
Pfam
entry there is a family page
which can be accessed in several ways.
Pfam
contains two types of families,
Pfam

A
and
Pfam

B.
Pfam

A families are manually
curated
HMM based families which we build
using an alignment of a small number of
representative sequences
.
For each family we build two HMMs, one to
represent fragment matches and one to
represent full length matches. We use the
HMMER2 software to build and search our
profile HMMs.
Available links:
http://pfam.sanger.ac.uk/
http://hmmer.janelia.org/
Each family has the following data:
A
seed alignment
which is a hand edited multiple
alignment representing the family.
Hidden Markov Models (HMM)
derived from the seed
alignment, which can be used to find new members
of the domain and also take a set of sequences to
realign them to the model. One HMM is in
ls
mode
(global) the other is an
fs
mode (local) model.
A
full alignment
which is an automatic alignment of
all the examples of the domain using the two HMMs
to find and then align the sequences.
Annotation
that contains a brief description of the
domain, links to other databases and some
Pfam
specific data.
To
record
how
the
family
was
constructed
.
v. 3.1 for MATLAB (2008a)
Uses the profile HMMs found in PFAM.
The search is usually done by accession
number or name of the family.
Multiple sequence profiles
—
MATLAB
implementations for multiple alignment and
profile hidden Markov model
.
algorithms (
gethmmprof
,
gethmmalignment
,
gethmmtree
,
pfamhmmread
,
hmmprofalign
,
hmmprofestimate
,
hmmprofgenerate
,
hmmprofmerge
,
hmmprofstruct
,
showhmmprof
).
HMMStruct
=
gethmmprof
(
‘2’
)
Name: '7tm_2'
PfamAccessionNumber
: 'PF00002.14'
ModelDescription
: [1x42
char
]
ModelLength
: 296
Alphabet: 'AA'
MatchEmission
: [296x20 double]
InsertEmission
: [296x20 double]
NullEmission
: [1x20 double]
BeginX
: [297x1 double]
MatchX
: [295x4 double]
InsertX
: [295x2 double]
DeleteX
: [295x2 double]
FlankingInsertX
: [2x2 double]
LoopX
: [2x2 double]
NullX
: [2x1 double]
Number of
match states
emission
probabilities
in the MATCH
states.
Symbol emission
probabilities in the MATCH
and INSERT states for the
NULL model.
>>site='http://pfam.sanger.ac.uk/';
hmm =
pfamhmmread
([site
'family/
gethmm?mode
=
ls&id
=7tm_2']);
Ή
>>
pfamhmmread
(‘pf00002.ls’);
>>model =
pfamhmmread
('pf00002.ls');
showhmmprof
(model, 'Scale', '
logodds
');
hydrophobic = 'IVLFCMAGTSWYPHNDQEKR';
showhmmprof
(model, 'Order', hydrophobic);
'
logprob
'
—
Log
probabilities
'
prob
'
—
Probabilities
'
logodds
'
—
Log

odd
ratios
Choices
for
TypeValue
are
:
'seed'
—
Returns a tree with only the
alignments used to generate the HMM model.
'full' (default)
—
Returns a tree with all of the
alignments that match the model.
>>tree =
gethmmtree
(2, 'type', 'seed');
And
>>
tr
=
phytreeread
('pf00002.tree');
Gethmmalignment
: retrieve multiple sequence
alignment associated with hmm profile from
Pfam
database
Hmmprofalign
:
Align query sequence to profile
using hidden Markov model alignment
>>load('hmm_model_examples','model_7tm_2');
exampleload
('
hmm_model_examples','sequences
');
exampleSCCR_RABIT
=sequences(2).Sequence;
[
a,s
]=
hmmprofalign
(model_7tm_2,SCCR_RABIT,'sh
owscore',true);
a =
514.7448
s =
LLKLKVMYTVGYSSS

LVMLLVALGILCAFRRLHCTRNYIHMHLFLSFILRALSNFI
KDAVLFSSDdaihcdahrvgCKLVMVFFQYCIMANYAWLLV
EGLYLHSLLVVS

FFSERKCLQGFVVLGWGSPAMFVTSWAVTR

HFLEDSGC

WDIN

ANAAIWWVIRGPVILSILINFILFINILRILTRKLR

TQETRGQDMNHYKRLARSTLLLIPLFGVHYIVFVFSPEG

AMEIQLFFELALGSFQGLVVAVLYCFLNGEV
hmmprofestimate

Estimate profile hidden
Markov model (HMM) parameters using
pseudocounts
Hmmprofgenerate

Generate random
sequence drawn from profile hidden Markov
model (HMM)
Hmmprofmerge

Concatenate
prealigned
strings of several sequences to profile hidden
Markov model (HMM)
>>
load('hmm_model_examples','model_7tm_2‘)%load
modelload
('
hmm_model_examples','sequences
') %load
sequences
for
ind
=1:length(sequences)
[scores(
ind
),sequences(
ind
).Aligned] =...
hmmprofalign
(model_7tm_2,sequences(
ind
).Sequence);
end
hmmprofmerge
(sequences, scores)
HMMER:
http://hmmer.wustl.edu/
SAM:
http://www.cse.ucsc.edu/research/compbio/sam.html
PFTOOLS:
http://www.isrec.isb

sib.ch/ftp

server/pftools/
GENEWISE:
http://www.ebi.ac.uk/Wise2/
PROBE:
ftp://ftp.ncbi.nih.gov/pub/neuwald/probe1.0/
META

MEME:
http://metameme.sdsc.edu/
PSI

BLAST:
http://www.ncbi.nlm.nih.gov/BLAST/newblast.html
[1] Durbin et al. “Biological Sequence Analysis“,
Cambridge University Press, 1998
[
2
] Anders Krogh et al. “Hidden Markov Models in
Computational Biology

Applications to protein
modeling”, 1994
[3]
Sean
R.Eddy
“Profile Hidden Markov Models”,
1998
[4]
Sean
R.Eddy
“Hidden Markov Models”, 1996
[5]
http://hmmer.janelia.org/#thanks
[6]
E.L.L.
Sonnhammer
, S.R. Eddy and R. Durbin,
“
Pfam
: a comprehensive database of protein
families based on seed alignments”, 1997
[7] R.D. Finn et al. “
Pfam
: clans, web tools and
services”, 2006
[8]
http://www.mathworks.com/
Comments 0
Log in to post a comment