# A Hidden Markov Model-based approach to sequential data clustering

AI and Robotics

Nov 8, 2013 (4 years and 6 months ago)

87 views

A Hidden Markov Model-based approach to sequential data
clustering
Antonello Panuccio,Manuele Bicego,and Vittorio Murino
University of Verona,Italy
y
fpanuccio|bicego|murinog@sci.univr.it
Objectives
Clustering of sequential or temporal data with an Hidden
Markov Model (HMM)-based technique.
Main aspects:
 use of HMMto derive
new proximity
distances,in the
likelihood sense,between sequences;
 partitional clustering algorithmwhich alleviates
hierarchical agglomerative approaches.
The method is demonstrated on real world data
sequences,i.e.the EEG signals:
 temporal complexity;
 growing interest in the emerging ﬁeld of Brain
Computer Interfaces.
Hidden Markov Models
A discrete HMMis formally deﬁned by the following
elements:
 A set S = fS
1
;S
2
;  ;S
N
g of (hidden) states.
 A state transition probability distribution,also called
transition matrix A = fa
ij
g,representing the
probability to go fromstate S
i
to state S
j
.
a
ij
= P[q
t+1
= S
j
jq
t
= S
i
]
1  i;j  N a
ij
 0;
N
X
j=1
a
ij
= 1
 A set V = fv
1
;v
2
;  ;v
M
g of observation symbols.
 An observation symbol probability distribution,also
called emission matrix B = fb
j
(k)g,indicating the
probability of emission of symbol v
k
when system
state is S
j
.
b
j
(k) = P[v
k
at time t jq
t
= S
j
]
1  j  N;1  k  M
with b
i
(k)  0 and
P
M
j=1
b
j
(k) = 1.
 An initial state probability distribution  = f
i
g,
representing probabilities of initial states.

i
= P[q
1
= S
i
] 1  i  N;

i
 0;
N
X
i=1

i
= 1
For convenience,we denote an HMMas a triplet
 = (A;B;).
v
v
v
v
1
2
3
4
b
b
b
b
11
12
13
14
Prob.Symb.
S
1
a
12
a
21
a
23
a
34
S S
4
S
2
3
a
42
v
v
v
v
1
2
3
4
b
b
b
b
21
24
Prob.Symb.
22
23
......

1
2
3
4
Hidden Markov AR Models
A very interesting class of HMMs that seems to be
particularly suitable for EEG signals are the
autoregressive HMMs
.In this case,the observation
vectors fy
1
:::y
T
g are drawn froman autoregressive
process and thus B is deﬁned as
b
j
(y
t
) = P(y
t
jq
t
= S
i
) = N(y
t
F
t
^a
i
;
2
i
) (1)
where
 F
t
= [y
t1
;y
t2
;  ;y
tp
];

^
a
i
is the (column) vector of AR coefﬁcients for the ith
state;
 
2
i
is the estimated observation noise for the i-th state,
estimated using Jazwinski method.
The prediction for the ith state is ^y
i
t
= F
t
^a
i
.The order of
AR model is p.
HMMfor EEGsignal modeling
EEGs are an useful tool used for understanding several
aspects of the brain,fromdiseases detection to sleep
analysis and evocated potential analysis.
The systemused to model the EEG signal is based on
[Penny and Roberts 1998] paper:
 train an autoregressive HMMdirectly on the EEG
signal,rather than use an intermediate AR
representation;
 a Kalman ﬁlter approach is used to preliminary
segment the signal in different dynamic regimes of the
signal;
 assign each HMMstate with a different dynamic
regime.
Initialization of training procedure
Problem
:The HMMtraining procedure (Baum-Welch
re-estimation procedure) is sensitive to initial parameters
estimate.
Solution
:
Initialization of Emission Matrix B:
 pass a Kalman ﬁlter AR model over the data,obtaining
a sequence of AR coefﬁcients;
 coefﬁcients corresponding to low evidence are
 clusterize the remaining with Gaussian Mixture
Models;
 the center of each Gaussian cluster is then used to
initialize the AR coefﬁcients in each state of the
HMM-AR model.
Initialization of Transition Matrix A:
 The prior knowledge fromthe problemdomain about
average state duration densities is used to initialize the
matrix.
 each diagonal element is set to a
ii
= 1 
1
d
to let HMM
remain in state i for d samples;
 d is computed knowing that EEG data is stationary for
a period of the order of half a second.
The proposed method for sequence
clustering
The proposed approach,inspired by [Smyth 95],can be
depicted by the following algorithm:
1.TrainanmstatesHMMforeachsequence
S
i
;(1  i  N) ofthedatasetD,initializingthe
trainingasformerexplained.TheobtainedN HMM
areidentiﬁedby
i
;(1  i  N);
2.foreachmodel
i
,evaluateitsprobabilitytogenerate
allsequencesS
j
;1  j  N,obtainingalikelihood
matrixL where
L
ij
= P(S
j
j
i
);1  i;j  N (2)
3.deriveasequencesdistancematrixfrom(2);
4.applyasuitableclusteringalgorithmtothedistance
matrixobtainingK clustersonthedatasetD.
Remarks:
 This method exploits the measure deﬁned by (2) which
naturally expresses the similarity between two
observation sequences.
 Hidden Markov Models are able to model similarity
between sequences,allowing to recover the difﬁcult
task of clustering sequences to standard clustering.
Derivation of Distance
Three different differences derived from(2) were
investigated:
1.Distance “SM”:obtained by simply symmetrizing (2):
L
ij
S
=
1
2
[L
ij
+L
ji
] (3)
2.Distance “KL”:similar to the Kullback-Leibler
information number:
L
ij
KL
= L
ii
"
ln
L
ii
L
ji
#
+L
ij
"
ln
L
ij
L
jj
#
(4)
3.Distance “BP”:proposed in this paper:
L
ij
BP
=
1
2
(
L
ij
L
ii
L
ii
+
L
ji
L
jj
L
jj
)
(5)
Motivations for distance “BP”:
 The measure (2),deﬁnes a similarity measure between
two sequences S
i
and S
j
as the likelihood of the
sequence S
i
with respect to the model 
j
,trained on
S
j
,without really taking into account the sequence S
j
.
 This kind of measure assumes that all sequences are
modelled with the same quality without considering
how well sequence S
j
is modelled by the HMM
j
:
this could not always be true.
 Our proposed distance also considers the modelling
goodness by evaluating the relative normalized
difference between the sequence and the training
likelihoods.
Clustering Algorithm
About step 4 we introduce a new partitional clustering
algorithm,called DPAM:
 this methods obtains a single partition of the data;
 we compare it with
Hierarchical Clustering
,a standard class of algorithms
that,instead of single partition,produces a sequence of
clustering of decreasing number of clusters at each
step;
 partitional method have advantages in application
involving large data sets for which the construction of
a dendogramis computationally prohibitive.
A Hidden Markov Model-based approach to sequential data
clustering
Antonello Panuccio,Manuele Bicego,Vittorio Murino
University of Verona,Italy
y
fpanuccio|bicego|murinog@sci.univr.it
DPAMPartitional Clustering
The standard partitional clustering schemes,as K-means,
work as follows:
 at each iteration evaluate the distance between each
itemand each cluster descriptor
 assign the itemto the nearest cluster.
 after re-assignment,the descriptor of each cluster will
be reevaluated by averaging its cluster items;
 A simple variation of the method,called “Partition
Around Medoid (PAM)”,determines each cluster
representative by choosing the point nearest to the
centroid.
Problem
:in our context we cannot evaluate centroid of
each cluster because we only have itemdistances and not
values.
Proposed approach:features
 a novel partitional method is proposed,able to
determine cluster descriptors in a PAMparadigm,
using itemdistances instead of their values;
 Moreover,the choice of the initial descriptors could
affect algorithmperformances.Our approach propose
to adopt a multiple initialization procedure,where the
best resulting partition is determined by a sort of
Davies-Bouldin criterion.
The Algorithm
Fixed  as the number of tested initializations,N the
number of sequences,k the number of clusters and L the
proximity matrix characterized by previously deﬁned
distances (3),(4),and (5),the resulting algorithmis the
following:
 for t=1 to 
– Initial cluster representatives 
j
are randomly
chosen (j = 1;:::;k;
j
2 f1;:::;Ng).
– Repeat:

Partition evaluation step:
Compute the cluster which each sequence
S
i
;i = 1;:::;N belongs to;S
i
lies in the j
cluster for which the distance
L(S
i
;
j
);i = 1;:::;N;j = 1;:::k is
minimum.

 Compute the sumof the distance of each
element of cluster C
j
fromeach other element
of the jth cluster
 Determine the index of the element in C
j
for
which this sumis minimal
 Use that index as new descriptor for cluster C
j
– Until the representatives 
j
values between two
successive iterations don’t change.
– R
t
= fC
1
;C
2
;:::;C
k
g
– Compute the Davies–Bouldin–like index deﬁned
as:
DBL
(t)
=
1
k
k
X
r=1
max
s6=r
(
S
L
c
(C
r
;
r
) +S
L
c
(C
s
;
s
)
L(
r
;
s
)
)
where S
c
is an intra–cluster measure deﬁned by:
S
L
c
(C
r
;
r
) =
P
i2C
r
L(i;
r
)
jC
r
j
 endfor t

Final solution:
The best clustering R
p
has the
minimumDavies–Bouldin–like index,viz.:
p = arg min
t=1;:::;
fDBL
(t)
g
Results
Data set
EEG data recorded by Zak Keirn at Purdue University;
the dataset contains EEGs signal recorded fromdifferent
relax as much as possible;
 math task,for which the subjects were given nontrivial
multiplications problems,such as 27*36,and were
asked to solve themwithout vocalizing or making any
other physical movements;
 letter task,for which the subjects were instructed to
mentally compose a letter to a friend without
vocalizing;
 geometric ﬁgure rotation,for which the subjects were
asked to visualize a particular 3D block ﬁgure being
to image a blackboard and to visualize numbers being
written on the board sequentially.
Preprocessing and Segmentation
We applied the method on a segment-by-segment basis,
1s signals sampled at 250Hz and drawn froma dataset of
cardinality varying from190 (two mental states) to 473
sequences (ﬁve mental states) where we removed
segments biased by signal spikes arising human artifact
Experimental
 The proposed HMMclustering algorithmhas been ﬁrst
applied to two mental states:baseline and math task,
then we extend trials to all available data;
 accuracies are computed by comparing the clustering
results with real segment labels;the percentage is
merely the ratio of correct assigned label with respect
to the total number of segments;
 ﬁrst we applied the hierarchical complete link
technique,varying the proximity measure:results are
shown in the following Table,with number of mental
states growing fromtwo to ﬁve.
BP
KL
SM
2 natural clusters
97.37%
97.89%
97.37%
3 natural clusters
71.23%
79.30%
81.40%
4 natural clusters
62.63%
57.36%
65.81%
5 natural clusters
46.74%
54.10%
49.69%
 Accuracies are quite satisfactory.None of the method
experimented can be considered the best one;
 nevertheless,measures (3) and (4) seemto be more
effective.
 Therefore we applied the partitional algorithmto the
same datasets setting the number of initializations
 = 5 during all the experiments.Results are presented
in the following Table:
Partitional DPAMClustering
BP
KL
SM
2 natural clusters
95.79%
96.32%
95.79%
3 natural clusters
75.44%
72.98%
65.61%
4 natural clusters
64.21%
62.04%
50.52%
5 natural clusters
57.04%
46.74%
44.80%
 in this last case the BP distance is overall slightly
better than the others experimented measures.
:
 a ﬁnal comparison of partitional and agglomerative
hierarchical algorithms underlines that there are no
remarkable differences between the proposed
approaches.Clearly,partitional approaches alleviates
computational burden,thus they should be preferred
when dealing with complex signals clustering (e.g.
EEG);
 the comparison of clustering and classiﬁcation results
(obtained in earlier works) shown that the latter are
just slightly better.This strengthen the quality of the
proposed method,considering that unsupervised
classiﬁcation is inherently a more difﬁcult task.
Conclusions
 We addressed the problemof unsupervised
classiﬁcation of sequences using an HMMapproach.
 These models,very suitable in modelling sequential
data,are used to characterize the similarity between
sequences in different ways.
 We extend the Smyth’s ideas by deﬁning a new metric
in likelihood sense between data sequences and by
applying to these distance matrices two clustering