A Hidden Markov Model-based approach to sequential data

clustering

Antonello Panuccio,Manuele Bicego,and Vittorio Murino

University of Verona,Italy

y

fpanuccio|bicego|murinog@sci.univr.it

Objectives

Clustering of sequential or temporal data with an Hidden

Markov Model (HMM)-based technique.

Main aspects:

use of HMMto derive

new proximity

distances,in the

likelihood sense,between sequences;

partitional clustering algorithmwhich alleviates

computational burden characterizing traditional

hierarchical agglomerative approaches.

The method is demonstrated on real world data

sequences,i.e.the EEG signals:

temporal complexity;

growing interest in the emerging ﬁeld of Brain

Computer Interfaces.

Hidden Markov Models

A discrete HMMis formally deﬁned by the following

elements:

A set S = fS

1

;S

2

; ;S

N

g of (hidden) states.

A state transition probability distribution,also called

transition matrix A = fa

ij

g,representing the

probability to go fromstate S

i

to state S

j

.

a

ij

= P[q

t+1

= S

j

jq

t

= S

i

]

1 i;j N a

ij

0;

N

X

j=1

a

ij

= 1

A set V = fv

1

;v

2

; ;v

M

g of observation symbols.

An observation symbol probability distribution,also

called emission matrix B = fb

j

(k)g,indicating the

probability of emission of symbol v

k

when system

state is S

j

.

b

j

(k) = P[v

k

at time t jq

t

= S

j

]

1 j N;1 k M

with b

i

(k) 0 and

P

M

j=1

b

j

(k) = 1.

An initial state probability distribution = f

i

g,

representing probabilities of initial states.

i

= P[q

1

= S

i

] 1 i N;

i

0;

N

X

i=1

i

= 1

For convenience,we denote an HMMas a triplet

= (A;B;).

v

v

v

v

1

2

3

4

b

b

b

b

11

12

13

14

Prob.Symb.

S

1

a

12

a

21

a

23

a

34

S S

4

S

2

3

a

42

v

v

v

v

1

2

3

4

b

b

b

b

21

24

Prob.Symb.

22

23

......

1

2

3

4

Hidden Markov AR Models

A very interesting class of HMMs that seems to be

particularly suitable for EEG signals are the

autoregressive HMMs

.In this case,the observation

vectors fy

1

:::y

T

g are drawn froman autoregressive

process and thus B is deﬁned as

b

j

(y

t

) = P(y

t

jq

t

= S

i

) = N(y

t

F

t

^a

i

;

2

i

) (1)

where

F

t

= [y

t1

;y

t2

; ;y

tp

];

^

a

i

is the (column) vector of AR coefﬁcients for the ith

state;

2

i

is the estimated observation noise for the i-th state,

estimated using Jazwinski method.

The prediction for the ith state is ^y

i

t

= F

t

^a

i

.The order of

AR model is p.

HMMfor EEGsignal modeling

EEGs are an useful tool used for understanding several

aspects of the brain,fromdiseases detection to sleep

analysis and evocated potential analysis.

The systemused to model the EEG signal is based on

[Penny and Roberts 1998] paper:

train an autoregressive HMMdirectly on the EEG

signal,rather than use an intermediate AR

representation;

a Kalman ﬁlter approach is used to preliminary

segment the signal in different dynamic regimes of the

signal;

assign each HMMstate with a different dynamic

regime.

Initialization of training procedure

Problem

:The HMMtraining procedure (Baum-Welch

re-estimation procedure) is sensitive to initial parameters

estimate.

Solution

:

Initialization of Emission Matrix B:

pass a Kalman ﬁlter AR model over the data,obtaining

a sequence of AR coefﬁcients;

coefﬁcients corresponding to low evidence are

discarded;

clusterize the remaining with Gaussian Mixture

Models;

the center of each Gaussian cluster is then used to

initialize the AR coefﬁcients in each state of the

HMM-AR model.

Initialization of Transition Matrix A:

The prior knowledge fromthe problemdomain about

average state duration densities is used to initialize the

matrix.

each diagonal element is set to a

ii

= 1

1

d

to let HMM

remain in state i for d samples;

d is computed knowing that EEG data is stationary for

a period of the order of half a second.

The proposed method for sequence

clustering

The proposed approach,inspired by [Smyth 95],can be

depicted by the following algorithm:

1.TrainanmstatesHMMforeachsequence

S

i

;(1 i N) ofthedatasetD,initializingthe

trainingasformerexplained.TheobtainedN HMM

areidentiﬁedby

i

;(1 i N);

2.foreachmodel

i

,evaluateitsprobabilitytogenerate

allsequencesS

j

;1 j N,obtainingalikelihood

matrixL where

L

ij

= P(S

j

j

i

);1 i;j N (2)

3.deriveasequencesdistancematrixfrom(2);

4.applyasuitableclusteringalgorithmtothedistance

matrixobtainingK clustersonthedatasetD.

Remarks:

This method exploits the measure deﬁned by (2) which

naturally expresses the similarity between two

observation sequences.

Hidden Markov Models are able to model similarity

between sequences,allowing to recover the difﬁcult

task of clustering sequences to standard clustering.

Derivation of Distance

Three different differences derived from(2) were

investigated:

1.Distance “SM”:obtained by simply symmetrizing (2):

L

ij

S

=

1

2

[L

ij

+L

ji

] (3)

2.Distance “KL”:similar to the Kullback-Leibler

information number:

L

ij

KL

= L

ii

"

ln

L

ii

L

ji

#

+L

ij

"

ln

L

ij

L

jj

#

(4)

3.Distance “BP”:proposed in this paper:

L

ij

BP

=

1

2

(

L

ij

L

ii

L

ii

+

L

ji

L

jj

L

jj

)

(5)

Motivations for distance “BP”:

The measure (2),deﬁnes a similarity measure between

two sequences S

i

and S

j

as the likelihood of the

sequence S

i

with respect to the model

j

,trained on

S

j

,without really taking into account the sequence S

j

.

This kind of measure assumes that all sequences are

modelled with the same quality without considering

how well sequence S

j

is modelled by the HMM

j

:

this could not always be true.

Our proposed distance also considers the modelling

goodness by evaluating the relative normalized

difference between the sequence and the training

likelihoods.

Clustering Algorithm

About step 4 we introduce a new partitional clustering

algorithm,called DPAM:

this methods obtains a single partition of the data;

we compare it with

Complete Link Agglomerative

Hierarchical Clustering

,a standard class of algorithms

that,instead of single partition,produces a sequence of

clustering of decreasing number of clusters at each

step;

partitional method have advantages in application

involving large data sets for which the construction of

a dendogramis computationally prohibitive.

yAddress for correspondence:Department of Informatics,University of Verona,Strada Le Grazie n.15,I-37134 Verona,Italy

A Hidden Markov Model-based approach to sequential data

clustering

Antonello Panuccio,Manuele Bicego,Vittorio Murino

University of Verona,Italy

y

fpanuccio|bicego|murinog@sci.univr.it

DPAMPartitional Clustering

The standard partitional clustering schemes,as K-means,

work as follows:

at each iteration evaluate the distance between each

itemand each cluster descriptor

assign the itemto the nearest cluster.

after re-assignment,the descriptor of each cluster will

be reevaluated by averaging its cluster items;

A simple variation of the method,called “Partition

Around Medoid (PAM)”,determines each cluster

representative by choosing the point nearest to the

centroid.

Problem

:in our context we cannot evaluate centroid of

each cluster because we only have itemdistances and not

values.

Proposed approach:features

a novel partitional method is proposed,able to

determine cluster descriptors in a PAMparadigm,

using itemdistances instead of their values;

Moreover,the choice of the initial descriptors could

affect algorithmperformances.Our approach propose

to adopt a multiple initialization procedure,where the

best resulting partition is determined by a sort of

Davies-Bouldin criterion.

The Algorithm

Fixed as the number of tested initializations,N the

number of sequences,k the number of clusters and L the

proximity matrix characterized by previously deﬁned

distances (3),(4),and (5),the resulting algorithmis the

following:

for t=1 to

– Initial cluster representatives

j

are randomly

chosen (j = 1;:::;k;

j

2 f1;:::;Ng).

– Repeat:

Partition evaluation step:

Compute the cluster which each sequence

S

i

;i = 1;:::;N belongs to;S

i

lies in the j

cluster for which the distance

L(S

i

;

j

);i = 1;:::;N;j = 1;:::k is

minimum.

Parameters upgrade:

Compute the sumof the distance of each

element of cluster C

j

fromeach other element

of the jth cluster

Determine the index of the element in C

j

for

which this sumis minimal

Use that index as new descriptor for cluster C

j

– Until the representatives

j

values between two

successive iterations don’t change.

– R

t

= fC

1

;C

2

;:::;C

k

g

– Compute the Davies–Bouldin–like index deﬁned

as:

DBL

(t)

=

1

k

k

X

r=1

max

s6=r

(

S

L

c

(C

r

;

r

) +S

L

c

(C

s

;

s

)

L(

r

;

s

)

)

where S

c

is an intra–cluster measure deﬁned by:

S

L

c

(C

r

;

r

) =

P

i2C

r

L(i;

r

)

jC

r

j

endfor t

Final solution:

The best clustering R

p

has the

minimumDavies–Bouldin–like index,viz.:

p = arg min

t=1;:::;

fDBL

(t)

g

Results

Data set

EEG data recorded by Zak Keirn at Purdue University;

the dataset contains EEGs signal recorded fromdifferent

subjects which were asked to performﬁve mental tasks:

baseline task,for which the subjects were asked to

relax as much as possible;

math task,for which the subjects were given nontrivial

multiplications problems,such as 27*36,and were

asked to solve themwithout vocalizing or making any

other physical movements;

letter task,for which the subjects were instructed to

mentally compose a letter to a friend without

vocalizing;

geometric ﬁgure rotation,for which the subjects were

asked to visualize a particular 3D block ﬁgure being

rotated about an axis;

visual counting task,for which the subjects were asked

to image a blackboard and to visualize numbers being

written on the board sequentially.

Preprocessing and Segmentation

We applied the method on a segment-by-segment basis,

1s signals sampled at 250Hz and drawn froma dataset of

cardinality varying from190 (two mental states) to 473

sequences (ﬁve mental states) where we removed

segments biased by signal spikes arising human artifact

(e.g.ocular blinks).

Experimental

The proposed HMMclustering algorithmhas been ﬁrst

applied to two mental states:baseline and math task,

then we extend trials to all available data;

accuracies are computed by comparing the clustering

results with real segment labels;the percentage is

merely the ratio of correct assigned label with respect

to the total number of segments;

ﬁrst we applied the hierarchical complete link

technique,varying the proximity measure:results are

shown in the following Table,with number of mental

states growing fromtwo to ﬁve.

Hierarchical Complete Link

BP

KL

SM

2 natural clusters

97.37%

97.89%

97.37%

3 natural clusters

71.23%

79.30%

81.40%

4 natural clusters

62.63%

57.36%

65.81%

5 natural clusters

46.74%

54.10%

49.69%

Accuracies are quite satisfactory.None of the method

experimented can be considered the best one;

nevertheless,measures (3) and (4) seemto be more

effective.

Therefore we applied the partitional algorithmto the

same datasets setting the number of initializations

= 5 during all the experiments.Results are presented

in the following Table:

Partitional DPAMClustering

BP

KL

SM

2 natural clusters

95.79%

96.32%

95.79%

3 natural clusters

75.44%

72.98%

65.61%

4 natural clusters

64.21%

62.04%

50.52%

5 natural clusters

57.04%

46.74%

44.80%

in this last case the BP distance is overall slightly

better than the others experimented measures.

Final comments

:

a ﬁnal comparison of partitional and agglomerative

hierarchical algorithms underlines that there are no

remarkable differences between the proposed

approaches.Clearly,partitional approaches alleviates

computational burden,thus they should be preferred

when dealing with complex signals clustering (e.g.

EEG);

the comparison of clustering and classiﬁcation results

(obtained in earlier works) shown that the latter are

just slightly better.This strengthen the quality of the

proposed method,considering that unsupervised

classiﬁcation is inherently a more difﬁcult task.

Conclusions

We addressed the problemof unsupervised

classiﬁcation of sequences using an HMMapproach.

These models,very suitable in modelling sequential

data,are used to characterize the similarity between

sequences in different ways.

We extend the Smyth’s ideas by deﬁning a new metric

in likelihood sense between data sequences and by

applying to these distance matrices two clustering

algorithms:the traditional hierarchical agglomerative

method and a novel partitional technique.

Partitional algorithms are generally less computational

demanding than hierarchical,but could not be applied

in this context without some proper adaptations,

proposed in this paper.

Finally we tested our approach on real data,using

complex temporal signals,the EEG,that are increasing

in importance due to recent interest in Brain Computer

Interface.

Results shown that the proposed method is able to

infer the natural partitions with patterns characterizing

a complex and noisy signal like the EEG ones.

yAddress for correspondence:Department of Informatics,University of Verona,Strada Le Grazie n.15,I-37134 Verona,Italy

## Comments 0

Log in to post a comment