TIME

SERIES SEGMENTATION BY CLUSTER ANALYSIS AND HIDDEN
MARKOV MODEL, WITH APPLICATION TO RATES OF RETURN
Stanley L. Sclove and Lon

Mu Liu
University of Illinois at Chicago
OUTLINE
1. Introduction and background
Motivation
Lack of Normality and lack
of independence in rates of return
2. Context
labeling problems
general definition of labeling problems
classification
cluster analysis
time

series segmentation
3. Time

Series Segmentation
time

ordered data
time

series data
clustering
clustering with time as a variable
epochs vs. classes (states)
comparison and contrast with change

point inference
Hidden Markov Model
EM algorithm
greedy algorithm for the E step
Viterbi algorithm for the E step
Examples
SPC
GNP
Da
ily RORs
3.3. Probabilistic prediction
4. Conclusions and Discussion
Discussion
Extensions
ARIMA within classes
Sclove and Liu: Time

Series Segmen
tation p.
2
2D image segmentation
3D image segmentation
TIME

SERIES SEGMENTATION BY CLUSTER ANALYSIS AND HIDDEN
MARKOV MODEL, WIT
H APPLICATION TO RATES OF RETURN
Stanley L. Sclove and Lon

Mu Liu
University of Illinois at Chicago
1. Introduction and background
1.1. Motivation
This paper deals with the segmentation of time series, with particular emphasis on
economic and financi
al data. Typical segmentation problems include the separation of
macroeconomic time series into segments of recession, recovery, and expansion, or
financial time series into Bull and Bear segments.
Such economic and financial data are often viewed in te
rms of rates of growth or return.
Rates of return, especially daily rates of return, are neither Normal nor independent, so the
classical model and even the standard financial model are not appropriate bases for their
analysis. We propose segmentation me
thods that do not depend upon Normality of the
whole series, or upon independence.
1.2. Statistical Context
Time

series segmentation may be viewed in terms of
change

point inference
or in terms
of what we call
labeling
: assigning a label to each obse
rvation. We shall discuss both but
focus mainly on the labeling approach. The reasons for this will be given later.
Labeling problems
include
classification, cluster analysis,
and
segmentation.
The model
for such problems involves
pairs
(
x
t
,
t
),
t
= 1, 2, . . . ,
n
,
where
x
t
is observable and
t
is an unobservable
label
, equal to one of several
states
,
indicated by 1, 2, . . . ,
K
. With each state is associated a probability distribution, the
class

conditional
distribution,
f
k
(
x
),
k
= 1,
2, . . . ,
K
. Often these will be taken to be in a
parametric family,
i.e,
the form is specified up to a parameter
k
, that is there is a specified
form of probability density function, so that
f
k
(
x
) =
g
(
x,
k
) . Sometimes these
distributions will
be Normal; then
k
= (
k
,
k
), the mean and standard deviation,
respectively.
Sclove and Liu: Time

Series Segmen
tation p.
2
Briefly, in
classification
, each observation is to be assigned to one of several known
classes. In
cluster analysis
the classes are unknown; classes are formed and observ
ations
assigned to them in the same procedure. In
time

series segmentation
also, the classes may
be unknown; they are formed and observations in the time series are assigned to them in the
same procedure.
Classification.
In
classification
, each of a
number of observations is to be assigned to one
of a set of known, pre

specified classes. In the simplest case, the distributional parameters
k
also are known. Usually, these distributional parameters are unknown but can be
estimated by
supervised learning
from a
training set
of
n
1
observations for which the
labels are known; from this set, the distributional parameters can be estimated. T
his set
will consist of two variables, the measurements, and the label characterizing each case.
Later there be a test set of
n
2
cases with unknown labels; classification involves estimating
these labels.
Cluster analysis.
In
cluster analysis
, or
unsu
pervised learning
, the labels are unknown
and yet we must allocate each observation to one of the
K
classes.
Time

series segmentation.
When the observations are a time series, the temporal
dependence can be combined with notions of cluster analysi
s to allocate each observation
to one of
K
classes. This combination is the main topic of this paper.
3. Time

series segmentation
Time

ordered data
are data recorded along a time continuum. A
time series
consists of
time

ordered data recorded a
t equally spaced time intervals. Time

ordered data may be
event triggered. Time series are time triggered. Much of what we discuss here applies to
time

ordered data and does not require the additional structure of a time

series.
Segmentation
of time

o
rdered data refers to the division of the time axis into
epochs
(time intervals) within which the behavior of the series is relatively homogeneous. It may
suffice to consider only a few
states
which describe the behavior of the series within the
epochs.
The epochs are intervals of time; the states are formed by considering intervals or
Sclove and Liu: Time

Series Segmen
tation p.
3
regions of values of the observed variable(s). The number of states is generally much
smaller than the number of epochs. Consider for example the analysis of macroeconom
ic
data. Economists use the terms
recession, depression, recovery,
and
expansion
as states of
the economy. As time goes on, the number of epochs will increase, but not necessarily the
number of states. If we analyze suitable macroeconomic data, for ex
ample, suitably
differenced quarterly GDP, over several decades, we might find a number of epochs, fit by
two or three such states. In analyzing a mixed second difference (which may be viewed as
a kind of acceleration of the economy), Sclove (1983a,b) rep
orts such findings.
The main thrust of this paper is to fit classes, not to find epochs.
One way to segment time series would be to simply cluster the observations, not using
the time structure explicitly. The states are the clusters. A finite

mix
ture model can be
used.
The hard

clustering estimates of the distributional parameters are the usual MLEs for
the hard

classified data.
Many clustering algorithms involve putting cases that are close together into the same
cluster. Time series segment
ation may be viewed as clustering, but with a time

ordered
structure. The squared distance between
x
t
and
x
u
is (
x
t

x
u
)
2
. One way to deal with
time would be to use time as an additional variable and cluster the observations using a
squared di
stance between cases which incorporates time, such as
d
2
(
x
t
, x
u
) = (
x
t

x
u
)
2
+ [(
t

u
)/c]
2
.
Here the constant
c
accomplishes a suitable weighting of the two terms.
Example.
If
c =
100, then
d
2
(
x
t
, x
u
) = (
x
t


x
u
)
2
+ [(
t

u
)/100]
2
.
I
f
x
t
=
5 and
x
u =

1, then
d
2
(
x
t
, x
u
) = (
x
t

x
u
)
2
+ [(
t

u
)/100]
2
= 36 + [(
t

u
)/100]
2
.
If
t
= 1 and
u
= 2, this is 36 + 1/10000 = 36.0001. If
t
= 100 and
u
= 600, it is 36 + 5
2
=
36 + 25 = 61 > > 36.0001 . Whether or not
x
t
and
x
u
are clustered together will depend
not only upon their values but also upon how far away they are in time. Using an
algorithm with this distance will pick out epochs, time intervals when the observations are
Sclove and Liu: Time

Series Segmen
tation p.
4
close in value. The extension of this
approach to two and three dimensional "time"
parameter models, that is, two

and three

dimensional images, can be successful.
However, for time series , a perhaps more satisfactory approach is model

based
clustering:
Rather than trying to estimate epoc
hs, the researcher can try to fit the parameters of a
few states. Distributions are fit to the states rather than the epochs. Rather than fitting
maybe forty

some epochs, you can fit three or four states.
We can use K

means or ISODATA to fit clust
ers, either with or without time as a
variable. K

means is ISODATA with drift: the centroids are updated as each assignment is
made, rather than waiting for a complete pass through the whole dataset.
In using these algorithms, it is wise to
scale
the
variables so that those with greater
variability don't dominate the clustering. One can scale by dividing by standard deviations
(converting to "z" scores), but a better procedure is to use statistical (Mahalanobis)
distance, which also adjusts for the co
rrelations among the variables.
In fact, if we agree that ordinary Euclidean distance is appropriate for variables that
have the same variance and are uncorrelated, then it follows that Mahalanobis distance is
appropriate in general. To see this, let
Z
be a vector of variables that are uncorrelated and
have equal variances, which without loss of generality can be taken to be 1. Then the
squared Euclidean distance between the random vector
Z
and its mean vector
Z
can be
written as
(
Z

Z
)
'
(
Z

Z
)
where, given a vector
v
, the notation
v'
denotes its transpose. Now, given a random
vector X, whose elements would in general be correlated and have different variances,
there exist matrices L suc
h that the vector of linear combinations LX has uncorrelated
elements with variances equal to 1. That is, squared Euclidean distance is appropriate for
LX, and we write
H E R E
Sclove and Liu: Time

Series Segmen
tation p.
5
(
LX

LX
)
'
(
LX

LX
) = L'L(X

U)
A problem is that the
covariance matrix to be used in Mahalanobis distance is the
within

groups covariance matrix. And the groups (clusters) have not yet been determined.
So, one uses an adaptive metric, updating the within

groups covariance matrix as well as
the centroids.
It is interesting to consider Mahalanobis distance in the case discussed above, where
the two variables are y and
t
. Given any two
p

vectors
u
and
v
and a
p
x
p
symmetric,
symmetric, nonsingular matrix
M
, define the function
D
2
as
D
2
(
u
,v;M
) = (
u

v
)
'
M

1
(
u

v
) ,
where, given any vector
v
, the symbol
v'
denotes its transpose. When
u

v
is random,
that is,
u
or
v
or both are random, with covariance matrix
u

v
, the squared statistical
distance (Mahalanobis distance
) between
u
and
v is
D
2
(
u,v;
u

v
) = (
u

v
)
'
u

v

1
(
u

v
) .
There exists a matrix L such that L
u

v
H E R E
Let us look in detail at the case
p
= 2. D2(xt,xu;S) denote the squared statistical
distance
The squared statistical dist
ance can be expressed as in terms of a matric square root of
the covariance matrix . One such square root involves expressing the variables
reexpressing the variables x and t as linear combinations of x of x and x adjusted for t.
And adjustin
g x for t is just to remove "drift" from x. So in this case the squared
statistical distance is the ordinary Euclidean distance between the values of t and the values
of detrended y
where, in general, we
squared distance between the values at tim
es
t
and
u
= (
t

u yi

yu
)
S

1
(xi

xj, ti

tj)'
= ti

tj)2 + x~I

x~j)2 . where y I I ii= xi

(a+bti)
Sclove and Liu: Time

Series Segmen
tation p.
6
In the case when
x
is
t
, the second variable is yt =

a + bt) .
Ball and Hall, 1967; Wolfe, J. H. (1970). Pattern
clustering by multivariate mixture
analysis.
Multivariate Behavioral Research
5
, 329

350.
An early definitive work on
multivariate normal mixtures, presented at the 1970 meeting of CSNA at the
University of Western Ontario, London, Ontario, Canada.
The FMM.
Now let's focus on the states. Use
K
states, with transition probabilities
between the states. The
finite mixture model
(FMM) involves class

conditional densities
and their probabilities, the mixing probabilities. In the FMM the density i
s
f
(
x
) =
1
f
1
(
x
) +
2
f
2
(
x
) + . . . +
K
f
K
(
x
).
From FMM to HMM.
Now let's focus on the states. Use
K
states, with transition
probabilities between the states. The mixing probabilities of the FMM become the
transition probabilities between states.
In the FMM the density is
f
(
x
) =
1
f
1
(
x
) +
2
f
2
(
x
) + . . . +
K
f
K
(
x
).
In the HMM the conditional distribution of Xt, the value of the process at time
t
, given that
the process is in state j at time t

1 (
t

1
= j
), is given by the dens
ity
f
(
x
t

t

1
= j
) =
j
1
f
1
(
x
) +
j
2
f
2
(
x
) + . . . +
jK
f
K
(
x
) .
This is an FMM density in which the transition probabilities
jk
replace the mixing
probabilities
k .
Hidden Markov Model
EM algorithm
E step
The E step e
stimation (or prediction) step The values of the missing data are estimated.
Sclove and Liu: Time

Series Segmen
tation p.
7
M step: M step or maximization step: The likelihood is maximized with respect to the
distributional parameters, given the current estimates of the missing data.
greedy algo
rithm for the E step
Example with exponential distributions
from ARO meeting
In classification, the MAP estimate of the label
i
of
x
i
is
g
i
= arg max
1
<
k
<
K
{
p
k
f
k
(
x
i
) } .
By analogy with this, we have the following rule for estimating the labels in the HMM:
Given that the estimate
g
t

1
of
t

1
is
g
t

1
=
j
, the estimate
g
t
of
t
is
g
t
= arg max
1
<
k
<
K
{
p
jk
f
k
(
x
t
)
} .
This is an example of a
greedy
algorithm, a one

step look ahead algorithm. Greedy
algorithms are not necessarily optimal. In fact, in this case, the greedy algorithm is not
optimal. Later we discuss an optimal algorithm.
the
Vit
erbi algorithm for the E step
Examples
SPC
We illustrate with the data from Tracy
et al.
Let's look for two classes. If there is a burn

in (or Phase I) epoch, the beginning of the series would be different, so let's use the first
and last observatio
ns for seed points.
Two states. Multivariate.
GNP
Daily RORs
3.3. Probabilistic prediction
Sclove and Liu: Time

Series Segmen
tation p.
8
Suppose that before period
t

1 ends, you know that
t

1
= j. Given that, the probability that
t
will be equal to
k
, is estimated as
p
jk
. Thus
you can predict that
with probability
p
j
1
,
y
t
= m
1
, with a standard deviation of s1,
with probability
p
j
2
, yt = m2, with a standard deviation of s2,
. . .
with probability pjK, yt = mK , with a standard deviation of sK.
4. Conclusio
ns and Discussion
Discussion
Extensions
preprocessing
natural with RORs
Within

state (class

conditional) distributions could be replaced by ARIMA models.
ARIMA within classes
2D image segmentation
3D image segmentation
Sclove and Liu: Time

Series Segmen
tation p.
9
References
Ball, G.H., and Hall, D.J., 1967. A clustering technique for summarizing multivariate data.
Behavioral Science, 12, 153

155.
The invention of ISODATA.
Dempster, A.P., Laird, N. M., and Rubin, D.B., 1977. Maximum likelihood from
inco
m
plete data via t
he E

M algorithm. Journal of the Royal Statistical Society,
Series B, 39, 1

38.
Everitt, B.S., Landau, S., and Leese, M.. 2001. Cluster Analysis. 4
th
ed. Arnold, London;
Oxford, New York.
Fleiss, J.L, and Zubin, J.. 1960. On the methods and theor
y of clustering. Multivariate
Behavioral Research, 4, 235

250.
A relatively early exposition of the concept of
mixtures of populations.
Fraley, C., and Raftery, A.E., 1999. MCLUST: Software for Model

Based Cluster
Analysis. Journal of Classification
, 16(2), 297

306.
A Web page with related links
can be found at
http://www.stat.washington.edu/fraley/software.html
.
Hunter, D.
MacQueen , J., 1967. Some methods for classif
ication and analysis of multivariate observations.
Proc. 5
th
Berkeley Symposium, 1,
281

297. University of California Press, Berkeley
.
The
invention of K

means.
McLachlan, G., and Basford, K.E., 1987. Mixture Models: Inference and Applic
a
tions to
Clustering.
Marcel Dekker, New York.
McLachlan, G., and Peel, D., 2000. Finite Mixture Models. Wiley, New York.
Their EMMIX
software is available at
http://www.maths.uq.edu.au/~gjm/em
mix/emmix.html
.
Sclove, S.L., 1978.
Population mixture models and clustering algorithms. Communications in
Statistics
,
A6 (1977), 417

434.
Interprets
ISODATA and K

Means in terms of maximum
likelihood estimation with the classification likelihood.
Sclove, S.L., 1983a.
On segmentation of time series. In: S. Karlin, T. Amemiya, and L.
Goodman (Eds.), Studies in
Econometrics, Time Series, and Multivariate Statistics, Academic
Press, New York, 311

330.
Sclove and Liu: Time

Series Segmen
tation p.
10
Sclove, S.L., 1983b. Time

series segmenta
tion: a model and a method. Information Sciences, 29,
7

25.
Sclove, S.L., 1992. CLUSPAC: Cluster Analysis Package. Technical Report 92

1
(Center for Research in Information Management, University of Illinois at Chicago,
Chicago, IL). http://www.
uic.edu/classes/idsc/ids594/ISOPAC/CLUSPAC/
Sullivan, J.H., 2002. Estimating the locations of multiple change points in the mean.
Computational Statistics and Data Analysis, 17(2), 289

296.
Titterington, D.M., Smith, A.F.M., and Makov, U.E., 1985. St
atistical Analysis of Finite
Mi
x
ture Distributions. New York: Wiley.
Tracy, N.
D., Young, J.C., and Mason, R.L., 1992. Multivariate control charts for
individual observations. Journal of Quality Technology, 24(2), 88

95.
Wolfe, J.H., 1967. NORM
IX: Computational methods for estimating the parameters of
multivarie normal mixtures of distributions. Research Memo. SRM 68

2. San Diego:
U.S. Naval Personnel Research Activity.
Wolfe, J.H., 1970. Pattern clustering by multivariate mixture analy
sis. Multivariate
Behavioral Research,
5
, 329

350.
An early definitive work on multivariate normal
mixtures, presented at the 1970 meeting of CSNA at the University of Western
Ontario, London, Ontario, Canada.
Comments 0
Log in to post a comment