# Time-Series Segmentation - University of Illinois at Chicago

AI and Robotics

Nov 25, 2013 (4 years and 5 months ago)

97 views

TIME
-
SERIES SEGMENTATION BY CLUSTER ANALYSIS AND HIDDEN
MARKOV MODEL, WITH APPLICATION TO RATES OF RETURN

Stanley L. Sclove and Lon
-
Mu Liu

University of Illinois at Chicago

OUTLINE

1. Introduction and background

Motivation

Lack of Normality and lack
of independence in rates of return

2. Context

labeling problems

general definition of labeling problems

classification

cluster analysis

time
-
series segmentation

3. Time
-
Series Segmentation

time
-
ordered data

time
-
series data

clustering

clustering with time as a variable

epochs vs. classes (states)

comparison and contrast with change
-
point inference

Hidden Markov Model

EM algorithm

greedy algorithm for the E step

Viterbi algorithm for the E step

Examples

SPC

GNP

Da
ily RORs

3.3. Probabilistic prediction

4. Conclusions and Discussion

Discussion

Extensions

ARIMA within classes

Sclove and Liu: Time
-
Series Segmen
tation p.
2

2D image segmentation

3D image segmentation

TIME
-
SERIES SEGMENTATION BY CLUSTER ANALYSIS AND HIDDEN
MARKOV MODEL, WIT
H APPLICATION TO RATES OF RETURN

Stanley L. Sclove and Lon
-
Mu Liu

University of Illinois at Chicago

1. Introduction and background

1.1. Motivation

This paper deals with the segmentation of time series, with particular emphasis on
economic and financi
al data. Typical segmentation problems include the separation of
macroeconomic time series into segments of recession, recovery, and expansion, or
financial time series into Bull and Bear segments.

Such economic and financial data are often viewed in te
rms of rates of growth or return.
Rates of return, especially daily rates of return, are neither Normal nor independent, so the
classical model and even the standard financial model are not appropriate bases for their
analysis. We propose segmentation me
thods that do not depend upon Normality of the
whole series, or upon independence.

1.2. Statistical Context

Time
-
series segmentation may be viewed in terms of
change
-
point inference

or in terms
of what we call
labeling
: assigning a label to each obse
rvation. We shall discuss both but
focus mainly on the labeling approach. The reasons for this will be given later.

Labeling problems
include
classification, cluster analysis,
and

segmentation.
The model
for such problems involves
pairs

(
x
t
,

t
),
t

= 1, 2, . . . ,
n
,

where
x
t

is observable and

t

is an unobservable
label
, equal to one of several
states
,
indicated by 1, 2, . . . ,
K
. With each state is associated a probability distribution, the
class
-
conditional
distribution,
f
k
(
x
),
k

= 1,
2, . . . ,
K
. Often these will be taken to be in a
parametric family,
i.e,

the form is specified up to a parameter

k

, that is there is a specified
form of probability density function, so that
f
k
(
x
) =
g
(
x,

k
) . Sometimes these
distributions will

be Normal; then

k

= (

k

,

k

), the mean and standard deviation,
respectively.

Sclove and Liu: Time
-
Series Segmen
tation p.
2

Briefly, in
classification
, each observation is to be assigned to one of several known
classes. In
cluster analysis

the classes are unknown; classes are formed and observ
ations
assigned to them in the same procedure. In
time
-
series segmentation

also, the classes may
be unknown; they are formed and observations in the time series are assigned to them in the
same procedure.

Classification.

In
classification
, each of a
number of observations is to be assigned to one
of a set of known, pre
-
specified classes. In the simplest case, the distributional parameters

k
also are known. Usually, these distributional parameters are unknown but can be
estimated by
supervised learning

from a
training set

of
n
1

observations for which the
labels are known; from this set, the distributional parameters can be estimated. T
his set
will consist of two variables, the measurements, and the label characterizing each case.
Later there be a test set of
n
2

cases with unknown labels; classification involves estimating
these labels.

Cluster analysis.

In
cluster analysis
, or
unsu
pervised learning
, the labels are unknown
and yet we must allocate each observation to one of the
K

classes.

Time
-
series segmentation.

When the observations are a time series, the temporal
dependence can be combined with notions of cluster analysi
s to allocate each observation
to one of
K

classes. This combination is the main topic of this paper.

3. Time
-
series segmentation

Time
-
ordered data

are data recorded along a time continuum. A
time series

consists of
time
-
ordered data recorded a
t equally spaced time intervals. Time
-
ordered data may be
event triggered. Time series are time triggered. Much of what we discuss here applies to
time
-
ordered data and does not require the additional structure of a time
-
series.

Segmentation

of time
-
o
rdered data refers to the division of the time axis into
epochs

(time intervals) within which the behavior of the series is relatively homogeneous. It may
suffice to consider only a few
states

which describe the behavior of the series within the
epochs.
The epochs are intervals of time; the states are formed by considering intervals or
Sclove and Liu: Time
-
Series Segmen
tation p.
3

regions of values of the observed variable(s). The number of states is generally much
smaller than the number of epochs. Consider for example the analysis of macroeconom
ic
data. Economists use the terms
recession, depression, recovery,
and

expansion

as states of
the economy. As time goes on, the number of epochs will increase, but not necessarily the
number of states. If we analyze suitable macroeconomic data, for ex
ample, suitably
differenced quarterly GDP, over several decades, we might find a number of epochs, fit by
two or three such states. In analyzing a mixed second difference (which may be viewed as
a kind of acceleration of the economy), Sclove (1983a,b) rep
orts such findings.

The main thrust of this paper is to fit classes, not to find epochs.

One way to segment time series would be to simply cluster the observations, not using
the time structure explicitly. The states are the clusters. A finite
-
mix
ture model can be
used.

The hard
-
clustering estimates of the distributional parameters are the usual MLEs for
the hard
-
classified data.

Many clustering algorithms involve putting cases that are close together into the same
cluster. Time series segment
ation may be viewed as clustering, but with a time
-
ordered
structure. The squared distance between
x
t

and

x
u

is (
x
t

-

x
u
)
2

. One way to deal with
time would be to use time as an additional variable and cluster the observations using a
squared di
stance between cases which incorporates time, such as

d
2
(
x
t
, x
u
) = (
x
t

-

x
u
)
2

+ [(
t
-

u
)/c]
2

.

Here the constant
c

accomplishes a suitable weighting of the two terms.

Example.

If
c =
100, then

d
2
(
x
t
, x
u
) = (
x
t
-
-

x
u
)
2

+ [(
t
-

u
)/100]
2

.

I
f
x
t

=
5 and

x
u =
-
1, then

d
2
(
x
t
, x
u
) = (
x
t

-

x
u
)
2

+ [(
t
-

u
)/100]
2

= 36 + [(
t
-

u
)/100]
2

.

If
t

= 1 and
u

= 2, this is 36 + 1/10000 = 36.0001. If
t

= 100 and
u

= 600, it is 36 + 5
2

=
36 + 25 = 61 > > 36.0001 . Whether or not
x
t
and

x
u

are clustered together will depend
not only upon their values but also upon how far away they are in time. Using an
algorithm with this distance will pick out epochs, time intervals when the observations are
Sclove and Liu: Time
-
Series Segmen
tation p.
4

close in value. The extension of this
approach to two and three dimensional "time"
parameter models, that is, two
-

and three
-
dimensional images, can be successful.

However, for time series , a perhaps more satisfactory approach is model
-
based
clustering:

Rather than trying to estimate epoc
hs, the researcher can try to fit the parameters of a
few states. Distributions are fit to the states rather than the epochs. Rather than fitting
maybe forty
-
some epochs, you can fit three or four states.

We can use K
-
means or ISODATA to fit clust
ers, either with or without time as a
variable. K
-
means is ISODATA with drift: the centroids are updated as each assignment is
made, rather than waiting for a complete pass through the whole dataset.

In using these algorithms, it is wise to
scale

the
variables so that those with greater
variability don't dominate the clustering. One can scale by dividing by standard deviations
(converting to "z" scores), but a better procedure is to use statistical (Mahalanobis)
distance, which also adjusts for the co
rrelations among the variables.

In fact, if we agree that ordinary Euclidean distance is appropriate for variables that
have the same variance and are uncorrelated, then it follows that Mahalanobis distance is
appropriate in general. To see this, let
Z

be a vector of variables that are uncorrelated and
have equal variances, which without loss of generality can be taken to be 1. Then the
squared Euclidean distance between the random vector
Z

and its mean vector

Z

can be
written as

(
Z

-

Z

)
'

(
Z

-

Z

)

where, given a vector
v
, the notation
v'

denotes its transpose. Now, given a random
vector X, whose elements would in general be correlated and have different variances,
there exist matrices L suc
h that the vector of linear combinations LX has uncorrelated
elements with variances equal to 1. That is, squared Euclidean distance is appropriate for
LX, and we write

H E R E

Sclove and Liu: Time
-
Series Segmen
tation p.
5

(
LX

-

LX

)
'

(
LX

-

LX

) = L'L(X
-
U)

A problem is that the

covariance matrix to be used in Mahalanobis distance is the
within
-
groups covariance matrix. And the groups (clusters) have not yet been determined.
So, one uses an adaptive metric, updating the within
-
groups covariance matrix as well as
the centroids.

It is interesting to consider Mahalanobis distance in the case discussed above, where
the two variables are y and
t
. Given any two
p
-
vectors
u

and
v

and a
p

x
p

symmetric,
symmetric, nonsingular matrix
M

, define the function
D
2

as

D
2
(
u
,v;M

) = (
u
-

v

)
'

M
-
1

(
u
-

v

) ,

where, given any vector
v
, the symbol
v'

denotes its transpose. When
u
-

v
is random,
that is,
u

or
v

or both are random, with covariance matrix

u
-
v

, the squared statistical
distance (Mahalanobis distance
) between

u
and

v is

D
2
(
u,v;

u
-
v

) = (
u
-

v

)
'

u
-
v

-
1

(
u
-

v

) .

There exists a matrix L such that L

u
-
v

H E R E

Let us look in detail at the case
p

= 2. D2(xt,xu;S) denote the squared statistical
distance

The squared statistical dist
ance can be expressed as in terms of a matric square root of
the covariance matrix . One such square root involves expressing the variables
reexpressing the variables x and t as linear combinations of x of x and x adjusted for t.
g x for t is just to remove "drift" from x. So in this case the squared
statistical distance is the ordinary Euclidean distance between the values of t and the values
of detrended y

where, in general, we

squared distance between the values at tim
es
t

and
u

= (
t
-
u yi
-

yu
)
S
-
1

(xi
-
xj, ti
-
tj)'
= ti
-
tj)2 + x~I
-

x~j)2 . where y I I ii= xi
-

(a+bti)

Sclove and Liu: Time
-
Series Segmen
tation p.
6

In the case when
x

is
t

, the second variable is yt =
-

a + bt) .

Ball and Hall, 1967; Wolfe, J. H. (1970). Pattern

clustering by multivariate mixture
analysis.
Multivariate Behavioral Research

5
, 329
-
350.
An early definitive work on
multivariate normal mixtures, presented at the 1970 meeting of CSNA at the
University of Western Ontario, London, Ontario, Canada.

The FMM.
Now let's focus on the states. Use
K

states, with transition probabilities
between the states. The
finite mixture model

(FMM) involves class
-
conditional densities
and their probabilities, the mixing probabilities. In the FMM the density i
s

f
(
x
) =

1

f
1
(
x
) +

2

f
2
(
x
) + . . . +

K
f
K
(
x
).

From FMM to HMM.
Now let's focus on the states. Use
K

states, with transition
probabilities between the states. The mixing probabilities of the FMM become the
transition probabilities between states.
In the FMM the density is

f
(
x
) =

1

f
1
(
x
) +

2

f
2
(
x
) + . . . +

K
f
K
(
x
).

In the HMM the conditional distribution of Xt, the value of the process at time
t
, given that
the process is in state j at time t
-
1 (

t
-
1

= j
), is given by the dens
ity

f
(
x
t
|

t
-
1

= j

) =

j
1

f
1
(
x
) +

j
2

f
2
(
x
) + . . . +

jK
f
K
(
x
) .

This is an FMM density in which the transition probabilities

jk

replace the mixing

probabilities

k .

Hidden Markov Model

EM algorithm

E step

The E step e
stimation (or prediction) step The values of the missing data are estimated.

Sclove and Liu: Time
-
Series Segmen
tation p.
7

M step: M step or maximization step: The likelihood is maximized with respect to the
distributional parameters, given the current estimates of the missing data.

greedy algo
rithm for the E step

Example with exponential distributions

from ARO meeting

In classification, the MAP estimate of the label

i

of
x
i

is

g
i

= arg max
1
<
k
<
K

{
p
k

f
k
(
x
i
) } .

By analogy with this, we have the following rule for estimating the labels in the HMM:

Given that the estimate
g
t
-
1

of

t
-
1

is
g
t
-
1

=
j
, the estimate
g
t

of

t

is

g
t

= arg max
1
<
k
<
K

{
p
jk

f
k
(
x
t
)

} .

This is an example of a
greedy

algorithm, a one
-
step look ahead algorithm. Greedy
algorithms are not necessarily optimal. In fact, in this case, the greedy algorithm is not
optimal. Later we discuss an optimal algorithm.

the

Vit
erbi algorithm for the E step

Examples

SPC

We illustrate with the data from Tracy
et al.

Let's look for two classes. If there is a burn
-
in (or Phase I) epoch, the beginning of the series would be different, so let's use the first
and last observatio
ns for seed points.

Two states. Multivariate.

GNP

Daily RORs

3.3. Probabilistic prediction

Sclove and Liu: Time
-
Series Segmen
tation p.
8

Suppose that before period
t
-
1 ends, you know that

t
-
1

= j. Given that, the probability that

t

will be equal to
k
, is estimated as
p
jk

. Thus
you can predict that

with probability
p
j
1
,
y
t

= m
1

, with a standard deviation of s1,

with probability
p
j
2
, yt = m2, with a standard deviation of s2,

. . .

with probability pjK, yt = mK , with a standard deviation of sK.

4. Conclusio
ns and Discussion

Discussion

Extensions

preprocessing

natural with RORs

Within
-
state (class
-
conditional) distributions could be replaced by ARIMA models.

ARIMA within classes

2D image segmentation

3D image segmentation

Sclove and Liu: Time
-
Series Segmen
tation p.
9

References

Ball, G.H., and Hall, D.J., 1967. A clustering technique for summarizing multivariate data.
Behavioral Science, 12, 153
-
155.

The invention of ISODATA.

Dempster, A.P., Laird, N. M., and Rubin, D.B., 1977. Maximum likelihood from
inco
m
plete data via t
he E
-
M algorithm. Journal of the Royal Statistical Society,
Series B, 39, 1
-
38.

Everitt, B.S., Landau, S., and Leese, M.. 2001. Cluster Analysis. 4
th

ed. Arnold, London;
Oxford, New York.

Fleiss, J.L, and Zubin, J.. 1960. On the methods and theor
y of clustering. Multivariate
Behavioral Research, 4, 235
-
250.
A relatively early exposition of the concept of
mixtures of populations.

Fraley, C., and Raftery, A.E., 1999. MCLUST: Software for Model
-
Based Cluster
Analysis. Journal of Classification
, 16(2), 297
-
306.
A Web page with related links
can be found at
http://www.stat.washington.edu/fraley/software.html

.

Hunter, D.

MacQueen , J., 1967. Some methods for classif
ication and analysis of multivariate observations.
Proc. 5
th

Berkeley Symposium, 1,

281
-
297. University of California Press, Berkeley
.

The
invention of K
-
means.

McLachlan, G., and Basford, K.E., 1987. Mixture Models: Inference and Applic
a
tions to
Clustering.

Marcel Dekker, New York.

McLachlan, G., and Peel, D., 2000. Finite Mixture Models. Wiley, New York.
Their EMMIX
software is available at
http://www.maths.uq.edu.au/~gjm/em
mix/emmix.html

.

Sclove, S.L., 1978.
Population mixture models and clustering algorithms. Communications in
Statistics
,

A6 (1977), 417
-
434.
Interprets

ISODATA and K
-
Means in terms of maximum
likelihood estimation with the classification likelihood.

Sclove, S.L., 1983a.

On segmentation of time series. In: S. Karlin, T. Amemiya, and L.
Goodman (Eds.), Studies in

Econometrics, Time Series, and Multivariate Statistics, Academic
Press, New York, 311
-
330.

Sclove and Liu: Time
-
Series Segmen
tation p.
10

Sclove, S.L., 1983b. Time
-
series segmenta
tion: a model and a method. Information Sciences, 29,
7
-
25.

Sclove, S.L., 1992. CLUSPAC: Cluster Analysis Package. Technical Report 92
-
1
(Center for Research in Information Management, University of Illinois at Chicago,
Chicago, IL). http://www.
uic.edu/classes/idsc/ids594/ISOPAC/CLUSPAC/

Sullivan, J.H., 2002. Estimating the locations of multiple change points in the mean.
Computational Statistics and Data Analysis, 17(2), 289
-
296.

Titterington, D.M., Smith, A.F.M., and Makov, U.E., 1985. St
atistical Analysis of Finite
Mi
x
ture Distributions. New York: Wiley.

Tracy, N.
D., Young, J.C., and Mason, R.L., 1992. Multivariate control charts for
individual observations. Journal of Quality Technology, 24(2), 88
-
95.

Wolfe, J.H., 1967. NORM
IX: Computational methods for estimating the parameters of
multivarie normal mixtures of distributions. Research Memo. SRM 68
-
2. San Diego:
U.S. Naval Personnel Research Activity.

Wolfe, J.H., 1970. Pattern clustering by multivariate mixture analy
sis. Multivariate
Behavioral Research,

5
, 329
-
350.
An early definitive work on multivariate normal
mixtures, presented at the 1970 meeting of CSNA at the University of Western
Ontario, London, Ontario, Canada.