header for SPIE use
A Bitmap Approach to Trend
Clustering for
Prediction
in Time

Series Databases
Jong P. Yoon
1
,
Yixin Luo
2
and Junghyun Nam
1
1
University of Louisiana,
Center for Advanced Computer Studies,
Lafayette, LA 70504

4330
2
So
uthern
University,
Computer Science De
partment, Baton Rouge
,
LA 70813
ABSTRACT
This paper describes a bitmap approach to clustering and prediction of trends in time

series databases. S
imilar trend patterns,
rather than similar data patterns,
are extracted
from time

series database.
We consider four
types of matches: (1) Exact
match, (2) Similarity match, (3) Exact match by shift, and (4) Similarity match by shift. Each pair of time

series data may be
matched in one of these four types if this pair is similar one to another, by similarity (or
sim
) not
ion over a threshold.
Matched data can be clustered by the same way of matching. To improve performance, we use the notion of
center
of a
cluster.
The
radius
of a cluster is used to determine whether a given time

series data is included in the cluster. We
also use a
new notion of dissimilarity, called
dissim
, to make accurate clusters. It is likely that a time

series data is in one cluster rather
than in another by using both notions,
sim
and
dissim
: a data is similar to one cluster while it is dissimilar
to another. For a
trend sequence, the cluster that is dissimilar to that sequence is called
dissimilar

cluster
.
For a cluster, the notion dissim can be also used to identify a set of clusters that are dissimilar to the given cluster. A
prediction of a tre
nd can be made by (1) Intra

cluster Trend Prediction that refers to the trends in the cluster as that trend is
involved and (2) Inter

cluster Trend Prediction that refers to other trends in the cluster that is dissimilar to that trend.
The contribution o
f this paper includes (1) clustering by using not only similarity match but also dissimilarity match. In this
way we prevent any positive and negative failures. (2) Prediction by using not only similar trend sequences but also
dissimilar trend sequences. (
3) A bitmap approach can improve performance of clustering and prediction.
Keywords:
Trend similarity, Trend sequence
, Bitmap Approach, Trend Clustering, Trend Prediction
1.
INTRODUCTION
Discovering knowledge from time

series databases appears as a complex and mult
idimensional data process
.
Similar
sequence patterns
can be discovered
from time

series
databases
[1,2,7,
15,
16
,21
].
Clustering similar patterns [8,22] and
prediction of a very next pattern [20] have been one of main concerns in discovering knowledge.
Th
e input data
for these
algorithms
is a set of time

series data, that is important in business as well as
in
scientific decision

support applications. The
problem is to find time

sequences (or subsequences) that are similar to a given sequence or to be abl
e to find all pairs of
similar sequences [1,2,
7
].
The earlier work can be classified into the following three approaches:
Time

domain approach: This approach handles time

series data in the time

domain by techniques such as shifting,
scaling, smoothing, t
ime warping, and etc [2]. Using these techniques, similar patterns can be extracted and
clustered. This approach is complex and results in low performance because of focusing on every single data point.
Frequency

domain approach: This
approach use
s
the Dis
crete Fourier Transform (DFT) to map a time sequence to
the frequency domain, take
s
the first few frequenc
i
es, and then depict
s
the frequenc
i
es to index the sequence in an
R*

tree [3] structure.
It
use
s
a sliding window over the data sequence with scaling
the magnitude of sequences, and
also identifies noises in sequences.
Only a few frequencies are consider
ed
for data mining. It is less controllable by
a
user, therefore it is not well suited for finding similar patterns with user given queries or cluster
ing time

series data
sequences.
Qualitative approach: To resolve the drawbacks stated in the above two approaches, the approach that is yet another
time

domain approach, focuses on only meaningful data points rather than all data points [15,21]. This pap
er
employs this approach to cluster trend sequences and predict one very next trend indicator. These new terms will be
defined in the later sections.
By employing the qualitative approach, w
e propose in this paper the concept of
“
trend
”
of time

series data
sequences. The
trend of a time

series data sequence is a higher level
description
about the direction that a time

series original data sequence
moves
to
. The trend therefore indicates up, crossover, or down movements of sequences. One of the instances
of such a
trend is the original data sequence. For example, a stock sequence consists of values of selling price over the course of ti
me.
Then, the trend of the sequence may describe
n
times ups and
m
times downs during the same period. For given time

se
ries
data sequences, as opposed to the previous approaches introduced above,
the earlier work [21]
1) uses the smoothing
technique by computing
l

term moving average, 2) uses the six
trend indicators
to single out a few points out of the given
sequence, an
d then 3) uses the bitmap indexing to find similar sequences.
The
sequence of time

series data
is called
data
sequence
, and the sequence of trends
trend sequence
.
With this in mind, w
e propose a technique for
clustering
trend sequences
by defining the n
ew notion “dissimilarity” in
addition to the notion of “similarity.” As a pre

process of clustering,
similar trend patterns
are extracted
from time

series
database.
We consider four types of matches: (1) Exact match, (2) Similarity match, (3) Exact match
by shift, and (4)
Similarity match by shift. The prediction of one very next trend indicator uses the earlier indicators of a trend in one clu
ster
and those of another trend in the other cluster that is dissimilar to the target trend.
The advantage of t
his paper is addressed as follows:
Clustering by not only similarity but also dissimilarity. Clustering trend sequences takes into account not only
similarity match but also dissimilarity match. In this way we prevent any positive and negative failures.
Correct predication. One very next trend indicator is predicted by using not only similar trend sequences but also
dissimilar trend sequences.
Fast processing. A bitmap approach is used to improve the performance of clustering and prediction
The remaining
of this paper is organized as follows. Section 2 describes related work. Section 3 describes preliminaries to
be used for the new techniques we propose in this paper. Section 4 describes the technique
for clustering
trends. Section 5
describes the tec
hnique for predicting
the very next
trend
indicator
.
Section 6 describes implementation details to prove the
concepts proposed in this paper.
Section
7
describes our conclusion and extended work.
2.
RELATED WORK
Many algorithms for discovering patterns fr
om time

series databases
can be classified into three approaches:
(1) The first
approach maps time

series data sequences into frequency domain, (2) the second approach processes the time

series
sequences directly in the time domain
, and (3) a qualitative a
pproach
. The frequency domain approach, pioneered by [1], in
general computes a DFT (Discrete Fourier Transform) [9] for each sequence and selects the first few coefficients to index
their respective original sequences. Sequences with matching coefficien
ts are considered similar
to
or at least within
Euclidean distance by Parseval theory [1
4
]. The indexing mechanism is typically through a multidimensional index structure
such as R

tree [9] or R*

tree [3]. Data mining from time

series data sequences is a
lso conducted for generalizing the process
to allow a sub

sequence matching [7] while in [1]. In [1
6
,19], moving average, time warping, and reversing are formulated
and the indexing methods are further examined for approximate subsequence matching. Furth
er, [1
9
] introduces a lower

bounding technique
that
can filter out sequences, if they are not similar enough for time warping.
This approach is not
efficient in dealing with every single data point and less performance thereof. To resolve some difficulties
of these
approaches, a qualitative approach is proposed. While time

series data is dealt with in the time

domain, not every single data
point but a few meaningful data points are considered. This paper extends a qualitative approach to propose a new meth
od of
clustering trend indicators and predicting one next trend indicator.
The task considered in this paper is clustering, that is, grouping data sets into meaningful (or similar) subclasses. A numbe
r
of clustering algorithms for large databases have been
proposed [8,22]. Partitioning algorithms construct a partition of a
database of
n
objects into a set of k clusters where
k
is an input parameter. Each cluster is represented by the center of gravity
of the cluster (
k

means) or by one of the objects of th
e cluster located near its center (
k

medoid) [11]. Hierarchical algorithms
on the other hand create a hierarchical decomposition of a database. The hierarchical decomposition is represented by a tree,
called a
dendrogram
, which iteratively splits a databas
e into smaller subsets until each subset consists of only one object. In
such a hierarchy, each level of the tree represents a clustering of the database [4]. None of these algorithms is efficient o
n
time

series databases. Therefore, some focusing techniqu
es have been proposed to increase the efficiency of clustering
algorithms.
Recently, some work in time domain focus on
a qualitative approach that deals with time

series data in the dime

domain but
focuses not every single data point but only meaningful da
ta points [15,21]. A meaningful data point may be an average, a
peak, or a slope in data sequences [15]. Yoon and et. al. treats the notion of trend as a meaningful data point [21]. This pa
per
again extends the notion of trend to clustering and prediction
techniques. The prediction technique in this paper is different
from the one in [20] in taking the relationship of dissimilar clusters into account.
Bitmap indexing has become a promising technique for pattern matching and data selection in data mining. V
ariations of
bitmap indexes include bit

sliced indexes [5,
13
], and encoded bitmap indexes [18]. In this paper, we also use an encoded
bitmap indexing technique to speed up the evaluation of selection and pattern matching conditions. One difference from t
he
previous approaches is that each
“
trend
”
obtained from time

series data consists of the different number of attributes and
matching on one projected attribute does not make sense. Therefore, we propose a technique of considering full, if not parti
al,
s
equence of trends for indexing and matching.
3.
PRELIMINARY
In
this section, we consider in the time domain data sequences, data sequences not taking all but trend indicators
into account
.
3.1 Sequence Smoothing by Moving Averages
Moving averages are wide
ly used in stock data analysis [6]. Their primary use is to smooth out short term fluctuations and
depict the underlying trend of a stock. The computation is simple: suppose
that
s
i
= (
v
i1
@ t
i1
, v
i2
@ t
i2
, …, v
ii
@ t
ii
)
denotes a
sequence , where
v
i
k
i
s a numeric value at time
t
ik
as shown in Figure 1.
The
l

term moving average of a sequence
s
=(
v
1
,
…
,
v
n
) is computed as follows: the mean is computed for an
l

term

wide window placed over the end of the sequence; this will
give the moving average for day
n
–
l
/2
; the subsequent values are obtained by stepping the window through the
beginning
of the sequence, one day at a time. This will produce a moving average of length
n
–
l
+1.
Moving averages are the central parameter of analysis in Stock Trends
as an example. It defines a trend by relating the
current stock price of an equity to its historical price
–
as represented by a 13

week and a 40

week average price. To get a
13

week moving average, we add up the closing prices of the previous 12 weeks p
lus the current weekly close
price
, then
divide
d
by 13. Some technical analysts have other ways of calculating averages, such as the weighted and exponential
moving averages, that give more weight to recent price movements.
3.2 Trend Indicators
As oppo
sed to similar data patterns considering each data point, a limited number of points
, but meaningful data points,
that
play a dominant role to make a trend are taken into account. These
meaningful
data points are called a
trend sequence
.
Trend sequences
will be defined in various ways
(e.g., average, peak, or slop of lines)
. Of many, we focus more on
considering trend sequences by a data smoothing technique
[21]
.
A time

series data sequence may fluctuate if it is dynamic over the course of time. The t
rend sequence of a data sequence is
defined
as
the interplay of the current sequence fluctuation with two moving averages: short

term and long

term moving
averages. If data sequences are about stock prices, the 13

week moving average and the 40

week movin
g average will be
compared to obtain the trend sequences (for more information, visit the web site, http://www.stocktrends.ca/handbook_1.html
[12]). For a data sequence, we can obtain two smoothed sequences using two different moving averages.
A
trend in
dicator
indicate
s
the
6 degrees in
difference
between
two trend sequences
.
For the user

given threshold, the two data sequences are
evaluated to be similar or not, in general. These 6 factors are called
trend indicators
and defined below.
Notice that in s
tock prices, the short

term price movement is represented in the interplay of a stock
’
s current price with the
13

week moving average. More precisely, the data sequence in stock markets calculates an
“
envelope
”
of 3% (or 5%, or 7%
depending upon analyst
’
s
input) around the 13

week moving average [12]. This envelope filters out some short

term price
fluctuations by demanding a more definite penetration of the intermediate

term trend sequence.
For time
t
=0,1,2,
…
n

1, let
M
m
(
t
) be the line for the
m

term m
oving average, and
M
k
(
t
) the line for the
k

term moving average
for a given data sequence
s
(
t
), where (
m
<
k
). If
m
=13 and
k
=40, then as in stock market example, they are the 13

week
moving average and the 40

week moving average. Assume that the size of
envelope is
e
. The 6 trend indicators are defined
as follows:
Up Arrow (
). Whenever the short

term moving average is above the long

term moving average, the data
sequence is tagged with an up arrow. In this case, the current price is still with the lon
g

term trend and has not
penetrated that smoothed sequence more than
envelope (
e
%
)
beyond the short

term moving average. That is the
case where
M
k
(
t
) <
M
m
(
t
) and
s
(
t
) <
M
k
(
t
)

M
m
(
t
)*
e
holds.
Up Arrow (
). Similar to above, but the current value has pene
trated a level
e
% beyond the short

term moving
average. That is the case where either
M
k
(
t
) <
M
m
(
t
) and
M
m
(
t
) <
s
(
t
), or
M
k
(
t
) <
M
m
(
t
) and
M
m
(
t
)

M
m
(
t
)*
e
<
s
(
t
)
holds.
Solid Square (
). On the weeks where there is crossover of the two different term mo
ving averages, if the short

term moving average passes above the long

term moving average, the data sequence is tagged with the solid
square. That is the case where
M
k
(
t
) <
M
m
(
t
) holds.
Empty Square (
). If
M
m
(
t
) <
M
k
(
t
) holds, the data sequence is tagged
with the empty square.
Down Arrow (
). If either
M
m
(
t
) <
M
k
(
t
) and
s
(
t
) <
M
m
(
t
), or
M
m
(
t
) <
M
k
(
t
) and
s
(
t
) <
M
m
(
t
) +
M
k
(
t
)*
e
holds,
then the data sequence is tagged with this down arrow.
Down Arrow (
). Similar to above, but the current price has pene
trated a level
e
% beyond the short

term
moving average. That is the case where
M
m
(
t
) <
M
k
(
t
) and
M
m
(
t
) +
M
k
(
t
)*
e
<
s
(
t
) holds.
We define a
trend sequence
as follows:
Definition
3.
1
. A trend sequence
is defined as a
pa
rtial order of
trend indicators
, (
,<), where
denotes the six trend
indicators stated above, and < denotes a partial order amongst
.
A trend implies the dire
ction that a time

series data sequence follows to move. The direction of data sequence is up

ward,
down

word, or crossover between the long

term and short

term moving averages. Extracting trends from data sequences is
conducted in the area of stock inves
tment.
One advantage of using trend sequences instead of original data sequences has two folds: (1) the smaller amount of data we
can compare for similar matching, (2) the abstract level of sequence movements we can obtain. With fewer data we extract
sim
ilar
trend sequences [21]
.
3.3 Bitmap Indexing
A promising approach to select and match with complex queries in information processing is the use of bitmap indexing [5].
In a conventional
B
+

tree index or an R*

tree index, each distinct attribute value
v
is associated with a list of IDs of all the
Figure 1: Time

Series Data Example
records associated with the attribute value
v
. The basic idea of a bitmap index is to replace the list of IDs with a bit vector (or
bitmap). That is, a bitmap index is essentially a collection of bitmaps. T
he size of bitmap is equal to the cardinality of the
indexed record, and the
i
th
bit corresponds to the
i
th
record. In the simplest bitmap index design, the
i
th
bit of a bitmap
associated with value
v
is set to 1 if and only if the
i
th
record has a value
v
for the indexed attribute.
This simple bitmap is
easy to manipulate data but requires a little more space than a so

called “encoded bitmap.” The encoded bitmap requires a
mapping table that interprets the record value with a bit in a bitmap index.
Consi
der
the trend indicators we discussed in the previous subsection. Six indicators are considered. Simple bitmap index
may require six bits to represent one indicator as shown in
Figure 2
. However, by encoded bitmap indexing, five bits are
enough to represen
t the six indicators as shown in Figure 3
.
For example, c
onsider the
following
trend sequences
:
s
1
= (
)
s
2
= (
)
s
3
= (
)
s
4
= (
).
Figure 4 shows the simple bitmap index of the above trend sequences by mapping as shown
in Figure 2
.
Similarly, Figure 5
shows the encoded bitmap index of the same trend sequences by using the mapping table as shown in Figure 3.
A major advantage of bitmap indexes is that bitmap manipulations using bit

wise operators
(AND, OR, XOR, NOT) ar
e very
efficiently supported by hardware.
As an example in Figure 4,
if
we can select or match with a query of bits
“
01000
0
100000”
then the outcome is resulted by a
pplying the AND operator to the query and Figure 4
. T
he answer
is
s2
,
s3
, and
s4
.
These in
dexes are very space

efficient, especially for attributes with low cardinality. XOR operator is used to calculate a
distance between two bitmaps as defined below.
Trend Indicator
Bit setting
100000
010000
001000
000100
000010
000001
䙩Fur攠2㨠卩mp汥lb楴m慰 r敱uir敳ex b楴s 慮d 瑨攠
d楳瑡t捥 b整w敥n 慮y two 楮di捡瑯rs 慲攠
瑨攠sam攮
r敮d Ind楣慴ir
B楴整瑩ng
00000
00001
00011
00111
01111
11111
䙩Fur攠3㨠:n捯d敤 b楴i慰 r敱u楲敳ef楶攠b楴s 慮d th攠
d楳瑡t捥 b整w敥n two 瑲end 楮d楣慴irs 捡n b攠ob瑡楮敤.
1
2
3
4
5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
s1
0
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0
…
s2
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0
…
s3
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
…
s4
0
1
0
0
0
0
1
0
0
0
0
0
Figure
4
.
Simple
Bitmap
Index
1
2
3
4
5
6
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
s1
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
1
1
1
s2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
s3
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
1
0
0
0
1
1
0
0
0
s4
0
0
0
0
1
0
0
0
0
0
Figure 5. Encoded Bitmap Index
Definition 3.2
(
Size of Bitmap
) 
b
i
 denotes the size of a bitmap
b
i
, which is the number o
f 1’s in a
bitmap
b
i
, and
[
b
i
]
denotes the cardinality of a bitmap
b
i
, which is the number of 1’s or 0’s.
Definition 3.3
(
Distance
) The distance between two trend sequences
can be defined
: dist
(s
i
,s
j
) = 
xOR
(s
i
,s
j
), where xOR is a
bit

wise exclusive or operator.
For example,
in
a simple bitmap as shown in Figure 2, the distance between any two trend indicators is not properly obtained
because it is always the same. However, in an encoded bitmap as shown in Figure 3, the distance between any two trend
indicators is obtained. The
distance between
and
is 1, while that between
and
is 2. Another example in Figure 4,
t
he distance of two trend sequences s
1
and s
2
is

xOR
(
s
1
,
s
2
) = 5. Notice that in a bitmap index, if a bit represents a word, then
the trend distance
can be obta
ined. Two trend sequences
s
i
and s
j
are the same if

xOR
(s
i
, s
j
)=0. Based on the notion of
distance,
we define the similarity that can be computed in a simple bitmap index and an encoded bitmap index.
Definition 3.4
(
Similarity in simple bitmap index
) T
he similarity of two trend sequences (or bitmap rows),
s
i
and s
j
, sim
(s
i
,
s
j
) = 1


xOR
(s
i
, s
j
)/2
MAX
([s
i
],[s
j
]).
Two trend sequences, s
i
and s
j
are

similar if sim
(s
i
, s
j
)
, where
0
1, and
is
given.
Definition 3.5
(
Similarity in encoded bitmap index
) The similarity of two trend sequences (or bitmap rows),
s
i
a
nd s
j
, sim
(s
i
,
s
j
) = 1


xOR
(s
i
, s
j
)/
MAX
([s
i
],[s
j
]).
Two trend sequences, s
i
and s
j
are

similar if sim
(s
i
, s
j
)
, where
0
1, and
is
given.
For example, in the simple bitmap index as in Figure 4, the similarity between s
1
and s
2
is 1

10/2*36 = 31/36, and that of s
2
and s
3
is again 31/36. They are the same, whi
ch is not true in the figure. On the other hand, in the encoded bitmap index as in
Figure 5, the similarity between s
1
and s
2
is 1

17/30 = 17/30, and that of s
2
and s
3
is again 25/30=5/6. That is, s
2
is closer to s
3
than s
1
. Intuitively, the encoded bitmap
index is more appropriate. In this way, we know that encoded bitmap indexing
distinguishes trend sequences more correctly.
4.
TREND
CLUSTERING
In this section, we describe various similarity matches in trend sequences. Trend sequences can be in the same c
luster if they
are similar one to another. The notions of center and radius for a cluster are defined. Based on these two notions, clusters
can
be merged or split.
4.1
Similarity Match
This section proposes
four types of trend sequence match
that is use
d to cluster trend sequences. We consider exact and
similar matches, and also match by shift and by no shift. Assume that
s
i
= (v
i1
@ t
i1
, v
i2
@ t
i2
, …, v
ii
@ t
ii
) and s
j
= (v
j1
@ t
j1
,
v
j2
@ t
j2
, …, v
jj
@ t
jj
) where v
k
, i.e.,
is a set of the six tre
nd indicators, and t
k
is time (therefore, there is a temporal
order between two t’s). The following four types are illustrated in Figure 6 (a) through (d).
Exact match (without shifting). Given two trend sequences,
s
i
and s
j
,
are exactly matched without
shifting if all the
following conditions hold: (1) the time range [
t
i1
,
t
ii
] and [
t
j1
,
t
jj
] is overlapped, i.e., [
t
i1
,
t
ii
]
[
t
j1
,
t
jj
]
(2) during the
overlapped time duration, the trend indicators
v
ik
and v
jk
are the same, i.e., v
ik
= v
jk
for t
ik
(=
t
jk
)
[
t
j1
,
t
jj
]
.
Exact match by a shift.
Given two trend sequences,
s
i
and s
j
,
are exactly matched by shift if the following condition
holds: the trend indicators
v
ik
and v
jk
are the same during no matter what time period they occur, i.e., v
ik
= v
jk
for t
i
k
=
t
jk
+
l
[
t
j1
,
t
jj
], where
l
denotes an arbitrary time duration
.
Similarity match (without shifting). Given two trend sequences,
s
i
and s
j
,
are similarly matched by no shift if all the
following conditions hold: (1) the time range [
t
i1
,
t
ii
] and [
t
j1
,
t
jj
] is overlapped, i.e., [
t
i1
,
t
ii
]
[
t
j1
,
t
jj
]
(2) during the
overlapped time duration, the trend indicators
v
ik
and v
jk
are similar, i.e., sim(v
ik
, v
jk
)
for t
ik
(=
t
jk
)
[
t
j1
,
t
jj
]
.
Similarity match by shift. Given two trend sequences,
s
i
and
s
j
,
are similarly matched by no shift if the following
condition holds: the trend indicators
v
ik
and v
jk
are similar during no matter what time period they occur, i.e., i.e.,
sim(v
ik
, v
jk
)
for t
ik
=
t
jk
+
l
[
t
j1
,
t
jj
], where
l
denotes an arbitrary tim
e duration
.
4.2
Center and Radius
This section describes two features of bitmap indexes: Radius and Center.
Definition 4.1
(
Center
)
Each cluster is represented by the center of gravity of the cluster (
k

means).
In a cluster of trend
sequences, the cent
er is a trend sequence that is a sequence of trend indicators.
Definition 4.2
(
Radius
) The radius of cluster
c
is
radius
(
c
) =
MAX
(
dist
(s
c
, s
j
)), where s
c
is a center of a bitmap index
for a
cluster and s
j
is a bitmap for a trend sequence in the boundary.
As an example in Figure 7, assume that s1, s2, s3, s4 are in a cluster, say c1. The ce
nter for the first trend indicator in the
cluster c1, is 0.7 because (1+0+1)/3=0.7. The radius for the first trend indicator of c1 is,
radius
(
c1
) =
MAX
{
dist
(s
c
, s
j
)}=
MAX
{
dist(c1,s1), dist(c1,s3), dist(c1,s4)
}=
MAX
{
0.3, 0.7, 0.3
}=0.7. This can be applied to
entire indicators of the trend sequences
in the cluster. The center of c1 is (00001
00111
00111
00111
00001) by rounding off, and its radius is (.7 .7 .5
1.0 .7).
The notions of center and radius of a cluster are incrementally computed by taking
the states of bitmap indexes into account.
4.3
Merge and Split
This subsection describes two methods that are useful to clustering: Merge and Split. Given two trend sequences can be
merged if they are similar enough to be in one cluster. In the same toke
n, given a trend sequence can be split into two or more
sub

clusters if its density is not one

centric. The criteria that lead to either the merge or the split methods include (1) the
radius of those given clusters, and (2) the number of trend sequences th
at are in the clusters.
For example, in Figure 7, the cluster c1 may be split with respect to the maximum radius. In the cluster, the trend indicator
at
time
t4
has the maximum radius 1.0. Based upon this trend indicator, it is split into the two clusters
: those trend sequences
with the 00011 bits, and those trend sequences with the 01111 bits, that is, the two clusters are: c1: {s1, s3}, and c1’: {s2
, s4}.
4.4
Two Approaches for Clustering
There are two approaches for clustering: top

down and bottom

up a
pproaches.
Top

down approach. The top

down approach uses the split method described above to partition a given cluster until
each sub

cluster satisfies the thresholds of those two criteria. Notice that the thresholds of the two criteria are the
radius o
f the cluster and the number of trend sequences.
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
(a) Exact match
(b) Exact match by shift
(c) Similarity match
(d) Similarity match by shift
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
t
jj
(a) Exact match
(b) Exact match by shift
(c) Similarity match
(d) Similarity match by shift
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
+
l
t
jj
+
l
Figure 6: Four Types of Trend Sequence Match
Bottom

up approach. The bottom

up approach uses the merge method described above to combine given clusters
until the merged cluster satisfies the two thresholds.
5.
TREND PREDICTION
In this section, we descr
ibe a new method of predicting a very next trend indicator. The prediction of a very next trend
indicator is based on the trend information already available. The trend information may be available within a cluster or
outside the cluster. Depending on thes
e situations, we can classify two types of trend prediction: Intra

cluster Trend
Prediction, and Inter

cluster Trend Prediction. To describe these two types of trend prediction, consider the clusters example
in Figure 7. Assume that cluster c2 and c3 are
dissimilar.
Intra

cluster Trend Prediction. A very next trend indicator is predicted by using information within the cluster
where that trend sequence is included. Assume that we want to predict a trend indicator at time
t+1
. Let the center
of the cluste
r at time
t
be
c
. Then the following formula can be used for the prediction.
o
If the target trend sequence is time

lagged, then the very next trend indicator is simply
c
. As an example in
Figure 7, the very next trend indicator of the trend sequence S1 in
cluster c1 is 00001 at time
t5
.
o
If the target trend sequence is not time

lagged, then the prediction can be obtained by using the traditional
method [10].
A trend indicator
v
t+1
at time
t+1
is obtained based on the center
c
at time
t
by compensating diff
erences
from the previous trend indicators with an error
. However, our proposed solution for this case is to
combine the prediction by using external cluster information with the one we can obtain here.
Inter

cluster Trend Prediction. A very next trend
indicator is predicted by using information outside the cluster if a
cluster outside is
dissimilar
to the target cluster. The “dissimilarity threshold” is given as the similarity threshold is
given. Two clusters are dissimilar if the distance between the
m is greater or equal to the given dissimilarity
threshold. For example, clusters c2 and c3 are dissimilar each other in Figure 7 if the dissimilarity threshold is 50%
time
t1
t2
t3
t4
t5
t6
Cluster c1
S1:
00001
00111
00111
00011
S2:
00111
00011
01111
00
001
S3:
00000
00011
00111
00011
00001
S4:
00001
00111
00011
01111
00011
Center
:
2/3=.7
11/4=2.7
5/2=2.5
12/4=3
4/3=1.3
00001
00111
00111
00111
00001
Radius
:
.7
.7
.5
1.0
.7
Cluster c2
S6:
00000
00011
11111
11111
S7:
00001
01111
01111
S8:
00000
00011
011
11
11111
Center
:
0
5/3=1.7
13/3=4.3
14/3=4.7
00000
00001
01111
11111
Radius
:
0
.7
.7
.7
Cluster c3
S10:
11111
01111
00000
00001
11111
S12:
01111
00001
00000
01111
Center
:
5
4
1/2=.5
1/2=.5
9/2=4.5
11111
01111
00001
00001
11111
Radius
:
0
0
.5
.5
.5
F
igure 7: Clusters Example
t
t
t
t
t
t
t
t
c
c
c
c
c
c
c
v
)
(
...
)
(
'
)
(
1
2
)
2
(
2
1
1
1
(or 0.5). Assume that we want to predict a trend indicator in the cluster c2 at time
t+
1
. Let the center of the cluster
c2 at time
t
be
c
. Let the center of the cluster c3 at time
t
be
d
. Then the following formula can be used for the
prediction.
o
If the cluster c2 is time

lagged, then the very next trend indicator is simply
d
. As an example
in Figure 7,
the very next trend indicator of a trend sequence, say S8, is simply far different from 11111 at time
t5
.
Therefore, S8 is expected to be
00000 00011 01111 11111 11111
.
o
If the target trend c2 is not time

lagged, then the prediction sh
ould combine the information of the trend
sequences in the same cluster together with the information of the trend sequences in the dissimilar cluster.
The proposed formula is as follows:
A trend indicator
v
t+1
at time
t+1
is obtained based on the center
d
of the dissimilar cluster at time
t
by
compensating differences from the previous trend indicators with an error
. The center of the dissimilar
cluster is subtracted from 5 because each trend indicator may be opposite in value.
6.
IMPLEMENTATION
The conc
epts described in this paper are in part implemented in Java. All processes are operated user

interactively. Figure 8
shows the user

interactive clustering tool. Time

series data sets can be read and smoothed by moving average. The short

term
and long

term
time periods are given. Then, trend indicators are obtained. Those trend indicators are converted into bitmap
indexes. Four types of matches are applied in this figure as well. Figure 9 displays the outcomes of clustering process.
Figure 8: Trend Sequence Clustering Tool
t
t
t
t
t
t
t
t
d
d
d
d
d
d
d
v
)
(
...
)
(
'
)
(
)
5
(
1
2
)
2
(
2
1
1
1
7.
CONCLUSION
This pape
r described
a method of clustering trend sequences and predicting a very next trend indicator for time

series
databases
.
We classified three approaches of handling time

series databases. As a qualitative approach, we defined the notion
of “trend.” Instead
of data sequences, we dealt with trend sequences. Trend sequences are indexed using a bitmap. Using bit

wise operations like distance, similarity, center, and radius, similarity trend sequences are clustered.
Among various clusters, we also developed a wa
y of predicting a very next trend indicator. Two approaches are investigated:
Intra

cluster trend prediction, an Inter

cluster trend prediction.
The contribution of this paper includes (1) clustering by using not only similarity match but also dissimilari
ty match. In this
way we prevent any positive and negative failures. (2) Prediction by using not only similar trend sequences but also
dissimilar trend sequences. (3) A bitmap approach can improve performance of clustering and prediction.
REFERENCES
1.
R.
Agrawal, C. Faloutsos, A. Swami,
“
Efficient Similarity Search in Sequence Databases,
”
in Proc. of 4
th
Int
’
l. Conf. on
Foundation of Data Organization and Algorithms, 1993.
2.
R. Agrawal, K. Lin, H. Sawhney, and K. Shim,
“
Fast Similarity Search in the Presence
of Noice, Scaling, and
Translation in Time

Series Databases,
”
In Proc. of the VLDB Conf., 490

501, 1995.
3.
N. Bechmann, H. Kriegel, R. Schneider, and B. Seeger,
“
The R*

tree: an Efficient and Robust Access Method for Points
and Rectangles,
”
In Proc. of ACM
SIGMOD Conf. on Management of Data, 322

31, 1990.
4.
A. Bouguettaya, “On

Line Clustering,” IEEE Transactions of Knowledge and Data Engineering, Vol. 8, 1996, pp. 333

339.
5.
C. Chan and Y. Ioannidis,
“
An Efficient Bitmap Encoding Scheme for Selection Queries,
”
I
n Proc. of ACM SIGMOD
Conf. on Management of Data, 215

226, 1999.
6.
R. Edwards and J. Magee,
Technical Analysis of Stock Trends
, John Magee, Springfield, Massachsetts, 1969.
7.
C. Faloutsos, M. Ranganathan, and Y. Manolopoulos,
“
Fast Subsequence Matching in Tim
e

Series Database,
”
In Proc.
A
CM SIGMOD Conf. on Management of Data, 419

429, 1994
.
Figure 9: Outcome of Clustering
8.
S. Guha, R. Rastogi and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, IEEE Conf. on
Data Engineering, 1999.
9.
A. Guttman,
“
R

tree: a Dynamic Index
Structure for Spatial Searching,
”
in Proc. of ACM SIGMOD Conf. on
Management of Data, 45

57, 1984.
10.
J. Hamilton, Time Series Analysis, Princeton University Press, 1994.
11.
L. Kaufman and P. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analys
is,” John Wiley & Sons,
1990.
12.
S. Kortje, Stock Trends
–
A Handbook for Investors,
http://www.stocktrends.ca
, 1998.
13.
P. O
’
Neil, D. Quass,
“
Improved Query Performance with Variant Indexes,
”
In Proc. of ACM SIGMOD Conf.
on
Management of Data, 1997.
14.
A. Oppenheim and R. Schafer, Digital Signal Processing, Prentice

Hall, 1975.
15.
C. Perng, H. Wang, S. Zhang, and D. Parker, “Landmarks: A New Model for Similarity

Based Pattern Querying in Time
Series Databases,” IEEE Conf. on Da
ta Engineering, 2000, pp. 33

44.
16
Rafiei, and A. Mendelson,
“
Similarity

based Queries for Time

Series Data,
”
In Proc. of ACM SIGMOD Conf. on
Management of Data, 13

23, 1997.
17
R. Srikant and Rakesh Agrawal,
“
Mining Sequential Patterns: Generalizations and
Performance Improvements,
”
In Proc.
of the Conf. on Extended Database Technology, 1996.
18
M. Wu,
“
Query Optimization for Selections using Bitmaps,
”
Tech. Report, CS Dept, Technische Universitat Darmstadt,
1
998.
19
B. Yi, H. Jagadish and C. Falutsos,
“
Effic
ient Retrieval of Similar Time Sequences under Time Warping,
”
in Proc., of
the 14
th
In
t’
l Conf. on Data Engineering, 1998.
20
B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, and A. Biliris, “Online Data Mining for Co

Evolving
Time Sequences,”
IEEE Conf. on Data Engineering, 2000, pp. 13

22.
21
J. Yoon, J. Lee, and S. Kim, “Trend Similarity and Prediction in Time

Series Databases,” SPIE Conference on Data
Mining and Knowledge Discovery: Theory, Tools, and Technology II, 201

212, 2000.
22
T. Zhan
g, R. Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, ACM
SIGMOD Conf. on Management of Data, 1996.
Comments 0
Log in to post a comment