A Bitmap Approach to Trend Clustering for Prediction in Time-Series Databases

tealackingAI and Robotics

Nov 8, 2013 (4 years and 5 days ago)

71 views

header for SPIE use

A Bitmap Approach to Trend

Clustering for
Prediction

in Time
-
Series Databases


Jong P. Yoon
1
,
Yixin Luo
2

and Junghyun Nam
1

1
University of Louisiana,

Center for Advanced Computer Studies,

Lafayette, LA 70504
-
4330

2
So
uthern

University,

Computer Science De
partment, Baton Rouge
,

LA 70813


ABSTRACT

This paper describes a bitmap approach to clustering and prediction of trends in time
-
series databases. S
imilar trend patterns,
rather than similar data patterns,
are extracted
from time
-
series database.

We consider four
types of matches: (1) Exact
match, (2) Similarity match, (3) Exact match by shift, and (4) Similarity match by shift. Each pair of time
-
series data may be
matched in one of these four types if this pair is similar one to another, by similarity (or
sim
) not
ion over a threshold.
Matched data can be clustered by the same way of matching. To improve performance, we use the notion of
center

of a
cluster.

The
radius

of a cluster is used to determine whether a given time
-
series data is included in the cluster. We

also use a
new notion of dissimilarity, called
dissim
, to make accurate clusters. It is likely that a time
-
series data is in one cluster rather
than in another by using both notions,
sim

and
dissim
: a data is similar to one cluster while it is dissimilar

to another. For a
trend sequence, the cluster that is dissimilar to that sequence is called
dissimilar
-
cluster
.

For a cluster, the notion dissim can be also used to identify a set of clusters that are dissimilar to the given cluster. A
prediction of a tre
nd can be made by (1) Intra
-
cluster Trend Prediction that refers to the trends in the cluster as that trend is
involved and (2) Inter
-
cluster Trend Prediction that refers to other trends in the cluster that is dissimilar to that trend.

The contribution o
f this paper includes (1) clustering by using not only similarity match but also dissimilarity match. In this
way we prevent any positive and negative failures. (2) Prediction by using not only similar trend sequences but also
dissimilar trend sequences. (
3) A bitmap approach can improve performance of clustering and prediction.


Keywords:


Trend similarity, Trend sequence
, Bitmap Approach, Trend Clustering, Trend Prediction


1.

INTRODUCTION


Discovering knowledge from time
-
series databases appears as a complex and mult
idimensional data process
.
Similar
sequence patterns

can be discovered

from time
-
series
databases

[1,2,7,
15,
16
,21
].

Clustering similar patterns [8,22] and
prediction of a very next pattern [20] have been one of main concerns in discovering knowledge.

Th
e input data

for these
algorithms

is a set of time
-
series data, that is important in business as well as
in
scientific decision
-
support applications. The
problem is to find time
-
sequences (or subsequences) that are similar to a given sequence or to be abl
e to find all pairs of
similar sequences [1,2,
7
].
The earlier work can be classified into the following three approaches:



Time
-
domain approach: This approach handles time
-
series data in the time
-
domain by techniques such as shifting,
scaling, smoothing, t
ime warping, and etc [2]. Using these techniques, similar patterns can be extracted and
clustered. This approach is complex and results in low performance because of focusing on every single data point.



Frequency
-
domain approach: This

approach use
s

the Dis
crete Fourier Transform (DFT) to map a time sequence to
the frequency domain, take
s

the first few frequenc
i
es, and then depict
s

the frequenc
i
es to index the sequence in an
R*
-
tree [3] structure.
It

use
s

a sliding window over the data sequence with scaling

the magnitude of sequences, and
also identifies noises in sequences.

Only a few frequencies are consider
ed

for data mining. It is less controllable by
a
user, therefore it is not well suited for finding similar patterns with user given queries or cluster
ing time
-
series data
sequences.



Qualitative approach: To resolve the drawbacks stated in the above two approaches, the approach that is yet another
time
-
domain approach, focuses on only meaningful data points rather than all data points [15,21]. This pap
er
employs this approach to cluster trend sequences and predict one very next trend indicator. These new terms will be
defined in the later sections.

By employing the qualitative approach, w
e propose in this paper the concept of

trend


of time
-
series data

sequences. The
trend of a time
-
series data sequence is a higher level
description

about the direction that a time
-
series original data sequence
moves

to
. The trend therefore indicates up, crossover, or down movements of sequences. One of the instances
of such a
trend is the original data sequence. For example, a stock sequence consists of values of selling price over the course of ti
me.
Then, the trend of the sequence may describe
n

times ups and
m

times downs during the same period. For given time
-
se
ries
data sequences, as opposed to the previous approaches introduced above,
the earlier work [21]
1) uses the smoothing
technique by computing
l
-
term moving average, 2) uses the six
trend indicators

to single out a few points out of the given
sequence, an
d then 3) uses the bitmap indexing to find similar sequences.
The
sequence of time
-
series data

is called

data
sequence
, and the sequence of trends
trend sequence
.

With this in mind, w
e propose a technique for
clustering

trend sequences

by defining the n
ew notion “dissimilarity” in
addition to the notion of “similarity.” As a pre
-
process of clustering,

similar trend patterns
are extracted
from time
-
series
database.

We consider four types of matches: (1) Exact match, (2) Similarity match, (3) Exact match

by shift, and (4)
Similarity match by shift. The prediction of one very next trend indicator uses the earlier indicators of a trend in one clu
ster
and those of another trend in the other cluster that is dissimilar to the target trend.

The advantage of t
his paper is addressed as follows:



Clustering by not only similarity but also dissimilarity. Clustering trend sequences takes into account not only
similarity match but also dissimilarity match. In this way we prevent any positive and negative failures.



Correct predication. One very next trend indicator is predicted by using not only similar trend sequences but also
dissimilar trend sequences.



Fast processing. A bitmap approach is used to improve the performance of clustering and prediction

The remaining

of this paper is organized as follows. Section 2 describes related work. Section 3 describes preliminaries to
be used for the new techniques we propose in this paper. Section 4 describes the technique
for clustering

trends. Section 5
describes the tec
hnique for predicting
the very next

trend

indicator
.

Section 6 describes implementation details to prove the
concepts proposed in this paper.

Section
7

describes our conclusion and extended work.


2.

RELATED WORK


Many algorithms for discovering patterns fr
om time
-
series databases

can be classified into three approaches:

(1) The first
approach maps time
-
series data sequences into frequency domain, (2) the second approach processes the time
-
series
sequences directly in the time domain
, and (3) a qualitative a
pproach
. The frequency domain approach, pioneered by [1], in
general computes a DFT (Discrete Fourier Transform) [9] for each sequence and selects the first few coefficients to index
their respective original sequences. Sequences with matching coefficien
ts are considered similar

to

or at least within
Euclidean distance by Parseval theory [1
4
]. The indexing mechanism is typically through a multidimensional index structure
such as R
-
tree [9] or R*
-
tree [3]. Data mining from time
-
series data sequences is a
lso conducted for generalizing the process
to allow a sub
-
sequence matching [7] while in [1]. In [1
6
,19], moving average, time warping, and reversing are formulated
and the indexing methods are further examined for approximate subsequence matching. Furth
er, [1
9
] introduces a lower
-
bounding technique
that

can filter out sequences, if they are not similar enough for time warping.

This approach is not
efficient in dealing with every single data point and less performance thereof. To resolve some difficulties

of these
approaches, a qualitative approach is proposed. While time
-
series data is dealt with in the time
-
domain, not every single data
point but a few meaningful data points are considered. This paper extends a qualitative approach to propose a new meth
od of
clustering trend indicators and predicting one next trend indicator.

The task considered in this paper is clustering, that is, grouping data sets into meaningful (or similar) subclasses. A numbe
r
of clustering algorithms for large databases have been

proposed [8,22]. Partitioning algorithms construct a partition of a
database of
n

objects into a set of k clusters where
k

is an input parameter. Each cluster is represented by the center of gravity
of the cluster (
k
-
means) or by one of the objects of th
e cluster located near its center (
k
-
medoid) [11]. Hierarchical algorithms
on the other hand create a hierarchical decomposition of a database. The hierarchical decomposition is represented by a tree,

called a
dendrogram
, which iteratively splits a databas
e into smaller subsets until each subset consists of only one object. In
such a hierarchy, each level of the tree represents a clustering of the database [4]. None of these algorithms is efficient o
n
time
-
series databases. Therefore, some focusing techniqu
es have been proposed to increase the efficiency of clustering
algorithms.

Recently, some work in time domain focus on

a qualitative approach that deals with time
-
series data in the dime
-
domain but
focuses not every single data point but only meaningful da
ta points [15,21]. A meaningful data point may be an average, a
peak, or a slope in data sequences [15]. Yoon and et. al. treats the notion of trend as a meaningful data point [21]. This pa
per
again extends the notion of trend to clustering and prediction
techniques. The prediction technique in this paper is different
from the one in [20] in taking the relationship of dissimilar clusters into account.

Bitmap indexing has become a promising technique for pattern matching and data selection in data mining. V
ariations of
bitmap indexes include bit
-
sliced indexes [5,
13
], and encoded bitmap indexes [18]. In this paper, we also use an encoded
bitmap indexing technique to speed up the evaluation of selection and pattern matching conditions. One difference from t
he
previous approaches is that each

trend


obtained from time
-
series data consists of the different number of attributes and
matching on one projected attribute does not make sense. Therefore, we propose a technique of considering full, if not parti
al,
s
equence of trends for indexing and matching.


3.

PRELIMINARY


In
this section, we consider in the time domain data sequences, data sequences not taking all but trend indicators

into account
.


3.1 Sequence Smoothing by Moving Averages


Moving averages are wide
ly used in stock data analysis [6]. Their primary use is to smooth out short term fluctuations and
depict the underlying trend of a stock. The computation is simple: suppose
that
s
i

= (
v
i1

@ t
i1
, v
i2

@ t
i2
, …, v
ii

@ t
ii
)
denotes a
sequence , where
v
i
k

i
s a numeric value at time
t
ik

as shown in Figure 1.

The
l
-
term moving average of a sequence
s
=(
v
1
,

,
v
n
) is computed as follows: the mean is computed for an
l
-
term
-
wide window placed over the end of the sequence; this will
give the moving average for day

n




l
/2

; the subsequent values are obtained by stepping the window through the
beginning

of the sequence, one day at a time. This will produce a moving average of length
n



l

+1.

Moving averages are the central parameter of analysis in Stock Trends
as an example. It defines a trend by relating the
current stock price of an equity to its historical price


as represented by a 13
-
week and a 40
-
week average price. To get a
13
-
week moving average, we add up the closing prices of the previous 12 weeks p
lus the current weekly close

price
, then
divide
d

by 13. Some technical analysts have other ways of calculating averages, such as the weighted and exponential
moving averages, that give more weight to recent price movements.


3.2 Trend Indicators


As oppo
sed to similar data patterns considering each data point, a limited number of points
, but meaningful data points,

that
play a dominant role to make a trend are taken into account. These
meaningful

data points are called a
trend sequence
.
Trend sequences
will be defined in various ways

(e.g., average, peak, or slop of lines)
. Of many, we focus more on
considering trend sequences by a data smoothing technique

[21]
.

A time
-
series data sequence may fluctuate if it is dynamic over the course of time. The t
rend sequence of a data sequence is
defined
as
the interplay of the current sequence fluctuation with two moving averages: short
-
term and long
-
term moving
averages. If data sequences are about stock prices, the 13
-
week moving average and the 40
-
week movin
g average will be
compared to obtain the trend sequences (for more information, visit the web site, http://www.stocktrends.ca/handbook_1.html
[12]). For a data sequence, we can obtain two smoothed sequences using two different moving averages.
A

trend in
dicator
indicate
s

the
6 degrees in
difference

between

two trend sequences
.
For the user
-
given threshold, the two data sequences are
evaluated to be similar or not, in general. These 6 factors are called
trend indicators

and defined below.

Notice that in s
tock prices, the short
-
term price movement is represented in the interplay of a stock

s current price with the
13
-
week moving average. More precisely, the data sequence in stock markets calculates an

envelope


of 3% (or 5%, or 7%
depending upon analyst

s

input) around the 13
-
week moving average [12]. This envelope filters out some short
-
term price
fluctuations by demanding a more definite penetration of the intermediate
-
term trend sequence.

For time
t
=0,1,2,


n
-
1, let
M
m
(
t
) be the line for the
m
-
term m
oving average, and
M
k
(
t
) the line for the
k
-
term moving average
for a given data sequence
s
(
t
), where (
m

<
k
). If
m
=13 and
k
=40, then as in stock market example, they are the 13
-
week
moving average and the 40
-
week moving average. Assume that the size of
envelope is
e
. The 6 trend indicators are defined
as follows:



Up Arrow (

). Whenever the short
-
term moving average is above the long
-
term moving average, the data
sequence is tagged with an up arrow. In this case, the current price is still with the lon
g
-
term trend and has not
penetrated that smoothed sequence more than

envelope (
e
%
)

beyond the short
-
term moving average. That is the
case where
M
k
(
t
) <
M
m
(
t
) and
s
(
t
) <
M
k
(
t
)
-

M
m
(
t
)*
e

holds.



Up Arrow (

). Similar to above, but the current value has pene
trated a level
e
% beyond the short
-
term moving
average. That is the case where either
M
k
(
t
) <
M
m
(
t
) and
M
m
(
t
) <
s
(
t
), or

M
k
(
t
) <
M
m
(
t
) and
M
m
(
t
)
-

M
m
(
t
)*
e

<
s
(
t
)
holds.



Solid Square (

). On the weeks where there is crossover of the two different term mo
ving averages, if the short
-
term moving average passes above the long
-
term moving average, the data sequence is tagged with the solid
square. That is the case where
M
k
(
t
) <
M
m
(
t
) holds.



Empty Square (

). If
M
m
(
t
) <
M
k
(
t
) holds, the data sequence is tagged

with the empty square.



Down Arrow (

). If either
M
m
(
t
) <
M
k
(
t
) and
s
(
t
) <
M
m
(
t
), or

M
m
(
t
) <
M
k
(
t
) and
s
(
t
) <
M
m
(
t
) +
M
k
(
t
)*
e

holds,
then the data sequence is tagged with this down arrow.



Down Arrow (

). Similar to above, but the current price has pene
trated a level
e
% beyond the short
-
term
moving average. That is the case where
M
m
(
t
) <
M
k
(
t
) and
M
m
(
t
) +
M
k
(
t
)*
e

<
s
(
t
) holds.


We define a

trend sequence

as follows:

Definition
3.
1
. A trend sequence

is defined as a
pa
rtial order of
trend indicators
, (

,<), where


denotes the six trend
indicators stated above, and < denotes a partial order amongst

.



A trend implies the dire
ction that a time
-
series data sequence follows to move. The direction of data sequence is up
-
ward,
down
-
word, or crossover between the long
-
term and short
-
term moving averages. Extracting trends from data sequences is
conducted in the area of stock inves
tment.


One advantage of using trend sequences instead of original data sequences has two folds: (1) the smaller amount of data we
can compare for similar matching, (2) the abstract level of sequence movements we can obtain. With fewer data we extract
sim
ilar
trend sequences [21]
.


3.3 Bitmap Indexing


A promising approach to select and match with complex queries in information processing is the use of bitmap indexing [5].
In a conventional
B
+
-
tree index or an R*
-
tree index, each distinct attribute value

v

is associated with a list of IDs of all the


Figure 1: Time
-
Series Data Example

records associated with the attribute value
v
. The basic idea of a bitmap index is to replace the list of IDs with a bit vector (or
bitmap). That is, a bitmap index is essentially a collection of bitmaps. T
he size of bitmap is equal to the cardinality of the
indexed record, and the
i
th

bit corresponds to the
i
th

record. In the simplest bitmap index design, the
i
th

bit of a bitmap
associated with value
v

is set to 1 if and only if the
i
th

record has a value
v

for the indexed attribute.
This simple bitmap is
easy to manipulate data but requires a little more space than a so
-
called “encoded bitmap.” The encoded bitmap requires a
mapping table that interprets the record value with a bit in a bitmap index.

Consi
der
the trend indicators we discussed in the previous subsection. Six indicators are considered. Simple bitmap index
may require six bits to represent one indicator as shown in

Figure 2
. However, by encoded bitmap indexing, five bits are
enough to represen
t the six indicators as shown in Figure 3
.

For example, c
onsider the

following

trend sequences
:

s
1

= (











)

s
2

= (











)

s
3

= (











)

s
4

= (



).

Figure 4 shows the simple bitmap index of the above trend sequences by mapping as shown

in Figure 2
.

Similarly, Figure 5
shows the encoded bitmap index of the same trend sequences by using the mapping table as shown in Figure 3.

A major advantage of bitmap indexes is that bitmap manipulations using bit
-
wise operators
(AND, OR, XOR, NOT) ar
e very
efficiently supported by hardware.

As an example in Figure 4,
if
we can select or match with a query of bits

01000
0
100000”

then the outcome is resulted by a
pplying the AND operator to the query and Figure 4
. T
he answer

is

s2
,
s3
, and
s4
.
These in
dexes are very space
-
efficient, especially for attributes with low cardinality. XOR operator is used to calculate a
distance between two bitmaps as defined below.

Trend Indicator

Bit setting



100000



010000



001000



000100



000010



000001


䙩Fur攠2㨠卩mp汥lb楴m慰 r敱uir敳⁳ex b楴s 慮d 瑨攠
d楳瑡t捥 b整w敥n 慮y two 楮di捡瑯rs 慲攠
瑨攠sam攮

r敮d Ind楣慴ir

B楴⁳整瑩ng



00000



00001



00011



00111



01111



11111


䙩Fur攠3㨠:n捯d敤 b楴i慰 r敱u楲敳ef楶攠b楴s 慮d th攠
d楳瑡t捥 b整w敥n two 瑲end 楮d楣慴irs 捡n b攠ob瑡楮敤.


1
2
3
4
5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
b4
b5
b0
b1
b2
b3
s1
0
0
0
0
1
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
0
1
0
0
0
0
0

s2
0
1
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
0
0
0
1
0

s3
1
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1

s4
0
1
0
0
0
0
1
0
0
0
0
0








Figure
4
.
Simple

Bitmap
Index


1
2
3
4
5
6
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
b3
b4
b0
b1
b2
s1
0
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
1
1
1
0
1
1
1
1
1
1
1
s2
0
0
0
0
1
0
0
0
0
0
0
0
0
0
1
0
0
1
1
1
0
0
0
0
1
0
0
0
s3
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
1
1
1
0
0
0
1
1
0
0
0
s4
0
0
0
0
1
0
0
0
0
0






Figure 5. Encoded Bitmap Index


Definition 3.2

(
Size of Bitmap
) |
b
i
| denotes the size of a bitmap
b
i
, which is the number o
f 1’s in a
bitmap
b
i
, and

[
b
i
]
denotes the cardinality of a bitmap
b
i
, which is the number of 1’s or 0’s.


Definition 3.3

(
Distance
) The distance between two trend sequences
can be defined
: dist
(s
i
,s
j
) = |
xOR
(s
i
,s
j
)|, where xOR is a
bit
-
wise exclusive or operator.




For example,
in

a simple bitmap as shown in Figure 2, the distance between any two trend indicators is not properly obtained
because it is always the same. However, in an encoded bitmap as shown in Figure 3, the distance between any two trend
indicators is obtained. The
distance between


and


is 1, while that between


and


is 2. Another example in Figure 4,
t
he distance of two trend sequences s
1

and s
2

is
|
xOR
(
s
1
,
s
2
)| = 5. Notice that in a bitmap index, if a bit represents a word, then
the trend distance
can be obta
ined. Two trend sequences
s
i

and s
j

are the same if

|
xOR
(s
i
, s
j
)|=0. Based on the notion of
distance,

we define the similarity that can be computed in a simple bitmap index and an encoded bitmap index.

Definition 3.4

(
Similarity in simple bitmap index
) T
he similarity of two trend sequences (or bitmap rows),
s
i

and s
j
, sim
(s
i
,
s
j
) = 1
-

|
xOR
(s
i
, s
j
)|/2
MAX
([s
i
],[s
j
]).

Two trend sequences, s
i

and s
j

are

-
similar if sim
(s
i
, s
j
)



, where
0





1, and


is
given.






Definition 3.5

(
Similarity in encoded bitmap index
) The similarity of two trend sequences (or bitmap rows),
s
i

a
nd s
j
, sim
(s
i
,
s
j
) = 1
-

|
xOR
(s
i
, s
j
)|/
MAX
([s
i
],[s
j
]).

Two trend sequences, s
i

and s
j

are

-
similar if sim
(s
i
, s
j
)



, where
0





1, and


is
given.




For example, in the simple bitmap index as in Figure 4, the similarity between s
1

and s
2

is 1
-
10/2*36 = 31/36, and that of s
2

and s
3

is again 31/36. They are the same, whi
ch is not true in the figure. On the other hand, in the encoded bitmap index as in
Figure 5, the similarity between s
1

and s
2

is 1
-
17/30 = 17/30, and that of s
2

and s
3

is again 25/30=5/6. That is, s
2

is closer to s
3

than s
1
. Intuitively, the encoded bitmap

index is more appropriate. In this way, we know that encoded bitmap indexing
distinguishes trend sequences more correctly.


4.

TREND
CLUSTERING

In this section, we describe various similarity matches in trend sequences. Trend sequences can be in the same c
luster if they
are similar one to another. The notions of center and radius for a cluster are defined. Based on these two notions, clusters
can
be merged or split.

4.1
Similarity Match


This section proposes

four types of trend sequence match

that is use
d to cluster trend sequences. We consider exact and
similar matches, and also match by shift and by no shift. Assume that
s
i

= (v
i1

@ t
i1
, v
i2

@ t
i2
, …, v
ii

@ t
ii
) and s
j

= (v
j1

@ t
j1
,
v
j2

@ t
j2
, …, v
jj

@ t
jj
) where v
k




, i.e.,


is a set of the six tre
nd indicators, and t
k

is time (therefore, there is a temporal
order between two t’s). The following four types are illustrated in Figure 6 (a) through (d).



Exact match (without shifting). Given two trend sequences,
s
i

and s
j
,
are exactly matched without
shifting if all the
following conditions hold: (1) the time range [
t
i1
,
t
ii
] and [
t
j1
,
t
jj
] is overlapped, i.e., [
t
i1
,
t
ii
]

[
t
j1
,
t
jj
]



(2) during the
overlapped time duration, the trend indicators
v
ik

and v
jk

are the same, i.e., v
ik

= v
jk

for t
ik

(=

t
jk
)


[
t
j1
,
t
jj
]
.



Exact match by a shift.
Given two trend sequences,
s
i

and s
j
,
are exactly matched by shift if the following condition
holds: the trend indicators
v
ik

and v
jk

are the same during no matter what time period they occur, i.e., v
ik

= v
jk

for t
i
k

=

t
jk
+
l



[
t
j1
,
t
jj
], where
l

denotes an arbitrary time duration
.



Similarity match (without shifting). Given two trend sequences,
s
i

and s
j
,
are similarly matched by no shift if all the
following conditions hold: (1) the time range [
t
i1
,
t
ii
] and [
t
j1
,

t
jj
] is overlapped, i.e., [
t
i1
,
t
ii
]

[
t
j1
,
t
jj
]



(2) during the
overlapped time duration, the trend indicators
v
ik

and v
jk

are similar, i.e., sim(v
ik
, v
jk
)




for t
ik

(=

t
jk
)


[
t
j1
,
t
jj
]
.



Similarity match by shift. Given two trend sequences,
s
i

and
s
j
,
are similarly matched by no shift if the following
condition holds: the trend indicators
v
ik

and v
jk

are similar during no matter what time period they occur, i.e., i.e.,
sim(v
ik
, v
jk
)




for t
ik

=

t
jk
+
l



[
t
j1
,
t
jj
], where
l

denotes an arbitrary tim
e duration
.


4.2
Center and Radius


This section describes two features of bitmap indexes: Radius and Center.

Definition 4.1

(
Center
)
Each cluster is represented by the center of gravity of the cluster (
k
-
means).

In a cluster of trend
sequences, the cent
er is a trend sequence that is a sequence of trend indicators.



Definition 4.2

(
Radius
) The radius of cluster
c

is
radius
(
c
) =
MAX
(
dist
(s
c
, s
j
)), where s
c

is a center of a bitmap index
for a
cluster and s
j

is a bitmap for a trend sequence in the boundary.



As an example in Figure 7, assume that s1, s2, s3, s4 are in a cluster, say c1. The ce
nter for the first trend indicator in the
cluster c1, is 0.7 because (1+0+1)/3=0.7. The radius for the first trend indicator of c1 is,
radius
(
c1
) =
MAX
{
dist
(s
c
, s
j
)}=

MAX
{
dist(c1,s1), dist(c1,s3), dist(c1,s4)
}=

MAX
{
0.3, 0.7, 0.3
}=0.7. This can be applied to

entire indicators of the trend sequences
in the cluster. The center of c1 is (00001

00111

00111

00111

00001) by rounding off, and its radius is (.7 .7 .5
1.0 .7).

The notions of center and radius of a cluster are incrementally computed by taking
the states of bitmap indexes into account.


4.3
Merge and Split


This subsection describes two methods that are useful to clustering: Merge and Split. Given two trend sequences can be
merged if they are similar enough to be in one cluster. In the same toke
n, given a trend sequence can be split into two or more
sub
-
clusters if its density is not one
-
centric. The criteria that lead to either the merge or the split methods include (1) the
radius of those given clusters, and (2) the number of trend sequences th
at are in the clusters.

For example, in Figure 7, the cluster c1 may be split with respect to the maximum radius. In the cluster, the trend indicator

at
time
t4

has the maximum radius 1.0. Based upon this trend indicator, it is split into the two clusters
: those trend sequences
with the 00011 bits, and those trend sequences with the 01111 bits, that is, the two clusters are: c1: {s1, s3}, and c1’: {s2
, s4}.

4.4
Two Approaches for Clustering


There are two approaches for clustering: top
-
down and bottom
-
up a
pproaches.



Top
-
down approach. The top
-
down approach uses the split method described above to partition a given cluster until
each sub
-
cluster satisfies the thresholds of those two criteria. Notice that the thresholds of the two criteria are the
radius o
f the cluster and the number of trend sequences.

s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
(a) Exact match
(b) Exact match by shift
(c) Similarity match
(d) Similarity match by shift
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
t
jj
s
i
s
j
t
j1
t
jj
(a) Exact match
(b) Exact match by shift
(c) Similarity match
(d) Similarity match by shift
s
i
s
j
t
j1
+
l
t
jj
+
l
s
i
s
j
t
j1
+
l
t
jj
+
l


Figure 6: Four Types of Trend Sequence Match



Bottom
-
up approach. The bottom
-
up approach uses the merge method described above to combine given clusters
until the merged cluster satisfies the two thresholds.


5.

TREND PREDICTION


In this section, we descr
ibe a new method of predicting a very next trend indicator. The prediction of a very next trend
indicator is based on the trend information already available. The trend information may be available within a cluster or
outside the cluster. Depending on thes
e situations, we can classify two types of trend prediction: Intra
-
cluster Trend
Prediction, and Inter
-
cluster Trend Prediction. To describe these two types of trend prediction, consider the clusters example
in Figure 7. Assume that cluster c2 and c3 are
dissimilar.



Intra
-
cluster Trend Prediction. A very next trend indicator is predicted by using information within the cluster
where that trend sequence is included. Assume that we want to predict a trend indicator at time
t+1
. Let the center
of the cluste
r at time
t

be
c
. Then the following formula can be used for the prediction.

o

If the target trend sequence is time
-
lagged, then the very next trend indicator is simply
c
. As an example in
Figure 7, the very next trend indicator of the trend sequence S1 in
cluster c1 is 00001 at time
t5
.

o

If the target trend sequence is not time
-
lagged, then the prediction can be obtained by using the traditional
method [10].

A trend indicator
v
t+1

at time
t+1

is obtained based on the center
c

at time
t

by compensating diff
erences
from the previous trend indicators with an error

. However, our proposed solution for this case is to
combine the prediction by using external cluster information with the one we can obtain here.



Inter
-
cluster Trend Prediction. A very next trend

indicator is predicted by using information outside the cluster if a
cluster outside is
dissimilar

to the target cluster. The “dissimilarity threshold” is given as the similarity threshold is
given. Two clusters are dissimilar if the distance between the
m is greater or equal to the given dissimilarity
threshold. For example, clusters c2 and c3 are dissimilar each other in Figure 7 if the dissimilarity threshold is 50%
time

t1

t2

t3

t4

t5

t6






Cluster c1

S1:

00001

00111

00111

00011

S2:


00111

00011

01111

00
001

S3:

00000

00011

00111

00011

00001

S4:

00001

00111

00011

01111

00011

Center
:

2/3=.7

11/4=2.7

5/2=2.5

12/4=3

4/3=1.3


00001

00111

00111

00111

00001

Radius
:

.7

.7

.5

1.0

.7

Cluster c2

S6:

00000

00011

11111

11111

S7:


00001

01111

01111

S8:

00000

00011

011
11

11111

Center
:

0

5/3=1.7

13/3=4.3

14/3=4.7


00000

00001

01111

11111

Radius
:

0

.7

.7

.7


Cluster c3


S10:

11111

01111

00000

00001

11111

S12:


01111

00001

00000

01111

Center
:

5

4

1/2=.5

1/2=.5

9/2=4.5


11111

01111

00001

00001

11111

Radius
:

0

0

.5

.5

.5


F
igure 7: Clusters Example

t
t
t
t
t
t
t
t
c
c
c
c
c
c
c
v


















)
(
...
)
(
'
)
(
1
2
)
2
(
2
1
1
1
(or 0.5). Assume that we want to predict a trend indicator in the cluster c2 at time
t+
1
. Let the center of the cluster
c2 at time
t

be
c
. Let the center of the cluster c3 at time
t

be
d
. Then the following formula can be used for the
prediction.

o

If the cluster c2 is time
-
lagged, then the very next trend indicator is simply
d
. As an example
in Figure 7,
the very next trend indicator of a trend sequence, say S8, is simply far different from 11111 at time
t5
.
Therefore, S8 is expected to be
00000 00011 01111 11111 11111
.

o

If the target trend c2 is not time
-
lagged, then the prediction sh
ould combine the information of the trend
sequences in the same cluster together with the information of the trend sequences in the dissimilar cluster.
The proposed formula is as follows:

A trend indicator
v
t+1

at time
t+1

is obtained based on the center
d

of the dissimilar cluster at time
t

by
compensating differences from the previous trend indicators with an error

. The center of the dissimilar
cluster is subtracted from 5 because each trend indicator may be opposite in value.


6.

IMPLEMENTATION

The conc
epts described in this paper are in part implemented in Java. All processes are operated user
-
interactively. Figure 8
shows the user
-
interactive clustering tool. Time
-
series data sets can be read and smoothed by moving average. The short
-
term
and long
-
term

time periods are given. Then, trend indicators are obtained. Those trend indicators are converted into bitmap
indexes. Four types of matches are applied in this figure as well. Figure 9 displays the outcomes of clustering process.



Figure 8: Trend Sequence Clustering Tool

t
t
t
t
t
t
t
t
d
d
d
d
d
d
d
v



















)
(
...
)
(
'
)
(
)
5
(
1
2
)
2
(
2
1
1
1

7.

CONCLUSION


This pape
r described
a method of clustering trend sequences and predicting a very next trend indicator for time
-
series
databases
.
We classified three approaches of handling time
-
series databases. As a qualitative approach, we defined the notion
of “trend.” Instead

of data sequences, we dealt with trend sequences. Trend sequences are indexed using a bitmap. Using bit
-
wise operations like distance, similarity, center, and radius, similarity trend sequences are clustered.

Among various clusters, we also developed a wa
y of predicting a very next trend indicator. Two approaches are investigated:
Intra
-
cluster trend prediction, an Inter
-
cluster trend prediction.

The contribution of this paper includes (1) clustering by using not only similarity match but also dissimilari
ty match. In this
way we prevent any positive and negative failures. (2) Prediction by using not only similar trend sequences but also
dissimilar trend sequences. (3) A bitmap approach can improve performance of clustering and prediction.


REFERENCES


1.

R.
Agrawal, C. Faloutsos, A. Swami,

Efficient Similarity Search in Sequence Databases,


in Proc. of 4
th

Int

l. Conf. on
Foundation of Data Organization and Algorithms, 1993.

2.

R. Agrawal, K. Lin, H. Sawhney, and K. Shim,

Fast Similarity Search in the Presence

of Noice, Scaling, and
Translation in Time
-
Series Databases,


In Proc. of the VLDB Conf., 490
-
501, 1995.

3.

N. Bechmann, H. Kriegel, R. Schneider, and B. Seeger,

The R*
-
tree: an Efficient and Robust Access Method for Points
and Rectangles,


In Proc. of ACM
SIGMOD Conf. on Management of Data, 322
-
31, 1990.

4.

A. Bouguettaya, “On
-
Line Clustering,” IEEE Transactions of Knowledge and Data Engineering, Vol. 8, 1996, pp. 333
-
339.

5.

C. Chan and Y. Ioannidis,

An Efficient Bitmap Encoding Scheme for Selection Queries,


I
n Proc. of ACM SIGMOD
Conf. on Management of Data, 215
-
226, 1999.

6.

R. Edwards and J. Magee,
Technical Analysis of Stock Trends
, John Magee, Springfield, Massachsetts, 1969.

7.

C. Faloutsos, M. Ranganathan, and Y. Manolopoulos,

Fast Subsequence Matching in Tim
e
-
Series Database,


In Proc.
A
CM SIGMOD Conf. on Management of Data, 419
-
429, 1994
.



Figure 9: Outcome of Clustering

8.

S. Guha, R. Rastogi and K. Shim, ROCK: A Robust Clustering Algorithm for Categorical Attributes, IEEE Conf. on
Data Engineering, 1999.

9.

A. Guttman,

R
-
tree: a Dynamic Index

Structure for Spatial Searching,


in Proc. of ACM SIGMOD Conf. on
Management of Data, 45
-
57, 1984.

10.

J. Hamilton, Time Series Analysis, Princeton University Press, 1994.

11.

L. Kaufman and P. Rousseeuw, “Finding Groups in Data: An Introduction to Cluster Analys
is,” John Wiley & Sons,
1990.

12.

S. Kortje, Stock Trends


A Handbook for Investors,
http://www.stocktrends.ca
, 1998.

13.

P. O

Neil, D. Quass,

Improved Query Performance with Variant Indexes,


In Proc. of ACM SIGMOD Conf.

on
Management of Data, 1997.

14.

A. Oppenheim and R. Schafer, Digital Signal Processing, Prentice
-
Hall, 1975.

15.

C. Perng, H. Wang, S. Zhang, and D. Parker, “Landmarks: A New Model for Similarity
-
Based Pattern Querying in Time
Series Databases,” IEEE Conf. on Da
ta Engineering, 2000, pp. 33
-
44.


16

Rafiei, and A. Mendelson,

Similarity
-
based Queries for Time
-
Series Data,


In Proc. of ACM SIGMOD Conf. on
Management of Data, 13
-
23, 1997.

17

R. Srikant and Rakesh Agrawal,

Mining Sequential Patterns: Generalizations and
Performance Improvements,


In Proc.
of the Conf. on Extended Database Technology, 1996.


18

M. Wu,

Query Optimization for Selections using Bitmaps,


Tech. Report, CS Dept, Technische Universitat Darmstadt,
1
998.

19

B. Yi, H. Jagadish and C. Falutsos,

Effic
ient Retrieval of Similar Time Sequences under Time Warping,


in Proc., of
the 14
th

In
t’
l Conf. on Data Engineering, 1998.

20

B. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish, C. Faloutsos, and A. Biliris, “Online Data Mining for Co
-
Evolving
Time Sequences,”

IEEE Conf. on Data Engineering, 2000, pp. 13
-
22.

21

J. Yoon, J. Lee, and S. Kim, “Trend Similarity and Prediction in Time
-
Series Databases,” SPIE Conference on Data
Mining and Knowledge Discovery: Theory, Tools, and Technology II, 201
-
212, 2000.

22

T. Zhan
g, R. Ramakrishnan, and Miron, Birch: An efficient data clustering method for very large databases, ACM
SIGMOD Conf. on Management of Data, 1996.