SAX: a Novel Symbolic Representation of Time Series

signtruculentΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

79 εμφανίσεις

SAX: a Novel Symbolic
Representation of Time Series

Authors

Jessica Lin

Eamonn

Keogh

Li Wei

Stefano
Lonardi

Presenter

Arif

Bin
Hossain

Slides incorporate materials kindly provided by Prof.
Eamonn

Keogh

Time Series



A

time series

is a sequence of

data points,
measured typically at successive times spaced at
uniform time intervals. [Wiki]





Example:


Economic, Sales, Stock market forecasting


EEG, ECG, BCI analysis



0

2000

4000

6000

8000

0

10

20

30

Problems


Join
: Given two data collections, link items occurring in
each



Annotation
: obtain additional information from given
data



Query by content
: Given a large data collection, find the
k

most similar objects to an object of interest.



Clustering
: Given a unlabeled dataset, arrange them into
groups by their mutual similarity


Problems (Cont.)


Classification:
Given a labeled training set, classify future
unlabeled examples



Anomaly Detection:
Given a large collection of objects,
find the one that is most different to all the rest.



Motif Finding
: Given a large collection of objects, find the
pair that is most similar.



Data Mining Constraints

For example, suppose
you have
one gig

of main memory
and want to do K
-
means clustering…

Clustering ¼ gig of data, 100 sec

Clustering ½ gig of data, 200 sec

Clustering 1 gig of data, 400 sec

Clustering 1.1 gigs of data,
few
hours

Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9
-
15

Generic Data Mining



Create an
approximation

of the data, which will
fit in main memory, yet retains the essential
features of interest




Approximately solve the problem at hand in main
memory




Make (hopefully very few) accesses to the original
data on disk to confirm the solution


Some Common Approximation

Why Symbolic Representation?


Reduce dimension


Numerosity

reduction


Hashing


Suffix Trees


Markov Models


Stealing ideas from text processing/ bioinformatics
community


S
ymbolic
A
ggregate
Appro
X
imation

(SAX)


Lower bounding of Euclidean distance


Lower bounding of the DTW distance


Dimensionality Reduction


Numerosity

Reduction



baabccbc

SAX


Allows a time series of arbitrary length
n

to be
reduced to a string of arbitrary length
w
(w<<n)


Notations


C

A time series C = c
1
, …..,
c
n

Ć

A Piecewise

Aggregate Approximation of a time
series
Ć = ć
1
,…
ć
w

Ĉ

A symbolic representation of a time series

Ĉ = ĉ
1
, …,
ĉ
w

w

Number PAA

segments representing C

a

Alphabet size

How to obtain SAX?


Step 1: Reduce dimension by PAA


Time series C of length
n

can be represented in a w
-
dimensional space by a vector Ć = ć
1
,…
ć
w


The
i
th

element is calculated by





Reduce dimension from 20 to 5. The 2
nd

element will be


How to obtain SAX?


Data is divided into
w

equal sized frames.


Mean value of the data falling within a frame is
calculated


Vector of these values becomes the PAA

0



20



40



60



80



100



120

















C



C



How to obtain SAX?


Step 2:
Discretization


Normalize
Ć

to have a Gaussian distribution


Determine breakpoints that will produce
a

equal
-
sized areas
under Gaussian curve.








0





-







-



0

20

40

60

80

100

120

b


b


b


a

c


c

c


a

baabccbc

Words: 8

Alphabet: 3

Distance Measure


Given 2 time series Q and C


Euclidean distance






Distance after transforming the subsequence to PAA


Distance Measure


Define MINDIST after transforming to symbolic
representation





MINDIST lower bounds the true distance between
the original time series

Numerosity

Reduction


Subsequences are extracted by a sliding window


Sequences are mostly repetitive subsequence


Sliding window finds
aabbcc


If the next sequence is also
aabbcc
, just store the position


This optimization depends on the data, but typically
yields a reduction factor of 2 or 3


Space shuttle telemetry with subsequence length 32

Experimental Validation


Clustering


Hierarchical


Partitional



Classification


Nearest neighbor


Decision tree



Motif discovery

Hierarchical Clustering


Sample dataset consists 3 decreasing trend, 3
upward shift and 3 normal classes

Partitional

Clustering (k
-
means)


Assign each point to one of
k

clusters whose center is
nearest


Each iteration tries to minimize the sum of squared
intra
-
clustered error

Nearest Neighbor Classification


SAX beats Euclidean distance due to the smoothing
effect of dimensional reduction


Decision Tree Classification


Since decision trees are expensive to use with high
dimensional dataset, Regression Tree [Geurts.2001]
is a better approach for data mining on time series

Motif Discovery


Implemented the random projection algorithm of
Tompa

and Buhler [ICMB2001]


Hashing
subsequenced

into buckets using a random subset of
their features as a key

New Version:
iSAX


Use binary numbers for labeling the words


Different alphabet size(cardinality)within a word


Comparison of words with different cardinalities


Thank you

Questions?