SAX: a Novel Symbolic
Representation of Time Series
Authors
Jessica Lin
Eamonn
Keogh
Li Wei
Stefano
Lonardi
Presenter
Arif
Bin
Hossain
Slides incorporate materials kindly provided by Prof.
Eamonn
Keogh
Time Series
A
time series
is a sequence of
data points,
measured typically at successive times spaced at
uniform time intervals. [Wiki]
Example:
Economic, Sales, Stock market forecasting
EEG, ECG, BCI analysis
0
2000
4000
6000
8000
0
10
20
30
Problems
Join
: Given two data collections, link items occurring in
each
Annotation
: obtain additional information from given
data
Query by content
: Given a large data collection, find the
k
most similar objects to an object of interest.
Clustering
: Given a unlabeled dataset, arrange them into
groups by their mutual similarity
Problems (Cont.)
Classification:
Given a labeled training set, classify future
unlabeled examples
Anomaly Detection:
Given a large collection of objects,
find the one that is most different to all the rest.
Motif Finding
: Given a large collection of objects, find the
pair that is most similar.
Data Mining Constraints
For example, suppose
you have
one gig
of main memory
and want to do K

means clustering…
Clustering ¼ gig of data, 100 sec
Clustering ½ gig of data, 200 sec
Clustering 1 gig of data, 400 sec
Clustering 1.1 gigs of data,
few
hours
Bradley, M. Fayyad, & Reina: Scaling Clustering Algorithms to Large Databases. KDD 1998: 9

15
Generic Data Mining
•
Create an
approximation
of the data, which will
fit in main memory, yet retains the essential
features of interest
•
Approximately solve the problem at hand in main
memory
•
Make (hopefully very few) accesses to the original
data on disk to confirm the solution
Some Common Approximation
Why Symbolic Representation?
•
Reduce dimension
•
Numerosity
reduction
•
Hashing
•
Suffix Trees
•
Markov Models
•
Stealing ideas from text processing/ bioinformatics
community
S
ymbolic
A
ggregate
Appro
X
imation
(SAX)
•
Lower bounding of Euclidean distance
•
Lower bounding of the DTW distance
•
Dimensionality Reduction
•
Numerosity
Reduction
baabccbc
SAX
Allows a time series of arbitrary length
n
to be
reduced to a string of arbitrary length
w
(w<<n)
Notations
C
A time series C = c
1
, …..,
c
n
Ć
A Piecewise
Aggregate Approximation of a time
series
Ć = ć
1
,…
ć
w
Ĉ
A symbolic representation of a time series
Ĉ = ĉ
1
, …,
ĉ
w
w
Number PAA
segments representing C
a
Alphabet size
How to obtain SAX?
Step 1: Reduce dimension by PAA
Time series C of length
n
can be represented in a w

dimensional space by a vector Ć = ć
1
,…
ć
w
The
i
th
element is calculated by
Reduce dimension from 20 to 5. The 2
nd
element will be
How to obtain SAX?
Data is divided into
w
equal sized frames.
Mean value of the data falling within a frame is
calculated
Vector of these values becomes the PAA
0
20
40
60
80
100
120
C
C
How to obtain SAX?
Step 2:
Discretization
Normalize
Ć
to have a Gaussian distribution
Determine breakpoints that will produce
a
equal

sized areas
under Gaussian curve.
0


0
20
40
60
80
100
120
b
b
b
a
c
c
c
a
baabccbc
Words: 8
Alphabet: 3
Distance Measure
Given 2 time series Q and C
Euclidean distance
Distance after transforming the subsequence to PAA
Distance Measure
Define MINDIST after transforming to symbolic
representation
MINDIST lower bounds the true distance between
the original time series
Numerosity
Reduction
Subsequences are extracted by a sliding window
Sequences are mostly repetitive subsequence
Sliding window finds
aabbcc
If the next sequence is also
aabbcc
, just store the position
This optimization depends on the data, but typically
yields a reduction factor of 2 or 3
Space shuttle telemetry with subsequence length 32
Experimental Validation
Clustering
Hierarchical
Partitional
Classification
Nearest neighbor
Decision tree
Motif discovery
Hierarchical Clustering
Sample dataset consists 3 decreasing trend, 3
upward shift and 3 normal classes
Partitional
Clustering (k

means)
Assign each point to one of
k
clusters whose center is
nearest
Each iteration tries to minimize the sum of squared
intra

clustered error
Nearest Neighbor Classification
SAX beats Euclidean distance due to the smoothing
effect of dimensional reduction
Decision Tree Classification
Since decision trees are expensive to use with high
dimensional dataset, Regression Tree [Geurts.2001]
is a better approach for data mining on time series
Motif Discovery
Implemented the random projection algorithm of
Tompa
and Buhler [ICMB2001]
Hashing
subsequenced
into buckets using a random subset of
their features as a key
New Version:
iSAX
Use binary numbers for labeling the words
Different alphabet size(cardinality)within a word
Comparison of words with different cardinalities
Thank you
Questions?
Comments 0
Log in to post a comment