10/26/09, Wilfrid
Laurier University
1
Temporal Relationship Among
Clusters for Data Streams
Margaret H.
Dunham, Michael
Hahsler, Doug Raiford
Students: Yu
Meng
,
Donya
Quick,
Jie
Huang, Charlie
Isaksson, Mallik Kotamarti
CSE Department
Southern Methodist University
Dallas, Texas 75275
mhd@lyle.smu.edu
This material is based upon work supported by the National Science Foundation under Grant
No
I
IS

0948893
.
10/26/09, Wilfrid
Laurier University
2
Objectives/Outline
Introduction
Background
TRAC

DS
TRAC

DS Applications
Conclusions/Future Work
Traditional Clustering of Data Streams
Igno
r
es one of the most
Salient
Features of Streams:
Ordering
10/26/09, Wilfrid
Laurier University
3
Objectives/Outline
Introduction
Stream Data
Motivation
Background
TRAC

DS
TRAC

DS Applications
Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
4
Stream Data
A growing number of applications generate streams
of data.
Computer network monitoring data
Call detail records in telecommunications
Highway transportation traffic data
Online web purchase log records
Sensor network data
Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
Clustering techniques play a key role in modeling
and analyzing this data.
10/26/09, Wilfrid
Laurier University
5
Stream Data Format
Events arriving in a stream
At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:
V
t
= <S
1t
, S
2t
, ..., S
nt
>
V
1
V
2
…
V
q
S
1
S
11
S
12
…
S
1q
S
2
S
21
S
22
…
S
2q
…
…
…
…
…
S
n
S
n1
S
n2
…
S
nq
Time
10/26/09, Wilfrid
Laurier University
6
Data Stream Modeling
Single pass: Each record is examined at most once
Bounded storage: Limited Memory for storing synopsis
Real

time: Per record processing time must be low
Summarization (Synopsis )of data
Use data NOT SAMPLE
Temporal and Spatial
Dynamic
Continuous (infinite stream)
Learn
Forget
Sublinear
growth rate

Clustering
6
Traditional Clustering
10/26/09, Wilfrid
Laurier University
7
TRAC

DS
10/26/09, Wilfrid
Laurier University
8
Motivation
Temporal Ordering is a major feature of
stream data.
Many stream applications depend on this
ordering
Prediction of future values
Anomaly (rare event) detection
Concept drift
10/26/09, Wilfrid
Laurier University
9
10/26/09, Wilfrid
Laurier University
10
Objectives/Outline
Introduction
Background
Clustering Stream Data
Extensible Markov Model

EMM
TRAC

DS
TRAC

DS Applications
Conclusions/Future Work
Stream Clustering Requirements
Dynamic updating of the clusters
Identify outliers
Barbara [2]:
compactness
fast
incremental processing
10/26/09, Wilfrid
Laurier University
11
Stream Clustering Algorithms
LOCALSEARCH [4]
Partitions stream into segments
Clusters each segment individually by solving the k

medians problem
Iteratively
reclusters
the resulting centers
CluStream
[1]
Micro

clusters represented by summary statistics.
Micro

clusters are handled online
Micro

clusters merged offline
MONIC [13]
Evolution of clusters over time
Cluster transitions over time
10/26/09, Wilfrid
Laurier University
12
10/26/09, Wilfrid
Laurier University
13
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej  Ei), and at any time the future
behavior of the process is based solely on the current
state
A
Markov Model (MM)
is a graph with m vertices or states,
S, and directed arcs, A, such that:
S ={N
1
,N
2
, …, N
m
}, and
A = {L
ij
 i
1, 2, …, m, j
1, 2, …, m} and Each arc,
L
ij
= <N
i
,N
j
> is labeled with a transition probability
P
ij
= P(N
j
 N
i
).
10/26/09, Wilfrid
Laurier University
14
Extensible Markov Model (EMM)
Time Varying Discrete First Order Markov Model
Nodes are clusters of real world states.
Learning continues during application phase.
Learning:
Transition probabilities between
states(clusters)
State labels (Cluster summary)
State are modified as clusters are
10/26/09, Wilfrid
Laurier University
15
EMM for TRAC

DS Modeling
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
1/3
N1
N2
2/3
N3
1/1
1/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/3
1/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/2
1/1
N1
1
10/26/09, Wilfrid
Laurier University
16
Objectives/Outline
Introduction
Background
TRAC

DS
Definition
Relationship to Traditional Clustering
Operations
TRAC

DS Applications
Conclusions/Future Work
TRAC

DS NOTE
TRAC

DS is not:
Another stream clustering
algorithm
TRAC

DS is:
A new way of looking at clustering
Built on top of an existing clustering
algorithm
TRAC

DS may be used with any
stream clustering algorithm
10/26/09, Wilfrid
Laurier University
17
TRAC

DS Overview
10/26/09, Wilfrid
Laurier University
18
Data Stream Clustering
At each point in time a data stream clustering ζ is
a partitioning of D
'
, the data seen thus far.
Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,
Cck
are available and k is
allowed to change over time.
The summaries
Cci
with
i
=1, 2,...,k typically
contain information about the size, distribution
and location of the data points in
Ci
.
10/26/09, Wilfrid
Laurier University
19
TRAC

DS Definition
Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC

DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are
satisﬁed
:
(1) There is a one

to

one correspondence
between the clusters in ζ and the states S in M.
(2) A transition
aij
in the EMM M represents the
probability that given a data point in cluster
i
,
the next data point in the data stream will
belong to cluster j with
i
; j = 1; 2; : : : ; k.
(3) The EMM M is created online together with the
data stream clustering
10/26/09, Wilfrid
Laurier University
20
Clustering Operations
A clustering operation is a function q : ζ
×
x
→ ζ which is used by the data stream
clustering algorithm to up
date the
clustering ζ given some additional
information x which either is a new data
point or other information (e.g., the
number of the cluster to be deleted to be
simpliﬁed
the clustering).
10/26/09, Wilfrid
Laurier University
21
TRAC

DS Operations
A TRAC

DS operation is a function r : M
×
sc
×
y
→ M
×
sc that updates the temporal relationship
among clusters represented by the EMM M with
states S given a current state sc
∈
S and
additional information y and returns an updated
EMM and possibly a new current state.
In order to be able to dynamically update the EMM
M we need to store a transition count matrix C.
The count
cij
in C contains the number of times
we observed a new point being assigned by the
clustering algorithm to cluster
i
followed by a point
being assigned to cluster j.
10/26/09, Wilfrid
Laurier University
22
Stream Clustering Operations *
qassign
point(
ζ,x
): Assigns the new data point x
to an existing cluster.
qnew
cluster(
ζ,x
): Create a new cluster.
qremove
cluster(
ζ,x
): Removes a cluster. Here x
is the cluster,
i
, to be removed. In this case the
associated summary
Cci
is removed from ζ and
k is decremented by one.
qmerge
clusters(
ζ,x
): Merges two clusters.
qfade
clusters(
ζ,x
): Fades the cluster structure.
qsplit
clusters(
ζ,x
): Splits a cluster.
* Inspired by MONIC [?]
10/26/09, Wilfrid
Laurier University
23
TRAC

DS Operations
rassign
point(
M,sc,y
): Assigns the new data point
to the state representing an existing cluster
rnew
cluster(
M,sc,y
): Create a state for a new
cluster.
rremove
cluster(
M,sc,y
): Removes state.
rmerge
clusters(
M,sc,y
): Merges two states.
rfade
clusters(
M,sc,y
): Fades the transition
probabilities using an exponential decay f(t)=2
−λt
rsplit
clusters(
M,sc,y
): Splits states. Y clustering
operations.
10/26/09,
Wilfrid
Laurier University
24
TRAC

DS Example
10/26/09, Wilfrid
Laurier University
25
TRAC

DS Advantages
Dynamic
Flexible
–
Use any Clustering Algorithm
Supports and clustering operations
Scalable
Merges Clustering & Markov Modeling
10/26/09, Wilfrid
Laurier University
26
10/26/09, Wilfrid
Laurier University
27
Objectives/Outline
Introduction
Background:
TRAC

DS
TRAC

DS Applications
Anomaly Detection
Bioinformatics
Conclusions/Future Work
10/26/09, Wilfrid
Laurier University
28
What is Anomaly in Stream Data?
Rare

Anomalous
–
Surprising
Out of the ordinary
Not outlier detection
No knowledge of data distribution
Data is not static
Must take temporal and spatial values into account
May be interested in sequence of events
Ex: Snow in upstate New York is not an anomaly
Snow in upstate New York in June is rare
Rare events may change over time
10/26/09, Wilfrid
Laurier University
29
TRAC

DS Approach to Detect Anomalies
By learning what is normal, the model can
predict what is not
Normal is based on likelihood of occurrence
Use TRAC

DS to build clusters and behavior
between clusters
We view a rare event as:
Unusual event
Transition between events states which does
not frequently occur.
Continue learning
10/26/09, Wilfrid
Laurier University
30
Determining Rare
Occurrence Frequency (
OF
i
)
of an EMM
state S
i
is normalized count of state:
Normalized Transition Probability (
NTP
mn
),
from one state,
S
m
, to another,
S
n
, is a
normalized transition Count:
10/26/09, Wilfrid
Laurier University
31
Datasets/Anomalies
MnDot
–
Minnesota Department of Transportation
Automobile Accident
Ouse
and
Serwent
–
River flow data from England
Flood
Drought
KDD Cup 1999 & 2000
http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Intrusion
Cisco VoIP
–
VoIP traffic data obtained at Cisco
Unusual Phone Call
10/26/09, Wilfrid
Laurier University
32
EMM
Sublinear
Growth
Servent Data
TRAC

DS
River Prediction
10/26/09, Wilfrid
Laurier University
33
10/26/09, Wilfrid
Laurier University
34
TRAC

DA Rare Event Detection
Weekdays Weekend
Minnesota DOT Traffic Data
Detected unusual
weekend traffic pattern
TRAC

DS Intrusion Detection
DARPA 1999/2000
Synthetic Dataset
MIT Lincoln Lab
The DARPA 1999 dataset which is
free of attacks for two weeks (1st
week and 3rd week) is used as
training data
DARPA 2000 dataset which
contains
DDoS
attacks is used a
test data
.
10/26/09, Wilfrid
Laurier University
35
10/26/09, Wilfrid
Laurier University
36
DARPA 1999, and 2000
Thresh
old
Detection
Rate
False Positive
Rate
0.9
6%
94%
0.8
20%
80%
0.7
50%
50%
0.6
100%
0%
Table
8
.
EMM
detection
and
false
positive
rates
.
TRAC

DS Intrusion Detection
10/17/06
37
TRAC

DS & Bioinformatics
Analysis DNA/RNA Sequences
Applications:
Classification
Differentiation
16s RNA
1542
nt
rRNA
Highly conserved across species
miRNA
Short (20

25nt) sequence of
noncoding
RNA
Known since 1993 but significance not widely
appreciated until 2001
Impact / Prevent translation of mRNA
10/17/06
38
First
–
Convert Sequence to NSV
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
A
C
G
T
Pos 0

8
2
3
3
1
Pos 1

9
1
3
3
2
…
Pos 34

42 2
4
2
1
Next
–
Apply TRAC

DS
10/26/09, Wilfrid
Laurier University
39
10/17/06
41
TRAC

DS
Predictionwith
miRNA
Positive Data Model
Cutoff Probability = 0.3
False Positive Rate = 0%
True Positive Rate = 66%
Test results could be improved by
meta classifiers combining multiple
positive and negative classifiers
together.
Profile EMMs
10/26/09, Wilfrid
Laurier University
42
•
Examples of three different Profile EMMs constructed for 16S
data from 3 different bacteria families
Profile EMMs for Organism Classification
10/26/09, Wilfrid
Laurier University
43
16S Classification Accuracy
Classification accuracy using different scoring
metrics on 16S
rRNA
data from NCBI.
We learned 31 classification models (at the
phylogenetic
class level) from 98 organisms and
tested with 23 randomly chosen organisms.
The Profile EMM approach was able to achieve
classification of more than 90% after tuning the
resolution settings.
10/26/09, Wilfrid
Laurier University
44
TRAC

DS and Bioinformatics
Efficient
Alignment free sequence analysis
Clustering reduces size of model
Flexible
Any sequence
Applicability to
Metagenomics
Scoring based on similarity between EMMs
or EMM and input sequence
Applications
Classification
Differentiation
10/26/09, Wilfrid
Laurier University
45
10/26/09, Wilfrid
Laurier University
46
Objectives/Outline
Introduction
Background
TRAC

DS
TRAC

DS Applications
Conclusions/Future Work
TRAC

DS Ongoing/Future
Create online tool suite
Improve TRAC algorithms:
Aging
Delete state
Merge states
Split states
Apply to Image Recognition
Bioinformatics
Build Profile EMM database of NCBI 16S Bacteria
Data
Perform classification using
Metagenomic
Data
collected from Yellowstone National Park
10/26/09, Wilfrid
Laurier University
47
10/26/09, Wilfrid
Laurier University
48
10/26/09, Wilfrid
Laurier University
49
Bibliography
1)
C. C.
Aggarwal
, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams.
Proceedings of the International
Conference on Very
Large Data Bases (VLDB)
, pp 81

92, 2003.
2)
D. Barbara, “Requirements for clustering data streams,”
SIGKDD Explorations,
Vol
3, No 2, pp 23

27, 2002.
3)
Margaret H. Dunham,
Donya
Quick,
Yuhang
Wang, Monnie McGee, Jim Waddle
,
“Visualization of DNA/RNA Structure using Temporal
CGRs,”
Proceedings
of the IEEE 6
th
Symposium on Bioinformatics & Bioengineering (BIBE06)
, October 16

18, 2006, Washington D.C. ,pp
171

178.
4)
S.
Guha
, A.
Meyerson
, N.
Mishra
, R.
Motwani
, and L. O'Callaghan, “Clustering data streams: Theory and practice,”
IEEE Transactions on
Knowledge and Data Engineering,
Vol
15, No 3, pp 515

528, 2003.
5)
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, subm
itt
ed
to SIAM International Conference on Data Mining.
6)
Jie
Huang, Yu
Meng
, and Margaret H. Dunham, “Extensible Markov Model,”
Proceedings IEEE ICDM Conference
, November 2004, pp 371

374.
7)
Charlie Isaksson, Yu
Meng
, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,”
International Journal of Computer
Science and Network Security
,
Vol
6, No 6, June 2006, pp 258

265.
8)
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLD
M
Conference, pp 440

453.
9)
Mallik Kotamarti, Douglas W. Raiford, M. L.
Raymer
, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome

Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics
and
Bioengineering, pp 161

167, June 22

24 2009.
10)
Yu
Meng
and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,”
Proceedings of the IEEE PAKDD
Conference
, April 2006, Singapore. (Also in
Lecture Notes in Computer Science
,
Vol
3918, 2006, Springer Berlin/Heidelberg, pp 750

754.)
11)
Yu
Meng
and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers
,
Vol
1, No
3, June 2006, pp 43

50.
12)
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation.
http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html
,
(2008)
13)
M.
Spiliopoulou
, I.
Ntoutsi
, Y.
Theodoridis
, and R.
Schult
. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706
–
711, 2006.
Comments 0
Log in to post a comment