Temporal Relationship Among Clusters for Data Streams - Southern ...

tastelesscowcreekBiotechnology

Oct 4, 2013 (4 years and 1 month ago)

113 views

10/26/09, Wilfrid
Laurier University

1

Temporal Relationship Among
Clusters for Data Streams



Margaret H.
Dunham, Michael
Hahsler, Doug Raiford

Students: Yu
Meng
,
Donya

Quick,
Jie

Huang, Charlie
Isaksson, Mallik Kotamarti

CSE Department

Southern Methodist University

Dallas, Texas 75275

mhd@lyle.smu.edu


This material is based upon work supported by the National Science Foundation under Grant
No
I
IS
-
0948893
.

10/26/09, Wilfrid
Laurier University

2

Objectives/Outline


Introduction


Background


TRAC
-
DS


TRAC
-
DS Applications


Conclusions/Future Work


Traditional Clustering of Data Streams
Igno
r
es one of the most
Salient
Features of Streams:
Ordering

10/26/09, Wilfrid
Laurier University

3

Objectives/Outline


Introduction


Stream Data


Motivation


Background


TRAC
-
DS


TRAC
-
DS Applications


Conclusions/Future Work


10/26/09, Wilfrid
Laurier University

4

Stream Data


A growing number of applications generate streams
of data.


Computer network monitoring data


Call detail records in telecommunications


Highway transportation traffic data


Online web purchase log records


Sensor network data


Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.


Clustering techniques play a key role in modeling
and analyzing this data.


10/26/09, Wilfrid
Laurier University

5

Stream Data Format


Events arriving in a stream


At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:

V
t

= <S
1t
, S
2t
, ..., S
nt
>


V
1

V
2



V
q

S
1

S
11

S
12



S
1q

S
2

S
21

S
22



S
2q











S
n

S
n1

S
n2



S
nq

Time

10/26/09, Wilfrid
Laurier University

6

Data Stream Modeling


Single pass: Each record is examined at most once


Bounded storage: Limited Memory for storing synopsis


Real
-
time: Per record processing time must be low


Summarization (Synopsis )of data


Use data NOT SAMPLE


Temporal and Spatial


Dynamic


Continuous (infinite stream)


Learn


Forget


Sublinear

growth rate
-

Clustering

6

Traditional Clustering

10/26/09, Wilfrid
Laurier University

7

TRAC
-
DS

10/26/09, Wilfrid
Laurier University

8

Motivation


Temporal Ordering is a major feature of
stream data.


Many stream applications depend on this
ordering


Prediction of future values


Anomaly (rare event) detection


Concept drift

10/26/09, Wilfrid
Laurier University

9

10/26/09, Wilfrid
Laurier University

10

Objectives/Outline


Introduction


Background


Clustering Stream Data


Extensible Markov Model
-

EMM


TRAC
-
DS


TRAC
-
DS Applications


Conclusions/Future Work


Stream Clustering Requirements


Dynamic updating of the clusters


Identify outliers


Barbara [2]:


compactness


fast


incremental processing

10/26/09, Wilfrid
Laurier University

11

Stream Clustering Algorithms


LOCALSEARCH [4]


Partitions stream into segments


Clusters each segment individually by solving the k
-
medians problem


Iteratively
reclusters

the resulting centers


CluStream

[1]


Micro
-
clusters represented by summary statistics.


Micro
-
clusters are handled online


Micro
-
clusters merged offline


MONIC [13]


Evolution of clusters over time


Cluster transitions over time


10/26/09, Wilfrid
Laurier University

12

10/26/09, Wilfrid
Laurier University

13

MM

A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state


A
Markov Model (MM)

is a graph with m vertices or states,
S, and directed arcs, A, such that:


S ={N
1
,N
2
, …, N
m
}, and


A = {L
ij

| i


1, 2, …, m, j


1, 2, …, m} and Each arc,

L
ij
= <N
i
,N
j
> is labeled with a transition probability

P
ij

= P(N
j

| N
i
).

10/26/09, Wilfrid
Laurier University

14

Extensible Markov Model (EMM)


Time Varying Discrete First Order Markov Model


Nodes are clusters of real world states.


Learning continues during application phase.


Learning:


Transition probabilities between
states(clusters)


State labels (Cluster summary)


State are modified as clusters are

10/26/09, Wilfrid
Laurier University

15

EMM for TRAC
-
DS Modeling

<18,10,3,3,1,0,0>

<17,10,2,3,1,0,0>

<16,9,2,3,1,0,0>

<14,8,2,3,1,0,0>

<14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/1

1/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/3

1/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/2

1/1

N1

1

10/26/09, Wilfrid
Laurier University

16

Objectives/Outline


Introduction


Background


TRAC
-
DS


Definition


Relationship to Traditional Clustering


Operations


TRAC
-
DS Applications


Conclusions/Future Work


TRAC
-
DS NOTE


TRAC
-
DS is not:



Another stream clustering
algorithm


TRAC
-
DS is:


A new way of looking at clustering


Built on top of an existing clustering
algorithm


TRAC
-
DS may be used with any
stream clustering algorithm


10/26/09, Wilfrid
Laurier University

17

TRAC
-
DS Overview

10/26/09, Wilfrid
Laurier University

18

Data Stream Clustering


At each point in time a data stream clustering ζ is
a partitioning of D
'
, the data seen thus far.


Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,
Cck

are available and k is
allowed to change over time.


The summaries
Cci

with
i

=1, 2,...,k typically
contain information about the size, distribution
and location of the data points in
Ci
.

10/26/09, Wilfrid
Laurier University

19

TRAC
-
DS Definition

Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC
-
DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are
satisfied
:

(1) There is a one
-
to
-
one correspondence
between the clusters in ζ and the states S in M.

(2) A transition
aij

in the EMM M represents the
probability that given a data point in cluster
i
,
the next data point in the data stream will
belong to cluster j with
i
; j = 1; 2; : : : ; k.

(3) The EMM M is created online together with the
data stream clustering


10/26/09, Wilfrid
Laurier University

20

Clustering Operations

A clustering operation is a function q : ζ
×

x
→ ζ which is used by the data stream
clustering algorithm to up
date the
clustering ζ given some additional
information x which either is a new data
point or other information (e.g., the
number of the cluster to be deleted to be
simplified

the clustering).

10/26/09, Wilfrid
Laurier University

21

TRAC
-
DS Operations


A TRAC
-
DS operation is a function r : M
×

sc
×

y
→ M
×

sc that updates the temporal relationship
among clusters represented by the EMM M with
states S given a current state sc


S and
additional information y and returns an updated
EMM and possibly a new current state.


In order to be able to dynamically update the EMM
M we need to store a transition count matrix C.
The count
cij

in C contains the number of times
we observed a new point being assigned by the
clustering algorithm to cluster
i

followed by a point
being assigned to cluster j.

10/26/09, Wilfrid
Laurier University

22

Stream Clustering Operations *


qassign

point(
ζ,x
): Assigns the new data point x
to an existing cluster.


qnew

cluster(
ζ,x
): Create a new cluster.


qremove

cluster(
ζ,x
): Removes a cluster. Here x
is the cluster,
i
, to be removed. In this case the
associated summary
Cci

is removed from ζ and
k is decremented by one.


qmerge

clusters(
ζ,x
): Merges two clusters.


qfade

clusters(
ζ,x
): Fades the cluster structure.


qsplit

clusters(
ζ,x
): Splits a cluster.


* Inspired by MONIC [?]




10/26/09, Wilfrid
Laurier University

23

TRAC
-
DS Operations


rassign

point(
M,sc,y
): Assigns the new data point
to the state representing an existing cluster


rnew

cluster(
M,sc,y
): Create a state for a new
cluster.


rremove

cluster(
M,sc,y
): Removes state.


rmerge

clusters(
M,sc,y
): Merges two states.


rfade

clusters(
M,sc,y
): Fades the transition
probabilities using an exponential decay f(t)=2
−λt


rsplit

clusters(
M,sc,y
): Splits states. Y clustering
operations.

10/26/09,
Wilfrid

Laurier University

24

TRAC
-
DS Example

10/26/09, Wilfrid
Laurier University

25

TRAC
-
DS Advantages


Dynamic


Flexible




Use any Clustering Algorithm


Supports and clustering operations


Scalable


Merges Clustering & Markov Modeling

10/26/09, Wilfrid
Laurier University

26

10/26/09, Wilfrid
Laurier University

27

Objectives/Outline


Introduction


Background:


TRAC
-
DS


TRAC
-
DS Applications


Anomaly Detection


Bioinformatics


Conclusions/Future Work


10/26/09, Wilfrid
Laurier University

28

What is Anomaly in Stream Data?


Rare
-

Anomalous


Surprising


Out of the ordinary


Not outlier detection


No knowledge of data distribution


Data is not static


Must take temporal and spatial values into account


May be interested in sequence of events


Ex: Snow in upstate New York is not an anomaly


Snow in upstate New York in June is rare


Rare events may change over time

10/26/09, Wilfrid
Laurier University

29

TRAC
-
DS Approach to Detect Anomalies


By learning what is normal, the model can
predict what is not


Normal is based on likelihood of occurrence


Use TRAC
-
DS to build clusters and behavior
between clusters


We view a rare event as:


Unusual event


Transition between events states which does
not frequently occur.


Continue learning




10/26/09, Wilfrid
Laurier University

30

Determining Rare


Occurrence Frequency (
OF
i
)

of an EMM
state S
i

is normalized count of state:






Normalized Transition Probability (
NTP
mn
),

from one state,
S
m
, to another,
S
n
, is a
normalized transition Count:


10/26/09, Wilfrid
Laurier University

31

Datasets/Anomalies


MnDot



Minnesota Department of Transportation


Automobile Accident


Ouse

and
Serwent



River flow data from England


Flood


Drought


KDD Cup 1999 & 2000


http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html


Intrusion


Cisco VoIP


VoIP traffic data obtained at Cisco


Unusual Phone Call

10/26/09, Wilfrid
Laurier University

32

EMM
Sublinear

Growth

Servent Data

TRAC
-
DS
River Prediction

10/26/09, Wilfrid
Laurier University

33

10/26/09, Wilfrid
Laurier University

34

TRAC
-
DA Rare Event Detection

Weekdays Weekend


Minnesota DOT Traffic Data

Detected unusual
weekend traffic pattern

TRAC
-
DS Intrusion Detection


DARPA 1999/2000


Synthetic Dataset


MIT Lincoln Lab


The DARPA 1999 dataset which is
free of attacks for two weeks (1st
week and 3rd week) is used as
training data


DARPA 2000 dataset which
contains
DDoS

attacks is used a
test data
.


10/26/09, Wilfrid
Laurier University

35

10/26/09, Wilfrid
Laurier University

36

DARPA 1999, and 2000

Thresh
old

Detection
Rate

False Positive
Rate

0.9

6%

94%

0.8

20%

80%

0.7

50%

50%

0.6

100%

0%

Table

8
.

EMM

detection

and

false

positive

rates
.

TRAC
-
DS Intrusion Detection

10/17/06

37

TRAC
-
DS & Bioinformatics


Analysis DNA/RNA Sequences


Applications:


Classification


Differentiation


16s RNA


1542
nt

rRNA


Highly conserved across species


miRNA


Short (20
-
25nt) sequence of
noncoding

RNA


Known since 1993 but significance not widely
appreciated until 2001


Impact / Prevent translation of mRNA


10/17/06

38

First


Convert Sequence to NSV

acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window



A

C

G

T

Pos 0
-
8


2

3

3

1

Pos 1
-
9


1

3

3

2




Pos 34
-
42 2

4

2

1



Next


Apply TRAC
-
DS

10/26/09, Wilfrid
Laurier University

39

10/17/06

41

TRAC
-
DS
Predictionwith

miRNA


Positive Data Model


Cutoff Probability = 0.3


False Positive Rate = 0%


True Positive Rate = 66%


Test results could be improved by
meta classifiers combining multiple
positive and negative classifiers
together.

Profile EMMs

10/26/09, Wilfrid
Laurier University

42


Examples of three different Profile EMMs constructed for 16S
data from 3 different bacteria families



Profile EMMs for Organism Classification

10/26/09, Wilfrid
Laurier University

43

16S Classification Accuracy


Classification accuracy using different scoring
metrics on 16S
rRNA

data from NCBI.


We learned 31 classification models (at the
phylogenetic

class level) from 98 organisms and
tested with 23 randomly chosen organisms.


The Profile EMM approach was able to achieve
classification of more than 90% after tuning the
resolution settings.




10/26/09, Wilfrid
Laurier University

44

TRAC
-
DS and Bioinformatics


Efficient


Alignment free sequence analysis


Clustering reduces size of model


Flexible


Any sequence


Applicability to
Metagenomics


Scoring based on similarity between EMMs
or EMM and input sequence


Applications


Classification


Differentiation


10/26/09, Wilfrid
Laurier University

45

10/26/09, Wilfrid
Laurier University

46

Objectives/Outline


Introduction


Background


TRAC
-
DS


TRAC
-
DS Applications


Conclusions/Future Work


TRAC
-
DS Ongoing/Future


Create online tool suite


Improve TRAC algorithms:


Aging


Delete state


Merge states


Split states


Apply to Image Recognition


Bioinformatics


Build Profile EMM database of NCBI 16S Bacteria
Data


Perform classification using
Metagenomic

Data
collected from Yellowstone National Park

10/26/09, Wilfrid
Laurier University

47

10/26/09, Wilfrid
Laurier University

48

10/26/09, Wilfrid
Laurier University

49

Bibliography

1)
C. C.
Aggarwal
, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams.
Proceedings of the International
Conference on Very
Large Data Bases (VLDB)
, pp 81
-
92, 2003.

2)
D. Barbara, “Requirements for clustering data streams,”
SIGKDD Explorations,
Vol

3, No 2, pp 23
-
27, 2002.

3)
Margaret H. Dunham,
Donya

Quick,
Yuhang

Wang, Monnie McGee, Jim Waddle
,

“Visualization of DNA/RNA Structure using Temporal
CGRs,”
Proceedings

of the IEEE 6
th

Symposium on Bioinformatics & Bioengineering (BIBE06)
, October 16
-
18, 2006, Washington D.C. ,pp
171
-
178.

4)
S.
Guha
, A.
Meyerson
, N.
Mishra
, R.
Motwani
, and L. O'Callaghan, “Clustering data streams: Theory and practice,”
IEEE Transactions on
Knowledge and Data Engineering,
Vol

15, No 3, pp 515
-
528, 2003.

5)
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, subm
itt
ed
to SIAM International Conference on Data Mining.

6)
Jie

Huang, Yu
Meng
, and Margaret H. Dunham, “Extensible Markov Model,”
Proceedings IEEE ICDM Conference
, November 2004, pp 371
-
374.

7)
Charlie Isaksson, Yu
Meng
, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,”
International Journal of Computer
Science and Network Security
,
Vol

6, No 6, June 2006, pp 258
-
265.

8)
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLD
M
Conference, pp 440
-
453.

9)
Mallik Kotamarti, Douglas W. Raiford, M. L.
Raymer
, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome
-
Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics

and
Bioengineering, pp 161
-
167, June 22
-
24 2009.

10)
Yu
Meng

and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,”
Proceedings of the IEEE PAKDD
Conference
, April 2006, Singapore. (Also in
Lecture Notes in Computer Science
,
Vol

3918, 2006, Springer Berlin/Heidelberg, pp 750
-
754.)

11)
Yu
Meng

and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers
,
Vol

1, No
3, June 2006, pp 43
-
50.

12)
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation.

http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html
,
(2008)

13)
M.
Spiliopoulou
, I.
Ntoutsi
, Y.
Theodoridis
, and R.
Schult
. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706

711, 2006.