3/11/10, BYU
1
The Magnificent EMM
Margaret H.
Dunham
Michael
Hahsler
, Mallik Kotamarti, Charlie
Isaksson
CSE Department
Southern
Methodist University
Dallas, Texas
75275
lyle.smu.edu/~
mhd
mhd@lyle.smu.edu
This material is based upon work supported by the National Science Foundation under Grant
No
I
IS

0948893
.
Objectives/Outline
EMM Overview
EMM + Stream Clustering
EMM + Bioinformatics
3/11/10, BYU
2
Objectives/Outline
EMM Overview
Why
What
How
EMM + Stream Clustering
EMM + Bioinformatics
3/11/10, BYU
3
Lots of Questions
Why don’t data miners practice what
they preach?
Why is training usually viewed as a
one time thing?
Why do we usually ignore the temporal
aspect of data streams?
3/11/10, BYU
4
Continuous
Learning
Interleave
learning &
application
Add time to
online clustering
3/11/10, BYU
5
MM
A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej  Ei), and at any time the future
behavior of the process is based solely on the current
state
A
Markov Model (MM)
is a graph with m vertices or states,
S, and directed arcs, A, such that:
S ={N
1
,N
2
, …, N
m
}, and
A = {L
ij
 i
1, 2, …, m, j
1, 2, …, m} and Each arc,
L
ij
= <N
i
,N
j
> is labeled with a transition probability
P
ij
= P(N
j
 N
i
).
3/11/10, BYU
6
Problem with Markov Chains
The required structure of the MC may not be certain
at the model construction time.
As the real world being modeled by the MC
changes, so should the structure of the MC.
Not scalable
–
grows linearly as number of events.
Our solution:
Extensible Markov Model (EMM)
Cluster real world events
Allow Markov chain to grow and shrink
dynamically
3/11/10, BYU
7
EMM (Extensible Markov Model)
Time Varying Discrete First Order Markov
Model
Continuously evolves
Nodes are clusters of real world states.
Learning continues during prediction phase.
Learning:
Transition probabilities between nodes
Node labels (
centroid
of cluster)
Nodes are added and removed as data
arrives
3/11/10, BYU
8
EMM Definition
Extensible Markov Model (EMM):
at any time
t, EMM consists of an MC with designated
current node, Nn, and algorithms to modify
it, where algorithms include:
EMMCluster,
which defines a technique for
matching between input data at time t + 1
and existing states in the MC at time t.
EMMIncrement
algorithm, which updates
MC at time t + 1 given the MC at time t and
clustering measure result at time t + 1.
EMMDecrement
algorithm,
which removes
nodes from the EMM when needed.
3/11/10, BYU
9
EMM Cluster
Nearest Neighbor
If none “close” create new node
Labeling of cluster is
centroid
of
members in cluster
O(n)
Here n is the number of states
3/11/10, BYU
10
EMM Increment
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
1/3
N1
N2
2/3
N3
1/1
1/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/3
1/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/2
1/1
N1
1
3/11/10, BYU
11
EMMDecrement
N2
N1
N3
N5
N6
2/2
1/3
1/3
1/3
1/2
N1
N3
N5
N6
1/6
1/6
1/6
1/3
1/3
1/3
Delete N2
3/11/10, BYU
12
EMM Advantages
Dynamic
Adaptable
Use of clustering
Learns rare event
Scalable:
Growth of EMM is not linear on size of
data.
Hierarchical feature of EMM
Creation/evaluation quasi

real time
Distributed / Hierarchical extensions
3/11/10, BYU
13
EMM
Sublinear
Growth
Servent Data
3/11/10, BYU
14
Growth Rate Automobile Traffic
Minnesota Traffic Data
EMM River Prediction
3/11/10, BYU
15
3/11/10, BYU
16
Determining Rare Event
Occurrence Frequency (
OF
i
)
of an EMM
state S
i
is normalized count of state:
Normalized Transition Probability (
NTP
mn
),
from one state,
S
m
, to another,
S
n
, is a
normalized transition Count:
EMM Rare Event Detection
3/11/10, BYU
17
Intrusion Data, Train DARPA 1999, Test DARPA 2000,
Ozone Data, UCI ML,
Jaccard
similarity,
2536 instances, 73 attributes, 73 ozone days
Objectives/Outline
EMM Overview
EMM + Stream Clustering
Handle evolving clusters
Incorporate time in clustering
EMM + Bioinformatics
3/11/10, BYU
18
3/11/10, BYU
19
Stream Data
A growing number of applications generate streams
of data.
Computer network monitoring data
Call detail records in telecommunications
Highway transportation traffic data
Online web purchase log records
Sensor network data
Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.
Clustering techniques play a key role in modeling
and analyzing this data.
3/11/10, BYU
20
Stream Data Format
Events arriving in a stream
At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:
V
t
= <S
1t
, S
2t
, ..., S
nt
>
V
1
V
2
…
V
q
S
1
S
11
S
12
…
S
1q
S
2
S
21
S
22
…
S
2q
…
…
…
…
…
S
n
S
n1
S
n2
…
S
nq
Time
Traditional Clustering
3/11/10, BYU
21
TRAC

DS
(Temporal Relationship
Among Clusters for Data Streams)
3/11/10, BYU
22
Motivation
Temporal Ordering is a major feature of
stream data.
Many stream applications depend on this
ordering
Prediction of future values
Anomaly (rare event) detection
Concept drift
3/11/10, BYU
23
Stream Clustering Requirements
Dynamic updating of the clusters
Completely online
Identify outliers
Identify concept drifts
Barbara [2]:
compactness
fast
incremental processing
3/11/10, BYU
24
Data Stream Clustering
At each point in time a data stream clustering ζ is
a partitioning of D
'
, the data seen thus far.
Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,
Cck
are available and k is
allowed to change over time.
The summaries
Cci
with
i
=1, 2,...,k typically
contain information about the size, distribution
and location of the data points in
Ci
.
3/11/10, BYU
25
TRAC

DS NOTE
TRAC

DS is not:
Another stream clustering
algorithm
TRAC

DS is:
A new way of looking at clustering
Built on top of an existing clustering
algorithm
TRAC

DS may be used with any
stream clustering algorithm
3/11/10, BYU
26
TRAC

DS Overview
3/11/10, BYU
27
TRAC

DS Definition
Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC

DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are
satisﬁed
:
(1) There is a one

to

one correspondence
between the clusters in ζ and the states S in M.
(2) A transition
aij
in the EMM M represents the
probability that given a data point in cluster
i
,
the next data point in the data stream will
belong to cluster j with
i
; j = 1; 2; : : : ; k.
(3) The EMM M is created online together with the
data stream clustering
3/11/10, BYU
28
Stream Clustering Operations *
qassign
point(
ζ,x
): Assigns the new data point x
to an existing cluster.
qnew
cluster(
ζ,x
): Create a new cluster.
qremove
cluster(
ζ,x
): Removes a cluster. Here x
is the cluster,
i
, to be removed. In this case the
associated summary
Cci
is removed from ζ and
k is decremented by one.
qmerge
clusters(
ζ,x
): Merges two clusters.
qfade
clusters(
ζ,x
): Fades the cluster structure.
qsplit
clusters(
ζ,x
): Splits a cluster.
* Inspired by MONIC [13]
3/11/10, BYU
29
TRAC

DS Operations
rassign
point(
M,sc,y
): Assigns the new data point
to the state representing an existing cluster
rnew
cluster(
M,sc,y
): Create a state for a new
cluster.
rremove
cluster(
M,sc,y
): Removes state.
rmerge
clusters(
M,sc,y
): Merges two states.
rfade
clusters(
M,sc,y
): Fades the transition
probabilities using an exponential decay f(t)=2
−λt
rsplit
clusters(
M,sc,y
): Splits states. Y clustering
operations.
3/11/10, BYU
30
TRAC

DS Example
3/11/10, BYU
31
Objectives/Outline
EMM Overview
EMM + Stream Clustering
EMM + Bioinformatics
Background
Preprocessing
Classification
Differentiation
3/11/10, BYU
32
DNA
Basic building blocks of organisms
Located in nucleus of cells
Composed of 4 nucleotides
Two strands bound together
3/11/10, BYU
33
http://www.visionlearning.com/library/module_viewer.php?mi
d=63
Central Dogma: DNA

> RNA

>
Protein
3/11/10, BYU
34
Protein
RNA
DNA
transcription
translation
CCTGAGCCAACTATTGATGAA
Amino Acid
CCUGAGCCA
ACU
AUUGAUGAA
www.bioalgorithms.info
; chapter 6; Gene Prediction
3/11/10, BYU
36
RNA
Ribonucleic Acid
Contains A,C,G but U (Uracil) instead
of T
Single Stranded
May fold back on itself
Needed to create proteins
Move around cells
–
can act like a
messenger
mRNA
–
moves out of nucleus to
other parts of cell
The Magical 16s
Ribosomal RNA (
rRNA
) is at the heart of the
protein creation process
16S
rRNA
About 1542 nucleotides in length
In all living organisms
Important in the classification of
organisms into phyla and class
PROBLEM: An organism may actually
contain many different copies of 16S, each
slightly different.
OUR WORK: Can we use EMM to quantify
this diversity? Can we use it to classify
different species of the same genus?
3/11/10, BYU
37
3/11/10, BYU
Using EMM with RNA Data
acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga
Moving Window
A
C
G
T
Pos 0

8
2
3
3
1
Pos 1

9
1
3
3
2
…
Pos 34

42 2
4
2
1
Construct EMM with nodes
representing clusters of count vectors
38
EMM for Classification
3/11/10, BYU
39
TRAC

DS and Bioinformatics
Efficient
Alignment free sequence analysis
Clustering reduces size of model
Flexible
Any sequence
Applicability to
Metagenomics
Scoring based on similarity between EMMs
or EMM and input sequence
Applications
Classification
Differentiation
3/11/10, BYU
40
Profile EMMs for Organism Classification
3/11/10, BYU
41
Profile EMM
–
E Coli
3/11/10, BYU
42
Differentiating Strains
Is it possible to identify different species of
same genus?
Initial test with EMM:
Bacillus has 21 species
Construct EMM for each species using
training set (64%)
Test by matching unknown strains (36%)
and place in closest EMM
All unknown strains correctly classified
except one: accuracy of 95%
3/11/10, BYU
43
3/11/10, BYU
44
Bibliography
1)
C. C.
Aggarwal
, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams.
Proceedings of the International
Conference on Very
Large Data Bases (VLDB)
, pp 81

92, 2003.
2)
D. Barbara, “Requirements for clustering data streams,”
SIGKDD Explorations,
Vol
3, No 2, pp 23

27, 2002.
3)
Margaret H. Dunham,
Donya
Quick,
Yuhang
Wang, Monnie McGee, Jim Waddle
,
“Visualization of DNA/RNA Structure using Temporal
CGRs,”
Proceedings
of the IEEE 6
th
Symposium on Bioinformatics & Bioengineering (BIBE06)
, October 16

18, 2006, Washington D.C. ,pp
171

178.
4)
S.
Guha
, A.
Meyerson
, N.
Mishra
, R.
Motwani
, and L. O'Callaghan, “Clustering data streams: Theory and practice,”
IEEE Transactions on
Knowledge and Data Engineering,
Vol
15, No 3, pp 515

528, 2003.
5)
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, subm
itt
ed
to SIAM International Conference on Data Mining.
6)
Jie
Huang, Yu
Meng
, and Margaret H. Dunham, “Extensible Markov Model,”
Proceedings IEEE ICDM Conference
, November 2004, pp 371

374.
7)
Charlie Isaksson, Yu
Meng
, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,”
International Journal of Computer
Science and Network Security
,
Vol
6, No 6, June 2006, pp 258

265.
8)
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLD
M
Conference, pp 440

453.
9)
Mallik Kotamarti, Douglas W. Raiford, M. L.
Raymer
, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome

Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics
and
Bioengineering, pp 161

167, June 22

24 2009.
10)
Yu
Meng
and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,”
Proceedings of the IEEE PAKDD
Conference
, April 2006, Singapore. (Also in
Lecture Notes in Computer Science
,
Vol
3918, 2006, Springer Berlin/Heidelberg, pp 750

754.)
11)
Yu
Meng
and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers
,
Vol
1, No
3, June 2006, pp 43

50.
12)
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation.
http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html
,
(2008)
13)
M.
Spiliopoulou
, I.
Ntoutsi
, Y.
Theodoridis
, and R.
Schult
. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706
–
711, 2006.
3/11/10, BYU
45
Comments 0
Log in to post a comment