The Magnificent EMM - Southern Methodist University

hordeprobableBiotechnology

Oct 4, 2013 (4 years and 8 days ago)

114 views

3/11/10, BYU

1

The Magnificent EMM



Margaret H.
Dunham

Michael
Hahsler
, Mallik Kotamarti, Charlie
Isaksson

CSE Department

Southern
Methodist University

Dallas, Texas
75275

lyle.smu.edu/~
mhd

mhd@lyle.smu.edu


This material is based upon work supported by the National Science Foundation under Grant
No
I
IS
-
0948893
.

Objectives/Outline


EMM Overview


EMM + Stream Clustering


EMM + Bioinformatics

3/11/10, BYU

2

Objectives/Outline


EMM Overview


Why


What


How


EMM + Stream Clustering


EMM + Bioinformatics

3/11/10, BYU

3

Lots of Questions


Why don’t data miners practice what
they preach?



Why is training usually viewed as a
one time thing?



Why do we usually ignore the temporal
aspect of data streams?

3/11/10, BYU

4

Continuous
Learning


Interleave
learning &
application


Add time to
online clustering

3/11/10, BYU

5

MM

A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state


A
Markov Model (MM)

is a graph with m vertices or states,
S, and directed arcs, A, such that:


S ={N
1
,N
2
, …, N
m
}, and


A = {L
ij

| i


1, 2, …, m, j


1, 2, …, m} and Each arc,

L
ij
= <N
i
,N
j
> is labeled with a transition probability

P
ij

= P(N
j

| N
i
).

3/11/10, BYU

6

Problem with Markov Chains


The required structure of the MC may not be certain
at the model construction time.


As the real world being modeled by the MC
changes, so should the structure of the MC.


Not scalable


grows linearly as number of events.


Our solution:


Extensible Markov Model (EMM)


Cluster real world events


Allow Markov chain to grow and shrink
dynamically

3/11/10, BYU

7

EMM (Extensible Markov Model)


Time Varying Discrete First Order Markov
Model


Continuously evolves


Nodes are clusters of real world states.


Learning continues during prediction phase.


Learning:


Transition probabilities between nodes


Node labels (
centroid

of cluster)


Nodes are added and removed as data
arrives


3/11/10, BYU

8

EMM Definition

Extensible Markov Model (EMM):
at any time
t, EMM consists of an MC with designated
current node, Nn, and algorithms to modify
it, where algorithms include:


EMMCluster,

which defines a technique for
matching between input data at time t + 1
and existing states in the MC at time t.


EMMIncrement

algorithm, which updates
MC at time t + 1 given the MC at time t and
clustering measure result at time t + 1.


EMMDecrement
algorithm,

which removes
nodes from the EMM when needed.

3/11/10, BYU

9

EMM Cluster


Nearest Neighbor


If none “close” create new node


Labeling of cluster is
centroid

of
members in cluster


O(n)


Here n is the number of states

3/11/10, BYU

10

EMM Increment

<18,10,3,3,1,0,0>

<17,10,2,3,1,0,0>

<16,9,2,3,1,0,0>

<14,8,2,3,1,0,0>

<14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/1

1/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/3

1/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/2

1/1

N1

1

3/11/10, BYU

11

EMMDecrement

N2

N1

N3

N5

N6

2/2

1/3

1/3

1/3

1/2

N1

N3

N5

N6

1/6

1/6

1/6

1/3

1/3

1/3

Delete N2

3/11/10, BYU

12

EMM Advantages


Dynamic


Adaptable


Use of clustering


Learns rare event


Scalable:


Growth of EMM is not linear on size of
data.


Hierarchical feature of EMM


Creation/evaluation quasi
-
real time


Distributed / Hierarchical extensions

3/11/10, BYU

13

EMM
Sublinear

Growth

Servent Data

3/11/10, BYU

14

Growth Rate Automobile Traffic

Minnesota Traffic Data

EMM River Prediction

3/11/10, BYU

15

3/11/10, BYU

16

Determining Rare Event


Occurrence Frequency (
OF
i
)

of an EMM
state S
i

is normalized count of state:






Normalized Transition Probability (
NTP
mn
),

from one state,
S
m
, to another,
S
n
, is a
normalized transition Count:


EMM Rare Event Detection

3/11/10, BYU

17

Intrusion Data, Train DARPA 1999, Test DARPA 2000,

Ozone Data, UCI ML,
Jaccard

similarity,


2536 instances, 73 attributes, 73 ozone days

Objectives/Outline


EMM Overview


EMM + Stream Clustering


Handle evolving clusters


Incorporate time in clustering


EMM + Bioinformatics

3/11/10, BYU

18

3/11/10, BYU

19

Stream Data


A growing number of applications generate streams
of data.


Computer network monitoring data


Call detail records in telecommunications


Highway transportation traffic data


Online web purchase log records


Sensor network data


Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.


Clustering techniques play a key role in modeling
and analyzing this data.


3/11/10, BYU

20

Stream Data Format


Events arriving in a stream


At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:

V
t

= <S
1t
, S
2t
, ..., S
nt
>


V
1

V
2



V
q

S
1

S
11

S
12



S
1q

S
2

S
21

S
22



S
2q











S
n

S
n1

S
n2



S
nq

Time

Traditional Clustering

3/11/10, BYU

21

TRAC
-
DS
(Temporal Relationship
Among Clusters for Data Streams)

3/11/10, BYU

22

Motivation


Temporal Ordering is a major feature of
stream data.


Many stream applications depend on this
ordering


Prediction of future values


Anomaly (rare event) detection


Concept drift

3/11/10, BYU

23

Stream Clustering Requirements


Dynamic updating of the clusters


Completely online


Identify outliers


Identify concept drifts


Barbara [2]:


compactness


fast


incremental processing

3/11/10, BYU

24

Data Stream Clustering


At each point in time a data stream clustering ζ is
a partitioning of D
'
, the data seen thus far.


Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,
Cck

are available and k is
allowed to change over time.


The summaries
Cci

with
i

=1, 2,...,k typically
contain information about the size, distribution
and location of the data points in
Ci
.

3/11/10, BYU

25

TRAC
-
DS NOTE


TRAC
-
DS is not:



Another stream clustering
algorithm


TRAC
-
DS is:


A new way of looking at clustering


Built on top of an existing clustering
algorithm


TRAC
-
DS may be used with any
stream clustering algorithm


3/11/10, BYU

26

TRAC
-
DS Overview

3/11/10, BYU

27

TRAC
-
DS Definition

Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC
-
DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are
satisfied
:

(1) There is a one
-
to
-
one correspondence
between the clusters in ζ and the states S in M.

(2) A transition
aij

in the EMM M represents the
probability that given a data point in cluster
i
,
the next data point in the data stream will
belong to cluster j with
i
; j = 1; 2; : : : ; k.

(3) The EMM M is created online together with the
data stream clustering


3/11/10, BYU

28

Stream Clustering Operations *


qassign

point(
ζ,x
): Assigns the new data point x
to an existing cluster.


qnew

cluster(
ζ,x
): Create a new cluster.


qremove

cluster(
ζ,x
): Removes a cluster. Here x
is the cluster,
i
, to be removed. In this case the
associated summary
Cci

is removed from ζ and
k is decremented by one.


qmerge

clusters(
ζ,x
): Merges two clusters.


qfade

clusters(
ζ,x
): Fades the cluster structure.


qsplit

clusters(
ζ,x
): Splits a cluster.


* Inspired by MONIC [13]




3/11/10, BYU

29

TRAC
-
DS Operations


rassign

point(
M,sc,y
): Assigns the new data point
to the state representing an existing cluster


rnew

cluster(
M,sc,y
): Create a state for a new
cluster.


rremove

cluster(
M,sc,y
): Removes state.


rmerge

clusters(
M,sc,y
): Merges two states.


rfade

clusters(
M,sc,y
): Fades the transition
probabilities using an exponential decay f(t)=2
−λt


rsplit

clusters(
M,sc,y
): Splits states. Y clustering
operations.

3/11/10, BYU

30

TRAC
-
DS Example

3/11/10, BYU

31

Objectives/Outline


EMM Overview


EMM + Stream Clustering


EMM + Bioinformatics


Background


Preprocessing


Classification


Differentiation

3/11/10, BYU

32

DNA


Basic building blocks of organisms


Located in nucleus of cells


Composed of 4 nucleotides


Two strands bound together


3/11/10, BYU

33

http://www.visionlearning.com/library/module_viewer.php?mi
d=63

Central Dogma: DNA
-
> RNA
-
>
Protein

3/11/10, BYU

34

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

Amino Acid

CCUGAGCCA
ACU
AUUGAUGAA

www.bioalgorithms.info
; chapter 6; Gene Prediction

3/11/10, BYU

36

RNA


Ribonucleic Acid


Contains A,C,G but U (Uracil) instead
of T


Single Stranded


May fold back on itself


Needed to create proteins


Move around cells


can act like a
messenger


mRNA


moves out of nucleus to
other parts of cell

The Magical 16s


Ribosomal RNA (
rRNA
) is at the heart of the
protein creation process


16S
rRNA



About 1542 nucleotides in length


In all living organisms


Important in the classification of
organisms into phyla and class


PROBLEM: An organism may actually
contain many different copies of 16S, each
slightly different.


OUR WORK: Can we use EMM to quantify
this diversity? Can we use it to classify
different species of the same genus?

3/11/10, BYU

37

3/11/10, BYU

Using EMM with RNA Data

acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window



A

C

G

T

Pos 0
-
8


2

3

3

1

Pos 1
-
9


1

3

3

2




Pos 34
-
42 2

4

2

1



Construct EMM with nodes
representing clusters of count vectors

38

EMM for Classification

3/11/10, BYU

39

TRAC
-
DS and Bioinformatics


Efficient


Alignment free sequence analysis


Clustering reduces size of model


Flexible


Any sequence


Applicability to
Metagenomics


Scoring based on similarity between EMMs
or EMM and input sequence


Applications


Classification


Differentiation


3/11/10, BYU

40

Profile EMMs for Organism Classification

3/11/10, BYU

41

Profile EMM


E Coli

3/11/10, BYU

42

Differentiating Strains


Is it possible to identify different species of
same genus?


Initial test with EMM:

Bacillus has 21 species

Construct EMM for each species using
training set (64%)

Test by matching unknown strains (36%)
and place in closest EMM

All unknown strains correctly classified
except one: accuracy of 95%


3/11/10, BYU

43

3/11/10, BYU

44

Bibliography

1)
C. C.
Aggarwal
, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams.
Proceedings of the International
Conference on Very
Large Data Bases (VLDB)
, pp 81
-
92, 2003.

2)
D. Barbara, “Requirements for clustering data streams,”
SIGKDD Explorations,
Vol

3, No 2, pp 23
-
27, 2002.

3)
Margaret H. Dunham,
Donya

Quick,
Yuhang

Wang, Monnie McGee, Jim Waddle
,

“Visualization of DNA/RNA Structure using Temporal
CGRs,”
Proceedings

of the IEEE 6
th

Symposium on Bioinformatics & Bioengineering (BIBE06)
, October 16
-
18, 2006, Washington D.C. ,pp
171
-
178.

4)
S.
Guha
, A.
Meyerson
, N.
Mishra
, R.
Motwani
, and L. O'Callaghan, “Clustering data streams: Theory and practice,”
IEEE Transactions on
Knowledge and Data Engineering,
Vol

15, No 3, pp 515
-
528, 2003.

5)
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, subm
itt
ed
to SIAM International Conference on Data Mining.

6)
Jie

Huang, Yu
Meng
, and Margaret H. Dunham, “Extensible Markov Model,”
Proceedings IEEE ICDM Conference
, November 2004, pp 371
-
374.

7)
Charlie Isaksson, Yu
Meng
, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,”
International Journal of Computer
Science and Network Security
,
Vol

6, No 6, June 2006, pp 258
-
265.

8)
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLD
M
Conference, pp 440
-
453.

9)
Mallik Kotamarti, Douglas W. Raiford, M. L.
Raymer
, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome
-
Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics

and
Bioengineering, pp 161
-
167, June 22
-
24 2009.

10)
Yu
Meng

and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,”
Proceedings of the IEEE PAKDD
Conference
, April 2006, Singapore. (Also in
Lecture Notes in Computer Science
,
Vol

3918, 2006, Springer Berlin/Heidelberg, pp 750
-
754.)

11)
Yu
Meng

and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers
,
Vol

1, No
3, June 2006, pp 43
-
50.

12)
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation.

http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html
,
(2008)

13)
M.
Spiliopoulou
, I.
Ntoutsi
, Y.
Theodoridis
, and R.
Schult
. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706

711, 2006.


3/11/10, BYU

45