# The Magnificent EMM - Southern Methodist University

Βιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 9 μήνες)

128 εμφανίσεις

3/11/10, BYU

1

The Magnificent EMM

Margaret H.
Dunham

Michael
Hahsler
, Mallik Kotamarti, Charlie
Isaksson

CSE Department

Southern
Methodist University

Dallas, Texas
75275

lyle.smu.edu/~
mhd

mhd@lyle.smu.edu

This material is based upon work supported by the National Science Foundation under Grant
No
I
IS
-
0948893
.

Objectives/Outline

EMM Overview

EMM + Stream Clustering

EMM + Bioinformatics

3/11/10, BYU

2

Objectives/Outline

EMM Overview

Why

What

How

EMM + Stream Clustering

EMM + Bioinformatics

3/11/10, BYU

3

Lots of Questions

Why don’t data miners practice what
they preach?

Why is training usually viewed as a
one time thing?

Why do we usually ignore the temporal
aspect of data streams?

3/11/10, BYU

4

Continuous
Learning

Interleave
learning &
application

online clustering

3/11/10, BYU

5

MM

A first order Markov Chain is a finite or countably infinite
sequence of events {E1, E2, … } over discrete time
points, where Pij = P(Ej | Ei), and at any time the future
behavior of the process is based solely on the current
state

A
Markov Model (MM)

is a graph with m vertices or states,
S, and directed arcs, A, such that:

S ={N
1
,N
2
, …, N
m
}, and

A = {L
ij

| i

1, 2, …, m, j

1, 2, …, m} and Each arc,

L
ij
= <N
i
,N
j
> is labeled with a transition probability

P
ij

= P(N
j

| N
i
).

3/11/10, BYU

6

Problem with Markov Chains

The required structure of the MC may not be certain
at the model construction time.

As the real world being modeled by the MC
changes, so should the structure of the MC.

Not scalable

grows linearly as number of events.

Our solution:

Extensible Markov Model (EMM)

Cluster real world events

Allow Markov chain to grow and shrink
dynamically

3/11/10, BYU

7

EMM (Extensible Markov Model)

Time Varying Discrete First Order Markov
Model

Continuously evolves

Nodes are clusters of real world states.

Learning continues during prediction phase.

Learning:

Transition probabilities between nodes

Node labels (
centroid

of cluster)

Nodes are added and removed as data
arrives

3/11/10, BYU

8

EMM Definition

Extensible Markov Model (EMM):
at any time
t, EMM consists of an MC with designated
current node, Nn, and algorithms to modify
it, where algorithms include:

EMMCluster,

which defines a technique for
matching between input data at time t + 1
and existing states in the MC at time t.

EMMIncrement

MC at time t + 1 given the MC at time t and
clustering measure result at time t + 1.

EMMDecrement
algorithm,

which removes
nodes from the EMM when needed.

3/11/10, BYU

9

EMM Cluster

Nearest Neighbor

If none “close” create new node

Labeling of cluster is
centroid

of
members in cluster

O(n)

Here n is the number of states

3/11/10, BYU

10

EMM Increment

<18,10,3,3,1,0,0>

<17,10,2,3,1,0,0>

<16,9,2,3,1,0,0>

<14,8,2,3,1,0,0>

<14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/1

1/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/3

1/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/2

1/1

N1

1

3/11/10, BYU

11

EMMDecrement

N2

N1

N3

N5

N6

2/2

1/3

1/3

1/3

1/2

N1

N3

N5

N6

1/6

1/6

1/6

1/3

1/3

1/3

Delete N2

3/11/10, BYU

12

Dynamic

Use of clustering

Learns rare event

Scalable:

Growth of EMM is not linear on size of
data.

Hierarchical feature of EMM

Creation/evaluation quasi
-
real time

Distributed / Hierarchical extensions

3/11/10, BYU

13

EMM
Sublinear

Growth

Servent Data

3/11/10, BYU

14

Growth Rate Automobile Traffic

Minnesota Traffic Data

EMM River Prediction

3/11/10, BYU

15

3/11/10, BYU

16

Determining Rare Event

Occurrence Frequency (
OF
i
)

of an EMM
state S
i

is normalized count of state:

Normalized Transition Probability (
NTP
mn
),

from one state,
S
m
, to another,
S
n
, is a
normalized transition Count:

EMM Rare Event Detection

3/11/10, BYU

17

Intrusion Data, Train DARPA 1999, Test DARPA 2000,

Ozone Data, UCI ML,
Jaccard

similarity,

2536 instances, 73 attributes, 73 ozone days

Objectives/Outline

EMM Overview

EMM + Stream Clustering

Handle evolving clusters

Incorporate time in clustering

EMM + Bioinformatics

3/11/10, BYU

18

3/11/10, BYU

19

Stream Data

A growing number of applications generate streams
of data.

Computer network monitoring data

Call detail records in telecommunications

Highway transportation traffic data

Online web purchase log records

Sensor network data

Stock exchange, transactions in retail chains, ATM
operations in banks, credit card transactions.

Clustering techniques play a key role in modeling
and analyzing this data.

3/11/10, BYU

20

Stream Data Format

Events arriving in a stream

At any time, t, we can view the state
of the problem as represented by a
vector of n numeric values:

V
t

= <S
1t
, S
2t
, ..., S
nt
>

V
1

V
2

V
q

S
1

S
11

S
12

S
1q

S
2

S
21

S
22

S
2q

S
n

S
n1

S
n2

S
nq

Time

3/11/10, BYU

21

TRAC
-
DS
(Temporal Relationship
Among Clusters for Data Streams)

3/11/10, BYU

22

Motivation

Temporal Ordering is a major feature of
stream data.

Many stream applications depend on this
ordering

Prediction of future values

Anomaly (rare event) detection

Concept drift

3/11/10, BYU

23

Stream Clustering Requirements

Dynamic updating of the clusters

Completely online

Identify outliers

Identify concept drifts

Barbara [2]:

compactness

fast

incremental processing

3/11/10, BYU

24

Data Stream Clustering

At each point in time a data stream clustering ζ is
a partitioning of D
'
, the data seen thus far.

Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,
Cck

are available and k is
allowed to change over time.

The summaries
Cci

with
i

=1, 2,...,k typically
contain information about the size, distribution
and location of the data points in
Ci
.

3/11/10, BYU

25

TRAC
-
DS NOTE

TRAC
-
DS is not:

Another stream clustering
algorithm

TRAC
-
DS is:

A new way of looking at clustering

Built on top of an existing clustering
algorithm

TRAC
-
DS may be used with any
stream clustering algorithm

3/11/10, BYU

26

TRAC
-
DS Overview

3/11/10, BYU

27

TRAC
-
DS Definition

Given a data stream clustering ζ, a temporal
relationship among clusters (TRAC
-
DS) overlays a
data stream clustering ζ with a EMM M, in such a
way that the following are
satisﬁed
:

(1) There is a one
-
to
-
one correspondence
between the clusters in ζ and the states S in M.

(2) A transition
aij

in the EMM M represents the
probability that given a data point in cluster
i
,
the next data point in the data stream will
belong to cluster j with
i
; j = 1; 2; : : : ; k.

(3) The EMM M is created online together with the
data stream clustering

3/11/10, BYU

28

Stream Clustering Operations *

qassign

point(
ζ,x
): Assigns the new data point x
to an existing cluster.

qnew

cluster(
ζ,x
): Create a new cluster.

qremove

cluster(
ζ,x
): Removes a cluster. Here x
is the cluster,
i
, to be removed. In this case the
associated summary
Cci

is removed from ζ and
k is decremented by one.

qmerge

clusters(
ζ,x
): Merges two clusters.

clusters(
ζ,x

qsplit

clusters(
ζ,x
): Splits a cluster.

* Inspired by MONIC [13]

3/11/10, BYU

29

TRAC
-
DS Operations

rassign

point(
M,sc,y
): Assigns the new data point
to the state representing an existing cluster

rnew

cluster(
M,sc,y
): Create a state for a new
cluster.

rremove

cluster(
M,sc,y
): Removes state.

rmerge

clusters(
M,sc,y
): Merges two states.

clusters(
M,sc,y
probabilities using an exponential decay f(t)=2
−λt

rsplit

clusters(
M,sc,y
): Splits states. Y clustering
operations.

3/11/10, BYU

30

TRAC
-
DS Example

3/11/10, BYU

31

Objectives/Outline

EMM Overview

EMM + Stream Clustering

EMM + Bioinformatics

Background

Preprocessing

Classification

Differentiation

3/11/10, BYU

32

DNA

Basic building blocks of organisms

Located in nucleus of cells

Composed of 4 nucleotides

Two strands bound together

3/11/10, BYU

33

http://www.visionlearning.com/library/module_viewer.php?mi
d=63

Central Dogma: DNA
-
> RNA
-
>
Protein

3/11/10, BYU

34

Protein

RNA

DNA

transcription

translation

CCTGAGCCAACTATTGATGAA

Amino Acid

CCUGAGCCA
ACU
AUUGAUGAA

www.bioalgorithms.info
; chapter 6; Gene Prediction

3/11/10, BYU

36

RNA

Ribonucleic Acid

Contains A,C,G but U (Uracil) instead
of T

Single Stranded

May fold back on itself

Needed to create proteins

Move around cells

can act like a
messenger

mRNA

moves out of nucleus to
other parts of cell

The Magical 16s

Ribosomal RNA (
rRNA
) is at the heart of the
protein creation process

16S
rRNA

In all living organisms

Important in the classification of
organisms into phyla and class

PROBLEM: An organism may actually
contain many different copies of 16S, each
slightly different.

OUR WORK: Can we use EMM to quantify
this diversity? Can we use it to classify
different species of the same genus?

3/11/10, BYU

37

3/11/10, BYU

Using EMM with RNA Data

acgtgcacgtaactgattccggaaccaaatgtgcccacgtcga

Moving Window

A

C

G

T

Pos 0
-
8

2

3

3

1

Pos 1
-
9

1

3

3

2

Pos 34
-
42 2

4

2

1

Construct EMM with nodes
representing clusters of count vectors

38

EMM for Classification

3/11/10, BYU

39

TRAC
-
DS and Bioinformatics

Efficient

Alignment free sequence analysis

Clustering reduces size of model

Flexible

Any sequence

Applicability to
Metagenomics

Scoring based on similarity between EMMs
or EMM and input sequence

Applications

Classification

Differentiation

3/11/10, BYU

40

Profile EMMs for Organism Classification

3/11/10, BYU

41

Profile EMM

E Coli

3/11/10, BYU

42

Differentiating Strains

Is it possible to identify different species of
same genus?

Initial test with EMM:

Bacillus has 21 species

Construct EMM for each species using
training set (64%)

Test by matching unknown strains (36%)
and place in closest EMM

All unknown strains correctly classified
except one: accuracy of 95%

3/11/10, BYU

43

3/11/10, BYU

44

Bibliography

1)
C. C.
Aggarwal
, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams.
Proceedings of the International
Conference on Very
Large Data Bases (VLDB)
, pp 81
-
92, 2003.

2)
D. Barbara, “Requirements for clustering data streams,”
SIGKDD Explorations,
Vol

3, No 2, pp 23
-
27, 2002.

3)
Margaret H. Dunham,
Donya

Quick,
Yuhang

,

“Visualization of DNA/RNA Structure using Temporal
CGRs,”
Proceedings

of the IEEE 6
th

Symposium on Bioinformatics & Bioengineering (BIBE06)
, October 16
-
18, 2006, Washington D.C. ,pp
171
-
178.

4)
S.
Guha
, A.
Meyerson
, N.
Mishra
, R.
Motwani
, and L. O'Callaghan, “Clustering data streams: Theory and practice,”
IEEE Transactions on
Knowledge and Data Engineering,
Vol

15, No 3, pp 515
-
528, 2003.

5)
Michael Hahsler and Margaret H. Dunham, “TRACDS: Temporal Relationship Among Clusters for Data Streams,” October 2009, subm
itt
ed
to SIAM International Conference on Data Mining.

6)
Jie

Huang, Yu
Meng
, and Margaret H. Dunham, “Extensible Markov Model,”
Proceedings IEEE ICDM Conference
, November 2004, pp 371
-
374.

7)
Charlie Isaksson, Yu
Meng
, and Margaret H. Dunham, “Risk Leveling of Network Traffic Anomalies,”
International Journal of Computer
Science and Network Security
,
Vol

6, No 6, June 2006, pp 258
-
265.

8)
Charlie Isaksson and Margaret H. Dunham, “A Comparative Study of Outlier Detection,” July 2009, Proceedings of the IEEE MLD
M
Conference, pp 440
-
453.

9)
Mallik Kotamarti, Douglas W. Raiford, M. L.
Raymer
, and Margaret H. Dunham, “A Data Mining Approach to Predicting Phylum for
Microbial Organisms Using Genome
-
Wide Sequence Data,” Proceedings of the IEEE Ninth International Conference on Bioinformatics

and
Bioengineering, pp 161
-
167, June 22
-
24 2009.

10)
Yu
Meng

and Margaret H. Dunham, “Efficient Mining of Emerging Events in a Dynamic Spatiotemporal,”
Proceedings of the IEEE PAKDD
Conference
, April 2006, Singapore. (Also in
Lecture Notes in Computer Science
,
Vol

3918, 2006, Springer Berlin/Heidelberg, pp 750
-
754.)

11)
Yu
Meng

and Margaret H. Dunham, “Mining Developing Trends of Dynamic Spatiotemporal Data Streams,”
Journal of Computers
,
Vol

1, No
3, June 2006, pp 43
-
50.

12)
MIT Lincoln Laboratory.: DARPA Intrusion Detection Evaluation.

http://www.ll.mit.edu/mission/communications/ist/corpora/ideval/index.html
,
(2008)

13)
M.
Spiliopoulou
, I.
Ntoutsi
, Y.
Theodoridis
, and R.
Schult
. MONIC: Modeling and monitoring cluster transitions. In Proceedings of the 12th
ACM SIGKDD international conference on Knowledge discovery and data mining, Philadelphia, PA, USA, pages 706

711, 2006.

3/11/10, BYU

45