So Hirai
The University of Tokyo
Currently NTT DATA Corp.
Kenji Yamanishi
The University of Tokyo
WITMSE 2012, Amsterdam, Netherland
Presented at KDD 2012
on Aug.13.
Contents
Problem Setting
Significance
Proposed Algorithm
：
Sequential Dynamic Model Selection
with
NML(normalized maximum likelihood)
coding
How to compute the NML coding for Gaussian mixtures
Experimental Results
Marketing
Applications
Conclusion
2
Problem Setting (1/2)
3
Time
Change
Change
Clustering change detection

Tracking changes of clustering structures in a sequential
setting to detect novelty in data
Ex. Market analysis
The structure of
customer groups
changes over time
Detect changes of the number of clusters
as well as their assignment
Problem Setting (2/2)
4
F
E
D
C
B
A
F
E
D
C
B
A
F
E
D
C
B
A
F
E
D
C
B
A
α
β
Examples of clustering structure changes
Existing customers change
their patterns
New customer s emerge to
form a new group
There exist various types of clustering structures
Related works
Evolutionally clustering [Chakrabrti et. al., 2006]
Hypothesis testing approach[Song and Wang, 2005]
Kalman filter approach [Krempl et. al., 2011]
Graph Scope [Sun et. al., 2007]
Variational Bayes approach[Sato, 2001]
5
Clustering change detection issue
Significance
A novel clustering change detection algorithm
Key idea:
・
Sequential dynamic model selection (sequential DMS
）
・
NML(normalized maximum likelihood) code

length as criteria
……..First formulae for NML for Gaussian mixture models
6
Empirical demonstration of its superiority over existing
methods
Shown using artificial data sets
Demonstration of its validity in market analysis
Shown using real beer consumption data sets
Sequential Dynamic Model
Selection Algorithm
7
Proposed Alg.
–
background of DMS
–
Batch DMS criterion :
8
Dynamic Model Selection ( DMS )
[Yamanishi and Maruyama, 2007]
Total code

length
Code

length
of
data seq
.
Code

length
of
model seq
.
Minimum w.r.t.
~Extension of
MDL (Minimum Description Length)
principle[Rissanen, 1978]
into
model “sequence” selection
Proposed Alg.
–
Sequential DMS
–
At each time
t
, given ,
sequentially select for clustering
9
Sequential dynamic model selection (SDMS) Alg.
Code

length for data clustering
～
NML
(normalized maximum
likelihoood)
coding
Code

length for
transition
of
clustering structure
Minimum
w.r.t.
K
t
, Z
t
Sequential variant of DMS criterion
[Yamanishi and Maruyama, 2007]
s.t.
Proposed Alg.
–
model transition
–
Run EM alg. with initial values below:
Case 1
# of clusters does not change
Initial parameter
values remain the same
Case 2
# of clusters decreases (e.g. ,
merging
)
Assign data in a certain cluster to other ones randomly
Case 3
# of clusters increases (e.g.,
splitting
)
Set data to a new cluster randomly
10
Consider three patterns of clustering changes
Case 2
Case 3
Proposed Alg.
–
code

length for transition
–
Model transition probability distribution
Suppose
K
transits to neighbors only
Employ Krichevsky

Trofimov (KT) estimate
[Krichevsky and Trofimov, 1981]
11
Code

length of the model transition
How to compute
NML code

length
for Gaussian mixtures
12
Criteria
–
NML code

length
–
Model (Gaussian mixture model) :
NML (normalized maximum likelihood) code

length :
Shortest code

length in the sense of minimax criterion [Shatarkov 1987]
13
Normalization
term
For Continuous Data
Normalization term
In case of , the data ranges over all domains
Problem:
NML for Gaussian distribution
Normalization term diverges
NML for mixture distribution
Normalization term is computationally intractable
This comes from combinational difficulties
14
For Continuous Data (Example)
For the one

dimension Gaussian
distribution
(
σ
2
is given)
Normalization term
15
Approximate computation (1/2)
16
Use sufficient statistics
g
1
: Gaussian distribution
g
2
: Wishart distribution
Criteria
–
NML for GMM
–
Restrict the range of data
so that the MLE lies in a bounded
range specified by a parameter
17
Efficiently computing an approximate variant of
the NML code

length for a GMM
[Hirai and Yamanishi, 2011]
The normalization term does not diverge
But still highly depends on the parameters :
ＮＭＬ
The normalization term is calculated as follows :
18
where,
: number of data,
: dim. of data
Criteria
–
RNML code

length
–
Re

normalize
around the MLE of parameter by restricting
the range of data
19
Modify NML to develop the re

normalized
maximum likelihood coding (RNML)
[Rissanen, Roos, Myllymaki 2010]
[Hirai and Yamanishi, 2012]
Less dependent on hyper

parameter
20
Criteria
–
RNML code

length
–
RNML code

length
Theorem
[Hirai and Yamanishi 2012]
ＲＮＭＬ
code

length for GMM is calculated as follows :
21
Definition
Problem
Computing ,
costs .
1
Criteria
–
efficient computing of RNML
–
Straightforward computation of RNML requires time
⇒
But we can compute it efficiently
Theorem
[Kontkanen and Myllymaki, 07]
22
１
)
Can compute the normalization term
in for “mixture” models
Criteria
–
efficient computing of RNML
–
Straightforward computation of RNML requires time
⇒
But we can compute it efficiently
Theorem
[Hirai and Yamanishi, 2012]
The normalization term satisfies recurrsive formula
23
２
２
２
Experimental Results
–
Artificial Data
–
–
Market Analysis
–
24
Experimental Results
–
data generation
–
Generate artificial data set according to GMM with
25
Experimental Results
–
comparison criteria
–
AR (accuracy rate)
:
Average rate of correctly estimating the true number of
clusters over all time
IR (identification rate)
:
Probability of correctly identifying change

points and
change themselves
FAR (false alarm rate)
:
Rate of the number of false alarms over all detected
change

points
26
Employ three comparison metrics
Experimental Results
–
artificial data
–
27
Our alg. with NML was able to detect true change

points and identify the true # of clusters with higher
probability than AIC and BIC
Average Number of clusters Over Time
AIC:Akaike’s information criteria [Akaike1974]
BIC:Bayesian information criteria [Shwarz 1978]
RNML
AIC
BIC
AR
0.903
0.103
0.135
IR
0.380
0.005
0.020
FAR
0.260
0.020
0.718
Comparison w. r. t. KL

divergence
Evaluated change detection accuracies by varying the
Kullback

Leibler divergence (KLD) between the
distributions before and after the change points
28
The larger the KLD
between GMMs before and
after the change

point was,
the more accurately it was
detected in terms of IR
(identification rate).
Experimental Results
–
vs SW Alg.
–
SW algorithm :
Hypothesis testing whether clusters are identical or not,
then make splitting, merging, etc.
[Song and Wang, 2005]
29
The sequential DMS with RNML significantly
outperformed SW

alg.
AR
IR
FAR
Proposed
0.988
0.950
0.050
SW

RNML
0.369
0.300
0.503
SW

BIC
0.019
0.000
0.841
Data : size/time = 512
Experimental Results
–
market analysis
–
30
Data set provided by MACROMILL, Inc.
Clustering customers to detect their structure changes
Our alg. detected clustering
changes that corresponded to
the year’s ending demand
Beer 1
Beer 2
. . .
User 1
350
700
. . .
User 2
1050
350
. . .
. . .
. . .
. . .
. . .
Beer 1
Beer 2
. . .
User 1
350
700
. . .
User 2
1050
350
. . .
. . .
. . .
. . .
. . .
Beer 1
Beer 2
. . .
User 1
350
700
. . .
User 2
1050
350
. . .
. . .
. . .
. . .
. . .
14 kinds of beer
3185 users
78 days
The cluster change in change

point : 1/1,2
31
assumption
（
ｍ
ｌ）
捬u獴敲 ㄠ
捬u獴敲 ㈠
捬u獴敲 ㌠
B敥e

A
184
0
117
Beer

B
91
0
95
Premium

A
108
0
80
Premium

B
113
0
43
Beer

C
0
0
126
Beer

D
0
0
140
Third

A
93
41
43
Third

B
0
198
121
Third

C
0
303
103
Third

D
0
120
182
LM

beer

A
0
75
48
Off

A
0
0
157
Off

B
0
114
34
Off

C
0
0
83
Total
Volume
589
852
1373
#
Customers
598
376
311
cluster 1
cluster 2
cluster
3
cluster
4
cluster 5
84
0
131
50
229
123
0
248
0
0
153
0
174
73
0
176
0
105
0
0
0
0
146
122
0
0
0
72
192
0
101
131
130
0
0
0
34
406
0
131
0
107
112
46
236
0
202
431
0
0
0
107
87
0
0
0
0
169
138
0
0
215
74
0
0
0
0
61
83
0
637
796
2348
705
596
397
190
123
162
363
Many of customers changed
their patterns to purchase
Beer

A and Third

Beer at
the year’s end
Conclusion
P
roposed the sequential DMS algorithm to address clustering
change detection issue.
Key ideas :
Sequential dynamic model selection based on MDL principle
T
he use of the NML code

length as criteria and its efficient computation
It is able to detect cluster
changes significantly
more accurately
than AIC/BIC based methods and the existing statistical

test
based method in artificial data
Tracking changes of group structures leads to the
understanding changes of market structures
32
Why is NML ?
33
The shortest code

length in the sense of
Shtarkov’s minimax criterion
[Shtarkov, 1987]
Minimum is attained by
Ｑ＝
NML
distribution
Maximum
Likelihood
Estimator
For a given class :
Restrict the range of data
34
Restrict the range of data for
Shtarkov’s minimax criterion
[Shtarkov, 1987]
For a given class :
Restrict the range of data.
We change the Shtarkov’s
minimax criterion itself
Comparison with non

parametric Bayes
Sequential Dynamic Model Selection works better than
non

parametric Bayes (Infinite HMM, etc.)
[Comparison of Dynamic Model Selection with
Infinite HMM for Statistical Model Change Detection
Sakurai and Yamanishi, to appear in ITW 2012]
35
Comments 0
Log in to post a comment