Presented at KDD 2012 on Aug.13.

naivenorthAI and Robotics

Nov 8, 2013 (4 years and 1 day ago)

72 views

So Hirai

The University of Tokyo

Currently NTT DATA Corp.

Kenji Yamanishi

The University of Tokyo

WITMSE 2012, Amsterdam, Netherland

Presented at KDD 2012


on Aug.13.

Contents


Problem Setting


Significance


Proposed Algorithm





Sequential Dynamic Model Selection



with
NML(normalized maximum likelihood)

coding


How to compute the NML coding for Gaussian mixtures


Experimental Results


Marketing
Applications



Conclusion

2

Problem Setting (1/2)

3

Time

Change

Change

Clustering change detection

---
Tracking changes of clustering structures in a sequential
setting to detect novelty in data

Ex. Market analysis


The structure of
customer groups
changes over time

Detect changes of the number of clusters
as well as their assignment

Problem Setting (2/2)

4

F

E

D

C

B

A

F

E

D

C

B

A

F

E

D

C

B

A

F

E

D

C

B

A

α

β

Examples of clustering structure changes


Existing customers change
their patterns

New customer s emerge to
form a new group

There exist various types of clustering structures

Related works


Evolutionally clustering [Chakrabrti et. al., 2006]


Hypothesis testing approach[Song and Wang, 2005]


Kalman filter approach [Krempl et. al., 2011]


Graph Scope [Sun et. al., 2007]


Variational Bayes approach[Sato, 2001]

5

Clustering change detection issue

Significance


A novel clustering change detection algorithm

Key idea:



Sequential dynamic model selection (sequential DMS




NML(normalized maximum likelihood) code
-
length as criteria


……..First formulae for NML for Gaussian mixture models

6


Empirical demonstration of its superiority over existing
methods


Shown using artificial data sets



Demonstration of its validity in market analysis


Shown using real beer consumption data sets



Sequential Dynamic Model
Selection Algorithm

7

Proposed Alg.


background of DMS



Batch DMS criterion :

8

Dynamic Model Selection ( DMS )

[Yamanishi and Maruyama, 2007]

Total code
-
length

Code
-
length
of
data seq
.

Code
-
length
of
model seq
.

Minimum w.r.t.

~Extension of
MDL (Minimum Description Length)
principle[Rissanen, 1978]
into
model “sequence” selection

Proposed Alg.


Sequential DMS



At each time
t
, given ,
sequentially select for clustering

9

Sequential dynamic model selection (SDMS) Alg.

Code
-
length for data clustering


NML

(normalized maximum
likelihoood)

coding

Code
-
length for
transition

of
clustering structure

Minimum

w.r.t.
K
t
, Z
t

Sequential variant of DMS criterion

[Yamanishi and Maruyama, 2007]

s.t.

Proposed Alg.


model transition



Run EM alg. with initial values below:


Case 1

# of clusters does not change

Initial parameter

values remain the same


Case 2

# of clusters decreases (e.g. ,
merging
)

Assign data in a certain cluster to other ones randomly


Case 3

# of clusters increases (e.g.,
splitting
)


Set data to a new cluster randomly

10

Consider three patterns of clustering changes

Case 2

Case 3

Proposed Alg.


code
-
length for transition



Model transition probability distribution


Suppose
K

transits to neighbors only





Employ Krichevsky
-
Trofimov (KT) estimate





[Krichevsky and Trofimov, 1981]

11

Code
-
length of the model transition


How to compute

NML code
-
length

for Gaussian mixtures

12

Criteria


NML code
-
length



Model (Gaussian mixture model) :







NML (normalized maximum likelihood) code
-
length :





Shortest code
-
length in the sense of minimax criterion [Shatarkov 1987]

13

Normalization

term

For Continuous Data


Normalization term


In case of , the data ranges over all domains





Problem:


NML for Gaussian distribution


Normalization term diverges


NML for mixture distribution


Normalization term is computationally intractable


This comes from combinational difficulties

14

For Continuous Data (Example)


For the one
-
dimension Gaussian

distribution







(
σ
2

is given)





Normalization term

15

Approximate computation (1/2)












16

Use sufficient statistics

g
1

: Gaussian distribution

g
2

: Wishart distribution

Criteria


NML for GMM



Restrict the range of data
so that the MLE lies in a bounded
range specified by a parameter

17

Efficiently computing an approximate variant of
the NML code
-
length for a GMM

[Hirai and Yamanishi, 2011]

The normalization term does not diverge

But still highly depends on the parameters :

NML


The normalization term is calculated as follows :

18

where,

: number of data,

: dim. of data

Criteria


RNML code
-
length



Re
-
normalize
around the MLE of parameter by restricting
the range of data

19

Modify NML to develop the re
-
normalized
maximum likelihood coding (RNML)

[Rissanen, Roos, Myllymaki 2010]

[Hirai and Yamanishi, 2012]

Less dependent on hyper
-
parameter

20

Criteria


RNML code
-
length


RNML code
-
length


Theorem

[Hirai and Yamanishi 2012]




RNML

code
-
length for GMM is calculated as follows :

21

Definition

Problem

Computing ,


costs .

1

Criteria


efficient computing of RNML



Straightforward computation of RNML requires time



But we can compute it efficiently



Theorem

[Kontkanen and Myllymaki, 07]



22



)

Can compute the normalization term

in for “mixture” models

Criteria


efficient computing of RNML



Straightforward computation of RNML requires time



But we can compute it efficiently



Theorem

[Hirai and Yamanishi, 2012]


The normalization term satisfies recurrsive formula

23







Experimental Results




Artificial Data




Market Analysis


24

Experimental Results


data generation



Generate artificial data set according to GMM with




25

Experimental Results


comparison criteria



AR (accuracy rate)

:

Average rate of correctly estimating the true number of
clusters over all time


IR (identification rate)

:

Probability of correctly identifying change
-
points and
change themselves


FAR (false alarm rate)

:

Rate of the number of false alarms over all detected
change
-
points

26

Employ three comparison metrics

Experimental Results


artificial data


27

Our alg. with NML was able to detect true change
-
points and identify the true # of clusters with higher
probability than AIC and BIC

Average Number of clusters Over Time

AIC:Akaike’s information criteria [Akaike1974]

BIC:Bayesian information criteria [Shwarz 1978]

RNML

AIC

BIC

AR

0.903

0.103

0.135

IR

0.380

0.005

0.020

FAR

0.260

0.020

0.718

Comparison w. r. t. KL
-
divergence


Evaluated change detection accuracies by varying the
Kullback
-
Leibler divergence (KLD) between the
distributions before and after the change points

28

The larger the KLD
between GMMs before and
after the change
-
point was,
the more accurately it was
detected in terms of IR
(identification rate).

Experimental Results


vs SW Alg.




SW algorithm :

Hypothesis testing whether clusters are identical or not,
then make splitting, merging, etc.
[Song and Wang, 2005]



29

The sequential DMS with RNML significantly
outperformed SW
-
alg.

AR

IR

FAR

Proposed

0.988

0.950

0.050

SW
-
RNML

0.369

0.300

0.503

SW
-
BIC

0.019

0.000

0.841

Data : size/time = 512

Experimental Results


market analysis


30

Data set provided by MACROMILL, Inc.

Clustering customers to detect their structure changes

Our alg. detected clustering
changes that corresponded to
the year’s ending demand

Beer 1

Beer 2

. . .

User 1

350

700

. . .

User 2

1050

350

. . .

. . .

. . .

. . .

. . .

Beer 1

Beer 2

. . .

User 1

350

700

. . .

User 2

1050

350

. . .

. . .

. . .

. . .

. . .

Beer 1

Beer 2

. . .

User 1

350

700

. . .

User 2

1050

350

. . .

. . .

. . .

. . .

. . .

14 kinds of beer

3185 users

78 days


The cluster change in change
-
point : 1/1,2

31

assumption


l)

捬u獴敲 ㄠ

捬u獴敲 ㈠

捬u獴敲 ㌠

B敥e
-
A

184

0

117

Beer
-
B

91

0

95

Premium
-
A

108

0

80

Premium
-
B

113

0

43

Beer
-
C

0

0

126

Beer
-
D

0

0

140

Third
-
A

93

41

43

Third
-
B

0

198

121

Third
-
C

0

303

103

Third
-
D

0

120

182

LM
-
beer
-
A

0

75

48

Off
-
A

0

0

157

Off
-
B

0

114

34

Off
-
C

0

0

83

Total

Volume

589

852

1373

#

Customers

598

376

311

cluster 1

cluster 2

cluster
3

cluster
4

cluster 5

84

0

131

50

229

123

0

248

0

0

153

0

174

73

0

176

0

105

0

0

0

0

146

122

0

0

0

72

192

0

101

131

130

0

0

0

34

406

0

131

0

107

112

46

236

0

202

431

0

0

0

107

87

0

0

0

0

169

138

0

0

215

74

0

0

0

0

61

83

0

637

796

2348

705

596

397

190

123

162

363

Many of customers changed
their patterns to purchase
Beer
-
A and Third
-
Beer at
the year’s end

Conclusion


P
roposed the sequential DMS algorithm to address clustering
change detection issue.


Key ideas :


Sequential dynamic model selection based on MDL principle


T
he use of the NML code
-
length as criteria and its efficient computation



It is able to detect cluster
changes significantly
more accurately
than AIC/BIC based methods and the existing statistical
-
test
based method in artificial data



Tracking changes of group structures leads to the
understanding changes of market structures

32

Why is NML ?

33

The shortest code
-
length in the sense of

Shtarkov’s minimax criterion

[Shtarkov, 1987]

Minimum is attained by
Q=
NML

distribution

Maximum

Likelihood

Estimator

For a given class :

Restrict the range of data

34

Restrict the range of data for

Shtarkov’s minimax criterion

[Shtarkov, 1987]

For a given class :

Restrict the range of data.

We change the Shtarkov’s
minimax criterion itself

Comparison with non
-
parametric Bayes


Sequential Dynamic Model Selection works better than
non
-
parametric Bayes (Infinite HMM, etc.)




[Comparison of Dynamic Model Selection with

Infinite HMM for Statistical Model Change Detection



Sakurai and Yamanishi, to appear in ITW 2012]

35