Fast Algorithms for Mining Frequent Itemsets

boorishadamantAI and Robotics

Oct 29, 2013 (3 years and 8 months ago)

89 views

Fast Algorithms for Mining
Frequent Itemsets


指導教授
:
張真誠

教授

研究生
:
李育強

Dept. of Computer Science and Information
Engineering,
National Chung Cheng University

Date:

May 31, 2007

博士論文初稿

探勘頻繁項目集合之快速演算法研究


2

Outline


Introduction


Background and Related Work


NFP
-
Tree Structure


Fast Share Measure (FSM) Algorithm


Three Efficient Algorithms


Direct Candidate Generate (DCG) Algorithm


Isolated Items Discarding Strategy (IIDS)


Maximum Item Conflict First (MICF)
Sanitization Method


Conclusions

3

Introduction


Data mining techniques have been developed to
find a small set of precious nugget from reams of
data (Cabena et al., 1998; Kantardzic, 2002)


Mining association rules constitutes one of the most
important data mining problem


Two sub
-
problem (Agrawal & Srikant, 1994)


Identifying all frequent itemsets


Using these frequent itemsets to generate
association rules


The first sub
-
problem plays an essential role in
mining association rules

4

Introduction (con

t)


Mining frequent itemsets


Mining share
-
frequent itemsets


Mining high utility itemsets


Hiding sensitive patterns



5

Support
-
Confidence Framework (1/4)

Apriori algorithm (
Agrawal and Srikant, 1994
):
minSup
= 40%

6

Support
-
Confidence Framework (2/4)


FP
-
growth algorithm (Han et al., 2000; Han et al., 2004)


TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D





7


TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

Support
-
Confidence

Framework (3/4)







8

Support
-
Confidence Framework (4/4)

Conditional FP
-
tree of

D


Conditional FP
-
tree of

BD


9


Measure value:
mv
(
i
p
,
T
q
)


mv
({
D}
,
T01
) = 1


mv
({
C}
,
T03
) = 3


Transaction measure value:
tmv
(
T
q
) =


tmv
(
T02
) = 10


Total measure value:
Tmv
(
DB
)=


Tmv
(
DB
)=47


Itemset measure value:
imv
(
X, T
q
)=


imv
({
A
,
E
}, T02)=5


Local measure value:
lmv
(
X
)=


lmv
({
BC
})=2+5+5=12

Share
-
Confidence Framework (1/4)

10

Share
-
Confidence Framework (2/4)

minShare
=30%


Itemset share:
SH
(
X
)=


SH
({
BC
})=12/47=25.5%


SH
-
frequent: if
SH
(
X
) >=
minShare
,

X

is a share
-
frequent
(SH
-
frequent) itemset

11

Share
-
Confidence Framework (3/4)


ZP(Zero Pruning)

ZSP(Zero Subset Pruning)
(Barber & Hamilton, 2003)


variants of exhaustive search


prune the candidate itemsets whose local measure
values are exactly zero


SIP(Share Infrequent Pruning) (Barber &
Hamilton, 2003)


like Apriori


with errors


The three algorithms are either inefficient or do
not discover complete share
-
frequent (SH
-
frequent) itemsets

12

Share
-
Confidence Framework (4/4)

ZSP Algorithm

SIP Algorithm

13


Internal utility:
iu
(
i
p
,
T
q
)


iu
({
D}
,
T01
) = 1


iu
({
C}
,
T03
) = 3


External utility:
eu
(
i
p
)


eu
({
D}
) = 3


eu
({
C}
) = 1


Utility value in a transaction:


util
({
C
,
E
,
F
},
T02
) =
util
(
C
,
T02
) +

util
(
E
,
T02
) +

util
(
F
,
T02
) = 3X1+1X5+2X2=12


Local utility:


Lutil
({
C
,
D
}) =
util
({
C
,
D
},
T01
) +
util
({
C
,
D
},
T04
) +
util
({
C
,
D
},
T06
) = 4 + 7 + 5 = 16

Utility Mining (1/2)

14

Utility Mining (2/2)


Total utility:
Tutil(DB) =


Tutil(DB)
= 122


The utility value of X in DB:

UTIL(X)=


UTIL({C, D})
= 16/122 =13.1%


High utility itemset: if
UTIL
(
X
) >=
minUtil
,

X

is a high
utility itemset



15

Privacy
-
Preserving in Mining Frequent
Itemsets


NP
-
hard problem (Atallah et al., 1999)


DB: database, DB

: released database


RI: the set of restrictive itemsets


~RI: the set of non
-
restrictive itemsets


Misses cost =


Sanitization algorithms (Oliveira and Za
ï
ane,
2002; Oliveira and Za
ï
ane, 2003; Saygin et
al., 2001)

16

NFP
-
Tree (1/4)


NFP
-
growth Algorithm


NFP
-
tree construction

17

NFP
-
Tree (2/4)

TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D





18

NFP
-
Tree (3/4)







TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

19

NFP
-
Tree (4/4)

Conditional NFP
-
tree of


D
(3,4)


20

Experimental Results (1/3)


PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running
windows 2000 professional


All algorithms were coded in VC++ 6.0


Datasets:


Real: BMS
-
Web View
-
1, BMS
-
Web View
-
2, Connect 4


Artificial: generated by IBM synthetic data generator



|
D
|

Number of transactions in
DB

|
T
|

Mean size of the transactions

|
I
|

Mean size of the maximal potentially frequent
itemsets

|
L
|

Number of maximal potentially frequent
itemsets

N

Number of items

21

Experimental Results (2/3)

22

Experimental Results (3/3)


23

Fast Share Measure (FSM)
Algorithm


FSM: Fast Share Measure algorithm


ML
: Maximum transaction length in

DB



MV
: Maximum measure value

in

DB



min_lmv
=
minShare
×
Tmv


Level Closure Property: Given a
minShare

and a
k
-
itemset
X


Theorem 1.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
<
min_lmv,
all
supersets of
X

with length
k
+ 1 are infrequent


Theorem 2.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
k

<
min_lmv,
all
supersets of
X

with length
k
+
k


are infrequent


Corollary 1.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
(
ML
-
k
)<
min_lmv,
all supersets of
X

are infrequent

24


minShare
=30%


Let
CF
(
X
)
=lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
(
ML
-
k
)


Prune X
if
CF
(
X
)<
min_lmv


CF
({
ABC
})=3+(3/3)
×
3
×
(6
-
3)=12<14.1=
min_lmv

25

Experimental

Results (1/2)


T4.I2.D100k.N50.S10


minShare

= 0.8%


ML
=14


Method

Pass (
k
)

ZSP

FSM(1)

FSM(2)

FSM(3)

FSM(
ML
-
1)

k
=1

C
k

50

50

50

50

50

RC
k

50

49

49

49

50

F
k

32

32

32

32

32

k
=2

C
k

1225

1176

1176

1176

1225

RC
k

1219

570

754

845

1085

F
k

119

119

119

119

119

k=
3

C
k

19327

4256

7062

8865

14886

RC
k

17217

868

1685

2410

5951

F
k

65

65

65

65

65

k
=4

C
k

165077

1725

3233

5568

24243

RC
k

107397

232

644

1236

6117

F
k

9

9

9

9

9

k
=5

C
k

406374

81

258

717

6309

RC
k

266776

5

40

109

1199

F
k

0

0

0

0

0

k
=6

C
k

369341

0

1

4

287

RC
k

310096

0

0

0

37

F
k

0

0

0

0

0

k>
=7

C
k

365975

0

0

0

0

RC
k

359471

0

0

0

0

F
k

0

0

0

0

0

Time(sec)

10349.9

2.30

2.98

3.31

11.24

26

Experimental Results (2/2)

27

Three Efficient Algorithms


EFSM (Enhanced FSM): instead of joining arbitrary
two itemsets in
RC
k
-
1
, EFSM joins arbitrary itemset of
RC
k
-
1

with a single item in
RC
1

to generate
C
k

efficiently


Reduce time complexity from O(
n
2
k
-
2
) to O(
n
k
)

28


X
k
+1
:

arbitrary superset of
X

with length
k
+1
in
DB


S
(
X
k
+1
): the set which contains all
X
k
+1
in
DB


db
S
(
X
k
+1
)
: the set of transactions of which
each transaction contains at least one
X
k
+1



SuFSM and ShFSM from EFSM which prune
the candidates more efficiently than FSM



SuFSM (Support
-
counted FSM):


Theorem 3. If
lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×
(
ML


k
)<
min_lmv
, all supersets of
X

are infrequent


29

SuFSM (Support
-
counted FSM)


lmv
(
X
)/
k

Sup
(
X
)

Sup
(
S
(
X
k
+1
))


EX.
lmv
({
BCD
})/
k
=15/3=5,

Sup
({
BCD
})=3,
Sup
(
S
({
BCD
}
k
+1
))=2,


If there is no superset of
X

is an SH
-
frequent
itemset, then the following four equations hold


lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×

(
ML
-

k
) <
min_lmv


lmv
(
X
)+
Sup
(
X
)
×
MV
×

(
ML
-

k
) <
min_lmv


lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×

(
ML
-

k
) <
min_lmv


30

ShFSM (Share
-
counted FSM)


ShFSM (Share
-
counted FSM):


Theorem 4. If
Tmv
(
db
S
(
X
k
+1
)
) <
min_lmv
, all supersets
of
X

are infrequent


FSM:
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×

(
ML
-

k
) <
min_lmv


SuFSM:
lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×

(
ML
-

k
) <
min_lmv


ShFSM:
Tmv
(
db
S
(
X
k
+1
)
) <
min_lmv

31

ShFSM (Share
-
counted FSM)


Ex.
X
={AB}


Tmv
(
db
S
(
X
k
+1
)
) =
tmv
(T01)+
tmv
(T05) =6+6=12
<14 =
min_lmv

32

Experimental Results (1/3)

minShare
=0.3%

33

Experimental Results (2/3)

minShare
=0.3%

34

Experimental
Results (3/3)

Method

Pass (
k
)

FSM

EFSM

SuFSM

ShFSM

F
k

k
=1

C
k

200

200

200

200

159

RC
k

200

200

199

197

k
=2

C
k

19900

19900

19701

19306

1844

RC
k

16214

16214

13312

7199

k=
3

C
k

829547

829547

564324

190607

101

RC
k

251877

251877

99765

9792

k
=4

C
k

3290296

3290296

793042

20913

0

RC
k

332877

332877

41057

1420

k
=5

C
k

393833

393833

25003

1050

5

RC
k

71420

71420

19720

959

k
=6

C
k

26137

26137

11582

518

8

RC
k

25562

25562

11045

506

k
=7

C
k

11141

11141

5940

204

7

RC
k

11099

11099

5827

196

k
=8

C
k

4426

4426

2797

58

1

RC
k

4423

4423

2750

54

k>
=9

C
k

2036

2036

1567

12

0

RC
k

2030

2030

1513

10

Time(sec)

13610.4

71.55

29.67

10.95


T6.I4.D100k.N200.S10


minShare

= 0.1%


ML
=20


35

Direct Candidate Generation (DCG)

Algorithm

36

Experimental Results (1/3)


37

Experimental Results (2/3)

38

Experimental Results (3/3)


39

Isolated Item Discarding Strategy
(IIDS) for Utility Mining


40

IIDS (1/2)

ShFSM

minUtil
=30%

41

IIDS (2/2)

FUM

minUtil
=30%

42

Experimental Results (1/5)


43

Experimental Results (2/5)


44

Experimental Results (3/5)

45


46

Experimental Results (5/5)


minUtil
= 0.12%

minUtil
= 0.12%

47

Maximum Item Conflict First (MICF)
Sanitization Method

Tdegree
(
Tq
): the degree of conflict of a sensitive transaction
Tq

is the
number of restrictive itemsets which are included in
Tq
,

If
Tdegree
(
Tq
) > 1,
Tq

is a conflicting transaction

48


Idegree({D}, {D, F}, T05)=1


Idegree({F}, {D, F}, T05)=0




MaxIdegree: store the
maximum value of the
conflict degree among items
in a transaction


MICF: select an item with
MaxIdegree to delete in each
iteration


TID


Transaction


Tdegree(T
q
)


T05


{B, D, F, H}


2


T06


{A, B, D, F, H}


3

49


Idegree({D}, {D, F}, T06)=1


Idegree({F}, {D, F}, T06)=0




TID


Transaction


Tdegree(T
q
)


T06


{A, B, D, F, H}


3

1

4

50

Experimental Results (1/5)

51

Experimental Results (2/5)


|RI|=200

minSup=0.04%

|RI|=50

minSup=0.1%

52

Experimental Results (3/5)


|RI|=200

minSup=0.064%

|RI|=200

minSup=0.024%

53

Experimental Results (4/5)


minSup=0.004%

minSup=0.1%

minSup=0.064%

minSup=0.024%

54

Experimental Results (5/5)


55

Conclusions


Support measure


NFP
-
growth is presented for mining frequent
itemsets



Uses two counters per tree node to reduce the number of
the tree nodes


Applies a smaller tree and header table to discover
frequent itemsets efficiently


Share measure


Proposed algorithms efficiently decrease the candidate
number to be counted


ShFSM and DCG perform the best

56


Utility mining


Propose IIDS to ignore isolated items in the process of
candidate generation


FUM and DCG+ were better than ShFSM and DCG,
respectively


Hiding sensitive patterns


Propose the MICF algorithm to reduce the impact on the
source database


MICF decreases the support of the maximum number of
restrictive itemsets


Outperform all other algorithms in several datasets on
misses costs for most cases


MICF has the lowest sanitization rate

57

Future Work


Apply a constraint relaxation algorithm or develop a
superior data structures to discover frequent
itemsets


Develop superior algorithms to accelerate identifying
all or long SH
-
frequent itemsets


Extend the application scope of IIDS to some
classification models


Develop superior algorithms to further reduce the
misses cost without hiding failure to protect sensitive
data


Apply data mining techniques on image processing,
for instance, to improve the interpolated color filter
array image

58

References

D. Agrawal and C. Aggarwal,

On the design and quantification of privacy
preserving data mining algorithms,


in
Proc. 20th ACM Symposium on
Principles of Database Systems
, Santa Barbara, CA, pp. 247
-
255, May 2001.

R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad,

A tree projection algorithm
for generation of frequent itemsets,


Journal of Parallel and Distributed
Computing
, vol. 61, no. 3, pp. 350
-
361, 2001.

R. Agrawal, T. Imielinski, and A. Swami,

Mining association rules between sets
of items in large databases


in
Proc.
1993

ACM SIGMOD Intl. Conf. on
Management of Data
, Washington, D.C., pp. 207
-
216, May 1993.

R. Agrawal and R. Srikant,

Fast algorithms for mining association rules,


in
Proc. 20th Intl. Conf. on Very Large Data Bases
, Santiago, Chile, pp. 487
-
499, Sep. 1994.

M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios,

Disclosure
limitation of sensitive rules,


in
Proc. 1999 Workshop on Knowledge and
Data Engineering Exchange
, Chicage, IL, pp. 45
-
52, Nov. 1999.

B. Barber and H. J. Hamilton,

Parametric algorithm for mining share frequent
itemsets,


Journal of Intelligent Information Systems
, vol. 16, no. 3, pp.
277
-
293, 2001.

F. Berzal, J. C. Cubero, N. Mar
í
n, and J. M. Serrano,

TBAR: An efficient method
for association rule mining in relational databases,


Data & Knowledge
Engineering
, vol. 37, no. 1, pp. 47
-
64, 2001.

59

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur,

Dynamic itemset counting and
implication rules for market basket data,


in
Proc. 1997

ACM SIGMOD Intl.
Conf. on Management of Data
, Tucson, AZ, pp. 255
-
264, May 1997.

P. Cabena, P. Hadjinian, R. Stadler, J. Verhees, and A. Zanasi,

Discovering
Data Mining from Concept to Implementation,


Prentice Hall PTR, New
Jersey, 1998.

C. L. Carter, H. J. Hamilton, and N. Cercone,

Share based measures for
itemsets,


Lecture Notes in Computer Science 1263
---

1st European Conf.
on the Principles of Data Mining and Knowledge Discovery
, H. J.

Komorowski and J. M. Zytkow (eds.), Springer
-
Verlag, Berlin, pp. 14
-
24,
1997.

G. Grahne and J. Zhu,

Efficient using prefix
-
tree in mining frequent itemsets,


in
Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations
,
Melbourne, FL, Nov. 2003.

J. Han, J. Pei, and Y. Yin,

Mining frequent patterns without candidate
generation,


in
Proc.

2000
ACM
-
SIGMOD Intl. Conf. on

Management of Data
,
Dallas, TX, pp. 1
-
12, May 2000.

J. Han, J. Pei, Y. Yin, and R. Mao,

Mining frequent patterns without candidate
generation: A frequent pattern tree approach,


Data Mining and Knowledge
Discovery
, vol. 8, no. 1, pp. 53
-
87, 2004.


60

T. Johnsten and V. V. Raghavan,

Impact of decision
-
region based classification
mining algorithms on database security,


in
Proc. IFIP WG 11.3 13th Intl.
Conf. on Database Security
, Seattle, WA, pp. 177
-
191, Jul. 1999.

M. Kantardzic,

Data Mining: Concepts, Models, Methods, and Algorithms,


John
Wiley & Sons, Inc.
,
New York
,

2002.

S. R. M. Oliveira and O. R. Za
ï
ane,

Privacy preserving frequent itemset mining,


in
Proc. IEEE ICDM Workshop on Privacy, Security and Data Mining
,
Maebashi City, Japan, pp. 43
-
54, Dec. 2002.

S. R. M. Oliveira and O. R. Za
ï
ane,

Algorithms for balancing privacy and
knowledge discovery in association rule mining,


in
Proc. of 7th Intl.
Database Engineering and Applications Symposium
, Hong Kong, China, pp.
54
-
63, Jul. 2003.

Y. Saygin, V. S. Verykios, and C. Clifton,

Using unknowns to prevent discovery
of association rules,


ACM SIGMOD Record
, vol. 30, no. 4, pp. 45
-
54, 2001.

H. Yao and H. J. Hamilton,

Mining itemset utilities from transaction databases,


Data & Knowledge Engineering
, vol. 59, no. 3, pp. 603
-
626, 2006.

H. Yao, H. J. Hamilton, and C. J. Butz,

A foundational approach to mining
itemset utilities from databases,


in
Proc. 4th SIAM Intl. Conf. on Data
Mining
, Lake Buena Vista, FL, pp. 482
-
486, Apr. 2004.

Thank You!

62

Background and Related Work


Support
-
Confidence Framework


Each item is a binary variable denoting whether an item
was purchased


Apriori (Agrawal & Swami, 1994) & Apriori
-
like algorithms
(Agrawal et al., 1993; Berzal et al., 2001; Brin et al.,
1997)


Pattern
-
growth algorithms (Agarwal et al., 2001; Grahn &
Zhu, 2003; Han et al., 2000; Han et al., 2004)


Share
-
Confidence Framework (Carter et al., 1997 )


Support
-
confidence framework does not analyze the
exact number of products purchased.


The support count method does not measure the profit
or cost of an itemset


Exhaustive search algorithm


Fast algorithms


63


Utility mining

(Yao et al. 2004; Yao and Hamiltom,
2006)


A generalized form of share
-
confidence framework


Privacy
-
Preserving in Mining Frequent
Itemsets


Classification rules

(Agrawal & Aggarwal, 2001; Johnsten &
Raghavan 1999)


Association rules
(Atallah et al., 1999; Oliveira & Za
ï
ane, 2002)