# Fast Algorithms for Mining Frequent Itemsets

AI and Robotics

Oct 29, 2013 (4 years and 8 months ago)

135 views

Fast Algorithms for Mining
Frequent Itemsets

:

:

Dept. of Computer Science and Information
Engineering,
National Chung Cheng University

Date:

May 31, 2007

2

Outline

Introduction

Background and Related Work

NFP
-
Tree Structure

Fast Share Measure (FSM) Algorithm

Three Efficient Algorithms

Direct Candidate Generate (DCG) Algorithm

Maximum Item Conflict First (MICF)
Sanitization Method

Conclusions

3

Introduction

Data mining techniques have been developed to
find a small set of precious nugget from reams of
data (Cabena et al., 1998; Kantardzic, 2002)

Mining association rules constitutes one of the most
important data mining problem

Two sub
-
problem (Agrawal & Srikant, 1994)

Identifying all frequent itemsets

Using these frequent itemsets to generate
association rules

The first sub
-
problem plays an essential role in
mining association rules

4

Introduction (con

t)

Mining frequent itemsets

Mining share
-
frequent itemsets

Mining high utility itemsets

Hiding sensitive patterns

5

Support
-
Confidence Framework (1/4)

Apriori algorithm (
Agrawal and Srikant, 1994
):
minSup
= 40%

6

Support
-
Confidence Framework (2/4)

FP
-
growth algorithm (Han et al., 2000; Han et al., 2004)

TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

7

TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

Support
-
Confidence

Framework (3/4)

8

Support
-
Confidence Framework (4/4)

Conditional FP
-
tree of

D

Conditional FP
-
tree of

BD

9

Measure value:
mv
(
i
p
,
T
q
)

mv
({
D}
,
T01
) = 1

mv
({
C}
,
T03
) = 3

Transaction measure value:
tmv
(
T
q
) =

tmv
(
T02
) = 10

Total measure value:
Tmv
(
DB
)=

Tmv
(
DB
)=47

Itemset measure value:
imv
(
X, T
q
)=

imv
({
A
,
E
}, T02)=5

Local measure value:
lmv
(
X
)=

lmv
({
BC
})=2+5+5=12

Share
-
Confidence Framework (1/4)

10

Share
-
Confidence Framework (2/4)

minShare
=30%

Itemset share:
SH
(
X
)=

SH
({
BC
})=12/47=25.5%

SH
-
frequent: if
SH
(
X
) >=
minShare
,

X

is a share
-
frequent
(SH
-
frequent) itemset

11

Share
-
Confidence Framework (3/4)

ZP(Zero Pruning)

ZSP(Zero Subset Pruning)
(Barber & Hamilton, 2003)

variants of exhaustive search

prune the candidate itemsets whose local measure
values are exactly zero

SIP(Share Infrequent Pruning) (Barber &
Hamilton, 2003)

like Apriori

with errors

The three algorithms are either inefficient or do
not discover complete share
-
frequent (SH
-
frequent) itemsets

12

Share
-
Confidence Framework (4/4)

ZSP Algorithm

SIP Algorithm

13

Internal utility:
iu
(
i
p
,
T
q
)

iu
({
D}
,
T01
) = 1

iu
({
C}
,
T03
) = 3

External utility:
eu
(
i
p
)

eu
({
D}
) = 3

eu
({
C}
) = 1

Utility value in a transaction:

util
({
C
,
E
,
F
},
T02
) =
util
(
C
,
T02
) +

util
(
E
,
T02
) +

util
(
F
,
T02
) = 3X1+1X5+2X2=12

Local utility:

Lutil
({
C
,
D
}) =
util
({
C
,
D
},
T01
) +
util
({
C
,
D
},
T04
) +
util
({
C
,
D
},
T06
) = 4 + 7 + 5 = 16

Utility Mining (1/2)

14

Utility Mining (2/2)

Total utility:
Tutil(DB) =

Tutil(DB)
= 122

The utility value of X in DB:

UTIL(X)=

UTIL({C, D})
= 16/122 =13.1%

High utility itemset: if
UTIL
(
X
) >=
minUtil
,

X

is a high
utility itemset

15

Privacy
-
Preserving in Mining Frequent
Itemsets

NP
-
hard problem (Atallah et al., 1999)

DB: database, DB

: released database

RI: the set of restrictive itemsets

~RI: the set of non
-
restrictive itemsets

Misses cost =

Sanitization algorithms (Oliveira and Za
ï
ane,
2002; Oliveira and Za
ï
ane, 2003; Saygin et
al., 2001)

16

NFP
-
Tree (1/4)

NFP
-
growth Algorithm

NFP
-
tree construction

17

NFP
-
Tree (2/4)

TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

18

NFP
-
Tree (3/4)

TID

Frequent 1
-
itemsets
(sorted)

001

002

003

004

005

006

C A B D

C A

C A

C B D

A B D

C B D

19

NFP
-
Tree (4/4)

Conditional NFP
-
tree of

D
(3,4)

20

Experimental Results (1/3)

PC: Pentium IV 1.5 GHZ, 1GB SDRAM, running
windows 2000 professional

All algorithms were coded in VC++ 6.0

Datasets:

Real: BMS
-
Web View
-
1, BMS
-
Web View
-
2, Connect 4

Artificial: generated by IBM synthetic data generator

|
D
|

Number of transactions in
DB

|
T
|

Mean size of the transactions

|
I
|

Mean size of the maximal potentially frequent
itemsets

|
L
|

Number of maximal potentially frequent
itemsets

N

Number of items

21

Experimental Results (2/3)

22

Experimental Results (3/3)

23

Fast Share Measure (FSM)
Algorithm

FSM: Fast Share Measure algorithm

ML
: Maximum transaction length in

DB

MV
: Maximum measure value

in

DB

min_lmv
=
minShare
×
Tmv

Level Closure Property: Given a
minShare

and a
k
-
itemset
X

Theorem 1.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
<
min_lmv,
all
supersets of
X

with length
k
+ 1 are infrequent

Theorem 2.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
k

<
min_lmv,
all
supersets of
X

with length
k
+
k

are infrequent

Corollary 1.
If
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
(
ML
-
k
)<
min_lmv,
all supersets of
X

are infrequent

24

minShare
=30%

Let
CF
(
X
)
=lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×
(
ML
-
k
)

Prune X
if
CF
(
X
)<
min_lmv

CF
({
ABC
})=3+(3/3)
×
3
×
(6
-
3)=12<14.1=
min_lmv

25

Experimental

Results (1/2)

T4.I2.D100k.N50.S10

minShare

= 0.8%

ML
=14

Method

Pass (
k
)

ZSP

FSM(1)

FSM(2)

FSM(3)

FSM(
ML
-
1)

k
=1

C
k

50

50

50

50

50

RC
k

50

49

49

49

50

F
k

32

32

32

32

32

k
=2

C
k

1225

1176

1176

1176

1225

RC
k

1219

570

754

845

1085

F
k

119

119

119

119

119

k=
3

C
k

19327

4256

7062

8865

14886

RC
k

17217

868

1685

2410

5951

F
k

65

65

65

65

65

k
=4

C
k

165077

1725

3233

5568

24243

RC
k

107397

232

644

1236

6117

F
k

9

9

9

9

9

k
=5

C
k

406374

81

258

717

6309

RC
k

266776

5

40

109

1199

F
k

0

0

0

0

0

k
=6

C
k

369341

0

1

4

287

RC
k

310096

0

0

0

37

F
k

0

0

0

0

0

k>
=7

C
k

365975

0

0

0

0

RC
k

359471

0

0

0

0

F
k

0

0

0

0

0

Time(sec)

10349.9

2.30

2.98

3.31

11.24

26

Experimental Results (2/2)

27

Three Efficient Algorithms

EFSM (Enhanced FSM): instead of joining arbitrary
two itemsets in
RC
k
-
1
, EFSM joins arbitrary itemset of
RC
k
-
1

with a single item in
RC
1

to generate
C
k

efficiently

Reduce time complexity from O(
n
2
k
-
2
) to O(
n
k
)

28

X
k
+1
:

arbitrary superset of
X

with length
k
+1
in
DB

S
(
X
k
+1
): the set which contains all
X
k
+1
in
DB

db
S
(
X
k
+1
)
: the set of transactions of which
each transaction contains at least one
X
k
+1

SuFSM and ShFSM from EFSM which prune
the candidates more efficiently than FSM

SuFSM (Support
-
counted FSM):

Theorem 3. If
lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×
(
ML

k
)<
min_lmv
, all supersets of
X

are infrequent

29

SuFSM (Support
-
counted FSM)

lmv
(
X
)/
k

Sup
(
X
)

Sup
(
S
(
X
k
+1
))

EX.
lmv
({
BCD
})/
k
=15/3=5,

Sup
({
BCD
})=3,
Sup
(
S
({
BCD
}
k
+1
))=2,

If there is no superset of
X

is an SH
-
frequent
itemset, then the following four equations hold

lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×

(
ML
-

k
) <
min_lmv

lmv
(
X
)+
Sup
(
X
)
×
MV
×

(
ML
-

k
) <
min_lmv

lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×

(
ML
-

k
) <
min_lmv

30

ShFSM (Share
-
counted FSM)

ShFSM (Share
-
counted FSM):

Theorem 4. If
Tmv
(
db
S
(
X
k
+1
)
) <
min_lmv
, all supersets
of
X

are infrequent

FSM:
lmv
(
X
)+(
lmv
(
X
)/
k
)
×
MV
×

(
ML
-

k
) <
min_lmv

SuFSM:
lmv
(
X
)+
Sup
(
S
(
X
k
+1
))
×
MV
×

(
ML
-

k
) <
min_lmv

ShFSM:
Tmv
(
db
S
(
X
k
+1
)
) <
min_lmv

31

ShFSM (Share
-
counted FSM)

Ex.
X
={AB}

Tmv
(
db
S
(
X
k
+1
)
) =
tmv
(T01)+
tmv
(T05) =6+6=12
<14 =
min_lmv

32

Experimental Results (1/3)

minShare
=0.3%

33

Experimental Results (2/3)

minShare
=0.3%

34

Experimental
Results (3/3)

Method

Pass (
k
)

FSM

EFSM

SuFSM

ShFSM

F
k

k
=1

C
k

200

200

200

200

159

RC
k

200

200

199

197

k
=2

C
k

19900

19900

19701

19306

1844

RC
k

16214

16214

13312

7199

k=
3

C
k

829547

829547

564324

190607

101

RC
k

251877

251877

99765

9792

k
=4

C
k

3290296

3290296

793042

20913

0

RC
k

332877

332877

41057

1420

k
=5

C
k

393833

393833

25003

1050

5

RC
k

71420

71420

19720

959

k
=6

C
k

26137

26137

11582

518

8

RC
k

25562

25562

11045

506

k
=7

C
k

11141

11141

5940

204

7

RC
k

11099

11099

5827

196

k
=8

C
k

4426

4426

2797

58

1

RC
k

4423

4423

2750

54

k>
=9

C
k

2036

2036

1567

12

0

RC
k

2030

2030

1513

10

Time(sec)

13610.4

71.55

29.67

10.95

T6.I4.D100k.N200.S10

minShare

= 0.1%

ML
=20

35

Direct Candidate Generation (DCG)

Algorithm

36

Experimental Results (1/3)

37

Experimental Results (2/3)

38

Experimental Results (3/3)

39

(IIDS) for Utility Mining

40

IIDS (1/2)

ShFSM

minUtil
=30%

41

IIDS (2/2)

FUM

minUtil
=30%

42

Experimental Results (1/5)

43

Experimental Results (2/5)

44

Experimental Results (3/5)

45

46

Experimental Results (5/5)

minUtil
= 0.12%

minUtil
= 0.12%

47

Maximum Item Conflict First (MICF)
Sanitization Method

Tdegree
(
Tq
): the degree of conflict of a sensitive transaction
Tq

is the
number of restrictive itemsets which are included in
Tq
,

If
Tdegree
(
Tq
) > 1,
Tq

is a conflicting transaction

48

Idegree({D}, {D, F}, T05)=1

Idegree({F}, {D, F}, T05)=0

MaxIdegree: store the
maximum value of the
conflict degree among items
in a transaction

MICF: select an item with
MaxIdegree to delete in each
iteration

TID

Transaction

Tdegree(T
q
)

T05

{B, D, F, H}

2

T06

{A, B, D, F, H}

3

49

Idegree({D}, {D, F}, T06)=1

Idegree({F}, {D, F}, T06)=0

TID

Transaction

Tdegree(T
q
)

T06

{A, B, D, F, H}

3

1

4

50

Experimental Results (1/5)

51

Experimental Results (2/5)

|RI|=200

minSup=0.04%

|RI|=50

minSup=0.1%

52

Experimental Results (3/5)

|RI|=200

minSup=0.064%

|RI|=200

minSup=0.024%

53

Experimental Results (4/5)

minSup=0.004%

minSup=0.1%

minSup=0.064%

minSup=0.024%

54

Experimental Results (5/5)

55

Conclusions

Support measure

NFP
-
growth is presented for mining frequent
itemsets

Uses two counters per tree node to reduce the number of
the tree nodes

Applies a smaller tree and header table to discover
frequent itemsets efficiently

Share measure

Proposed algorithms efficiently decrease the candidate
number to be counted

ShFSM and DCG perform the best

56

Utility mining

Propose IIDS to ignore isolated items in the process of
candidate generation

FUM and DCG+ were better than ShFSM and DCG,
respectively

Hiding sensitive patterns

Propose the MICF algorithm to reduce the impact on the
source database

MICF decreases the support of the maximum number of
restrictive itemsets

Outperform all other algorithms in several datasets on
misses costs for most cases

MICF has the lowest sanitization rate

57

Future Work

Apply a constraint relaxation algorithm or develop a
superior data structures to discover frequent
itemsets

Develop superior algorithms to accelerate identifying
all or long SH
-
frequent itemsets

Extend the application scope of IIDS to some
classification models

Develop superior algorithms to further reduce the
misses cost without hiding failure to protect sensitive
data

Apply data mining techniques on image processing,
for instance, to improve the interpolated color filter
array image

58

References

D. Agrawal and C. Aggarwal,

On the design and quantification of privacy
preserving data mining algorithms,

in
Proc. 20th ACM Symposium on
Principles of Database Systems
, Santa Barbara, CA, pp. 247
-
255, May 2001.

R. C. Agarwal, C. C. Aggarwal, and V. V. V. Prasad,

A tree projection algorithm
for generation of frequent itemsets,

Journal of Parallel and Distributed
Computing
, vol. 61, no. 3, pp. 350
-
361, 2001.

R. Agrawal, T. Imielinski, and A. Swami,

Mining association rules between sets
of items in large databases

in
Proc.
1993

ACM SIGMOD Intl. Conf. on
Management of Data
, Washington, D.C., pp. 207
-
216, May 1993.

R. Agrawal and R. Srikant,

Fast algorithms for mining association rules,

in
Proc. 20th Intl. Conf. on Very Large Data Bases
, Santiago, Chile, pp. 487
-
499, Sep. 1994.

M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim, and V. Verykios,

Disclosure
limitation of sensitive rules,

in
Proc. 1999 Workshop on Knowledge and
Data Engineering Exchange
, Chicage, IL, pp. 45
-
52, Nov. 1999.

B. Barber and H. J. Hamilton,

Parametric algorithm for mining share frequent
itemsets,

Journal of Intelligent Information Systems
, vol. 16, no. 3, pp.
277
-
293, 2001.

F. Berzal, J. C. Cubero, N. Mar
í
n, and J. M. Serrano,

TBAR: An efficient method
for association rule mining in relational databases,

Data & Knowledge
Engineering
, vol. 37, no. 1, pp. 47
-
64, 2001.

59

S. Brin, R. Motwani, J. D. Ullman, and S. Tsur,

Dynamic itemset counting and
implication rules for market basket data,

in
Proc. 1997

ACM SIGMOD Intl.
Conf. on Management of Data
, Tucson, AZ, pp. 255
-
264, May 1997.

Discovering
Data Mining from Concept to Implementation,

Prentice Hall PTR, New
Jersey, 1998.

C. L. Carter, H. J. Hamilton, and N. Cercone,

Share based measures for
itemsets,

Lecture Notes in Computer Science 1263
---

1st European Conf.
on the Principles of Data Mining and Knowledge Discovery
, H. J.

Komorowski and J. M. Zytkow (eds.), Springer
-
Verlag, Berlin, pp. 14
-
24,
1997.

G. Grahne and J. Zhu,

Efficient using prefix
-
tree in mining frequent itemsets,

in
Proc. IEEE ICDM Workshop on Frequent Itemset Mining Implementations
,
Melbourne, FL, Nov. 2003.

J. Han, J. Pei, and Y. Yin,

Mining frequent patterns without candidate
generation,

in
Proc.

2000
ACM
-
SIGMOD Intl. Conf. on

Management of Data
,
Dallas, TX, pp. 1
-
12, May 2000.

J. Han, J. Pei, Y. Yin, and R. Mao,

Mining frequent patterns without candidate
generation: A frequent pattern tree approach,

Data Mining and Knowledge
Discovery
, vol. 8, no. 1, pp. 53
-
87, 2004.

60

T. Johnsten and V. V. Raghavan,

Impact of decision
-
region based classification
mining algorithms on database security,

in
Proc. IFIP WG 11.3 13th Intl.
Conf. on Database Security
, Seattle, WA, pp. 177
-
191, Jul. 1999.

M. Kantardzic,

Data Mining: Concepts, Models, Methods, and Algorithms,

John
Wiley & Sons, Inc.
,
New York
,

2002.

S. R. M. Oliveira and O. R. Za
ï
ane,

Privacy preserving frequent itemset mining,

in
Proc. IEEE ICDM Workshop on Privacy, Security and Data Mining
,
Maebashi City, Japan, pp. 43
-
54, Dec. 2002.

S. R. M. Oliveira and O. R. Za
ï
ane,

Algorithms for balancing privacy and
knowledge discovery in association rule mining,

in
Proc. of 7th Intl.
Database Engineering and Applications Symposium
, Hong Kong, China, pp.
54
-
63, Jul. 2003.

Y. Saygin, V. S. Verykios, and C. Clifton,

Using unknowns to prevent discovery
of association rules,

ACM SIGMOD Record
, vol. 30, no. 4, pp. 45
-
54, 2001.

H. Yao and H. J. Hamilton,

Mining itemset utilities from transaction databases,

Data & Knowledge Engineering
, vol. 59, no. 3, pp. 603
-
626, 2006.

H. Yao, H. J. Hamilton, and C. J. Butz,

A foundational approach to mining
itemset utilities from databases,

in
Proc. 4th SIAM Intl. Conf. on Data
Mining
, Lake Buena Vista, FL, pp. 482
-
486, Apr. 2004.

Thank You!

62

Background and Related Work

Support
-
Confidence Framework

Each item is a binary variable denoting whether an item
was purchased

Apriori (Agrawal & Swami, 1994) & Apriori
-
like algorithms
(Agrawal et al., 1993; Berzal et al., 2001; Brin et al.,
1997)

Pattern
-
growth algorithms (Agarwal et al., 2001; Grahn &
Zhu, 2003; Han et al., 2000; Han et al., 2004)

Share
-
Confidence Framework (Carter et al., 1997 )

Support
-
confidence framework does not analyze the
exact number of products purchased.

The support count method does not measure the profit
or cost of an itemset

Exhaustive search algorithm

Fast algorithms

63

Utility mining

(Yao et al. 2004; Yao and Hamiltom,
2006)

A generalized form of share
-
confidence framework

Privacy
-
Preserving in Mining Frequent
Itemsets

Classification rules

(Agrawal & Aggarwal, 2001; Johnsten &
Raghavan 1999)

Association rules
(Atallah et al., 1999; Oliveira & Za
ï
ane, 2002)