# Association Rule and

Security

Nov 30, 2013 (4 years and 7 months ago)

69 views

Association Rule and
Sequential Pattern
Mining for Episode
Extraction

Jonathan Yip

Introduction to
Association Rule

Associating multiple objects/events together

Example: A customer buying a laptop also

buys a wireless LAN card (2
-

itemset)

Wireless
LAN Card

Laptop

Laptop

Wireless LAN Card

Association Rule (con’t)

Measures of Rule Interestingness

Support
==

P(Laptop

䱁丠捡牤c

Probability that all studied sets

occur

Confidence
==
P(LAN card

䱡灴潰)

=P(Laptop U LAN card)/P(Laptop)

Conditional Probability that a

customer bought Laptop also

bought Wireless LAN card

Thresholds:

Minimum Support: 25%

Minimum Confidence: 30%

[Support = 40%,

Confidence = 60%]

Laptop

Wireless
LAN
Card

Association Rule (eg.)

TID

Items

1

2

3

Coke, Eggs, Milk

4

Coke

5

Coke, Eggs, Milk

Min_Sup = 25%

Min_Conf = 25%

Milk

Eggs

Support :

P(Milk

E杧猩s㴠㌯㔠㴠㘰%

Confidence :

P (Eggs|Milk)

= P(Milk U Eggs)/P(Milk)

P(Milk) = 4/5 = 80%

P(Eggs

M楬欩k㘰%⼸/%

㴠㜵%

⠷㔥⁃潮晩摥湣攠瑨慴 愠捵獴a浥m 扵b猠

Types of Association

Boolean vs. Quantitative

Single dimension vs. Multiple dimension

Single level vs. Multiple level Analysis

Example:

1.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)

2.) Income(X,,”>50K”)

3.)
Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)

Association Rule

(DB Miner)

Apriori Algorithm

Purpose

To mine frequent itemsets for boolean

association rules

Use prior knowledge to predict future values

Has to be frequent (Support>Min_Sup)

Anti
-
monotone concept

If a set cannot pass a min_sup test, all

supersets will fail as well

Apriori Algorithm
Psuedo
-
Code

Pseudo
-
code
:

C
k
: Candidate itemset of size k

L
k

: frequent itemset of size k

L
1

= {frequent items};

for

(
k

= 1;
L
k

!=

k
++)
do begin

C
k+1

= candidates generated from
L
k
;

for each

transaction
t

in database do

increment the count of all candidates in
C
k+1

that are contained in
t

L
k+1

= candidates in
C
k+1

with min_support

end

return

k

L
k
;

Apriori Algorithm
Procedures

Step 1

Scan & find
support of each
item (C1):

TID

Items

1

2

3

Coke, Eggs, Milk

4

Coke

5

Coke, Eggs, Milk

Example revisited:

5

itemset with 5 transactions

Min_Sup = 25%

Min Support Count = 2 items

Min_Conf = 25%

Items

support

3

Coke

4

Milk

4

Chips

1 (
fail)

Eggs

3

Items

support

3

Coke

4

Milk

4

Eggs

3

Step 2

Compare with
Min_Sup and
eliminate (prune)
I<Min_Sup

(L1):

Apriori Algorithm (con’t)

Supports

Coke & Milk:4/5=80%

Coke & Eggs:2/5=40%

Milk & Eggs:3/5=60%

Items

Coke

Milk

Eggs

Items

Coke

Milk

Eggs

Step 3 Join (L1 L1)

Repeated Step: Eliminate (prune)
items<min_supPrune (C2):

L1 set

L1 set

Supports

Coke & Milk

Coke & Eggs

Milk & Eggs

L2 set

Join L2 L2

Supports

Coke & Milk

Coke & Eggs

Milk & Eggs

Items

Support

Coke &
Milk

2

Coke &
Eggs

1 (fail)

Coke &
Milk &
Eggs

1 (fail)

Coke &
Milk &
Eggs

3

L2 set

Compare with Min_Sup then
eliminate (prune) items
<Min_sup:

Conclusion:

Bread & Coke & Milk have strong correlation

Coke & Milk & Eggs have strong correlation

Apriori Algorithm (con’t)

Sequential Pattern
Mining

Introduction

Mining of frequently occurring patterns related to time or
other sequences

Examples

70% of customers rent “Star Wars, then “Empire
Strikes Back”, and then “Return of the Jedi

Application

Intrusion detection on computers

Web access pattern

Predict disease with sequence of symptoms

Many other areas

Star Wars

Empire Strikes
Back

Return of the Jedi

Sequential Pattern
Mining (con’t)

Steps:

Sort Phase

Sort by Cust_ID, Transaction_ID

Litemset Phase

Find large itemsets

Transform Phase

Eliminates items < min_sup

Sequence Phase

Find desired sequences

Maximal Phase

Find the maximal sequences among set of large
sequences

Sequential Pattern
Mining (con’t)

Cust
ID

Trans. Time

Items
Bought

1

June 25 ‘02

3

1

June 30 ‘02

9

2

June 10 ‘02

1 , 2

2

June 15 ‘02

3

2

June 20 ‘02

4, 6, 7

3

June 25 ‘02

3, 5, 7

4

June 25 ‘02

3

4

June 30 ‘02

4, 7

4

July 25 ‘02

9

5

June 12 ‘02

9

Example:

Database sorted by
Cust_ID & Transaction Time
(Min_sup=25%)

Organized format
with Cust_ID:

Cust
ID

Original
Sequence

1

{(3) (9)}

2

{(1,2) (3) (4,6,7)}

3

{(3,5,7)}

4

{(3) (4,7) (9)}

5

{(9)}

Sequential Pattern
Mining (con’t)

Cust ID

Original
Sequence

Items to study

Support

Count

1

{(3)(9)}

{(3)} {(9)} {(3,9)}

3,3, 2

5

{(9)}

{(9)}

1

Step 1: Sort (examples of several transaction):

Conclusion:

>25%
Min_sup

{(3) (9)} && {(3) (4,7)}

Sequential Pattern
Mining (con’t)

Cust
ID

Original
Sequence

Transformed Cust.
Sequence

After mapping

1

{(3) (9)}

({3} {(9)}

({1} {5})

2

{(1,2) (3) (4,6,7)}

{(3}) {(4) (7) (4,7)}

({1} {2 3 4})

3

{(3,5,7)}

{(3) (7)}

({1,3})

4

{(3) (4,7) (9)}

({3} {(4) (7) (4 7)} {(9)}

({1} {2 3 4} {5})

5

{(9)}

{(9)}

({5})

Data sequence of each
customer:

Sequences < min_support:

{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},

{(3) (4)}, {(3) (7), {(4) (7)}

Support > 25% {(3) (9)}

{(3) (4 7)}

The most right column implies customers

L
Item

Ma
pp
ed
To

(30)

1

(40)

2

(70)

3

(40
70)

4

(90)

5

Step 2:
Litemset
phase

Sequential Pattern
Mining Algorithm

Algorithm

AprioriAll

Count all large sequence, including
those not maximal

Pseudo
-
code:

Ck: Candidate sequence of size k

Lk : frequent or large sequence of size k

L1 = {large 1
-
sequence};

//result of litemset phase

for (k = 2; Lk !=

; k++) do begin

Ck = candidates generated from Lk
-
1;

for each customer sequence c in database do

Increment the count of all candidates in Ck

that are contained in c

end

k Lk;

AprioriSome

Generates every candidate sequence, but
skips counting some large sequences
not maximal and counts remaining large
sequences (Backward Phase).

Episode Extraction

A partially ordered collection of events occurring together

Goal: To analyze sequence of events, and to discover
recurrent episodes

First finding small frequent episodes then progressively
looking larger episodes

Types of episodes

Serial (

E occurs before F

Parallel(

No constraints on

relativelyorder of A & B

Non
-
Serial/Non
-
Parallel (

††
-

O捣畲c敮捥c潦⁁o☠䈠

†††

E

F

A

B

A

B

C

Episode Extraction
(con’t)

E D F A B C E F C D B A D C E F C B E A E C F A

30 35 40 45 50 55 60

65

S = {(A
1
,t
1
),(A
2
,t
2
),….,(A
n
, t
n
)

s={(E,31),(D,32),(F,33)….(A,65)}

Time window is set to bind the interestingness

W(s,5) slides and snapshot the whole sequence

eg. (w,35,40) contains A,B,C,E episodes

User specifies how many windows an episode has to occur to be

frequent

Formula :

A Sequence of events:

| { (,) | occurs in w}|
(,,)
| (,) |
w Win s win
fr s win
W s win
 
 
Episode Extraction

Minimal occurrences

Look at exact occurrences of episodes & relationships between
occurrences

Can modify width of window

Eliminates unnecessary repetition of the recognition effort

Example

mo(

)㴠筛㌵ⰳ㠩Ⱐ嬴㘬㐸)ⱛ,㜬7〩0

When episode is a subepisode of another; this relation is used for

discovering all frequent episodes

Applications of Episodes
Extraction

Computer Security

Bioinformatics

Finance

Market Analysis

And more……

References

Discovery of Frequent Episodes in Event Sequences

(Manilla,Toivonen, Verkamo)

Mining Sequential Patterns (Agrawal, Srikant)

Principles of Data Mining (Hand, Manilla, Smyth) 2001

Data Mining Concepts and Techniques (Han, Kamber) 2001

END