Association Rule and

utterlypanoramicSecurity

Nov 30, 2013 (3 years and 8 months ago)

51 views

Association Rule and
Sequential Pattern
Mining for Episode
Extraction



Jonathan Yip

Introduction to
Association Rule



Associating multiple objects/events together



Example: A customer buying a laptop also


buys a wireless LAN card (2
-

itemset)

Wireless
LAN Card

Laptop

Laptop


Wireless LAN Card

Association Rule (con’t)

Measures of Rule Interestingness



Support
==

P(Laptop


䱁丠捡牤c


Probability that all studied sets


occur



Confidence
==
P(LAN card

䱡灴潰)

=P(Laptop U LAN card)/P(Laptop)


Conditional Probability that a


customer bought Laptop also


bought Wireless LAN card




Buy both

Thresholds:

Minimum Support: 25%

Minimum Confidence: 30%

[Support = 40%,

Confidence = 60%]


Laptop

Wireless
LAN
Card

Association Rule (eg.)

TID

Items

1

Bread, Coke, Milk

2

Chips, Bread

3


Coke, Eggs, Milk

4

Bread, Eggs, Milk,
Coke

5

Coke, Eggs, Milk

Min_Sup = 25%

Min_Conf = 25%

Milk


Eggs

Support :

P(Milk


E杧猩s㴠㌯㔠㴠㘰%

Confidence :

P (Eggs|Milk)

= P(Milk U Eggs)/P(Milk)

P(Milk) = 4/5 = 80%

P(Eggs

M楬欩k㘰%⼸/%

㴠㜵%

⠷㔥⁃潮晩摥湣攠瑨慴 愠捵獴a浥m 扵b猠
浩汫l慬獯 扵b猠敧e猩

Types of Association


Boolean vs. Quantitative


Single dimension vs. Multiple dimension


Single level vs. Multiple level Analysis



Example:

1.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)




Buys (X, BMW Sedan)

2.) Income(X,,”>50K”)


Buys (X, BMW Sedan)

3.)
Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)



Buys (X, BMW 540i)





Association Rule

(DB Miner)

Apriori Algorithm



Purpose


To mine frequent itemsets for boolean


association rules



Use prior knowledge to predict future values



Has to be frequent (Support>Min_Sup)



Anti
-
monotone concept


If a set cannot pass a min_sup test, all


supersets will fail as well




Apriori Algorithm
Psuedo
-
Code


Pseudo
-
code
:

C
k
: Candidate itemset of size k

L
k

: frequent itemset of size k


L
1

= {frequent items};

for

(
k

= 1;
L
k

!=


k
++)
do begin


C
k+1

= candidates generated from
L
k
;


for each

transaction
t

in database do


increment the count of all candidates in
C
k+1

that are contained in
t


L
k+1

= candidates in
C
k+1

with min_support



end

return


k

L
k
;


Apriori Algorithm
Procedures

Step 1

Scan & find
support of each
item (C1):

TID

Items

1

Bread, Coke, Milk

2

Chips, Bread

3


Coke, Eggs, Milk

4

Bread, Eggs, Milk,
Coke

5

Coke, Eggs, Milk

Example revisited:

5


itemset with 5 transactions

Min_Sup = 25%



Min Support Count = 2 items

Min_Conf = 25%

Items

support

Bread

3

Coke

4

Milk

4

Chips

1 (
fail)

Eggs

3

Items

support

Bread

3

Coke

4

Milk

4

Eggs

3

Step 2

Compare with
Min_Sup and
eliminate (prune)
I<Min_Sup

(L1):

Apriori Algorithm (con’t)

Supports

Bread & Coke:2/5=40%

Bread & Milk:2/5=40%

Bread & Eggs:1/5=20%

Coke & Milk:4/5=80%

Coke & Eggs:2/5=40%

Milk & Eggs:3/5=60%

Items

Bread

Coke

Milk

Eggs

Items

Bread

Coke

Milk

Eggs

Step 3 Join (L1 L1)

Repeated Step: Eliminate (prune)
items<min_supPrune (C2):

L1 set

L1 set



Supports

Bread & Coke

Bread & Milk

Coke & Milk

Coke & Eggs

Milk & Eggs

L2 set

Join L2 L2

Supports

Bread & Coke

Bread & Milk

Coke & Milk

Coke & Eggs

Milk & Eggs

Items

Support

Bread &
Coke &
Milk

2

Bread &
Coke &
Eggs

1 (fail)

Bread &
Coke &
Milk &
Eggs

1 (fail)

Coke &
Milk &
Eggs

3

L2 set

Compare with Min_Sup then
eliminate (prune) items
<Min_sup:

Conclusion:


Bread & Coke & Milk have strong correlation


Coke & Milk & Eggs have strong correlation

Apriori Algorithm (con’t)

Sequential Pattern
Mining

Introduction


Mining of frequently occurring patterns related to time or
other sequences

Examples


70% of customers rent “Star Wars, then “Empire
Strikes Back”, and then “Return of the Jedi



Application


Intrusion detection on computers


Web access pattern


Predict disease with sequence of symptoms


Many other areas








Star Wars

Empire Strikes
Back

Return of the Jedi

Sequential Pattern
Mining (con’t)

Steps:


Sort Phase


Sort by Cust_ID, Transaction_ID


Litemset Phase


Find large itemsets


Transform Phase


Eliminates items < min_sup


Sequence Phase


Find desired sequences


Maximal Phase


Find the maximal sequences among set of large
sequences

Sequential Pattern
Mining (con’t)

Cust
ID

Trans. Time

Items
Bought

1

June 25 ‘02

3

1

June 30 ‘02

9

2

June 10 ‘02

1 , 2

2

June 15 ‘02

3

2

June 20 ‘02

4, 6, 7

3

June 25 ‘02

3, 5, 7

4

June 25 ‘02

3

4

June 30 ‘02

4, 7

4

July 25 ‘02

9

5

June 12 ‘02

9

Example:

Database sorted by
Cust_ID & Transaction Time
(Min_sup=25%)

Organized format
with Cust_ID:

Cust
ID

Original
Sequence

1

{(3) (9)}

2

{(1,2) (3) (4,6,7)}

3

{(3,5,7)}

4

{(3) (4,7) (9)}

5

{(9)}

Sequential Pattern
Mining (con’t)

Cust ID

Original
Sequence

Items to study

Support

Count

1

{(3)(9)}

{(3)} {(9)} {(3,9)}

3,3, 2

5

{(9)}

{(9)}

1

Step 1: Sort (examples of several transaction):

Conclusion:

>25%
Min_sup

{(3) (9)} && {(3) (4,7)}

Sequential Pattern
Mining (con’t)

Cust
ID

Original
Sequence

Transformed Cust.
Sequence

After mapping

1

{(3) (9)}

({3} {(9)}

({1} {5})

2

{(1,2) (3) (4,6,7)}

{(3}) {(4) (7) (4,7)}

({1} {2 3 4})

3

{(3,5,7)}

{(3) (7)}

({1,3})

4

{(3) (4,7) (9)}

({3} {(4) (7) (4 7)} {(9)}

({1} {2 3 4} {5})

5

{(9)}

{(9)}

({5})

Data sequence of each
customer:

Sequences < min_support:

{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},

{(3) (4)}, {(3) (7), {(4) (7)}

Support > 25% {(3) (9)}

{(3) (4 7)}

The most right column implies customers
buying patterns

L
Item

Ma
pp
ed
To

(30)

1

(40)

2

(70)

3

(40
70)

4

(90)

5

Step 2:
Litemset
phase

Sequential Pattern
Mining Algorithm

Algorithm


AprioriAll


Count all large sequence, including
those not maximal

Pseudo
-
code:

Ck: Candidate sequence of size k

Lk : frequent or large sequence of size k


L1 = {large 1
-
sequence};


//result of litemset phase

for (k = 2; Lk !=

; k++) do begin



Ck = candidates generated from Lk
-
1;



for each customer sequence c in database do



Increment the count of all candidates in Ck





that are contained in c



end

Answer=Maximal sequences in

k Lk;





AprioriSome

Generates every candidate sequence, but
skips counting some large sequences
(Forward Phase). Then, discards candidates
not maximal and counts remaining large
sequences (Backward Phase).




Episode Extraction


A partially ordered collection of events occurring together


Goal: To analyze sequence of events, and to discover
recurrent episodes


First finding small frequent episodes then progressively
looking larger episodes


Types of episodes


Serial (




E occurs before F


Parallel(




No constraints on


relativelyorder of A & B


Non
-
Serial/Non
-
Parallel (



††
-

O捣畲c敮捥c潦⁁o☠䈠

†††
灲散e摥猠C

E

F

A

B

A

B

C







Episode Extraction
(con’t)


E D F A B C E F C D B A D C E F C B E A E C F A

30 35 40 45 50 55 60


65

S = {(A
1
,t
1
),(A
2
,t
2
),….,(A
n
, t
n
)


s={(E,31),(D,32),(F,33)….(A,65)}


Time window is set to bind the interestingness


W(s,5) slides and snapshot the whole sequence


eg. (w,35,40) contains A,B,C,E episodes




潣捵牳o扵b湯琠




User specifies how many windows an episode has to occur to be


frequent


Formula :



A Sequence of events:

| { (,) | occurs in w}|
(,,)
| (,) |
w Win s win
fr s win
W s win
 
 
Episode Extraction

Minimal occurrences



Look at exact occurrences of episodes & relationships between
occurrences


Can modify width of window


Eliminates unnecessary repetition of the recognition effort


Example


mo(

)㴠筛㌵ⰳ㠩Ⱐ嬴㘬㐸)ⱛ,㜬7〩0


When episode is a subepisode of another; this relation is used for


discovering all frequent episodes


Applications of Episodes
Extraction


Computer Security


Bioinformatics


Finance


Market Analysis


And more……





References


Discovery of Frequent Episodes in Event Sequences


(Manilla,Toivonen, Verkamo)



Mining Sequential Patterns (Agrawal, Srikant)



Principles of Data Mining (Hand, Manilla, Smyth) 2001



Data Mining Concepts and Techniques (Han, Kamber) 2001




END