Association Rule and
Sequential Pattern
Mining for Episode
Extraction
Jonathan Yip
Introduction to
Association Rule
•
Associating multiple objects/events together
•
Example: A customer buying a laptop also
buys a wireless LAN card (2

itemset)
Wireless
LAN Card
Laptop
Laptop
Wireless LAN Card
Association Rule (con’t)
Measures of Rule Interestingness
•
Support
==
P(Laptop
∪
†
䱁丠捡牤c
Probability that all studied sets
occur
•
Confidence
==
P(LAN card
∣
䱡灴潰)
=P(Laptop U LAN card)/P(Laptop)
Conditional Probability that a
customer bought Laptop also
bought Wireless LAN card
Buy both
Thresholds:
Minimum Support: 25%
Minimum Confidence: 30%
[Support = 40%,
Confidence = 60%]
Laptop
Wireless
LAN
Card
Association Rule (eg.)
TID
Items
1
Bread, Coke, Milk
2
Chips, Bread
3
Coke, Eggs, Milk
4
Bread, Eggs, Milk,
Coke
5
Coke, Eggs, Milk
Min_Sup = 25%
Min_Conf = 25%
Milk
Eggs
Support :
P(Milk
∪
E杧猩s㴠㌯㔠㴠㘰%
Confidence :
P (EggsMilk)
= P(Milk U Eggs)/P(Milk)
P(Milk) = 4/5 = 80%
P(Eggs
∣
M楬欩k㘰%⼸/%
㴠㜵%
⠷㔥⁃潮晩摥湣攠瑨慴 愠捵獴a浥m 扵b猠
浩汫l慬獯 扵b猠敧e猩
Types of Association
•
Boolean vs. Quantitative
•
Single dimension vs. Multiple dimension
•
Single level vs. Multiple level Analysis
Example:
1.) Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)
Buys (X, BMW Sedan)
2.) Income(X,,”>50K”)
Buys (X, BMW Sedan)
3.)
Gender(X,”Male”) ^ Income(X,”>50K”) ^Age(X,”35…50”)
Buys (X, BMW 540i)
Association Rule
(DB Miner)
Apriori Algorithm
•
Purpose
To mine frequent itemsets for boolean
association rules
•
Use prior knowledge to predict future values
•
Has to be frequent (Support>Min_Sup)
•
Anti

monotone concept
If a set cannot pass a min_sup test, all
supersets will fail as well
Apriori Algorithm
Psuedo

Code
•
Pseudo

code
:
C
k
: Candidate itemset of size k
L
k
: frequent itemset of size k
L
1
= {frequent items};
for
(
k
= 1;
L
k
!=
㬠
k
++)
do begin
C
k+1
= candidates generated from
L
k
;
for each
transaction
t
in database do
increment the count of all candidates in
C
k+1
that are contained in
t
L
k+1
= candidates in
C
k+1
with min_support
end
return
k
L
k
;
Apriori Algorithm
Procedures
Step 1
Scan & find
support of each
item (C1):
TID
Items
1
Bread, Coke, Milk
2
Chips, Bread
3
Coke, Eggs, Milk
4
Bread, Eggs, Milk,
Coke
5
Coke, Eggs, Milk
Example revisited:
5
–
itemset with 5 transactions
Min_Sup = 25%
Min Support Count = 2 items
Min_Conf = 25%
Items
support
Bread
3
Coke
4
Milk
4
Chips
1 (
fail)
Eggs
3
Items
support
Bread
3
Coke
4
Milk
4
Eggs
3
Step 2
Compare with
Min_Sup and
eliminate (prune)
I<Min_Sup
(L1):
Apriori Algorithm (con’t)
Supports
Bread & Coke:2/5=40%
Bread & Milk:2/5=40%
Bread & Eggs:1/5=20%
Coke & Milk:4/5=80%
Coke & Eggs:2/5=40%
Milk & Eggs:3/5=60%
Items
Bread
Coke
Milk
Eggs
Items
Bread
Coke
Milk
Eggs
Step 3 Join (L1 L1)
Repeated Step: Eliminate (prune)
items<min_supPrune (C2):
L1 set
L1 set
Supports
Bread & Coke
Bread & Milk
Coke & Milk
Coke & Eggs
Milk & Eggs
L2 set
Join L2 L2
Supports
Bread & Coke
Bread & Milk
Coke & Milk
Coke & Eggs
Milk & Eggs
Items
Support
Bread &
Coke &
Milk
2
Bread &
Coke &
Eggs
1 (fail)
Bread &
Coke &
Milk &
Eggs
1 (fail)
Coke &
Milk &
Eggs
3
L2 set
Compare with Min_Sup then
eliminate (prune) items
<Min_sup:
Conclusion:
•
Bread & Coke & Milk have strong correlation
•
Coke & Milk & Eggs have strong correlation
Apriori Algorithm (con’t)
Sequential Pattern
Mining
Introduction
•
Mining of frequently occurring patterns related to time or
other sequences
Examples
•
70% of customers rent “Star Wars, then “Empire
Strikes Back”, and then “Return of the Jedi
Application
•
Intrusion detection on computers
•
Web access pattern
•
Predict disease with sequence of symptoms
•
Many other areas
Star Wars
Empire Strikes
Back
Return of the Jedi
Sequential Pattern
Mining (con’t)
Steps:
•
Sort Phase
Sort by Cust_ID, Transaction_ID
•
Litemset Phase
Find large itemsets
•
Transform Phase
Eliminates items < min_sup
•
Sequence Phase
Find desired sequences
•
Maximal Phase
Find the maximal sequences among set of large
sequences
Sequential Pattern
Mining (con’t)
Cust
ID
Trans. Time
Items
Bought
1
June 25 ‘02
3
1
June 30 ‘02
9
2
June 10 ‘02
1 , 2
2
June 15 ‘02
3
2
June 20 ‘02
4, 6, 7
3
June 25 ‘02
3, 5, 7
4
June 25 ‘02
3
4
June 30 ‘02
4, 7
4
July 25 ‘02
9
5
June 12 ‘02
9
Example:
Database sorted by
Cust_ID & Transaction Time
(Min_sup=25%)
Organized format
with Cust_ID:
Cust
ID
Original
Sequence
1
{(3) (9)}
2
{(1,2) (3) (4,6,7)}
3
{(3,5,7)}
4
{(3) (4,7) (9)}
5
{(9)}
Sequential Pattern
Mining (con’t)
Cust ID
Original
Sequence
Items to study
Support
Count
1
{(3)(9)}
{(3)} {(9)} {(3,9)}
3,3, 2
5
{(9)}
{(9)}
1
Step 1: Sort (examples of several transaction):
Conclusion:
>25%
Min_sup
{(3) (9)} && {(3) (4,7)}
Sequential Pattern
Mining (con’t)
Cust
ID
Original
Sequence
Transformed Cust.
Sequence
After mapping
1
{(3) (9)}
({3} {(9)}
({1} {5})
2
{(1,2) (3) (4,6,7)}
{(3}) {(4) (7) (4,7)}
({1} {2 3 4})
3
{(3,5,7)}
{(3) (7)}
({1,3})
4
{(3) (4,7) (9)}
({3} {(4) (7) (4 7)} {(9)}
({1} {2 3 4} {5})
5
{(9)}
{(9)}
({5})
Data sequence of each
customer:
Sequences < min_support:
{(1,2) (3)}, {(3)},{(4)},{(7)},{(9)},
{(3) (4)}, {(3) (7), {(4) (7)}
Support > 25% {(3) (9)}
{(3) (4 7)}
The most right column implies customers
buying patterns
L
Item
Ma
pp
ed
To
(30)
1
(40)
2
(70)
3
(40
70)
4
(90)
5
Step 2:
Litemset
phase
Sequential Pattern
Mining Algorithm
Algorithm
•
AprioriAll
Count all large sequence, including
those not maximal
Pseudo

code:
Ck: Candidate sequence of size k
Lk : frequent or large sequence of size k
L1 = {large 1

sequence};
//result of litemset phase
for (k = 2; Lk !=
; k++) do begin
Ck = candidates generated from Lk

1;
for each customer sequence c in database do
Increment the count of all candidates in Ck
that are contained in c
end
Answer=Maximal sequences in
k Lk;
•
AprioriSome
Generates every candidate sequence, but
skips counting some large sequences
(Forward Phase). Then, discards candidates
not maximal and counts remaining large
sequences (Backward Phase).
鸞
Episode Extraction
•
A partially ordered collection of events occurring together
•
Goal: To analyze sequence of events, and to discover
recurrent episodes
•
First finding small frequent episodes then progressively
looking larger episodes
•
Types of episodes
Serial (
⤠
–
E occurs before F
Parallel(
⤠
–
No constraints on
relativelyorder of A & B
Non

Serial/Non

Parallel (
⤠
††

O捣畲c敮捥c潦⁁o☠䈠
†††
灲散e摥猠C
E
F
A
B
A
B
C
Episode Extraction
(con’t)
E D F A B C E F C D B A D C E F C B E A E C F A
30 35 40 45 50 55 60
65
S = {(A
1
,t
1
),(A
2
,t
2
),….,(A
n
, t
n
)
†
s={(E,31),(D,32),(F,33)….(A,65)}
•
Time window is set to bind the interestingness
W(s,5) slides and snapshot the whole sequence
eg. (w,35,40) contains A,B,C,E episodes
Ⱐ
潣捵牳o扵b湯琠
•
User specifies how many windows an episode has to occur to be
frequent
Formula :
A Sequence of events:
 { (,)  occurs in w}
(,,)
 (,) 
w Win s win
fr s win
W s win
Episode Extraction
Minimal occurrences
•
Look at exact occurrences of episodes & relationships between
occurrences
•
Can modify width of window
•
Eliminates unnecessary repetition of the recognition effort
•
Example
mo(
)㴠筛㌵ⰳ㠩Ⱐ嬴㘬㐸)ⱛ,㜬7〩0
•
When episode is a subepisode of another; this relation is used for
discovering all frequent episodes
Applications of Episodes
Extraction
•
Computer Security
•
Bioinformatics
•
Finance
•
Market Analysis
•
And more……
References
•
Discovery of Frequent Episodes in Event Sequences
(Manilla,Toivonen, Verkamo)
•
Mining Sequential Patterns (Agrawal, Srikant)
•
Principles of Data Mining (Hand, Manilla, Smyth) 2001
•
Data Mining Concepts and Techniques (Han, Kamber) 2001
END
Comments 0
Log in to post a comment