# On Efficiently Mining Generalized

Internet και Εφαρμογές Web

20 Οκτ 2013 (πριν από 5 χρόνια και 5 μήνες)

139 εμφανίσεις

On Efficiently Mining Generalized
Association Rules from Large

Presenter: Jiang Tao, PhD Student

Supervisor: Assoc Prof. Tan Ah Hwee

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

Introduction

Association Rule
Mining

Association Rule Mining

First proposed by Agrawal, Imielinski and
Swami (1993)

Let
I
= {i
1
, i
2
, …i
n
} a set of all items in a large
database of transactions
D,
where each
transaction is a set of items
T

I

with a
unique identifier
tid
,
ARM is to find rules about
how one subset of items (itemset) implies
another subset
.

Example: {milk}


Introduction

Association Rule &
Frequent Pattern Mining

Find all
frequent itemsets
(or

frequent
patterns
), the itemsets whose support is
greater than
minsup.

Frequent Pattern Mining is essential for
association rule mining.

Generate rules based on frequent itemsets:

Given frequent itemsets X, for all A

X, generate
rules A

X
-
A, whose confidence is greater than
minconf
.

Introduction

Closed Pattern Mining.

Seeing that some patterns are subsumed
by others, e.g. {a, b} and {a, b, c} with the
same support of 30%.

Closed Pattern Mining

is proposed.

Closed Pattern: given a frequent itemset X,
there doesn’t exist an itemset Y

X and
support(X) = support(Y).

Algorithms: CHARM, Close, and Closet, etc.

Introduction

Generalized Association Rule
Mining

tid

Items bought

10

Shirts

20

Ski Pants, Hiking Boots

30

Jackets, Hiking Boots

40

Shirts, Shoes

Users may be interested in
generating associations
span items
at different
levels of a taxonomy
.

Patterns only consisting of
items at leave nodes may
be trivial or cannot reach
minsup.

Example:

minsup=50%, minconf=50%

“Outerwear

Hiking Boots”
(support: 50%, confidence:
100%)

Introduction

Resource Description Framework

W3C RDF (Resource Description Framework)

A specification for
describing

web resources with
semantic information

and interchanging semantic
metadata on the Semantic Web enviroment.

Element:
RDF Statement
, a triple <subject, predicate,
object>.

Reflect binary relation between web resources.

Introduction

RDF Statement

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#"
xmlns:ns1=“http://purl.org/dc/elements/1.1/">

<ns1:creator resource”http://www.example.org/staffid/85740”>

</rdf:Description>

</rdf:RDF>

Introduction

RDF Schema

RDF Schema

Allow to define RDF
vocabularies:

RDFS Classes, RDF
Properties

Class Hierarchy

Property Hierarchy

Mining Generalized
Association rules
among RDF
statements

Application: RDF
query optimization,
web resource
recommendation, etc.

Staff

Software Engineer

Project Manager

http://www.example.org/staffid/85740

rdf:type

rdfs:subClass

rdfs:subClass

Class Hierarchy

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

Mining Generalized Association
Rule in RDF Dataset.

Challenges

Large amount of different RDF statements:

A small model may allow hundred thousands of
different RDF statements. (A large set of items
I
).

Each document may contain hundreds of RDF
statements. (Long transactions).

There is a high probability that long generalized
patterns exist.

Over generalization problem

Over Generalization:
An Example

Sample RDF DB

RDF
Vocabulary

Frequent Generalized Relationset (minsup=50%)

GARM on RDF

Relation Hierarchy

GARM on RDF

Over
Generalization: An Example

Two frequent relationsets extracted from sample
DB:

rs1: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Car Bombing
>} (support:
50%)

rs2: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Bombing
>} (support: 50%)

rs1 is more interesting than rs2

rs2 is a generalization of rs1

rs1 conveys more precise information; rs2 lost
information

A frequent relationset
rs

is
over
-
generalized
, if it
has a
specialization

having the
same support
.

GARM on RDF

Generalization
Closure

Over
-
Generalization Reduction Using
Generalization Closure.

Given relationset X, its generalization
closure gc(X) = {r | r

X or

r*

X, r is a
generalization of r*
}.

Lemma 1: gc(X) is
closed
→ X is
not

over
-
generalized

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

GP
-
Close

We propose
GP
-
Close
: mining
all closed
generalization closure

for
over
-
generalization reduction
.

Main Features:

Generalization closure enumeration.

Hybrid support counting.

A sorting technique that guarantee closures
are generated with a specialization
-
first
manner.

GP
-
Close: Generalization closure
enumeration

Calculate
1
-
frequent relationsets
.

Generate generalization

closures

of 1
-
frequent relationsets.

Recursively merge

the smaller closures to
generate larger closures.

GP
-
Close: Generalization closure enumeration and
Closure Sorting

GP
-
Close: Hybrid Support Counting

Two kinds of support counting

DB Scan: for each transaction T, update the support
of candidate relationsets that T supports.

Using tid
-
set:

X.tids = {1,2} and Y.tids = {2, 3} → XY.tids = {1,2}

{2,3} = {2},
support(XY)=|XY.tids|=1.

Require extra physical memory to store tid
-
sets.

Hybrid counting:

locate a tid
-
set buffer with a max buffer size

Try to build tid
-
sets for a sub closure search tree when tid
-
sets can fit in buffer.

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

Experiments

GP
-
Close VS Cumulate

Dataset: foafPub
http://ebiquity.umbc.edu/resource/html/id/82/

FOAF (Friend Of A Friend) Vocabulary for
describing people and their relationship.

Total about 6,000 documents and 100,000
RDF statements.

GP
-
Close VS Cumulate

GP
-
Close VS Cumulate

GP
-
Close VS Cumulate

Outline

Introduction

Generalized Association Rule Mining on

GP
-
Close Algorithm

Experiments

Conclusion

Conclusion

Mining Generalized Association Rule in RDF
Metadata involves new challenges; one of main
obstacle is
over
-
generalization problem
.

We proposed to
mining closed generalization
closure

for OG reduction.

We present the
GP
-
Close

algorithm which can
more efficiently mine generalized relationsets
than Cumulate algorithm.

Questions?

Introduction

Association Rule
Mining

X, Y

I,

called
itemsets
.

Find all the rules
X

Y (
X

Y =

)

with min confidence and
support

support
,
s
, probability that a
transaction contains X

Y

confidence
,
c,

conditional
probability that a transaction
having X also contains
Y
.

Let min_support = 50%,
min_conf = 50%:

A

C
(50%, 66.7%)

C

A
(50%, 100%)

Customer

Customer

Customer

Transaction
-
id

Items bought

10

A, B, C

20

A, C

30

A, D

40

B, E, F

Introduction

Maximal and Closed
Patterns.

Apriori
-
like algorithms

Search itemset lattice
in
bottom
-
up

and
-
first

manner.

Problem

Scan DB many times.

Enumerate all patterns
is costly and
unnecessary.

Because if itemset X is
frequent, an subset Y

X must also be
frequent.

ABCD

ABC

ABD

ACD

BCD

AB

AC

BC

BD

CD

A

B

C

D

{}