On Efficiently Mining Generalized
Association Rules from Large
RDF Metadata Collection
Presenter: Jiang Tao, PhD Student
Supervisor: Assoc Prof. Tan Ah Hwee
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
Introduction
–
Association Rule
Mining
Association Rule Mining
First proposed by Agrawal, Imielinski and
Swami (1993)
Let
I
= {i
1
, i
2
, …i
n
} a set of all items in a large
database of transactions
D,
where each
transaction is a set of items
T
I
with a
unique identifier
tid
,
ARM is to find rules about
how one subset of items (itemset) implies
another subset
.
Example: {milk}
{bread} [0.5%, 60%]
Introduction
–
Association Rule &
Frequent Pattern Mining
Two subtasks of ARM:
Find all
frequent itemsets
(or
frequent
patterns
), the itemsets whose support is
greater than
minsup.
Frequent Pattern Mining is essential for
association rule mining.
Generate rules based on frequent itemsets:
Given frequent itemsets X, for all A
X, generate
rules A
X

A, whose confidence is greater than
minconf
.
Introduction
–
Closed Pattern Mining.
Seeing that some patterns are subsumed
by others, e.g. {a, b} and {a, b, c} with the
same support of 30%.
Closed Pattern Mining
is proposed.
Closed Pattern: given a frequent itemset X,
there doesn’t exist an itemset Y
X and
support(X) = support(Y).
Algorithms: CHARM, Close, and Closet, etc.
Introduction
–
Generalized Association Rule
Mining
tid
Items bought
10
Shirts
20
Ski Pants, Hiking Boots
30
Jackets, Hiking Boots
40
Shirts, Shoes
Users may be interested in
generating associations
span items
at different
levels of a taxonomy
.
Patterns only consisting of
items at leave nodes may
be trivial or cannot reach
minsup.
Example:
minsup=50%, minconf=50%
“Outerwear
Hiking Boots”
(support: 50%, confidence:
100%)
Introduction
–
Resource Description Framework
W3C RDF (Resource Description Framework)
A specification for
describing
web resources with
semantic information
and interchanging semantic
metadata on the Semantic Web enviroment.
Element:
RDF Statement
, a triple <subject, predicate,
object>.
Reflect binary relation between web resources.
Introduction
–
RDF Statement
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22

rdf

syntax

ns#"
xmlns:ns1=“http://purl.org/dc/elements/1.1/">
<rdf:Description about=“http://www.example.org/index.html”>
<ns1:creator resource”http://www.example.org/staffid/85740”>
</rdf:Description>
</rdf:RDF>
Introduction
–
RDF Schema
RDF Schema
Allow to define RDF
vocabularies:
RDFS Classes, RDF
Properties
Class Hierarchy
Property Hierarchy
Mining Generalized
Association rules
among RDF
statements
Application: RDF
query optimization,
web resource
recommendation, etc.
Staff
Software Engineer
Project Manager
http://www.example.org/staffid/85740
rdf:type
rdfs:subClass
rdfs:subClass
Class Hierarchy
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
Mining Generalized Association
Rule in RDF Dataset.
Challenges
Large amount of different RDF statements:
A small model may allow hundred thousands of
different RDF statements. (A large set of items
I
).
Each document may contain hundreds of RDF
statements. (Long transactions).
There is a high probability that long generalized
patterns exist.
Over generalization problem
GARM in RDF Metadata
–
Over Generalization:
An Example
Sample RDF DB
RDF
Vocabulary
Frequent Generalized Relationset (minsup=50%)
GARM on RDF
–
Relation Hierarchy
GARM on RDF
–
Over
Generalization: An Example
Two frequent relationsets extracted from sample
DB:
rs1: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Car Bombing
>} (support:
50%)
rs2: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Bombing
>} (support: 50%)
rs1 is more interesting than rs2
rs2 is a generalization of rs1
rs1 conveys more precise information; rs2 lost
information
A frequent relationset
rs
is
over

generalized
, if it
has a
specialization
having the
same support
.
GARM on RDF
–
Generalization
Closure
Over

Generalization Reduction Using
Generalization Closure.
Given relationset X, its generalization
closure gc(X) = {r  r
X or
r*
X, r is a
generalization of r*
}.
Lemma 1: gc(X) is
closed
→ X is
not
over

generalized
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
GP

Close
We propose
GP

Close
: mining
all closed
generalization closure
for
over

generalization reduction
.
Main Features:
Generalization closure enumeration.
Hybrid support counting.
A sorting technique that guarantee closures
are generated with a specialization

first
manner.
GP

Close: Generalization closure
enumeration
Calculate
1

frequent relationsets
.
Generate generalization
closures
of 1

frequent relationsets.
Recursively merge
the smaller closures to
generate larger closures.
GP

Close: Generalization closure enumeration and
Closure Sorting
GP

Close: Hybrid Support Counting
Two kinds of support counting
DB Scan: for each transaction T, update the support
of candidate relationsets that T supports.
Increase IO overhead.
Using tid

set:
X.tids = {1,2} and Y.tids = {2, 3} → XY.tids = {1,2}
{2,3} = {2},
support(XY)=XY.tids=1.
Require extra physical memory to store tid

sets.
Hybrid counting:
locate a tid

set buffer with a max buffer size
Try to build tid

sets for a sub closure search tree when tid

sets can fit in buffer.
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
Experiments
GP

Close VS Cumulate
Dataset: foafPub
http://ebiquity.umbc.edu/resource/html/id/82/
FOAF (Friend Of A Friend) Vocabulary for
describing people and their relationship.
Total about 6,000 documents and 100,000
RDF statements.
GP

Close VS Cumulate
GP

Close VS Cumulate
GP

Close VS Cumulate
Outline
Introduction
Generalized Association Rule Mining on
RDF Metadata
GP

Close Algorithm
Experiments
Conclusion
Conclusion
Mining Generalized Association Rule in RDF
Metadata involves new challenges; one of main
obstacle is
over

generalization problem
.
We proposed to
mining closed generalization
closure
for OG reduction.
We present the
GP

Close
algorithm which can
more efficiently mine generalized relationsets
than Cumulate algorithm.
Questions?
Introduction
–
Association Rule
Mining
X, Y
I,
called
itemsets
.
Find all the rules
X
Y (
X
Y =
)
with min confidence and
support
support
,
s
, probability that a
transaction contains X
Y
confidence
,
c,
conditional
probability that a transaction
having X also contains
Y
.
Let min_support = 50%,
min_conf = 50%:
A
C
(50%, 66.7%)
C
A
(50%, 100%)
Customer
buys milk
Customer
buys both
Customer
buys bread
Transaction

id
Items bought
10
A, B, C
20
A, C
30
A, D
40
B, E, F
Introduction
–
Maximal and Closed
Patterns.
Apriori

like algorithms
Search itemset lattice
in
bottom

up
and
breadth

first
manner.
Problem
Scan DB many times.
Enumerate all patterns
is costly and
unnecessary.
Because if itemset X is
frequent, an subset Y
X must also be
frequent.
ABCD
ABC
ABD
ACD
BCD
AB
AC
BC
AD
BD
CD
A
B
C
D
{}
Comments 0
Log in to post a comment