On Efficiently Mining Generalized

steelsquareInternet και Εφαρμογές Web

20 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

99 εμφανίσεις

On Efficiently Mining Generalized
Association Rules from Large
RDF Metadata Collection

Presenter: Jiang Tao, PhD Student

Supervisor: Assoc Prof. Tan Ah Hwee

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

Introduction


Association Rule
Mining


Association Rule Mining


First proposed by Agrawal, Imielinski and
Swami (1993)


Let
I
= {i
1
, i
2
, …i
n
} a set of all items in a large
database of transactions
D,
where each
transaction is a set of items
T


I

with a
unique identifier
tid
,
ARM is to find rules about
how one subset of items (itemset) implies
another subset
.


Example: {milk}


{bread} [0.5%, 60%]

Introduction


Association Rule &
Frequent Pattern Mining


Two subtasks of ARM:


Find all
frequent itemsets
(or

frequent
patterns
), the itemsets whose support is
greater than
minsup.


Frequent Pattern Mining is essential for
association rule mining.


Generate rules based on frequent itemsets:


Given frequent itemsets X, for all A


X, generate
rules A


X
-
A, whose confidence is greater than
minconf
.


Introduction

Closed Pattern Mining.


Seeing that some patterns are subsumed
by others, e.g. {a, b} and {a, b, c} with the
same support of 30%.


Closed Pattern Mining

is proposed.


Closed Pattern: given a frequent itemset X,
there doesn’t exist an itemset Y

X and
support(X) = support(Y).


Algorithms: CHARM, Close, and Closet, etc.

Introduction


Generalized Association Rule
Mining

tid

Items bought

10

Shirts

20

Ski Pants, Hiking Boots

30

Jackets, Hiking Boots

40

Shirts, Shoes


Users may be interested in
generating associations
span items
at different
levels of a taxonomy
.


Patterns only consisting of
items at leave nodes may
be trivial or cannot reach
minsup.


Example:


minsup=50%, minconf=50%


“Outerwear


Hiking Boots”
(support: 50%, confidence:
100%)

Introduction


Resource Description Framework


W3C RDF (Resource Description Framework)


A specification for
describing

web resources with
semantic information

and interchanging semantic
metadata on the Semantic Web enviroment.


Element:
RDF Statement
, a triple <subject, predicate,
object>.


Reflect binary relation between web resources.

Introduction


RDF Statement

<rdf:RDF


xmlns:rdf="http://www.w3.org/1999/02/22
-
rdf
-
syntax
-
ns#"
xmlns:ns1=“http://purl.org/dc/elements/1.1/">

<rdf:Description about=“http://www.example.org/index.html”>


<ns1:creator resource”http://www.example.org/staffid/85740”>

</rdf:Description>

</rdf:RDF>

Introduction


RDF Schema


RDF Schema


Allow to define RDF
vocabularies:


RDFS Classes, RDF
Properties


Class Hierarchy


Property Hierarchy


Mining Generalized
Association rules
among RDF
statements


Application: RDF
query optimization,
web resource
recommendation, etc.

Staff

Software Engineer

Project Manager

http://www.example.org/staffid/85740

rdf:type

rdfs:subClass

rdfs:subClass

Class Hierarchy

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

Mining Generalized Association
Rule in RDF Dataset.


Challenges


Large amount of different RDF statements:


A small model may allow hundred thousands of
different RDF statements. (A large set of items
I
).


Each document may contain hundreds of RDF
statements. (Long transactions).


There is a high probability that long generalized
patterns exist.


Over generalization problem

GARM in RDF Metadata


Over Generalization:
An Example

Sample RDF DB

RDF
Vocabulary

Frequent Generalized Relationset (minsup=50%)

GARM on RDF


Relation Hierarchy

GARM on RDF


Over
Generalization: An Example


Two frequent relationsets extracted from sample
DB:



rs1: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Car Bombing
>} (support:
50%)



rs2: {<Terrorist Group, participate, Financial Crime>,
<Terrorist Group, participate,
Bombing
>} (support: 50%)


rs1 is more interesting than rs2


rs2 is a generalization of rs1


rs1 conveys more precise information; rs2 lost
information


A frequent relationset
rs

is
over
-
generalized
, if it
has a
specialization

having the
same support
.

GARM on RDF


Generalization
Closure


Over
-
Generalization Reduction Using
Generalization Closure.


Given relationset X, its generalization
closure gc(X) = {r | r

X or

r*

X, r is a
generalization of r*
}.


Lemma 1: gc(X) is
closed
→ X is
not

over
-
generalized

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

GP
-
Close


We propose
GP
-
Close
: mining
all closed
generalization closure

for
over
-
generalization reduction
.


Main Features:


Generalization closure enumeration.


Hybrid support counting.


A sorting technique that guarantee closures
are generated with a specialization
-
first
manner.



GP
-
Close: Generalization closure
enumeration


Calculate
1
-
frequent relationsets
.


Generate generalization

closures

of 1
-
frequent relationsets.


Recursively merge

the smaller closures to
generate larger closures.

GP
-
Close: Generalization closure enumeration and
Closure Sorting


GP
-
Close: Hybrid Support Counting


Two kinds of support counting


DB Scan: for each transaction T, update the support
of candidate relationsets that T supports.


Increase IO overhead.


Using tid
-
set:


X.tids = {1,2} and Y.tids = {2, 3} → XY.tids = {1,2}

{2,3} = {2},
support(XY)=|XY.tids|=1.


Require extra physical memory to store tid
-
sets.


Hybrid counting:


locate a tid
-
set buffer with a max buffer size


Try to build tid
-
sets for a sub closure search tree when tid
-
sets can fit in buffer.

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

Experiments


GP
-
Close VS Cumulate


Dataset: foafPub
http://ebiquity.umbc.edu/resource/html/id/82/


FOAF (Friend Of A Friend) Vocabulary for
describing people and their relationship.


Total about 6,000 documents and 100,000
RDF statements.

GP
-
Close VS Cumulate

GP
-
Close VS Cumulate

GP
-
Close VS Cumulate

Outline


Introduction


Generalized Association Rule Mining on
RDF Metadata


GP
-
Close Algorithm


Experiments


Conclusion

Conclusion


Mining Generalized Association Rule in RDF
Metadata involves new challenges; one of main
obstacle is
over
-
generalization problem
.


We proposed to
mining closed generalization
closure

for OG reduction.


We present the
GP
-
Close

algorithm which can
more efficiently mine generalized relationsets
than Cumulate algorithm.

Questions?

Introduction


Association Rule
Mining


X, Y


I,

called
itemsets
.


Find all the rules
X

Y (
X


Y =

)

with min confidence and
support


support
,
s
, probability that a
transaction contains X

Y


confidence
,
c,

conditional
probability that a transaction
having X also contains
Y
.

Let min_support = 50%,
min_conf = 50%:

A


C
(50%, 66.7%)

C


A
(50%, 100%)

Customer

buys milk

Customer

buys both

Customer

buys bread

Transaction
-
id

Items bought

10

A, B, C

20

A, C

30

A, D

40

B, E, F

Introduction


Maximal and Closed
Patterns.


Apriori
-
like algorithms


Search itemset lattice
in
bottom
-
up

and
breadth
-
first

manner.


Problem


Scan DB many times.


Enumerate all patterns
is costly and
unnecessary.


Because if itemset X is
frequent, an subset Y


X must also be
frequent.


ABCD

ABC

ABD

ACD

BCD

AB

AC

BC

AD

BD

CD

A

B

C

D

{}