簡要結案報告 - 陶幼慧

desertcockatooData Management

Nov 20, 2013 (3 years and 10 months ago)

280 views

行政院國家科學委員會補助專題研究計畫成果報告


※※※※※※※※※※※※※※※※※※※※※※※














※※※※※※※※※※※※※※※※※※※※※※※


計畫類別:

個別型計畫

□整合型計畫

計畫編號:
NSC

89

2213

E

214

056

執行期間:
89

8

1
日至
90

7

31



計畫主持人:洪宗貝

教授


共同主持人:陶幼慧

助理教授






本成果報告包括以下應繳交之附件:

□赴國外出差或研習心得報告一份

□赴大陸地區出差或研習心得報告一份



出席國際學術會議心得報告及發表之論文各一份



國際合作研究計畫國外研究報告書一份



執行單位:義守大學











90


10


20



利用準大項目集之漸進式資料挖掘










1

行政院國家科學委員會專題研究計劃成果報告

利用準大項目集之漸進式資料挖掘

Incremental Data Mining Using Pre
-
large Itemsets

計畫編
號:
NSC
89
-
2213
-
E
-
214
-
056

執行期限:
89

8

1
日至
90

7

31



主持人:洪宗貝

義守大學資管系

共同主持人:陶幼慧

義守大學資管系

計畫參與人員:王慶堯、林桂英、王乾隆、李詠騏

義守大學資工所



1
、中文摘要


在本計劃當中,我們提出『準大項目
集』的概念,並設計一全新且有效率的漸
進式演算法。所謂準大項目集,主要是由
〝低支持度門檻值〞和〝高支持度門檻值〞
兩個支持度門檻值所定義而成,以用來減
少對原來資料庫的處理並節省維護所挖掘
出知識庫的成本。我們所提出之演算法

非新增的資料太大,否則將不需要對原來
資料庫做處理或掃描的動作。另外,此演
算法更具備一個特質,即當資料庫不斷的
增加時,所獲得的結果會更好,這對於現
實資料庫的應用尤其有用。


關鍵詞:
資料挖掘

關聯式法則


項目集、準大項目集

漸進式資料挖掘。


English
Abstract


In this project, we propose the concept of
pre
-
large itemsets and design a novel
efficient incremental mining algor
ithm based
on it. Pre
-
large itemsets are defined using
two support thresholds, a lower support
threshold and an upper support threshold, to
reduce rescanning the original databases and
to save maintenance costs. The proposed
algorithm doesn't need to resca
n the original
database until a number of transactions have
come. If the size of the database

is growing
larger, then the allowed number of new
transactions will be larger too. Therefore,
along with the growth of the database, our
proposed approach is incr
easingly efficient.
This characteristic is especially useful for
real applications.


K
eywords:
data mining, association rule,

large itemset, pre
-
large itemset, incremental
mining.


2

B慣k杲潵nd⁡ d⁐urp潳o


In the past, many algorithms for mining
associati
on rules from transactions were
proposed, most of which were executed in
level
-
wise processes. That is, itemsets
containing single items were processed first,
then itemsets with two items were processed,
then the process was repeated, continuously
adding o
ne more item each time, until some
criteria were met. These algorithms usually
considered the database size

static

and
focused on batch mining. In real
-
world
application
s
, however, new records are
usually inserted into databases, and design
ing

a mining alg
orithm that can maintain
association rules as a database grows is thus
critically important
.

When new records are added to databases,
the original association rules may
become

invalid, or new implicitly valid rules may
appear in the resulting updated datab
ases

[7][
8][
9
][
11][12].

In these situations,
conventional batch
-
mining algorithms must
re
-
process the
entire updated databases to
find final association rules. Two drawbacks
may exist for conventional batch
-
mining

2

algorithms in maintaining database
knowled
ge:

(a)
Nearly the same computation time as
that spent in mining from the original
database is needed to cope with each new
transaction. If the original database is large,
much computation time is wasted in
maintaining association rules whenever new
transa
ctions are generated

(b)
Information
previously
mined
from the
original database
,

such as large itemsets

and

association rules
, provides no help in the
maintenance process
.

Cheung
and

his co
-
workers proposed an
incremental mining algorithm, called FUP

(
Fa
st U
P
date algorithm) [7]
, for
incrementally maintaining mined association
rules and avoiding the shortcomings
mentioned above. The
FUP

algorithm
modifies the Apriori mining algorithm
[3]

and adopts the pruning techniques used in the
DHP
(
Direct H
ashing and

Pruning)
algorithm
[
1
0]
. It first calculat
e
s large itemsets mainly
from newly inserted transactions, and
compares them with the previous large
itemsets from the original database.
According to the comparison results, FUP
determin
e
s whether re
-
scanning the

original
database is needed
, thus saving some time in
maintaining the association rules. Although
the FUP algorithm can indeed improve
mining performance for incrementally
growing databases, original databases still
need to be scanned when necessary. In t
his
paper, we
thus
propose a new mining
algorithm
based on two

support threshold
s to
further reduce the need for rescanning
original databases
.

Since rescanning the
database spends much computation time, the
maintenance cost can thus be reduced in the
prop
osed algorithm.


3

剥獵汴猠慮d⁄楳 u獳s潮s


In this plan, we propose the concept of
pre
-
large itemsets to solve the problem
represented by case 3

the situation
in which a
candidate itemsets is large for new
transactions but is not recorded in large
itemset
s already mined from the original
database

in FUP
. A pre
-
large itemset is not
truly large, but promises to be large in the
future. A lower support threshold and an
upper support threshold are used to realize
this concept. The upper support threshold is
the

same as that used in the conventional
mining algorithms. The support ratio of an
itemset must be larger than the upper support
threshold in order to be considered large. On
the other hand, the lower support threshold
defines the lowest support ratio for a
n itemset
to be treated as pre
-
large.
An itemset with its
support ratio below the lower threshold is
thought of as a
small

itemset.
Pre
-
large
itemsets act like buffers in the incremental
mining process and are used to reduce the
movements of itemsets direc
tly from large to
small and vice
-
versa.

Considering an original database and
transactions newly inserted using the two
support thresholds, itemsets may thus fall
into one of the following nine cases
illustrated in Figure 1.

F
igure
1
:
Nine cases arising fro
m adding new

transactions to existing
database
s


Cases 1, 5, 6, 8 and 9 above will not affect
the final association rules

according to the
weighted average of the counts
. Cases 2 and
3 may remove existing association rules, and
cases 4 and 7 may add new as
sociation rules.
If we retain all large and pre
-
large itemsets
with their counts after each pass, then cases 2,
3 and case 4 can be handled easily. Also, in
the maintenance phase, the ratio of new
transactions to old transactions is usually
very small. Thi
s is more apparent when the
database is growing larger. An itemset in
Large
itemsets
Large
itemsets
Pre
-
large
itemsets
Original
database
New
transactions
Small
itemsets
Small
itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre
-
large
itemsets
Large
itemsets
Large
itemsets
Pre
-
large
itemsets
Original
database
New
transactions
Small
itemsets
Small
itemsets
Case 1 Case 2 Case 3
Case 4 Case 5 Case 6
Case 7 Case 8 Case 9
Pre
-
large
itemsets

3

case 7 cannot possibly be large for the entire
updated database as long as the number of
transactions is small compared to the number
of transactions in the original database.

This
is
sh
ow
n in the following theorem.

Theorem

1
: let
S
l

and
S
u

be respectively
the lower and the upper support thresholds,
and let
d

and
t

be respectively the numbers of
the original and new transactions. If
t


u
l
u
S
d
S
S


1
)
(
, then an itemset that is sma
ll
(neither large nor pre
-
large) in the original
database but is large in newly inserted
transactions is not large for the entire updated
database.

The details of the proposed maintenance
algorithm are described below. A variable,
c
,
is used to record the

number of new
transactions since the last re
-
scan of the
original database.


The proposed maintenance algorithm:

INPUT: A lower support threshold
S
l
, an
upper support threshold
S
u
, a set of large
itemsets and pre
-
large itemsets in the original
database co
nsisting of (
d+c
) transactions, and
a set of
t

new transactions.

OUTPUT: A set of final association rules
for the updated database.

STEP 1: Calculate the safety number
f

of
new transactions
according to theorem 1

as
follows:


f =








u
l
u
S
d
S
S
1
)
(
.

STEP 2: Set
k
=1, where
k

records the
number of items in itemsets currently being
processed.

STEP 3: Find all candidate
k
-
itemsets
C
k

and their counts from the new transactions.

STEP 4: Divide the candidate
k
-
itemsets
into three parts according to whether
they are
large, pre
-
large
or small
in the original
database.

STEP 5: For each itemset
I

in the
originally large
k
-
itemsets
D
k
L
, do the
following substeps:


Substep 5
-
1: Set the new count
S
U
(
I)

=
S
T
(
I)
+
S
D
(I
).


Substep 5
-
2:

If
S
U
(
I)
/(d
+t+c)


S
u
, then
assign
I

as a large itemset, set
S
D
(I
) =
S
U
(
I)

and keep
I

with
S
D
(I
),otherwise, if
S
U
(
I)
/(d+t+c)


S
l
, then assign
I

as a
pre
-
large itemset, set
S
D
(I
) =
S
U
(
I)

and keep
I

with
S
D
(I
), otherwise, neglect
I
.

STEP 6: For each itemset
I

in the
o
riginally pre
-
large itemset
D
k
P
, do the
following substeps:


Substep 6
-
1: Set the new count
S
U
(
I)

=
S
T
(
I)
+
S
D
(I
).


Substep 6
-
2:

If
S
U
(
I)
/(d+t+c)


S
u
, then
assign
I

as a large itemset, set
S
D
(I
) =
S
U
(
I)

and keep
I

with
S
D
(I
),otherwise
, if
S
U
(
I)
/(d+t+c)


S
l
, then assign
I

as a
pre
-
large itemset, set
S
D
(I
) =
S
U
(
I)

and keep
I

with
S
D
(I
), otherwise, neglect
I
.

STEP 7: For each itemset
I

in the
candidate itemsets that is not in the originally
large itemsets
D
k
L

or pre
-
l
arge itemsets
D
k
P
,
do the following substeps:

Substep 7
-
1:

If
I

is in the large itemsets
T
k
L

or pre
-
large itemsets
T
k
P
from the new
transactions, then put it in the rescan
-
set
R
,
which is used when

rescanning in Step 8 is
necessary.

Substep 7
-
2: If
I

is small for the new
transactions, then do nothing.

STEP 8: If
t
+
c



f

or
R

is null, then do
nothing; otherwise, rescan the original
database to determine whether the itemsets in
the rescan
-
set
R

are l
arge or pre
-
large.

STEP 9: Form candidate
(k+1)
-
itemsets
C
k+1

from finally large and pre
-
large
k
-
itemsets (

U
k
U
k
P
L
) that appear in the new
transactions.

STEP 10: Set
k

=
k
+1.

STEP 11: Repeat STEPs 4 to 10 until no
new large or pre
-
large ite
msets are found.

STEP 12: Modify the association rules
according to the modified large itemsets.

STEP 13: If
t
+
c

>
f,

then set
d=d+t+c
and
set

c
=0; otherwise, set
c=t+c.

After Step 13, the final association rules
for the updated database
have

been

determi
ned.



4

4

Se汦
-
E癡汵慴楯a


In this paper, we have proposed the
concept of pre
-
large itemsets, and designed a
novel, efficient, incremental mining
algorithm based on it. Using two
user
-
specified upper and low
er

support
thresholds, the pre
-
large itemsets act a
s a gap
to avoid small itemsets becoming large in the
updated database when transactions are
inserted. Our proposed algorithm also retains
the features of the FUP algorithm [7][1
1
]
.

Moreover, the proposed algorithm can
effectively handle cases, in which it
emsets
are small in an original database but large in
newly inserted transactions, although it does
need additional storage space to record the
pre
-
large itemsets. Note that the FUP
algorithm needs to rescan databases to handle
such cases. The proposed alg
orithm does not
require rescanning of the original databases
until a number of new transactions
determined from the two support thresholds
and the size of the database have been
processed. If the size of the database grows
larger, then the number of new tr
ansactions
allowed before rescanning will be larger too.
Therefore, as the database grows, our
proposed approach becomes increasingly
efficient. This characteristic is especially
useful for real
-
world applications.

The
contents of this paper have been publ
ished in
the
Intelligent Data Analysis
,

Vol. 5, No. 2,
2001, pp. 111
-
129
.


5

剥ferences


[1] R. Agrawal, T. Imielinksi and A. Swami,
“Mining association rules between sets of
items in large database,“
The ACM
SIGMOD Conference,
pp. 207
-
216,
Washington DC
,

USA,

1993
.

[2] R. Agrawal, T. Imielinksi and A. Swami,
“Database mining: a performance
perspective,”
IEEE Transactions on
Knowledge and Data Engineering,

Vol. 5,
No. 6, pp. 914
-
925, 1993
.

[3] R. Agrawal and R. Srikant, “Fast
algorithm for mining associatio
n rules,”
The International Conference on Very
Large Data Bases
, pp. 487
-
499, 1994.

[
4
]

R. Agrawal and R. Srikant,

Mining
sequential patterns
,”
The Eleventh

IEEE
International Conference on Data
Engineering,

pp. 3
-
14,
1995
.

[5]

R. Agrawal, R. Srikant and
Q. Vu,
“Mining association rules with item
constraints,”

The

Third

International
Conference on Knowledge Discovery in
Databases and Data Mining,

pp. 67
-
73,
Newport Beach, California, 1997.

[6] M.S. Chen, J. Han and P.S. Yu,

Data
m
ining: An
o
verview from a

d
atabase
p
erspective,


IEEE Transactions on
Knowledge and Data Engineering
, Vol. 8,
No. 6, pp. 866
-
883, 1996.

[7]

D.W. Cheung, J. Han, V.T. Ng, and C.Y.
Wong,

Maintenance of discovered
association rules in large databases: An
incremental updating approac
h,


T
he
T
welfth

IEEE International Conference
on Data Engineering,

pp. 106
-
114,

1996
.

[8]

D.W. Cheung, S.D. Lee, and B. Kao,

A
general incremental technique for
maintaining discovered association
rules,


In Proceedings of Database
Systems for Advanced App
lications,
pp.
185
-
194, Melbourne, Australia, 1997.

[9]
M
.
Y
.

Lin and S
.
Y
.

Lee
,

Incremental
update on sequential patterns in large
databases,


The Tenth
IEEE International
Conference on

Tools with Artificial
Intelligence,

pp. 24
-
31,
1998
.

[10]

J
.
S
.

Park, M
.
S
.

Chen
,

P.S
. Yu,

Using a
hash
-
based method with transaction
trimming for mining association rules,


IEEE Transactions on

K
nowledge and
Data Engineering
,

Vol.

9, No. 5,
pp.

812
-
825, 1997
.

[11]

N.L. Sarda and N.V. Srinivas,

An
adaptive algorithm for incre
mental
mining of association rules,


The Ninth
International Workshop on Database
and Expert Systems
,
pp.

240
-
245
, 1998.

[12] S. Zhang,

Aggregation and
maintenance for database mining,


Intelligent Data Analysis
, Vol. 3, No. 6,
pp. 475
-
490, 1999.