l Emam S. A composed GP-Paid-25-2-13x - Sci-int.com

siberiaskeinData Management

Nov 20, 2013 (3 years and 8 months ago)

287 views

Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


249

INCREMENTAL DISTANCE ASSOCIATION RULES MINING

Ahmed Z. Emam , Lama A. Al
-
Zaben


College of Computer and Information Science
,
King Saud University

Riyadh, Saudi Arabia 11543

aemam@ksu.edu.sa

,
lama.alzaben@gmail.com

ABSTRACT:

Association Rules Mining (ARM) is simply mining a given dataset for some kind of data that
appear together frequently to form rules, while Distance
-
Based Association Rule Mining (DARM) is an
enhanced Apriori algorithm that takes a distance between each pai
r of items within an item set into
consideration. It is crucial to maintain such discovered rules from large datasets, this was the main idea
behind Incremental Association Rules Mining (IARM), which has recently received much attention from the
data minin
g researcher. In this paper we have added an incremental feature to the distance
-
enhanced
association rule mining algorithm. The proposed algorithm updates a set of distance preserving frequent
itemsets to reflect changes occurred to the original dataset.
This update is done by making use of previous
knowledge which was generated from a previous mining process on the original dataset and was kept for
future updates. This knowledge includes the number of occurrences for each frequent itemset, in addition to
the mean and standard deviation of distances within these itemsets occurrences. When the original dataset
is changed, accessing the original dataset is needed to gather information of only the itemsets that were not
previously frequent and have a potential

to be currently frequent. Finally, the resulting algorithm has been
implemented in Weka.

Keywords
-
component; Association Rules Mining; Incremental Association Rules Mining; Distance
-
Based Association
Rule Mining


I.


I
NTRODUCTION

Association Rules
Mining

(ARM) is mining a given
database for some kind of data that appear together
frequently to form rules. Rules that are driven state that if
some specific values have occurred then it is likely that
some other specific values will occur too. In ARM
,

th
ese
values are called
items

and a set of values is an
itemset
. This
idea was i
nitially introduced as

m
arket
basket analysis


in
which experts wanted to know what items are usually
bought togethe
r
;

for example
,
mining the transactional

database of some sup
ermarket
might

show that milk and
bread are commonly bought together. Using this
information can help in developing some marketing strategy
to increase profits. ARM is not
only
restricted to business
area
s;
rather it may reveal valuable information in many

domains
,

and

therefore a lot of resear
ch and algorithms have
been
developed

to improve t
his field. To form the problem,

let


1 2
,.
 
n
I i i i

the
set of all items in a transactional
database. All items here are a combination of all domains of
interesting attributes.
D
is a set of transactions and each
transaction is a subset of

I
. The number of transactions in
D

is defined as


D

. Assume that
X

and
Y

are two itemsets
such that
,

   
X I Y I and X Y

the
rule

X Y

means that whenever
X

occurs
,

there is a chance
that
Y
will occur too. The left hand side of a rule is called
the antecedent or premise and right hand side is called
consequent or succedent. An itemset can be of multiple
sizes; k
-
itemset is an itemset with k items. To state whether
an itemset appears frequently
,

a user threshold needs to be
defined. This threshold

is usually a percentage value defined
by the user and multiplied by the number of transactions in a
database. For an itemset to be frequent it needs to be
supported in a set of transactions equal or greater than this
threshold which is referred to as
mini
mum support
.
Moreover, rules that are formed from these frequent
itemsets also have to pass a threshold to produce valuable
knowledge. This threshold is called
minimum confidence

and is a percentage value defined by the user. The
confidence of a rule is th
e ratio of the support of the whole
itemset to the support of antecedent part of the rule.

By restricting the
consequent
part to a specific attribute in
all generated rules we get a special kind of association rules
called class association rules [1]. Cla
ss association rules
might be used as classifiers. The association rule mining
process can be divided into two phases. The first phase is
extracting all frequent itemsets. The second phase is
forming rules from these frequent itemsets. Currently
,

several A
RM algorithms
exist

and most of them are based
on
the
Apriori algorithm [2]. In some domains there are
some interests in distances between items within an itemset.
Rules are formed from frequent itemsets in which distances
between their items are preserved

in all their occurrences.
Distance
-
Enhanced Association Rule Mining [3
,
4] (DARM)
is an algorithm that generates these rules. This algorithm

was
has proven

useful
in
the
gene expression field. When
the original database is update
d
,

the set of frequent ite
msets
may have changed accordingly. A number of
contributions
[5,6
,
7]
are done
to overcome the problem of updating the set
of frequent itemsets. Their goal
was
to make use of the
previous set of frequent itemsets and update them with the
minimum processing
.


II.

R
ELATED
W
ORK

In this section we present three types of ARM. First we
present the general association rule mining in which all we
are concerned

about is the presence of items in a transaction
for a fixed size database. Then we present an algorithm in
which not only the presence of an item is considered
,

but
moreover the distance between specific items within

a
transaction is considered
as well
. Finally, we present an
incremental association rule mining technique which is one
of the solutions to the pro
blem of maintaining frequent
itemsets.

A.

Apriori Based Algorithm


ISSN 1013
-
5316; CODEN: SINTE


Sci.Int.(Lahore),25(
2
),
249
-
259
,2013


250

Apriori
[2]

is one of the earliest algorithms in ARM. The
algorithm use
s

input as set of transactions (Datasets), a user
minimum support and minimum confidence. The output is a
set of rules
along with their confidence
and support. Apriori
is a
level
-
wise
algorithm
; it starts by extracting frequent 1
-
itemsets and increase
s

the size by one in each succeeding
iteration. It uses prior knowledge gained from the previous
iteration
,

and
thus its nam
e. On each iteration it first forms
the current
candidate
levels

set then after a database scan it
prunes
the
infrequent itemset
so that remaining itemsets are
all the frequent itemsets in the current level. These frequent
itemsets are
self
-
joined
in the n
ext iteration to generate the
candidates. Before scanning the database
,

Apriori prune
s

any itemset with an infrequent subset because it can’t be
part of any frequent itemsets. This monotonicity property of
Apriori helps in minimizing the set of candidates generated
in each next level.

The algorithm works as follow
s
.

In the
first iteration i
t starts with a set of candidates composed of
all items in a database. Then, a whole database scan is
conducted to determine the support of each candidate. Next
,

candidates with a support below the minimum support
threshold are pruned and remaining candida
tes
form
the
frequent

set of 1
-
itemset and are
label
ed
with L
1
. The next
level's candidates C
2
will be formed by self
-
joining L
1
and
will be processed as C
1
to generate L
2
. Starting from the
third iteration, candidates will be formed by joining each
pair o
f frequent itemset from the previous level if all their k
-
1 items are similar
,

and the k
th

item of the first itemset is
less than the k
th

item in the second one. Furthermore, it
removes candidates that contain a subset of any infrequent
itemset. Next
,

the
process continues as the first two
iteration
s

by scanning the database and pruning infrequent
itemsets. The algorithm continues until the set of candidates
is empty.
As the set of all frequent itemset
s

are
available
,

all
that

is needed is to generate a set

of rules. First, for each
frequent itemset we generate all possible subsets and then
form a rule by placing the subset in the antecedent part and
all other items that
belong
to the same itemset in
consequent
part
.

Such a rule is only considered

if its con
fidence is equal
or greater than the minimum confidence threshold.
Extracting frequent itemsets is the
major part
before
generating the rules
and consumes a lot of memory space,
CPU time and I/O operations. Generating Rules
,

on the
other hand
,

is a straightforward process therefore a lot of
researches focused on improving the first part. There are
many algorithms
that
were

developed
based on Aprioir
algorithm
,

which

enhance the number of scan such as FAST
Apriori and others.

B.

Distance
-

Enhanced Association Rule Mining

DARM
[3
,
4]

is an enhanced Apriori algorithm that takes
distance between each pair of items within an itemset into
consideration. For a given itemset, meeting the minimum
su
pport threshold is not enough.
Each itemset mainta
ins the
variation of distances between each pair of items in every
occurrence. For a given it
emset to be considered frequent
,

its variati
on of distances must not
exceed

a specific value in
addition to having a support greater than the minimum
support. Vari
ation in distances is captured using the
coefficient of variation of distances

(cvd). An itemset has a
cvd for each pair of numeric items that belongs to that

itemset. Each cvd is the ratio

between the standard
deviation and the mean of distances

within each pair in
every itemset's occurrence. DARM parameters are
minimum support, minimum confidence and maximum cvd.
It

works as Apriori in two phases. In the first phase it
generates frequent itemsets that
deviate properly
. ‘
D
eviate
properly’ is a pr
operty that is used to prune some itemset in
which all supersets will either fail to reach the minimum
support threshold or will contain a pair of items that have a
cvd above the max cvd. For an itemset to deviate properly
the cvd between each pair of item
s need to be below the
max cvd in any subset of occurrences equal to at least the
minimum support threshold. Such a property has no effect
on the output but
was
added to enhance the efficiency of
the
DARM algorithm by minimizing the number of candidates.
A
dditionally the max
-
cvd constra
int is a non
-
monotonic
property
.

In the

next phase,
it
generates rules similarly to
Apriori but with an additional process. For itemsets that
have a support greater than the minimum support
,

they will
be removed if any cvd between any pair of items is above
the maximum cvd when considering all occ
urrences and not
only a subset.

Remaining itemsets are used to form
Distance
-
based Association Rules (DAR). The DARM
algorithm was introduced to
ana
lyze

gene expression. It can
be further utilized in other domains such as finance, retail
and many others. DARM is not restricted to only distance
but also any other measurements that need to be captured
between a pair of items such as time or cost. The li
mitation
here is that DARM consider
s

only a single measurement.
Since, association rules algo
rithms such as Apriori
generate

the list of association rules by processing the whole
database. Processing the whole database whenever a set of
records is added or

deleted is costly. Therefore a number of
algorithms have been proposed to solve the issue of
maintaining association rules for dynamic databases. The
basic dilemma associated with updating the dataset is
that
adding a number of transactions will increase
the number of
transactions in the whole dataset which results in a higher
minimum threshold so previously frequent itemsets may
now become infrequent.
In t
he next sections will explore the
basic algorithms
that used as based for any incremental
association

rule mining.

C.

Fast Update Algorithm (FUP)

The first algorithm that maintains association rules against a
dynamic database is FUP [5]. This algorithm is built upon
Apriori [2] with the aim to minimize the number of
candidates as much as possible. This minim
ization can be
reached by using previous information which
constitutes

the
set of frequent itemsets with their support counts that have
been generated for the original database. To update the set
of frequent itemsets FUP separ
ates old frequent itemsets
fro
m

new
,

potentially frequent itemsets. The reason for such
a separation is because the support counts for old frequent
itemsets are known and all
that

is needed is to scan the new
set of records and count their occurrences to be added to
their old support.
The result will represent the support for a
given itemset through the whole updated database. Thus for
old frequent itemsets
,

it can
decide

whether a candidate
Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


251

remains frequent or not without a database scan.
Unfortunately, this is not the case for new fre
quent itemsets
as we don’t know their support in the original database
;

therefore a scan is needed. FUP algorithm in this case tries
to minimize the number candidates as much as possible. In
order for an itemset to be frequent through the updated
database
,

it at least needs to be frequent through the newly
added set of records. Therefore it first
scan
s

that set to
identify itemsets that are frequent. If there is any frequent
itemset within the new set then the original database need to
be scanned. The best

case is when there
are

no potential
frequent
itemset
.

As in Apriori, FUP is a level
-
wise
algorithm which goes through
an iterative
process

starting
from itemsets of size 1 until the k
-
itemset. A set of frequent
itemsets are generated through each
iteration. The FUP
algorithm deals only with addition. Deletion need
s

to be
considered
since it will also change the list of frequent
itemset. By deleting a set of records some infrequent
itemsets may become frequent as the minimum support
threshold is now

less due to the decrease in the number of
transactions. On the other hand some frequent itemsets may
become infrequent because their occurrences were high in
the deleted set of records. An updated version of FUP is
presented in the next section. Parthasar
athy et al. (1999)
proposed an active association rule mining algorithm that
combined the features of Apriori and FPU, while utilizing a
client
-
server mechanism that allowed the user to query
interactively without accessing the dataset. As extensions
and r
efinements of the aforementioned algorithms, several
other algorithms were also introduced for mining
association rules. These include: a frequent pattern
-
growth
method, multidimensional association rules, a hits and
LOGSOM algorithm, and a parallel
and di
stributed
algorithm.
I
nterested readers should refer to Dunham (2003)
and Kantardzic (2003) for further details of these
algorithms.

D.


Fast Update Algorithm2 (FUP2)

In this section we present an updated version of FUP named
FUP2 [6] which is also based on
Apriori [2]. This algorithm
is the same as FUP in the case of addition and is the
complementary in case of deletion. In the case of deletion,
the algorithm starts by forming a set of all 1
-
itemset as
candidates. Then, separate

the candidate
1
-
itemset
into
two
lists. The first is the list of old frequent itemsets and the
second list contains all remaining candidates. After scanning
the deleted set
,

the support count for each itemset within the
decrement is calculated. For each itemset we need to
determine wh
ether or not it is frequent in the remaining set
of instances in the database. For the first list it subtracts
their corresponding support in the decrement from their total
support in the original database. For the second list
,

on the
other hand
,

their su
pport in the original database is not
known. As FUP, the objective is to minimize the number of
candidates. Therefore, the algorithm states that an itemset
can’t be frequent if it was infrequent in the original database

and frequent in the decrement.

So by

scanning the
decrement we can identify frequent itemsets and remove
them. For remaining itemsets a database scan is required to
determine newly frequent itemsets. As Apriori all
discovered frequent itemsets will be joined to form the next
level’s candidat
es and
the
process will be repeated. When
adding and deleting is done at the same time
,

the candidates
are divided similarly into old frequent itemset and all
remaining itemset
s
. It starts by scanning the decrement to
determine the corresponding support c
ount of all itemsets.
For old frequent itemsets, it checks if there is a possibility
that such an itemset will remain frequent. Up to now it
doesn’t know their support in the increment but at most it
will occur in each transaction.

Therefore, we can minimi
ze
the number of candidates by removing those that can’t reach
the mini
mum threshold in the best case.

So it subtracts their
support in the decrement from their support in the original
database and add
s

the number of transactions in the
increment and prune
s

it if the result is less than the support
of the updated database. After scanning the increment it
computes the final support of all remaining candidates in the
first list.
In order t
o
determine whether it is frequent or not
,

the result is compared to mi
nimum support threshold for the
whole updated database. The total support for the second list
is its support in the original database and in the increment
minus its support in the de
crement.

A database scan it
needed to
determine it support

in the original

database. The
number of candidates is minimized before scanning the
database. Since the support of a given itemset in the second
list in the original database is less

than

the minimum support
threshold for the original database then each itemset is
pruned

if their support in the increment minus their support
in the decrement is less than the minimum support threshold
for increment size minus the decrement size.
Because it has
no potential to be frequent and therefore can be pruned
. For
the remaining candid
ates in the second list a database


scan
is required to determine their total support. After defining
the set of all frequent itemset
s

of size 1

as Apriori it
generates the next level of candidates and the process is
repeated. To improve efficiency, the number of candidates
in the second iteration and beyond can be reduced before
scanning the increment. To do this we need to keep track of
the support of each it
emset in the increment in addition to
their
total support
.
Increment support

will be used to
determine the
high bound

of each itemset in the next
iteration. The minimum

increment support

of the two
joined
itemsets
will be the
high bound

for the resulting i
temset.
Using this
high bound

the number of candidates can be
reduced. It treats this
high bound

as if it was the
support

in
the incremen
t and deletes itemsets that
do

not meet the
minimum support threshold. After that, the increment will
be scanned to upd
ate the
increment supports

of the
remaining set of itemsets. The algorithm will continue as
described previously.

III.


I
NCREMENTAL
D
ISTANCE
-
B
ASED
A
SSOCIATION
R
ULES
M
INING
(IDARM)


The main goal of
Incremental Distance
-
Based Association
Rules Mining (IDARM) is
to update a set of DAR when a
number of records are added/deleted to/from a dataset. It is
about modifying
FUP2 to be applicable over DARs. We
want to update a set of DARs without the need to apply
DARM on the updated dataset as if it was a new set. The
mi
ning process can be minimized by making use of

ISSN 1013
-
5316; CODEN: SINTE


Sci.Int.(Lahore),25(
2
),
249
-
259
,2013


252

previous knowledge gained from the last mining process
and making use of FUP2 concepts.
Basically, the
framework of our algorithm is similar to that of the FUP2
algorithm
,

but w
hat needs to be kept as knowled
ge after a
mining process is the main problem of IDARM.

A.

Frequent ITEMSETs vs. DPF ITEMSETs

In IDARM we
have made
use of FUP2 concepts. Therefore
a set of itemsets need
s

to be kept as part

of the gained
knowledge.
FUP2 keeps the set of frequent itemsets with
their support, but in IDARM the issue is
which set can be
sufficient
is it the frequent set of itemsets as in FUP2 or the
set of distance pres
erving frequent (DPF) itemsets.

Because
t
he DPF set is part of the fr
equent set, keeping The DPF set
can be beneficial in terms of space. Unfortunately that is not
sufficient as a previous knowledge in IDARM because an
itemset’s cvd might increase or decrease whenever an
occurrence
has

been added or deleted. Assume an items
et x
is frequent but not DPF
with
in
a dataset.

Then a set of
records ha
s

been added to this dataset that contains
occurrences of x in way that x is not frequent within the new
set
,

but remains frequent within the whole updated dataset.
Since occurrences of

x have increased it
s

cvd will change
and may become
lower than
the maximum cvd
. Therefore, if
we
keep

only DPF as knowledge then x
(
which is a DPF
itemset
within the updated dataset
)

would

be missed
because x
would
not
be
frequent in the new set of record
s
and
would

not
be
DPF in the original
d
ataset. The following
example clarifies the basic idea. Assume we have a dataset
of 6 recor
ds and a minimum support of 50%, minimum
confidence of 70%

and a maximum cvd of 0.5. A frequent
itemset AB
is

found
with 4 oc
currences (a su
pport of
66.6%)
,

and
distances b
etween A and B
are
10, 4, 5 and 3
.
The corresponding cvd is calculated as follows: the mean is
5.5, standard deviation will be 3.11, and cvd will be 0.57.
Therefore, AB is not a DPF itemset since its cvd is gr
eater
than the maximum cvd. Later the dataset was updated by
adding 6 records. The itemset AB had two occurrences in
this
new set of records which make
s

it infrequent in the new
set
. As a result when keeping only DPF itemsets
,

AB will
not be foun
d. On the other hand if we
mine

DPF
itemsets
from

the whole dataset (12 records) we will find that AB

is
frequent (a support of 50%)
. If Distances between A and B
in the new set of records
was

7 and 8 then the correspondi
ng
cvd
would be

calculated as follo
ws
:

the mean is 6.17,
standard deviation will be 2.64, and cvd will be 0.43. AB
became

a DPF itemset within the updated dataset since its
cvd
is

now less than the maximum cvd. As a result we
need
to
keep frequent and not DPF itemsets as knowledge for
futur
e updates. Since we need to keep all frequent itemsets
the ‘deviate properly’ DARM concept is not helpful in
IDARM and thus not considered. We still need to
distinguish between DPF and Frequent
,

but not DP because
rules generated need to be derived from on
ly DPF itemsets.

B.

ITEMSET

When a dataset is updated
,

all itemsets need to be
updated accordingly. Each itemset includes the set of items
it presents in addition to its support count. DARM deals
with two kinds of items: numeric and nominal. In t
he case
of nominal, all occurrences of an itemset will have the exact
same set so they are only kept once. When a new
occurrence i
s discovered all we need
to do
is increment
its support count.

Numeric attributes on the other hand
are
different sin
ce each occurrence ha
s

a different value and
these values are needed to recalculate the cvd when a new
occurrence is found. As a result they cannot be mana
ged as
done with nominal items.

This can be solved by keeping
only what is needed to be able to updat
e an itemset’s cvd.
To determine the new cvd we need to calculate the mean
and standard deviation of all distances between each pair of
numeric items. Keeping only the mean and standard
deviation for each pair of numeric items and the support
coun
t for the

itemset is sufficient
; there is no need to keep
each disti
nct numeric value of each pair.

W
henever an
occurrence is discovered we update mean and standard
deviation based on distances between items in that
occurrence. To illustrate our point
, assume w
e ha
ve

only
one pair of numeric items within a speci
fic itemset.
For this
itemset
, when a new occurrence is found with a
specific dista
nce between the numeric items,

we update
mean and standard deviation as follows where
di is the
distanc
e within the ith

occurrence
, n is the
support count
and meann

and stdvn are the mean and standard deviation of
distances in the n occurrences
found so far
. To calculate and
update the mean of n
th

item, we follow the equation (1) and
(2).


















































































To calculate the Standard Deviation

of n
th

item, we follow
the equation (3) and (4).





















































































=

(











)








































(4)



Where,
2
n
i
i 1
(d )


value is calculated from stdv
n

by

squaring
both sides

2
n
2
i
n
i 1
i
i 1
2
n
( d )
d
n
stdv
n 1









Then multiply both sides by n
-
1

2
n
n
2 2
i
i 1
n i
i 1
( d )
stdv (n 1) (d )
n


   




Finally add
2
1
( )


n
i
i
d
n
to both sides

Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


253

2
n
n
2 2
i
i 1
i n
i 1
( d )
(d ) (stdv (n 1))
n


   



(
1
)


The standard deviation is not calculated unless the number
of occurrences is greater than 1. Therefore, when a second
occurrence is found
,

and because this is the first standard
deviation to calculate
,

we use equation
Error! Reference source not found.

while

equation
Error! Reference source not found.

is used whenever the

support count is greater the 2. The minimum support
threshold needs to be greater or equal to the minimum
support that
was
applied on the original dataset. If this
condition was not satisfied then we don’t know the
complete set of frequent itemsets

for the original dataset.
In this case DARM algorithm need to be applied on the
whole data set and IDARM is useless. Therefore, the
minimum support that was used in generating the previous
set of frequent itemset needs to be kept
in order
to ensure
corr
ect results. The number of attributes and each attribute’s
label and type should be kept to
ensure

that the new
increment structure match
es

the original dataset
structure. Additionally the size of the original dataset
should be

kept to cal
culate the new minimum

support
count threshold for the whole dataset.


C.

IDARM Algorithm

The
proposed
algorithm is built on FUP 2
[6]

as shown in
figure (1)
.
Starting from the second iteration
,

two numeric
items may be present in the same itemset and the distance
between them need
s

to

be

observed and updated.
Let
D
+

be
the
new set of instances
,
D
-

be the set of instances to be
deleted,
D

be
the original set of instances
-

D
, and
DKnowledge an

ob
ject containing

the following
information
:
l
et
ms
be
minimum Support
,
tc
be
transaction
total count
,
F
be
a set of Frequent Itemsets that contain sc
(
support count
),
ic
(
increment count
),
dc
(decrement count)
and a set of m (mean) and sdv (
standard
deviation
).
B
oth the
previous frequent set and deleted itemsets are empty. As
there will be no previous knowledge in the initial case
,

we
need to get the minimum support from the user and keep it
for farther updates.
Figure

(2) shows t
he initial

phase
of
I
DARM
.

T
he initial phase in IDARM is different than

in

DARM

because we need to keep all frequent itemsets in the
disk for future updates and not only DPF itemsets.
The
initial phase
will be
accomplished by applying
the

IDARM
algorithm
.


D.


.















Inp
ut: D+ , DKnowledge , D
-

,minConf, maxcvd

Output: DKnowledge


MAIN
ALGORTHM:

1.

from D
+
header get the set of all items as 1
-
itemsets

2.

for each x in 1
-
itemsets Scan F If a match exist then update
x.sc and flag it as old

3.

Scan D
-


and for each x in
1
-
itemsets calculate x.dc and if x is
old x.sc=x.sc
-
x.dc

4.

for each x in 1
-
itemsets

4.1.

if x is old and x .sc < F.ms×(DKnowledge.tc
-

|D
-
|+|D
+
| )
-

|D
+
| then delete x

4.2.

else if |D
+
|
-

x.dc< F.ms×( |D
+
|
-

| D
-
| ) delete x

5.

Scan D
+
and for each x in 1
-
itemsets calculate x.ic and
x.sc=x.sc + x.ic

6.

for each x in 1
-
itemsets

6.1.

if x is old and x.sc < F.ms×( DKnowledge.tc
-

|D
-
|+ |D
+
| )
then delete x

6.2.

else if x.ic
-

x.dc< F.ms×( |D
+
|
-

| D
-
| ) delete x

7.


if a new x in 1
-
itemset still

exists

7.1.

Scan D

to update for each new x in 1
-
itemsets its x.sc

7.2.

for each new x in 1
-
itemsets if x.sc < F.ms×(
DKnowledge.tc
-

|D
-
|+ |D
+
| ) then delete x

8.

add 1
-
itemsets to F
+

9.

set k=1

10.

while (F
+
.itemsets(k) not empty)

10.1.

increment k

10.2.

self
-
join F
+
.itemsets(k
-
1) to get each x in k
-
itemsets and
set x.ic to be the minimum of joined itemsets

10.3.

prune each x in k
-
itemsets with an infrequent subset

10.4.

for each x in k
-
itemsets Scan F If a match exist update
x.sc, x.m , x.sdv and flag x as old

10.5.

for each x in k
-
itemset
s

10.5.1.

if x is old and x.sc +x. ic < F.ms×(DKnowledge.tc
-

|D
-
|+|D
+
|) then delete x

10.5.2.

else if x.ic < F.ms×( |D
+
|
-

| D
-
| ) then delete x

10.6.

scan D
-


and for each x in k
-
itemsets calculate x.dc and if
x is old x.sc=x.sc
-
x.dc and update x.m , x.sdv

10.7.

for each x in k
-
itemsets

10.7.1.

if x is old and x.sc+x.ic < F.ms×(DKnowledge.tc
-

|D
-
|+|D
+
|) then delete x

10.7.2.

else if x.ic
-

x.dc<F.ms×( |D
+
|
-

| D
-
| ) then delete x

10.8.

Scan D
+
and for each x in k
-
itemsets calculate x.ic and
x.sc=x.sc+x.ic ,x.m , x.sdv

10.9.

for
each x in k
-
itemsets

10.9.1.

if x is old and x.sc < F.ms×( DKnowledge.tc
-

|D
-
|+
|D
+
| ) then delete x

10.9.2.

else if x.ic
-

x.dc<F.ms×( |D
+
|
-

| D
-
| ) then delete x

10.10.


if a new itemset still exists

10.10.1.

Scan D

and for each new x in k
-
itemsets update
its x.sc , x.m ,

x.sdv

10.10.2.

for each new x in k
-
itemsets if x .sc < F.ms×(
DKnowledge.tc
-

|D
-
|+ |D
+
| ) then delete x

10.11.


for each x in k
-
itemsets flag x as DPF if all its cvds <
maxcvd

10.12.

add k
-
itemsets to F
+

11.

End while

12.

Generate rules only from itemsets flagged as DPF with
co
nfidence not less than minconf

13.

Set DKnowledge.F=F
+

and DKnowledge.tc=DKnowledge.tc+|D
+
|
-
|D
-
|

14.

Save DKnowledge to the disk


Figure (1): IDARM algorithm.


ISSN 1013
-
5316; CODEN: SINTE


Sci.Int.(Lahore),25(
2
),
249
-
259
,2013


254






















Figure (2): IDARM Initial Phase.

E.

IDARM Case Study:

In the following example
,

we will first apply
initial IDARM
over the dataset given in

table (1) that shows the original
dataset from

T1 till T4
.


Table
1

:

The original dataset before deleting T1
-
T4
.

TID

Itemset

T1

A(210)B(135)C(5)D,X

T2

B(414)C(139)D,Y

T3

A(140)C(50)D,X

T4

A(140)B(150)C,Y

T5

A(409)B(87)C,Z

T6

A(196)B,Y

T7

A(236)B,Y

T8

C(203)D,Z

T9

A(111)C,Z

T10

A(195)B(249)C(7)D,X

Assume that our minimum support is 30%, minimum
confidence
is 50% and a maximum cvd of 0.5
. Moreover,
we only consider classification rules with at least two values
in the antecedent part.

By using table (1), we will
produce
the corres
ponding set of DARs and build knowledge needed
for future updates. Then we apply IDARM over the updated
dataset given in

table (2)
that uses previous knowledge
which was built in the initial phase to produce the updated
set of DAR's.



Table
2

:
The updated dataset after adding T11
-
T14
.

TID

Itemset

T5

A(409)B(87)C,Z

T6

A(196)B,Y

T7

A(236)B,Y

T8

C(203)D,Z

T9

A(111)C,Z

T10

A(195)B(249)C(7)D,X

T11

B(32)D,Y

T12

B(414)C(139)D,Y

T13

A(116)B(200)C,Y

T14

A(91)B(390)C(208)D,Y

The i
nitial
phase of
IDARM starts by forming the set of
items as 1
-
itemsets. As a result we have 12 itemsets (4 items
and 3 classes). Then
IDARM

scans the original dataset to
determine each support count and removes those with a
suppo
rt count less than 3. Only Frequent items remain as
shown
in

table (3).

Table
3

:
Frequent 1
-
Itemset
.

Frequent 1
-
Itemset

Support

A,X

3

A,Y

3

B,Y

4

C,X

3

C,Z

3

D,X

3

Next,
the f
requent 1
-
itemsets table will
be
joined to
generate
2
-
itemsets
.
Itemsets that contain an inf
requent subset will be
pruned.
As

t
his level’s itemsets contain two numeric items
,

the mean and standard deviation need to be calculated and
updated whenever a new occurrence is found. So by
scanning the dat
aset we need maintain
the new
support,
mean and standard deviation for each itemset. To show the
updating process we assume that we have only one 2
-
itemset which is (AC, X) as i
.
we start by T1 and we found
that it contains i so we
increment
ed


its support

count by
one, set its mean as the distance between each numeric item
(A and C) and set its standard deviation as zero

as shown in
table

(4)
.

Next is T2 which does not contain i and consequently
nothing will change. Then T3 is encountered which has an

Input : D ,minSupport, minConf, maxcvd

Output: DKnowledge

ALGORTHM:

1.

from D

header get
the set of all items as 1
-
itemsets

2.

Scan D

and for each x in 1
-
itemsets calculate x.sc

3.

for each x in 1
-
itemsets if x .sc < minSupport× |D|
then delete x

4.

add 1
-
itemsets to F

5.

set k=1

6.

while (F.itemsets(k) not empty)

6.1.

increment k

6.2.

self
-
join F.itemsets(k
-
1) to get each x in k
-
itemsets

6.3.

prune each x in k
-
itemsets with an infrequent
subset

6.4.

Scan D

and for each x in k
-
itemsets calculate
x.sc ,x.m , x.sdv

6.5.

for each x in k
-
itemsets if x.sc <minSupport×
|D| then delete x

6.6.


for each x
in k
-
itemsets flag x as DPF if all its
cvds < maxcvd

6.7.

add remaining k
-

itemsets to F

7.

End while

8.

generate rules only from itemsets flagged as DPF with
confidence not less than minconf

9.

DKnowledge.F=F

, DKnowledge.tc=|D

| ,
DKnowledge.sc=minSupport

10.

Save DKn
owledge to the disk

Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


255

occu
rrence of i so i’s support count is increase
d

to 2.
As
explained
in

the above section,
mean will be updated as
following :


Table
4

: AC,X after scanning T1

itemset

Support

Count

Numeric

Pairs

Mean

Standard
deviation

AC, X

1

AC

345

0


(mean (SupportCount 1)) 140 345 140 485
mean 242.5
SupportCount 2 2
   
   
updating standard deviation

as shown in table(5)

is a
special case
as we previously

because the current
support count equals 2
.




2 2
( mean Support Count 140) mean) (140 mean)
Support Count 1
    


stdv





2 2
(345 242.5) (140 242.5
144.9
1
5
)
7
  
 


Table
5

: AC,X after scanning T3

itemset

Support
Count

Numeric Pairs

Mean

Standard
deviation

AC, X

2

AC

242.5

144.957


The
final
support count and mean are calculated as
previously. Standard deviation
,

on the other hand
,

can’t be
calculated in the same way because we don’t know the
distance within each pair for each previous occurrence
.

I
t
can be calculated using the previous value of the standard
deviation. This is the general case of calculating the
standard deviation w
hen the number of occurrences exceeds
two

as shown in table(6)
.


2
n 1
2 2
i
n 1
i 1
i
i 1
(( distance ) 444)
( (distance )) 444
st v
1
d
n
n





 








n 1
i
i 1
distance mean Support Count 1 444 485



   






2
2
n 1
n 1
2
i
i 1
i
i
n 1
1
stdv n 2
( distance )
(distance ))
n 1





 

 





=(
144.957
2
×1
)+
2
(485)

2
=138625.031849

2
2
(929)
(138625.031849 444
3
2
stdv 155.0495
 




Now the scan is completed and the support count, mean
and standard deviation for all 2
-
itemsets are calculated.
Next we remove those with a support less than 30%
(minisupport).

Table
6

:
AC,X after scanning
the original Dataset.

itemset

Support
Count

Numeric
Pairs

Mean

Standard
deviation

AC, X

3

AC

309.667

155.0495


For the remaining set we label itemsets with cvds less than
0.5
,

as
DPF
shown in table (7)
.

This label is used to
generate rules from only DPF

itemsets. The set will

be
added to F.

Table
7

: Frequent
2
-
Itemset
.

2
-
itemset

Support
Count

Numeric
Pairs

Mean

Standard
deviation

isDPF

AC, X

3

AC

309.667

155.049

N

AD,X

3

AD

343.667

136.530

Y

AB,Y

3

AB

190.667

48.222

Y

CD,X

3

CD

34.000

23.516

N


In a similar way the next iterations frequent itemsets are
generated

as shown in table (8).


Table
8

: Frequent
3
-
Itemset
.

3
-
itemset

Support
Count

Numeri
c Pairs

Mean

Standard
deviation

isDPF

ACD, X

3

AC

309.667

155.049

N

AD

343.667

136.530

CD

34.000

23.516


The process terminates as there are no candidates in the next
iteration. Rules are generated from each DPF itemset which
are:

1.

A D (3) ==> X (3) confidence :(%100)

2.

A B (6) ==> Y (3) confidence :(%50)

All
frequent itemsets are kept in the memory in which each
itemset contains support count, items

list,
mean
,

and
standard deviation for each pair of numeric item
,

as shown
in table(9)
. The minimum support and number of
transactions is kept as well.

Regarding t
he

updating process
,

t
he dataset
will be
updated by deleting T1
-
T4 and adding
T11
-
T14. IDARM first forms a set of all 1
-
itemsets as
candidates which is the same set produced in the f
irst step in
the initial phase.
The Dknowledge generated by
the
initial
ID
ARM is retrieved.
Each candidate
will be
updated with

its

correspondence in the Dknowledge.F set as shown in

table

(9)
.

Updated itemsets
are
labeled as old (marked with an
asterisk
)

to distinguish them from
a
new itemset that wasn’t
frequent previously.

Next, the deleted set (T1

T4) is
scanned and itemsets are updated accordingly as shown in
table (10)
. No candidates are removed because old itemsets
(AX,AY,BY,CX and DX) are with a support not less than
Dknowledge.ms × (Dknowledge.tc
-
|D
-
|+|D
+
|)
-
|D
+
|
(step

4.1)
.


ISSN 1013
-
5316; CODEN: SINTE


Sci.Int.(Lahore),25(
2
),
249
-
259
,2013


256

New itemset
s

are all kept because subtracting their
decrement count from the number of added instances gives a
result
not less than

Dknowledge.ms × (|D
+
|
-
|D
-
|)
,

which
equals zero in this case
(step 4.2).

Next
,

the added

datas
et (T11
-
T14) is scanned

to update each
itemset's support count and increment count as shown in

table (11).

For old
they are now completely updated
, and

therefore we remove itemsets with a support count less than
the minimum support threshold for the updated dataset
which is Dkno
wledge.ms × (Dknowledge.tc
-
|D
-
|+|D
+
|)=3
, as
shown in table (11)
.

Next
,

the added

datas
et (T11
-
T14) is
scanned to update each itemset's support count and
increment count as shown in

table (11).

For old
they are
now completely updated
, and

therefore we remove itemsets
with a support count less than the minimum support
threshold for the updated dataset which is Dknowledge.ms ×
(Dknowledge.tc
-
|D
-
|+|D
+
|)=3
, as shown in table (11)
.

Table
9

: 1
-
itemset a
fter Dknowledge.F
scan
.

1
-
Itemset

Support

ic

dc

A,X*

3

-

-

A,Y*

3

-

-

A,Z

-

-

-

B,X

-


-

-

B,Y*

4

-

-

B,Z

-


-

-

C,X*

3

-

-

C,Y

-


-

-

C,Z*

3

-

-

D,X*

3

-

-

D,Y

-


-

-

D,Z

-


-

-

Table
10

: 1
-
itemsets a
fter D
-

scan
.

1
-
Itemset

Support

ic

dc

A,X*

1

-

2

A,Y*

2

-

1

A,Z

-

-

0

B,X

-

-

1

B,Y*

2

-

2

B,Z

-

-

0

C,X*

1

-

2

C,Y

-

-

2

C,Z*

3

-

0

D,X*

1

-

2

D,Y

-

-

1

D,Z

-

-

0

For new itemsets
,

we need to know their support count in
the unchanged portion of the dataset to determine whether
they are frequent or not. Before scanning we can reduce the
number of candidates by removing itemsets with a result of
subtracting
the
decrement count from th
e increment count
that is less than Dknowledge.ms × (|D
+
|
-
|D
-
|). As a result
,

B,X is removed and then the unchanged portion of the
dataset is scanned with AZ,BZ and CY as candidates . The
final set of frequent
1
-
itemsets is shown in
table (12).


Table
11

: 1
-
Itemset fter D+

scan
.


1
-
Itemset

Support

ic

dc

A,X*

1

0

2

A,Y*

4

2

1

A,Z

0

0

0

B,X

0

0

1

B,Y*

6

4

2

B,Z

0

0

0

C,X*

1

0

2

C,Y

3

3

2

C,Z*

3

0

0

D,X*

1

0

2

D,Y

3

3

1

D,Z

0

0

0

Table
12

:
Frequent 1
-
Itemset
.

Frequent1
-
Itemset

Support

A,Y

4

B,Y

6

C,Y

3

C,Z

3

D,Y

3

2
-
itemset candidates are formed by joining 1
-
itemsets and
the increment count of each candidate is set to the value of
the minimum increment count of the two joined itemsets
(step 10.2). In our case, joining AY and BY will
give
ABY
with an increment count
of 2. This level’s candidates are
shown

in table (13)
with their increment count. The
increment count before scanning the added set is the
maximum possible increment count for an itemset. FUP
2

call this value the increment high bound
,

which is used to
redu
ce the number of candidates before scanning the added
set and calculating the actual increment count.

Table
13

: 2
-
Itemsets.

2
-
Itemset

sc

ic

dc

m

stdv

AB,Y

-


2

-

-

-

AC,Y

-

2

-

-

-

AD,Y

-

2

-

-

-

BC,Y

-

3

-

-

-

BD,Y

-

3

-

-

-

CD,Y

-

3

-

-

-

As done in the first iteration
,

the set of candidates are
update
d with their correspondence in
Dknowledge.F

table
(14)
.



Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


257


Table
14

: 2
-
itemset after Dknowledge.F scan
.

2
-
Itemset

sc

ic

dc

m

stdv

AB,Y*

3

2

-

190.667

48.222

AC,Y

-

2

-

-

-

AD,Y

-

2

-

-

-

BC,Y

-

3

-

-

-

BD,Y

-

3

-

-

-

CD,Y

-

3

-

-

-


Step 10.5 minimiz
es

the number of candidates before
scanning
D
-

and D+
.

Old itemsets can be removed if their
previous support
,

plus their increment count
,

is less than the
updated database minimum support
which is, in this
example, 3.
New itemsets can be removed if their increment
count is less than the minimum support of subtracting
the
decrement from increment

sizes which is, in this case,
zero.
This step

can be

beneficial only when |D+|>|D
-
|
. As a result
,
no itemsets will be removed at this stage.

Next, the D
-

will
be scanned and itemsets will be updated. For new itemsets
we only need to update the decrement count whenever a
match exists. Decrement count
is only used to help in
reducing the number of candidates. For old itemsets
,

on the
other hand, we should update their mean and standard
deviation in addition to their support count. In our case, only
one itemset
(
which is AB,Y
)

is old. The itemset AB,Y ha
s
only one occurrence in D
-

as shown in table (15)
,
which is in
T4. The effect of this occurrence on the itemset’s support
count, mean and standard deviation need to be removed

as
shown in table(16)
.


Table
15

:
AB,Y

before D
-

scan.

2
-
itemset

Support
Count

Numeric
Pairs

Mean

Standard
deviation

isDPF

AB,Y

3

AB

190.67

48.222

Y


Support=Support


1=2








n 1
n
mean Support 1 -140
432.001
mean 216.001
Support 2

 
  


1
1




n
i
i
distance
=(mean

×

support)+140=572.001

n 1
2
i
i 1
(distance ))



= (stdv
n+1
2

× n
)+
2
n 1
i
i 1
( distance )

n 1






=(48.222
2
×2
)+
2
572.001

3
=113712.4372

stdv
n
=


2
n 1
i
i 1
2 2
n 1
i
i 1
( distance 140)
( (distance )) 140
n
n 1





 





=


2
2
( 572.001 140)
165917.933 140
2
28.284
1

 


Table
16

:
AB,Y

after

D
-

scan
.

2
-
itemset

Support
Count

Numeric
Pairs

Mean

Standard
deviation

isDPF

AB,Y

2

AB

216.001

28.284

Y


After updating itemsets we check of any candidate may be
removed before scanning D+

as shown in table(17)
.

Table
17

: 2
-
itemset after D
-

scan
.

2
-
Itemset

sc

ic

dc

m

stdv

AB,Y*

3

2

1

216.001





AC,Y

-

2

1

-

-

AD,Y

-

2

0

-

-

䉃ⱙ

-

3

2

-

-

BD,Y

-

3

1

-

-

CD,Y

-

3

1

-

-


In this example no itemsets are removed and the added set
will be scanned with the same set of candidates. Each
itemset’s support, mean and standard deviation are updated
while scanning the D+
as shown in table

(18)
. For old
itemsets
,

their data now reflec
ts the updated dataset
,
and

therefore itemsets with a support less than the minimum
threshold of the updated dataset can be removed. In this
case
,

we only have one old itemset which has a support not
less than the minimum threshold. For new itemsets we nee
d
to scan the unchanged portion of the dataset to determine
their actual support count. Before performing the scan we
remove any itemset in which subtracting its decrement
count from its increment count is less than zero. In our case
there is no reduction
in the number of candidates in this
stage. After
completing
scan
ning
,

as shown in table (19),
we
keep only itemsets with a support not less than the minimum
threshold and label DPF itemsets as shown in

table (20).


Table
18

: 2
-
itemset after D
+

scan
.

2
-
Itemset

sc

ic

dc

m

stdv

AB,Y

4

2

1

159.750

67.746

AC,Y

2

2

1

398.500

116.673

AD,Y

1

1

0

689.000

0

BC,Y

3

3

2

334.667

117.240

BD,Y

3

3

1

394.333

314.596

CD,Y

2

2

1

173.500

48.790




ISSN 1013
-
5316; CODEN: SINTE


Sci.Int.(Lahore),25(
2
),
249
-
259
,2013


258


Table
19

: 2
-
itemset after D scan
.

2
-
Itemset

sc

ic

dc

m

stdv

AB,Y

4

2

1

159.750

67.746

AC,Y

2

2

1

398.500

116.673

AD,Y

1

1

0

689.000

0

BC,Y

3

3

2

334.667

117.240

BD,Y

3

3

1

394.333

314.596

CD,Y

2

2

1

173.500

48.790

Table
20

:
Frequent
2
-
itemsets
.

2
-
Itemset

sc

ic

dc

m

stdv

isDPF

AB,Y

4

2

1

159.750

67.746

Y

BC,Y

3

3

2

334.667

117.240

Y

BD,Y

3

3

1

394.333

314.596

N


In the next level
,

no candidates can be generated
,

so

the
process terminates. Next,
r
ules

are generated from only DPF
itemsets
,

which are the same rules generated in
the
above
section
,

wi
th a support not less than 30%.

Finally
,
t
he set of
frequent itemsets will be kept in the disk as knowledge for
future updates.

The final
Rules
Found

are
:

1.

A B

6 ==> class=Y 4 conf:(0.67)

2.

B C 5 ==> class=Y 3 conf:(0.6)

IV.

I
MPLEMENTATION
A
ND
E
XPERIMENTAL
R
ESULTS

To assess the performance of our algorithms for the updating
of large itemsets, we perform several experiments on
Windows dual core I3 system with 4 G
B RAM an
d 1TB
hard disk using Weka [8].

The proposed algorithms are
implemented using Weka API. Sample set of experiments
were conducted as proof of concept for the proposed
algorith
ms. We used synthetic databset

to evaluate the
performance of the algorith
ms and the data is generated
using the same approach introduced in IDARM algorithm
section.

We perform an experiment on the dataset with

differe
nt minimum support 1, 5, and 10
% and

t
he
experiment
s
run
several
insertions of increment data
set
and
several deletions from dataset. The
execution time of each

insertion is recorded
and compared against i
nsert
ion into
FUP2
.
Figure (3) shows
a
snapshot for the proposed
algorithms using Weka.

The first challenge for
assess
ing

the performance of our
algorith
ms for the updating of large itemsets

toward other
developed algorithms was selecting the implementation and
testing platform. After several searches among the existing
data mining tools, the Weka data mining system and Weka
API were selected for implement
ation and testing.
The

implementation and
tests are performed using
dual core I3

system
2.0GHz Pentium PC with 4
GB

RAM and 1TB hard
disk
,
running using Windows system

7
professional with
Weka [8]. The proposed algorithms are implemented using
Weka API

and
the first s
ample set of experiments were



Figure (3): Snapshot of DARM and IDARM using Weka.

conducted as proof of concept for the proposed algorithms.

Figure (3) shows
a
snapshot for the proposed algorithms
using Weka.
The second challenge was finding appropriate
motif dataset with different size. Synthetic dataset with
different size to evaluate the performance of the algorithms
were generated and the generated dataset used the same
approach introduced in
the

IDARM algo
rithm section.

Several
experiments
were conducted using s
ynthetic dataset

with

differe
nt minimum support 1, 5, and 10
% and t
he
experiment
s
run
several
insertions of increment data
set
and
several deletions from dataset. Four different experiments
with diffe
rent size
datasets

ranging from 1ooo motif to
1000000 motif
were
c
onducted using t
hresholds ranging
from 0.
2
5 to 0.
7
5 were tested
. Figure

(4) shows accuracy
among Aprioir, DARM, and the proposed algorithm
IDARM. Figure (4) shows there is no significant dif
ference
accuracy between IDARM and DARM but there is a
significant difference between Apriori and the developed
algorithm IDARM.




Figure 4: Accuracy of IDARM vs DARM and Apriori

Figure (5) shows the execution time for finding the large
frequent itemset
s using Apriori, DARM, and IDARM.
The
execution time using small motif dataset
,

which are
less
than or equal
to
10K
,

will be less than or equal to the
execution time of DARM
,

but the IDARM perform
s

better
than Apriori and DARM when the dataset are larger than
10K.

60
65
70
75
APRIORI
DARM
IDARM
Accuracy of IDARM

Sci.Int.(Lahore),25(
2
),
249
-
259
,2013

ISSN 1013
-
5316; CODEN: SINTE


259



Figure 5: Finding Frequent Itemset time of IDARM vs DARM and Apriori


V.

C
ONCLUSION AND
F
UTURE WORK

The data
set may be changed by the user either frequently or
infrequently and these changes
would

modify
the
characteristic of the datas
et especially if
one uses a
distance
-
based association rule mining approach
.
Insert, Delete and
Update are the most significant p
rocess
es

that
affect

any
incremental mining process.
In this
research
we have
combined DARM
[3
,

4]

and FUP
2

[6]

to get an incremental
version of DARM. Our algorithm generates distance
association rules
in

addition to the set of all frequent
itemsets. Each itemset is augmented with a set of mean and
standard deviation in addition to its support count and items
it represent
s
. The frequent set is used as knowledge to be
accessed whenever the original dataset
is updated. IDARM
and DARM have been implemented in the Weka
workbench.
IDARM
will be close to or
equal
to
FUP2
performance
if

the
IDARM

is

applied using a

set of
Instances that contains no numeric items
.
Therefore
,

implementing IDARM is an implementation
of FUP
2
as
well.
The proposed algorithms are implemented using Weka
API

and
using synthetic dataset with different size to
evaluate the performance of the algorithms. Several
experiments set were conducted using synthetic dataset

with

different minimum sup
port
,

and confidence and accuracy of
IDARM w
ere

close to DARM and
do
better than traditional
Aprioir. IDARM show
s

significant
ly

better
achievement in
the execution time for

finding the large itemset than

DARM
and Apriori.
The developed algorithm using
the
incremental
distance association rules mining technique could reduce the
time and computational power needed to mine a dynamic
dat
aset staged over a
multi
-
server,

mirrored array
, which
would lead to more efficient processing.


In the future,
IDARM
c
ou
ld

be implemented and tested
over
a set of servers distributed over a network
,

and i
f these
servers maintain mirrors of the same dataset that need to be
mined. Each server updates it
s

dataset in isolation from
others. At some point these updates
are
sent t
o the main
server as patches. The main server using IDARM can use
previous knowledge to update its DAR and minimize the
number of candidates when scanning the whole dataset.
Also,
further discussion on the challenge IDARM technique
on different application

area
s,

such as
applied security
context;

in other words what type of systems would benefit
most from adopting this technique and what does that mean
for the chief informational officer,
analyst supervisor,
policymaker

or other beneficiary of this technica
l advance
.
Finally, the criteria for setting significance thresholds and
the influence of exogenous factors influencing dataset
performance over time without an secure effect caused by
updates that preserve initial data distance and variability
characteris
tics during updates.



ACKNOWLEDGMENT

This work was supported by the Research Center of College
of Computer and Information Sciences

(CCIS)
, King Saud
University
,
Saudi Arabia.

REFERENCES


[1]

B. Liu, W. Hsu and Y. Ma, “Integrating Classification
and Association Rule Mining,” 1998.

[2]

R. Agrawal and R. Srikant, “Fast Algorithms for
Mining Association Rules,” 1994.

[3]

A. Icev, “DARM: DISTANCE
-
BASED
ASSOCIATION RULE MINING A MASTERS
THESIS,”

2003.

[4]

A. Icev, “Distance
-
enhanced association rules for gene
expression,” in BIOKDD’03, in conjunction with
ACM SIGKDD, 2003.

[5]

D. W. Cheung, J. Han, V. T. Ng and C. Y. Wong,
Maintenance of Discovered Association Rules in
Large Databases: An, 19
96.

[6]

D. W. Cheung, S. D. Lee and B. Kao, “A General
Incremental Technique for Maintaining Discovered
Association,” in In Proceedings of the Fifth
International Conference On Database, 1997.

[7]

C.
-
H. Lee, C.
-
R. Lin and M.
-
S. Chen, “Sliding
-
Window Fi
ltering: An Efficient Algorithm for
Incremental Mining,” Information Systems, vol. 30,
no. 3, p. 227

244, May 2005.

[8]

“Weka homepage,” [Online]. Available:
http://www.cs.waikato.ac.nz/ml/weka/.

[9]

A. Savasere, E. Omiecinski and S. Navathe, “An
effici
ent algorithm for mining association rules in large
databases,” 1995.

[10]

B. Goethals, “Survey on frequent pattern mining,”
2002.

[11]

J. Han, M. Kamber and J. Pei, Data Mining: Concepts
and Techniques, Morgan Kaufmann, 2005.

[12]

I. H. Witten, E. Fr
ank and M. A. Hall, Data Mining:
Practical machine learning tools and techniques, 3rd
Edition ed., 2011.


0
200
400
600
800
1K
10K
100K
1000K
Number of Motif

Frequent Itemset Response Time

APRIORI
DARM
IDARM