Large Dataset Compression Approach Using Intelligent Technique

kneewastefulAI and Robotics

Oct 29, 2013 (3 years and 9 months ago)

92 views




Large Dataset Compression Approach Using Intelligent
Technique

Ahmed Tariq Sadiq
1,a
, Mehdi G. Duaimi
2,b
, Rasha Subhi Ali
2,c

1
Computer Science Department, University of Technology,
Baghdad, Iraq

2
Computer Science Department, Baghdad University, Baghdad, Iraq

a
drahmaed_tark@yahoo.com,
b
mehdi_duaimi@scbaghdad.edu.iq,
c
danafush@Gmail.com


A
BSTRACT

Data clustering is a process of putting similar data into groups. A clustering algorithms partition data
set into several groups such that the similarity within a group is larger than among groups.
Association rule is one of the possible methods for analys
is of data. The association rules algorithm
generates a huge number of association rules, of which many are redundant. The main idea of this
paper is to compress large database by using clustering techniques with association rule algorithms. In
the first s
tage, the database is compressed by using clustering techniques followed by association rules
algorithm. Adaptive k
-
means clustering algorithm is proposed with apriori algorithm. Due to many
experiments by using the adaptive k
-
means algorithm and apriori a
lgorithm together it gives better
compression ratio and smaller compressed file size than the compression ratio and compressed file
size that are given from using each algorithm alone. Several experiments were made in several
different sizes of database. T
he apriori algorithm increases the compression ratio of the adaptive k
-
means algorithm when hey are used together but it takes more compression time than the adaptive k
-
means takes. These algorithms are presented and their results are compared.

Keywords
: A
ssociation rule, clustering techniques and compression algorithms.

1.

Introduction

Compression is the art of representing the information in a compact form rather than its
original or uncompressed form (Pu, 2006). In other words, using the data compression, t
he
size of a particular file can be reduced. This is very useful when processing, storing or
transferring a huge file, which needs lots of resources (Kodituwakku,et al., 2007). Data
compression is widely used in data management to save storage space and ne
twork
bandwidth

(Goetz, et al.,1991). In
computer science

and
information theory
, data
compression, source coding, or bit
-
rate reduction involve
encoding

information

using fewer
bits

than the original representation. Compression can be either lossy

or lossless. Lossless
compression reduces bits by identifying and eliminating
statistical redundancy

(eliminate
unwanted redundancy). Compression is useful because it helps reduce the consuming of
resources such as data space or transmission
capacity
. Bec
ause compressed data must be
decompressed to be used, this extra processing imposes computational or other costs through
decompression. In lossy data compression, some loss of information is acceptable. Depending
upon the application, details can be droppe
d from the data to save storage space. Lossy data
compression used in image, audio, and video (Graham, 1994). .
Lossless data compression

is
contrasted with
Lossy data compression
. Lossless data compression has been suggested for
many space science explora
tion mission applications either to increase the scientific return or
Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


2


to reduce the requirement for on
-
board memory, station contact time, and data archival
volume. A Lossless compression technique guarantees full reconstruction of the original data
withou
t incurring any distortion in the process. The Lossless Data Compression technique
preserves the source data accuracy by removing redundancy from the application source data.
In the decompression process the original source data are reconstructed from the
compressed
data by restoring the removed redundancy. The reconstructed data is an exact replica of the
original source data. The amount of redundancy removed from the source data is variable and
is highly dependent on the source data statistics, which are
often non
-
stationary (Report
Concerning Space Data System Standards, 2006). In this paper two intelligent techniques
(clustering techniques and association rules) are used to compress large data sets. Clustering
is a division of data into groups of similar

objects. Each group, called a cluster, consists of
objects that are similar among themselves and dissimilar to the objects of other groups
(Berkhin, 2002). . Association Rule Mining plays a major role in the process of mining data
for frequent pattern ma
tching. Association Rules: Association rule of data mining involves
picking out the unknown interdependence of the data and finding out the rules among those
items (Neeraj, et al.,2012).. The main objective of this work is to discuss intelligent
technique
s to compress large data sets.

This paper is organized as follow. Section two shows the related works, Section three
gives some terminology on association rules and apriori algorithm, Section four explains
major clustering techniques and k
-
means algorithm
, Section five explains the methodology of
the compression algorithms and the decompression algorithms, Section six presents some
results from experiments and section seven concludes with a discussion.

2.


Related Work

Sonia Dora (Jacob , et al., 2012). , f
ocuses on lossless compression for relational
databases at the attribute level so the proposed technique is used at the attribute level by
compressing three types of attribute (string, integer and date type) and the most interesting
feature is that it aut
omatically identifies the type of attribute. I Made Agus
Dwi Suarjaya
(Suarjaya, 2012)
, proposes a new algorithm for data compression, called j
-
bit encoding
(JBE). This algorithm manipulates each bit of data in a file to minimize the size without
loosing a
ny data after decoding which is classified to lossless compression. Heba Afify,
Muhammad Islam and Manal Abdel Wahed (Afify, et al.,2011). , present a differential
compression algorithm that is based on production of difference sequences according to an
op
-
code table in order to optimize the compression of homologous sequences in the dataset.
István Szépkúti (Szépkút , 2004)., introduces a new method called difference sequence
compression. Under some conditions, this technique is able to create a smaller si
ze
multidimensional database than others like single count header compression, logical position
compression or base
-
offset compression.

3.

Association R
ules

Association rule mining is one of the most important and well researched techniques of
data mining.
It was first introduced by Agrawal, Imielinski, and Swami (Agrawal, et
al.,1997). . The discovery of “association rules” in databases may provide useful background
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

3


knowledge to decision support systems, selective marketing, financial forecast, medical
diag
nosis, and many other applications (Yijun. et al., 2000). . Mining association rules is an
important data mining problem. Association rules are usually mined repeatedly in different
parts of a database. Current algorithms for mining association rules work
in two steps.

1. Discover the large itemsets, i.e. the sets of itemsets that have support above a
predetermined minimum support σ.

2. Use the large itemsets to generate the association rules for the database.

It is noted that the overall performance of min
ing association rules is determined by the first
step, which usually requires repeated passes over the analyzed database and determines the
overall performance. After the large itemsets are identified, the corresponding association
rules can be derived in
a straightforward manner. (Saad et al., 2010).

3.1 Associations rules concept

An association rule is a simple probabilistic statement about the co
-
occurrence of certain
events in a database, and is particularly applicable to spa
rse transaction data sets
(Hand

et al.,

2001).

An association rule is a rule, which infers certain association relationships among a set
of objects (such as those which occur together or one infers the other”). In a database
association rule mining wo
rks as follows: (Adriaans

et al
.,

1998).

Let I be a set of items and D a database of transactions, where each transaction has a
unique identifier (tid) and contains a set of items called an item set. An itemset with k items
is called a k
-
itemset. The support of an itemset X denoted S (X
) is the number of transactions
in which that itemset occurs as a subset. A k
-
subset is a k
-

length subset of an itemset. An
itemset is frequent or large if its support is more than a user
-
specified minimum support
(min_sup) value. Fk is the set of frequen
t k
-
itemsets. A frequent itemset is maximal if it is not
a subset of any other frequent itemset. An association rule is an expression A

B, where A
and B are itemsets. The rule’s support (S) is the joint probability of a transaction containing
both A and B,
and is given as S (A

B). The confidence of the rule is the conditional
probability that a transaction contains B, given that it contains A and is given as S (A

B)/S
(A). A rule is frequent if its support is greater than min_sup and strong if its confidence
is
more than a user
-
specified minimum confidence (min_conf). Data mining involves
generating all association rules in the database that have a support greater than min_sup (the
rules are frequent) and that have a confidence greater than min_conf (the rule
s are strong)
(Saad et al., 2010). The important measures for association rules are support (S) and
confidence (C). They can be defined as:

Definition1: Support (S)

Support(X, Y) = Pr(X

Y) =count of (X

Y) / Total transactions……………………... (1)

The support
(S) of an association rule is the ratio (in percent) of the records that contain
(X

Y) to the total number of records in the database. Therefore, if we say that the support of
a rule is 5% then it means that 5% of the total records contain (X

Y) (Brin , e
t al.,1997).

Definition2: Confidence (C)

Conf (X

Y) =Pr (X

Y)/Pr(X) =support(X, Y)/support(X)……………………………. (2)

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


4


For a given number of records, confidence (C) is the ratio (in percent) of the numbers of
records that contain (X

Y), to the number of records
that contain X. Thus, if we say that a
rule has a confidence of 15% it means that 85% of the records containing X also contain Y.
The confidence of the rule refers to the degree of correlation in the database between X and
Y. Confidence is also a measure o
f rules strength. Mining consists of finding all rules that
meet the user
-
specified threshold support and confidence (Brin , et al.,1997). As there are two
thresholds, we need two processes to mine the rules. The first step is to get the large itemsets.
It

finds all the itemsets whose supports are larger than the support threshold. An itemset is the
set of the items. Based on the large itemsets, we can generate the rules from the large
itemsets, which is the second step. Rules that satisfy both a minimum su
pport threshold
(min_sup) and a minimum confidence threshold (min_conf) are called strong (Han, et al.,
2001).. An association rule
-
mining problem is broken down into two steps: 1) Generate all
the item combinations (itemsets) whose support is greater than

the user specified minimum
support. Such sets are called the frequent itemsets and 2) use the identified frequent itemsets
to generate the rules that satisfy a user specified confidence. The frequent itemsets generation
requires more effort and the rul
e g
eneration is straightforward,
(Kona, 2003).

4.

Unsupervised learning

Unsupervised learning is a branch of machine learning algorithms, where the aim is to
find patterns in raw, unlabeled, data. Unlike in supervised learning, where the algorithm is
first trai
ned on labeled data, so it can learn and adapt to the particular problem
(classification). The general, inherent, properties of the data can be used to find patterns and
organize it (clustering). (Figueroa , 2011).

4.1 Clustering

Clustering is the process
of examining a collection of “points,” and grouping the points
into “clusters” according to some distance measure (Rajaraman , et al., 2011).. The goal of
clustering is to define clusters of elements in the dataset, such that the elements in the same
clust
er are similar to each other, and each cluster as a whole is distant from the others. Two
important classes of clustering can be distinguished 1) Hierarchical clustering: These kinds of
techniques can either be agglomerative or divisive. In agglomerative c
lustering, we start by
assigning each element to a different class. The successive iterations of the algorithm cluster
together the closest classes, until all the elements belong to the same, main, class. The
divisive methods perform divisions of the class
es into smaller ones instead. 2) Partitional
clustering: the concept of partitional clustering is to divide the data immediately into a certain
number of clusters. The most popular algorithm is k
-
means, which will be presented in the
next section. (Figuero
a , 2011).


4.1.
1. Partitioning Methods

The partitioning methods generally result in a set of M clusters; each object is belonging
to one cluster. Each cluster may be represented by a centroid or a cluster representative; this
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

5


is some a sort of a summary description of all the objects contained in a cluster. K
-
means
method is an example about partitioning clustering. (Rai, et al., 2010).

K
-
means

The k
-
means algorithm was proposed independently in various scientific fields over
50
years ago. MacQueen was the first, in 1967, to name k
-
means its one
-
pass version of the
algorithm, where he defined the first k elements of the dataset as the k classes, and
successively assigned the next elements to the closest class, updating the cent
roid after each
assignment (Figueroa , 2011).. The standard k
-
means algorithm is considered to be a simple
but efficient partitioning algorithm. It divides the data into k clusters, minimizing the squared
distance between each element to the center of its
cluster. The distance measure is a
parameter of the algorithm. The objective function, using the Euclidian distance, is defined
as:


………………………………………………………………………. (3)

Where:


• m: is the total number of clusters.

• C
k
: is the k
-
th

cluster.

• x
i
: is the vector of the i
-
th element of the dataset.

• g
k
: is the vector of the center of the k
-
th cluster.

K
-
means then proposes the following iterative method to find a good solution.

1.

Initialize the center of each cluster (also called
centroids). For example, we can
arbitrarily choose some elements of the dataset to be the centroids.

2.

Reassign each element to the closest centroid.

3.

Recompute the center of each cluster.

4.

Repeat steps 2 and 3 until the stopping criterion is satisfied.

(Figue
roa , 2011).


4.1.
2. Hierarchical Methods

Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of
clusters. The basics of hierarchical clustering include Lance
-
Williams formula, conceptual
clustering, classic algorithms
SLINK, COBWEB, as well as newer algorithms CURE and
CHAMELEON. The hierarchical algorithms build clusters gradually (as crystals are grown).
Strategies for hierarchical clustering generally fall into two types: In hierarchical clustering
the data are not p
artitioned into a particular cluster in a single step. Instead, a series of
partitions takes place, which may run from a single cluster containing all objects to n clusters
each containing a single object. Hierarchical Clustering is subdivided into agglome
rative
methods, which proceed by a series of fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer groupings. Agglomerative
techniques are more commonly used (Rai, et al., 2010). .

A. Agglomerative tec
hnique

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


6


This is a "bottom up" approach: each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy. The algorithm forms clusters in a bottom
-
up manner, as follows (Rai, et al., 2010). :

1. Initially, put each
article in its own cluster.

2. Among all current clusters, pick the two clusters with the smallest distance.

3. Replace these two clusters with a new cluster, formed by merging the two original ones.

4. Repeat the above two steps until there is only one re
maining cluster in the pool.

Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single
article clusters as its leaf nodes and a root node containing all the articles.

B. Divisive technique
(Rai, et al., 2010).

This is
a "top down" approach: all observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.

1. Put all objects in one cluster

2. Repeat until all clusters are singletons

a) Choose a cluster to split

b) Replace the ch
osen cluster with the sub
-
cluster.

C. Advantages of hierarchal clustering
(Rai, et al., 2010).

1)

Embedded flexibility regarding the level of granularity.

2)

Ease of handling any forms of similarity or distance.

3)

Applicability to any attributes type.

D. Disadvantages of hierarchal clustering
(Rai, et al., 2010).

1.

Ambiguity of termination criteria.

2.

Most hierarchal algorithms do not revisit once constructed clusters with the purpose
of improvement.

5.

Methodology

In the first part of this work the adaptive
k
-
means and apriori algorithm are utilized. By
adaptive k
-
means, one can extract several clusters from each database table, and by the
apriori algorithm; relationships among the sets of items can be extracted from each cluster. In
the next part of this wor
k the decompressed data must be recovered from the compressed one.
To do that the adaptive k
-
means decompression algorithm and the apriori decompression
algorithm will be used to recover the original data from the compressed data.

5.1 compression
algorithms

5.1.
1. Adaptive k
-
means algorithm

Adaptive k
-
means is a partitioning clustering algorithm used to extract all available
clusters in any selected database. In recent k
-
means the user must determine the number of
clusters and the center of each cl
uster, while in adaptive k
-
means the number of clusters and
the center of clusters are determined automatically without intervention of the user. In this
algorithm the user selects a database file then the algorithm automatically select two
attributes. The

items that are available in these selected attributes represent the centers of
clusters. This algorithm has several stages.

Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

7


Algorithm: adaptive k
-
means.

Input: database file

Output: two text file first one for saving extracted clusters, and the second is
for saving
information for each extracted cluster (this information contains a number of items in each
cluster and the name of each cluster).

Begin

i.

Let the DB = database file

ii.

Automatically select two attributes from the inputted database file. Let G and D
are the
selected attributes.

iii.

The center for each cluster is determined automatically by selecting items from the
determined attributes without repetition of these items and considers them as centers for
clusters. Let (G
1
, G
2
…..,G
n

and D
1
, D
2
,…. D
m

)

to be centers of clusters.

n represents the number of items in the first selected attributes without repetition.

m represents the number of items in the second selected attributes without repetition.

iv.

U= unselected attributes

v.

For each closed items in G
1
, G
2
,… G
n

and D
1
, D
2
,… D
m

select U

vi.

Print U in first text file

vii.

Print information of each cluster in the second file

viii.

Return compressed files.

End


For example, if the dept table is selected to apply compression algorithms on it then the
gender and degree colum
ns will be selected to make their items as a center for each cluster.
The (male and female) are closed items belonged to the first selected attribute (gender).
While (secondary school, B.Sc., Diploma, PhD, MSc, Higher Diploma) are closed items
belonged to
the second selected attribute. Then the center of each cluster is considered as
follows: The center of the First cluster is (male, secondary school

(
, the center of the second
cluster is (male, MSc), while the center of the third cluster is (female, MSc) a
nd so on as
determined by the adaptive k
-
means algorithm.

5.1.
2. Apriori Algorithm

Apriori Algorithm can be used to generate all frequent itemset. A Frequent itemset is an
itemset

that has support is greater than the user
-
specified minimum support (denoted L
k
,
where k is the size of the itemset). A Candidate itemset is a potentially frequent itemset
(denoted C
k
, where k is the size of the itemset).

Algorithm: apriori

Input: adapti
ve k
-
means resulted files, min_sup, min_conf

Output: one text file for saving extracted rules and the remaining cluster's data on it.

Begin:

For each itemset l
1


L
k
-
1



For each itemset l
2



L
k
-
1



IF (l
1

[1] = l
2

[1])


(l
1

[2] = l
2

[2])




(l
1

[k
-
2] = l
2

[k
-
2])


(l
1

[k
-
1] < l
2

[k
-
1] ) then
c = l
1

join
l2;

// join step: generate candidates

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


8



IF has_infrequent_subset (c, L
k
-
1
) then


Delete c; // prune step: remove unfruitful candidate


Else add c to C
k
;


Return C
k

and the remaining items of clusters and write them in a text file;

End


5.2 Decompression algorithms

5.2.
1. Adaptive k
-
means Decompression Algorithm

To recover the original data from the compressed one, we reverse the operation of the
adaptive k
-
means compression algorithm. This operation is called the adaptive k
-
means
decompression algorithm. By decompressing the original data from the input compresse
d
files, each input compressed files are used to read out the original data from the compressed
file's data. To do this without loss of any information, it is necessary to keep the data sets
with an index used in both adaptive k
-
means and apriori decompres
sion algorithm. Hence,
the data are distributed in the clusters with its cluster's name, the available number of items in
each cluster, and the selected attribute names. This information is used as the current output
data and is the next entry to be insert
ed into the database file. The detailed operation of the
adaptive k
-
means decompression algorithm is described as follows.

Adaptive k
-
means decompression algorithm

Input: compressed files.

Output: original database file.

Begin:

1.

Read the clusters items from

the first compressed file.

2.

Read the clusters information from the second compressed file.

3.

Split the data that are available in the both two compressed files.

4.

Save the data that are getting from the second compressed file into a one dimensional
array.

5.

Save

the data that are getting from the first compressed file into a buffer view.

6.

Create a new database file.

7.

Fill this new database file with the data that are needed from the array and the buffer
view.

8.

Return the original database file.

End


5.1.
2. Apriori
Decompression Algorithm


To recover the original data from the compressed one, we reverse the operation of the
apriori compression algorithm. This operation is called the apriori decompression algorithm.
By decompressing the original data from the input co
mpressed data files, each input
compressed data files used to read out the original data from the data set. To do this without
loss of any information, it is necessary to keep the data sets with an index used in both
adaptive k
-
means and apriori decompress
ion algorithm. Hence, important correlations are
extracted by using apriori compression algorithm, and only correlated data and uncorrelated
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

9


data are saved as a cluster with its cluster names into a text file. The second file is the same as
the second file

used by adaptive k
-
means algorithm. This information is used as the current
output data and is the next entry to be inserted into the database file. The detailed operation of
the apriori decompression algorithm is described as follows:

Apriori

decompression algorithm

Input: compressed files.

Output: original database file.

Begin:

1.

Read the clusters items from the apriori compressed file.

2.

Read the clusters information from the second adaptive k
-
means compressed file.

3.

Split the data that are
available in the both two compressed files.

4.

Save the data that are getting from the second compressed file into a one dimensional
array.

5.

Save the data that are getting from the apriori compressed file into a buffer view and
save the clusters that are given

from apriori compressed file into another buffer view.

6.

Return correlated data to the clusters that belonged to it by using its indexes.

7.

Create a new database file.

8.

Fill this new database file with the data that are needed from the array and the buffer
vie
w.

9.

Return the original database file.

End


6.

Experimental Results

For an experimental evaluation of the proposed algorithms, several experiments are
performed on real databases. The proposed algorithms are implemented in VB.net
environment. Many databases ar
e used to test the performance of these algorithms. In
particular, several clusters are derived from each database file and a large number of rules can
be derived from those clusters depending on the support and confidence thresholds. Some of
the results a
re given in tables 1, 2, 3, 4, 5,6,7,8 and 9. Table 1 shows the taken time for
compressing several databases by using both adaptive k
-
means and apriori algorithm.


Table 1: adaptive k
-
means vs apriori compression time along with the database files size.

Or
iginal size kilobytes

AK comp time in sec

AP time in sec

No of clusters

128

1

2

6

168

1

2

11

176

2

2

6

356

2

10

14

372

11

147

697

640

3

31

6

704

4

63

43

852

7

18

107

872

3

82

4

1188

6

157

10

1444

11

81

150

2693

8

31

82

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


10


In the above table the
AK comp time in sec (adaptive k
-
means compression time in
second) represents the taken time to compress the database by using adaptive k
-
means
compression algorithm. The AP comp time in sec (apriori compressed time in second)
represents the taken time to c
ompress the database by applying apriori compression algorithm
on the adaptive k
-
means algorithm results. The original size in kilobytes represents the
original database size. From the above results we see that the apriori algorithm takes more
time than ad
aptive k
-
means algorithm because of the apriori algorithm has two calculations.
Firstly calculates the support or number of frequencies of each item then matching this
support with min_sub if the support≥ min_sub then moving this item to the frequent item
list.
Next calculate the confidence of each two or more frequent items that appear together, if the
confidence ≥ min_conf extract this rule. These calculations will cause a delay in the work of
the algorithm in advanced. Also, increasing the total taken ti
me happen whenever increased
the testing file size. While the K
-
means algorithm adjustment there is no need for these
operations. While adaptive K
-
means algorithm needs only extract the center of groups and
matching the remaining items from the database if

sharing this Center or not. Finally, the
distribution of these items according to the sharing center.



Fig
.
1. Illustrations of the adaptive k
-
means vs apriori compression time along with the
database files size.


Table 2 shows the compressed database size by applying adaptive k
-
means together with the
apriori algorithm and applying adaptive k
-
means alone on the original database file and also
shows the original database file size


Table 2: Original database file size vs compressed file size.

Table name

Original size kilobytes

AK comp size in kilobytes

apriori size in kilobyte

Dept

128

12

12

niaid100

168

16

16

dept200

176

24

20

salary 653

356

52

40

medicin 831

372

140

120

dept3000

640

236

148

niaid2248

704

280

220

Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

11


2012ss 2733

852

232

204

DWC_admin 3697

872

448

324

dept10000

1188

636

364

niaid 2248&2012ss
2733

1444

512

424

bacteria 4894

2693

1069

1059


In the above table the AK comp size in KB (adaptive k
-
means compressed file size)
represents the compressed database file size by using adaptive k
-
means compression
algorithm measured in kilobytes, and the apriori comp size in KB (apriori compressed file
s
ize) represents the compressed database file size by applying apriori compression algorithm
on the data that result from the adaptive k
-
means compression algorithm also measured by
using kilobytes.
Fig.

2 shows the compressed database size for each tested database using
both adaptive k
-
means and apriori with adaptive k
-
means algorithms.



Fig
.
2. Demonstrations of original database files size vs compress database size.


The above figure discovers that when the apriori algorithm is applied on the adaptive k
-
means resulted data., It will result in a compressed file size less than that when the adaptive
k
-
means algorithm is applied alone because of the apriori algorithm save
s the frequent items
with its support only once instead of saving these items several times, this process decreases
the amount of data. Table 3 shows the comparison between the compression ratios when
using adaptive k
-
means alone and when applying adaptive

k
-
means together with the apriori
algorithm on the database files.


Table 3: The compression ratios for the adaptive k
-
means and apriori compression algorithm.

Original size in
kilobytes

Comp ratio by using
clustering

Ratio by using apriori with clusteri
ng
technique

No of
clusters

128 k

91%

91%

6

168 k

90%

90%

11

176 k

86%

89%

6

356 k

85%

89%

14

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


12


372 k

62%

68%

697

640 k

63%

77%

6

704 k

60%

70%

43

852 k

73%

76%

107

872 k

49%

63%

4

1188 k

46%

70%

10

1444 k

65%

71%

150

2693 k

60%

61%

82


In the above table the comp ratio by using clustering (compression ratio by using a
clustering technique) represents the compression ratio by using adaptive k
-
means
compression algorithm, and the comp ratio by using apriori (compression ratio by using
apri
ori) with clustering represents the compression ratio by applying apriori compression
algorithm on the adaptive k
-
means compression algorithm results.
Fig.

3 shows the
compression ratio when applying adaptive k
-
means algorithm alone and together with the
a
priori algorithm on the original database file.



Fig 3. Illustration of the compression ratios for the adaptive k
-
means and apriori compression
algorithm.


The compression ratio is calculated by using the following equation:


Compression ratio=


Table 3 expresses that the
apriori compression ratio is more than
the adaptive k
-
means compression ratio
accepts

in small database file size where the apriori
and the adaptive k
-
means give the same compression ratio such as in (dept and niaid100)
tables. Table 4 shows the decompression time taken using the adaptive k
-
means
decompression algorithm and the apriori dec
ompression algorithm.






(
1
-

) * 100 …………………. (4)

Compressed file size
size

Original file size

Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

13


Table 4: original database file size corresponded to the decompression time.

Original size
kilobytes

Clustering decomp time
in sec

Apriori decomp time
in sec

No of
clusters

128

2

2

6

168

4

4

11

176

4

4

6

356

11

12

14

372

157

14

697

640

92

114

6

704

73

77

43

852

71

74

107

872

123

140

4

1188

549

696

10

1444

144

151

150

2693

185

187

82


The clustering decomp

time in sec (adaptive k
-
means decompression algorithm time in
seconds) represents the time taken to recover the original data from the compressed one by
using adaptive k
-
means decompression algorithm. The Apriori decomp time in sec (apriori
decompression
algorithm time in seconds) represents the time taken to recover the original
data from the compressed one by using apriori decompression algorithm. Table 1 and table 4
shows that the compression time is less than the decompression time and the time will be

increased when the number of clusters are increased because that the number of loops will be
increased in the program execution. Also the adaptive k
-
means decompression algorithm
takes less time than the time taken in apriori decompression algorithm excep
ts in the
medicine table where the apriori decompression algorithm takes less time than that takes
when adaptive k
-
means decompression algorithm applied.
Fig.

4 shows the comparison
between the decompression time by using adaptive k
-
means and apriori decom
pression
algorithm on the compressed files.

Table 5 shows the compressed file size when adaptive k
-
means algorithm is used alone,
when the apriori algorithm is used alone, and when using a joint of adaptive k
-
means
algorithm with apriori algorithm.


Fig
4. The decompression time for the adaptive k
-
means algorithm and the apriori algorithm.

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


14


Table 5: the compressed files size and the original files size.

Original size in
kilobytes

AK comp size in
kilobytes

Joint apriori with adaptive
k
-
means size kilobyte
s

apriori compression
size in kilobytes

No of
clusters

128

12

12

8

6

168

16

16

16

11

356

52

40

44

14

372

140

120

120

697

704

280

220

240

43

2693

1069

1059

1157

82


From the table above when applying the adaptive k
-
means together with the apriori

algorithm they gives better results than the application of each algorithm alone on the original
data base file. These results are also shown in
Fig.

5.


Fig 5. The compressed file size vs original database files size.


Table 6 presents the compression ratio when using each algorithm separately and when using
an integration of these two algorithms.

Table 6: the resulted compression ratio of the three algorithms.

Original size
kilobytes

Comp ratio by using
clustering

Rati
o by using a priori with
clustering comp ratio

Apriori
ratio

No of
clusters

128

91%

91%

94%

6

168

90%

90%

90%

11

356

85%

89%

88%

14

372

62%

68%

68%

697

704

60%

70%

66%

43

2693

60%

61%

57%

82


From the table above when applying the apriori

algorithm on result of the adaptive k
-
means algorithm gives a higher compression ratio than when applying each algorithm alone
on the original database except in the case that the number of records are few. The apriori
algorithm when applied alone gives a

better compression ratio because of the support of the
LHS will be increased so that the confidence value is not matching the min_conf then few
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

15


rules will be produced. While in the case of large amount of records the amount of LHS and
RHS items that appea
rs together will be increased. This increasing may caused matching
between the confidence value and the min_conf value even the increasing in the amount of
LHS items so the amount of the extracted rules will be increased. This increasing causes an
increasi
ng in the compressed file size and the decreasing in the compression ratio.
Fig.

6
demonstrates the compression ratios after applying apriori alone, adaptive k
-
means alone and
a joint between adaptive k
-
means algorithm with apriori algorithm


Fig 6.

The compression ratios corresponding to the original database file size.


A joint between adaptive k
-
means and apriori algorithm means the adaptive k
-
means is
applied first on the database file then the apriori algorithm is applied on the adaptive k
-
means

resulted files.


Table 7: the compression taken time of the three discussed schemes measured in seconds.

Original size
kilobytes

AK comp time
in sec

Ap time in
sec

apriori comp time on
origin database in
sec

No of
clusters

128

1

2

2

6

168

1

2

4

11

356

2

10

38

14

372

11

147

23

697

704

4

63

431

43

2693

8

31

887

82


Fig.

7 shows the time taking to compress the database file by using adaptive k
-
means
algorithm alone, apriori algorithm alone and a joint between these two algorithms.

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


16



Fig 7. The time of
the compression process when using each algorithm separately and when
using a joint of these algorithms.


The above figure shows that when applying the apriori algorithm on the original database,
it will take more time than when the apriori

is applied on the results of the adaptive k
-
means
algorithm, except in the table of Medicine when it is applied on the results of the adaptive k
-
means. It takes more time because the number of clusters that are resulting from applying the
adaptive k
-
means

on the original database items is very large. Meaning that when the number
of clusters increases the time of the compression process will increase.


Table 8: the decompression taken time of the three algorithms.

Original size
kilobytes

Clustering decomp
t ime
in sec

apriori decomp
time in sec

apriori on db
decomp time in sec

No of clusters

128

2

2

2

6

168

4

4

3

11

356

11

12

15

14

372

157

14

14

697

704

73

77

86

43

2693

185

187

221

82


In the above table the first column represents the original database size. The second
column represents the time that is taken to recover the original database using the adaptive k
-
means decompression algorithm. The third column represents the time that is

taken to recover
the original database using the priori decompression algorithm. In this case the apriori
decompression algorithm is applied on the compressed files that are resulted from applying
the apriori compression algorithm on the adaptive k
-
means
results. The fourth column
represents the time that is taken to recover the original database using the apriori
decompression algorithm. In the fourth column the apriori decompression algorithm is
applied on the compressed files that are resulted from appl
ying the apriori algorithm on the
original database data. The table above shows that apriori decompression algorithm takes
more time when it is applied on the compressed files resulting from applying the apriori
compression algorithm on the original databa
ses. While decompression time of the apriori
decompression algorithm that is applied on the compressed files resulting from applying the
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

17


apriori compression algorithm on the results of the adaptive k
-
means compression algorithm
is less.


Fig 8.
Illustration of the decompression time for the three algorithms.


The following table and
Fig.

9 shows a comparison between the compression and the
decompression time.


Table 9: A comparison between the compression time and the decompression taken time.

AK

comp time in sec

ap time in sec

Clustering decomp time in
sec

apriori decomp time in
sec

No of clusters

1

2

2

2

6

1

2

4

4

11

2

2

4

4

6

2

10

11

12

14

11

147

157

14

697

3

31

92

114

6

4

63

73

77

43

7

18

71

74

107

3

82

123

140

4

6

157

549

696

10

11

81

144

151

150

8

31

185

187

82

The above table states that the time taken to return the original database is more than the
time taken to compress it. In order to return the original database the algorithm reads the data
from the compressed file and
appends this data to the new database. This process will cause a
delaying in the time that it takes to recover the original database. Appending data to the
database is made automatically.

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


18



Fig 9. Demonstration a comparison of the compression time and the
decompression time.


7.

Conclusions

In this work, database files are exploited to discover the grouping of Objects and the
correlated data in each group. Grouping is proposed (adaptive k
-
means and apriori

algorithm). The adaptive k
-
means algorithm is used to discover the grouping of objects while
the apriori algorithm is used to extract the correlated data that are available in each group of
objects. The experimental results show that the proposed compress
ion algorithms effective in
reducing the quantity of transmiting data and improving the compressibility, and thus,
reducing the energy consuming that uses for data transfer. The apriori algorithm achieves a
better compression ratio than the adaptive k
-
mean
s algorithm but the apriori compression
algorithm takes more time than the adaptive k
-
means compression algorithm. The apriori
decompression algorithm takes more time than the adaptive k
-
means decompression
algorithm in order to recover the original databa
se from the compressed one. Therefore the
apriori algorithm is considered better than the adaptive k
-
means algorithm in compressing
database file size. Even though if the apriori algorithm is applied alone on the database file
without joining with adaptive

k
-
means algorithm, it gives a better compression ratio than the
adaptive k
-
means algorithm alone. Also it gives smaller file size than the files given when the
adaptive k
-
means is applied alone on the database file. But when the apriori algorithm is
appli
ed alone on the database files the time that is taken to compress the database file will be
more than the time that is taken when the adaptive k
-
means is applied alone and when the
apriori algorithm is applied on the adaptive k
-
means results. The proposed
adaptive k
-
means
algorithm can deal with any data types such as text, date… etc. While the recent k
-
means
algorithm deals with numerical data only. In order to validate the results, we have calculated
some measurements used to evaluate the performances of

compression algorithms such as the
compression ratios, compression time and the decompression time. The compression ratios of
joining the apriori algorithm with the adaptive k
-
means algorithm has values around between
61% and 91% . The compression ratios
by using the adaptive k
-
means algorithm alone have
values around between 46% and 91%. As shown in section 4, the compression time has
values around between 2 sec and 157 sec for the joining of the adaptive k
-
means with the
apriori algorithm and we have ob
tained values around 1 sec


11 sec for adaptive k
-
means
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March

2013, 1
-
20

19


algorithm. Finally, the decompression time of the apriori decompression algorithm has values
between 2 Sec and 696 Sec and the decompression time of using the adaptive k
-
means
decompression algorithm

has values between 2 Sec and 549 Sec.

References

Adriaans, P., & Zantinge, D. (1998). Data Mining. Addision_Wesley.

Afify, H. , Islam, M., & Wahed, M. A. (2011). Dna

Lossless Differential Compression
Algorithm Based On Similarity Of Genomic Sequence Database. International Journal of
Computer Science & Information Technology (IJCSIT) Vol 3, No 4.

Agrawal, R., Imielinski, T. & Swami, A. (1997). Database Mining: A Perfo
rmance
Perspective. IEEE Trans. Knowledge and Data Engineering, England.

Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue Software, 1045
Forest Knoll Dr., San Jose, CA, 95129
.

Brin ,S., Motwani, R., Ullman, J. D., & Tsur, S.(1997).
Dynamic Itemset Counting And
Implication Rules For Market Basket Data. In proceeding of data (SGMOD97) Tucson,
Arizona USA

Figueroa ,V. (2010
-
2011).Clustering methods applied to Wikipedia, Ann´ee acad´emique.

Goetz, G., & Leonard, D. S. (1991). Data Compre
ssion and Database Performance. Oregon
Advanced Computing Institute (OACIS) and NSF awards IRI
-
8805200, IRI
-
8912618,
and IRI
-
9006348.

Graham, W. (1994). Signal coding and processing. (2 ed.). Cambridge University Press. p.

34

Han, J., & Kanmber, M. (2001)
. Data Mining Concepts and Techniques. Academic press,
USA.

Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. The MIT
Press,Cambridge, London England.


Jacob , N., Somvanshi ,P., & Tornekar, R. (2012). Comparative Analysis of Lossless

Text
Compression Techniques. International Journal of Computer Applications (0975


8887),
Volume 56


No.3.

Kodituwakku, S.R.., & Amarasinghe ,U. S.. (2007).Comparison Of Lossless Data
Compression Algorithms For Text Data. Indian Journal of Computer Scie
nce and
Engineering Vol 1 No 4 416
-
425.

Neeraj,S., & Swati2, L. S. (2012)..Overview of Non
-
redundant Association Rule Mining.
Research Journal of Recent Sciences(Res.J.Recent Sci), ISSN 2277
-
2502 Vol. 1(2), 108
-
112

Pu, I.M. (2006). Fundamental Data Compres
sion, Elsevier, Britain.

Rai, P., & Singh, S. (2010). A Survey of Clustering Techniques, International Journal of
Computer Applications (0975


8887) Volume 7


No.12.

Rajaraman ,A., & Ullman ,J. D .( 2010, 2011). Mining of Massive Datasets.

Report Concer
ning Space Data System Standards. (2006). Lossless Data Compression. Ccsds
Secretariat, Office of Space Communication (Code M
-
3), National Aeronautics and Space
Administration, Washington, DC 20546, USA.

Saad K. Majeed & Hussein K. Abbas. (2010). An Improv
ed Distributed Association Rule
Algorithm. Eng.& Tech. Journal, Vol.28, No.18 .

Journal of Advanced Computer Sci ence and Technol ogy

Research, V
ol.3 No.1, March 2013, 1
-
20


20


Suarjaya, I. M. A. D. (2012). A New Algorithm for Data Compression
Optimization.International Journal of Advanced Computer Science and Applications
(IJACSA), Vol. 3, No.8.

Szé
pkút , I. (2004). Difference Sequence Compression Of Multidimensional Databases.
Periodica Polytechnica Ser. EL. ENG. VOL. 48, NO. 3

4, PP. 197

218.

Kona, H. V. (2003). Association Rule Mining Over Multiple Databases: Partitioned And
Incremental Approaches
. Published: Master Thesis, The University Of Texas At
Arlington.

Yijun. Lin ,X., & Tsang,C.( 2000). An Efficient Distributed Algorithm for Computing
Association Rules. Springer
-
Verlag Berlin Heidelberg.