Large Dataset Compression Approach Using Intelligent
Technique
Ahmed Tariq Sadiq
1,a
, Mehdi G. Duaimi
2,b
, Rasha Subhi Ali
2,c
1
Computer Science Department, University of Technology,
Baghdad, Iraq
2
Computer Science Department, Baghdad University, Baghdad, Iraq
a
drahmaed_tark@yahoo.com,
b
mehdi_duaimi@scbaghdad.edu.iq,
c
danafush@Gmail.com
A
BSTRACT
Data clustering is a process of putting similar data into groups. A clustering algorithms partition data
set into several groups such that the similarity within a group is larger than among groups.
Association rule is one of the possible methods for analys
is of data. The association rules algorithm
generates a huge number of association rules, of which many are redundant. The main idea of this
paper is to compress large database by using clustering techniques with association rule algorithms. In
the first s
tage, the database is compressed by using clustering techniques followed by association rules
algorithm. Adaptive k

means clustering algorithm is proposed with apriori algorithm. Due to many
experiments by using the adaptive k

means algorithm and apriori a
lgorithm together it gives better
compression ratio and smaller compressed file size than the compression ratio and compressed file
size that are given from using each algorithm alone. Several experiments were made in several
different sizes of database. T
he apriori algorithm increases the compression ratio of the adaptive k

means algorithm when hey are used together but it takes more compression time than the adaptive k

means takes. These algorithms are presented and their results are compared.
Keywords
: A
ssociation rule, clustering techniques and compression algorithms.
1.
Introduction
Compression is the art of representing the information in a compact form rather than its
original or uncompressed form (Pu, 2006). In other words, using the data compression, t
he
size of a particular file can be reduced. This is very useful when processing, storing or
transferring a huge file, which needs lots of resources (Kodituwakku,et al., 2007). Data
compression is widely used in data management to save storage space and ne
twork
bandwidth
(Goetz, et al.,1991). In
computer science
and
information theory
, data
compression, source coding, or bit

rate reduction involve
encoding
information
using fewer
bits
than the original representation. Compression can be either lossy
or lossless. Lossless
compression reduces bits by identifying and eliminating
statistical redundancy
(eliminate
unwanted redundancy). Compression is useful because it helps reduce the consuming of
resources such as data space or transmission
capacity
. Bec
ause compressed data must be
decompressed to be used, this extra processing imposes computational or other costs through
decompression. In lossy data compression, some loss of information is acceptable. Depending
upon the application, details can be droppe
d from the data to save storage space. Lossy data
compression used in image, audio, and video (Graham, 1994). .
Lossless data compression
is
contrasted with
Lossy data compression
. Lossless data compression has been suggested for
many space science explora
tion mission applications either to increase the scientific return or
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
2
to reduce the requirement for on

board memory, station contact time, and data archival
volume. A Lossless compression technique guarantees full reconstruction of the original data
withou
t incurring any distortion in the process. The Lossless Data Compression technique
preserves the source data accuracy by removing redundancy from the application source data.
In the decompression process the original source data are reconstructed from the
compressed
data by restoring the removed redundancy. The reconstructed data is an exact replica of the
original source data. The amount of redundancy removed from the source data is variable and
is highly dependent on the source data statistics, which are
often non

stationary (Report
Concerning Space Data System Standards, 2006). In this paper two intelligent techniques
(clustering techniques and association rules) are used to compress large data sets. Clustering
is a division of data into groups of similar
objects. Each group, called a cluster, consists of
objects that are similar among themselves and dissimilar to the objects of other groups
(Berkhin, 2002). . Association Rule Mining plays a major role in the process of mining data
for frequent pattern ma
tching. Association Rules: Association rule of data mining involves
picking out the unknown interdependence of the data and finding out the rules among those
items (Neeraj, et al.,2012).. The main objective of this work is to discuss intelligent
technique
s to compress large data sets.
This paper is organized as follow. Section two shows the related works, Section three
gives some terminology on association rules and apriori algorithm, Section four explains
major clustering techniques and k

means algorithm
, Section five explains the methodology of
the compression algorithms and the decompression algorithms, Section six presents some
results from experiments and section seven concludes with a discussion.
2.
Related Work
Sonia Dora (Jacob , et al., 2012). , f
ocuses on lossless compression for relational
databases at the attribute level so the proposed technique is used at the attribute level by
compressing three types of attribute (string, integer and date type) and the most interesting
feature is that it aut
omatically identifies the type of attribute. I Made Agus
Dwi Suarjaya
(Suarjaya, 2012)
, proposes a new algorithm for data compression, called j

bit encoding
(JBE). This algorithm manipulates each bit of data in a file to minimize the size without
loosing a
ny data after decoding which is classified to lossless compression. Heba Afify,
Muhammad Islam and Manal Abdel Wahed (Afify, et al.,2011). , present a differential
compression algorithm that is based on production of difference sequences according to an
op

code table in order to optimize the compression of homologous sequences in the dataset.
István Szépkúti (Szépkút , 2004)., introduces a new method called difference sequence
compression. Under some conditions, this technique is able to create a smaller si
ze
multidimensional database than others like single count header compression, logical position
compression or base

offset compression.
3.
Association R
ules
Association rule mining is one of the most important and well researched techniques of
data mining.
It was first introduced by Agrawal, Imielinski, and Swami (Agrawal, et
al.,1997). . The discovery of “association rules” in databases may provide useful background
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
3
knowledge to decision support systems, selective marketing, financial forecast, medical
diag
nosis, and many other applications (Yijun. et al., 2000). . Mining association rules is an
important data mining problem. Association rules are usually mined repeatedly in different
parts of a database. Current algorithms for mining association rules work
in two steps.
1. Discover the large itemsets, i.e. the sets of itemsets that have support above a
predetermined minimum support σ.
2. Use the large itemsets to generate the association rules for the database.
It is noted that the overall performance of min
ing association rules is determined by the first
step, which usually requires repeated passes over the analyzed database and determines the
overall performance. After the large itemsets are identified, the corresponding association
rules can be derived in
a straightforward manner. (Saad et al., 2010).
3.1 Associations rules concept
An association rule is a simple probabilistic statement about the co

occurrence of certain
events in a database, and is particularly applicable to spa
rse transaction data sets
(Hand
et al.,
2001).
An association rule is a rule, which infers certain association relationships among a set
of objects (such as those which occur together or one infers the other”). In a database
association rule mining wo
rks as follows: (Adriaans
et al
.,
1998).
Let I be a set of items and D a database of transactions, where each transaction has a
unique identifier (tid) and contains a set of items called an item set. An itemset with k items
is called a k

itemset. The support of an itemset X denoted S (X
) is the number of transactions
in which that itemset occurs as a subset. A k

subset is a k

length subset of an itemset. An
itemset is frequent or large if its support is more than a user

specified minimum support
(min_sup) value. Fk is the set of frequen
t k

itemsets. A frequent itemset is maximal if it is not
a subset of any other frequent itemset. An association rule is an expression A
⇒
B, where A
and B are itemsets. The rule’s support (S) is the joint probability of a transaction containing
both A and B,
and is given as S (A
⇒
B). The confidence of the rule is the conditional
probability that a transaction contains B, given that it contains A and is given as S (A
B)/S
(A). A rule is frequent if its support is greater than min_sup and strong if its confidence
is
more than a user

specified minimum confidence (min_conf). Data mining involves
generating all association rules in the database that have a support greater than min_sup (the
rules are frequent) and that have a confidence greater than min_conf (the rule
s are strong)
(Saad et al., 2010). The important measures for association rules are support (S) and
confidence (C). They can be defined as:
Definition1: Support (S)
Support(X, Y) = Pr(X
Y) =count of (X
Y) / Total transactions……………………... (1)
The support
(S) of an association rule is the ratio (in percent) of the records that contain
(X
Y) to the total number of records in the database. Therefore, if we say that the support of
a rule is 5% then it means that 5% of the total records contain (X
Y) (Brin , e
t al.,1997).
Definition2: Confidence (C)
Conf (X
⇒
Y) =Pr (X
Y)/Pr(X) =support(X, Y)/support(X)……………………………. (2)
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
4
For a given number of records, confidence (C) is the ratio (in percent) of the numbers of
records that contain (X
Y), to the number of records
that contain X. Thus, if we say that a
rule has a confidence of 15% it means that 85% of the records containing X also contain Y.
The confidence of the rule refers to the degree of correlation in the database between X and
Y. Confidence is also a measure o
f rules strength. Mining consists of finding all rules that
meet the user

specified threshold support and confidence (Brin , et al.,1997). As there are two
thresholds, we need two processes to mine the rules. The first step is to get the large itemsets.
It
finds all the itemsets whose supports are larger than the support threshold. An itemset is the
set of the items. Based on the large itemsets, we can generate the rules from the large
itemsets, which is the second step. Rules that satisfy both a minimum su
pport threshold
(min_sup) and a minimum confidence threshold (min_conf) are called strong (Han, et al.,
2001).. An association rule

mining problem is broken down into two steps: 1) Generate all
the item combinations (itemsets) whose support is greater than
the user specified minimum
support. Such sets are called the frequent itemsets and 2) use the identified frequent itemsets
to generate the rules that satisfy a user specified confidence. The frequent itemsets generation
requires more effort and the rul
e g
eneration is straightforward,
(Kona, 2003).
4.
Unsupervised learning
Unsupervised learning is a branch of machine learning algorithms, where the aim is to
find patterns in raw, unlabeled, data. Unlike in supervised learning, where the algorithm is
first trai
ned on labeled data, so it can learn and adapt to the particular problem
(classification). The general, inherent, properties of the data can be used to find patterns and
organize it (clustering). (Figueroa , 2011).
4.1 Clustering
Clustering is the process
of examining a collection of “points,” and grouping the points
into “clusters” according to some distance measure (Rajaraman , et al., 2011).. The goal of
clustering is to define clusters of elements in the dataset, such that the elements in the same
clust
er are similar to each other, and each cluster as a whole is distant from the others. Two
important classes of clustering can be distinguished 1) Hierarchical clustering: These kinds of
techniques can either be agglomerative or divisive. In agglomerative c
lustering, we start by
assigning each element to a different class. The successive iterations of the algorithm cluster
together the closest classes, until all the elements belong to the same, main, class. The
divisive methods perform divisions of the class
es into smaller ones instead. 2) Partitional
clustering: the concept of partitional clustering is to divide the data immediately into a certain
number of clusters. The most popular algorithm is k

means, which will be presented in the
next section. (Figuero
a , 2011).
4.1.
1. Partitioning Methods
The partitioning methods generally result in a set of M clusters; each object is belonging
to one cluster. Each cluster may be represented by a centroid or a cluster representative; this
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
5
is some a sort of a summary description of all the objects contained in a cluster. K

means
method is an example about partitioning clustering. (Rai, et al., 2010).
K

means
The k

means algorithm was proposed independently in various scientific fields over
50
years ago. MacQueen was the first, in 1967, to name k

means its one

pass version of the
algorithm, where he defined the first k elements of the dataset as the k classes, and
successively assigned the next elements to the closest class, updating the cent
roid after each
assignment (Figueroa , 2011).. The standard k

means algorithm is considered to be a simple
but efficient partitioning algorithm. It divides the data into k clusters, minimizing the squared
distance between each element to the center of its
cluster. The distance measure is a
parameter of the algorithm. The objective function, using the Euclidian distance, is defined
as:
………………………………………………………………………. (3)
Where:
• m: is the total number of clusters.
• C
k
: is the k

th
cluster.
• x
i
: is the vector of the i

th element of the dataset.
• g
k
: is the vector of the center of the k

th cluster.
K

means then proposes the following iterative method to find a good solution.
1.
Initialize the center of each cluster (also called
centroids). For example, we can
arbitrarily choose some elements of the dataset to be the centroids.
2.
Reassign each element to the closest centroid.
3.
Recompute the center of each cluster.
4.
Repeat steps 2 and 3 until the stopping criterion is satisfied.
(Figue
roa , 2011).
4.1.
2. Hierarchical Methods
Hierarchical clustering is a method of cluster analysis which seeks to build a hierarchy of
clusters. The basics of hierarchical clustering include Lance

Williams formula, conceptual
clustering, classic algorithms
SLINK, COBWEB, as well as newer algorithms CURE and
CHAMELEON. The hierarchical algorithms build clusters gradually (as crystals are grown).
Strategies for hierarchical clustering generally fall into two types: In hierarchical clustering
the data are not p
artitioned into a particular cluster in a single step. Instead, a series of
partitions takes place, which may run from a single cluster containing all objects to n clusters
each containing a single object. Hierarchical Clustering is subdivided into agglome
rative
methods, which proceed by a series of fusions of the n objects into groups, and divisive
methods, which separate n objects successively into finer groupings. Agglomerative
techniques are more commonly used (Rai, et al., 2010). .
A. Agglomerative tec
hnique
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
6
This is a "bottom up" approach: each observation starts in its own cluster, and pairs of
clusters are merged as one moves up the hierarchy. The algorithm forms clusters in a bottom

up manner, as follows (Rai, et al., 2010). :
1. Initially, put each
article in its own cluster.
2. Among all current clusters, pick the two clusters with the smallest distance.
3. Replace these two clusters with a new cluster, formed by merging the two original ones.
4. Repeat the above two steps until there is only one re
maining cluster in the pool.
Thus, the agglomerative clustering algorithm will result in a binary cluster tree with single
article clusters as its leaf nodes and a root node containing all the articles.
B. Divisive technique
(Rai, et al., 2010).
This is
a "top down" approach: all observations start in one cluster, and splits are performed
recursively as one moves down the hierarchy.
1. Put all objects in one cluster
2. Repeat until all clusters are singletons
a) Choose a cluster to split
b) Replace the ch
osen cluster with the sub

cluster.
C. Advantages of hierarchal clustering
(Rai, et al., 2010).
1)
Embedded flexibility regarding the level of granularity.
2)
Ease of handling any forms of similarity or distance.
3)
Applicability to any attributes type.
D. Disadvantages of hierarchal clustering
(Rai, et al., 2010).
1.
Ambiguity of termination criteria.
2.
Most hierarchal algorithms do not revisit once constructed clusters with the purpose
of improvement.
5.
Methodology
In the first part of this work the adaptive
k

means and apriori algorithm are utilized. By
adaptive k

means, one can extract several clusters from each database table, and by the
apriori algorithm; relationships among the sets of items can be extracted from each cluster. In
the next part of this wor
k the decompressed data must be recovered from the compressed one.
To do that the adaptive k

means decompression algorithm and the apriori decompression
algorithm will be used to recover the original data from the compressed data.
5.1 compression
algorithms
5.1.
1. Adaptive k

means algorithm
Adaptive k

means is a partitioning clustering algorithm used to extract all available
clusters in any selected database. In recent k

means the user must determine the number of
clusters and the center of each cl
uster, while in adaptive k

means the number of clusters and
the center of clusters are determined automatically without intervention of the user. In this
algorithm the user selects a database file then the algorithm automatically select two
attributes. The
items that are available in these selected attributes represent the centers of
clusters. This algorithm has several stages.
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
7
Algorithm: adaptive k

means.
Input: database file
Output: two text file first one for saving extracted clusters, and the second is
for saving
information for each extracted cluster (this information contains a number of items in each
cluster and the name of each cluster).
Begin
i.
Let the DB = database file
ii.
Automatically select two attributes from the inputted database file. Let G and D
are the
selected attributes.
iii.
The center for each cluster is determined automatically by selecting items from the
determined attributes without repetition of these items and considers them as centers for
clusters. Let (G
1
, G
2
…..,G
n
and D
1
, D
2
,…. D
m
)
to be centers of clusters.
n represents the number of items in the first selected attributes without repetition.
m represents the number of items in the second selected attributes without repetition.
iv.
U= unselected attributes
v.
For each closed items in G
1
, G
2
,… G
n
and D
1
, D
2
,… D
m
select U
vi.
Print U in first text file
vii.
Print information of each cluster in the second file
viii.
Return compressed files.
End
For example, if the dept table is selected to apply compression algorithms on it then the
gender and degree colum
ns will be selected to make their items as a center for each cluster.
The (male and female) are closed items belonged to the first selected attribute (gender).
While (secondary school, B.Sc., Diploma, PhD, MSc, Higher Diploma) are closed items
belonged to
the second selected attribute. Then the center of each cluster is considered as
follows: The center of the First cluster is (male, secondary school
(
, the center of the second
cluster is (male, MSc), while the center of the third cluster is (female, MSc) a
nd so on as
determined by the adaptive k

means algorithm.
5.1.
2. Apriori Algorithm
Apriori Algorithm can be used to generate all frequent itemset. A Frequent itemset is an
itemset
that has support is greater than the user

specified minimum support (denoted L
k
,
where k is the size of the itemset). A Candidate itemset is a potentially frequent itemset
(denoted C
k
, where k is the size of the itemset).
Algorithm: apriori
Input: adapti
ve k

means resulted files, min_sup, min_conf
Output: one text file for saving extracted rules and the remaining cluster's data on it.
Begin:
For each itemset l
1
L
k

1
For each itemset l
2
L
k

1
IF (l
1
[1] = l
2
[1])
(l
1
[2] = l
2
[2])
…
(l
1
[k

2] = l
2
[k

2])
(l
1
[k

1] < l
2
[k

1] ) then
c = l
1
join
l2;
// join step: generate candidates
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
8
IF has_infrequent_subset (c, L
k

1
) then
Delete c; // prune step: remove unfruitful candidate
Else add c to C
k
;
Return C
k
and the remaining items of clusters and write them in a text file;
End
5.2 Decompression algorithms
5.2.
1. Adaptive k

means Decompression Algorithm
To recover the original data from the compressed one, we reverse the operation of the
adaptive k

means compression algorithm. This operation is called the adaptive k

means
decompression algorithm. By decompressing the original data from the input compresse
d
files, each input compressed files are used to read out the original data from the compressed
file's data. To do this without loss of any information, it is necessary to keep the data sets
with an index used in both adaptive k

means and apriori decompres
sion algorithm. Hence,
the data are distributed in the clusters with its cluster's name, the available number of items in
each cluster, and the selected attribute names. This information is used as the current output
data and is the next entry to be insert
ed into the database file. The detailed operation of the
adaptive k

means decompression algorithm is described as follows.
Adaptive k

means decompression algorithm
Input: compressed files.
Output: original database file.
Begin:
1.
Read the clusters items from
the first compressed file.
2.
Read the clusters information from the second compressed file.
3.
Split the data that are available in the both two compressed files.
4.
Save the data that are getting from the second compressed file into a one dimensional
array.
5.
Save
the data that are getting from the first compressed file into a buffer view.
6.
Create a new database file.
7.
Fill this new database file with the data that are needed from the array and the buffer
view.
8.
Return the original database file.
End
5.1.
2. Apriori
Decompression Algorithm
To recover the original data from the compressed one, we reverse the operation of the
apriori compression algorithm. This operation is called the apriori decompression algorithm.
By decompressing the original data from the input co
mpressed data files, each input
compressed data files used to read out the original data from the data set. To do this without
loss of any information, it is necessary to keep the data sets with an index used in both
adaptive k

means and apriori decompress
ion algorithm. Hence, important correlations are
extracted by using apriori compression algorithm, and only correlated data and uncorrelated
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
9
data are saved as a cluster with its cluster names into a text file. The second file is the same as
the second file
used by adaptive k

means algorithm. This information is used as the current
output data and is the next entry to be inserted into the database file. The detailed operation of
the apriori decompression algorithm is described as follows:
Apriori
decompression algorithm
Input: compressed files.
Output: original database file.
Begin:
1.
Read the clusters items from the apriori compressed file.
2.
Read the clusters information from the second adaptive k

means compressed file.
3.
Split the data that are
available in the both two compressed files.
4.
Save the data that are getting from the second compressed file into a one dimensional
array.
5.
Save the data that are getting from the apriori compressed file into a buffer view and
save the clusters that are given
from apriori compressed file into another buffer view.
6.
Return correlated data to the clusters that belonged to it by using its indexes.
7.
Create a new database file.
8.
Fill this new database file with the data that are needed from the array and the buffer
vie
w.
9.
Return the original database file.
End
6.
Experimental Results
For an experimental evaluation of the proposed algorithms, several experiments are
performed on real databases. The proposed algorithms are implemented in VB.net
environment. Many databases ar
e used to test the performance of these algorithms. In
particular, several clusters are derived from each database file and a large number of rules can
be derived from those clusters depending on the support and confidence thresholds. Some of
the results a
re given in tables 1, 2, 3, 4, 5,6,7,8 and 9. Table 1 shows the taken time for
compressing several databases by using both adaptive k

means and apriori algorithm.
Table 1: adaptive k

means vs apriori compression time along with the database files size.
Or
iginal size kilobytes
AK comp time in sec
AP time in sec
No of clusters
128
1
2
6
168
1
2
11
176
2
2
6
356
2
10
14
372
11
147
697
640
3
31
6
704
4
63
43
852
7
18
107
872
3
82
4
1188
6
157
10
1444
11
81
150
2693
8
31
82
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
10
In the above table the
AK comp time in sec (adaptive k

means compression time in
second) represents the taken time to compress the database by using adaptive k

means
compression algorithm. The AP comp time in sec (apriori compressed time in second)
represents the taken time to c
ompress the database by applying apriori compression algorithm
on the adaptive k

means algorithm results. The original size in kilobytes represents the
original database size. From the above results we see that the apriori algorithm takes more
time than ad
aptive k

means algorithm because of the apriori algorithm has two calculations.
Firstly calculates the support or number of frequencies of each item then matching this
support with min_sub if the support≥ min_sub then moving this item to the frequent item
list.
Next calculate the confidence of each two or more frequent items that appear together, if the
confidence ≥ min_conf extract this rule. These calculations will cause a delay in the work of
the algorithm in advanced. Also, increasing the total taken ti
me happen whenever increased
the testing file size. While the K

means algorithm adjustment there is no need for these
operations. While adaptive K

means algorithm needs only extract the center of groups and
matching the remaining items from the database if
sharing this Center or not. Finally, the
distribution of these items according to the sharing center.
Fig
.
1. Illustrations of the adaptive k

means vs apriori compression time along with the
database files size.
Table 2 shows the compressed database size by applying adaptive k

means together with the
apriori algorithm and applying adaptive k

means alone on the original database file and also
shows the original database file size
Table 2: Original database file size vs compressed file size.
Table name
Original size kilobytes
AK comp size in kilobytes
apriori size in kilobyte
Dept
128
12
12
niaid100
168
16
16
dept200
176
24
20
salary 653
356
52
40
medicin 831
372
140
120
dept3000
640
236
148
niaid2248
704
280
220
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
11
2012ss 2733
852
232
204
DWC_admin 3697
872
448
324
dept10000
1188
636
364
niaid 2248&2012ss
2733
1444
512
424
bacteria 4894
2693
1069
1059
In the above table the AK comp size in KB (adaptive k

means compressed file size)
represents the compressed database file size by using adaptive k

means compression
algorithm measured in kilobytes, and the apriori comp size in KB (apriori compressed file
s
ize) represents the compressed database file size by applying apriori compression algorithm
on the data that result from the adaptive k

means compression algorithm also measured by
using kilobytes.
Fig.
2 shows the compressed database size for each tested database using
both adaptive k

means and apriori with adaptive k

means algorithms.
Fig
.
2. Demonstrations of original database files size vs compress database size.
The above figure discovers that when the apriori algorithm is applied on the adaptive k

means resulted data., It will result in a compressed file size less than that when the adaptive
k

means algorithm is applied alone because of the apriori algorithm save
s the frequent items
with its support only once instead of saving these items several times, this process decreases
the amount of data. Table 3 shows the comparison between the compression ratios when
using adaptive k

means alone and when applying adaptive
k

means together with the apriori
algorithm on the database files.
Table 3: The compression ratios for the adaptive k

means and apriori compression algorithm.
Original size in
kilobytes
Comp ratio by using
clustering
Ratio by using apriori with clusteri
ng
technique
No of
clusters
128 k
91%
91%
6
168 k
90%
90%
11
176 k
86%
89%
6
356 k
85%
89%
14
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
12
372 k
62%
68%
697
640 k
63%
77%
6
704 k
60%
70%
43
852 k
73%
76%
107
872 k
49%
63%
4
1188 k
46%
70%
10
1444 k
65%
71%
150
2693 k
60%
61%
82
In the above table the comp ratio by using clustering (compression ratio by using a
clustering technique) represents the compression ratio by using adaptive k

means
compression algorithm, and the comp ratio by using apriori (compression ratio by using
apri
ori) with clustering represents the compression ratio by applying apriori compression
algorithm on the adaptive k

means compression algorithm results.
Fig.
3 shows the
compression ratio when applying adaptive k

means algorithm alone and together with the
a
priori algorithm on the original database file.
Fig 3. Illustration of the compression ratios for the adaptive k

means and apriori compression
algorithm.
The compression ratio is calculated by using the following equation:
Compression ratio=
Table 3 expresses that the
apriori compression ratio is more than
the adaptive k

means compression ratio
accepts
in small database file size where the apriori
and the adaptive k

means give the same compression ratio such as in (dept and niaid100)
tables. Table 4 shows the decompression time taken using the adaptive k

means
decompression algorithm and the apriori dec
ompression algorithm.
(
1

) * 100 …………………. (4)
Compressed file size
size
Original file size
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
13
Table 4: original database file size corresponded to the decompression time.
Original size
kilobytes
Clustering decomp time
in sec
Apriori decomp time
in sec
No of
clusters
128
2
2
6
168
4
4
11
176
4
4
6
356
11
12
14
372
157
14
697
640
92
114
6
704
73
77
43
852
71
74
107
872
123
140
4
1188
549
696
10
1444
144
151
150
2693
185
187
82
The clustering decomp
time in sec (adaptive k

means decompression algorithm time in
seconds) represents the time taken to recover the original data from the compressed one by
using adaptive k

means decompression algorithm. The Apriori decomp time in sec (apriori
decompression
algorithm time in seconds) represents the time taken to recover the original
data from the compressed one by using apriori decompression algorithm. Table 1 and table 4
shows that the compression time is less than the decompression time and the time will be
increased when the number of clusters are increased because that the number of loops will be
increased in the program execution. Also the adaptive k

means decompression algorithm
takes less time than the time taken in apriori decompression algorithm excep
ts in the
medicine table where the apriori decompression algorithm takes less time than that takes
when adaptive k

means decompression algorithm applied.
Fig.
4 shows the comparison
between the decompression time by using adaptive k

means and apriori decom
pression
algorithm on the compressed files.
Table 5 shows the compressed file size when adaptive k

means algorithm is used alone,
when the apriori algorithm is used alone, and when using a joint of adaptive k

means
algorithm with apriori algorithm.
Fig
4. The decompression time for the adaptive k

means algorithm and the apriori algorithm.
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
14
Table 5: the compressed files size and the original files size.
Original size in
kilobytes
AK comp size in
kilobytes
Joint apriori with adaptive
k

means size kilobyte
s
apriori compression
size in kilobytes
No of
clusters
128
12
12
8
6
168
16
16
16
11
356
52
40
44
14
372
140
120
120
697
704
280
220
240
43
2693
1069
1059
1157
82
From the table above when applying the adaptive k

means together with the apriori
algorithm they gives better results than the application of each algorithm alone on the original
data base file. These results are also shown in
Fig.
5.
Fig 5. The compressed file size vs original database files size.
Table 6 presents the compression ratio when using each algorithm separately and when using
an integration of these two algorithms.
Table 6: the resulted compression ratio of the three algorithms.
Original size
kilobytes
Comp ratio by using
clustering
Rati
o by using a priori with
clustering comp ratio
Apriori
ratio
No of
clusters
128
91%
91%
94%
6
168
90%
90%
90%
11
356
85%
89%
88%
14
372
62%
68%
68%
697
704
60%
70%
66%
43
2693
60%
61%
57%
82
From the table above when applying the apriori
algorithm on result of the adaptive k

means algorithm gives a higher compression ratio than when applying each algorithm alone
on the original database except in the case that the number of records are few. The apriori
algorithm when applied alone gives a
better compression ratio because of the support of the
LHS will be increased so that the confidence value is not matching the min_conf then few
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
15
rules will be produced. While in the case of large amount of records the amount of LHS and
RHS items that appea
rs together will be increased. This increasing may caused matching
between the confidence value and the min_conf value even the increasing in the amount of
LHS items so the amount of the extracted rules will be increased. This increasing causes an
increasi
ng in the compressed file size and the decreasing in the compression ratio.
Fig.
6
demonstrates the compression ratios after applying apriori alone, adaptive k

means alone and
a joint between adaptive k

means algorithm with apriori algorithm
Fig 6.
The compression ratios corresponding to the original database file size.
A joint between adaptive k

means and apriori algorithm means the adaptive k

means is
applied first on the database file then the apriori algorithm is applied on the adaptive k

means
resulted files.
Table 7: the compression taken time of the three discussed schemes measured in seconds.
Original size
kilobytes
AK comp time
in sec
Ap time in
sec
apriori comp time on
origin database in
sec
No of
clusters
128
1
2
2
6
168
1
2
4
11
356
2
10
38
14
372
11
147
23
697
704
4
63
431
43
2693
8
31
887
82
Fig.
7 shows the time taking to compress the database file by using adaptive k

means
algorithm alone, apriori algorithm alone and a joint between these two algorithms.
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
16
Fig 7. The time of
the compression process when using each algorithm separately and when
using a joint of these algorithms.
The above figure shows that when applying the apriori algorithm on the original database,
it will take more time than when the apriori
is applied on the results of the adaptive k

means
algorithm, except in the table of Medicine when it is applied on the results of the adaptive k

means. It takes more time because the number of clusters that are resulting from applying the
adaptive k

means
on the original database items is very large. Meaning that when the number
of clusters increases the time of the compression process will increase.
Table 8: the decompression taken time of the three algorithms.
Original size
kilobytes
Clustering decomp
t ime
in sec
apriori decomp
time in sec
apriori on db
decomp time in sec
No of clusters
128
2
2
2
6
168
4
4
3
11
356
11
12
15
14
372
157
14
14
697
704
73
77
86
43
2693
185
187
221
82
In the above table the first column represents the original database size. The second
column represents the time that is taken to recover the original database using the adaptive k

means decompression algorithm. The third column represents the time that is
taken to recover
the original database using the priori decompression algorithm. In this case the apriori
decompression algorithm is applied on the compressed files that are resulted from applying
the apriori compression algorithm on the adaptive k

means
results. The fourth column
represents the time that is taken to recover the original database using the apriori
decompression algorithm. In the fourth column the apriori decompression algorithm is
applied on the compressed files that are resulted from appl
ying the apriori algorithm on the
original database data. The table above shows that apriori decompression algorithm takes
more time when it is applied on the compressed files resulting from applying the apriori
compression algorithm on the original databa
ses. While decompression time of the apriori
decompression algorithm that is applied on the compressed files resulting from applying the
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
17
apriori compression algorithm on the results of the adaptive k

means compression algorithm
is less.
Fig 8.
Illustration of the decompression time for the three algorithms.
The following table and
Fig.
9 shows a comparison between the compression and the
decompression time.
Table 9: A comparison between the compression time and the decompression taken time.
AK
comp time in sec
ap time in sec
Clustering decomp time in
sec
apriori decomp time in
sec
No of clusters
1
2
2
2
6
1
2
4
4
11
2
2
4
4
6
2
10
11
12
14
11
147
157
14
697
3
31
92
114
6
4
63
73
77
43
7
18
71
74
107
3
82
123
140
4
6
157
549
696
10
11
81
144
151
150
8
31
185
187
82
The above table states that the time taken to return the original database is more than the
time taken to compress it. In order to return the original database the algorithm reads the data
from the compressed file and
appends this data to the new database. This process will cause a
delaying in the time that it takes to recover the original database. Appending data to the
database is made automatically.
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
18
Fig 9. Demonstration a comparison of the compression time and the
decompression time.
7.
Conclusions
In this work, database files are exploited to discover the grouping of Objects and the
correlated data in each group. Grouping is proposed (adaptive k

means and apriori
algorithm). The adaptive k

means algorithm is used to discover the grouping of objects while
the apriori algorithm is used to extract the correlated data that are available in each group of
objects. The experimental results show that the proposed compress
ion algorithms effective in
reducing the quantity of transmiting data and improving the compressibility, and thus,
reducing the energy consuming that uses for data transfer. The apriori algorithm achieves a
better compression ratio than the adaptive k

mean
s algorithm but the apriori compression
algorithm takes more time than the adaptive k

means compression algorithm. The apriori
decompression algorithm takes more time than the adaptive k

means decompression
algorithm in order to recover the original databa
se from the compressed one. Therefore the
apriori algorithm is considered better than the adaptive k

means algorithm in compressing
database file size. Even though if the apriori algorithm is applied alone on the database file
without joining with adaptive
k

means algorithm, it gives a better compression ratio than the
adaptive k

means algorithm alone. Also it gives smaller file size than the files given when the
adaptive k

means is applied alone on the database file. But when the apriori algorithm is
appli
ed alone on the database files the time that is taken to compress the database file will be
more than the time that is taken when the adaptive k

means is applied alone and when the
apriori algorithm is applied on the adaptive k

means results. The proposed
adaptive k

means
algorithm can deal with any data types such as text, date… etc. While the recent k

means
algorithm deals with numerical data only. In order to validate the results, we have calculated
some measurements used to evaluate the performances of
compression algorithms such as the
compression ratios, compression time and the decompression time. The compression ratios of
joining the apriori algorithm with the adaptive k

means algorithm has values around between
61% and 91% . The compression ratios
by using the adaptive k

means algorithm alone have
values around between 46% and 91%. As shown in section 4, the compression time has
values around between 2 sec and 157 sec for the joining of the adaptive k

means with the
apriori algorithm and we have ob
tained values around 1 sec
–
11 sec for adaptive k

means
Journal of Advanced Computer Sci ence and Technol ogy Research, Vol.3 No.1, March
2013, 1

20
19
algorithm. Finally, the decompression time of the apriori decompression algorithm has values
between 2 Sec and 696 Sec and the decompression time of using the adaptive k

means
decompression algorithm
has values between 2 Sec and 549 Sec.
References
Adriaans, P., & Zantinge, D. (1998). Data Mining. Addision_Wesley.
Afify, H. , Islam, M., & Wahed, M. A. (2011). Dna
Lossless Differential Compression
Algorithm Based On Similarity Of Genomic Sequence Database. International Journal of
Computer Science & Information Technology (IJCSIT) Vol 3, No 4.
Agrawal, R., Imielinski, T. & Swami, A. (1997). Database Mining: A Perfo
rmance
Perspective. IEEE Trans. Knowledge and Data Engineering, England.
Berkhin, P. (2002). Survey of Clustering Data Mining Techniques. Accrue Software, 1045
Forest Knoll Dr., San Jose, CA, 95129
.
Brin ,S., Motwani, R., Ullman, J. D., & Tsur, S.(1997).
Dynamic Itemset Counting And
Implication Rules For Market Basket Data. In proceeding of data (SGMOD97) Tucson,
Arizona USA
Figueroa ,V. (2010

2011).Clustering methods applied to Wikipedia, Ann´ee acad´emique.
Goetz, G., & Leonard, D. S. (1991). Data Compre
ssion and Database Performance. Oregon
Advanced Computing Institute (OACIS) and NSF awards IRI

8805200, IRI

8912618,
and IRI

9006348.
Graham, W. (1994). Signal coding and processing. (2 ed.). Cambridge University Press. p.
34
Han, J., & Kanmber, M. (2001)
. Data Mining Concepts and Techniques. Academic press,
USA.
Hand, D., Mannila, H., & Smyth, P. (2001). Principles of Data Mining. The MIT
Press,Cambridge, London England.
Jacob , N., Somvanshi ,P., & Tornekar, R. (2012). Comparative Analysis of Lossless
Text
Compression Techniques. International Journal of Computer Applications (0975
–
8887),
Volume 56
–
No.3.
Kodituwakku, S.R.., & Amarasinghe ,U. S.. (2007).Comparison Of Lossless Data
Compression Algorithms For Text Data. Indian Journal of Computer Scie
nce and
Engineering Vol 1 No 4 416

425.
Neeraj,S., & Swati2, L. S. (2012)..Overview of Non

redundant Association Rule Mining.
Research Journal of Recent Sciences(Res.J.Recent Sci), ISSN 2277

2502 Vol. 1(2), 108

112
Pu, I.M. (2006). Fundamental Data Compres
sion, Elsevier, Britain.
Rai, P., & Singh, S. (2010). A Survey of Clustering Techniques, International Journal of
Computer Applications (0975
–
8887) Volume 7
–
No.12.
Rajaraman ,A., & Ullman ,J. D .( 2010, 2011). Mining of Massive Datasets.
Report Concer
ning Space Data System Standards. (2006). Lossless Data Compression. Ccsds
Secretariat, Office of Space Communication (Code M

3), National Aeronautics and Space
Administration, Washington, DC 20546, USA.
Saad K. Majeed & Hussein K. Abbas. (2010). An Improv
ed Distributed Association Rule
Algorithm. Eng.& Tech. Journal, Vol.28, No.18 .
Journal of Advanced Computer Sci ence and Technol ogy
Research, V
ol.3 No.1, March 2013, 1

20
20
Suarjaya, I. M. A. D. (2012). A New Algorithm for Data Compression
Optimization.International Journal of Advanced Computer Science and Applications
(IJACSA), Vol. 3, No.8.
Szé
pkút , I. (2004). Difference Sequence Compression Of Multidimensional Databases.
Periodica Polytechnica Ser. EL. ENG. VOL. 48, NO. 3
–
4, PP. 197
–
218.
Kona, H. V. (2003). Association Rule Mining Over Multiple Databases: Partitioned And
Incremental Approaches
. Published: Master Thesis, The University Of Texas At
Arlington.
Yijun. Lin ,X., & Tsang,C.( 2000). An Efficient Distributed Algorithm for Computing
Association Rules. Springer

Verlag Berlin Heidelberg.
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment