APPLICATION OF CURE DATA CLUSTERING ALGORITHM TO BATANGAS STATE UNIVERSITY STUDENT DATABASE

muttchessAI and Robotics

Nov 8, 2013 (4 years and 1 day ago)

73 views

90


APPLICATIO
N OF CURE DATA CLUST
ERING ALGORITHM TO
BATANGAS STATE UNIVE
RSITY STUDENT DATABA
SE

Nguyen Thi Linh

Department of Information Technology

ICT University


Thai Nguyen University

Thai Nguyen, Vietnam

Christopher C. Chua

Department of Informatics and Computing

Sciences

Batangas State University

Batangas City, Philippines


ABSTRACT

Clustering is said to be one of the most complex, well
-
known and most studied problems in data mining theory.
Data clustering is the process of grouping the data into
classes or cl
usters, so that objects within a cluster have
high similarity in comparison to one another but are very
dissimilar to objects in other clusters. The increasing
enrolment of students at Batangas State University
(BatStateU) equates to increase of students’

database
which can be mined to discover patterns in large data sets.
Patterns extracted can be converted to understandable
information that can be useful to the organization. A
popular data clustering algorithm known as Clustering
Using Representative (
CURE) was implemented using C#
programming language to cluster the students’ database of
Batangas State University.

KEYWORDS:
CURE algorithm, data clustering, data
mining

I.
INTRODUCTION

Data mining is one of the main steps in the process of
knowledge disc
overy. It is considered a complex process
where intelligent methods are applied in order to extract
data patterns [1]. It involves integration of techniques from
multiple disciplines such as database and data warehouse
technology, statistics, machine learn
ing, high


performance computing, pattern recognition, neural
networks, data visualization, information retrieval, image
and signal processing, and spatial or temporal data
analysis.

Investigating on methods of data mining still has been a
main and essen
tial subject of researchers and scientists.
With the vast and diversified information resource,
discovering a general method for data mining is
impossible. This is because each kind of information
resource or database has some correlative methods which
are

appropriate for mining it. Researchers‟ main objective
is finding effective data mining methods for each case.

One of the most complex, well
-
known and most studied
problems in data mining theory is clustering. This term
refers to the process of grouping

the data into classes or
clusters, so that objects within a cluster have high
similarity in comparison to one another but are very
dissimilar to objects in other clusters [2]. As mentioned by
Ma and Wu [3] dissimilarities are assessed based on the
attribu
te values describing the objects and are usually
distance measures are used.

CURE is an agglomerative algorithm in the hierarchical
method which builds clusters gradually. It identifies
clusters by using
c

representative points that are created by
choosing

well
-
scattered points from the cluster and then
shrinking them toward the center of the cluster by a
specified fraction
α
[4]. The parameter
α

can also be used
to control the shapes of clusters. A smaller value of
α

contracts the dispersed points very little and thus favors
elongated clusters. On the other hand, with larger values of
α
, the scattered points get located clos
er to the mean, and
clusters tend to be more compact [4]. During each
iteration, the clusters merged are those having the closest
pair of representative points, until the desired number of
clusters is reached. Having more than one representative
point per
cluster allows CURE to adjust well to the
geometry of non
-
spherical shapes and the shrinking helps
to dampen the effects of outliers.


In this paper, the following objectives are attained:

1.

Characterize the student database of BatStateU.

2.

Develop a database

clustering application using
CURE algorithm

3.

Utilize the developed application to cluster the
student database of BatStateU.

II. METHODOLOGY


This paper
used the constructive research method
to
come up with a data clustering application.
Constructive
research method deals with building of an artifact
(practical, theoretical or both) which solves a domain
specific problem in order to create knowledge about how
the problem can be solved (or understood, explained or
modeled) in principle [5]. The C# objec
t

oriented
programming language was used to design the interface,
implement CURE algorithm and functions for the
application. SQL Server 2005 was used as a tool for pre
-
processing data, designing data tables and implementing
connections, queries, and store
d procedures to ensure the
interaction between the user and the application, as well as
the application and the database system.

91


Pre
-
processing of the raw data to prepare them for another
processing procedure is needed in data mining [6]. In this
context
, the BatStateU student database was pre
-
processed
manually and by queries using
select, update, insert
procedures in SQL. Each table is represented by an entity
in SQL server database and a class in the developed
application. A total of nine tables wer
e created namely
tblGrade, tblStudent, tblDepartment, tblCourse, tblSubject,
tblInstructor, tblYear, tblGradeFilter, tblGradeCluster.

In the design of interface, six menus were created namely
BSU Data, User Data, Parameters, View, Window, and
Help. Under
the BSU Data menu, the application has a
main form called frmBsuData which presents all
information filtered by the user. Under this form, seven
sub
-
forms can be found. Forms namely frmGrade,
frmStudent, frmCourse, frmSubject, frmInstructor,
frmYear are us
ed to input information about student,
subjects, courses, grades, instructors and school year. Form
named frmGradeCluster allows the user to choose the
object to cluster.

Under the User Data menu, the application has a form
called frmNewData
.

This allows t
he user to input data for
clustering. For the Parameters menu, the form named
DialogSetAll is defined in order to set the clustering
parameters. The application has a main clustering form
called frmOpenData. Data for clustering processing and
clustering re
sults are presented in this form. View and
Window menus support the display of toolbar, status bar
and the window of the application. Under the Help menu,
the application has a form called frmHelp which provides
instructions on how the application can be
used
.

CURE and functions were applied along with the object
-
oriented structure: classes and objects. In addition, two
data structures HEAP, KD_TREE were used and applied
to CURE. The main functions of the application allow
managing information about the gr
ade, student, subjects,
and courses; filtering data of BatStateU desired clustering;
setting clustering parameters; implementing data
clustering; presenting the result; saving or printing results,
etc. In addition, the application also allows inputting
mi
ned data directly or from files for clustering.

The database of the BatStateU


Graduate School (GS)
students was used for clustering purpose of student,
subjects, and courses. Data clustering involves steps such
as filtering data needed to cluster, transf
orming filtered
data into mined data, choosing clustering object, setting
clustering parameters of CURE algorithm, and executing
clustering [2]
.

In order to choose the best results, the clustering
processing on specific data is repeated many times using
di
fferent clustering parameters of CURE algorithm. These
include the number of clusters (k), number of
representative points (c) and shrink coefficient (α). The
clustering result changes when the value of one of
parameters changes.

III.
RESULTS AND DISCUSSI
ON

A. Characteristics of Student Database of BatStateU

BatStateU student database is described by means of the
following parameters namely: SRCODE for student,
CODE for subject, COURSE CODE for course and FG for
grade.

The SRCODE was used to identify each
student. It is a
unique field and students are described by enrolled year
followed by five numerical letters (e.g
.

2002
-
00979).
S
ubject code was called CODE field and described by
letters which are initial letters in course name or subject
name followed by

numbers (e.g. IT 512). Courses in
BatStateU database are represented by COURSE CODE
field and are coded using initial letters in course name (e.g.
DPENGLS)
.

The final grade was

called FG field
. If a
student received an incomplete grade in a certain subje
ct,
the developed application assigns a value of zero for the
FG. Possible values for this field are 1.00, 1.25, 1.50, 1.75,
2.00 2.25, 2.50, 2.75, 3.00, 4.00, 5.00 or 0.00.

B.
Development of the Data Clustering Application Using
CURE Algorithm

In this pap
er, the developed clustering application includes
many classes and objects which were structuralized as the
two

tier architecture in a window
-
based application. The
two tiers are presentation and implement tiers.

The graphical user interface of the applic
ation is designed
at the presentation tier. This tier contains window forms
used in data presentation and accepting input from users.

The implement tier contains business logic, validations and
calculations related with the data. This tier contains classe
s
to support for data access and the clustering process of
CURE algorithm. Classes intended to support data access
are labeled
clsTblCourse, clsTblSubject, clsTblStudent,
clsTblGrade, clsTblYear, clsTblInstructor,
clsTblGradeCluster
and

FILES
.

Classes whi
ch support the clustering process of CURE
algorithm are named POINT, CLUSTER, HEAP,
KD_TREE and CLUSTERING_CURE. Specific methods
for each class were created.

The clustering processing using CURE algorithm is
implemented in CLUSTERING_CURE class. Its main

procedure is named as
cluster()
. The procedure needs a set
of points and the number of desired cluster
k
. The result of
this procedure is a set of desired clusters.
CLUSTERING_CURE class is defined as follow:

92


Class CLUSTERING_CURE

{

//This property contai
ns set of initialized points

private POINT[] points;

//Initialize a CLUSTERING_CURE object from a set of
points

public CLUSTERING_CURE(POINT[] points)

{

this.points = points;

}


public CLUSTER[] Cluster()

{

//Initialize Heap and kd
-
Tree

CLUSTER[] resultArr
ay;

CLUSTER[] clusters =
TOOLS.Points2Clusters(points);

KD_TREE T = new KD_TREE(points);

HEAP Q = new HEAP(clusters);


//Clustering loop

while (Q.size() > TOOLS.k)

{

CLUSTER u = Q.DeleteMin();


CLUSTER v = u.Closest;


Q.Delete(v);



CLUSTER w = u.
merge(v);

//Delete u.Rep, v.Rep and insert w.Rep into T
tree

bool deleteOK = true;

foreach (POINT p in u.Rep)

T.Delete(p, 0, ref deleteOK);

foreach (POINT p in v.Rep)

T.Delete(p, 0, ref deleteOK);

foreach (POINT p in w.Rep)

T.Insert(p);

//Initialize Closes
t for w

w.Closest = Q.Data[0];

w.DistCloset = w.distCluster(w.Closest);


//Start searching w.closest and closest


for the other clusters in Q

for (int i = 0; i <= Q.Last; i++)


{

CLUSTER x = Q.Data[i];

// Find out w.closets

if (w.distCluster(x) <
w.distC
luster(w.Closest))

{

w.Closest = x;

w.DistCloset =
w.distCluster(w.Closest);

}

//Find out x.closest:

if (TOOLS.Equals(x.Closest, u) ||
TOOLS.Equals(x.Closest,
v))


{

if (
x.distCluster(x.Closest<x.distCluster(w) )

{

CLUSTER closest =
T.Closest_Cluster(x,
x.distCluster(w), Q);

if (!TOOLS.Equals(x, closest))

{

x.Closest = closest;

x.DistCloset =
x.distCluster(x.Closest);

}

}

else


if (!TOOLS.Equals(x, w))

{

x.Closest = w;

x.DistCloset =
x.distCluster(x.Closest);

}

Q.Relocate(i);


}

Else



if (x.distCluster(
x.Closest) >
x.distCluster(w))

{

x.Closest = w;


x.DistCloset =
x.distCluster(x.Closest);

Q.Relocate(i);


}

}

Q.Insert(w);


} //End While

//return the clustering result

resultArray = new CLUSTER[Q.size()];

for (int i = 0; i <= Q.Last; i++)

resultArray[i] =

Q.Data[i];

return resultArray;

}

}

Finally, to interact between presentation tier and
implement tier, the TOOLS class was designed.

Properties
and methods supporting the clustering process in the
implement tier are created in this
class
.

C. Clustering the

Student Database of BatStateU

Specifically,
BatStateU
-

GS database from 2008
-
2009 to
2009
-
2010 academic years which includes 529 students
was used for clustering. Steps performed in the clustering
process were filtering data, transforming filtered data
into
mined data, choosing clustering object, setting clustering
parameters and executing clustering.

Clustering of student was based on statistical ratio of
grades that each student achieved in all subjects. Table 1
shows the computation of grade of a stu
dent
.

Table 1.

Computation of Grade of a Student

93



The statistical ratio of each kind of grade is calculated by
dividing the frequency of a certain kind of grade by the
overall total frequency of all kinds of grade. Hence, the
statistical ratio of grade 1
.25 is computed 6/13 = 0.4615.
The student has statistical ratios of 0.00, 1.00, 1.25, 1.50,
1.75, 2.00 grades with values 0.0000, 0.0000, 0.4615,
0.4615, 0.0769, 0.0000, respectively. The statistical ratio
of each kind of grade is automatically calculated

by the
application and creates a data point (e.g. G2008
-
00148,
0.0000, 0.0000, 0.4615, 0.4615, 0.0769, 0.0000). This
data point is saved on a table called tblGradeCluster which
resides in SQL server database. All data saved on
tblGradeCluster table are u
sed for clustering.

In using the developed application, the following steps are
performed:

1)

Open the grade table of students from BSU data
menu.

2)

Transform the data to mined data by clicking the

Transform into mined data’

button.

3)

Click the „Yes‟ button to c
luster the data.

4)

Choose the desired clustering object (e.g. students)

5)

Set the parameters (e.g. shrink coefficient=0.7, no. of
representative=4, no. of cluster=15)

6)

Click „OK‟ button.

Fig. 1 shows the student data clustering result. The
clustering process o
f 526 students is equivalent to 526 data
points. The number of cluster (k) can be set as desired
from 1 to 526. The number of representative points (c)
varies from 1 to 10 and the shrink coefficient (α) from 0.1
to 0.9. After several student clustering ex
periments with
various parameters, the researchers found out that using k
= 15, c = 4 and α = 0.7, gave the best clustering result
.

Under these settings, most students are seen to have 1.5
grades and rated as good in terms of their performance.


Figure 1
. Student Data Clustering Result
There is no method to select the number of clusters or
representative points or shrink coefficient which will give
the best clustering result. Only known is that, the greater
the similarity coefficient, the more similar a
re the two data
points of the two clusters [3]. And so, authors performed
many experiments and use empirical evaluations to choose
the optimal results. The clustering result is considered the
best when found data points in a cluster have highest
similarity

in comparison to each other.

For clustering of courses, the developed application
clustered 25 courses which correspond to 25 data points.
94


Clustering process of courses is similar to the steps
executed in student clustering except when choosing the
desire
d clustering object, the „
course

object
‟ is selected
instead. After many course clustering experiments with
various parameters, authors found out that using
k=8, c=3,
α = 0.6 gave
the best result
.
Fig. 2 shows the course
clustering result.



Figure 2. Course Clustering Result

The statistical ratios of grades in each course were
obtained. The frequency of each grade for each the course
is counted. The sta
tistical ratio of each kind of grade was
calculated by dividing the frequency of a certain kind of
grade by the total frequency of all the kinds of grade.
Hence, the statistical ratio of grade 0.00 as computed was
36/405 = 0.0888. The MSINTEC course had st
atistical
ratios of 0.00, 1.00, 1.25, 1.50, 1.75, 2.00 grades with
values 0.0888, 0.0123, 0.2716, 0.4567, 0.1753, 0.0024,
respectively. The statistical ratio of each kind of grade was
automatically calculated by the application and created a
data point (e.
g. MSINTEC, 0.0888, 0.0123, 0.2716,
0.4567, 0.1753, 0.0024). Computation is illustrated in
Table 2.

Table 2. obtained Grades of students in Course msintec



For subject clustering, subject code and FG fields were
used.
The clustering result was used to e
valuate the level
of difficulty of each subject. The clustering process of
subjects was based on statistical ratio of grades of each
subject. The statistical ratio of grades

in the subject IT 509
-

E
-
learning and Related Technology is presented in Table
3
.


Table 3.


Grades obtained by students in the Subject it
509


The statistical ratio of each kind of grade is automatically
calculated by the application and creates a data point (e.g.
IT 509, 0.1569, 0.0000, 0.2745, 0.3529, 0.2157, 0.0000).
This data p
oint is saved in tblGradeCluster table in SQL
server database. After several subject clustering
experiments with various parameters, the authors found
out that using

k=10, c=4 and α = 0.7 achieved the best
result. Fig. 3 shows the subject clustering result.

IV.
CONCLUSION AND FUTUR
E WORK

95


In this paper the student database of BatStateU is
described. The database uses SRCODE field for student
identification, CODE fiel
d for subject name, COURSE
CODE field for course name and FG field for grading. A
database clustering application using CURE algorithm was
successfully developed using C# and SQL Server 2005.
The application is based on a two
-
tier architecture in which
se
veral classes with specific methods were created to
support data access and the clustering process of CURE
algorithm. With regard to the clustering of the database,
the best clustering results for students, courses, and
subjects are achieved when (k = 15,

c = 4 and α = 0.7),
(
k=8, c=3, α = 0.6
) and (
k = 10, c = 4, α = 0.7
),
respectively. Further analysis revealed that students are
performing well and subjects in BatStateU


GS are
moderately difficult.

The application developed in this paper can be modifie
d
focusing on
the method to select the number of clusters k,
number of representatives c, or shrink coefficient α which
will give the best clustering result using CURE algorithm.
Similar studies can be conducted using the improved
algorithms of CURE and a
pply them to complicated
databases which have mixed types of data such as weather,
business and geographical databases.


Figure 3. Subject Clustering Result
REFERENCES

1)

M.J.A. Berry and G.S. Linoff.
Mining the Web:
Transforming Customer Data
. John Wiley

& Sons,
New York, 2002.

2)

J
.

R. Dubes, Algorithms for Clustering Data
,
Prentice
-
Hall, 1998.

3)

G.G. Ma and J. Wu, Data Clustering: Theory,
Algorithms, and Applications, ASA
-
SIAM Series on
Statistics and Applied Probability, 2007.


4)

Guha, R. Rastogi, and K. Shi
m, CURE: An Efficient
Clustering Algorithm for Large Databases
,

ACM 0
-
89791~996.5/98/006, 1998.


5)

G.D. Crnkovic, Model
-
Based Reasoning in Science
and Technology Studies in Computational
Intelligence


“Constructive Research and Info
-
Computational Knowledge
Generation” Vol. 314,
2010, pp. 359
-
380.


6)

I.B.
Gul
, and A.
Nosheen
, MFP: A Mechanism for
Determining Associated Patterns of Stock
,

Proceedings of the 6th International Conference on
Frontiers of Inf
ormation Technology, ISBN: 978
-
1
-
60558
-
642
-
7, 2009.













96





97