IEEE TRANSACTIONS ON
TKDE, MANUSCRIPT ID
1
A Unifying Domain

Driven Framework for
Clustering with Plug

In Fitness Functions and
Region Discovery
Christoph F. Eick, Oner U. Celepcikay, Rachsuda Jiamthapthaksin and
Vadeerat Rinsurongkawong
Abstract
—
—
The main challenge in developing methodologies for
domain

driven data mining is incorporating domain
knowledge and domain

specific evaluation measures into the data mining
algorithms and
tools, so that “actionable knowledge”
can be discovered. In this paper a generic, domain

driven clustering framework is
proposed that incorporates domain
in
telligence into domain

specific
plug

in fitness functions that are maximized by
the clustering algorithm
. The framework
provides a family of clustering algorithms and a set of fitness functions, along with the capabilit
y of defining new fitness
functions. Fitness functions are the core components in the framework as they capture a domain expert’s notion of the
interestingness. The fitness function is independent from the clustering algorithm employed. The framework also
incorporates
domain knowledge through preprocessing and post

processing steps and parameter selections. This paper introduces the
framework in detail, and illustrates the framework through demonstrations and case studies that center on spatial clustering
a
nd
region discovery. Moreover, the paper introduces ontology and a theoretical foundation for clustering with fitness functions
in
general, and region discovery in particular. Finally, intensional clustering algorithms that operate on cluster models are
in
troduced.
Index Terms
—
Clustering, Data Mining, Spatial Databases and GIS
, Domain

driven Data Mining
——————————
——————————
1
I
NTRODUCTION
o
extract knowledge from the immense amount of
data that has been generated by advances
in data a
c
quis
i
tion technologies has been a major focus of data
mi
n
ing research over the last 20 years. However, it has
been observed that knowledge obtained from trad
i
tional
data

driven data mining algorithms in domain

specific
applications is not really
actionable
[1]
because the e
x
tracted knowledge does not capture what domain experts
are interested in. This observ
a
tion can be explained by
two limitations of trad
i
tional data mining: 1) traditional
data mining algorithms insufficiently incorporate domain
intelligence to aid the mining process and 2) the alg
o
rithms use techn
i
cal significance as their sole evaluation
mea
s
ure.
As far as the first limitation is concerned, domain in

telligence includes the involvement of domain know

ledge, domain

specific cons
traints and experts. Consider
a situation in which a clustering algorithm is used to
identify clusters in a specific domain. Different clustering
algorithms have their own assumptions on clustering
criteria, e.g. tightness, connectivity, separation and so
on.
Due to the fact that clustering is NP

hard, clustering alg
o
rithms focus their search efforts on clusters that maximize
those criteria, frequently generating “optimal” but out

of

interest clusters. Clustering with constraints intends to
alleviate this p
roblem by incorporating must

link and
cannot

link constraints to better guide the search for good
clusters [2]. The second limitation occurs due to the fact
that in traditional data mining, actionability of kno
w
ledge
is determined solely by technical signi
ficance based on
domain

independent criteria [1]; this type of measure
usually differs from domain

specific expectations and
measures of interestingness. To address this problem,
when assessing cluster quality both technical and do

main

specific significan
ce should be considered. Conse

quently, the main challenge in developing methodolo

gies and techniques for domain

driven data mining is to
incorporate domain knowledge into data mining alg
o
rithms and tools so that actionable knowledge can be di
s
covered.
I
n this paper, we propose a unifying domain

driven
clustering framework that provides families of clustering
algorithms with plug

in fitness functions capable of di
s
covering actionable knowledge. The fitness function is the
core component of the framework,
as it captures the d
o
main expert’s notion of interestingness. The fitness fun
c
tion is specifically designed to be externally plugged

in to
provide extensibility and flexibility; the fitness function
component of the framework is independent of the clus

ter
ing algorithms
employed
.
In general, families of task

and/or domain

specific
fitness functions are employed to capture the domain
interestingness and to incorporate domain knowledge.
For example, let us consider a data mining task in which
geologists are
interested in discovering hotspots in ge
o
graphical space where deep earthquakes are in close
xxxx

xxxx/0x/$xx.00 © 200x IEEE
————————————————
Authors are with the Department of Computer Science, University of Ho
u
ston, Houston, TX
,
77204
.
Emails:
(
ceick, onerulvi,
rachsuda, vad
eerat
)@cs.uh.edu
Manuscript received (
03/31/2009
).
T
2
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
proximity to shallow earthquakes. That is, they are inte
r
ested
in identifying contiguous regions in an eart
h
quake
data set for which the variance of the variable
eart
h
quake_depth is high. When using our framework, the
geologist’s notion of interestingness is the captured
in the
form of a High Variance fitness function
—
formally d
e
fined in section 3.
The domain expert
additionally select
s
the
parameters to instruct
the clustering algorithm in
what patterns they are really interested in: an earth

depth
var
i
ance threshold and a parameter that controls cluster
granularity and size of the spatial clusters discove
re
d.
Next, a clustering algorithm is run with the paramet
e
r
ized High Variance fitness function, and high variance
earthquake depth hotspots are obtained, as di
s
played in
fig. 1.
Fig. 1. Examples of interesting regions discovered by a domain dr
i
v
en clustering algorithm using a High Va

riance fitness fun
c
tion
Our framework incorporates domain knowledge not
only through domain

specific fitness functions, but also
through preprocessing and post

processing steps, fitness
function parameter selections including seed patterns,
threshold parameter values that are su
itable for a specific
domain, and desired cluster granularities. The family of
clustering algorithms supported by the framework i
n
cludes divisive, grid

based, prototype

based and aggl
o
m
erative clustering algorithms; all of which support plug

in
fitness fu
nctions.
The first high level domain driven data mining frame

work has been introduced by Cao and Zhang [1]. In this
framework, domain intelligence is incor

porated into the
KDD process towards actionable knowledge di
s
covery,
and the framework has been il
lustrated through mining
activity patterns in social security data. They also pr
o
posed criteria to measure actionability of the know
l
edge.
Yang
[3] introduced
a framework with two tec
h
niques to
produce actionable output from traditional KKD ou
t
put
models.
The first technique uses an algorithm for extrac
t
ing actions from decision trees such that each test instance
falls in a desirable state. The second technique uses an
algorithm that can learn relational action models from
frequent item sets. This technique
is applied
to aut
o
matic
planning systems and Yang’s Action

Relation Mo
d
eling
System (ARMS) automatically acquires action mo
d
els
from recorded user plans.
One subcomponent of the domain knowledge that
must be incorporated into any domain

driven data mi
n
in
g framework is human intelligence; and Multiaspect
Data Analysis (MDA) is an important Brain Informatics
methodology. Brain Informatics considers the brain as an
information

processing system to understand its mech
a
n
ism for analyzing and managing data. But
, since brain
researchers can not use MDA results directly, Zhong [4]
proposes a methodology that employs an explanation

based reasoning process that combines multiple source
data into more general results to form actionable
kno
w
ledge. Zhong’s framework b
asically takes traditio
n
al KDD output as input to an explanation

based reaso
n
ing process that generates actionable output. The concept
of mo
v
ing from method

driven or data

driven data mi
n
ing to domain

driven has been recently proposed and is
fe
a
tured in [5
]. The authors describe four aspects of mo
v
ing data mining, from a method

driven ap

proach to a
process that focuses on domain knowledge. In general,
the use of plug

in fitness functions is not very common in
traditional clustering; the only exception is t
he CHAM
E
LEON [6] clustering algorithm. However, fitness fun
c
tions play a more important role in semi

supervised and
supervised clustering [7] and in adaptive clustering [8].
The main contributions of this paper are that it:
1.
Introduces a unifying domai
n

driven clustering
framework for actionable knowledge discovery.
2.
Proposes a novel domain

specific fitness
function
model that is plugged in
to clustering algorithms externa
l
ly to capture domain interestingness.
3.
Presents a set of fitness functions c
apable of ser
v
ing
for clustering tasks for various domains.
4.
Introduces a family of clustering algorithms, most of
which have been developed in our previous
work, as part
of the framework and
introduces
novel intensional clu
s
tering
that directly m
a
nipula
te cluster model.
5.
Illustrates deployment of the proposed framework
and its benefits in challenging real world case stu

dies.
The remainder of this paper is organized as follows: In
section 2, we formally present our domain

driven
cluste
r
ing
framework.
Section 3 provides a detailed discussion
on domain

specific plug

in fitness function including
three examples. Section 4 introduces the family of cluste
r
ing algorithms provided in our framework, and section 5
illustrates the framework through demonstr
a
tion
s and
case studies. Section 6 concludes the paper
.
2
A
S
PATIAL
C
LUSTERING WITH
P
LUG

I
N
F
ITNESS
F
UNCTION
2.1
Preview
As mentioned in the introduction, the goal of this paper is
to introduce a highly generic clustering framework that
supports plug

in fitness
functions to capture domain i
n
t
e
restingness. As we will discuss later the framework is
very general and can be used for traditional clustering.
However, because almost all of our applications involve
spatial data mining, the remainder of this paper will
m
ostly focus on spatial clustering and on region disco
v
ery in particular. The goal of spatial clustering is to ident
i
fy interesting groups of objects in the subspace of the sp
a
tial attributes. Region discovery is a special type of spatial
clustering that fo
cuses on finding interesting places in
AUTHOR ET AL.: TITL
E
3
spatial datasets. Moreover, in this section and in Section 4
a theoretical foundation and
ontology
for clu
s
tering with
plug

in fitness functions is intoduced. Finally, novel i
n
tensional clustering algorithms are in
tr
o
duced
.
2.2
An Architecture for Region Discovery
As depicted in Figure 2, the proposed region discovery
framework consists of three key components. The first
two components are families of clustering algorithms and
fitness functions that play a major rol
e in discovering i
n
teresting regions and their associated patterns. As we will
discuss in more detail soon, the framework uses cluste
r
ing algorithms that support plug

in fitness functions to
find interesting regions in spatial datasets. Decoupling
cluster
evaluation from the search for good clusters cr
e
ates flexiblility in using any clustering algorithm with any
fitness function. The role of the third component is to
manage and integrate datasets residing in several repos
i
tories; it will not further be dis
cussed in this paper
.
Fig. 2. Region Discovery Framework
2.3
Goals and Objectives of Region Discovery
As mentioned earlier, the goal of region discovery is to
find interesting places in spatial datasets. Our work a
s
sumes that region discovery algorithms
th
at
we develop
operate on datasets containing objects o
1
,..,o
n
:
O
={o
1
,..,o
n
}
F where F is relational database schema and
the objects be
longing to O are tup
l
e
s that are charact
e
r
ized by attri
b
utes S
N, where:
S={s
1
,…,s
q
} is a set of spatial attributes.
N={n
1
,..,n
p
} is a set of non

spatial attributes.
Dom(S) and Dom(N) describe the possible values the
attributes in S and N can take; that is, each object o
O is
characterized by a single tuple that takes values from
Dom(S)
Dom(N)
1
.
In general, clustering algori
thms can be subdivided i
n
to
intensional
clustering
and
extensional clustering
alg
o
rithms: extensional clustering algorithms just create clu
s
ters for the data set O, portioning O into subsets, but do
1
If S is empty we call the problem a traditional clustering problem.
One key char
acteristic of spatial clustering is that spatial

and non

spatial
attributes play different roles in the clustering process, which is not the
case in traditional clustering.
nothing else. Intensional clustering algorithms, on the
o
ther hand, cr
e
ate a clustering model based on O and
other inputs. Most popular clustering algorithms have
been introduced as extension clustering algorithms, but
—
it is not too difficult to generalize most extensional clu
s
tering algorithms so that they beco
me intensional cluste
r
ing alg
o
rithms, as we
present
in Section 5.
Extensional clustering algorithms create clusters X on
O that are sets of disjoint subsets of O:
X=
{c
1
,...,c
k
}
with
c
i
O
(i=1,…,k) and
c
i
c
j
=
(i
j)
Intensional clustering algorithms create
a set of disjoint
regions Y in F:
Y=
{
r
1
,...,
r
k
}
with r
i
F (i=1,…,k) and r
i
r
j
=
(i
j)
In the case of spatial clustering and region discovery,
cluster models have a peculiar structure in that they seek
for regions in the subspace Dom(S) and not in F itse
lf: a
region discovery model
is a function
2
:
Dom(S)
{1,…,k}
{
} that assigns a r
e
gion
(p) to a point
p in Dom(S) assuming that there are k regions in the sp
a
tial dataset
—
the number of regions k is chosen by the
region discovery algorithm that creates t
he model. Mo
d
els support the notion of ou
t
liers; that is, a point p’ can be
an outlier that does not belong to any r
e
gion: in this case:
(p’)=
.
Intensional region discovery algorithms obtain a clu
s
tering Y in dom(S) that is defined as a set of disjoint
r
e
gions in dom(S)
3
:
Y=
{
r
1
,...,
r
k
}
with r
i
F[S] (i=1,…,k) and r
i
r
j
=
(i
j)
Moreover, regions r belong to Y are described as fun
c
tions over tupels in Dom(S)
—
r
: Dom(S)
{t,f} indicating
if a point p
Dom(S) belongs to r:
r
(p)=t.
r
is
called
the
intension
of r
.
r
can easily be constructed from a the mo
d
el
of a clustering Y.
Moreover,
the extension of a r
e
gion r
r
is defined as follows:
r
={o
O
r
(o[S])=t}
In the above definition o[S] denotes the projection of o
on
its spatial
attributes.
Our approach requires discovered regions to be co
n
t
i
guous. To cope with this constraint in extensional clu
s
te
r
ing, we assume that we have neighbor relationshi
ps
no
()
between the objects in O and cluster neighbor rel
a
tio
n
ship
nc
()
between regions in X defined with respect to
O: if no(o,o’) holds objects o and o’ are neighboring; if
nc(r,r’) holds regions r and r’ are neighboring.
no
OxO
nc
2
O
x2
O
Moreover, neigh
boring relationships are solely dete
r
mined by the attributes in S; that is, the temporal and sp
a
tial attributes in S are used to determine which objects
and clusters are neighboring. A region r is contiguous if
for each pair of points u and v in r there is
a path between
u and v that solely traverses r and no other regions. More
2
denotes “undefined”.
3
F[S] denotes the projection of F on the attributes in
S.
4
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
formally, contiguity
4
is defined as a predicate over su
b
sets c of O:
contiguous(c)
w
c
v
c
m≥
2
x
1
,…,x
m
c:
w=x
1
v= x
m
no
(x
i
,
x
i+1
) (i=
1
,…,m)
.
contiguous(X)
c
X: contiguous(c
)
Our approach employs arbitrary plug

in, reward

based fitness functions to evaluate the quality of a given
set regions. The goal of region discovery is to find a set of
regions X that maximize an externally given fitness fun
c
tion q(X); moreover, q is assu
med to have the following
stru
c
ture
:
(
1
)
where
i
(c) is the interestingness of a
region c
—
a qua
n
tity
designed by a domain expert to reflect a degree to which
regions are “newsworthy".
The
number of objects in O
belonging to a regi
on
is denoted by c, and the quantity
i(c)

c

can be considered as a “reward" given to a
r
e
gion
c; we seek X such that the sum of rewards over all of
its constituent
regions
is maximized.
The amount of pr
e
mium put on the size of the
region
is controlled
by the
value of parameter β (β
>1).
A
region
reward is propo
r
tional to its interestingness, but
larger
regions receive a
higher reward than
smaller
regions
having the
same va
l
ue of interestingness
,
to reflect a preference
for
larger
r
e
gions
.
Furthermore, i
t is assumed that the fitness fun
c
tion
q is additive; the reward associated with X is the sum of
the reward of its constituent regions.
The reader might ask why we restrict the form of fi
t
ness function in our proposed framework. The main re
a
son is our desi
re to develop an efficient clustering alg
o
rithm for region discovery. Restricting the form of fitness
fun
c
tion supported allows us to use knowledge about the
stru
c
ture of the fitness function to obtain faster clustering
algorithms which employ pruning, inc
remental upda
t
ing,
and sophisticated search strategies. This topic will be r
e
visited in Section 4 of this paper when specific clustering
algorithms are intr
o
duced.
Given a spatial dataset O, there are many possible clu
s
tering algorithms to seek for intere
sting regions in O with
respect to a plug in fitness function q. In general, the o
b
jective of region discovery with plug

in fitness func

tions
is:
Given
: O, q, and possibly other input parameters
Find
: regions r1,...,rk that maximize q({r1,...,rk}) su
b
jec
t
to the following constraints:
(1a) ri
O (i=1,…,k) for extensional clustering
(1b) ri
F[S] (i=1,…,k) for intensional clustering
(2) contiguous(ri) (i=1,..,k)
(3) ri
rj=
(i
j)
It should be emphasized that the number of regions k
is not an input param
eter in the proposed framework;
that is, region discovery algorithms are assumed to seek
4
Other alternative definitions of contiguity exist, but will not be di
s
cussed in this paper due to the lack of space.
for the optimal number of regions k.
3
D
OMAIN

S
PECIFIC
P
LUG

IN
F
ITNESS
F
UNCTIONS
The fitness function, whose formula was given in
eq
u
ation 3,
is the core component
in
our framework in
ca
p
turing the notion of the interestingness of the domain.
The main challenge in developing methodologies and
techniques for domain

driven data mining is to incorp
o
rate domain knowledge into the data mining task so “a
c
tionable knowledge” c
an be discovered.
For example, in
region discovery, the fr
amework searches for interesting
su
b
spaces and then extracts
regional knowledge from the
obtained subspaces
which provides very crucial
kno
w
ledge for domain e
x
perts
.
The fitness function is
specifically designed to be e
x
ternally plugged

in to provide extensibility and flexibi
l
i
ty. The fitness function component of the framework is
ind
e
pendent from the
clustering algorithm employed
and
for each domain a domain

specific fitness function is d
e
s
igned to capture the domain interestingness and inco
r
porate domain knowledge
.
Because the fitness fun
c
tion is
exte
r
nal and encapsulated from the rest of the framework
,
any change in the framework such as a p
a
rameter change
or change in the clustering algor
ithm will not affect the
fitness function. Likewise, changes on fi
t
ness function
that comes from domain requirements will not affect the
contents of the clustering algorithm and so on. This d
e
sign enables the framework to be flexible and extensible
to meet
domain needs and requirements
.
In order to illustrate how the notion of domain int
e
r
estingness and the domain

specific fitness functions are
used in domain driven data mining and in discovering
actionable knowledge, we now provide several e
x
amples
of such
fitness functions
in the remainder of this section.
3.1
PCA

based Fitness Function
Finding interesting regional correlation patterns that will
help summarize the characteristics of a region is i
m
po
r
tant to domain and business people, since many pa
t
terns
only exist at a regional level, but not at the global
level.
Moreover, u
sing regional patterns which are no
r
mally are globally hidden, domain or business people can
unde
r
stand the structure of data and make business or
d
o
main decisions by analyzing these c
orrelation patterns.
For e
x
ample, a strong correl
a
tion between a fatal disease
and a set of chemical conce
n
trations in
Texas
water wells
might not be detectable throughout Texas, but a
strong
correl
a
tion pattern might exist regionally which is also a
refle
c
tion of Simpsons' paradox [
9
]. This type of regional
kno
w
ledge is crucial for
the
domain experts who seek to
u
n
derstand the causes of such diseases and predict future
cases.
T
o
identify a sub

region in South Texas with 35
water wells that demonstrates a u
ni
que and strong corr
e
lation between Arsenic,
another chemical in water
of
those wells
and
high occurrence of the
disease
in this r
e
gion,
might
suggest to
domain experts the possible exi
s
t
ence of nearby toxic waste
,
and
provide valuable
actio
n
a
ble know
l
edg
e
that
will help
them
to understand the
cause of dangerous amount of arsenic in
water
wells
, then
AUTHOR ET AL.: TITL
E
5
develop
a solution to this problem
and prevent future
incidents.
An example of discovered regions along with
highly corr
e
lated attribute sets is given in fig.
4. This is an
applic
a
tion of our framewok using PCA

based fitness
function on Texas Water Wells data [10]; and the fact that
the correlation sets for each region show significant di
f
ferences emphasizes the i
m
portance of regional pattern
discovery.
Fig.
3
. An Example of Regional Correlation Patterns for Chemical
Concentrations in Texas
In order to discover regions where sets of attributes are
highly correlated, we need a fitness function that will r
e
ward high correlation and enables our frame
work to di
s
cover such regions. The Principal Component Analysis
(PCA) is a good candidate
since t
he directions identified
by PCA are the eigenvectors of the correlation matrix, and
each eigenvector has an associated eigenvalue that is a
measure of the corr
espon
d
ing variance
.
T
he Principal
Components (PCs) are ordered with respect to the v
a
r
i
ance associated with that component in descending o
r
der.
The eigenvectors of PCs can help to reveal correlation
patterns among sets of a
t
tributes.
Ideally, it is desira
ble to have high eigenvalues for the
first k PCs, since this means that a smaller number of PCs
will be adequate to account for the threshold var
i
ance
which overall suggests that a strong correlation among
variables exists [11]. The PCA

based fitness funct
ion is
d
e
fined next.
Let
λ
1
,
λ
2
,…,
λ
k
be the eigenvalues of the first k PCs,
with k being a parameter:
PCA

based Interestingness is estimated using formula
2
:
(
2
)
PCA

based fitness function then b
e
comes:
(
3
)
The fitness function rewards high eigenvalues for the
first k PCs. By
ta
k
ing the square of each eigenvalue we
ensure that regions with a higher spread in their eige
n
v
a
lues will obtain higher r
e
wards
—
reflecting the higher
importance assigned in PCA to hig
h
er ranked principal
components.
Moreover, a generic pre

processing tec
hnique to s
e
lect
the best k value for the PCA

based fitness function is
based on a variance threshold to d
e
cide how many PCs to
retrieve. This variance threshold is also d
o
main

specific
and is set based on the domain knowledge avai
l
able, to
ensure selectin
g appropriate k value for each dat
a
set from
different domains and reflecting co
n
cerns and constraints
implied by domain know
l
edge.
The PCA

based fitness function repeatedly applies
PCA during the search for the optimal set of regions,
maximizing the eigenv
alues of the first k PCs in that r
e
gion. Having an externally plugged in PCA

based fitness
function enables the clustering algorithm to probe for
optimal partitioning, and encourages the merging of two
regions that exhibit structural similarities in corre
lation
patterns. This approach is more advantageous than a
p
plying PCA just once or multiple times on the data u
s
ing
other tools, since the PCA

based fitness function is a
p
plied repeatedly to candidate regions to explore each po
s
sible region combination.
3.
2
Co

location Fitness Function
Co

location mining is a data mining task that seeks for
interesting but implicit pa
t
terns in which two or more
patterns collocate in spatial proximity. In the following
we will introduce an interestingness function for co

lo
cation sets involving objects that are characterized by
continous attributes (see also [12] for background on the
described approach).
The pattern
A
↑
denotes that attri
b
ute A has high values and the pattern A
↓
indicates that
attribute A has low values. For
example, the pattern
{A
↑
,
B
↓
, D
↑
} describes that high values of A are co

located
with
low va
l
ues of B and high values of D.
Let
O be a dataset
c
be a region
o
O be an object in the dataset O
N
=
{A
1
,…,A
q
} be the set of non

geo

re
ferenced cont
i
n
uous
attributes in the dataset O
Q={A
1
↑
, A
1
↓
,…, A
q
↑
, A
q
↓
} be the set of possible base co

location
patterns
B
Q be a set of co

location pa
t
terns
Let
z

score
(A,o) be the z

score
(A,o)
of object o’s value
of attribute
A
(
4
)
(
5
)
The interestingness of an object o with respect to a
co
l
location
set B
Q is measured as the product of the z

values of the
patterns in the set B. It is defined as fo
l
lows:
(
6
)
where
z(p,o) is called the z

value of base
pattern p
Q for
6
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
object o.
I
n general, the interestingness of a
region can be
straightforwardly
computed by
using
the average inte
r
e
s
tingness of the objects
belonging to a region. However,
using this approach some very
large prod
ucts might
d
o
m
inate interestingness computations. For
some d
o
main
experts just finding a few objects with very high
products
in close proximity of each other is important, even if the
remai
n
ing objects in the region deviate from the observed
pattern.
In ot
her cases, domain experts are more interes
t
ed in patterns with
highly reg
u
lar products so that all or
almost all objects in a region
share this pattern, and are
less interested in a few very high
products. To satisfy the
needs of
different domains
, our app
roach
add
i
tionally
considers purity when computing region
interestingness,
where purity(B,c) denotes the percentage of
objects
o
c
for which i(B,o)>0
. In summary, the interestingness
of a
region c with respect to a co

location set
B, denoted by
(B,c), is computed as fo
l
lows:
(
7
)
The parameter
θ
[0,
∞
) controls the importance a
t
tached to purity
in interestingness computations;
θ
=0 i
m
plies that purity is
ignored, and using larger values i
n
creases the importance of
p
u
rity.
Fig. 6 depicts regions r in Texas with their highe
st v
a
l
ued co

location sets B; that is, the depicted co

location set
B has the highest value for
(B,
r
)
.
3.3
Variance Fitness Function
High
Variance Fitness Function is a fitness function to
discover regions where there is high contrast in value of
attribu
te of interest. For example, in studying of eart
h
quake
as discussed in more details in a case study in se
c
tion 5.2
, where attribute of interest is the depth of eart
h
quakes, the domain expert may use High Variance Fitness
Function to find regions where shal
low earthquakes are
in close proximity with deep earthquake.
The interestin
g
ness of a region r, i(r), is
defined as fo
l
lows:
(
8
)
where
(
9
)
The interestingness function parameters
β
and th are
determined in close collaboration with the domain e
x
perts. At
tr is the attribute of the interest and in the form
u
la
attr(o)
denotes the value
of attr
for
object o. The inte
r
e
s
tingness function computes the ratio of the region’s
v
a
riance with respect to attr and the dataset’s variance.
R
e
gions whose ratio is above a
given threshold th receive
rewards.
Figure 1 in Section 1 shows the result of using the
above variance interestingness function for an earthquake
dataset with earthquake depth being the attribute of i
n
terest. The polygons in Figure 1 indicate regions with
po
s
itive interestingness; usually, those regions will be further
ranked by region reward to sort regions from most inte
r
esting to least interesting, providing search engine

type
capabilities to scientists that are interested in finding i
n
teresting places
in spatial dat
a
sets.
4
C
LUSTERING ALGORITHMS
W
ITH
P
LUG

IN
F
ITNESS
F
UNCTIONS
Another key component of
proposed
framework is a fa
m
ily of clustering algorithms
that allows domain experts to
instruct clustering algorithms to seek clusters that satisfy
their sp
ecific requirements.
To achieve this flexible clu
s
tering capability,
several
clustering algorithms
were d
e
signed and implemented
that
support
externally

given
fitness function
s that are ma
x
imized during the clustering
process
.
Using
different plug

in fitne
ss function
s
in the
algorithms
results in
obtaining different, alternative clu
s
ters for the
same data set.
Existing clustering paradigms
have been extended to support plug

in fitness functions,
namely repres
en
tative

based clustering, agglomerative
clusteri
ng, divisive clu
s
tering, and grid

based clustering
.
Three such clustering algorithms CLEVER
[12]
, MOSAIC
[
13]
, and SCMRG
[14]
will be briefly introduced and fo
r
mally described by extending the formal fram
e
work that
was introduced in Section 2.
Different cl
ustering par
a
digms are superior
with re
s
pect to different a
s
pect
s
of
clustering.
For example,
grid

based cluste
r
ing algorithms
are able to cluster large datasets quickly,
whereas repr
e
sentative

based clustering algorithms di
s
cover cluster
s of
better qualit
y
.
Finally,
agglomerative clus
tering alg
o
rithms are
capable of identif
y
ing arbitrary shape clusters
which is particularly impo
r
tant in spatial data mining.
They can also be
employed
as a post processing technique
to enhance
the
quality of clusters
that wer
e o
b
tained by
running
a representative

based clustering alg
o
rithm.
4.1
CLEVER
—
A
R
epresentative

b
ased
C
lustering
A
lg
o
rithm
Representative

based clustering algorithms, sometimes
called prototype

based clustering algorithms in the liter
a
ture, construct clus
ters by seeking for a set of represent
a
tives; clusters are then created by assigning objects in the
dataset to their closest representative
;
in general,
they
compute the following function
:
: O
q
d
{other parameters}
2
Dom(S)
takes O, q, a distance
function
d
over Dom(S), and po
s
sibly other parameters as an input and seeks for an “o
p
timal set”
5
of representatives in Dom(S), such that the
clustering X obtained by assigning the objects in O to
their closest representative in
(O,q,
d
,…) maximizes q(X)

the fitness function
. Moreover, it should be noted that
5
In general, prototype

based clustering is NP

hard. Therefore, most
representative

based clustering algorithm will only be able to
find a
subopt
i
mal clustering X and not the global maximum of q.
AUTHOR ET AL.: TITL
E
7
clu
s
tering is done in the spatial attribute space S, and not
in F; the a
t
tributes in N are only used by fitness function q
when evaluating clu
s
ters.
CLEVER is an example of the representative

based c
lu
s
tering algorithms that uses randomized hill climbing and
large
r
neighbor
hood sizes
6
to battle premature conve
r
gence
when greedily searching for the best set of repr
e
sentatives.
Initially, the algorithm randomly selects k’
representatives from O. In the
iterative process, CLEVER
samples and evaluates p solutions in the neighborhood of
the current solution; if the best one improves fitness, it
becomes the current solution. The n
eighboring solutions
are created
by applying one of
following
operators
on a
re
presentative
of the current solution
: Insert
,
Delete and
R
e
place. Each operator has a certain selection
probability
and representatives to be manipulated are chosen at
ra
n
dom.
Moreover, to battle premature convergence, CLE
V
ER re

samples
p’>p solutions befo
re terminating.
The
Pseudocode of CLEVER is given in Fig
.
4
.
Fig.
4
. Pseudo

code of
CLEVER
The
cluster
model
for the result o
b
tained by running
a representative

based clustering algorithm can be co
n
structed as follows:
Let
(O,q,
d
,…)={rep
1
,…, rep
k
}
Dom(
S)
that is; the representative

based clustering algorithm r
e
turned R={rep
1
,…, rep
k
}
.
Then the model
can be defined
as follows:
p
S
(p)=m
d
(p,rep
m
}
d
(p,rep
j
} for j=1,…,k
that is,
assigns p to the
cluster
associated with the clo
s
est representative
7
.
Because representative

based clustering algorithms a
s
sign objects to clusters using 1

nearest neighbor queries,
the sp
a
tial extent of regions ri
Dom(S) can be constructed
by co
m
puting Voronoi diagrams; this implies that the
6
It modifies the current set of representatives by applying more than
one operator to it; e.g. modifying the current set of representatives by
repla
c
ing two representatives and inserting a n
ew representative.
7
Our formulation ignores the problem of ties when finding the closest
representative; in general, our representative

based clustering algorithms
break ties randomly.
shape of regions obtained by re
presentative

based clu
s
te
r
ing algorithms is limited to convex polygons in
Dom(S). Neighboring rel
a
tionships
no
()
between objects
in O and
nc
()
between clusters obtained by a represent
a
tive

based clustering algorithm can be constructed by
compu
t
ing the Dela
unay triangulation for R. Moreover,
representative

based clustering algorithms do not su
p
port the concept of outliers; therefore, representative

based mo
d
els have to assign a
cluster
to every point p in
S.
4.2
MOSAIC
—
An
A
gglomerative
C
lustering
A
lgorithm
T
he
agglomerative clustering problem
can be defined as fo
l
lows:
Given
: O, F, S, N, a fitness function q, and an initial
clustering X with contiguous(X)
Find
: X’={c’
1
,…,c’
h
} that maximizes q(X’) and all clu
s
ters in X’ have been constructed using unions of
neig
h
boring clusters in X:
c
i
X’: c
i
=c
i1
…
c
ij
c
i1
,…,c
ij
X
nc
(c
ik
,c
ik+1
)
(for k=1,j

1)
c
i
c
j
=
(for i
j)
Due to the fact that the above definition assumes that
only neighboring clusters are merged
,
cont
i
guous(X’)
trivially holds.
In the followin
g, we view results that are obtained by
agglomerative methods as a meta

clustering X’ over an
initial clustering X of O; X’ over X is defined as an exhau
s
tive set of contiguous, disjoint subsets of X. More forma
l
ly, the
objectives of agglomerative clusteri
ng
can be reform
u
lated
as fo
l
lows:
Find
: X’={x
1
,...,x
r
} with x
i
X (i=1,…,r) maximizing
q(X’), subject to the following constraints:
(1)
x
1
…
x
r
=X
(2)
x
i
x
j
=
(i
j)
(3)
contiguous(x
i
) (for i=1,..,r)
(4)
x
X’
m
1
x’
1
…
x’
m
X: x =x’
1
…
x’
m
We use the term
meta

clustering
, because it is a cluste
r
ing of clusters and not
of
objects as is the case
with
trad
i
tional clustering. It should be noted that agglomer
a
tive
clusters are exhaustive subsets of an initial clustering X;
that is, we assume that outliers are not removed by th
e
aggl
o
merative clustering algorithm itself, but rather by
the algorithm that constructs the input X for the aggl
o
m
erative clustering algorithm. In general, an agglomer
a
tive
clustering
algorithm is decomposed
of two alg
o
rithms:
1.
a preprocessing algorithm t
hat constructs the
cluste
r
ing X
2.
the agglomerative clustering algorithm itself that
d
e
rives X’ from
X.
The preprocessing algorithm is frequently degen
e
ra
t
ed; for example, its input could consist of single object
clu
s
ters, or X could be constructed based o
n a grid

stucture; however, it is ben
e
ficia
l
for many applications to
use a full fledged clustering algorithm for the prepr
o
8
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
ces
s
ing step.
An agglomerative clustering algorithm MOSAIC [
13]
has been introduced
in a
previous work. MOSAIC takes
the clustering
X obtained by running a represent
a
tive

based region discovery algorithm as its input, merges
neighboring regions greedily
as long mer
g
ing enhances
q(X). For efficiency reasons, MOSAIC uses Gabriel graphs
[15]
—
which are subsets of Delaunay graphs
—
to co
m
put
e
nc; nc is then used to
identify merge candidates for M
O
SAIC
which are pairs of neighboring clusters whose
merging enhances q(X)
;
taking
nc is updated incrementa
l
ly as clusters are merged.
Finally, when clusters are
merged, q(X) is updated incrementally,
taking advantage
of the fact that our framework assume that q is additive.
Fig. 8 gives the pseudo

code for MOSAIC
:
Moreover, models for the clusters obtained by an a
g
glomerative regions discovery algorithm can be easily
co
n
structed from the models of the
input clusters in X
that have been merged to obtain the region in question.
Let us assume r has been obtained as r=r
1
…
r
m
; in this
case the model for r can be definded as
:
r
(p)=
r1
(p)
…
rm
(p)
In the case of MOSAIC,
r
(p) is implemented by ch
a
r
acteri
zing MOSAIC clusters by sets of representatives
8
;
new points are then assigned to the cluster whose set of
representatives contains the representative that is clo
s
est
to p. Basically, MOSAIC constructs regions as union of
Voronoi cells and the above constr
uction takes a
d
vantage
of this prope
r
ty.
Fig
.
5
. Pseudo code
of
M
O
SAIC
4.3
SCMRG
—
A
Divisive G
rid
b
ased
C
lustering
A
lgorithm
The
divisive clustering problem
can be defined as fo
l
lows:
Given
: O, F, S, N, a fitness function q, and an initial
clustering X={x
1
,…,x
h
} with contiguous(X).
Find
: X’={c’
1
,…,c’
k
} that maximizes q(X’) and X’ has
been obtained from X.
Procedure
: Initially, X’ is set to X. Then X’ is modified
to increase q(X’) by recursively replacing an x
X’ by
8
If r in X’ has been constructed using
r=
r
1
…
r
m
from X r would be
c
haracterized by the representatives of regions
r
1
,…,r
m
.
x=x’
1
…
x’
p
as long as q(X) improves, a
nd the following
conditions are satisfied:
(1)
x’
j
x (j=1,…p)
(2)
x’
j
x’
i
=
(for j
i)
(3)
contiguous(x’
j
) (j=1,…p)
(4)
reward(x)<reward(x’
1
)+…+reward(x’
p
)
Region x is only replaced by regions at a lower level of
resolution, if the sum of the rewards of the regions at
the
lower level of resolution is higher than x’s reward. It
should be emphasized that the splitting procedure e
m
ploys a variable number of decompositions; e.g. one r
e
gion
might not be split at all, another region
might be
split into just
four
region
s
, whereas
a third region
might
be s
p
lit into 17 sub
region
s. Moreover, the splitting pr
o
cedure is not a
s
sumed to be exhaustive; that is, x can be
split into y1, y2, y3 with y1
y2
y3
x; in other words, the
above specification allows divisive region discovery alg
o
rith
ms to di
s
card outliners when seeking for interesting
regions; bas
i
cally the objects belonging to the residual
region x/y1
y2
y3 in the above examples are cons
i
d
ered to be outliers.
SCMRG (Supervised Clustering using Multi

Resolution Grids)
[14]
is a divi
sive, grid

based region di
s
covery algorithm that has been developed by our past
work. SCMRG partitions the spatial space Dom(S) of the
dataset into grid cells. Each grid cell at a higher level is
partitioned further into a number of smaller cells at the
lo
wer level, and this process continues if the sum of the
rewards of the lower level cells is greater than the r
e
wards at the higher level cell. The regions returned by
SCMRG usually have different sizes, because they were
obtained at different levels of res
olution. Moreover, a cell
is drilled down only if it is promising (if its fitness i
m
proves at a lower level of resolution). SCRMG uses a
look

ahead splitting procedure that splits a cell into 4, 16,
and 64 cells respectively and analyzes if there is an i
m
p
rovement in fitness in any of these three splits; if this is
not the case and the original cell receives a reward, this
cell is included in the region discovery result; however,
regions who themselves as well as their successors at
lower level of resol
u
tio
n do not receive any rewards, will
be treated as outliers, and discarded from the final clu
s
tering X’.
SCMRG employs a queue to store cells that need fu
r
ther processing. SCMRG starts
at a user defined level of
resolution
and puts the cells associated with
this level on
the queue. Next,
SCMRG generates a clustering from cells
in the queue by traversing through the hierarchical stru
c
ture and examin
ing
those cells in the queue, and cons
i
d
e
r
ing the follo
w
ing three cases when processing a cell:
Case 1
. If the c
ell c receives a reward, and its reward is
greater than the sum of the rewards of its children and
the sum of rewards of its grandchildren respectively,
this
cell is returned as a cluster by the algorithm.
Case 2.
If the cell c does not receive a reward,
nor does
its children and grandchildren, neither the cell nor any of
AUTHOR ET AL.: TITL
E
9
its decedents will be further process
ed
or labeled as a
cluster.
Case 3.
Otherwise, if the cell c does not receive a r
e
ward, but its children receive rewards, put all the chi
l
d
ren of the
cell c into a queue for further processing.
Finally, all cells that hav
e been labels as clusters (case
1)
are returned as the final result of SCMRG.
5
C
ASE
S
TUDIES
5.1
C
o

location
Mining of Risks P
atterns of
Arsenics
and Associated Chemicals in Texas
Water
S
upply
In this case study, we apply our domain

driven clu
s
tering
framework for discovering
interesting regions where two
or more attriutes are collocated
and associated patterns.
The
employed
procedure is summarized in
fig.
6
and e
x
plained step by step ste
p below
:
Fig.
6
. A procedure of applying domain

driven clustering
framework
for actionable region discovery with involvement of domain e
x
perts
Step
1. Define
the
problem
: Co

location mining is a
data mining task that seeks for interesting but implicit
patt
erns in which two or more patterns collocate in sp
a
tial proxi
m
ity. For this case study, hydrologists helped us
select subsets of chemicals and some external factors su
s
pected of generating high levels of Arsenic concentr
a
tions. Interesting pa
t
terns B is de
fined as follows:
Given
N={A
1
,…,A
q
} be the set of non

spatial
continuous a
t
tributes that measure chemical concentrations in Texas
water wells.
Q={A
1
↑, A
1
↓,…, A
q
↑, A
q
↓} be the set of base colloc
a
tion patterns; in this case study, the domain expert is i
n
tere
sted in finding associations of high/low concentr
a
tions (denoted by ‘↑’ and ‘↓’, respectively) with high/low
co
n
centrations of other chemicals.
B
Q be a set of co

location patterns, where
P(B)
be a predicate over
B
that restricts the co

location sets con
sidered, i.e.
P(B)=As
↑
B
(“only look for
co

location sets nvolving high arsenic concentrations of
A
r
senic”)
Step 2. Create
/Select
a fitness function:
F
irst,
the
h
y
drologists
formulate
a measure of their interesting
ness in
form of a reward

based
fitness fun
ction. Fitness function
s
express
extrinsic characteristics which are varying in di
f
ferent problems and domains. In our framework, it is a
generic component so that hydrologists can define
several
fitness functions, some of which might have small vari
a
tions
from each other based on his diverse interests
. The
si
m
plified version of fitness function applied in the co

location mi
n
ing called z

value
was given in section 3.2.
Step 3
:
Select a clustering algorithm
.
The framework
provides many algorithms
that exemp
lify
different clu
s
tering paradigms, e.g. re
presentative

based clu
s
tering
,
divisive grid

based clu
s
tering, agglomerative clustering
.
For this case study, CLEVER (CLustEring using repr
e
sentatiVEs and Randomized hill climbing) is employed to
identify r
e
gions
and associated
co

location
patterns.
Step 4. Select parameters of the fitness function and
the clustering algorithm
:
T
uning or setting parameters of
the fitness function
and the region disc
o
very framework
such as
helps obtain better results
,
or exten
ds
the search
to focus on alternative patterns
or patterns at different
levels of granu
larity.
For the particular fitness function
e
m
ployed,
parame
ter
controls
an importance of
purity
of a pa
t
tern in interestingness
computations;
the
larger
is, the more
importance of
purity
of a pattern is a
d
dressed
.
Beside those two parameters,
hydrol
o
gists can
also specify seed patterns, i.e. As
as a ma
ndat
o
ry item in
the co

location pa
t
terns considered. Later on they can
simply change the seed pa
t
terns to force the co

location
mining algorithm to seek for alternative patterns; e.g. pa
t
terns that are co

located with both {As
,F
}.
This bridges
a gap between hydrologists
’ expectations
and
the results
of clustering algorithms
,
permitting hydrologists to tune
the comprehen
sive parameters in order to derive action
a
ble pa
t
terns
.
Step 5. Run the clustering algorithm to
discover i
n
teresting
regions and associated patterns
:
Results (a set
of clusters) obtained from the clustering algorithm are
ranked either by reward or interes
tingness. An example
of exp
e
rimental results is given in Fig. 6. For instance, the
first ranked pattern indicates that high level of Arsenic
collocates with high levels of Boron, Chloride and Total
Di
s
solved Solids in Southern of Texas.
10
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
Fig.
7
. Example of
Top 5 regions ranked by interestin
g
ness
Step 6. Analyze the results
:
By the nature of fitness
function
s
, the clustering algorithm consequently weeds
out many regions having zero interestingness.
T
he exp
e
r
imental results show the ability of the framework t
o ide
n
tify
interesting regions and associated patterns e
x
empl
i
fied in Fig.
7
, which are
comparative
to
regions of high
level of Arsenic concentration o
b
tained
from
TCEQ
as
depicted in Fig.
8
. Step
s
4
–
6
are usually
repeated
se
v
eral
times
in order to enhance
the results or
explore alte
r
n
a
tive regions and pa
t
terns.
Fig.
8
. Arsenic poll
u
tion map
In contrast to traditional clustering, our
framework
o
f
fers search engine

type capabilities to domain e
x
perts, to
help them identify patterns they are interested in
. D
o
main experts assist and incorporate their kno
w
ledge in
several mining phases, especially
before the clustering
phase
. By expressing their interestingness in forms of fi
t
ness function
s
, the clustering algorithms are able to seek
for clusters with extrinsic
characteri
s
tics. Therefore, the
clusters and associated patterns obtained repesent
actio
n
able
kno
w
ledge
.
5.2
C
hange
Analysis
in EarthQuake Data
A
change analysis framework is
developed
using our
framework
to
analyze how interesting regions in two di
f
feren
t time frames. For instance, analy
z
ing changes in
places where deep earthquakes are in close proximity to
shallow eart
h
quakes.
Fig.
9
summarizes
the
approach of
change analy
sis and the steps are e
x
plained next.
Fig.
9
. A procedure of applying domain

driven
clustering framework
in change anal
y
sis
First, geologists sample two datasets corresponding to
two different time frames.
Secondly, the
domain driven clustering framework is
used to separately identify the interesting r
e
gions of each
time frame; a fitne
ss function measures high variance of
earthquakes depth.
To gene
r
ate intensional clusterings
from results of CLEVER, we construct voronoi cells, in
which polygons represent cluster models.
Then, change
analysis techniques are a
p
plied
in Steps 3
–
6
in order
to
detect and identify different change patterns in those r
e
gions.
Third: users select relevant change predicates to co
m
pare changes between the two intensional clusterings; the
predicates also have thresholds to be controlled externa
l
ly. Then changes be
tween the two clusterings are insta
n
tiated with respect to the predicate threshold. Finally,
emergent patterns are summarized and further an
a
lyzed.
Fig.
10
. An overlay of interesting regions disco
v
ered in O
old
and and
O
new
Fig. 10 illustrates An overlay o
f interesting regions di
s
covered in O
old
and and O
new
; the red regions belong to
the early time frame (labeled with Region
old
) whereas the
blue r
e
gions belong to the late time frame (labeled with
Region
new
). Examples of relationship discovered between
two
clusterings are also given; r
e
gions 5 and 10 in Fig. 11
are considered new whereas region 0, 2, 3, 7 in Fig. 12 is
considered dissappea
r
ance.
AUTHOR ET AL.: TITL
E
11
Fig. 1
1
. Novelty areas of r
e
gions in O
new
data
Fig. 1
2
. Disappearance areas of regions in O
old
d
a
ta
5.3
Other
A
pplications of the Framework for
Actionable R
egion
al K
nowledge
D
iscovery
Beside the use of the domain driven clustering frame

work t
o discover actionable knowledge
specified in the
two aforementioned case studies, the framework
can also
be
applied to aid k
nowledge discovery in other real a
p
plications. The first application, similar to the first case
study, is co

location mining in planetary science
[16]
; we
are interested in mining feature

based hotspots where
extreme densities of deep ice and sha
l
low ice c
o

locate on
Mars; the fitness function employed is an absolute of
product of z

score of the continuous non

spatial feature in
spatial dataset. Outcomes of the fram
e
work are regions
having either very high co

location or very high anti co

location.
The sec
ond application is regional correlation pattern
discovery using PCA in hydrology
[10]
. Finding r
e
gional
patterns in spatial datasets is an important data mining
task. PCA

based fitness function is used to discover r
e
gional correlation patterns. This approa
ch is more effe
c
tive than solely applying PCA once or mult
i
ple times on
the data, since the PCA is applied repeate
d
ly to candidate
regions to explore each possible region co
m
bination. This
case study uses PCA

based fitness function maximizing
the eigenvalu
es of first k PCs; it r
e
wards the regions with
high correlation since higher correlated sets would result
in higher eigenvalues, in ot
h
er words, higher variance is
captured.
The third application is multi

objective clustering,
whose goal is to seek for a
set of clusters individually s
a
t
isfying multiple objectives. For example, hydrologists are
also interested in identifying regions which satisfy mu
l
t
i
ple patterns of chemical contamination in water supply.
We apply m
ulti

run clustering in
as a tool
to
gathe
r mu
l
ti

objective clusters simultaneously. Multi

run clu
s
tering
reduces extensive human effort by searching for and e
n
hancing novel and high quality clusters in aut
o
mated
fashion. Since multi

run clustering is developed on top of
domain driven clustering f
ramework, it co
n
forms to the
framework and also inherits the capability of the fram
e
work to plug in different clustering algorithms and fitness
functions. Therefore, results obtained from multi

run
clustering are also considered actionable.
6
C
ONCLUSION
I
n this paper a generic, domain

driven clustering fram
e
work
is proposed that incorporates domain intelligence into do

main

specific, plug

in fitness functions that are ma
x
imized
by clustering algorithms. The framework provides a family
of clustering algorit
hms and a set of fitness functions, along
with the capability of defining new fitness functions. More

over, an ontology and a theoretical foundation for clustering
with fitness functions in general, and for region discovery in
particular is introduced. Fit
ness functions are the core com

ponents in the framework as they capture a domain expert’s
notion of of interestingness. The fitness function is indepen

dent from the clustering algorithms e
m
ployed.
The framework was evaluated for different region di
s
cove
ry tasks in several case studies. The framework treats
the region discovery problem as a clustering problem in
which a given, plug

in fitness function has to
be
maximized.
By integrating and utilizing domain knowledge and d
o
main

specific
evaluation measur
es, into parame
t
er
ized,
plug

in fitness functions altogether with controlling thr
e
s
h
olds, the framework is able to obtain actionable re
gional
knowledge
and their a
s
sociated patterns satisfying domain

specific needs.
The case studies demonstrate
the capabi
lity of the
framework to integrate the
domain intelligence and
effe
c
tively
util
ize
clustering tasks
by incorporating
domain r
e
quirements
in
to the clustering algorithm
s
in form of a fitness
function to guide clustering. To the best of our kno
w
ledge,
this ca
pability has been very little been explored by past r
e
search
in the field of clustering
and we are optimistic that
our proposed framework will foster novel applic
a
tions of
domain

driven cluste
r
ing.
.
R
EFERENCES
[1]
L
Cao
and C. Zhang,
“The E
volution of KDD
:
Towards Domain

Driven Data Mining,”
Journal of Pattern Recognition and Artificial Intell
i
gence
, v
ol.21,
n
o. 4, pp. 677

692, Worl
d Scientific Publishing Company,
2007.
[2]
I.
Davidson
and S
.
S. Ravi,
“Clustering under Constraints: F
easibi
l
ity
Issues and th
e k

means Algorithm,”
Proc.
Fifth
SIAM Data Mining Conf.
,
2005.
[3]
Q.
Yang
,
K
.
Wu.,
and Y.
Jiang
,
“Learning Action Models from Plan
E
xam
ples using Weighted MAX

SAT,”
Artif. Intell
.,
vol. 171, issue 2

3,
pp. 107

143, 2007.
[4]
N
.
Zhong,,
“
Actionable Know
ledge Discovery: A Brain Informatics
Pers
pective,”
Special Trends and Controversies Department on D
o
main

Driven, Actionable Knowledge Discovery, IEEE Intelligent Systems
,
vol. 22,
issue 4, pp. 85

86, 2007
.
12
IEEE TRANSACTIONS ON
JOURNAL NAME, MANUS
CRIPT ID
[5]
W. Graco, T. Semenova, and E. Dubossarsky, “
Toward Knowledge

Driven Data Mining,”
Proc. Domain Driven Data Mining Workshop
, 2007.
[6]
G.
Karypis, E.H.S
Han
,
V.
Kumar,
“Chameleon: Hierarchical C
luste
r
ing
using Dynamic M
odeling,
”
IEEE Computer
,
vol.
32
, issue 8
, pp.
68

75,
1999.
[7]
C. F. Eick, N. Zei
dat, and Z. Zhao,
“
Supervised Clustering

Alg
o
rithms and Benefits,
”
Proc. Int. Conf. on Tools with AI.
,
2004.
[8]
A. Bagherjeiran, C. F. Eick, C.

S. Chen, and R.Vilalta,
“
Adaptive Cluste
r
ing: Obtaining Better Clusters Using Feedback and Past Experience,
”
Proc. Fifth IEEE Int. Conf. on Data Mining
,
2005.
[9]
E
.
H.:
Simpson
,
“The Interpretation of Interaction in Contingency T
a
bles,”
Jou
r
nal of the Royal Statistical Society
,
ser.
B
,
13
, pp.
:238

241, 1951.
[10]
O. U. Celepcikay
and C. F. Eick,
“
A Regional P
attern Discovery
Framework using Principal Component Analysis,
”
Proc. Int
.
Conf
.
on
Multivariate Statistical Modeling & High Dimensional Data Mining,
2008.
[11]
I. T
Jolliffe
,
“Principal Component Analysis,”
NY Springer, 1986.
[12]
C. F. Eick, R. Parmar, W
. Ding, T. Stepinki, and J.

P. Nicot,
“
Fin
d
ing
Regional Co

location Patterns for Sets of Continuous Variables in Sp
a
tial Datasets,
”
Proc.
S
ixteenth ACM SIGSPATIAL Int. Conf. on A
d
vances in
GIS
,
2008.
[13]
J. Choo, R. Jiamthapthaksin, C.

S. Chen, O. Celepci
kay, C. Giusti, and C.
F. Eick,
“
MOSAIC: A Proximity Graph Approach to Agglomerative
Clustering,
”
Proc.
N
inth Int. Conf. on Data Warehousing and Knowledge
Discovery,
2007.
[14]
C. F. Eick, B. Vaezian, D. Jiang, and J. Wang,
“
Discovery of Interesting
Region
s in Spatial Datasets Using Supervised Clustering,
”
Proc. Tenth
European Conf. on Principles and Practice of Knowledge Discovery in Dat
a
bases
,
2006.
[15]
Gabriel, K. and
R. Sokal,
“
A New Statistical Approach to Ge
o
graphic
Vari
ation Analysis,”
Systematic Zo
ology
, v
ol. 18
, pp.
259

278
, 1969.
[16]
W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. St
e
pinski, and C. F.
Eick,
“
Towards Region
Discovery in Spatial Dat
a
sets,”
Proc. Pacific

Asia Conf
.
on Knowledge Discovery and Data Min
ing,
2008
.
Christoph F. E
ick
received his PhD from the University of Karlsruhe
in Germany. He is currently an Associate Professor in the Depar
t
ment of Computer Science at the University of Houston. He is the
Co

Director of the UH Data Mining and Machine Learning Group. His
researc
h interests include data mining, machine learning, evoluti
o
n
ary computing, and artificial intelligence. He published more than 95
papers these and related areas. He serves on the program co
m
mi
t
tee of the IEEE International Conference on Data Mining (ICDM)
and
other major data mining and machine learning conferences.
Oner Ulvi Celepcikay
is a senior PhD cand
i
date at the University of
Houston, Computer Science Department. He got his bachelor d
e
gree in Electrical Engineering in 1997 from Istanbul University,
Ista
n
bul, Turkey. He acquired his M.S. Degree in Co
m
puter Science at
University of Houston 2003. He had worked at Un
i
versity of Houston
Educational Technology Outreach (ETO) Depar
t
ment from 2000 to
2007. He has published number of papers in his research f
ields i
n
cluding cluster analysis, multivariate statistical analysis, and spatial
data mining. He served as session chair in International Co
n
ference
on Multivariate Statistical Modeling & High Dimensional Data Mining
in 2008 and has been serving as a non

p
c reviewer in many conf
e
r
ences.
Rachsuda Jiamthapthaksin
graduated her bachelor degree in
Computer Science in 1997 and graduated with Honors in master
degree in Computer Science, Dean’s Prize for Outstanding Perfo
r
mance, in 1999 from Assumption University
, Bangkok, Tha
i
land. She
was a faculty in Computer Science department, Assumption Unive
r
sity during 1997

2004. She is now a PhD candidate in Computer
Science at University of Houston, Texas. She has pu
b
lished papers
in the area of her research interests in
cluding intell
i
gent agents,
fuzzy systems, cluster analysis, data mining and knowledge disco
v
ery. She has served as a non

pc reviewer in many co
n
ferences and
served as a volunteer staff in an organization of the 2005 IEEE ICDM
Conference, November 2005, Ho
uston, Texas.
Vadeerat Rinsurongkawong
is a PhD candidate in Computer Sc
i
ence at University of Houston. She got her
M.S.degree
in Info
r
mation Technology from Assumption University, Thailand and her B
.
Eng. deree
in Electrical Engineering from Chulalongkor
n University,
Thailand. She has work experience in electrical engineering, info
r
mation technology and computer science. Her areas of
interest
are
data mining and databases.
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο