A Unifying Domain-Driven Framework for Clustering with Plug-In Fitness Functions and Region Discovery

odecrackΤεχνίτη Νοημοσύνη και Ρομποτική

29 Οκτ 2013 (πριν από 3 χρόνια και 11 μήνες)

106 εμφανίσεις

IEEE TRANSACTIONS ON

TKDE, MANUSCRIPT ID

1


A Unifying Domain
-
Driven Framework for
Clustering with Plug
-
In Fitness Functions and
Region Discovery

Christoph F. Eick, Oner U. Celepcikay, Rachsuda Jiamthapthaksin and

Vadeerat Rinsurongkawong

Abstract



The main challenge in developing methodologies for

domain
-
driven data mining is incorporating domain
knowledge and domain
-
specific evaluation measures into the data mining
algorithms and
tools, so that “actionable knowledge”
can be discovered. In this paper a generic, domain
-
driven clustering framework is

proposed that incorporates domain
in
telligence into domain
-
specific

plug
-
in fitness functions that are maximized by
the clustering algorithm
. The framework
provides a family of clustering algorithms and a set of fitness functions, along with the capabilit
y of defining new fitness
functions. Fitness functions are the core components in the framework as they capture a domain expert’s notion of the
interestingness. The fitness function is independent from the clustering algorithm employed. The framework also
incorporates
domain knowledge through preprocessing and post
-
processing steps and parameter selections. This paper introduces the
framework in detail, and illustrates the framework through demonstrations and case studies that center on spatial clustering
a
nd
region discovery. Moreover, the paper introduces ontology and a theoretical foundation for clustering with fitness functions
in
general, and region discovery in particular. Finally, intensional clustering algorithms that operate on cluster models are
in
troduced.

Index Terms


Clustering, Data Mining, Spatial Databases and GIS
, Domain
-
driven Data Mining

——————————



——————————

1

I
NTRODUCTION
o

extract knowledge from the immense amount of
data that has been generated by advances
in data a
c-
quis
i
tion technologies has been a major focus of data
mi
n
ing research over the last 20 years. However, it has
been observed that knowledge obtained from trad
i
tional
data
-
driven data mining algorithms in domain
-
specific
applications is not really
actionable
[1]

because the e
x-
tracted knowledge does not capture what domain experts
are interested in. This observ
a
tion can be explained by
two limitations of trad
i
tional data mining: 1) traditional
data mining algorithms insufficiently incorporate domain
intelligence to aid the mining process and 2) the alg
o-
rithms use techn
i
cal significance as their sole evaluation
mea
s
ure.

As far as the first limitation is concerned, domain in
-
telligence includes the involvement of domain know
-
ledge, domain
-
specific cons
traints and experts. Consider
a situation in which a clustering algorithm is used to
identify clusters in a specific domain. Different clustering
algorithms have their own assumptions on clustering
criteria, e.g. tightness, connectivity, separation and so
on.
Due to the fact that clustering is NP
-
hard, clustering alg
o-
rithms focus their search efforts on clusters that maximize
those criteria, frequently generating “optimal” but out
-
of
-
interest clusters. Clustering with constraints intends to
alleviate this p
roblem by incorporating must
-
link and
cannot
-
link constraints to better guide the search for good
clusters [2]. The second limitation occurs due to the fact
that in traditional data mining, actionability of kno
w
ledge
is determined solely by technical signi
ficance based on
domain
-
independent criteria [1]; this type of measure
usually differs from domain
-
specific expectations and
measures of interestingness. To address this problem,
when assessing cluster quality both technical and do
-
main
-
specific significan
ce should be considered. Conse
-
quently, the main challenge in developing methodolo
-
gies and techniques for domain
-
driven data mining is to
incorporate domain knowledge into data mining alg
o-
rithms and tools so that actionable knowledge can be di
s-
covered.

I
n this paper, we propose a unifying domain
-
driven
clustering framework that provides families of clustering
algorithms with plug
-
in fitness functions capable of di
s-
covering actionable knowledge. The fitness function is the
core component of the framework,
as it captures the d
o-
main expert’s notion of interestingness. The fitness fun
c-
tion is specifically designed to be externally plugged
-
in to
provide extensibility and flexibility; the fitness function
component of the framework is independent of the clus
-
ter
ing algorithms

employed
.

In general, families of task
-

and/or domain
-
specific
fitness functions are employed to capture the domain
interestingness and to incorporate domain knowledge.
For example, let us consider a data mining task in which
geologists are

interested in discovering hotspots in ge
o-
graphical space where deep earthquakes are in close
xxxx
-
xxxx/0x/$xx.00 © 200x IEEE

————————————————



Authors are with the Department of Computer Science, University of Ho
u-
ston, Houston, TX
,
77204
.



Emails:
(
ceick, onerulvi,
rachsuda, vad
eerat
)@cs.uh.edu

Manuscript received (
03/31/2009
).

T

2

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


proximity to shallow earthquakes. That is, they are inte
r-
ested

in identifying contiguous regions in an eart
h
quake
data set for which the variance of the variable
eart
h-
quake_depth is high. When using our framework, the
geologist’s notion of interestingness is the captured
in the
form of a High Variance fitness function


formally d
e-
fined in section 3.
The domain expert

additionally select
s

the
parameters to instruct
the clustering algorithm in
what patterns they are really interested in: an earth
-
depth
var
i
ance threshold and a parameter that controls cluster
granularity and size of the spatial clusters discove
re
d.
Next, a clustering algorithm is run with the paramet
e
r-
ized High Variance fitness function, and high variance
earthquake depth hotspots are obtained, as di
s
played in
fig. 1.






Fig. 1. Examples of interesting regions discovered by a domain dr
i
v-
en clustering algorithm using a High Va
-
riance fitness fun
c
tion

Our framework incorporates domain knowledge not
only through domain
-
specific fitness functions, but also
through preprocessing and post
-
processing steps, fitness
function parameter selections including seed patterns,
threshold parameter values that are su
itable for a specific
domain, and desired cluster granularities. The family of
clustering algorithms supported by the framework i
n-
cludes divisive, grid
-
based, prototype
-
based and aggl
o
m-
erative clustering algorithms; all of which support plug
-
in
fitness fu
nctions.

The first high level domain driven data mining frame
-
work has been introduced by Cao and Zhang [1]. In this
framework, domain intelligence is incor
-
porated into the
KDD process towards actionable knowledge di
s
covery,
and the framework has been il
lustrated through mining
activity patterns in social security data. They also pr
o-
posed criteria to measure actionability of the know
l
edge.
Yang
[3] introduced

a framework with two tec
h
niques to
produce actionable output from traditional KKD ou
t
put
models.
The first technique uses an algorithm for extrac
t-
ing actions from decision trees such that each test instance
falls in a desirable state. The second technique uses an
algorithm that can learn relational action models from
frequent item sets. This technique

is applied

to aut
o
matic
planning systems and Yang’s Action
-
Relation Mo
d
eling
System (ARMS) automatically acquires action mo
d
els
from recorded user plans.

One subcomponent of the domain knowledge that
must be incorporated into any domain
-
driven data mi
n-
in
g framework is human intelligence; and Multiaspect
Data Analysis (MDA) is an important Brain Informatics
methodology. Brain Informatics considers the brain as an
information
-
processing system to understand its mech
a-
n
ism for analyzing and managing data. But
, since brain
researchers can not use MDA results directly, Zhong [4]
proposes a methodology that employs an explanation
-
based reasoning process that combines multiple source
data into more general results to form actionable
kno
w
ledge. Zhong’s framework b
asically takes traditio
n-
al KDD output as input to an explanation
-
based reaso
n-
ing process that generates actionable output. The concept
of mo
v
ing from method
-
driven or data
-
driven data mi
n-
ing to domain
-
driven has been recently proposed and is
fe
a
tured in [5
]. The authors describe four aspects of mo
v-
ing data mining, from a method
-
driven ap
-
proach to a
process that focuses on domain knowledge. In general,
the use of plug
-
in fitness functions is not very common in
traditional clustering; the only exception is t
he CHAM
E-
LEON [6] clustering algorithm. However, fitness fun
c-
tions play a more important role in semi
-
supervised and
supervised clustering [7] and in adaptive clustering [8].


The main contributions of this paper are that it:

1.
Introduces a unifying domai
n
-
driven clustering
framework for actionable knowledge discovery.

2.
Proposes a novel domain
-
specific fitness

function
model that is plugged in
to clustering algorithms externa
l-
ly to capture domain interestingness.

3.
Presents a set of fitness functions c
apable of ser
v
ing
for clustering tasks for various domains.

4.
Introduces a family of clustering algorithms, most of
which have been developed in our previous

work, as part
of the framework and
introduces

novel intensional clu
s-
tering
that directly m
a
nipula
te cluster model.

5.
Illustrates deployment of the proposed framework
and its benefits in challenging real world case stu
-
dies.


The remainder of this paper is organized as follows: In
section 2, we formally present our domain
-
driven
cluste
r-
ing

framework.
Section 3 provides a detailed discussion
on domain
-
specific plug
-
in fitness function including
three examples. Section 4 introduces the family of cluste
r-
ing algorithms provided in our framework, and section 5
illustrates the framework through demonstr
a
tion
s and
case studies. Section 6 concludes the paper
.

2

A

S
PATIAL
C
LUSTERING WITH
P
LUG
-
I
N
F
ITNESS
F
UNCTION

2.1
Preview

As mentioned in the introduction, the goal of this paper is
to introduce a highly generic clustering framework that
supports plug
-
in fitness

functions to capture domain i
n-
t
e
restingness. As we will discuss later the framework is
very general and can be used for traditional clustering.
However, because almost all of our applications involve
spatial data mining, the remainder of this paper will
m
ostly focus on spatial clustering and on region disco
v-
ery in particular. The goal of spatial clustering is to ident
i-
fy interesting groups of objects in the subspace of the sp
a-
tial attributes. Region discovery is a special type of spatial
clustering that fo
cuses on finding interesting places in
AUTHOR ET AL.: TITL
E

3


spatial datasets. Moreover, in this section and in Section 4
a theoretical foundation and
ontology

for clu
s
tering with
plug
-
in fitness functions is intoduced. Finally, novel i
n-
tensional clustering algorithms are in
tr
o
duced
.

2.2
An Architecture for Region Discovery

As depicted in Figure 2, the proposed region discovery
framework consists of three key components. The first
two components are families of clustering algorithms and
fitness functions that play a major rol
e in discovering i
n-
teresting regions and their associated patterns. As we will
discuss in more detail soon, the framework uses cluste
r-
ing algorithms that support plug
-
in fitness functions to
find interesting regions in spatial datasets. Decoupling
cluster

evaluation from the search for good clusters cr
e-
ates flexiblility in using any clustering algorithm with any
fitness function. The role of the third component is to
manage and integrate datasets residing in several repos
i-
tories; it will not further be dis
cussed in this paper
.

Fig. 2. Region Discovery Framework

2.3

Goals and Objectives of Region Discovery

As mentioned earlier, the goal of region discovery is to
find interesting places in spatial datasets. Our work a
s-
sumes that region discovery algorithms
th
at
we develop
operate on datasets containing objects o
1
,..,o
n
:
O
={o
1
,..,o
n
}

F where F is relational database schema and
the objects be
longing to O are tup
l
e
s that are charact
e
r-
ized by attri
b
utes S

N, where:

S={s
1
,…,s
q
} is a set of spatial attributes.

N={n
1
,..,n
p
} is a set of non
-
spatial attributes.

Dom(S) and Dom(N) describe the possible values the
attributes in S and N can take; that is, each object o

O is
characterized by a single tuple that takes values from
Dom(S)

Dom(N)
1
.

In general, clustering algori
thms can be subdivided i
n-
to
intensional
clustering

and
extensional clustering

alg
o-
rithms: extensional clustering algorithms just create clu
s-
ters for the data set O, portioning O into subsets, but do

1

If S is empty we call the problem a traditional clustering problem.
One key char
acteristic of spatial clustering is that spatial
-

and non
-
spatial
attributes play different roles in the clustering process, which is not the
case in traditional clustering.

nothing else. Intensional clustering algorithms, on the
o
ther hand, cr
e
ate a clustering model based on O and
other inputs. Most popular clustering algorithms have
been introduced as extension clustering algorithms, but

it is not too difficult to generalize most extensional clu
s-
tering algorithms so that they beco
me intensional cluste
r-
ing alg
o
rithms, as we
present
in Section 5.

Extensional clustering algorithms create clusters X on
O that are sets of disjoint subsets of O:

X=
{c
1
,...,c
k
}

with
c
i

O
(i=1,…,k) and
c
i


c
j
=


(i

j)

Intensional clustering algorithms create

a set of disjoint
regions Y in F:

Y=
{
r
1
,...,
r
k
}

with r
i

F (i=1,…,k) and r
i


r
j
=


(i

j)

In the case of spatial clustering and region discovery,
cluster models have a peculiar structure in that they seek
for regions in the subspace Dom(S) and not in F itse
lf: a
region discovery model

is a function
2


:
Dom(S)

{1,…,k}

{

} that assigns a r
e
gion

(p) to a point
p in Dom(S) assuming that there are k regions in the sp
a-
tial dataset

the number of regions k is chosen by the
region discovery algorithm that creates t
he model. Mo
d-
els support the notion of ou
t
liers; that is, a point p’ can be
an outlier that does not belong to any r
e
gion: in this case:

(p’)=

.

Intensional region discovery algorithms obtain a clu
s-
tering Y in dom(S) that is defined as a set of disjoint

r
e-
gions in dom(S)
3
:

Y=
{
r
1
,...,
r
k
}

with r
i

F[S] (i=1,…,k) and r
i


r
j
=


(i

j)

Moreover, regions r belong to Y are described as fun
c-
tions over tupels in Dom(S)


r
: Dom(S)

{t,f} indicating
if a point p

Dom(S) belongs to r:

r
(p)=t.

r

is
called

the
intension
of r
.

r

can easily be constructed from a the mo
d-
el


of a clustering Y.

Moreover,
the extension of a r
e
gion r


r

is defined as follows:


r
={o

O|


r
(o[S])=t}

In the above definition o[S] denotes the projection of o
on
its spatial

attributes.

Our approach requires discovered regions to be co
n-
t
i
guous. To cope with this constraint in extensional clu
s-
te
r
ing, we assume that we have neighbor relationshi
ps
no
()

between the objects in O and cluster neighbor rel
a-
tio
n
ship
nc
()

between regions in X defined with respect to
O: if no(o,o’) holds objects o and o’ are neighboring; if
nc(r,r’) holds regions r and r’ are neighboring.

no

OxO

nc

2
O
x2
O

Moreover, neigh
boring relationships are solely dete
r-
mined by the attributes in S; that is, the temporal and sp
a-
tial attributes in S are used to determine which objects
and clusters are neighboring. A region r is contiguous if
for each pair of points u and v in r there is

a path between
u and v that solely traverses r and no other regions. More

2



denotes “undefined”.

3

F[S] denotes the projection of F on the attributes in
S.

4

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


formally, contiguity
4

is defined as a predicate over su
b-
sets c of O:

contiguous(c)


w

c

v

c

m≥
2

x
1
,…,x
m


c:

w=x
1


v= x
m



no
(x
i
,

x
i+1
) (i=
1
,…,m)
.

contiguous(X)


c

X: contiguous(c
)

Our approach employs arbitrary plug
-
in, reward
-
based fitness functions to evaluate the quality of a given
set regions. The goal of region discovery is to find a set of
regions X that maximize an externally given fitness fun
c-
tion q(X); moreover, q is assu
med to have the following
stru
c
ture
:




(
1
)

where
i
(c) is the interestingness of a
region c

a qua
n
tity
designed by a domain expert to reflect a degree to which
regions are “newsworthy".

The
number of objects in O
belonging to a regi
on

is denoted by |c|, and the quantity
i(c)

|
c
|


can be considered as a “reward" given to a
r
e-
gion
c; we seek X such that the sum of rewards over all of
its constituent
regions

is maximized.
The amount of pr
e-
mium put on the size of the
region

is controlled

by the
value of parameter β (β
>1).

A
region

reward is propo
r-
tional to its interestingness, but
larger

regions receive a
higher reward than
smaller
regions

having the

same va
l-
ue of interestingness
,

to reflect a preference
for

larger
r
e-
gions
.
Furthermore, i
t is assumed that the fitness fun
c
tion
q is additive; the reward associated with X is the sum of
the reward of its constituent regions.

The reader might ask why we restrict the form of fi
t-
ness function in our proposed framework. The main re
a-
son is our desi
re to develop an efficient clustering alg
o-
rithm for region discovery. Restricting the form of fitness
fun
c
tion supported allows us to use knowledge about the
stru
c
ture of the fitness function to obtain faster clustering
algorithms which employ pruning, inc
remental upda
t
ing,
and sophisticated search strategies. This topic will be r
e-
visited in Section 4 of this paper when specific clustering
algorithms are intr
o
duced.

Given a spatial dataset O, there are many possible clu
s-
tering algorithms to seek for intere
sting regions in O with
respect to a plug in fitness function q. In general, the o
b-
jective of region discovery with plug
-
in fitness func
-
tions
is:

Given
: O, q, and possibly other input parameters

Find
: regions r1,...,rk that maximize q({r1,...,rk}) su
b-
jec
t

to the following constraints:

(1a) ri

O (i=1,…,k) for extensional clustering

(1b) ri

F[S] (i=1,…,k) for intensional clustering

(2) contiguous(ri) (i=1,..,k)

(3) ri

rj=


(i

j)

It should be emphasized that the number of regions k
is not an input param
eter in the proposed framework;
that is, region discovery algorithms are assumed to seek

4

Other alternative definitions of contiguity exist, but will not be di
s-
cussed in this paper due to the lack of space.

for the optimal number of regions k.

3

D
OMAIN
-
S
PECIFIC
P
LUG
-
IN
F
ITNESS
F
UNCTIONS

The fitness function, whose formula was given in
eq
u
ation 3,
is the core component
in

our framework in
ca
p
turing the notion of the interestingness of the domain.
The main challenge in developing methodologies and
techniques for domain
-
driven data mining is to incorp
o-
rate domain knowledge into the data mining task so “a
c-
tionable knowledge” c
an be discovered.

For example, in
region discovery, the fr
amework searches for interesting
su
b
spaces and then extracts
regional knowledge from the
obtained subspaces

which provides very crucial
kno
w
ledge for domain e
x
perts
.



The fitness function is

specifically designed to be e
x-
ternally plugged
-
in to provide extensibility and flexibi
l-
i
ty. The fitness function component of the framework is
ind
e
pendent from the
clustering algorithm employed

and
for each domain a domain
-
specific fitness function is d
e-
s
igned to capture the domain interestingness and inco
r-
porate domain knowledge
.
Because the fitness fun
c
tion is
exte
r
nal and encapsulated from the rest of the framework
,

any change in the framework such as a p
a
rameter change
or change in the clustering algor
ithm will not affect the
fitness function. Likewise, changes on fi
t
ness function
that comes from domain requirements will not affect the
contents of the clustering algorithm and so on. This d
e-
sign enables the framework to be flexible and extensible
to meet

domain needs and requirements
.

In order to illustrate how the notion of domain int
e
r-
estingness and the domain
-
specific fitness functions are
used in domain driven data mining and in discovering
actionable knowledge, we now provide several e
x
amples
of such

fitness functions

in the remainder of this section.

3.1

PCA
-
based Fitness Function

Finding interesting regional correlation patterns that will
help summarize the characteristics of a region is i
m-
po
r
tant to domain and business people, since many pa
t-
terns
only exist at a regional level, but not at the global
level.

Moreover, u
sing regional patterns which are no
r-
mally are globally hidden, domain or business people can
unde
r
stand the structure of data and make business or
d
o
main decisions by analyzing these c
orrelation patterns.
For e
x
ample, a strong correl
a
tion between a fatal disease
and a set of chemical conce
n
trations in
Texas
water wells
might not be detectable throughout Texas, but a
strong
correl
a
tion pattern might exist regionally which is also a
refle
c
tion of Simpsons' paradox [
9
]. This type of regional
kno
w
ledge is crucial for
the
domain experts who seek to
u
n
derstand the causes of such diseases and predict future
cases.
T
o

identify a sub
-
region in South Texas with 35
water wells that demonstrates a u
ni
que and strong corr
e-
lation between Arsenic,
another chemical in water
of
those wells
and
high occurrence of the

disease
in this r
e-
gion,
might
suggest to

domain experts the possible exi
s
t-
ence of nearby toxic waste
,

and
provide valuable
actio
n
a-
ble know
l
edg
e
that
will help
them
to understand the
cause of dangerous amount of arsenic in

water
wells
, then
AUTHOR ET AL.: TITL
E

5


develop

a solution to this problem

and prevent future
incidents.

An example of discovered regions along with
highly corr
e
lated attribute sets is given in fig.

4. This is an
applic
a
tion of our framewok using PCA
-
based fitness
function on Texas Water Wells data [10]; and the fact that
the correlation sets for each region show significant di
f-
ferences emphasizes the i
m
portance of regional pattern
discovery.















Fig.

3
. An Example of Regional Correlation Patterns for Chemical
Concentrations in Texas

In order to discover regions where sets of attributes are
highly correlated, we need a fitness function that will r
e-
ward high correlation and enables our frame
work to di
s-
cover such regions. The Principal Component Analysis
(PCA) is a good candidate
since t
he directions identified
by PCA are the eigenvectors of the correlation matrix, and
each eigenvector has an associated eigenvalue that is a
measure of the corr
espon
d
ing variance
.

T
he Principal
Components (PCs) are ordered with respect to the v
a
r
i-
ance associated with that component in descending o
r
der.
The eigenvectors of PCs can help to reveal correlation
patterns among sets of a
t
tributes.

Ideally, it is desira
ble to have high eigenvalues for the
first k PCs, since this means that a smaller number of PCs
will be adequate to account for the threshold var
i
ance
which overall suggests that a strong correlation among
variables exists [11]. The PCA
-
based fitness funct
ion is
d
e
fined next.

Let
λ
1
,
λ
2
,…,
λ
k

be the eigenvalues of the first k PCs,
with k being a parameter:

PCA
-
based Interestingness is estimated using formula
2
:

(
2
)

PCA
-
based fitness function then b
e
comes:

(
3
)

The fitness function rewards high eigenvalues for the
first k PCs. By
ta
k
ing the square of each eigenvalue we
ensure that regions with a higher spread in their eige
n-
v
a
lues will obtain higher r
e
wards

reflecting the higher
importance assigned in PCA to hig
h
er ranked principal
components.

Moreover, a generic pre
-
processing tec
hnique to s
e
lect
the best k value for the PCA
-
based fitness function is
based on a variance threshold to d
e
cide how many PCs to
retrieve. This variance threshold is also d
o
main
-
specific
and is set based on the domain knowledge avai
l
able, to
ensure selectin
g appropriate k value for each dat
a
set from
different domains and reflecting co
n
cerns and constraints
implied by domain know
l
edge.

The PCA
-
based fitness function repeatedly applies
PCA during the search for the optimal set of regions,
maximizing the eigenv
alues of the first k PCs in that r
e-
gion. Having an externally plugged in PCA
-
based fitness
function enables the clustering algorithm to probe for
optimal partitioning, and encourages the merging of two
regions that exhibit structural similarities in corre
lation
patterns. This approach is more advantageous than a
p-
plying PCA just once or multiple times on the data u
s
ing
other tools, since the PCA
-
based fitness function is a
p-
plied repeatedly to candidate regions to explore each po
s-
sible region combination.

3.
2

Co
-
location Fitness Function

Co
-
location mining is a data mining task that seeks for
interesting but implicit pa
t
terns in which two or more
patterns collocate in spatial proximity. In the following
we will introduce an interestingness function for co
-
lo
cation sets involving objects that are characterized by
continous attributes (see also [12] for background on the
described approach).
The pattern

A


denotes that attri
b-
ute A has high values and the pattern A


indicates that
attribute A has low values. For

example, the pattern

{A

,
B

, D

} describes that high values of A are co
-
located
with

low va
l
ues of B and high values of D.

Let

O be a dataset

c

be a region

o
O be an object in the dataset O

N

=

{A
1
,…,A
q
} be the set of non
-
geo
-
re
ferenced cont
i
n-
uous

attributes in the dataset O

Q={A
1

, A
1

,…, A
q

, A
q

} be the set of possible base co
-
location

patterns

B
Q be a set of co
-
location pa
t
terns

Let
z
-
score

(A,o) be the z
-
score
(A,o)

of object o’s value
of attribute
A


(
4
)

(
5
)

The interestingness of an object o with respect to a
co
l-
location
set B
Q is measured as the product of the z
-
values of the

patterns in the set B. It is defined as fo
l
lows:

(
6
)

where
z(p,o) is called the z
-
value of base

pattern p
Q for
6

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


object o.


I
n general, the interestingness of a
region can be
straightforwardly
computed by
using

the average inte
r-
e
s
tingness of the objects

belonging to a region. However,
using this approach some very

large prod
ucts might
d
o
m
inate interestingness computations. For

some d
o
main
experts just finding a few objects with very high

products
in close proximity of each other is important, even if the

remai
n
ing objects in the region deviate from the observed
pattern.

In ot
her cases, domain experts are more interes
t-
ed in patterns with

highly reg
u
lar products so that all or
almost all objects in a region

share this pattern, and are
less interested in a few very high

products. To satisfy the
needs of
different domains
, our app
roach

add
i
tionally
considers purity when computing region

interestingness,
where purity(B,c) denotes the percentage of

objects
o
c
for which i(B,o)>0
. In summary, the interestingness

of a
region c with respect to a co
-
location set

B, denoted by


(B,c), is computed as fo
l
lows:

(
7
)

The parameter
θ
[0,

) controls the importance a
t-
tached to purity

in interestingness computations;
θ
=0 i
m-
plies that purity is

ignored, and using larger values i
n-
creases the importance of

p
u
rity.

Fig. 6 depicts regions r in Texas with their highe
st v
a
l-
ued co
-
location sets B; that is, the depicted co
-
location set
B has the highest value for

(B,
r
)
.

3.3

Variance Fitness Function

High
Variance Fitness Function is a fitness function to
discover regions where there is high contrast in value of
attribu
te of interest. For example, in studying of eart
h-
quake

as discussed in more details in a case study in se
c-
tion 5.2
, where attribute of interest is the depth of eart
h-
quakes, the domain expert may use High Variance Fitness
Function to find regions where shal
low earthquakes are
in close proximity with deep earthquake.

The interestin
g-
ness of a region r, i(r), is

defined as fo
l
lows:

(
8
)

where


(
9
)

The interestingness function parameters
β

and th are
determined in close collaboration with the domain e
x-
perts. At
tr is the attribute of the interest and in the form
u-
la
attr(o)

denotes the value

of attr

for

object o. The inte
r-
e
s
tingness function computes the ratio of the region’s
v
a
riance with respect to attr and the dataset’s variance.
R
e
gions whose ratio is above a
given threshold th receive
rewards.

Figure 1 in Section 1 shows the result of using the
above variance interestingness function for an earthquake
dataset with earthquake depth being the attribute of i
n-
terest. The polygons in Figure 1 indicate regions with

po
s-
itive interestingness; usually, those regions will be further
ranked by region reward to sort regions from most inte
r-
esting to least interesting, providing search engine
-
type
capabilities to scientists that are interested in finding i
n-
teresting places
in spatial dat
a
sets.

4

C
LUSTERING ALGORITHMS

W
ITH
P
LUG
-
IN
F
ITNESS
F
UNCTIONS

Another key component of
proposed

framework is a fa
m-
ily of clustering algorithms

that allows domain experts to
instruct clustering algorithms to seek clusters that satisfy
their sp
ecific requirements.
To achieve this flexible clu
s-
tering capability,
several

clustering algorithms

were d
e-
signed and implemented

that
support

externally
-
given
fitness function
s that are ma
x
imized during the clustering
process
.
Using

different plug
-
in fitne
ss function
s

in the
algorithms
results in

obtaining different, alternative clu
s-
ters for the
same data set.
Existing clustering paradigms
have been extended to support plug
-
in fitness functions,
namely repres
en
tative
-
based clustering, agglomerative
clusteri
ng, divisive clu
s
tering, and grid
-
based clustering
.
Three such clustering algorithms CLEVER
[12]
, MOSAIC
[
13]
, and SCMRG
[14]

will be briefly introduced and fo
r-
mally described by extending the formal fram
e
work that
was introduced in Section 2.
Different cl
ustering par
a-
digms are superior
with re
s
pect to different a
s
pect
s

of
clustering.

For example,

grid
-
based cluste
r
ing algorithms
are able to cluster large datasets quickly,

whereas repr
e-
sentative
-
based clustering algorithms di
s
cover cluster
s of
better qualit
y
.
Finally,
agglomerative clus
tering alg
o-
rithms are
capable of identif
y
ing arbitrary shape clusters

which is particularly impo
r
tant in spatial data mining.

They can also be

employed

as a post processing technique

to enhance
the
quality of clusters
that wer
e o
b
tained by
running

a representative
-
based clustering alg
o
rithm.


4.1

CLEVER


A

R
epresentative
-
b
ased
C
lustering
A
lg
o
rithm

Representative
-
based clustering algorithms, sometimes
called prototype
-
based clustering algorithms in the liter
a-
ture, construct clus
ters by seeking for a set of represent
a-
tives; clusters are then created by assigning objects in the
dataset to their closest representative
;

in general,

they
compute the following function


:



: O

q

d

{other parameters}

2
Dom(S)




takes O, q, a distance
function
d

over Dom(S), and po
s-
sibly other parameters as an input and seeks for an “o
p-
timal set”
5

of representatives in Dom(S), such that the
clustering X obtained by assigning the objects in O to
their closest representative in

(O,q,
d
,…) maximizes q(X)
-
the fitness function
. Moreover, it should be noted that

5

In general, prototype
-
based clustering is NP
-
hard. Therefore, most
representative
-
based clustering algorithm will only be able to
find a
subopt
i
mal clustering X and not the global maximum of q.

AUTHOR ET AL.: TITL
E

7


clu
s
tering is done in the spatial attribute space S, and not
in F; the a
t
tributes in N are only used by fitness function q
when evaluating clu
s
ters.


CLEVER is an example of the representative
-
based c
lu
s-
tering algorithms that uses randomized hill climbing and
large
r

neighbor
hood sizes
6

to battle premature conve
r-
gence
when greedily searching for the best set of repr
e-
sentatives.

Initially, the algorithm randomly selects k’
representatives from O. In the
iterative process, CLEVER
samples and evaluates p solutions in the neighborhood of
the current solution; if the best one improves fitness, it
becomes the current solution. The n
eighboring solutions
are created
by applying one of

following
operators

on a
re
presentative

of the current solution
: Insert
,
Delete and
R
e
place. Each operator has a certain selection

probability
and representatives to be manipulated are chosen at

ra
n-
dom.

Moreover, to battle premature convergence, CLE
V-
ER re
-
samples

p’>p solutions befo
re terminating.

The
Pseudocode of CLEVER is given in Fig
.

4
.

Fig.

4
. Pseudo
-
code of
CLEVER

The
cluster

model


for the result o
b
tained by running
a representative
-
based clustering algorithm can be co
n-
structed as follows:

Let


(O,q,
d
,…)={rep
1
,…, rep
k
}

Dom(
S)

that is; the representative
-
based clustering algorithm r
e-
turned R={rep
1
,…, rep
k
}
.
Then the model


can be defined
as follows:


p

S

(p)=m


d
(p,rep
m
}


d
(p,rep
j
} for j=1,…,k

that is,


assigns p to the
cluster

associated with the clo
s-
est representative
7
.

Because representative
-
based clustering algorithms a
s-
sign objects to clusters using 1
-
nearest neighbor queries,
the sp
a
tial extent of regions ri

Dom(S) can be constructed
by co
m
puting Voronoi diagrams; this implies that the

6

It modifies the current set of representatives by applying more than
one operator to it; e.g. modifying the current set of representatives by
repla
c
ing two representatives and inserting a n
ew representative.

7

Our formulation ignores the problem of ties when finding the closest
representative; in general, our representative
-
based clustering algorithms
break ties randomly.

shape of regions obtained by re
presentative
-
based clu
s-
te
r
ing algorithms is limited to convex polygons in
Dom(S). Neighboring rel
a
tionships
no
()

between objects
in O and
nc
()

between clusters obtained by a represent
a-
tive
-
based clustering algorithm can be constructed by
compu
t
ing the Dela
unay triangulation for R. Moreover,
representative
-
based clustering algorithms do not su
p-
port the concept of outliers; therefore, representative
-
based mo
d
els have to assign a
cluster
to every point p in
S.

4.2

MOSAIC

An
A
gglomerative
C
lustering
A
lgorithm

T
he
agglomerative clustering problem

can be defined as fo
l-
lows:

Given
: O, F, S, N, a fitness function q, and an initial
clustering X with contiguous(X)

Find
: X’={c’
1
,…,c’
h
} that maximizes q(X’) and all clu
s-
ters in X’ have been constructed using unions of

neig
h-
boring clusters in X:


c
i

X’: c
i
=c
i1



c
ij


c
i1
,…,c
ij

X


nc
(c
ik
,c
ik+1
)

(for k=1,j
-
1)


c
i

c
j
=

(for i

j)

Due to the fact that the above definition assumes that
only neighboring clusters are merged
,

cont
i
guous(X’)
trivially holds.

In the followin
g, we view results that are obtained by
agglomerative methods as a meta
-
clustering X’ over an
initial clustering X of O; X’ over X is defined as an exhau
s-
tive set of contiguous, disjoint subsets of X. More forma
l-
ly, the
objectives of agglomerative clusteri
ng
can be reform
u-
lated

as fo
l
lows:

Find
: X’={x
1
,...,x
r
} with x
i

X (i=1,…,r) maximizing
q(X’), subject to the following constraints:

(1)

x
1



x
r
=X

(2)

x
i

x
j
=


(i

j)

(3)

contiguous(x
i
) (for i=1,..,r)

(4)


x

X’

m

1

x’
1


x’
m

X: x =x’
1



x’
m

We use the term
meta
-
clustering
, because it is a cluste
r-
ing of clusters and not
of
objects as is the case
with

trad
i-
tional clustering. It should be noted that agglomer
a
tive
clusters are exhaustive subsets of an initial clustering X;
that is, we assume that outliers are not removed by th
e
aggl
o
merative clustering algorithm itself, but rather by
the algorithm that constructs the input X for the aggl
o
m-
erative clustering algorithm. In general, an agglomer
a
tive
clustering
algorithm is decomposed

of two alg
o
rithms:


1.

a preprocessing algorithm t
hat constructs the
cluste
r
ing X

2.

the agglomerative clustering algorithm itself that
d
e
rives X’ from
X.

The preprocessing algorithm is frequently degen
e
ra
t-
ed; for example, its input could consist of single object
clu
s
ters, or X could be constructed based o
n a grid
-
stucture; however, it is ben
e
ficia
l

for many applications to
use a full fledged clustering algorithm for the prepr
o-
8

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


ces
s
ing step.


An agglomerative clustering algorithm MOSAIC [
13]

has been introduced
in a

previous work. MOSAIC takes
the clustering

X obtained by running a represent
a
tive
-
based region discovery algorithm as its input, merges
neighboring regions greedily

as long mer
g
ing enhances

q(X). For efficiency reasons, MOSAIC uses Gabriel graphs

[15]

which are subsets of Delaunay graphs

to co
m
put
e
nc; nc is then used to
identify merge candidates for M
O-
SAIC

which are pairs of neighboring clusters whose
merging enhances q(X)
;
taking
nc is updated incrementa
l-
ly as clusters are merged.
Finally, when clusters are
merged, q(X) is updated incrementally,
taking advantage
of the fact that our framework assume that q is additive.
Fig. 8 gives the pseudo
-
code for MOSAIC
:

Moreover, models for the clusters obtained by an a
g-
glomerative regions discovery algorithm can be easily
co
n
structed from the models of the

input clusters in X
that have been merged to obtain the region in question.
Let us assume r has been obtained as r=r
1



r
m
; in this
case the model for r can be definded as

:

r
(p)=


r1
(p)





rm
(p)

In the case of MOSAIC,

r
(p) is implemented by ch
a
r-
acteri
zing MOSAIC clusters by sets of representatives
8
;
new points are then assigned to the cluster whose set of
representatives contains the representative that is clo
s
est
to p. Basically, MOSAIC constructs regions as union of
Voronoi cells and the above constr
uction takes a
d
vantage
of this prope
r
ty.


Fig
.

5
. Pseudo code
of

M
O
SAIC

4.3

SCMRG

A

Divisive G
rid
b
ased
C
lustering
A
lgorithm

The
divisive clustering problem

can be defined as fo
l
lows:

Given
: O, F, S, N, a fitness function q, and an initial
clustering X={x
1
,…,x
h
} with contiguous(X).

Find
: X’={c’
1
,…,c’
k
} that maximizes q(X’) and X’ has
been obtained from X.

Procedure
: Initially, X’ is set to X. Then X’ is modified
to increase q(X’) by recursively replacing an x

X’ by

8

If r in X’ has been constructed using
r=
r
1



r
m

from X r would be
c
haracterized by the representatives of regions
r
1
,…,r
m
.

x=x’
1




x’
p

as long as q(X) improves, a
nd the following
conditions are satisfied:

(1)

x’
j

x (j=1,…p)

(2)

x’
j

x’
i
=


(for j

i)

(3)

contiguous(x’
j
) (j=1,…p)

(4)

reward(x)<reward(x’
1
)+…+reward(x’
p
)

Region x is only replaced by regions at a lower level of
resolution, if the sum of the rewards of the regions at
the
lower level of resolution is higher than x’s reward. It
should be emphasized that the splitting procedure e
m-
ploys a variable number of decompositions; e.g. one r
e-
gion

might not be split at all, another region

might be
split into just
four
region
s
, whereas
a third region

might
be s
p
lit into 17 sub
region
s. Moreover, the splitting pr
o-
cedure is not a
s
sumed to be exhaustive; that is, x can be
split into y1, y2, y3 with y1

y2

y3

x; in other words, the
above specification allows divisive region discovery alg
o-
rith
ms to di
s
card outliners when seeking for interesting
regions; bas
i
cally the objects belonging to the residual
region x/y1

y2

y3 in the above examples are cons
i
d-
ered to be outliers.

SCMRG (Supervised Clustering using Multi
-
Resolution Grids)
[14]

is a divi
sive, grid
-
based region di
s-
covery algorithm that has been developed by our past
work. SCMRG partitions the spatial space Dom(S) of the
dataset into grid cells. Each grid cell at a higher level is
partitioned further into a number of smaller cells at the
lo
wer level, and this process continues if the sum of the
rewards of the lower level cells is greater than the r
e-
wards at the higher level cell. The regions returned by
SCMRG usually have different sizes, because they were
obtained at different levels of res
olution. Moreover, a cell
is drilled down only if it is promising (if its fitness i
m-
proves at a lower level of resolution). SCRMG uses a
look
-
ahead splitting procedure that splits a cell into 4, 16,
and 64 cells respectively and analyzes if there is an i
m-
p
rovement in fitness in any of these three splits; if this is
not the case and the original cell receives a reward, this
cell is included in the region discovery result; however,
regions who themselves as well as their successors at
lower level of resol
u
tio
n do not receive any rewards, will
be treated as outliers, and discarded from the final clu
s-
tering X’.

SCMRG employs a queue to store cells that need fu
r-
ther processing. SCMRG starts
at a user defined level of
resolution

and puts the cells associated with

this level on
the queue. Next,
SCMRG generates a clustering from cells
in the queue by traversing through the hierarchical stru
c-
ture and examin
ing

those cells in the queue, and cons
i
d-
e
r
ing the follo
w
ing three cases when processing a cell:

Case 1
. If the c
ell c receives a reward, and its reward is

greater than the sum of the rewards of its children and
the sum of rewards of its grandchildren respectively,

this
cell is returned as a cluster by the algorithm.

Case 2.

If the cell c does not receive a reward,
nor does
its children and grandchildren, neither the cell nor any of
AUTHOR ET AL.: TITL
E

9


its decedents will be further process
ed

or labeled as a
cluster.

Case 3.

Otherwise, if the cell c does not receive a r
e-
ward, but its children receive rewards, put all the chi
l-
d
ren of the
cell c into a queue for further processing.

Finally, all cells that hav
e been labels as clusters (case
1)
are returned as the final result of SCMRG.

5

C
ASE
S
TUDIES

5.1

C
o
-
location
Mining of Risks P
atterns of
Arsenics

and Associated Chemicals in Texas
Water
S
upply

In this case study, we apply our domain
-
driven clu
s
tering
framework for discovering
interesting regions where two
or more attriutes are collocated

and associated patterns.
The
employed
procedure is summarized in
fig.

6

and e
x-
plained step by step ste
p below
:

Fig.
6
. A procedure of applying domain
-
driven clustering

framework
for actionable region discovery with involvement of domain e
x
perts

Step
1. Define
the
problem
: Co
-
location mining is a
data mining task that seeks for interesting but implicit
patt
erns in which two or more patterns collocate in sp
a-
tial proxi
m
ity. For this case study, hydrologists helped us
select subsets of chemicals and some external factors su
s-
pected of generating high levels of Arsenic concentr
a-
tions. Interesting pa
t
terns B is de
fined as follows:

Given



N={A
1
,…,A
q
} be the set of non
-
spatial

continuous a
t-
tributes that measure chemical concentrations in Texas
water wells.



Q={A
1
↑, A
1
↓,…, A
q
↑, A
q
↓} be the set of base colloc
a-
tion patterns; in this case study, the domain expert is i
n-
tere
sted in finding associations of high/low concentr
a-
tions (denoted by ‘↑’ and ‘↓’, respectively) with high/low
co
n
centrations of other chemicals.



B


Q be a set of co
-
location patterns, where



P(B)

be a predicate over
B
that restricts the co
-
location sets con
sidered, i.e.
P(B)=As


B

(“only look for
co
-
location sets nvolving high arsenic concentrations of
A
r
senic”)

Step 2. Create
/Select

a fitness function:

F
irst,
the
h
y-
drologists
formulate

a measure of their interesting
ness in
form of a reward
-
based

fitness fun
ction. Fitness function
s
express

extrinsic characteristics which are varying in di
f-
ferent problems and domains. In our framework, it is a
generic component so that hydrologists can define
several
fitness functions, some of which might have small vari
a-
tions

from each other based on his diverse interests
. The
si
m
plified version of fitness function applied in the co
-
location mi
n
ing called z
-
value
was given in section 3.2.


Step 3
:

Select a clustering algorithm
.
The framework
provides many algorithms
that exemp
lify

different clu
s-
tering paradigms, e.g. re
presentative
-
based clu
s
tering
,
divisive grid
-
based clu
s
tering, agglomerative clustering
.
For this case study, CLEVER (CLustEring using repr
e-
sentatiVEs and Randomized hill climbing) is employed to
identify r
e
gions

and associated
co
-
location
patterns.

Step 4. Select parameters of the fitness function and
the clustering algorithm
:

T
uning or setting parameters of
the fitness function
and the region disc
o
very framework
such as


helps obtain better results
,

or exten
ds

the search
to focus on alternative patterns

or patterns at different
levels of granu
larity.
For the particular fitness function
e
m
ployed,
parame
ter



controls

an importance of

purity

of a pa
t
tern in interestingness

computations;
the
larger


is, the more

importance of

purity

of a pattern is a
d-
dressed
.

Beside those two parameters,

hydrol
o
gists can
also specify seed patterns, i.e. As


as a ma
ndat
o
ry item in
the co
-
location pa
t
terns considered. Later on they can
simply change the seed pa
t
terns to force the co
-
location
mining algorithm to seek for alternative patterns; e.g. pa
t-
terns that are co
-
located with both {As

,F

}.
This bridges
a gap between hydrologists
’ expectations

and
the results
of clustering algorithms
,

permitting hydrologists to tune
the comprehen
sive parameters in order to derive action
a-
ble pa
t
terns
.

Step 5. Run the clustering algorithm to
discover i
n-
teresting

regions and associated patterns
:

Results (a set
of clusters) obtained from the clustering algorithm are
ranked either by reward or interes
tingness. An example
of exp
e
rimental results is given in Fig. 6. For instance, the
first ranked pattern indicates that high level of Arsenic
collocates with high levels of Boron, Chloride and Total
Di
s
solved Solids in Southern of Texas.

10

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


Fig.
7
. Example of

Top 5 regions ranked by interestin
g
ness

Step 6. Analyze the results
:

By the nature of fitness
function
s
, the clustering algorithm consequently weeds
out many regions having zero interestingness.
T
he exp
e
r-
imental results show the ability of the framework t
o ide
n-
tify
interesting regions and associated patterns e
x
empl
i-
fied in Fig.
7
, which are
comparative
to
regions of high
level of Arsenic concentration o
b
tained
from

TCEQ

as
depicted in Fig.
8
. Step
s

4

6
are usually

repeated

se
v
eral
times

in order to enhance

the results or

explore alte
r
n
a-
tive regions and pa
t
terns.

Fig.
8
. Arsenic poll
u
tion map

In contrast to traditional clustering, our
framework
o
f-
fers search engine
-
type capabilities to domain e
x
perts, to
help them identify patterns they are interested in
. D
o-
main experts assist and incorporate their kno
w
ledge in
several mining phases, especially
before the clustering
phase
. By expressing their interestingness in forms of fi
t-
ness function
s
, the clustering algorithms are able to seek
for clusters with extrinsic
characteri
s
tics. Therefore, the
clusters and associated patterns obtained repesent
actio
n-
able

kno
w
ledge
.

5.2

C
hange
Analysis

in EarthQuake Data

A

change analysis framework is
developed

using our
framework
to

analyze how interesting regions in two di
f-
feren
t time frames. For instance, analy
z
ing changes in
places where deep earthquakes are in close proximity to
shallow eart
h
quakes.
Fig.
9

summarizes
the

approach of
change analy
sis and the steps are e
x
plained next.

Fig.
9
. A procedure of applying domain
-
driven

clustering framework
in change anal
y
sis

First, geologists sample two datasets corresponding to
two different time frames.

Secondly, the
domain driven clustering framework is
used to separately identify the interesting r
e
gions of each
time frame; a fitne
ss function measures high variance of
earthquakes depth.
To gene
r
ate intensional clusterings
from results of CLEVER, we construct voronoi cells, in
which polygons represent cluster models.
Then, change
analysis techniques are a
p
plied
in Steps 3

6
in order
to
detect and identify different change patterns in those r
e-
gions.


Third: users select relevant change predicates to co
m-
pare changes between the two intensional clusterings; the
predicates also have thresholds to be controlled externa
l-
ly. Then changes be
tween the two clusterings are insta
n-
tiated with respect to the predicate threshold. Finally,
emergent patterns are summarized and further an
a
lyzed.

Fig.

10
. An overlay of interesting regions disco
v
ered in O
old

and and
O
new

Fig. 10 illustrates An overlay o
f interesting regions di
s-
covered in O
old

and and O
new
; the red regions belong to
the early time frame (labeled with Region
old
) whereas the
blue r
e
gions belong to the late time frame (labeled with
Region
new
). Examples of relationship discovered between
two

clusterings are also given; r
e
gions 5 and 10 in Fig. 11
are considered new whereas region 0, 2, 3, 7 in Fig. 12 is
considered dissappea
r
ance.

AUTHOR ET AL.: TITL
E

11


Fig. 1
1
. Novelty areas of r
e
gions in O
new

data

Fig. 1
2
. Disappearance areas of regions in O
old

d
a
ta

5.3

Other
A
pplications of the Framework for
Actionable R
egion
al K
nowledge

D
iscovery

Beside the use of the domain driven clustering frame
-
work t
o discover actionable knowledge

specified in the
two aforementioned case studies, the framework
can also
be

applied to aid k
nowledge discovery in other real a
p-
plications. The first application, similar to the first case
study, is co
-
location mining in planetary science
[16]
; we
are interested in mining feature
-
based hotspots where
extreme densities of deep ice and sha
l
low ice c
o
-
locate on
Mars; the fitness function employed is an absolute of
product of z
-
score of the continuous non
-
spatial feature in
spatial dataset. Outcomes of the fram
e
work are regions
having either very high co
-
location or very high anti co
-
location.

The sec
ond application is regional correlation pattern
discovery using PCA in hydrology
[10]
. Finding r
e
gional
patterns in spatial datasets is an important data mining
task. PCA
-
based fitness function is used to discover r
e-
gional correlation patterns. This approa
ch is more effe
c-
tive than solely applying PCA once or mult
i
ple times on
the data, since the PCA is applied repeate
d
ly to candidate
regions to explore each possible region co
m
bination. This
case study uses PCA
-
based fitness function maximizing
the eigenvalu
es of first k PCs; it r
e
wards the regions with
high correlation since higher correlated sets would result
in higher eigenvalues, in ot
h
er words, higher variance is
captured.

The third application is multi
-
objective clustering,
whose goal is to seek for a
set of clusters individually s
a
t-
isfying multiple objectives. For example, hydrologists are
also interested in identifying regions which satisfy mu
l
t
i-
ple patterns of chemical contamination in water supply.
We apply m
ulti
-
run clustering in
as a tool

to

gathe
r mu
l
ti
-
objective clusters simultaneously. Multi
-
run clu
s
tering
reduces extensive human effort by searching for and e
n-
hancing novel and high quality clusters in aut
o
mated
fashion. Since multi
-
run clustering is developed on top of
domain driven clustering f
ramework, it co
n
forms to the
framework and also inherits the capability of the fram
e-
work to plug in different clustering algorithms and fitness
functions. Therefore, results obtained from multi
-
run
clustering are also considered actionable.

6

C
ONCLUSION

I
n this paper a generic, domain
-
driven clustering fram
e
work
is proposed that incorporates domain intelligence into do
-
main
-
specific, plug
-
in fitness functions that are ma
x
imized
by clustering algorithms. The framework provides a family
of clustering algorit
hms and a set of fitness functions, along
with the capability of defining new fitness functions. More
-
over, an ontology and a theoretical foundation for clustering
with fitness functions in general, and for region discovery in
particular is introduced. Fit
ness functions are the core com
-
ponents in the framework as they capture a domain expert’s
notion of of interestingness. The fitness function is indepen
-
dent from the clustering algorithms e
m
ployed.

The framework was evaluated for different region di
s-
cove
ry tasks in several case studies. The framework treats
the region discovery problem as a clustering problem in
which a given, plug
-
in fitness function has to

be

maximized.
By integrating and utilizing domain knowledge and d
o-
main
-
specific
evaluation measur
es, into parame
t
er
ized,
plug
-
in fitness functions altogether with controlling thr
e
s
h-
olds, the framework is able to obtain actionable re
gional
knowledge

and their a
s
sociated patterns satisfying domain
-
specific needs.

The case studies demonstrate
the capabi
lity of the
framework to integrate the
domain intelligence and
effe
c-
tively
util
ize

clustering tasks
by incorporating
domain r
e-
quirements
in
to the clustering algorithm
s

in form of a fitness
function to guide clustering. To the best of our kno
w
ledge,
this ca
pability has been very little been explored by past r
e-
search

in the field of clustering

and we are optimistic that
our proposed framework will foster novel applic
a
tions of
domain
-
driven cluste
r
ing.
.

R
EFERENCES

[1]

L
Cao

and C. Zhang,
“The E
volution of KDD
:

Towards Domain
-
Driven Data Mining,”

Journal of Pattern Recognition and Artificial Intell
i-
gence
, v
ol.21,
n
o. 4, pp. 677
-
692, Worl
d Scientific Publishing Company,
2007.

[2]

I.
Davidson

and S
.
S. Ravi,
“Clustering under Constraints: F
easibi
l
ity
Issues and th
e k
-
means Algorithm,”
Proc.
Fifth

SIAM Data Mining Conf.
,
2005.


[3]

Q.
Yang
,

K
.

Wu.,
and Y.
Jiang
,

“Learning Action Models from Plan
E
xam
ples using Weighted MAX
-
SAT,”

Artif. Intell
.,
vol. 171, issue 2
-
3,
pp. 107
-
143, 2007.

[4]

N
.

Zhong,,

Actionable Know
ledge Discovery: A Brain Informatics
Pers
pective,”

Special Trends and Controversies Department on D
o
main
-
Driven, Actionable Knowledge Discovery, IEEE Intelligent Systems
,
vol. 22,
issue 4, pp. 85
-
86, 2007
.

12

IEEE TRANSACTIONS ON

JOURNAL NAME, MANUS
CRIPT ID


[5]

W. Graco, T. Semenova, and E. Dubossarsky, “
Toward Knowledge
-
Driven Data Mining,”
Proc. Domain Driven Data Mining Workshop
, 2007.

[6]

G.

Karypis, E.H.S
Han
,
V.
Kumar,
“Chameleon: Hierarchical C
luste
r
ing
using Dynamic M
odeling,


IEEE Computer
,

vol.
32
, issue 8
, pp.

68
-
75,
1999.

[7]

C. F. Eick, N. Zei
dat, and Z. Zhao,

Supervised Clustering
---

Alg
o-
rithms and Benefits,


Proc. Int. Conf. on Tools with AI.
,

2004.

[8]

A. Bagherjeiran, C. F. Eick, C.
-
S. Chen, and R.Vilalta,

Adaptive Cluste
r-
ing: Obtaining Better Clusters Using Feedback and Past Experience,


Proc. Fifth IEEE Int. Conf. on Data Mining
,
2005.

[9]

E
.
H.:

Simpson
,

“The Interpretation of Interaction in Contingency T
a-
bles,”

Jou
r
nal of the Royal Statistical Society
,
ser.
B
,
13
, pp.
:238
-
241, 1951.

[10]

O. U. Celepcikay

and C. F. Eick,

A Regional P
attern Discovery
Framework using Principal Component Analysis,

Proc. Int
.
Conf
.

on
Multivariate Statistical Modeling & High Dimensional Data Mining,
2008.

[11]

I. T
Jolliffe
,

“Principal Component Analysis,”

NY Springer, 1986.

[12]

C. F. Eick, R. Parmar, W
. Ding, T. Stepinki, and J.
-
P. Nicot,

Fin
d
ing
Regional Co
-
location Patterns for Sets of Continuous Variables in Sp
a-
tial Datasets,

Proc.
S
ixteenth ACM SIGSPATIAL Int. Conf. on A
d
vances in
GIS
,
2008.

[13]

J. Choo, R. Jiamthapthaksin, C.
-
S. Chen, O. Celepci
kay, C. Giusti, and C.
F. Eick,

MOSAIC: A Proximity Graph Approach to Agglomerative
Clustering,

Proc.
N
inth Int. Conf. on Data Warehousing and Knowledge
Discovery,

2007.

[14]

C. F. Eick, B. Vaezian, D. Jiang, and J. Wang,

Discovery of Interesting
Region
s in Spatial Datasets Using Supervised Clustering,

Proc. Tenth
European Conf. on Principles and Practice of Knowledge Discovery in Dat
a-
bases
,
2006.

[15]

Gabriel, K. and

R. Sokal,

A New Statistical Approach to Ge
o
graphic
Vari
ation Analysis,”
Systematic Zo
ology
, v
ol. 18
, pp.

259
-
278
, 1969.



[16]

W. Ding, R. Jiamthapthaksin, R. Parmar, D. Jiang, T. St
e
pinski, and C. F.
Eick,

Towards Region

Discovery in Spatial Dat
a
sets,”
Proc. Pacific
-
Asia Conf
.

on Knowledge Discovery and Data Min
ing,
2008
.


Christoph F. E
ick
received his PhD from the University of Karlsruhe
in Germany. He is currently an Associate Professor in the Depar
t-
ment of Computer Science at the University of Houston. He is the
Co
-
Director of the UH Data Mining and Machine Learning Group. His
researc
h interests include data mining, machine learning, evoluti
o
n-
ary computing, and artificial intelligence. He published more than 95
papers these and related areas. He serves on the program co
m
mi
t-
tee of the IEEE International Conference on Data Mining (ICDM)
and
other major data mining and machine learning conferences.



Oner Ulvi Celepcikay

is a senior PhD cand
i
date at the University of
Houston, Computer Science Department. He got his bachelor d
e-
gree in Electrical Engineering in 1997 from Istanbul University,

Ista
n-
bul, Turkey. He acquired his M.S. Degree in Co
m
puter Science at
University of Houston 2003. He had worked at Un
i
versity of Houston
Educational Technology Outreach (ETO) Depar
t
ment from 2000 to
2007. He has published number of papers in his research f
ields i
n-
cluding cluster analysis, multivariate statistical analysis, and spatial
data mining. He served as session chair in International Co
n
ference
on Multivariate Statistical Modeling & High Dimensional Data Mining
in 2008 and has been serving as a non
-
p
c reviewer in many conf
e
r-
ences.


Rachsuda Jiamthapthaksin

graduated her bachelor degree in
Computer Science in 1997 and graduated with Honors in master
degree in Computer Science, Dean’s Prize for Outstanding Perfo
r-
mance, in 1999 from Assumption University
, Bangkok, Tha
i
land. She
was a faculty in Computer Science department, Assumption Unive
r-
sity during 1997
-
2004. She is now a PhD candidate in Computer
Science at University of Houston, Texas. She has pu
b
lished papers
in the area of her research interests in
cluding intell
i
gent agents,
fuzzy systems, cluster analysis, data mining and knowledge disco
v-
ery. She has served as a non
-
pc reviewer in many co
n
ferences and
served as a volunteer staff in an organization of the 2005 IEEE ICDM
Conference, November 2005, Ho
uston, Texas.


Vadeerat Rinsurongkawong

is a PhD candidate in Computer Sc
i-
ence at University of Houston. She got her
M.S.degree

in Info
r-
mation Technology from Assumption University, Thailand and her B
.
Eng. deree

in Electrical Engineering from Chulalongkor
n University,
Thailand. She has work experience in electrical engineering, info
r-
mation technology and computer science. Her areas of
interest

are
data mining and databases.