The Journal of Systems and Software

muttchessΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

250 εμφανίσεις

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
Contents

lists

available

at

ScienceDirect
The

Journal

of

Systems

and

Software
j

ourna

l

ho

me

page:

www.el sevi er.com/l ocat e/j ss
Enhancing

grid-density

based

clustering

for

high

dimensional

data
Yanchang

Zhao
a
,Jie

Cao
b,∗
,Chengqi

Zhang
c
,Shichao

Zhang
d,∗
a
Centrelink,

Australia
b
Jiangsu

Provincial

Key

Laboratory

of

E-business,

Nanjing

University

of

Finance

and

Economics,

Nanjing,

210003,

P.R.

China
c
Centre

for

Quantum

Computation

and

Intelligent

Systems,

Faculty

of

Engineering

and

Information

Technology,

University

of

Technology,

Sydney,

Australia
d
College

of

CS

&

IT,

Guangxi

Normal

University,

Guilin,

China
a

r

t

i

c

l

e

i

n

f

o
Article

history:
Received

26

July

2010
Received

in

revised

form

9

February

2011
Accepted

25

February

2011
Available online 8 March 2011
Keywords:
Clustering
Subspace

clustering
High

dimensional

data
a

b

s

t

r

a

c

t
We

propose

an

enhanced

grid-density

based

approach

for

clustering

high

dimensional

data.

Our

tech-
nique

takes

objects

(or

points)

as

atomic

units

in

which

the

size

requirement

to

cells

is

waived

without
losing

clustering

accuracy.

For

efficiency,

a

new

partitioning

is

developed

to

make

the

number

of

cells
smoothly

adjustable;

a

concept

of

the

ith-order

neighbors

is

defined

for

avoiding

considering

the

expo-
nential

number

of

neighboring

cells;

and

a

novel

density

compensation

is

proposed

for

improving

the
clustering

accuracy

and

quality.

We

experimentally

evaluate

our

approach

and

demonstrate

that

our
algorithm

significantly

improves

the

clustering

accuracy

and

quality.
© 2011 Elsevier Inc. All rights reserved.
1.

Introduction
Clustering,

as

one

of

the

main

techniques

in

data

mining,

is

to
find

“natural”

groups

in

datasets.

Not

only

can

it

be

used

stand
alone

in

database

segmentation

and

data

compression,

it

also

can
be

employed

in

the

preprocessing

procedures

of

other

data

mining
techniques,

such

as

classification,

association

rules,

and

so

on.
Density-based

clustering

(Ankerst

et

al.,

1999;

Ester

et

al.,
1996;

Hinneburg

and

Keim,

1998)

and

grid-based

clustering
(Sheikholeslami

et

al.,

1998;

Wang

et

al.,

1997)

are

two

well-known
clustering

approaches.

The

former

is

famous

for

its

capabilities

of
discovering

clusters

of

various

shapes,

effectively

eliminating

out-
liers

and

being

insensitive

to

the

order

of

inputs,

whereas

the

latter
is

well

known

for

its

high

speed.

However,

neither

approach

is

scal-
able

to

high

dimensionality.

For

density-based

ones,

the

reason

is
that

the

index

structures,

such

as

R*-tree,

are

not

scalable

to

high-
dimensional

spaces.

For

grid-based

approaches,

the

reason

is

that
both

the

number

of

cells

and

the

count

of

neighboring

cells

grow
exponentially

with

the

dimensionality

of

data.

Grid-based

algo-
rithms

take

cells

as

atomic

units

which

are

inseparable,

and

thus
the

interval

partitioned

in

each

dimension

must

be

small

enough

to
ensure

the

resolution

of

clustering.

Therefore,

the

number

of

cells
will

increase

exponentially

with

dimensionality.

Some

researchers
try

to

break

the

curse

of

dimensionality

by

using

the

adaptive

grid

Corresponding

authors.
E-mail

addresses:

yanchang.zhao@centrelink.gov.au

(Y.

Zhao),
caojie690929@163.com

(J.

Cao),

chengqi@it.uts.edu.au

(C.

Zhang),
zhangsc@gxnu.edu.cn

(S.

Zhang).
(Nagesh

et

al.,

1999),

the

optimal

grid

(Hinneburg

and

Keim,

1999),
or

in

an

a

priori-like

way

(Agrawal

et

al.,

1998).
Previously,

we

developed

an

algorithm

called

AGRID

(Advanced
GRid-based

Iso-Density

line

clustering),

which

combines

density-
based

and

grid-based

approaches

to

cluster

large

high-dimensional
data

(Zhao

and

Song,

2003).

Based

on

the

idea

of

density-based
clustering,

it

employs

grid

to

reduce

the

complexity

of

distance
computation

and

can

discover

clusters

of

arbitrary

shapes

effi-
ciently.

However,

in

order

to

reduce

the

complexity

of

density
computation,

only

(2d

+

1)

out

of

all

3
d
neighbors

are

considered
for

each

cell

when

computing

the

densities

of

objects

in

it.

When
the

dimensionality

is

high,

most

neighboring

cells

are

ignored

and
the

accuracy

becomes

very

poor.
In

this

paper,

we

present

an

enhanced

grid-density

based

algo-
rithm

for

clustering

high

dimensional

data,

referred

to

as

AGRID+,
which

substantially

improves

the

accuracy

of

density

computation
and

clustering.

AGRID+

has

four

main

distinct

technical

features.
The

first

is

that

objects

(or

points),

instead

of

cells,

are

taken

as
the

atomic

units.

In

this

way,

it

is

no

longer

necessary

to

set

the
intervals

very

small,

so

that

the

number

of

cells

does

not

grow
dramatically

with

the

dimensionality

of

data.

The

second

feature
is

the

concept

of

ith-order

neighbors,

with

which

the

neighboring
cells

are

organized

into

a

couple

of

groups

to

improve

efficiency
and

meet

different

requirements

of

accuracy.

As

a

result,

we

obtain
a

trade-off

between

accuracy

and

speed

in

AGRID+.

The

third

is
the

technique

of

density

compensation

which

improves

the

accu-
racy

greatly.

Last

but

no

the

least,

a

new

distance

measure,

minimal
subspace

distance,

is

designed

for

subspace

clustering.
The

rest

of

the

paper

is

organized

as

follows.

In

Section

2,

we
present

the

related

work

and

some

concepts

needed.

The

AGRID+
clustering

is

designed

in

Section

3,

in

which

an

idea

to

adapt

our
0164-1212/$



see

front

matter ©

2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2011.02.047
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1525
algorithm

for

subspace

clustering

is

also

given

in

this

section.

Sec-
tion

4

shows

the

results

of

experiments

both

on

synthetic

and
public

datasets.

Some

discussions

are

given

in

Section

5.

Conclu-
sions

are

made

in

Section

6.
2.

Related

work
Most

clustering

algorithms

fall

into

four

categories:

partition-
ing

clustering,

hierarchical

clustering,

density-based

clustering

and
grid-based

clustering.

The

idea

of

partitioning

clustering

is

to

par-
tition

the

dataset

into

k

clusters

which

are

represented

by

the
centroid

of

the

cluster

(k-Means)

or

one

representative

object

of
the

cluster

(k-Medoids).

It

uses

an

iterative

relocation

technique
that

improves

the

partitioning

by

moving

objects

from

one

group

to
another.

Well-known

partitioning

algorithms

are

k-Means

(Alsabti
et

al.,

1998),

k-Medoids

(Huang,

1998)

and

CLARANS

(Ng

and

Han,
1994).
Hierarchical

clustering

creates

a

hierarchical

decomposition

of
the

dataset

in

bottom-up

approach

(agglomerative)

or

in

top-down
approach

(divisive).

A

major

problem

of

hierarchical

methods

is
that

they

cannot

correct

erroneous

decisions.

Famous

hierarchi-
cal

algorithms

are

AGENS,

DIANA,

BIRCH

(Zhang

et

al.,

1996),
CURE

(Guha

et

al.,

1998),

ROCK

(Guha

et

al.,

1999)

and

Chameleon
(Karypis

et

al.,

1999).
The

general

idea

of

density-based

clustering

is

to

continue
growing

the

given

cluster

as

long

as

the

density

(i.e.,

the

number
of

objects)

in

the

neighborhood

exceeds

some

threshold.

Such

a
method

can

be

used

to

filter

out

noise

and

discover

clusters

of

arbi-
trary

shapes.

The

density

of

an

object

is

defined

as

the

number

of
objects

in

its

neighborhood.

Therefore,

the

densities

of

each

object
have

to

be

computed

at

first.

A

naive

way

is

to

calculate

the

distance
between

each

pair

of

objects

and

count

the

number

of

objects

in

the
neighborhood

of

each

object

as

its

density,

which

is

not

scalable
with

the

size

of

datasets,

since

the

computational

complexity

is
()
,
where

N

is

the

number

of

objects.

Typical

density-based

methods
are

DBSCAN

(Ester

et

al.,

1996),

OPTICS

(Ankerst

et

al.,

1999)

and
DENCLUE

(Hinneburg

and

Keim,

1998).
Grid-based

algorithms

quantize

the

data

space

into

a

finite
number

of

cells

that

form

a

grid

structure

and

all

of

the

clus-
tering

operations

are

performed

on

the

grid

structure.

The

main
advantage

of

this

approach

is

its

fast

processing

time.

However,

it
does

not

work

effectively

and

efficiently

in

high-dimensional

space
due

to

the

so-called

“curse

of

dimensionality”.

Well-known

grid-
based

approaches

for

clustering

includes

STING

(Wang

et

al.,

1997),
WaveCluster

(Sheikholeslami

et

al.,

1998),

OptiGrid

(Hinneburg
and

Keim,

1999),

CLIQUE

(Agrawal

et

al.,

1998)

and

MAFIA

(Nagesh
et

al.,

1999),

and

they

are

sometimes

called

density-grid

based
approaches

(Han

and

Kamber,

2001;

Kolatch,

2001).
STING

is

a

grid-based

multi-resolution

clustering

technique

in
which

the

spatial

area

is

divided

into

rectangular

cells

and

orga-
nized

into

a

statistical

information

cell

hierarchy

(Wang

et

al.,
1997).

Thus,

the

statistical

information

associated

with

spatial

cells
are

captured

and

queries

and

clustering

problems

can

be

answered
without

recourse

to

the

individual

objects.

The

hierarchical

struc-
ture

of

grid

cells

and

the

statistical

information

associated

with
them

make

STING

very

fast.

STING

assumes

that

K,

the

number
of

cells

at

bottom

layer

of

hierarchy,

is

much

less

than

the

num-
ber

of

objects,

and

the

overall

computational

complexity

is

O(K).
However,

K

can

be

much

greater

than

N

in

high-dimensional

data.
Sheikholeslami

et

al.

(1998)

proposed

a

technique

named
WaveCluster

to

look

at

the

multidimensional

data

space

from

a

sig-
nal

processing

perspective.

The

objects

are

taken

as

a

d-dimensional
signal,

so

the

high

frequency

parts

of

the

signal

correspond

to

the
boundaries

of

clusters,

while

the

low

frequency

parts

which

have
high

amplitude

correspond

to

the

areas

of

the

data

space

where
data

are

concentrated.

It

first

partitions

the

data

space

into

cells
and

then

applies

wavelet

transform

on

the

quantized

feature

space
and

detects

the

dense

regions

in

the

transformed

space.

With

the
multi-resolution

property

of

wavelet

transform,

it

can

detect

the
clusters

at

different

scales

and

levels

of

details.

The

time

complexity
of

WaveCluster

is

O(dN

log

N).
The

basic

idea

of

OptiGrid

is

to

use

contracting

projections

of

the
data

to

determine

the

optimal

cutting

hyper-planes

for

partitioning
the

data

(Hinneburg

and

Keim,

1999).

The

data

space

is

partitioned
with

arbitrary

(non-equidistant,

irregular)

grids

based

on

the

dis-
tribution

of

data,

which

avoids

the

effectiveness

problems

of

the
existing

grid-based

approaches

and

guarantees

that

all

clusters

are
found

by

the

algorithm,

while

still

retaining

the

efficiency

of

a

gird-
based

approach.

The

time

complexity

of

OptiGrid

is

between

O(dN)
and

O(dN

log

N).
CLIQUE

(Agrawal

et

al.,

1998),

MAFIA

(Nagesh

et

al.,

1999)

and
Random

Projection

(Fern

and

Brodley,

2003)

are

three

algorithms
for

discovering

clusters

in

subspaces.

CLIQUE

discovers

clusters

in
subspaces

in

a

way

similar

to

the

Apriori

algorithm.

It

partitions
each

dimension

into

intervals

and

computes

the

dense

units

in

all
dimensions.

Then

these

dense

units

are

combined

to

generate

the
dense

units

in

higher

dimensions.
MAFIA

is

an

efficient

algorithm

for

subspace

clustering

using

a
density

and

gird

based

approach

(Nagesh

et

al.,

1999).

It

uses

adap-
tive

grids

to

partition

a

dimension

depending

on

the

distribution

of
data

in

the

dimension.

The

bins

and

cells

that

have

low

density

of
data

are

pruned

to

reduce

the

computation.

The

boundaries

of

the
bins

are

not

rigid,

which

improves

the

quality

of

clustering.
Fern

and

Brodley

proposed

Random

Projection

to

find

the

sub-
spaces

of

clusters

in

a

Random

Projection

and

ensemble

way

(Fern
and

Brodley,

2003).

The

dataset

is

first

projected

into

random

sub-
spaces,

and

then

EM

algorithm

is

used

to

discovers

clusters

in
the

projected

dataset.

The

algorithm

generates

several

groups

of
clusters

with

the

above

method

and

then

combines

them

into

a
similarity

matrix,

from

which

the

final

clusters

are

discovered

with
an

agglomerative

clustering

algorithm.
Moise

et

al.

(2008)

proposed

P3C,

a

robust

algorithm

for

pro-
jected

clustering.

Based

on

the

computation

of

so-called

cluster
cores,

it

can

effectively

discover

projected

clusters

in

the

data

while
minimizing

the

number

of

required

parameters.

Moreover,

it

can
work

on

both

numerical

and

categorical

datasets.
Assent

et

al.

(2008)

proposed

an

algorithm

capable

of

find-
ing

parallel

clusters

in

different

subspaces

in

spatial

and

temporal
databases.

Although

they

also

use

the

notions

of

neighborhood

and
density,

their

target

problem

is

clustering

sequence

data

instead

of
generic

data

in

this

paper.
Previously,

we

proposed

AGRID,

a

grid-density

based

algorithm
for

clustering

(Zhao

and

Song,

2003).

It

has

the

advantages

of

both
density-based

clustering

and

grid-based

clustering,

and

is

effective
and

efficient

for

clustering

large

high-dimensional

data.

However,
it

is

not

accurate

enough,

and

the

reason

is

that

only

2d

immediate
neighbors

are

taken

into

consideration.

Moreover,

it

is

incapable

of
discovering

clusters

in

subspaces.
With

AGRID

algorithm,

firstly,

each

dimension

is

divided

into
multiple

intervals

and

the

data

space

is

thus

partitioned

into

many
hyper-rectangular

cells.

Objects

are

assigned

to

cells

according

to
their

attribute

values.

Secondly,

for

an

object

˛

in

a

cell,

we

only
compute

the

distances

between

it

and

the

objects

in

its

neighbor-
ing

cells,

and

use

the

count

of

those

objects

which

are

close

to
˛

as

its

density.

Objects

that

are

not

in

the

neighboring

cells

are
far

away

from

˛,

and

therefore

do

not

contribute

to

the

densities
of

˛.

Thirdly,

each

object

is

taken

as

a

cluster

and

every

pair

of
objects

which

are

in

the

neighborhood

of

each

other

are

checked
whether

they

are

close

enough

to

merge

into

one

cluster.

If

yes,

then
the

two

clusters

which

the

two

objects

belongs

to

respectively

are
merged

into

a

single

cluster.

All

eligible

pairs

of

clusters

meeting
1526 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
j+1jj-1
i-1
(i-1,j

-1)

(i-1,j

)

(i-1,j+

1)
i
(i,j-1)
(i,j

)
(i,j+1

)
i+1
(i+1,j-1)
(i+1,j

)
(i+1,j

+1)
3
(a)
d
Neighbors
j+1jj-1
i-1
(i-1,j)
i
(i,j-1)
(i,j)
(i,j+1)
i+1
(i+1,j)
(2d
(b)
+ 1) Immediate Neighbors
Fig.

1.

Two

definitions

of

neighbors.

The

grey

cell

labelled

with

“(i,j)”

in

the

center

is

C
˛
,

and

the

grey

cells

around

it

are

its

neighbors.
the

above

requirement

are

merged

to

generate

larger

clusters

and
the

clustering

finishes

when

all

such

kind

of

object

pairs

have

been
checked.
With

the

idea

of

grid

and

neighbor,

3
d
neighboring

cells

(see
Fig.

1(a))

or

(2d

+

1)

immediate

neighboring

cells

(including

C
˛
itself)

(see

Fig.

1(b))

are

considered

when

computing

densities
of

objects

in

cell

C
˛
and

clustering

in

AGRID

algorithm.

If

all

3
d
neighboring

cells

are

considered,

the

computation

is

prohibitively
expensive

when

the

dimensionality

is

high.

Nevertheless,

if

only
(2d

+

1)

immediate

neighboring

cells

are

considered,

a

lot

of

cells

are
ignored

and

the

densities

computed

become

inaccurate

for

high-
dimensional

data.

To

tackle

the

above

dilemma,

ith-order

neighbor
will

be

defined

in

this

paper

to

classify

3
d
the

neighboring

cells

into
groups

according

to

their

significance.

By

considering

only

most

sig-
nificant

neighboring

cells,

high

speed

is

achieved

and

high

accuracy
is

kept.

To

improve

the

accuracy

further,

density

compensation

and
minimal

subspace

distance

are

proposed,

which

will

be

described
in

the

following

section.
For

a

comprehensive

survey

on

other

approaches

for

clustering,
please

refer

to

Berkhin

(2002),

Grabmeier

and

Rudolph

(2002),

Han
and

Kamber

(2001),

Jain

et

al.

(1999)

and

Kolatch

(2001).
3.

AGRID+:

an

enhanced

density-grid

based

clustering
The

proposed

AGRID+

algorithm

for

clustering

high-
dimensional

data

will

be

presented

in

this

section.

An

ith-order
neighbor

is

first

introduced

to

improve

the

efficiency

and

then
a

density

compensation

is

proposed

to

improve

accuracy

and
make

the

algorithm

more

effective

to

cluster

high-dimensional
data.

In

addition,

a

measure

of

minimal

subspace

distance

is
introduced

to

make

the

algorithm

capable

to

find

clusters

in
subspaces

effectively.

Our

techniques

of

partitioning

the

data
space

and

choosing

parameters

will

also

be

discussed

in

this
section.
The

following

notations

are

used

throughout

this

paper.

N

is
the

number

of

objects

(or

points

or

instances)

and

d

is

the

dimen-
sionality

of

dataset.

L

is

the

length

of

an

interval,

r

is

the

radius

of
neighborhood,

and

DT

is

the

density

threshold.

˛

is

an

object

or

a
point,

and

C
˛
is

the

cell

in

which

˛

is

located.

X

is

an

object

with
coordinates

of

(x
1
,

x
2
,

...,

x
d
)

and

Dist
p
(X,

Y)

is

the

distance

between

X
and

Y

with

L
p
-metric

as

the

distance

measure.

C
i
1
i
2
...i
d
stands

for

the
cell

whose

ID

is

i
1
i
2
.

.

.

i
d
,

where

i
j
is

the

ID

of

the

interval

in

which
the

cell

is

located

in

the

jth

dimension.

V
n
and

V
c
are

respectively
the

volume

of

neighborhood

and

the

volume

of

the

considered

part
of

neighborhood.

Cnt
q
(˛)

is

the

count

of

points

in

the

considered
part

of

neighborhood

of

˛

and

Den
q
(˛)

is

the

compensated

density
of

˛

when

all

ith-order

neighbors

of

˛

(0



i



q)

are

considered

for
density

computation.
3.1.

The

ith-order

neighbors
In

this

section,

our

definitions

of

neighbors

will

be

presented

and
discussed.

Note

that

neighborhood

and

neighbors

(or

neighboring
cells)

are

two

different

concepts

in

this

paper.

The

former

is

defined
for

a

point

and

its

neighborhood

is

an

area

or

a

space,

while

the
latter

is

defined

for

a

cell

and

its

neighbors

are

those

cells

adjacent
to

it.

Sometimes

we

use

“the

neighbors

of

point

˛”

to

denote

the
neighbors

of

cell

C
˛
w.r.t.

point

˛

(see

Definition

4),

where

C
˛
is

the
cell

in

which

˛

is

located.
An

intuitive

way

is

to

define

all

the

cells

around

a

cell

as

its
neighbors,

as

Definition

1

shows.
Definition

1

((Neighbors)).

Cells

C
i
1
i
2
...i
d
and

C
j
1
j
2
...j
d
are

neighbors
of

each

other

iff

p,

1



p



d,

|i
p


j
p
|



1,
where

i
1
i
2
.

.

.i
d
and

j
1
j
2
.

.

.

j
d
are

respectively

the

interval

IDs

of

cell
C
i
1
i
2
...i
d
and

C
j
1
j
2
...j
d
.
Generally

speaking,

there

are

all

together

3
d
neighbors

for

each
cell

in

a

d-dimensional

data

space

according

to

Definition

1

(see
Fig.

1(a)).

Assume

˛

is

an

object

and

C
˛
is

the

cell

that

˛

is

located

in.
When

calculating

the

density

of

object

˛,

we

need

to

compute

the
distances

between

˛

and

the

objects

in

cell

C
˛
and

its

neighboring
cells

only.

Those

objects

in

other

cells

are

relatively

far

away

from
object

˛,

so

they

contribute

nothing

or

little

to

the

density

of

˛.
Therefore,

for

object

˛,

we

do

not

care

about

the

objects

which

are
not

in

the

neighboring

cells

of

cell

C
˛
.
With

Definition

1,

each

cell

has

3
d
neighbors,

which

makes

the
computing

very

expensive

when

the

dimensionality

is

high.

There-
fore,

an

idea

of

immediate

neighbors

is

defined

as

follows

to

reduce
computational

complexity.
Definition

2

((Immediate

Neighbors)).

Cell

C
i
1
i
2
...i
d
and

C
j
1
j
2
...j
d
are
immediate

neighbors

of

each

other

iff
￿

l,

1



l



d,

|i
l


j
l
|

=

1,

and

p

/=

l,

1



p



d,

i
p
=

j
p
,
where

l

is

an

integer

between

1

and

d,

and

i
1
i
2
.

.

.

i
d
and

j
1
j
2
.

.

.

j
d
are

respectively

the

interval

IDs

of

cell

C
i
1
i
2
...i
d
and

C
j
1
j
2
...j
d
.
Generally

speaking,

in

a

d-dimensional

spaces,

each

cell

has

2d
immediate

neighbors

(see

Fig.

1(b)).
With

only

immediate

neighbors

considered

according

to
Definition

2,

the

computational

complexity

is

greatly

reduced,

but
at

the

cost

of

accuracy.

It

is

effective

when

the

clusters

are

compact
and

dense.

Nevertheless,

when

the

dimensionality

is

high

and

the
data

are

sparse,

the

density

values

and

the

clustering

become

inac-
curate,

since

many

cells

are

ignored

when

computing

densities.

To
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1527
(a)
C
α
(i

=
(b)
0)
i

=
(c)
1

i

=
(d)
2

i

=

3
Fig.

2.

The

ith-order

neighbors

of

cell

C
˛
.
improve

the

accuracy,

we

classify

the

neighbors

according

to

their
significance

by

defining

ith-order

neighbors

as

follows.
Definition

3

((ith-order

Neighbors)).

Let

C
˛
be

a

cell

in

a

d-
dimensional

space.

A

cell

which

shares

a

(d–i)

dimensional

facet
with

cell

C
˛
is

an

ith-order

neighbor

of

C
˛
,

where

i

is

an

integer
between

0

and

d.

Especially,

we

set

0th-order

neighbors

of

C
˛
to

be
C
˛
itself.
Examples

of

ith-order

neighbors

in

a

3D

space

are

shown

in
Fig.

2.

The

grey

cell

in

Fig.

2(a)

is

C
˛
,

and

the

0th-order

neigh-
bor

of

C
˛
is

itself.

The

grey

cells

in

Fig.

2(b)–(d)

are

the

1st-,

2nd-
and

3rd-order

neighbors

of

C
˛
,

respectively.

With

the

introduction
of

ith-order

neighbors,

the

neighbors

of

cell

C
˛
are

classified

into
groups

according

to

their

positions

relative

to

C
˛
,

and

an

ith-order
neighbors’

contribution

to

the

density

of

˛

is

greater

with

lower
i.

Therefore,

we

only

consider

low-order

neighbors

when

cluster-
ing.

More

specifically,

only

those

neighbors

whose

order

is

not
greater

than

q

are

taken

into

account,

where

q

is

a

positive

inte-
ger

and

0



q



d.

The

ith-order

neighbor

is

a

generalized

notion
of

Definitions

1

and

2.

When

q

is

set

to

1,

then

only

0th-

and
1st-order

neighbors

are

considered,

and

the

low-order

neighbors
are

C
˛
itself

and

the

immediate

neighbors

defined

by

Definition
2.

In

this

case,

the

speed

is

very

fast,

but

the

accuracy

is

poor.

If
q

is

set

to

d,

all

neighbors

are

considered,

which

is

the

same

as
Definition

1.

Thus,

the

accuracy

is

guaranteed,

but

the

computation
is

prohibitively

costly.

Since

lower-order

neighbors

are

of

more

sig-
nificance,

our

technique

of

considering

only

low-order

neighbors
helps

to

improve

performance

and

keep

accuracy

as

high

as

pos-
sible.

Moreover,

the

accuracy

can

be

further

improved

with

our
technique

of

density

compensation,

which

will

be

discussed

later
in

this

paper.
In

the

following,

the

relationship

between

the

radius

of

neigh-
borhood

and

the

length

of

interval

will

be

discussed

to

further
improve

the

performance

of

our

algorithm.

Assume

r

to

be

the
radius

of

neighborhood

and

L

the

length

of

interval.

When

r

is

large
enough

that

all

the

objects

in

all

the

neighbors

of

a

cell

are

within
the

neighborhood,

AGRID+

will

behave

somewhat

like

grid-based
clustering,

in

the

sense

that

the

densities

of

all

the

objects

in

a
cell

will

be

the

same

and

that

the

density

is

simply

the

count

of
those

objects

in

all

its

neighboring

cells.

With

a

very

large

r,

both
the

densities

and

neighborhood

become

very

large,

which

will

lead
to

the

merging

of

adjacent

clusters

into

bigger

clusters

and

clus-
r
L
Fig.

3.

Neighborhood

and

neighbors.

The

black

point

is

˛,

the

grey

cell

in

the

center
is

C
˛
,

and

the

other

grey

cells

around

it

are

its

neighbors.

The

area

within

the

dashed
line

is

the

neighborhood

of

˛.
ters

consisting

of

noises.

On

the

other

hand,

if

r

is

much

smaller
than

the

lengths

of

all

edges

of

the

hyper-rectangular

cell,

AGRID+
will

become

somewhat

like

density-based

clustering,

because

the
density

of

a

object

is

largely

decided

by

the

number

of

objects
circumscribed

by

r.

With

a

very

small

r,

both

the

densities

and
neighborhood

become

very

small,

so

the

result

will

be

composed

of
many

small

clusters

and

a

large

amount

of

objects

will

be

taken

as
outliers.

Therefore,

it

is

reasonable

to

set

r

to

be

on

the

same

order
as

the

length

of

interval.
If

r

>

L/2,

all

the

3
d
cells

around

C
˛
should

be

considered

to

com-
pute

the

density

of

object

˛

accurately.

If

r

<

L/2,

some

of

the

3
d
cells

around

C
˛
will

not

overlap

with

the

neighborhood

and

can
be

excluded

for

density

computation.

An

illustration

of

the

above
observation

is

given

in

Fig.

3,

which

shows

the

neighborhood

and
neighbors

in

a

2D

space.

In

the

figure,

L

-metric

is

used

as

the

dis-
tance

measure

and

the

neighborhood

of

˛

becomes

a

hypercube.
Note

that

the

above

observation

also

holds

for

other

distance

mea-
sures.

As

Fig.

3

shows,

if

object

˛

is

located

near

the

top-left

corner
of

C
˛
,

only

the

cells

that

are

on

the

top-left

side

of

C
˛
need

to

be
(a)
C
α
(i

=
(b)
0)
i

=
(c)
1

i

=
(d)
2
i

=

3
Fig.

4.

The

ith-order

neighbors

of

C
˛
w.r.t.

point

˛.
1528 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
considered

and

the

computation

becomes

less

expensive.

In

what
follows,

we

assume

that

r,

the

radius

of

neighborhood,

is

less

than
L/2.
The

above

observation

can

be

generalized

as:

if

the

radius

of
neighborhood

is

less

than

L/2,

only

those

neighbors

which

are
located

on

the

same

side

of

C
˛
as

˛

contributes

to

the

density

of

˛.
Therefore,

for

each

point

˛

in

cell

C
˛
,

the

neighbors

need

to

consider
are

related

to

the

relative

position

of

˛

in

C
˛
,

so

a

new

definition
of

ith-order

neighbors

with

respect

to

the

position

of

˛

is

given

as
follows.
Definition

4

((ith-order

Neighbor

w.r.t.

Point

˛)).

In

a

d-dimensional
space,

let

˛

be

a

point

and

C
˛
be

the

cell

in

which

˛

is

located.
Assume

that

the

coordinate

of

˛

is

(x
1
,

x
2
,

...,

x
d
),

the

center

of

C
˛
is
(a
1
,

a
2
,

...,

a
d
)

and

the

center

of

C
ˇ
is

(b
1
,

b
2
,

...,

b
d
).

Point

˛

and

cell
C
ˇ
are

on

the

same

side

of

cell

C
˛
iff

i,

1



i



d,

(x
i


a
i
)

(b
i


a
i
)



0.
Cell

C
ˇ
is

an

ith-order

neighbor

of

C
˛
w.r.t.

˛

(or

an

ith-order
neighbor

of

˛

for

short)

iff:

(1)

C
ˇ
is

an

ith-order

neighbor

of

C
˛
,
and

(2)

C
ˇ
and

˛

are

on

the

same

side

of

C
˛
.
Since

an

ith-order

neighbor

of

˛

shares

a

(d–i)-dimensional

facet
with

C
˛
,

the

ID

sequences

of

the

ith-order

neighbors

of

C
˛
have

i
different

IDs

from

that

of

C
˛
,

and

the

difference

between

each

pair
of

IDs

can

be

either

+1

or

−1.

Because

the

ith-order

neighbors

of

˛
lie

on

the

same

side

of

C
˛
as

˛,

the

number

of

ith-order

neighbors
of

˛

is

d
i
.

Examples

of

the

ith-order

neighbors

of

C
˛
w.r.t.

˛

in

a

3D
space

are

shown

in

Fig.

4.

Assume

that

˛

is

a

point

on

the

top-right-
back

side

of

the

center

of

C
˛
(the

grey

cell)

in

Fig.

4(a),

so

C
˛
is

the
0th-order

neighbor

of

C
˛
w.r.t.

˛.

The

grey

cells

in

Fig.

4(b)–(d)

are
the

1st-,

2nd-

and

3rd-order

neighbors

of

C
˛
w.r.t.

˛,

respectively.
3.2.

Density

compensation
With

the

introduction

of

ith-order

neighbors,

the

efficiency

is
much

improved

by

considering

low-order

neighbors

only.

How-
ever,

the

clustering

still

becomes

less

accurate

with

the

increase
of

dimensionality.

To

further

improve

accuracy,

an

idea

of

density
compensation

will

be

proposed

in

this

section

to

make

up

for

the
loss

introduced

by

ignoring

high-order

neighbors.
3.2.1.

Idea

of

density

compensation
Since

only

low-order

neighbors

are

considered,

a

part

of

neigh-
borhood

is

ignored

and

the

clustering

becomes

less

accurate,
especially

as

d

increases.

To

make

up

for

the

loss,

we

propose

a
notion

of

density

compensation.

The

idea

is

that

for

each

object,
the

ratio

of

the

volume

of

the

neighborhood

to

that

of

the

con-
sidered

part

is

calculated

as

a

compensation

coefficient,

and

the
final

density

of

an

object

is

the

product

of

its

original

density

and
its

compensation

coefficient.

According

to

Definition

4,

if

all

ith-
order

neighbors

of

˛

(i

=

0,

1,

...,

q,

where

0



q



d)

are

considered
when

computing

the

density

of

˛,

the

density

we

get

should

be
compensated

as
Den
q
(˛)

=
V
n
V
c
Cnt
q
(˛),

(1)
where

V
n
and

V
c
are

respectively

the

volume

of

neighborhood

and
the

volume

of

the

considered

part

of

neighborhood,

which

is

cov-
ered

by

ith-order

neighbors

of

˛

with

0



i



q.

Cnt
q
(˛)

is

the

count
of

points

in

the

considered

part

of

neighborhood,

and

Den
q
(˛)

is
the

compensated

density

of

˛.
Unfortunately,

since

there

are

too

many

neighbors

for

each

cell
when

q

is

a

large

integer,

it

is

impractical

to

compute

the

contri-
bution

of

each

cell

individually.

Therefore,

we

simplify

the

density
compensation

by

assuming

that,

for

a

specific

i,

the

considered

parts
of

an

ith-order

neighbor

are

of

the

same

volume.

Let

V
S
i
be

the
volume

of

the

overlapped

space

of

an

ith-order

neighbor

and

the
neighborhood

of

˛,

and

based

on

Eq.

(1),

we

can

get
Den
q
(˛)

=
V
n
￿
q
i=0
()
V
si
Cnt
q
(˛),(2)
where

S
i
is

the

overlapped

space

of

an

ith-order

neighbor

and

the
neighborhood

of

˛,

and

V
S
i
is

the

volume

of

S
i
.

In

the

above

equation,
the

density

becomes

more

accurate

with

the

increase

of

q.

When
q

=

d,

the

density

we

obtain

is

the

exact

value

of

density.

Neverthe-
less,

the

number

of

neighbors

considered

increases

dramatically
with

the

increase

of

q.

When

the

value

of

q

is

set

to

1,

the

0th-

and
1st-order

neighbors

together

are

the

(2d

+

1)

neighbors

defined

in
AGRID.

The

value

of

V
S
i
in

Eq.

(2)

varies

with

the

measure

of

dis-
tance.

A

method

for

computing

V
S
i
will

be

presented

in

the

next
section.
3.2.2.

Density

compensation
Euclidean

distance

is

the

most

widely

used

distance

measure.
However,

Euclidean

distance

increases

with

dimensionality,

which
makes

it

difficult

to

select

a

value

for

r.

Assume

that

there

are

two
points

˛(a,

a,

...,

a)

and

ˇ(0,

0,

...,

0)

in

a

d-dimensional

space.

The
L
p
-distance

between

them

is

Dist
p
(˛,

ˇ)

=

(da
p
)
1/p
,

that

is,

ad
1/p
.
We

can

see

that

the

distance

increases

with

the

dimensionality,
especially

when

p

is

small.

For

example,

for

a

dataset

within

the
unit

cube

in

a

100D

space,

if

Euclidean

distance

(p

=

2)

is

used,

the
distance

of

point

(0.2,

0.2,

.

.

.

,

0.2)

from

the

original

point

is

2.

If

r
is

set

to

1.5,

it

will

cover

the

whole

range

in

every

single

dimen-
sion,

but

still

cannot

cover

the

above

point!

It

is

the

same

case
with

most

other

L
p
-metric,

especially

when

p

is

a

small

integer.
However,

when

L

-metric

is

used,

the

distance

becomes

Dist(˛,
ˇ)

=

a.

For

the

above

example,

r

can

be

set

to

0.3

to

cover

the
point,

and

it

will

cover

a

part

in

every

dimension.

Therefore,

L

-
metric

is

more

meaningful

to

measure

the

distance

for

clustering
in

high

dimensional

spaces.

Moreover,

for

subspace

clustering

in
a

high-dimensional

space,

clusters

are

defined

by

researchers

as
axis-parallel

hyper-rectangles

in

subspaces

(Agrawal

et

al.,

1998;
Procopiuc

et

al.,

2002).

Therefore,

it

is

reasonable

to

define

a

clus-
ter

in

this

paper

to

be

composed

of

those

objects

which

are

in

a
hyper-rectangle

in

a

subspace.

Since

a

hyper-rectangle

in

a

sub-
space

can

be

get

by

setting

the

subspace

distance

with

L

-metric,
we

select

L

-metric

as

the

distance

measure,

which

is

shown

as
follows:
Dist

(X,

Y)

=

max
i=1...d
|x
i


y
i
|

(3)
When

L

-metric

is

used

as

distance

measure,

the

neighbor-
hood

of

an

object

becomes

a

hyper-cube

with

the

edge

length
of

2r

and

its

volume

is

V
n
=

(2r)
d
,

where

r

is

the

radius

of
neighborhood.
Let

(a
1
,

a
2
,

...,

a
d
)

be

the

coordinate

of

˛

relative

to

the

start

point
of

C
˛
(see

Fig.

5(a)).

Let

b
j
=

min{a
j
,

L
j


a
j
},

where

L
j
is

the

length

of
interval

in

the

jth

dimension.

If

b
j
<

r,

then

the

neighborhood

of

˛

is
beyond

the

boundary

of

C
˛
in

the

jth

dimension.

Suppose

that

there
are

all

together

d

dimensions

with

b
j
<

r

and

a

is

the

mean

of

such

b
j
.
To

approximate

the

ratio

of

overlapped

spaces,

we

assume

that

the
current

object

˛

is

located

on

the

diagonal

of

cell

C
˛
and

(a,

a,

...,

a)

is
the

coordinate

relative

to

the

start

point

of

C
˛
(Fig.

5(b)),

where

a

=
(1/d)
￿
d
i=1
a
i
.

With

such

an

assumption,

for

a

specific

i,

all

the

ith-
order

neighbors

of

˛

have

the

same

volume

of

overlapped

spaces
with

the

neighborhood

of

˛.

Let

S
i
be

the

overlapped

space

of

an

ith-
order

neighbor

and

the

neighborhood

of

˛,

and

V
S
i
be

the

volume

of
S
i
.

S
i
is

a

hyper-rectangle

which

has

i

edges

with

length

of

(r



a)

and
(d



i)

edges

with

length

of

(r

+

a),

so

V
S
i
=

(r

+

a)
d

−i
(r



a)
i
,

where
0



a



r.

Since

the

neighborhood

is

overlapped

with

C
˛
only

in

d

dimensions,

the

following

equation

can

be

derived

by

replacing

d,
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1529
r
a
1
a2
r
a
a
r-a
r
+a
(b)(a)
Fig.

5.

Density

compensation.

As

(a)

shows,

the

black

point

is

˛,

and

the

area

circumscribed

by

dotted

line

is

the

neighborhood

of

˛.

To

approximate

the

volumes

of

the
overlapped

spaces

(in

grey)

by

the

neighborhood

and

each

neighbor

of

˛,

we

assume

that

˛

is

located

on

the

diagonal

line

of

the

current

cell

C
˛
,

as

(b)

shows.
V
n
and

V
S
i
with

d

,

(2r)
d

and

(r

+

a)
d

−i
(r



a)
i
respectively

in

Eq.
(2):
Den
q
(˛)

=
(2r)
d

￿
q
i=0
()
(r

+

a)
d

−i
(r



a)
i
Cnt
q
(˛)

(4)
where

q

is

a

positive

integer

no

larger

than

d

.
In

fact,

the

number

of

ith-order

neighbors

of

a

point

is

much

less
than
()
,

especially

in

high-dimensional

spaces

where

most

cells

are
empty.

Thus,

k
i
,

the

actual

number

of

ith-order

neighbors,

can

be
used

to

replace
()
,

leading

to
Den
q
(˛)

=
(2r)
d

￿
q
i=0
k
i
(r

+

a)
d

−i
(r



a)
i
Cnt
q
(˛)

(5)
By

tuning

parameter

q,

we

can

obtain

different

clustering

accu-
racy.

Clearly,

both

the

accuracy

and

cost

will

increase

as

q

increases.
Therefore,

we

need

a

trade-off

between

accuracy

and

efficiency.
The

value

of

q

can

be

chosen

according

to

the

requirement

of

the
accuracy

and

the

performance

of

computers.

A

large

value

of

q

will
improve

the

accuracy

of

clustering

result,

but

at

the

cost

of

time.

On
the

contrary,

high

speed

can

be

achieved

by

setting

a

small

value
to

q,

but

the

accuracy

will

become

lower

accordingly.

Interestingly,
our

experiments

show

that

setting

q

to

two

or

three

achieves

good
accuracy

in

most

situations.

The

effect

of

different

values

of

q

will
be

shown

in

Section

4.
3.3.

Minimal

subspace

distance
Euclidean

distance

is

the

mostly

used

distance

measure.

How-
ever,

the

difference

between

the

nearest

and

the

farthest

points
becomes

less

discriminating

with

the

increase

of

dimensionality
(Hinneburg

et

al.,

2000).

Aggarwal

et

al.

suggest

to

use

fractional
distance

metrics

(i.e.,

L
p
-norm

with

0

<

p

<

1)

to

measure

the

simi-
larity

between

objects

in

high

dimensional

space

(Aggarwal

et

al.,
2001).

Nevertheless,

many

researchers

think

that

most

meaningful
clusters

only

exist

in

subspaces,

so

they

use

traditional

L
p
-norm
(p

=

1,

2,

3,

...)

to

discover

clusters

in

subspaces

(Agrawal

et

al.,
1998;

Fern

and

Brodley,

2003;

Nagesh

et

al.,

1999;

Procopiuc

et

al.,
2002).

For

subspace

clustering

in

high-dimensional

space,

clusters
are

constrained

to

be

axis-parallel

hyper-rectangles

in

subspaces

by
Agrawal

et

al.

(1998),

and

projective

clusters

are

defined

as

axis-
aligned

box

by

Procopiuc

et

al.

(2002).

Therefore,

it

is

reasonable
to

define

a

cluster

to

be

composed

of

those

objects

which

are

in
a

hyper-rectangle

in

a

subspace.

To

improve

traditional

L
p
-norm
(p

=

1,

2,

3,

...)

for

subspace

clustering

in

high-dimensional

space,
a

new

distance

measure,

minimal

subspace

distance,

is

defined

as
follows.
Definition

5

((Minimal

Subspace

Distance)).

Suppose

that

X

=

(x
1
,
x
2
,

...,

x
d
)

and

Y

=

(y
1
,

y
2
,

...,

y
d
)

are

two

objects

or

points

in

a

d-
dimensional

space.

The

minimal

k-dimensional

subspace

distance
between

X

and

Y

is

the

minimal

distance

between

them

in

all

pos-
sible

k-dimensional

subspaces:
Dist
(k)
(X,

Y)

=

min
all

J
k
{Dist(X
J
k
,

Y
J
k
)},

J
k


{1,

2,

...,

d},

1



k

<

d(6)
where

J
k
=

(j
1
,

j
2
,

...,

j
k
)

is

a

k-dimensional

subspace,

X
J
k
and

Y
J
k
are
respectively

the

projected

vectors

of

X

and

Y

in

subspace

J
k
,

and
Dist(

·

)

is

a

traditional

distance

measure

in

the

full

dimensional
space.
When

L
p
-metric

is

used

as

the

measure

of

distance,

the

minimal
subspace

distance

is

the

L
p
distance

of

the

k

minimal

differences
between

each

pair

of

x
i
and

y
i
:
Dist
(k)
p
(X,

Y)

=
￿
￿
i=1..k
|x
j
i


y
j
i
|
p
￿
1/p
(7)
If

L

-norm

is

used

as

distance

measure,

the

minimal

subspace
distance

becomes

is

the

k-th

minimum

of

|x
i


y
i
|,

which

can

be
easily

got

by

sorting

|x
i


y
i
|

(i

=

1..d)

ascendingly

and

then

picking
the

k-th

value.

Then

Dist
(k)
(X,

Y)



r

means

that

X

and

Y

are

in

a
hyper-rectangle

with

edge

of

r

in

k

dimensions

and

without

limits

in
other

dimensions.

Therefore,

the

above

distance

measure

provides
an

effective

measure

for

hyper-rectangular

clusters

in

subspaces.
With

the

help

of

the

above

minimal

subspace

distance,

it

will
be

easier

to

discover

clusters

in

subspaces.

For

two

objects,

it

finds
the

subspace

in

which

they

are

the

most

similar

or

the

nearest

to
each

other.

Assume

that

L

-norm

is

used.

For

example,

if

the

4D
minimal

subspace

distance

between

two

objects

is

7,

it

means

that
the

two

objects

are

within

a

4D

hyper-rectangle

with

edge

length
of

7.
Minimal

subspace

distance

tries

to

measure

the

distance
between

objects

in

the

subspace

where

they

are

closest

to

each
other,

so

it

is

effective

to

find

subspaces

where

clusters

exist
and

then

discover

clusters

in

those

subspaces.

With

the

above
1530 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
definition

of

minimal

subspace

distance,

our

algorithm

is

capa-
ble

of

finding

projected

clusters

and

subspaces

automatically
when

the

average

dimensionality

of

subspaces

is

given.

The

effec-
tiveness

of

the

above

distance

measure

will

be

shown

in

the
experiments.
3.4.

Partitioning

data

space
3.4.1.

Technique

of

partitioning
The

performance

of

our

algorithm

is

largely

dependent

on

the
partitioning

of

data

space.

Given

a

certain

number

of

objects,

the
more

cells

the

objects

are

in

and

the

more

uniformly

the

objects
are

distributed,

the

better

the

performance

is.

In

some

papers
(Agrawal

et

al.,

1998;

Sheikholeslami

et

al.,

1998),

each

dimension
is

divided

into

the

same

number

(say,

m)

of

intervals

and

there
are

m
d
cells

in

the

data

space.

The

above

method

of

partitioning

is
effective

when

the

dimensionality

is

low.

Nevertheless,

it

is

inappli-
cable

in

high

dimensional

data

space,

because

the

number

of

cells
increases

exponentially

with

the

dimensionality

and

the

comput-
ing

will

become

extremely

expensive.

For

example,

if

d

is

80,

the
number

of

cells

is

too

large

to

be

applicable

even

if

m

is

set

to

two.
However,

the

value

of

m

cannot

be

lower,

because

there

will

be

only
one

cell

and

the

density

calculation

of

each

object

needs

to

com-
pute

distances

for

N

times

if

m

is

set

to

one.

In

addition,

when

the
dimensionality

is

high,

it

is

very

difficult

to

choose

an

appropriate
value

for

m,

the

interval

number,

and

a

little

change

of

it

can

lead
to

a

great

variance

of

the

number

of

cells.

For

example,

if

d

is

30,

m
d
is

2.06

×

10
14
when

m

=

3

and

1.07

×

10
9
when

m

=

2.

To

tackle

the
above

problem,

a

technique

of

dividing

different

dimensions

into
different

number

of

intervals

is

employed

to

partition

the

whole
data

space.
With

our

technique,

different

interval

numbers

are

used

for

dif-
ferent

dimensions.

For

the

first

p

dimensions,

each

dimension

is
divided

evenly

into

m

intervals,

while

(m



1)

intervals

for

each

of
the

remaining

(d



p)

dimensions.

With

such

a

technique

of

parti-
tioning,

the

total

number

of

cells

is

m
p
(m



1)
d−p
and

the

number
of

cells

can

be

adjusted

smoothly

by

changing

m

and

p.
Let

ω

be

the

percentage

of

non-empty

cells

and

N

the

number
of

objects.

The

number

of

non-empty

cells

is

N
ne
=

ωm
p
(m



1)
d−p
.
The

average

number

of

objects

contained

in

each

non-empty

cell
is

N
avg
=

N/N
ne
.

Let

N
nc
be

the

average

number

of

neighboring

cells
of

a

non-empty

cell

(including

itself).

For

each

non-empty

cell,

the
number

of

distance

computation

is

N
avg
N
nc
N
avg
.

So

the

total

time
complexity

is
C
t
=

N
avg
N
nc
N
avg
N
ne
=

N
avg
N
nc
N

=
N
2
N
nc
ωm
p
(m



1)
d−p
(8)
By

setting

the

time

complexity

to

be

linear

with

both

N

and

d,
we

can

get
N
2
N
nc
ωm
p
(m



1)
d−p
=

Nd,

(9)
that

is,
NN
nc
ωm
p
(m



1)
d−p
=

d.

(10)
Then

the

values

of

m

and

p

can

be

derived

from

the

above

equa-
tion.
3.4.2.

Average

number

of

neighbors

per

cell
For

simplicity,

we

consider

the

case

with

q

=

1

to

select
the

values

of

m

and

p.

Actually,

The

m

and

p

calculated

in
this

way

are

also

used

when

q

is

set

to

other

values

in

our
algorithm.
With

q

=

1,

when

m

is

a

large

number,

most

cells

have

(2d

+

1)
neighbors,

i.e.,

N
nc


2d

+

1,

so

the

following

can

be

derived

from

Eq.
(9):
N(2d

+

1)
ωm
p
(m



1)
d−p
=

d,

(11)
where

both

m

and

p

are

positive

integers,

m



2

and

1



p



d.
However,

when

the

dimensionality

is

high,

m

may

become

small
and

the

neighbors

of

the

majority

of

cells

would

be

less

than

(2d

+

1),
so

Eq.

(11)

would

be

inapplicable

to

compute

the

value

of

p.

In
the

following,

a

theorem

will

be

presented

to

compute

the

average
number

of

neighbors

of

a

cell.
Theorem

1.

In

a

d-dimensional

data

space,

if

each

of

the

first

p
dimension

is

evenly

divided

into

m

intervals

and

each

of

the

remain-
ing

(d

−p)

dimension

into

(m



1)

intervals,

where

m



2,

the

average
number

of

immediate

neighbors

for

a

cell

with

q

=

1

is
N
nc
=

1

+
2(m



1)
m
p

+
2(m



2)
m



1
(d



p).

(12)
Proof.

If

a

dimension

is

partitioned

into

m

intervals,

the

total
number

of

neighboring

intervals

of

the

intervals

in

the

dimen-
sion

is

(2m



2),

since

each

of

the

two

intervals

at

both

ends

has
one

neighbor

and

each

of

the

remaining

(m



2)

intervals

has

two
neighbors.
As

to

the

immediate

neighbors

(q

=

1),

the

interval

ID

in

one
dimension

only

is

different

from

the

ID

sequence

of

the

current
cell.

If

the

difference

is

in

one

of

the

first

p

dimensions,

there
are

pm
p−1
(m



1)
d−p
cases,

and

in

each

ease

there

are

(2m



2)
neighbors,

so

the

count

of

neighbors

in

the

first

p

dimensions

is
n
1
=

pm
p−1
(m



1)
d−p
(2m



2).

If

the

difference

is

in

one

of

the

last
(d



p)

dimensions,

there

are

(d



p)m
p
(m



1)
d−p−1
cases,

and

in
each

ease

there

are

(2m



4)

neighbors,

so

the

count

of

neighbors
in

the

last

(d



p)

dimensions

is

n
2
=

(d



p)m
p
(m



1)
d−p−1
(2m



4).
The

count

of

cells

is

n
3
=

m
p
(m



1)
d−p
,

and

the

average

number
of

neighbors

of

each

cell

is
n
1
+

n
2
n
3
=
pm
p−1
(m



1)
d−p
(2m



2)

+

(d



p)m
p
(m



1)
d−p−1
(2m



4)
m
p
(m



1)
d−p
=
2(m



1)
m
p

+
2(m



2)
m



1
(d



p).
In

addition,

each

cell

is

also

considered

as

a

neighbor

of

itself,

so
N
nc
=

1

+
2(m−1)
m
p

+
2(m−2)
m−1
(d



p).

￿
From

Eq.

(9)

and

Theorem

1,

we

can

get
N(1

+

((2(m



1))/m)p

+

((2(m



2))/m



1)

(d



p))
ωm
p
(m



1)
d−p
=

d,

(13)
where

m



2

and

1



p



d.

For

a

given

m,

Eq.

(13)

is

a

transcendental
equation

and

cannot

be

worked

out

directly.

In

fact,

for

each

m,

p
is

an

integer

no

less

than

one

and

no

greater

than

d.

Therefore,

the
values

of

p

fall

in

a

small

range

and

the

optimal

value

can

be

derived
by

trying

every

possible

pair

of

values

in

Eq.

(13).
3.5.

Storage

of

cells
In

a

high

dimensional

data

space,

the

number

of

cells

can

be

huge
and

it

is

impossible

to

store

all

cells

in

memory.

Fortunately,

not

all
of

the

cells

contain

objects.

Especially

when

the

dimensionality

is
high,

the

space

is

very

sparse

and

the

majority

of

cells

are

empty.
Therefore,

it

is

not

necessary

to

store

all

cells.

With

our

technique,
only

the

non-empty

cells

are

stored

and

a

hash

table

is

used

to
store

non-empty

cells.

Because

each

non-empty

cell

contains

at
least

one

object,

the

number

of

non-empty

cells

is

no

more

than

N,
the

number

of

objects.
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1531
3.6.

Parameters

r

and

DT
While

it

is

very

easy

to

count

the

number

of

objects

in

the
neighborhood

of

an

object,

it

is

not

so

easy

to

choose

an

appro-
priate

value

for

r,

the

radius

of

neighborhood.

When

r

is

large
enough

that

all

the

objects

in

all

the

neighbors

of

a

cell

are

within
the

neighborhood,

AGRID+

will

behave

somewhat

like

grid-based
clustering,

in

the

sense

that

the

densities

of

all

the

objects

in

a
cell

will

be

the

same

and

that

the

density

is

simply

the

count
of

those

objects

in

all

its

neighboring

cells.

But

they

are

differ-
ent

in

that

AGRID+

considers

the

significant

low-order

neighbors
only,

instead

of

all

3
d
neighboring

cells.

On

the

other

hand,

if
r

is

much

smaller

than

the

lengths

of

all

edges

of

the

hyper-
rectangular

cell,

AGRID+

will

become

somewhat

like

density-based
clustering

because

the

density

of

a

object

is

largely

decided

by

the
number

of

objects

circumscribed

by

r.

However,

the

partitioning
of

data

space

into

cells

helps

to

reduce

the

number

of

distance
computation

and

make

AGRID+

much

faster

than

density-based
clustering.
Since

there

is

an

assumption

in

Section

3.1

that

the

radius

of
neighborhood

is

less

than

L/2,

r

is

simply

set

to

a

value

less

than
L/2,

while

L

is

the

length

of

the

shortest

interval

in

all

dimensions.
Because

a

small

r

can

make

the

densities

too

low

to

find

any

useful
clusters,

r

is

set

to

be

between

L/4

to

L/2

in

our

algorithm.
Besides

r,

the

result

of

clustering

is

also

decided

by

the

value
of

density

threshold,

DT.

With

AGRID+,

we

calculate

DT

dynami-
cally

according

to

the

mean

of

densities

according

to

the

following
equation:
DT

=
1

×
￿
N
i=1
Density(i)
N
,

(14)
where



is

a

coefficient

which

can

be

tuned

to

get

clustering

results
at

different

levels

of

resolution.

Actually,

by

tuning

,

various

clus-
tering

results

can

be

achieved

with

different

DT.

On

the

one

hand,
a

small



will

lead

to

a

big

DT,

the

merging

condition

will

become
strict

and

the

result

would

be

composed

of

many

small

clusters

and
many

objects

would

be

taken

as

noises.

On

the

other

hand,

a

large


will

make

a

small

DT.

Then

it

will

lead

to

a

few

large

clusters

because
adjacent

clusters

will

be

merged,

and

some

noises

will

be

mistaken
as

clusters.

With

a

set

of

different

values

of

,

a

multi-resolution
clustering

can

be

obtained.

Since

the

proposed

algorithm

is

based
on

AGRID,

the

effect

of

DT

and

the

multi-resolution

clustering

of
the

two

algorithms

are

similar

and

some

experimental

results

and
more

discussions

on

that

can

be

found

in

our

previous

work

on
AGRID

(Zhao

and

Song,

2003).
3.7.

The

procedure

of

AGRID+
AGRID+

is

composed

of

the

following

seven

steps.

Detailed

pseu-
docodes

can

be

found

in

Figs.

6–8.
(1) Partitioning.

The

whole

data

space

is

partitioned

into

cells
according

to

m

and

p

computed

with

Eq.

(13).

Each

object

is

then
assigned

to

a

cell

according

to

its

coordinates

and

non-empty
cells

are

inserted

into

a

hash

table.
(2) Computing

distance

threshold.

The

distance

threshold

is

com-
puted

according

to

the

interval

lengths

of

every

dimension

with
the

method

given

in

Section

3.6.
(3)

Calculating

densities.

For

each

object,

count

the

number

of
objects

both

in

its

neighboring

cells

and

in

its

neighborhood
as

its

density.
(4)

Compensating

densities.

For

each

object

˛,

compute

the

ratio

of
the

volume

of

all

neighbors

and

that

of

neighbors

considered,
and

use

the

product

of

the

ratio

and

the

density

of

˛

as

the

new
density

of

˛,

according

to

Eq.

(5).
(5)

Calculating

density

threshold

DT.

The

average

of

all

compensated
densities

is

calculated

and

then

the

density

threshold

DT

is
computed

with

Eq.

(14).
(6)

Clustering

automatically.

At

first,

each

object

whose

density

is
greater

than

DT

is

taken

as

a

cluster.

Then,

for

each

object

˛,
check

each

object

in

the

neighboring

cells

of

C
˛
to

see

whether
its

density

is

greater

than

the

density

threshold

and

whether
its

distance

from

object

˛

is

less

than

the

distance

threshold.

If
yes,

then

merge

the

two

clusters

which

the

two

objects

belong
to

respectively.

Continue

the

above

merging

procedure

until

all
eligible

object

pairs

have

been

checked.
(7)

Removing

noises.

In

those

clusters

obtained,

many

are

too

small
to

be

considered

as

meaningful

clusters,

so

they

are

removed
as

noises.
3.8.

Complexity

analysis
The

performance

of

AGRID+

depends

on

the

values

of

N

(the
size

of

data)

and

d

(the

dimensionality

of

data).

With

our

parti-
tioning

technique

proposed

in

Section

3.4,

the

time

complexity

is
controlled

by

m

and

p,

two

parameters

for

space

partitioning.

The
time

complexity

is

set

to

be

linear

with

N

and

d

in

Eq.

(9)

in

Section
3.4.1.

Nevertheless,

the

time

complexity

we

computed

is

under

an
ideal

condition

that

the

number

of

objects

in

every

cell

is

equal

to
one

another.

In

nearly

all

cases,

the

number

of

objects

varies

from
cell

to

cell.

So

the

time

complexity

is

to

some

degree

dependent

on
the

distribution

of

objects

in

data.

Our

experimental

results

in

next
section

will

show

that

the

time

complexity

is

nearly

linear

with
both

data

size

and

dimensionality.
Regarding

space

complexity,

our

algorithm

stores

non-empty
cells

only

in

a

hash

table,

and

the

number

of

non-empty

cells

is

no
more

than

the

number

of

objects.

Besides,

the

densities

of

objects
and

the

discovered

clusters

are

also

kept

in

memory,

and

the

spaces
used

to

store

densities

and

clusters

are

also

linear

with

N.

Therefore,
the

space

complexity

is

linear

with

the

size

and

the

dimensionality
of

data.
4.

Experimental

evaluation
Our

experiments

were

performed

on

a

PC

with

256MB

RAM

and
an

Intel

Pentium

III

1

GHz

CPU.

In

the

experiments,

we

will

show
the

improvement

of

AGRID+

over

AGRID

in

terms

of

scalability,
performance

and

accuracy,

and

will

also

show

the

effectiveness

of
AGRID+

for

discovering

clusters

in

subspaces.

In

addition,

we

will
compare

AGRID+

with

Random

Projection

(Fern

and

Brodley,

2003)
on

a

public

dataset.
4.1.

Synthetic

data

generator
The

function

nngenc(X,

C,

N,

D)

from

Matlab
1
is

used

to

generate
clusters

of

data

points,

where

X

is

a

R

×

2

matrix

of

cluster

bounds,
C

is

the

number

of

clusters,

N

is

the

number

of

data

points

in

each
cluster,

and

D

is

the

standard

deviation

of

clusters.

The

function
returns

a

matrix

containing

C

×

N

R-element

vectors

arranged

in
C

clusters

with

centers

inside

bounds

set

by

X,

with

N

elements
each,

randomly

around

the

centers

with

standard

deviation

of

D.
The

range

is

set

to

[0,1000].

For

some

clusters,

we

set

the

values

in
some

dimensions

to

be

of

uniform

distribution

to

make

subspace
clusters.

Noises,

which

are

uniformly

distributed

in

all

dimensions,
are

added

to

the

data.
1
http://www.mathworks.com/.
1532 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
Fig.

6.

Pseudo-code

of

AGRID+.
4.2.

Superiority

over

AGRID
The

first

dataset

is

a

15D

dataset

of

10,000

points.

It

is

generated
with

deviation

set

to

130.

There

are

4

clusters

and

5%

of

the

data

are
noises.

All

clusters

are

in

the

full-dimensional

space.

The

clustering
results

of

AGRID+

and

AGRID

are

shown

in

Fig.

9.

Fig.

9(a)

shows
the

clusters

discovered

by

AGRID,

where

the

numbers

below

sub-
figures

are

the

sizes

of

clusters.

AGRID+

also

found

four

clusters,

but
with

more

objects.

Fig.

9(b)

shows

the

additional

objects

discov-
ered

by

AGRID+

as

opposed

to

those

by

AGRID,

where

the

numbers
below

the

sub-figures

are

the

counts

of

additional

objects

in

clus-
ters.

The

objects

in

Fig.

9(b)

are

missed

by

AGRID,

which

shows

that
the

compensation

of

densities

makes

AGRID+

more

accurate

than
AGRID.
Tables

1

and

3

show

the

comparison

between

AGRID+

and
AGRID

on

the

densities

of

objects

and

the

accuracy

of

clustering.
The

confusion

matrix

of

densities

is

given

in

Table

1,

in

which

DT
stands

for

the

density

threshold

and

the

figures

are

the

counts

of
objects.

It

is

clear

from

the

table

that

the

accuracy

of

AGRID+

is
Table

1
Comparison

of

densities.
Standard

density

Density

in

AGRID+

Density

in

AGRID
≥DT

<DT

≥DT

<DT
≥DT 9314 190

8292

1212
<DT

222

274

16

480
Accuracy 95.9%

87.7%
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1533
Fig.

7.

Pseudo-code

of

Computing
density(

).
greater

than

that

of

AGRID

(Table

2).

We

then

further

studied

the
effectiveness

of

i-th

order

neighbors

and

density

compensation,
and

the

results

are

shown

in

Table

3.

Table

3

gives

the

clusters
discovered

and

the

accuracy.

It

shows

the

clustering

results

of

four
algorithms:

NAIVE,

AGRID,

IORDER

and

AGRID+.

NAIVE

is

a

naive
density-based

clustering

algorithm

without

using

any

grid.

AGRID
uses

grid

and

(2d

+

1)

neighboring

cells

based

on

NAIVE.

IORDER
uses

ith-order

neighbors

to

improve

the

performance

of

NAIVE,
but

no

compensation

is

conducted

to

density

computation.

AGRID+
uses

all

techniques

designed

in

this

paper.

The

results

show

that
ALGO

RITHM:

Clustering
INPUT:

data
OUTPUT:

clusters
/*

creati

ng

a

new

cluster

for eac

h ob

jec

t

who

se

densit

y

is

no

less

than

DT

*/
FOR

all ob

jec

ts

O
i
IF

Den
q
(O
i
)



DT
cluster(O
i
)

=

{O
i
};
ENDIF
ENDFOR
/*

combining

clusters

*/
FOR

all ce

ll

C
i
in

hash

table
FOR

all

C
j
,non

-empty

kth-order

neighb

ors

of

C
i
(0



k



q,

j

>=

i)
FOR

all ob

jec

ts

O
m
in

C
i
FOR

all ob

jec

ts

O
n
in

C
j
IF

Den
q
(O
m
)



DT

AND

Den
q
(O
n
)



DT AND

dist(O
m
,

O
n
)



r
cluster(O
m
)

=

cluster(O
m
)



cluster(O
n
);
cluster(O
n
)

=

cluster(O
m
);
ENDIF
ENDFOR
ENDFOR
ENDFOR
ENDFOR
Fig.

8.

Pseudo-code

of

Clustering(

).
1534 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
cluster 4: 2019
cluster 3: 2163
cluster 1: 1968
cluster 2: 2158
(a)
Results of AGRID
cluster 4: 304
cluster 3: 211
cluster 2: 221
cluster 1: 325
(b)
Results of AGRID+
Fig.

9.

Experimental

results

of

AGRID

and

AGRID+.

(a)

The

four

clusters

discovered

by

AGRID,

and

the

number

under

each

sub-figure

gives

the

size

of

the

cluster.

The

additional
objects

in

each

cluster

found

by

AGRID+

are

shown

in

(b),

and

the

number

under

each

sub-figure

gives

the

count

of

additional

objects

in

each

cluster.
Table

2
Four

algorithms

and

Their

techniques.
Algorithms

Grid

ith-order

neighbors

Density

compensation
NAIVE
AGRID

IORDER
√ √
AGRID+





NAIVE

is

of

the

highest

accuracy

but

with

the

longest

running

time
(around

10

times

as

long

as

the

other

three).

It

is

because

it

does-
not

use

any

grid

to

reduce

computation

and

the

distance

between
every

two

objects

have

to

be

calculated.

All

other

three

algorithms
are

10

times

faster

than

NAIVE,

so

grid

is

very

effective

to

speed

up
density

computation,

at

the

cost

of

accuracy.

IORDER

is

more

accu-
rate

than

AGRID,

which

demonstrates

the

effectiveness

of

ith-order
neighbors.

A

higher

accuracy

of

AGRID+

than

IORDER

shows

that,
density

compensation

enhances

clustering

quality,

at

the

cost

of
a

marginally

longer

running

time.

The

table

clearly

demonstrates
the

effectiveness

of

the

grid

to

improve

speed

and

the

effective-
ness

of

ith-order

neighbors

and

density

compensation

to

improve
clustering

quality.
While

the

clusters

in

the

above

dataset

can

be

easily

discov-
ered

by

both

algorithms,

another

dataset

is

used

to

demonstrate
the

superiority

of

AGRID+

over

AGRID

in

subspace

clustering.

It

is
a

15D

dataset

of

20,000

points,

and

the

clusters

exist

in

11D

sub-
spaces.

As

shown

in

Fig.

10,

the

first

cluster

exists

in

the

first

11
dimensions,

while

the

attribute

values

in

the

last

4

dimensions

are
uniformly

distributed.

For

the

second

cluster,

the

attribute

values
in

dimensions

3–6

are

of

uniform

distribution,

i.e.,

the

second

clus-
ter

exists

in

the

subspace

composed

of

dimensions

1,

2

and

7–15.
For

the

other

three

clusters,

the

uniformly

distributed

dimensions
Table

3
Comparison

of

accuracy.
Algorithms

Cluster

1

Cluster

2

Cluster

3

Cluster

4

Accuracy

Time(s)
NAIVE 2368

2396

2374

2366

95.0%

44.21
AGRID 1968 2158

2163

2019

83.1%

3.62
IORDER

1990

2216

2240

2081

85.3%

4.54
AGRID+ 2293

2379

2374

2323

93.7%

4.60
are

respectively

7–10,

8–11,

and

6–9.

The

last

sub-figure

shows

the
noise,

which

account

for

10%

of

the

data.

Our

experiment

shows
that

AGRID+

can

discover

the

five

clusters

correctly

with

the

accu-
racy

of

91%.

In

contrast,

AGRID

cannot

find

the

five

clusters

correctly
even

by

fine

tuning

parameters

r

and

DT.
4.3.

Scalability
The

performance

of

AGRID+

and

AGRID

is

shown

in

Fig.

11,
where

the

solid

lines

represent

AGRID+

and

the

dashed

lines

repre-
sent

AGRID.

Ten

experiments

have

been

conducted

for

each

method
and

the

average

results

are

given

in

the

figure.

In

Fig.

11(a),

the
dimensionalities

of

the

datasets

are

all

20,

and

the

sizes

of

datasets
range

from

10,000

to

100,000.

In

Fig.

11(b),

the

size

is

100,000

and
the

dimensionalities

are

from

3

to

100.

In

each

dataset,

10%

are
noises.

From

the

figure,

it

is

clear

that

the

running

time

of

AGRID+
is

nearly

linear

both

in

the

size

and

dimensionality

of

datasets,

and
is

a

little

longer

than

that

of

AGRID.
In

the

above

experiments,

we

set

q

=

1

when

applying

Eq.

(5).

To
test

the

effect

of

different

q,

another

experiment

is

conducted

on
a

dataset

of

100,000

objects

and

15

dimensions.

The

running

time
and

accuracy

of

clusters

discovered

with

different

q

are

shown

in
Fig.

12(a)

and

(b),

respectively.

When

q

is

zero,

the

algorithm

is
fastest,

but

the

accuracy

is

very

low.

With

the

increase

of

q,

more
neighbors

are

taken

into

consideration

and

the

accuracy

goes

up
dramatically,

but

the

running

time

becomes

longer.

When

q

is

larger
than

three,

there

is

no

significant

increase

in

accuracy

in

the

exper-
iment.

From

the

figure,

it

is

reasonable

to

set

q

to

2

or

3

to

achieve
both

high

speed

and

high

accuracy.

Users

can

set

the

value

of

q
according

to

computer

performance

and

accuracy

requirement

in
their

applications.
4.4.

Multi-resolution

clustering
Multi-resolution

clustering

can

be

achieved

by

using

different
DT

(or

)

in

Eq.

(14),

which

helps

to

detect

clusters

at

different
levels,

as

shown

in

Fig.

13.

Although

the

multi-resolution

property
of

our

technique

is

somewhat

like

that

of

WaveCluster,

they

are
much

different

in

that

AGRID+

achieves

by

adjusting

the

value

of
density

threshold

while

WaveCluster

by

“increasing

the

size

of

a
cell’s

neighborhood”.

The

clustering

of

a

2D

dataset

of

2000

objects
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1535
5

10

15

5

10

15
5

10

15

5

10

15
5

10

15

5

10

15
Fig.

10.

Experimental

results

of

AGRID+.

The

first

5

sub-figures

are

clusters

discovered

by

AGRID+

in

a

dataset

of

11D

subspace

clusters,

while

the

last

sub-figure

shows

noise.
is

used

to

demonstrate

the

effectiveness

of

multi-resolution
clustering.

Fig.

13(a)

shows

the

original

data

before

clustering,
and

the

other

figures

are

clustering

results

with

different

DTs.
The

values

of

density

threshold

are

respectively

5,

10,

20,

30,

35,
40

and

50

in

Fig.

13(b)–(h).

In

Fig.

13(b),

DT

is

set

to

5

and

there
are

three

clusters

found.

The

two

groups

of

objects

in

top-right
are

in

one

cluster,

because

they

are

connected

by

some

objects
between

them.

When

DT

increases

to

10,

they

are

split

into

two
clusters,

as

shown

in

Fig.

13(c).

The

bottom

cloud

of

objects

are
classified

into

two

clusters,

when

DT

is

20

in

Fig.

13(d).

As

DT
increases

further,

all

clusters

shrink,

resulting

in

the

splitting

or
disappearance

of

some

clusters,

as

shown

in

Fig.

13(e)–(h).

When
DT

is

set

to

50,

only

three

clusters

are

found,

composed

of

objects
in

very

densely-populated

areas

(see

Fig.

13(h)).
10

20

30

40

50

60

70

80

90

100
0
50
100
150
200
250
Size (x1000)
Time (s)
AGRID+
AGRID
withScalability
(a)
N
0

10

20

30

40

50

60

70

80

90
100
0
100
200
300
400
500
600
700
800
Dimensionality
Time (s)
AGRID+
AGRID
withScalability
(b)
d
Fig.

11.

Scalability

with

the

size

and

dimensionality

of

datasets.
1536 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15
0
500
1000
1500
2000
2500
3000
3500
Order of Neighbors
Time (s)
(a)
Running Time
0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15
0
10
20
30
40
50
60
70
80
90
100
Order of Neighbors (q)
Percentage (%)
Contribution
Accuracy
(b)
Accurac

y
Fig.

12.

Experimental

results

of

running

time

and

accuracy

with

various

values

of

q,

the

order

of

neighbors.
Generally

speaking,

the

greater

the

density

threshold

is,

the
smaller

the

clusters

are,

and

the

more

is

the

number

of

objects
that

would

be

treated

as

outliers.

When

DT

is

a

very

small

number,
the

number

of

clusters

is

small

and

just

a

few

objects

are

treated

as
outliers.

With

the

increasing

of

DT,

some

clusters

break

into

more
and

smaller

clusters.

A

hierarchical

clustering

tree

can

be

built

with
selecting

a

series

of

DT

and

the

appropriate

resolution

level

for
choosing

clusters

can

be

decided

on

the

needs

of

users.
Data
(a)
cluster

s,3
(b)

DT=5
clusters,

4
(c)

DT=10
clusters,5
(d)

DT=20
cluster

s,5
(e)

DT=30
clusters,

5
(f)

DT=35
clusters,4
(g)

DT=40
cluster

s,3
(h)

DT=50
Fig.

13.

Multi-resolution

clustering.
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1537
0 10 20 30 40 50 600 10 20 30 40 50 60
0 10 20 30 40 50 600 10 20 30 40 50 60
0 10 20 30 40 50 600 10 20 30 40 50 60
60504030201006050403020100

6050403020100
60504030201006050403020100

6050403020100
Fig.

14.

Control

chart

time

series

data.
4.5.

Comparison

with

Random

Projection

on

public

data
In

addition

to

the

experiments

with

the

above

synthetic
datasets,

experiments

have

been

conducted

with

the

control

chart
time

series

dataset

from

UCI

KDD

Archive
2
and

comparison

was
made

with

Random

Projection

(Fern

and

Brodley,

2003),

an

algo-
rithm

for

subspace

clustering.

The

dataset

is

of

60

dimensions

and
600

records.

There

are

six

clusters:

normal,

cyclic,

increasing

trend,
decreasing

trend,

upward

shift

and

downward

shift

(see

Fig.

14).

To
make

it

easy

to

see

the

six

clusters,

only

a

few

time

series

are

shown
in

each

cluster

in

the

figure.
The

clustering

given

in

UCI

KDD

Archive

is

used

as

the
standard

result

and

Conditional

Entropy

(CE)

and

Normalized
Mutual

Information

(NMI)

are

employed

to

measure

the

quality

of
clustering.

Compactness

(Zait

and

Messatfa,

1997)

is

also

widely
used

to

measure

the

quality

of

clustering,

but

it

favours

sphere-
shaped

clusters

since

the

diameter

is

used.

CE

and

NMI

have

been
used

to

measure

the

quality

of

clustering

by

Strehl

and

Ghosh
(2002),

Fern

and

Brodley

(2003)

and

Pfitzner

et

al.

(2009),

and

sim-
ilar

measures

base

on

entropy

have

also

been

used

by

Hu

and

Sung
(2006).

Conditional

Entropy

measures

the

uncertainty

of

the

class
labels

given

a

clustering

solution.

For

one

clustering

with

m

clusters
and

a

second

clustering

with

k

clusters,

the

Conditional

Entropy

is
defined

as

CE

=
￿
k
j=1
n
j
∗E
j
N
,

where

entropy

E
j
=


￿
m
i=1
p
ij
log(p
ij
),

n
j
is

the

size

of

cluster

j

in

the

second

clustering,

p
ij
is

the

probability
that

a

member

of

cluster

i

in

the

first

clustering

belongs

to

cluster

j

in
the

second

clustering,

p
i
is

the

probability

of

cluster

i,

p
j
is

the

prob-
ability

of

cluster

j,

and

N

is

the

size

of

dataset.

The

value

of

CE

is

a
non-negative

real

number.

The

less

CE

is,

the

more

the

tested

result
approaches

the

standard

result.

The

two

results

become

the

same

as
each

other

when

CE

is

zero.

For

two

clustering

solutions

C
1
and

C
2
,
the

normalized

mutual

information

is

defined

as

NMI

=
MI

H(C
1
)H(C
2
)
,
where

mutual

information

MI

=
￿
i,j
p
ij
log
￿
p
ij
p
i
p
j
￿
,

and

H(C
1
)

and
H(C
2
)

denote

the

entropy

of

C
1
and

C
2
,

respectively.

The

value

of
NMI

lies

in

[0,1].

Contrary

to

CE,

the

larger

the

value

of

NMI

is,

the
2
http://kdd.ics.uci.edu/.
better

is

the

clustering.

If

NMI

is

one,

then

the

two

clusterings

are
the

same

as

each

other.

In

all,

we

would

like

to

minimize

CE

and
maximize

NMI.
Since

the

dimensionality

is

high

and

the

records

are

relatively
few,

subspace

clustering

is

employed

to

find

the

clusters.

The
parameters

are

selected

based

on

studying

the

data

and

fine-tuning
of

the

parameters.

Generally

speaking,

the

greater

are

the

average
dimensionality

of

subspace

cluster

and

the

density

threshold,

the
stricter

is

the

condition

for

merging

two

objects

or

clusters

into
one

cluster,

and

the

result

tends

to

be

composed

of

many

smaller
clusters.

On

the

contrary,

the

smaller

are

the

two

parameters,

the
looser

is

the

condition

for

merging,

and

the

result

tends

to

con-
sist

of

a

few

bigger

clusters.

From

Fig.

14,

we

can

see

that,

for

the
time

series

in

a

cluster,

the

values

in

most

dimensions

are

close
and

there

are

big

difference

in

around

10–20

dimensions.

There-
fore,

it

is

reasonable

to

set

the

average

dimensionality

of

subspace
clusters

to

40–50.

In

our

experiments

on

the

above

data,

we

found
that,

with

an

average

dimensionality

less

than

40,

it

often

merges
increasing

trend

and

upward

shift

into

one

cluster

and

decreasing
trend

and

downward

shift

into

another

cluster.

The

best

result

was
achieved

by

setting

the

average

dimensionality

to

45

and

the

den-
sity

threshold

to

8.

The

results

of

clustering

with

our

algorithm

and
Random

Projection

are

given

in

Table

4.

From

the

table,

we

can

see
that

the

clustering

of

our

algorithm

are

of

the

lowest

CE

and

the
highest

NMI,

which

shows

that

our

algorithm

performs

better

than
Random

Projection.

The

superiority

of

AGRID+

over

IORDER

also
shows

that

the

effectiveness

of

density

compensation.
Generally

speaking,

the

average

dimensionality

of

subspaces
can

be

set

based

on

domain

knowledge

in

specific

applications.
However,

if

a

user

has

no

idea

at

all

about

how

to

set

a

right

value
to

it,

he

may

run

the

algorithm

for

multiple

times

with

various

val-
ues

for

the

parameter,

and

then

choose

the

best

clustering

with
the

help

of

some

internal

validation

measures

or

relative

valida-
Table

4
AGRID+

vs

Random

Projection.
AGRID+ IORDER

Random

Projection
CE

0.466

0.517

0.706
NMI 0.845

0.822

0.790
1538 Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539
tion

measures,

such

as

Compactness,

Silhouette

index,

Figure

of

merit
and

Stability

(Brun

et

al.,

2007;

Halkidi

et

al.,

2001).
5.

Discussions
To

reduce

the

cost

of

computation,

some

assumptions

are

made
in

this

paper.

One

assumption

is

to

reduce

computation

cost

of

V
c
in

Eq.

(1)

by

approximating

it

in

a

simple

way.

For

an

object

in

a

d-
dimensional

space,

when

all

neighboring

cells

with

order

no

more
than

q

are

considered,

we

need

to

calculate

the

volume

of

over-
lapped

spaces

for

every

neighboring

cell

with

order

no

more

than
q.

That

is,

the

number

of

volume

calculation

would

be

1

+
()
+
()
+
.

.

.

+
()
.

We

can

see

that

the

above

computation

is

costly,

especially
when

d

and

q

are

big

numbers.

To

simply

the

above

calculation,
We

use

the

V
c
for

object

(a,

a,

...a)

to

approximate

the

V
c
for

object
(a
1
,

a
2
...a
d
),

where

a

is

the

mean

of

a
1
,

a
2
,

.

.

.

a
d
.

With

the

approx-
imation,

we

need

to

calculate

only

(d

+

1)

overlapped

spaces

(i.e.,
one

for

each

order

of

neighbors).

Although

for

a

specific

neighbor,
the

volume

may

be

over-

or

down-

estimated,

the

overall

volume
of

V
c
,

which

is

the

sum

of

the

overlapped

space

with

all

neighbors
with

order

no

more

than

q,

is

well

approximated.

The

effectiveness
of

the

approximation

is

shown

in

Table

1,

where

the

accuracy

is
improved

from

87.7%

to

95.9%

with

density

compensation.

It

is

also
validated

by

an

improvement

in

accuracy

from

85.3%

(IORDER)

to
93.7%

(AGRID+),

as

shown

in

Table

3.

Moreover,

the

above

assump-
tion

is

used

only

to

calculate

volume

for

density

compensation,

and
the

calculation

of

Cnt
q
(˛)

in

Eqs.

(1)

and

(2)

is

not

affected

by

it.
Regarding

space

partitioning

for

producing

grid

and

cells,

some
other

techniques,

such

as

adaptive

grid

(Nagesh

et

al.,

1999)

and
optimal

grid

(Hinneburg

and

Keim,

1999),

partition

data

space

by
considering

data

distribution

in

every

dimension.

However,

they
do

not

fit

our

algorithm

due

to

the

following

reasons.

Firstly,

equal-
sized

cells

are

preferred

in

our

algorithm

to

cover

neighborhood
with

cells,

since

the

density

is

defined

based

on

neighborhood.
Distribution-based

partitioning

often

produces

cells

with

great
variance

in

their

sizes.

Secondly,

the

number

of

cells

needs

be
able

to

smoothly

change,

so

that

it

would

be

easier

to

choose

an
appropriate

value

for

DT

or

fine

tune

DT.

The

above

two

features
are

important

to

our

algorithm,

but

distribution-based

partition-
ing

fails

to

do

so.

Although

our

proposed

partitioning

method

looks
simple,

it

addresses

the

above

two

issues

well

and

its

effectiveness
is

shown

in

our

experiments.
6.

Conclusions
In

this

paper,

we

have

presented

a

novel

and

efficient

grid-
density

based

clustering

approach,

which

has

four

novel

technical
features.

The

first

feature

is

that

it

takes

objects

(or

points)

as

atomic
units

in

which

the

size

requirement

to

cells

is

waived

without

los-
ing

clustering

accuracy.

The

second

one

is

the

concept

of

ith-order
neighbors,

with

which

the

neighboring

cells

are

organized

into

a
couple

of

groups

to

lower

the

computational

complexity

and

meet
different

requirements

of

accuracy.

The

third

is

the

idea

of

density
compensation

to

improve

the

accuracy

of

densities

and

clustering.
The

last

but

not

the

least,

the

measure

of

minimal

subspace

distance
is

used

to

help

AGRID+

to

discover

clusters

in

subspaces.

We

have
experimentally

evaluated

our

approach

and

demonstrated

that

our
algorithm

significantly

reduces

computation

cost

and

improves
clustering

quality.

In

fact,

besides

AGRID+,

our

measure

of

minimal
subspace

distance

can

also

help

other

algorithms

to

find

clusters
in

subspaces,

which

will

be

included

in

our

future

work.

Another
two

future

works

are:

(1)

finding

an

optimal

order

of

the

dimen-
sions

based

on

the

distribution

of

data

in

every

single

dimension
before

partitioning;

and

(2)

using

internal

indices

to

obtain

optimal
settings

of

parameters.
Acknowledgements
This

research

was

done

when

Yanchang

Zhao

was

an

Australian
Postdoctoral

Fellow

(Industry)

at

Faculty

of

Engineering

&

IT,

Uni-
versity

of

Technology,

Sydney,

Australia.
This

work

is

supported

in

part

by

the

Australian

Research

Coun-
cil

(ARC)

under

large

grant

DP0985456,

the

China

“1000-Plan”
Distinguished

Professorship,

the

Jiangsu

Provincial

Key

Laboratory
of

E-business

at

the

Nanjing

University

of

Finance

and

Economics,
and

the

Guangxi

NSF

(Key)

grants.
References
Aggarwal,

C.C.,

Hinneburg,

A.,

Keim,

D.A.,

2001.

On

the

surprising

behavior

of

dis-
tance

metrics

in

high

dimensional

space.

In:

Proc.

of

the

8th

International
Conference

on

Database

Theory.
Agrawal,

R.,

Gehrke,

J.,

et

al.,

1998.

Automatic

subspace

clustering

of

high

dimen-
sional

data

for

data

mining

applications.

In:

Proc.

of

the

1998

ACM-SIGMOD
International

Conference

on

Management

of

Data

(SIGMOD’98)

,

Seattle,

WA,
June,

pp.

94–105.
Alsabti,

K.,

Ranka,

S.,

Singh,

V.,

1998.

An

efficient

K-means

clustering

algorithm.

In:
Proc.

of

the

First

Workshop

on

High

Performance

Data

Mining

,

Orlando,

FL.
Ankerst,

M.,

Breunig,

M.,

et

al.,

1999.

OPTICS:

ordering

points

to

identify

the

cluster-
ing

structure.

In:

Proc.

of

the

1999

ACM-SIGMOD

International

Conference

on
Management

of

Data

(SIGMOD’99)

,

Philadelphia,

PA,

June,

pp.

49–60.
Assent,

I.,

Krieger,

R.,

Glavic,

B.,

Seidl,

T.,

2008.

Clustering

multidimensional
sequences

in

spatial

and

temporal

databases.

Knowledge

and

Information

Sys-
tems

16

(July

(1)),

29–51.
Berkhin,

P.,

2002.

Survey

of

Clustering

Data

Mining

Techniques.

Technical

Report,
Accrue

Software.
Brun,

M.,

Sima,

C.,

Hua,

J.,

Lowey,

J.,

Carroll,

B.,

Suh,

E.,

Dougherty,

E.R.,

2007.

Model-
based

evaluation

of

clustering

validation

measures.

Pattern

Recognition,

vol.

40.
Elsevier

Science

Inc,

pp.

807–824.
Ester,

M.,

Kriegel,

H.-P.,

et

al.,

1996.

A

density-based

algorithm

for

discovering

clus-
ters

in

large

spatial

databases

with

noise.

In:

Proc.

of

the

1996

International
Conference

On

Knowledge

Discovery

and

Data

Mining

(KDD’96)

,

Portland,

Ore-
gon,

August,

pp.

226–231.
Fern,

X.Z.,

Brodley,

E.,

2003.

Random

projection

for

high

dimensional

data

clustering:
a

clustering

ensemble

approach.

In:

Proc.

of

the

20th

International

Conference
On

Machine

Learning

(ICML’03)

,

Washington,

DC.
Grabmeier,

J.,

Rudolph,

A.,

2002.

Techniques

of

cluster

algorithms

in

data

mining.
Data

Mining

and

Knowledge

Discovery

6,

303–360.
Guha,

S.,

Rastogi,

R.,

Shim,

K.,

1998.

Cure:

an

efficient

clustering

algorithm

for

large
databases.

In:

Proc.

of

the

1998

ACM-SIGMOD

International

Conference

on

Man-
agement

of

Data

(SIGMOD’98),

Seattle,

WA,

June

,

pp.

73–84.
Guha,

S.,

Rastogi,

R.,

Shim,

K.,

1999.

Rock:

a

robust

clustering

algorithm

for

categori-
cal

attributes.

In:

Proc.

of

the

1999

International

Conference

on

Data

Engineering
(ICDE’99)

,

Sydney,

Australia,

March,

pp.

512–521.
Halkidi,

M.,

Batistakis,

Y.,

Vazirgiannis,

M.,

2001.

On

clustering

validation

techniques.
Journal

of

Intelligent

Information

Systems

17,

107–145.
Han,

J.,

Kamber,

M.,

2001.

Data

Mining:

Concepts

and

Techniques.

Higher

Education
Press,

Morgan

Kaufmann

Publishers.
Hinneburg,

A.,

Keim,

D.A.,

1998.

An

efficient

approach

to

clustering

in

large

multi-
media

databases

with

noise.

In:

Proc.

of

the

1998

International

Conference

on
Knowledge

Discovery

and

Data

Mining

(KDD’98)

,

New

York,

August,

pp.

58–65.
Hinneburg,

A.,

Keim,

D.A.,

1999.

Optimal

grid-clustering:

towards

breading

the

curse
of

dimensionality

in

high-dimensional

clustering.

In:

Proc.

of

the

25th

Interna-
tional

Conference

on

Very

Large

Data

Bases

,

Edinburgh,

Scotland.
Hu,

T.,

Sung,

S.Y.,

2006.

Finding

centroid

clusterings

with

entropy-based

criteria.
Knowledge

and

Information

Systems

10

(November

(4)),

505–514.
Huang,

Z.,

1998.

Extensions

to

the

k-means

algorithm

for

clustering

large

data

sets
with

categorical

values.

Data

Mining

and

Knowledge

Discovery

2,

283–304.
Jain,

A.K.,

Murty,

M.N.,

Flynn,

P.J.,

1999.

Data

clustering:

a

review.

ACM

Computing
Surveys

31

(September

(3)).
Karypis,

G.,

Han,

E.-H.,

Kumar,

V.,

1999.

CHAMELEON:

a

hierarchical

clustering

algo-
rithm

using

dynamic

modelling.

IEEE

Computer,

Special

Issue

on

Data

Analysis
and

Mining

32

(August

(8)),

68–75.
Kolatch,

E.,

2001.

Clustering

Algorithms

for

Spatial

Databases:

A

Survey.

Dept.

of
Computer

Science,

University

of

Maryland,

College

Park.
Hinneburg,

A.,

Aggarwal,

C.C.,

Keim,

D.A.,

2000.

What

is

the

nearest

neighbor

in

high
dimensional

spaces?

In:

Proc.

of

the

26th

International

Conference

on

Very

Large
Data

Bases

,

Cairo,

Egypt,

pp.

506–515.
Moise,

G.,

Sander,

J.,

Ester,

M.,

2008.

Robust

projected

clustering.

Knowledge

and
Information

Systems

14

(March

(3)),

273–298.
Nagesh

H.,

Goil

S.

and

Choudhary

A.

MAFIA:

Efficient

and

Scalable

Subspace

Clus-
tering

for

Very

Large

Data

Sets,

Technical

Report

9906-010,

Northwestern
University,

June

1999.
Procopiuc,

M.,

Jones,

M.,

Agarwal,

P.,

Murali,

T.M.,

2002.

A

Monte-Carlo

algorithm
for

fast

projective

clustering.

In:

Proc.

of

the

2002

International

Conference

on
Management

of

Data.
Ng,

R.,

Han,

J.,

1994.

Efficient

and

effective

clustering

method

for

spatial

data

mining.
In:

Proc

of

the

1994

International

Conference

on

Very

Large

Data

Bases

(VLDB’94)
,

Santiago,

Chile,

September,

pp.

144–155.
Y.

Zhao

et

al.

/

The

Journal

of

Systems

and

Software

84 (2011) 1524–

1539 1539
Pfitzner,

D.,

Leibbrandt,

R.,

Powers,

D.,

2009.

Characterization

and

evaluation

of

sim-
ilarity

measures

for

pairs

of

clusterings.

Knowledge

and

Information

Systems

19
(June

(3)),

361–394.
Sheikholeslami,

G.,

Chatterjee,

S.,

Zhang,

A.,

1998.

WaveCluster:

a

multi-resolution
clustering

approach

for

very

large

spatial

databases.

In:

Proc.

of

the

1998

Inter-
national

Conference

on

Very

Large

Data

Bases

(VLDB’98)

,

New

York,

August,

pp.
428–429.
Strehl,

A.,

Ghosh,

J.,

2002.

Cluster

ensembles—a

knowledge

reuse

framework

for
combining

multiple

partitions.

Machine

Learning

Research

3,

583–617.
Wang,

W.,

Yang

D

J.,

Muntz,

R.,

1997.

STING:

a

statistical

information

grid

approach
to

spatial

data

mining.

In:

Proc.

of

the

1997

International

Conference

on

Very
Large

Data

Bases

(VLDB’97)

,

Athens,

Greece,

August,

pp.

186–195.
Zait,

M.,

Messatfa,

H.,

1997.

A

comparative

study

of

clustering

methods.

Future
Generation

Computer

Systems

13,

149–159.
Zhang,

T.,

Ramakrishnan,

R.,

Livny,

M.,

1996.

BIRCH:

an

efficient

data

clustering
method

for

very

large

databases.

In:

Proc.

of

the

1996

ACM-SIGMOD

Interna-
tional

Conference

on

Management

of

Data

(SIGMOD’96)

,

Montreal,

Canada,
June,

pp.

103–114.
Zhao,

Y.,

Song,

J.,

2003.

AGRID:

an

efficient

algorithm

for

clustering

large

high-
dimensional

datasets.

In:

Proc.

of

The

7th

Pacific-Asia

Conference

on

Knowledge
Discovery

and

Data

Mining

(PAKDD’03)

,

Seoul,

Korea,

April,

pp.

271–282.
Yanchang

Zhao

is

a

Senior

Data

Mining

Specialist

in

Centrelink,

Australia.

He

was
an

Australian

Postdoctoral

Fellow

(Industry)

at

the

Data

Sciences

and

Knowledge
Discovery

Research

Lab,

Centre

for

Quantum

Computation

and

Intelligent

Systems,
University

of

Technology,

Sydney,

Australia,

from

2007

to

2009.

His

research

inter-
ests

are

clustering,

sequential

patterns,

time

series,

association

rules

and

their
applications.

He

is

a

member

of

the

IEEE.
Jie

Cao

is

a

Professor

and

the

Chair

of

Jiangsu

Provincial
Key

Laboratory

of

E-business

at

the

Nanjing

University

of
Finance

and

Economics.

He

is

a

winner

of

the

Program

for
New

Century

Excellent

Talents

in

Universities

(NCET).

He
received

his

PhD

degree

from

the

Southeast

University,
China,

in

2002.

His

main

research

interests

include

cloud
computing,

business

intelligence

and

data

mining.

Dr.

Cao
has

published

one

book

and

more

than

40

refereed

papers
in

various

journals

and

conferences.
Chengqi

Zhang

has

been

a

Professor

of

Information

Technology

at

The

Uni-
versity

of

Technology,

Sydney

(UTS)

since

December

2001.

He

has

been

the
Director

of

the

UTS

Priority

Investment

Research

Centre

for

Quantum

Compu-
tation

and

Intelligent

Systems

since

April

2008.

He

has

been

Chairman

of

the
Australian

Computer

Society

National

Committee

for

Artificial

Intelligence

since
November

2005.
Prof.

Zhang

obtained

his

PhD

degree

from

the

University

of

Queensland

in

1991,
followed

by

a

Doctor

of

Science

(DSc

– Higher

Doctorate)

from

Deakin

University

in
2002.
Prof.

Zhang’s

research

interests

mainly

focus

on

Data

Mining

and

its

appli-
cations.

He

has

published

more

than

200

research

papers,

including

several

in
first-class

international

journals,

such

as

Artificial

Intelligence,

IEEE

and

ACM

Trans-
actions.

He

has

published

six

monographs

and

edited

16

books.

He

has

delivered

12
keynote/invited

speeches

at

international

conferences

over

the

last

six

years.

He

has
attracted

seven

Australian

Research

Council

grants.
He

is

a

Fellow

of

the

Australian

Computer

Society

(ACS)

and

a

Senior

Mem-
ber

of

the

IEEE

Computer

Society

(IEEE).

He

has

been

serving

as

an

Associate

Editor
for

three

international

journals,

including

IEEE

Transactions

on

Knowledge

and

Data
Engineering,

from

2005

to

2008;

and

he

served

as

General

Chair,

PC

Chair,

or

Organis-
ing

Chair

for

five

international

Conferences

including

ICDM

and

WI/IAT.

His

personal
web

page

can

be

found

at:

http://www-staff.it.uts.edu.au/∼chengqi/.
Shichao

Zhang

is

a

China

“1000-Plan”

Distinguished

Pro-
fessor

and

the

Dean

of

College

of

Computer

Science

and
Information

Technology

at

the

Guangxi

Normal

Univer-
sity,

Guilin,

China.

He

received

his

PhD

degree

in

Applied
Mathematics

from

the

China

Academy

of

Atomic

Energy
in

1997.

His

research

interests

include

information

qual-
ity

and

multi-sources

data

mining.

He

has

published

10
solely-authored

international

journal

papers,

about

50
international

journal

papers

and

over

60

international
conference

papers.

He

has

been

a

CI

for

winning

10

nation-
class

projects

in

China

and

Australia.

He

is

served/ing

as
an

associate

editor

for

IEEE

Transactions

on

Knowledge
and

Data

Engineering,

Knowledge

and

Information

Sys-
tems,

and

IEEE

Intelligent

Informatics

Bulletin,

and

served

as

a

(vice-)PC

Chair

for

5
international

conferences.