The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
Contents
lists
available
at
ScienceDirect
The
Journal
of
Systems
and
Software
j
ourna
l
ho
me
page:
www.el sevi er.com/l ocat e/j ss
Enhancing
griddensity
based
clustering
for
high
dimensional
data
Yanchang
Zhao
a
,Jie
Cao
b,∗
,Chengqi
Zhang
c
,Shichao
Zhang
d,∗
a
Centrelink,
Australia
b
Jiangsu
Provincial
Key
Laboratory
of
Ebusiness,
Nanjing
University
of
Finance
and
Economics,
Nanjing,
210003,
P.R.
China
c
Centre
for
Quantum
Computation
and
Intelligent
Systems,
Faculty
of
Engineering
and
Information
Technology,
University
of
Technology,
Sydney,
Australia
d
College
of
CS
&
IT,
Guangxi
Normal
University,
Guilin,
China
a
r
t
i
c
l
e
i
n
f
o
Article
history:
Received
26
July
2010
Received
in
revised
form
9
February
2011
Accepted
25
February
2011
Available online 8 March 2011
Keywords:
Clustering
Subspace
clustering
High
dimensional
data
a
b
s
t
r
a
c
t
We
propose
an
enhanced
griddensity
based
approach
for
clustering
high
dimensional
data.
Our
tech
nique
takes
objects
(or
points)
as
atomic
units
in
which
the
size
requirement
to
cells
is
waived
without
losing
clustering
accuracy.
For
efﬁciency,
a
new
partitioning
is
developed
to
make
the
number
of
cells
smoothly
adjustable;
a
concept
of
the
ithorder
neighbors
is
deﬁned
for
avoiding
considering
the
expo
nential
number
of
neighboring
cells;
and
a
novel
density
compensation
is
proposed
for
improving
the
clustering
accuracy
and
quality.
We
experimentally
evaluate
our
approach
and
demonstrate
that
our
algorithm
signiﬁcantly
improves
the
clustering
accuracy
and
quality.
© 2011 Elsevier Inc. All rights reserved.
1.
Introduction
Clustering,
as
one
of
the
main
techniques
in
data
mining,
is
to
ﬁnd
“natural”
groups
in
datasets.
Not
only
can
it
be
used
stand
alone
in
database
segmentation
and
data
compression,
it
also
can
be
employed
in
the
preprocessing
procedures
of
other
data
mining
techniques,
such
as
classiﬁcation,
association
rules,
and
so
on.
Densitybased
clustering
(Ankerst
et
al.,
1999;
Ester
et
al.,
1996;
Hinneburg
and
Keim,
1998)
and
gridbased
clustering
(Sheikholeslami
et
al.,
1998;
Wang
et
al.,
1997)
are
two
wellknown
clustering
approaches.
The
former
is
famous
for
its
capabilities
of
discovering
clusters
of
various
shapes,
effectively
eliminating
out
liers
and
being
insensitive
to
the
order
of
inputs,
whereas
the
latter
is
well
known
for
its
high
speed.
However,
neither
approach
is
scal
able
to
high
dimensionality.
For
densitybased
ones,
the
reason
is
that
the
index
structures,
such
as
R*tree,
are
not
scalable
to
high
dimensional
spaces.
For
gridbased
approaches,
the
reason
is
that
both
the
number
of
cells
and
the
count
of
neighboring
cells
grow
exponentially
with
the
dimensionality
of
data.
Gridbased
algo
rithms
take
cells
as
atomic
units
which
are
inseparable,
and
thus
the
interval
partitioned
in
each
dimension
must
be
small
enough
to
ensure
the
resolution
of
clustering.
Therefore,
the
number
of
cells
will
increase
exponentially
with
dimensionality.
Some
researchers
try
to
break
the
curse
of
dimensionality
by
using
the
adaptive
grid
∗
Corresponding
authors.
Email
addresses:
yanchang.zhao@centrelink.gov.au
(Y.
Zhao),
caojie690929@163.com
(J.
Cao),
chengqi@it.uts.edu.au
(C.
Zhang),
zhangsc@gxnu.edu.cn
(S.
Zhang).
(Nagesh
et
al.,
1999),
the
optimal
grid
(Hinneburg
and
Keim,
1999),
or
in
an
a
priorilike
way
(Agrawal
et
al.,
1998).
Previously,
we
developed
an
algorithm
called
AGRID
(Advanced
GRidbased
IsoDensity
line
clustering),
which
combines
density
based
and
gridbased
approaches
to
cluster
large
highdimensional
data
(Zhao
and
Song,
2003).
Based
on
the
idea
of
densitybased
clustering,
it
employs
grid
to
reduce
the
complexity
of
distance
computation
and
can
discover
clusters
of
arbitrary
shapes
efﬁ
ciently.
However,
in
order
to
reduce
the
complexity
of
density
computation,
only
(2d
+
1)
out
of
all
3
d
neighbors
are
considered
for
each
cell
when
computing
the
densities
of
objects
in
it.
When
the
dimensionality
is
high,
most
neighboring
cells
are
ignored
and
the
accuracy
becomes
very
poor.
In
this
paper,
we
present
an
enhanced
griddensity
based
algo
rithm
for
clustering
high
dimensional
data,
referred
to
as
AGRID+,
which
substantially
improves
the
accuracy
of
density
computation
and
clustering.
AGRID+
has
four
main
distinct
technical
features.
The
ﬁrst
is
that
objects
(or
points),
instead
of
cells,
are
taken
as
the
atomic
units.
In
this
way,
it
is
no
longer
necessary
to
set
the
intervals
very
small,
so
that
the
number
of
cells
does
not
grow
dramatically
with
the
dimensionality
of
data.
The
second
feature
is
the
concept
of
ithorder
neighbors,
with
which
the
neighboring
cells
are
organized
into
a
couple
of
groups
to
improve
efﬁciency
and
meet
different
requirements
of
accuracy.
As
a
result,
we
obtain
a
tradeoff
between
accuracy
and
speed
in
AGRID+.
The
third
is
the
technique
of
density
compensation
which
improves
the
accu
racy
greatly.
Last
but
no
the
least,
a
new
distance
measure,
minimal
subspace
distance,
is
designed
for
subspace
clustering.
The
rest
of
the
paper
is
organized
as
follows.
In
Section
2,
we
present
the
related
work
and
some
concepts
needed.
The
AGRID+
clustering
is
designed
in
Section
3,
in
which
an
idea
to
adapt
our
01641212/$
–
see
front
matter ©
2011 Elsevier Inc. All rights reserved.
doi:10.1016/j.jss.2011.02.047
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1525
algorithm
for
subspace
clustering
is
also
given
in
this
section.
Sec
tion
4
shows
the
results
of
experiments
both
on
synthetic
and
public
datasets.
Some
discussions
are
given
in
Section
5.
Conclu
sions
are
made
in
Section
6.
2.
Related
work
Most
clustering
algorithms
fall
into
four
categories:
partition
ing
clustering,
hierarchical
clustering,
densitybased
clustering
and
gridbased
clustering.
The
idea
of
partitioning
clustering
is
to
par
tition
the
dataset
into
k
clusters
which
are
represented
by
the
centroid
of
the
cluster
(kMeans)
or
one
representative
object
of
the
cluster
(kMedoids).
It
uses
an
iterative
relocation
technique
that
improves
the
partitioning
by
moving
objects
from
one
group
to
another.
Wellknown
partitioning
algorithms
are
kMeans
(Alsabti
et
al.,
1998),
kMedoids
(Huang,
1998)
and
CLARANS
(Ng
and
Han,
1994).
Hierarchical
clustering
creates
a
hierarchical
decomposition
of
the
dataset
in
bottomup
approach
(agglomerative)
or
in
topdown
approach
(divisive).
A
major
problem
of
hierarchical
methods
is
that
they
cannot
correct
erroneous
decisions.
Famous
hierarchi
cal
algorithms
are
AGENS,
DIANA,
BIRCH
(Zhang
et
al.,
1996),
CURE
(Guha
et
al.,
1998),
ROCK
(Guha
et
al.,
1999)
and
Chameleon
(Karypis
et
al.,
1999).
The
general
idea
of
densitybased
clustering
is
to
continue
growing
the
given
cluster
as
long
as
the
density
(i.e.,
the
number
of
objects)
in
the
neighborhood
exceeds
some
threshold.
Such
a
method
can
be
used
to
ﬁlter
out
noise
and
discover
clusters
of
arbi
trary
shapes.
The
density
of
an
object
is
deﬁned
as
the
number
of
objects
in
its
neighborhood.
Therefore,
the
densities
of
each
object
have
to
be
computed
at
ﬁrst.
A
naive
way
is
to
calculate
the
distance
between
each
pair
of
objects
and
count
the
number
of
objects
in
the
neighborhood
of
each
object
as
its
density,
which
is
not
scalable
with
the
size
of
datasets,
since
the
computational
complexity
is
()
,
where
N
is
the
number
of
objects.
Typical
densitybased
methods
are
DBSCAN
(Ester
et
al.,
1996),
OPTICS
(Ankerst
et
al.,
1999)
and
DENCLUE
(Hinneburg
and
Keim,
1998).
Gridbased
algorithms
quantize
the
data
space
into
a
ﬁnite
number
of
cells
that
form
a
grid
structure
and
all
of
the
clus
tering
operations
are
performed
on
the
grid
structure.
The
main
advantage
of
this
approach
is
its
fast
processing
time.
However,
it
does
not
work
effectively
and
efﬁciently
in
highdimensional
space
due
to
the
socalled
“curse
of
dimensionality”.
Wellknown
grid
based
approaches
for
clustering
includes
STING
(Wang
et
al.,
1997),
WaveCluster
(Sheikholeslami
et
al.,
1998),
OptiGrid
(Hinneburg
and
Keim,
1999),
CLIQUE
(Agrawal
et
al.,
1998)
and
MAFIA
(Nagesh
et
al.,
1999),
and
they
are
sometimes
called
densitygrid
based
approaches
(Han
and
Kamber,
2001;
Kolatch,
2001).
STING
is
a
gridbased
multiresolution
clustering
technique
in
which
the
spatial
area
is
divided
into
rectangular
cells
and
orga
nized
into
a
statistical
information
cell
hierarchy
(Wang
et
al.,
1997).
Thus,
the
statistical
information
associated
with
spatial
cells
are
captured
and
queries
and
clustering
problems
can
be
answered
without
recourse
to
the
individual
objects.
The
hierarchical
struc
ture
of
grid
cells
and
the
statistical
information
associated
with
them
make
STING
very
fast.
STING
assumes
that
K,
the
number
of
cells
at
bottom
layer
of
hierarchy,
is
much
less
than
the
num
ber
of
objects,
and
the
overall
computational
complexity
is
O(K).
However,
K
can
be
much
greater
than
N
in
highdimensional
data.
Sheikholeslami
et
al.
(1998)
proposed
a
technique
named
WaveCluster
to
look
at
the
multidimensional
data
space
from
a
sig
nal
processing
perspective.
The
objects
are
taken
as
a
ddimensional
signal,
so
the
high
frequency
parts
of
the
signal
correspond
to
the
boundaries
of
clusters,
while
the
low
frequency
parts
which
have
high
amplitude
correspond
to
the
areas
of
the
data
space
where
data
are
concentrated.
It
ﬁrst
partitions
the
data
space
into
cells
and
then
applies
wavelet
transform
on
the
quantized
feature
space
and
detects
the
dense
regions
in
the
transformed
space.
With
the
multiresolution
property
of
wavelet
transform,
it
can
detect
the
clusters
at
different
scales
and
levels
of
details.
The
time
complexity
of
WaveCluster
is
O(dN
log
N).
The
basic
idea
of
OptiGrid
is
to
use
contracting
projections
of
the
data
to
determine
the
optimal
cutting
hyperplanes
for
partitioning
the
data
(Hinneburg
and
Keim,
1999).
The
data
space
is
partitioned
with
arbitrary
(nonequidistant,
irregular)
grids
based
on
the
dis
tribution
of
data,
which
avoids
the
effectiveness
problems
of
the
existing
gridbased
approaches
and
guarantees
that
all
clusters
are
found
by
the
algorithm,
while
still
retaining
the
efﬁciency
of
a
gird
based
approach.
The
time
complexity
of
OptiGrid
is
between
O(dN)
and
O(dN
log
N).
CLIQUE
(Agrawal
et
al.,
1998),
MAFIA
(Nagesh
et
al.,
1999)
and
Random
Projection
(Fern
and
Brodley,
2003)
are
three
algorithms
for
discovering
clusters
in
subspaces.
CLIQUE
discovers
clusters
in
subspaces
in
a
way
similar
to
the
Apriori
algorithm.
It
partitions
each
dimension
into
intervals
and
computes
the
dense
units
in
all
dimensions.
Then
these
dense
units
are
combined
to
generate
the
dense
units
in
higher
dimensions.
MAFIA
is
an
efﬁcient
algorithm
for
subspace
clustering
using
a
density
and
gird
based
approach
(Nagesh
et
al.,
1999).
It
uses
adap
tive
grids
to
partition
a
dimension
depending
on
the
distribution
of
data
in
the
dimension.
The
bins
and
cells
that
have
low
density
of
data
are
pruned
to
reduce
the
computation.
The
boundaries
of
the
bins
are
not
rigid,
which
improves
the
quality
of
clustering.
Fern
and
Brodley
proposed
Random
Projection
to
ﬁnd
the
sub
spaces
of
clusters
in
a
Random
Projection
and
ensemble
way
(Fern
and
Brodley,
2003).
The
dataset
is
ﬁrst
projected
into
random
sub
spaces,
and
then
EM
algorithm
is
used
to
discovers
clusters
in
the
projected
dataset.
The
algorithm
generates
several
groups
of
clusters
with
the
above
method
and
then
combines
them
into
a
similarity
matrix,
from
which
the
ﬁnal
clusters
are
discovered
with
an
agglomerative
clustering
algorithm.
Moise
et
al.
(2008)
proposed
P3C,
a
robust
algorithm
for
pro
jected
clustering.
Based
on
the
computation
of
socalled
cluster
cores,
it
can
effectively
discover
projected
clusters
in
the
data
while
minimizing
the
number
of
required
parameters.
Moreover,
it
can
work
on
both
numerical
and
categorical
datasets.
Assent
et
al.
(2008)
proposed
an
algorithm
capable
of
ﬁnd
ing
parallel
clusters
in
different
subspaces
in
spatial
and
temporal
databases.
Although
they
also
use
the
notions
of
neighborhood
and
density,
their
target
problem
is
clustering
sequence
data
instead
of
generic
data
in
this
paper.
Previously,
we
proposed
AGRID,
a
griddensity
based
algorithm
for
clustering
(Zhao
and
Song,
2003).
It
has
the
advantages
of
both
densitybased
clustering
and
gridbased
clustering,
and
is
effective
and
efﬁcient
for
clustering
large
highdimensional
data.
However,
it
is
not
accurate
enough,
and
the
reason
is
that
only
2d
immediate
neighbors
are
taken
into
consideration.
Moreover,
it
is
incapable
of
discovering
clusters
in
subspaces.
With
AGRID
algorithm,
ﬁrstly,
each
dimension
is
divided
into
multiple
intervals
and
the
data
space
is
thus
partitioned
into
many
hyperrectangular
cells.
Objects
are
assigned
to
cells
according
to
their
attribute
values.
Secondly,
for
an
object
˛
in
a
cell,
we
only
compute
the
distances
between
it
and
the
objects
in
its
neighbor
ing
cells,
and
use
the
count
of
those
objects
which
are
close
to
˛
as
its
density.
Objects
that
are
not
in
the
neighboring
cells
are
far
away
from
˛,
and
therefore
do
not
contribute
to
the
densities
of
˛.
Thirdly,
each
object
is
taken
as
a
cluster
and
every
pair
of
objects
which
are
in
the
neighborhood
of
each
other
are
checked
whether
they
are
close
enough
to
merge
into
one
cluster.
If
yes,
then
the
two
clusters
which
the
two
objects
belongs
to
respectively
are
merged
into
a
single
cluster.
All
eligible
pairs
of
clusters
meeting
1526 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
j+1jj1
i1
(i1,j
1)
(i1,j
)
(i1,j+
1)
i
(i,j1)
(i,j
)
(i,j+1
)
i+1
(i+1,j1)
(i+1,j
)
(i+1,j
+1)
3
(a)
d
Neighbors
j+1jj1
i1
(i1,j)
i
(i,j1)
(i,j)
(i,j+1)
i+1
(i+1,j)
(2d
(b)
+ 1) Immediate Neighbors
Fig.
1.
Two
deﬁnitions
of
neighbors.
The
grey
cell
labelled
with
“(i,j)”
in
the
center
is
C
˛
,
and
the
grey
cells
around
it
are
its
neighbors.
the
above
requirement
are
merged
to
generate
larger
clusters
and
the
clustering
ﬁnishes
when
all
such
kind
of
object
pairs
have
been
checked.
With
the
idea
of
grid
and
neighbor,
3
d
neighboring
cells
(see
Fig.
1(a))
or
(2d
+
1)
immediate
neighboring
cells
(including
C
˛
itself)
(see
Fig.
1(b))
are
considered
when
computing
densities
of
objects
in
cell
C
˛
and
clustering
in
AGRID
algorithm.
If
all
3
d
neighboring
cells
are
considered,
the
computation
is
prohibitively
expensive
when
the
dimensionality
is
high.
Nevertheless,
if
only
(2d
+
1)
immediate
neighboring
cells
are
considered,
a
lot
of
cells
are
ignored
and
the
densities
computed
become
inaccurate
for
high
dimensional
data.
To
tackle
the
above
dilemma,
ithorder
neighbor
will
be
deﬁned
in
this
paper
to
classify
3
d
the
neighboring
cells
into
groups
according
to
their
signiﬁcance.
By
considering
only
most
sig
niﬁcant
neighboring
cells,
high
speed
is
achieved
and
high
accuracy
is
kept.
To
improve
the
accuracy
further,
density
compensation
and
minimal
subspace
distance
are
proposed,
which
will
be
described
in
the
following
section.
For
a
comprehensive
survey
on
other
approaches
for
clustering,
please
refer
to
Berkhin
(2002),
Grabmeier
and
Rudolph
(2002),
Han
and
Kamber
(2001),
Jain
et
al.
(1999)
and
Kolatch
(2001).
3.
AGRID+:
an
enhanced
densitygrid
based
clustering
The
proposed
AGRID+
algorithm
for
clustering
high
dimensional
data
will
be
presented
in
this
section.
An
ithorder
neighbor
is
ﬁrst
introduced
to
improve
the
efﬁciency
and
then
a
density
compensation
is
proposed
to
improve
accuracy
and
make
the
algorithm
more
effective
to
cluster
highdimensional
data.
In
addition,
a
measure
of
minimal
subspace
distance
is
introduced
to
make
the
algorithm
capable
to
ﬁnd
clusters
in
subspaces
effectively.
Our
techniques
of
partitioning
the
data
space
and
choosing
parameters
will
also
be
discussed
in
this
section.
The
following
notations
are
used
throughout
this
paper.
N
is
the
number
of
objects
(or
points
or
instances)
and
d
is
the
dimen
sionality
of
dataset.
L
is
the
length
of
an
interval,
r
is
the
radius
of
neighborhood,
and
DT
is
the
density
threshold.
˛
is
an
object
or
a
point,
and
C
˛
is
the
cell
in
which
˛
is
located.
X
is
an
object
with
coordinates
of
(x
1
,
x
2
,
...,
x
d
)
and
Dist
p
(X,
Y)
is
the
distance
between
X
and
Y
with
L
p
metric
as
the
distance
measure.
C
i
1
i
2
...i
d
stands
for
the
cell
whose
ID
is
i
1
i
2
.
.
.
i
d
,
where
i
j
is
the
ID
of
the
interval
in
which
the
cell
is
located
in
the
jth
dimension.
V
n
and
V
c
are
respectively
the
volume
of
neighborhood
and
the
volume
of
the
considered
part
of
neighborhood.
Cnt
q
(˛)
is
the
count
of
points
in
the
considered
part
of
neighborhood
of
˛
and
Den
q
(˛)
is
the
compensated
density
of
˛
when
all
ithorder
neighbors
of
˛
(0
≤
i
≤
q)
are
considered
for
density
computation.
3.1.
The
ithorder
neighbors
In
this
section,
our
deﬁnitions
of
neighbors
will
be
presented
and
discussed.
Note
that
neighborhood
and
neighbors
(or
neighboring
cells)
are
two
different
concepts
in
this
paper.
The
former
is
deﬁned
for
a
point
and
its
neighborhood
is
an
area
or
a
space,
while
the
latter
is
deﬁned
for
a
cell
and
its
neighbors
are
those
cells
adjacent
to
it.
Sometimes
we
use
“the
neighbors
of
point
˛”
to
denote
the
neighbors
of
cell
C
˛
w.r.t.
point
˛
(see
Deﬁnition
4),
where
C
˛
is
the
cell
in
which
˛
is
located.
An
intuitive
way
is
to
deﬁne
all
the
cells
around
a
cell
as
its
neighbors,
as
Deﬁnition
1
shows.
Deﬁnition
1
((Neighbors)).
Cells
C
i
1
i
2
...i
d
and
C
j
1
j
2
...j
d
are
neighbors
of
each
other
iff
∀
p,
1
≤
p
≤
d,
i
p
−
j
p

≤
1,
where
i
1
i
2
.
.
.i
d
and
j
1
j
2
.
.
.
j
d
are
respectively
the
interval
IDs
of
cell
C
i
1
i
2
...i
d
and
C
j
1
j
2
...j
d
.
Generally
speaking,
there
are
all
together
3
d
neighbors
for
each
cell
in
a
ddimensional
data
space
according
to
Deﬁnition
1
(see
Fig.
1(a)).
Assume
˛
is
an
object
and
C
˛
is
the
cell
that
˛
is
located
in.
When
calculating
the
density
of
object
˛,
we
need
to
compute
the
distances
between
˛
and
the
objects
in
cell
C
˛
and
its
neighboring
cells
only.
Those
objects
in
other
cells
are
relatively
far
away
from
object
˛,
so
they
contribute
nothing
or
little
to
the
density
of
˛.
Therefore,
for
object
˛,
we
do
not
care
about
the
objects
which
are
not
in
the
neighboring
cells
of
cell
C
˛
.
With
Deﬁnition
1,
each
cell
has
3
d
neighbors,
which
makes
the
computing
very
expensive
when
the
dimensionality
is
high.
There
fore,
an
idea
of
immediate
neighbors
is
deﬁned
as
follows
to
reduce
computational
complexity.
Deﬁnition
2
((Immediate
Neighbors)).
Cell
C
i
1
i
2
...i
d
and
C
j
1
j
2
...j
d
are
immediate
neighbors
of
each
other
iff
∃
l,
1
≤
l
≤
d,
i
l
−
j
l

=
1,
and
∀
p
/=
l,
1
≤
p
≤
d,
i
p
=
j
p
,
where
l
is
an
integer
between
1
and
d,
and
i
1
i
2
.
.
.
i
d
and
j
1
j
2
.
.
.
j
d
are
respectively
the
interval
IDs
of
cell
C
i
1
i
2
...i
d
and
C
j
1
j
2
...j
d
.
Generally
speaking,
in
a
ddimensional
spaces,
each
cell
has
2d
immediate
neighbors
(see
Fig.
1(b)).
With
only
immediate
neighbors
considered
according
to
Deﬁnition
2,
the
computational
complexity
is
greatly
reduced,
but
at
the
cost
of
accuracy.
It
is
effective
when
the
clusters
are
compact
and
dense.
Nevertheless,
when
the
dimensionality
is
high
and
the
data
are
sparse,
the
density
values
and
the
clustering
become
inac
curate,
since
many
cells
are
ignored
when
computing
densities.
To
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1527
(a)
C
α
(i
=
(b)
0)
i
=
(c)
1
i
=
(d)
2
i
=
3
Fig.
2.
The
ithorder
neighbors
of
cell
C
˛
.
improve
the
accuracy,
we
classify
the
neighbors
according
to
their
signiﬁcance
by
deﬁning
ithorder
neighbors
as
follows.
Deﬁnition
3
((ithorder
Neighbors)).
Let
C
˛
be
a
cell
in
a
d
dimensional
space.
A
cell
which
shares
a
(d–i)
dimensional
facet
with
cell
C
˛
is
an
ithorder
neighbor
of
C
˛
,
where
i
is
an
integer
between
0
and
d.
Especially,
we
set
0thorder
neighbors
of
C
˛
to
be
C
˛
itself.
Examples
of
ithorder
neighbors
in
a
3D
space
are
shown
in
Fig.
2.
The
grey
cell
in
Fig.
2(a)
is
C
˛
,
and
the
0thorder
neigh
bor
of
C
˛
is
itself.
The
grey
cells
in
Fig.
2(b)–(d)
are
the
1st,
2nd
and
3rdorder
neighbors
of
C
˛
,
respectively.
With
the
introduction
of
ithorder
neighbors,
the
neighbors
of
cell
C
˛
are
classiﬁed
into
groups
according
to
their
positions
relative
to
C
˛
,
and
an
ithorder
neighbors’
contribution
to
the
density
of
˛
is
greater
with
lower
i.
Therefore,
we
only
consider
loworder
neighbors
when
cluster
ing.
More
speciﬁcally,
only
those
neighbors
whose
order
is
not
greater
than
q
are
taken
into
account,
where
q
is
a
positive
inte
ger
and
0
≤
q
≤
d.
The
ithorder
neighbor
is
a
generalized
notion
of
Deﬁnitions
1
and
2.
When
q
is
set
to
1,
then
only
0th
and
1storder
neighbors
are
considered,
and
the
loworder
neighbors
are
C
˛
itself
and
the
immediate
neighbors
deﬁned
by
Deﬁnition
2.
In
this
case,
the
speed
is
very
fast,
but
the
accuracy
is
poor.
If
q
is
set
to
d,
all
neighbors
are
considered,
which
is
the
same
as
Deﬁnition
1.
Thus,
the
accuracy
is
guaranteed,
but
the
computation
is
prohibitively
costly.
Since
lowerorder
neighbors
are
of
more
sig
niﬁcance,
our
technique
of
considering
only
loworder
neighbors
helps
to
improve
performance
and
keep
accuracy
as
high
as
pos
sible.
Moreover,
the
accuracy
can
be
further
improved
with
our
technique
of
density
compensation,
which
will
be
discussed
later
in
this
paper.
In
the
following,
the
relationship
between
the
radius
of
neigh
borhood
and
the
length
of
interval
will
be
discussed
to
further
improve
the
performance
of
our
algorithm.
Assume
r
to
be
the
radius
of
neighborhood
and
L
the
length
of
interval.
When
r
is
large
enough
that
all
the
objects
in
all
the
neighbors
of
a
cell
are
within
the
neighborhood,
AGRID+
will
behave
somewhat
like
gridbased
clustering,
in
the
sense
that
the
densities
of
all
the
objects
in
a
cell
will
be
the
same
and
that
the
density
is
simply
the
count
of
those
objects
in
all
its
neighboring
cells.
With
a
very
large
r,
both
the
densities
and
neighborhood
become
very
large,
which
will
lead
to
the
merging
of
adjacent
clusters
into
bigger
clusters
and
clus
r
L
Fig.
3.
Neighborhood
and
neighbors.
The
black
point
is
˛,
the
grey
cell
in
the
center
is
C
˛
,
and
the
other
grey
cells
around
it
are
its
neighbors.
The
area
within
the
dashed
line
is
the
neighborhood
of
˛.
ters
consisting
of
noises.
On
the
other
hand,
if
r
is
much
smaller
than
the
lengths
of
all
edges
of
the
hyperrectangular
cell,
AGRID+
will
become
somewhat
like
densitybased
clustering,
because
the
density
of
a
object
is
largely
decided
by
the
number
of
objects
circumscribed
by
r.
With
a
very
small
r,
both
the
densities
and
neighborhood
become
very
small,
so
the
result
will
be
composed
of
many
small
clusters
and
a
large
amount
of
objects
will
be
taken
as
outliers.
Therefore,
it
is
reasonable
to
set
r
to
be
on
the
same
order
as
the
length
of
interval.
If
r
>
L/2,
all
the
3
d
cells
around
C
˛
should
be
considered
to
com
pute
the
density
of
object
˛
accurately.
If
r
<
L/2,
some
of
the
3
d
cells
around
C
˛
will
not
overlap
with
the
neighborhood
and
can
be
excluded
for
density
computation.
An
illustration
of
the
above
observation
is
given
in
Fig.
3,
which
shows
the
neighborhood
and
neighbors
in
a
2D
space.
In
the
ﬁgure,
L
∞
metric
is
used
as
the
dis
tance
measure
and
the
neighborhood
of
˛
becomes
a
hypercube.
Note
that
the
above
observation
also
holds
for
other
distance
mea
sures.
As
Fig.
3
shows,
if
object
˛
is
located
near
the
topleft
corner
of
C
˛
,
only
the
cells
that
are
on
the
topleft
side
of
C
˛
need
to
be
(a)
C
α
(i
=
(b)
0)
i
=
(c)
1
i
=
(d)
2
i
=
3
Fig.
4.
The
ithorder
neighbors
of
C
˛
w.r.t.
point
˛.
1528 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
considered
and
the
computation
becomes
less
expensive.
In
what
follows,
we
assume
that
r,
the
radius
of
neighborhood,
is
less
than
L/2.
The
above
observation
can
be
generalized
as:
if
the
radius
of
neighborhood
is
less
than
L/2,
only
those
neighbors
which
are
located
on
the
same
side
of
C
˛
as
˛
contributes
to
the
density
of
˛.
Therefore,
for
each
point
˛
in
cell
C
˛
,
the
neighbors
need
to
consider
are
related
to
the
relative
position
of
˛
in
C
˛
,
so
a
new
deﬁnition
of
ithorder
neighbors
with
respect
to
the
position
of
˛
is
given
as
follows.
Deﬁnition
4
((ithorder
Neighbor
w.r.t.
Point
˛)).
In
a
ddimensional
space,
let
˛
be
a
point
and
C
˛
be
the
cell
in
which
˛
is
located.
Assume
that
the
coordinate
of
˛
is
(x
1
,
x
2
,
...,
x
d
),
the
center
of
C
˛
is
(a
1
,
a
2
,
...,
a
d
)
and
the
center
of
C
ˇ
is
(b
1
,
b
2
,
...,
b
d
).
Point
˛
and
cell
C
ˇ
are
on
the
same
side
of
cell
C
˛
iff
∀
i,
1
≤
i
≤
d,
(x
i
−
a
i
)
(b
i
−
a
i
)
≥
0.
Cell
C
ˇ
is
an
ithorder
neighbor
of
C
˛
w.r.t.
˛
(or
an
ithorder
neighbor
of
˛
for
short)
iff:
(1)
C
ˇ
is
an
ithorder
neighbor
of
C
˛
,
and
(2)
C
ˇ
and
˛
are
on
the
same
side
of
C
˛
.
Since
an
ithorder
neighbor
of
˛
shares
a
(d–i)dimensional
facet
with
C
˛
,
the
ID
sequences
of
the
ithorder
neighbors
of
C
˛
have
i
different
IDs
from
that
of
C
˛
,
and
the
difference
between
each
pair
of
IDs
can
be
either
+1
or
−1.
Because
the
ithorder
neighbors
of
˛
lie
on
the
same
side
of
C
˛
as
˛,
the
number
of
ithorder
neighbors
of
˛
is
d
i
.
Examples
of
the
ithorder
neighbors
of
C
˛
w.r.t.
˛
in
a
3D
space
are
shown
in
Fig.
4.
Assume
that
˛
is
a
point
on
the
topright
back
side
of
the
center
of
C
˛
(the
grey
cell)
in
Fig.
4(a),
so
C
˛
is
the
0thorder
neighbor
of
C
˛
w.r.t.
˛.
The
grey
cells
in
Fig.
4(b)–(d)
are
the
1st,
2nd
and
3rdorder
neighbors
of
C
˛
w.r.t.
˛,
respectively.
3.2.
Density
compensation
With
the
introduction
of
ithorder
neighbors,
the
efﬁciency
is
much
improved
by
considering
loworder
neighbors
only.
How
ever,
the
clustering
still
becomes
less
accurate
with
the
increase
of
dimensionality.
To
further
improve
accuracy,
an
idea
of
density
compensation
will
be
proposed
in
this
section
to
make
up
for
the
loss
introduced
by
ignoring
highorder
neighbors.
3.2.1.
Idea
of
density
compensation
Since
only
loworder
neighbors
are
considered,
a
part
of
neigh
borhood
is
ignored
and
the
clustering
becomes
less
accurate,
especially
as
d
increases.
To
make
up
for
the
loss,
we
propose
a
notion
of
density
compensation.
The
idea
is
that
for
each
object,
the
ratio
of
the
volume
of
the
neighborhood
to
that
of
the
con
sidered
part
is
calculated
as
a
compensation
coefﬁcient,
and
the
ﬁnal
density
of
an
object
is
the
product
of
its
original
density
and
its
compensation
coefﬁcient.
According
to
Deﬁnition
4,
if
all
ith
order
neighbors
of
˛
(i
=
0,
1,
...,
q,
where
0
≤
q
≤
d)
are
considered
when
computing
the
density
of
˛,
the
density
we
get
should
be
compensated
as
Den
q
(˛)
=
V
n
V
c
Cnt
q
(˛),
(1)
where
V
n
and
V
c
are
respectively
the
volume
of
neighborhood
and
the
volume
of
the
considered
part
of
neighborhood,
which
is
cov
ered
by
ithorder
neighbors
of
˛
with
0
≤
i
≤
q.
Cnt
q
(˛)
is
the
count
of
points
in
the
considered
part
of
neighborhood,
and
Den
q
(˛)
is
the
compensated
density
of
˛.
Unfortunately,
since
there
are
too
many
neighbors
for
each
cell
when
q
is
a
large
integer,
it
is
impractical
to
compute
the
contri
bution
of
each
cell
individually.
Therefore,
we
simplify
the
density
compensation
by
assuming
that,
for
a
speciﬁc
i,
the
considered
parts
of
an
ithorder
neighbor
are
of
the
same
volume.
Let
V
S
i
be
the
volume
of
the
overlapped
space
of
an
ithorder
neighbor
and
the
neighborhood
of
˛,
and
based
on
Eq.
(1),
we
can
get
Den
q
(˛)
=
V
n
q
i=0
()
V
si
Cnt
q
(˛),(2)
where
S
i
is
the
overlapped
space
of
an
ithorder
neighbor
and
the
neighborhood
of
˛,
and
V
S
i
is
the
volume
of
S
i
.
In
the
above
equation,
the
density
becomes
more
accurate
with
the
increase
of
q.
When
q
=
d,
the
density
we
obtain
is
the
exact
value
of
density.
Neverthe
less,
the
number
of
neighbors
considered
increases
dramatically
with
the
increase
of
q.
When
the
value
of
q
is
set
to
1,
the
0th
and
1storder
neighbors
together
are
the
(2d
+
1)
neighbors
deﬁned
in
AGRID.
The
value
of
V
S
i
in
Eq.
(2)
varies
with
the
measure
of
dis
tance.
A
method
for
computing
V
S
i
will
be
presented
in
the
next
section.
3.2.2.
Density
compensation
Euclidean
distance
is
the
most
widely
used
distance
measure.
However,
Euclidean
distance
increases
with
dimensionality,
which
makes
it
difﬁcult
to
select
a
value
for
r.
Assume
that
there
are
two
points
˛(a,
a,
...,
a)
and
ˇ(0,
0,
...,
0)
in
a
ddimensional
space.
The
L
p
distance
between
them
is
Dist
p
(˛,
ˇ)
=
(da
p
)
1/p
,
that
is,
ad
1/p
.
We
can
see
that
the
distance
increases
with
the
dimensionality,
especially
when
p
is
small.
For
example,
for
a
dataset
within
the
unit
cube
in
a
100D
space,
if
Euclidean
distance
(p
=
2)
is
used,
the
distance
of
point
(0.2,
0.2,
.
.
.
,
0.2)
from
the
original
point
is
2.
If
r
is
set
to
1.5,
it
will
cover
the
whole
range
in
every
single
dimen
sion,
but
still
cannot
cover
the
above
point!
It
is
the
same
case
with
most
other
L
p
metric,
especially
when
p
is
a
small
integer.
However,
when
L
∞
metric
is
used,
the
distance
becomes
Dist(˛,
ˇ)
=
a.
For
the
above
example,
r
can
be
set
to
0.3
to
cover
the
point,
and
it
will
cover
a
part
in
every
dimension.
Therefore,
L
∞

metric
is
more
meaningful
to
measure
the
distance
for
clustering
in
high
dimensional
spaces.
Moreover,
for
subspace
clustering
in
a
highdimensional
space,
clusters
are
deﬁned
by
researchers
as
axisparallel
hyperrectangles
in
subspaces
(Agrawal
et
al.,
1998;
Procopiuc
et
al.,
2002).
Therefore,
it
is
reasonable
to
deﬁne
a
clus
ter
in
this
paper
to
be
composed
of
those
objects
which
are
in
a
hyperrectangle
in
a
subspace.
Since
a
hyperrectangle
in
a
sub
space
can
be
get
by
setting
the
subspace
distance
with
L
∞
metric,
we
select
L
∞
metric
as
the
distance
measure,
which
is
shown
as
follows:
Dist
∞
(X,
Y)
=
max
i=1...d
x
i
−
y
i

(3)
When
L
∞
metric
is
used
as
distance
measure,
the
neighbor
hood
of
an
object
becomes
a
hypercube
with
the
edge
length
of
2r
and
its
volume
is
V
n
=
(2r)
d
,
where
r
is
the
radius
of
neighborhood.
Let
(a
1
,
a
2
,
...,
a
d
)
be
the
coordinate
of
˛
relative
to
the
start
point
of
C
˛
(see
Fig.
5(a)).
Let
b
j
=
min{a
j
,
L
j
−
a
j
},
where
L
j
is
the
length
of
interval
in
the
jth
dimension.
If
b
j
<
r,
then
the
neighborhood
of
˛
is
beyond
the
boundary
of
C
˛
in
the
jth
dimension.
Suppose
that
there
are
all
together
d
dimensions
with
b
j
<
r
and
a
is
the
mean
of
such
b
j
.
To
approximate
the
ratio
of
overlapped
spaces,
we
assume
that
the
current
object
˛
is
located
on
the
diagonal
of
cell
C
˛
and
(a,
a,
...,
a)
is
the
coordinate
relative
to
the
start
point
of
C
˛
(Fig.
5(b)),
where
a
=
(1/d)
d
i=1
a
i
.
With
such
an
assumption,
for
a
speciﬁc
i,
all
the
ith
order
neighbors
of
˛
have
the
same
volume
of
overlapped
spaces
with
the
neighborhood
of
˛.
Let
S
i
be
the
overlapped
space
of
an
ith
order
neighbor
and
the
neighborhood
of
˛,
and
V
S
i
be
the
volume
of
S
i
.
S
i
is
a
hyperrectangle
which
has
i
edges
with
length
of
(r
−
a)
and
(d
−
i)
edges
with
length
of
(r
+
a),
so
V
S
i
=
(r
+
a)
d
−i
(r
−
a)
i
,
where
0
≤
a
≤
r.
Since
the
neighborhood
is
overlapped
with
C
˛
only
in
d
dimensions,
the
following
equation
can
be
derived
by
replacing
d,
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1529
r
a
1
a2
r
a
a
ra
r
+a
(b)(a)
Fig.
5.
Density
compensation.
As
(a)
shows,
the
black
point
is
˛,
and
the
area
circumscribed
by
dotted
line
is
the
neighborhood
of
˛.
To
approximate
the
volumes
of
the
overlapped
spaces
(in
grey)
by
the
neighborhood
and
each
neighbor
of
˛,
we
assume
that
˛
is
located
on
the
diagonal
line
of
the
current
cell
C
˛
,
as
(b)
shows.
V
n
and
V
S
i
with
d
,
(2r)
d
and
(r
+
a)
d
−i
(r
−
a)
i
respectively
in
Eq.
(2):
Den
q
(˛)
=
(2r)
d
q
i=0
()
(r
+
a)
d
−i
(r
−
a)
i
Cnt
q
(˛)
(4)
where
q
is
a
positive
integer
no
larger
than
d
.
In
fact,
the
number
of
ithorder
neighbors
of
a
point
is
much
less
than
()
,
especially
in
highdimensional
spaces
where
most
cells
are
empty.
Thus,
k
i
,
the
actual
number
of
ithorder
neighbors,
can
be
used
to
replace
()
,
leading
to
Den
q
(˛)
=
(2r)
d
q
i=0
k
i
(r
+
a)
d
−i
(r
−
a)
i
Cnt
q
(˛)
(5)
By
tuning
parameter
q,
we
can
obtain
different
clustering
accu
racy.
Clearly,
both
the
accuracy
and
cost
will
increase
as
q
increases.
Therefore,
we
need
a
tradeoff
between
accuracy
and
efﬁciency.
The
value
of
q
can
be
chosen
according
to
the
requirement
of
the
accuracy
and
the
performance
of
computers.
A
large
value
of
q
will
improve
the
accuracy
of
clustering
result,
but
at
the
cost
of
time.
On
the
contrary,
high
speed
can
be
achieved
by
setting
a
small
value
to
q,
but
the
accuracy
will
become
lower
accordingly.
Interestingly,
our
experiments
show
that
setting
q
to
two
or
three
achieves
good
accuracy
in
most
situations.
The
effect
of
different
values
of
q
will
be
shown
in
Section
4.
3.3.
Minimal
subspace
distance
Euclidean
distance
is
the
mostly
used
distance
measure.
How
ever,
the
difference
between
the
nearest
and
the
farthest
points
becomes
less
discriminating
with
the
increase
of
dimensionality
(Hinneburg
et
al.,
2000).
Aggarwal
et
al.
suggest
to
use
fractional
distance
metrics
(i.e.,
L
p
norm
with
0
<
p
<
1)
to
measure
the
simi
larity
between
objects
in
high
dimensional
space
(Aggarwal
et
al.,
2001).
Nevertheless,
many
researchers
think
that
most
meaningful
clusters
only
exist
in
subspaces,
so
they
use
traditional
L
p
norm
(p
=
1,
2,
3,
...)
to
discover
clusters
in
subspaces
(Agrawal
et
al.,
1998;
Fern
and
Brodley,
2003;
Nagesh
et
al.,
1999;
Procopiuc
et
al.,
2002).
For
subspace
clustering
in
highdimensional
space,
clusters
are
constrained
to
be
axisparallel
hyperrectangles
in
subspaces
by
Agrawal
et
al.
(1998),
and
projective
clusters
are
deﬁned
as
axis
aligned
box
by
Procopiuc
et
al.
(2002).
Therefore,
it
is
reasonable
to
deﬁne
a
cluster
to
be
composed
of
those
objects
which
are
in
a
hyperrectangle
in
a
subspace.
To
improve
traditional
L
p
norm
(p
=
1,
2,
3,
...)
for
subspace
clustering
in
highdimensional
space,
a
new
distance
measure,
minimal
subspace
distance,
is
deﬁned
as
follows.
Deﬁnition
5
((Minimal
Subspace
Distance)).
Suppose
that
X
=
(x
1
,
x
2
,
...,
x
d
)
and
Y
=
(y
1
,
y
2
,
...,
y
d
)
are
two
objects
or
points
in
a
d
dimensional
space.
The
minimal
kdimensional
subspace
distance
between
X
and
Y
is
the
minimal
distance
between
them
in
all
pos
sible
kdimensional
subspaces:
Dist
(k)
(X,
Y)
=
min
all
J
k
{Dist(X
J
k
,
Y
J
k
)},
J
k
⊂
{1,
2,
...,
d},
1
≤
k
<
d(6)
where
J
k
=
(j
1
,
j
2
,
...,
j
k
)
is
a
kdimensional
subspace,
X
J
k
and
Y
J
k
are
respectively
the
projected
vectors
of
X
and
Y
in
subspace
J
k
,
and
Dist(
·
)
is
a
traditional
distance
measure
in
the
full
dimensional
space.
When
L
p
metric
is
used
as
the
measure
of
distance,
the
minimal
subspace
distance
is
the
L
p
distance
of
the
k
minimal
differences
between
each
pair
of
x
i
and
y
i
:
Dist
(k)
p
(X,
Y)
=
i=1..k
x
j
i
−
y
j
i

p
1/p
(7)
If
L
∞
norm
is
used
as
distance
measure,
the
minimal
subspace
distance
becomes
is
the
kth
minimum
of
x
i
−
y
i
,
which
can
be
easily
got
by
sorting
x
i
−
y
i

(i
=
1..d)
ascendingly
and
then
picking
the
kth
value.
Then
Dist
(k)
(X,
Y)
≤
r
means
that
X
and
Y
are
in
a
hyperrectangle
with
edge
of
r
in
k
dimensions
and
without
limits
in
other
dimensions.
Therefore,
the
above
distance
measure
provides
an
effective
measure
for
hyperrectangular
clusters
in
subspaces.
With
the
help
of
the
above
minimal
subspace
distance,
it
will
be
easier
to
discover
clusters
in
subspaces.
For
two
objects,
it
ﬁnds
the
subspace
in
which
they
are
the
most
similar
or
the
nearest
to
each
other.
Assume
that
L
∞
norm
is
used.
For
example,
if
the
4D
minimal
subspace
distance
between
two
objects
is
7,
it
means
that
the
two
objects
are
within
a
4D
hyperrectangle
with
edge
length
of
7.
Minimal
subspace
distance
tries
to
measure
the
distance
between
objects
in
the
subspace
where
they
are
closest
to
each
other,
so
it
is
effective
to
ﬁnd
subspaces
where
clusters
exist
and
then
discover
clusters
in
those
subspaces.
With
the
above
1530 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
deﬁnition
of
minimal
subspace
distance,
our
algorithm
is
capa
ble
of
ﬁnding
projected
clusters
and
subspaces
automatically
when
the
average
dimensionality
of
subspaces
is
given.
The
effec
tiveness
of
the
above
distance
measure
will
be
shown
in
the
experiments.
3.4.
Partitioning
data
space
3.4.1.
Technique
of
partitioning
The
performance
of
our
algorithm
is
largely
dependent
on
the
partitioning
of
data
space.
Given
a
certain
number
of
objects,
the
more
cells
the
objects
are
in
and
the
more
uniformly
the
objects
are
distributed,
the
better
the
performance
is.
In
some
papers
(Agrawal
et
al.,
1998;
Sheikholeslami
et
al.,
1998),
each
dimension
is
divided
into
the
same
number
(say,
m)
of
intervals
and
there
are
m
d
cells
in
the
data
space.
The
above
method
of
partitioning
is
effective
when
the
dimensionality
is
low.
Nevertheless,
it
is
inappli
cable
in
high
dimensional
data
space,
because
the
number
of
cells
increases
exponentially
with
the
dimensionality
and
the
comput
ing
will
become
extremely
expensive.
For
example,
if
d
is
80,
the
number
of
cells
is
too
large
to
be
applicable
even
if
m
is
set
to
two.
However,
the
value
of
m
cannot
be
lower,
because
there
will
be
only
one
cell
and
the
density
calculation
of
each
object
needs
to
com
pute
distances
for
N
times
if
m
is
set
to
one.
In
addition,
when
the
dimensionality
is
high,
it
is
very
difﬁcult
to
choose
an
appropriate
value
for
m,
the
interval
number,
and
a
little
change
of
it
can
lead
to
a
great
variance
of
the
number
of
cells.
For
example,
if
d
is
30,
m
d
is
2.06
×
10
14
when
m
=
3
and
1.07
×
10
9
when
m
=
2.
To
tackle
the
above
problem,
a
technique
of
dividing
different
dimensions
into
different
number
of
intervals
is
employed
to
partition
the
whole
data
space.
With
our
technique,
different
interval
numbers
are
used
for
dif
ferent
dimensions.
For
the
ﬁrst
p
dimensions,
each
dimension
is
divided
evenly
into
m
intervals,
while
(m
−
1)
intervals
for
each
of
the
remaining
(d
−
p)
dimensions.
With
such
a
technique
of
parti
tioning,
the
total
number
of
cells
is
m
p
(m
−
1)
d−p
and
the
number
of
cells
can
be
adjusted
smoothly
by
changing
m
and
p.
Let
ω
be
the
percentage
of
nonempty
cells
and
N
the
number
of
objects.
The
number
of
nonempty
cells
is
N
ne
=
ωm
p
(m
−
1)
d−p
.
The
average
number
of
objects
contained
in
each
nonempty
cell
is
N
avg
=
N/N
ne
.
Let
N
nc
be
the
average
number
of
neighboring
cells
of
a
nonempty
cell
(including
itself).
For
each
nonempty
cell,
the
number
of
distance
computation
is
N
avg
N
nc
N
avg
.
So
the
total
time
complexity
is
C
t
=
N
avg
N
nc
N
avg
N
ne
=
N
avg
N
nc
N
=
N
2
N
nc
ωm
p
(m
−
1)
d−p
(8)
By
setting
the
time
complexity
to
be
linear
with
both
N
and
d,
we
can
get
N
2
N
nc
ωm
p
(m
−
1)
d−p
=
Nd,
(9)
that
is,
NN
nc
ωm
p
(m
−
1)
d−p
=
d.
(10)
Then
the
values
of
m
and
p
can
be
derived
from
the
above
equa
tion.
3.4.2.
Average
number
of
neighbors
per
cell
For
simplicity,
we
consider
the
case
with
q
=
1
to
select
the
values
of
m
and
p.
Actually,
The
m
and
p
calculated
in
this
way
are
also
used
when
q
is
set
to
other
values
in
our
algorithm.
With
q
=
1,
when
m
is
a
large
number,
most
cells
have
(2d
+
1)
neighbors,
i.e.,
N
nc
≈
2d
+
1,
so
the
following
can
be
derived
from
Eq.
(9):
N(2d
+
1)
ωm
p
(m
−
1)
d−p
=
d,
(11)
where
both
m
and
p
are
positive
integers,
m
≥
2
and
1
≤
p
≤
d.
However,
when
the
dimensionality
is
high,
m
may
become
small
and
the
neighbors
of
the
majority
of
cells
would
be
less
than
(2d
+
1),
so
Eq.
(11)
would
be
inapplicable
to
compute
the
value
of
p.
In
the
following,
a
theorem
will
be
presented
to
compute
the
average
number
of
neighbors
of
a
cell.
Theorem
1.
In
a
ddimensional
data
space,
if
each
of
the
ﬁrst
p
dimension
is
evenly
divided
into
m
intervals
and
each
of
the
remain
ing
(d
−p)
dimension
into
(m
−
1)
intervals,
where
m
≥
2,
the
average
number
of
immediate
neighbors
for
a
cell
with
q
=
1
is
N
nc
=
1
+
2(m
−
1)
m
p
+
2(m
−
2)
m
−
1
(d
−
p).
(12)
Proof.
If
a
dimension
is
partitioned
into
m
intervals,
the
total
number
of
neighboring
intervals
of
the
intervals
in
the
dimen
sion
is
(2m
−
2),
since
each
of
the
two
intervals
at
both
ends
has
one
neighbor
and
each
of
the
remaining
(m
−
2)
intervals
has
two
neighbors.
As
to
the
immediate
neighbors
(q
=
1),
the
interval
ID
in
one
dimension
only
is
different
from
the
ID
sequence
of
the
current
cell.
If
the
difference
is
in
one
of
the
ﬁrst
p
dimensions,
there
are
pm
p−1
(m
−
1)
d−p
cases,
and
in
each
ease
there
are
(2m
−
2)
neighbors,
so
the
count
of
neighbors
in
the
ﬁrst
p
dimensions
is
n
1
=
pm
p−1
(m
−
1)
d−p
(2m
−
2).
If
the
difference
is
in
one
of
the
last
(d
−
p)
dimensions,
there
are
(d
−
p)m
p
(m
−
1)
d−p−1
cases,
and
in
each
ease
there
are
(2m
−
4)
neighbors,
so
the
count
of
neighbors
in
the
last
(d
−
p)
dimensions
is
n
2
=
(d
−
p)m
p
(m
−
1)
d−p−1
(2m
−
4).
The
count
of
cells
is
n
3
=
m
p
(m
−
1)
d−p
,
and
the
average
number
of
neighbors
of
each
cell
is
n
1
+
n
2
n
3
=
pm
p−1
(m
−
1)
d−p
(2m
−
2)
+
(d
−
p)m
p
(m
−
1)
d−p−1
(2m
−
4)
m
p
(m
−
1)
d−p
=
2(m
−
1)
m
p
+
2(m
−
2)
m
−
1
(d
−
p).
In
addition,
each
cell
is
also
considered
as
a
neighbor
of
itself,
so
N
nc
=
1
+
2(m−1)
m
p
+
2(m−2)
m−1
(d
−
p).
From
Eq.
(9)
and
Theorem
1,
we
can
get
N(1
+
((2(m
−
1))/m)p
+
((2(m
−
2))/m
−
1)
(d
−
p))
ωm
p
(m
−
1)
d−p
=
d,
(13)
where
m
≥
2
and
1
≤
p
≤
d.
For
a
given
m,
Eq.
(13)
is
a
transcendental
equation
and
cannot
be
worked
out
directly.
In
fact,
for
each
m,
p
is
an
integer
no
less
than
one
and
no
greater
than
d.
Therefore,
the
values
of
p
fall
in
a
small
range
and
the
optimal
value
can
be
derived
by
trying
every
possible
pair
of
values
in
Eq.
(13).
3.5.
Storage
of
cells
In
a
high
dimensional
data
space,
the
number
of
cells
can
be
huge
and
it
is
impossible
to
store
all
cells
in
memory.
Fortunately,
not
all
of
the
cells
contain
objects.
Especially
when
the
dimensionality
is
high,
the
space
is
very
sparse
and
the
majority
of
cells
are
empty.
Therefore,
it
is
not
necessary
to
store
all
cells.
With
our
technique,
only
the
nonempty
cells
are
stored
and
a
hash
table
is
used
to
store
nonempty
cells.
Because
each
nonempty
cell
contains
at
least
one
object,
the
number
of
nonempty
cells
is
no
more
than
N,
the
number
of
objects.
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1531
3.6.
Parameters
r
and
DT
While
it
is
very
easy
to
count
the
number
of
objects
in
the
neighborhood
of
an
object,
it
is
not
so
easy
to
choose
an
appro
priate
value
for
r,
the
radius
of
neighborhood.
When
r
is
large
enough
that
all
the
objects
in
all
the
neighbors
of
a
cell
are
within
the
neighborhood,
AGRID+
will
behave
somewhat
like
gridbased
clustering,
in
the
sense
that
the
densities
of
all
the
objects
in
a
cell
will
be
the
same
and
that
the
density
is
simply
the
count
of
those
objects
in
all
its
neighboring
cells.
But
they
are
differ
ent
in
that
AGRID+
considers
the
signiﬁcant
loworder
neighbors
only,
instead
of
all
3
d
neighboring
cells.
On
the
other
hand,
if
r
is
much
smaller
than
the
lengths
of
all
edges
of
the
hyper
rectangular
cell,
AGRID+
will
become
somewhat
like
densitybased
clustering
because
the
density
of
a
object
is
largely
decided
by
the
number
of
objects
circumscribed
by
r.
However,
the
partitioning
of
data
space
into
cells
helps
to
reduce
the
number
of
distance
computation
and
make
AGRID+
much
faster
than
densitybased
clustering.
Since
there
is
an
assumption
in
Section
3.1
that
the
radius
of
neighborhood
is
less
than
L/2,
r
is
simply
set
to
a
value
less
than
L/2,
while
L
is
the
length
of
the
shortest
interval
in
all
dimensions.
Because
a
small
r
can
make
the
densities
too
low
to
ﬁnd
any
useful
clusters,
r
is
set
to
be
between
L/4
to
L/2
in
our
algorithm.
Besides
r,
the
result
of
clustering
is
also
decided
by
the
value
of
density
threshold,
DT.
With
AGRID+,
we
calculate
DT
dynami
cally
according
to
the
mean
of
densities
according
to
the
following
equation:
DT
=
1
×
N
i=1
Density(i)
N
,
(14)
where
is
a
coefﬁcient
which
can
be
tuned
to
get
clustering
results
at
different
levels
of
resolution.
Actually,
by
tuning
,
various
clus
tering
results
can
be
achieved
with
different
DT.
On
the
one
hand,
a
small
will
lead
to
a
big
DT,
the
merging
condition
will
become
strict
and
the
result
would
be
composed
of
many
small
clusters
and
many
objects
would
be
taken
as
noises.
On
the
other
hand,
a
large
will
make
a
small
DT.
Then
it
will
lead
to
a
few
large
clusters
because
adjacent
clusters
will
be
merged,
and
some
noises
will
be
mistaken
as
clusters.
With
a
set
of
different
values
of
,
a
multiresolution
clustering
can
be
obtained.
Since
the
proposed
algorithm
is
based
on
AGRID,
the
effect
of
DT
and
the
multiresolution
clustering
of
the
two
algorithms
are
similar
and
some
experimental
results
and
more
discussions
on
that
can
be
found
in
our
previous
work
on
AGRID
(Zhao
and
Song,
2003).
3.7.
The
procedure
of
AGRID+
AGRID+
is
composed
of
the
following
seven
steps.
Detailed
pseu
docodes
can
be
found
in
Figs.
6–8.
(1) Partitioning.
The
whole
data
space
is
partitioned
into
cells
according
to
m
and
p
computed
with
Eq.
(13).
Each
object
is
then
assigned
to
a
cell
according
to
its
coordinates
and
nonempty
cells
are
inserted
into
a
hash
table.
(2) Computing
distance
threshold.
The
distance
threshold
is
com
puted
according
to
the
interval
lengths
of
every
dimension
with
the
method
given
in
Section
3.6.
(3)
Calculating
densities.
For
each
object,
count
the
number
of
objects
both
in
its
neighboring
cells
and
in
its
neighborhood
as
its
density.
(4)
Compensating
densities.
For
each
object
˛,
compute
the
ratio
of
the
volume
of
all
neighbors
and
that
of
neighbors
considered,
and
use
the
product
of
the
ratio
and
the
density
of
˛
as
the
new
density
of
˛,
according
to
Eq.
(5).
(5)
Calculating
density
threshold
DT.
The
average
of
all
compensated
densities
is
calculated
and
then
the
density
threshold
DT
is
computed
with
Eq.
(14).
(6)
Clustering
automatically.
At
ﬁrst,
each
object
whose
density
is
greater
than
DT
is
taken
as
a
cluster.
Then,
for
each
object
˛,
check
each
object
in
the
neighboring
cells
of
C
˛
to
see
whether
its
density
is
greater
than
the
density
threshold
and
whether
its
distance
from
object
˛
is
less
than
the
distance
threshold.
If
yes,
then
merge
the
two
clusters
which
the
two
objects
belong
to
respectively.
Continue
the
above
merging
procedure
until
all
eligible
object
pairs
have
been
checked.
(7)
Removing
noises.
In
those
clusters
obtained,
many
are
too
small
to
be
considered
as
meaningful
clusters,
so
they
are
removed
as
noises.
3.8.
Complexity
analysis
The
performance
of
AGRID+
depends
on
the
values
of
N
(the
size
of
data)
and
d
(the
dimensionality
of
data).
With
our
parti
tioning
technique
proposed
in
Section
3.4,
the
time
complexity
is
controlled
by
m
and
p,
two
parameters
for
space
partitioning.
The
time
complexity
is
set
to
be
linear
with
N
and
d
in
Eq.
(9)
in
Section
3.4.1.
Nevertheless,
the
time
complexity
we
computed
is
under
an
ideal
condition
that
the
number
of
objects
in
every
cell
is
equal
to
one
another.
In
nearly
all
cases,
the
number
of
objects
varies
from
cell
to
cell.
So
the
time
complexity
is
to
some
degree
dependent
on
the
distribution
of
objects
in
data.
Our
experimental
results
in
next
section
will
show
that
the
time
complexity
is
nearly
linear
with
both
data
size
and
dimensionality.
Regarding
space
complexity,
our
algorithm
stores
nonempty
cells
only
in
a
hash
table,
and
the
number
of
nonempty
cells
is
no
more
than
the
number
of
objects.
Besides,
the
densities
of
objects
and
the
discovered
clusters
are
also
kept
in
memory,
and
the
spaces
used
to
store
densities
and
clusters
are
also
linear
with
N.
Therefore,
the
space
complexity
is
linear
with
the
size
and
the
dimensionality
of
data.
4.
Experimental
evaluation
Our
experiments
were
performed
on
a
PC
with
256MB
RAM
and
an
Intel
Pentium
III
1
GHz
CPU.
In
the
experiments,
we
will
show
the
improvement
of
AGRID+
over
AGRID
in
terms
of
scalability,
performance
and
accuracy,
and
will
also
show
the
effectiveness
of
AGRID+
for
discovering
clusters
in
subspaces.
In
addition,
we
will
compare
AGRID+
with
Random
Projection
(Fern
and
Brodley,
2003)
on
a
public
dataset.
4.1.
Synthetic
data
generator
The
function
nngenc(X,
C,
N,
D)
from
Matlab
1
is
used
to
generate
clusters
of
data
points,
where
X
is
a
R
×
2
matrix
of
cluster
bounds,
C
is
the
number
of
clusters,
N
is
the
number
of
data
points
in
each
cluster,
and
D
is
the
standard
deviation
of
clusters.
The
function
returns
a
matrix
containing
C
×
N
Relement
vectors
arranged
in
C
clusters
with
centers
inside
bounds
set
by
X,
with
N
elements
each,
randomly
around
the
centers
with
standard
deviation
of
D.
The
range
is
set
to
[0,1000].
For
some
clusters,
we
set
the
values
in
some
dimensions
to
be
of
uniform
distribution
to
make
subspace
clusters.
Noises,
which
are
uniformly
distributed
in
all
dimensions,
are
added
to
the
data.
1
http://www.mathworks.com/.
1532 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
Fig.
6.
Pseudocode
of
AGRID+.
4.2.
Superiority
over
AGRID
The
ﬁrst
dataset
is
a
15D
dataset
of
10,000
points.
It
is
generated
with
deviation
set
to
130.
There
are
4
clusters
and
5%
of
the
data
are
noises.
All
clusters
are
in
the
fulldimensional
space.
The
clustering
results
of
AGRID+
and
AGRID
are
shown
in
Fig.
9.
Fig.
9(a)
shows
the
clusters
discovered
by
AGRID,
where
the
numbers
below
sub
ﬁgures
are
the
sizes
of
clusters.
AGRID+
also
found
four
clusters,
but
with
more
objects.
Fig.
9(b)
shows
the
additional
objects
discov
ered
by
AGRID+
as
opposed
to
those
by
AGRID,
where
the
numbers
below
the
subﬁgures
are
the
counts
of
additional
objects
in
clus
ters.
The
objects
in
Fig.
9(b)
are
missed
by
AGRID,
which
shows
that
the
compensation
of
densities
makes
AGRID+
more
accurate
than
AGRID.
Tables
1
and
3
show
the
comparison
between
AGRID+
and
AGRID
on
the
densities
of
objects
and
the
accuracy
of
clustering.
The
confusion
matrix
of
densities
is
given
in
Table
1,
in
which
DT
stands
for
the
density
threshold
and
the
ﬁgures
are
the
counts
of
objects.
It
is
clear
from
the
table
that
the
accuracy
of
AGRID+
is
Table
1
Comparison
of
densities.
Standard
density
Density
in
AGRID+
Density
in
AGRID
≥DT
<DT
≥DT
<DT
≥DT 9314 190
8292
1212
<DT
222
274
16
480
Accuracy 95.9%
87.7%
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1533
Fig.
7.
Pseudocode
of
Computing
density(
).
greater
than
that
of
AGRID
(Table
2).
We
then
further
studied
the
effectiveness
of
ith
order
neighbors
and
density
compensation,
and
the
results
are
shown
in
Table
3.
Table
3
gives
the
clusters
discovered
and
the
accuracy.
It
shows
the
clustering
results
of
four
algorithms:
NAIVE,
AGRID,
IORDER
and
AGRID+.
NAIVE
is
a
naive
densitybased
clustering
algorithm
without
using
any
grid.
AGRID
uses
grid
and
(2d
+
1)
neighboring
cells
based
on
NAIVE.
IORDER
uses
ithorder
neighbors
to
improve
the
performance
of
NAIVE,
but
no
compensation
is
conducted
to
density
computation.
AGRID+
uses
all
techniques
designed
in
this
paper.
The
results
show
that
ALGO
RITHM:
Clustering
INPUT:
data
OUTPUT:
clusters
/*
creati
ng
a
new
cluster
for eac
h ob
jec
t
who
se
densit
y
is
no
less
than
DT
*/
FOR
all ob
jec
ts
O
i
IF
Den
q
(O
i
)
≥
DT
cluster(O
i
)
=
{O
i
};
ENDIF
ENDFOR
/*
combining
clusters
*/
FOR
all ce
ll
C
i
in
hash
table
FOR
all
C
j
,non
empty
kthorder
neighb
ors
of
C
i
(0
≤
k
≤
q,
j
>=
i)
FOR
all ob
jec
ts
O
m
in
C
i
FOR
all ob
jec
ts
O
n
in
C
j
IF
Den
q
(O
m
)
≥
DT
AND
Den
q
(O
n
)
≥
DT AND
dist(O
m
,
O
n
)
≤
r
cluster(O
m
)
=
cluster(O
m
)
∪
cluster(O
n
);
cluster(O
n
)
=
cluster(O
m
);
ENDIF
ENDFOR
ENDFOR
ENDFOR
ENDFOR
Fig.
8.
Pseudocode
of
Clustering(
).
1534 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
cluster 4: 2019
cluster 3: 2163
cluster 1: 1968
cluster 2: 2158
(a)
Results of AGRID
cluster 4: 304
cluster 3: 211
cluster 2: 221
cluster 1: 325
(b)
Results of AGRID+
Fig.
9.
Experimental
results
of
AGRID
and
AGRID+.
(a)
The
four
clusters
discovered
by
AGRID,
and
the
number
under
each
subﬁgure
gives
the
size
of
the
cluster.
The
additional
objects
in
each
cluster
found
by
AGRID+
are
shown
in
(b),
and
the
number
under
each
subﬁgure
gives
the
count
of
additional
objects
in
each
cluster.
Table
2
Four
algorithms
and
Their
techniques.
Algorithms
Grid
ithorder
neighbors
Density
compensation
NAIVE
AGRID
√
IORDER
√ √
AGRID+
√
√
√
NAIVE
is
of
the
highest
accuracy
but
with
the
longest
running
time
(around
10
times
as
long
as
the
other
three).
It
is
because
it
does
not
use
any
grid
to
reduce
computation
and
the
distance
between
every
two
objects
have
to
be
calculated.
All
other
three
algorithms
are
10
times
faster
than
NAIVE,
so
grid
is
very
effective
to
speed
up
density
computation,
at
the
cost
of
accuracy.
IORDER
is
more
accu
rate
than
AGRID,
which
demonstrates
the
effectiveness
of
ithorder
neighbors.
A
higher
accuracy
of
AGRID+
than
IORDER
shows
that,
density
compensation
enhances
clustering
quality,
at
the
cost
of
a
marginally
longer
running
time.
The
table
clearly
demonstrates
the
effectiveness
of
the
grid
to
improve
speed
and
the
effective
ness
of
ithorder
neighbors
and
density
compensation
to
improve
clustering
quality.
While
the
clusters
in
the
above
dataset
can
be
easily
discov
ered
by
both
algorithms,
another
dataset
is
used
to
demonstrate
the
superiority
of
AGRID+
over
AGRID
in
subspace
clustering.
It
is
a
15D
dataset
of
20,000
points,
and
the
clusters
exist
in
11D
sub
spaces.
As
shown
in
Fig.
10,
the
ﬁrst
cluster
exists
in
the
ﬁrst
11
dimensions,
while
the
attribute
values
in
the
last
4
dimensions
are
uniformly
distributed.
For
the
second
cluster,
the
attribute
values
in
dimensions
3–6
are
of
uniform
distribution,
i.e.,
the
second
clus
ter
exists
in
the
subspace
composed
of
dimensions
1,
2
and
7–15.
For
the
other
three
clusters,
the
uniformly
distributed
dimensions
Table
3
Comparison
of
accuracy.
Algorithms
Cluster
1
Cluster
2
Cluster
3
Cluster
4
Accuracy
Time(s)
NAIVE 2368
2396
2374
2366
95.0%
44.21
AGRID 1968 2158
2163
2019
83.1%
3.62
IORDER
1990
2216
2240
2081
85.3%
4.54
AGRID+ 2293
2379
2374
2323
93.7%
4.60
are
respectively
7–10,
8–11,
and
6–9.
The
last
subﬁgure
shows
the
noise,
which
account
for
10%
of
the
data.
Our
experiment
shows
that
AGRID+
can
discover
the
ﬁve
clusters
correctly
with
the
accu
racy
of
91%.
In
contrast,
AGRID
cannot
ﬁnd
the
ﬁve
clusters
correctly
even
by
ﬁne
tuning
parameters
r
and
DT.
4.3.
Scalability
The
performance
of
AGRID+
and
AGRID
is
shown
in
Fig.
11,
where
the
solid
lines
represent
AGRID+
and
the
dashed
lines
repre
sent
AGRID.
Ten
experiments
have
been
conducted
for
each
method
and
the
average
results
are
given
in
the
ﬁgure.
In
Fig.
11(a),
the
dimensionalities
of
the
datasets
are
all
20,
and
the
sizes
of
datasets
range
from
10,000
to
100,000.
In
Fig.
11(b),
the
size
is
100,000
and
the
dimensionalities
are
from
3
to
100.
In
each
dataset,
10%
are
noises.
From
the
ﬁgure,
it
is
clear
that
the
running
time
of
AGRID+
is
nearly
linear
both
in
the
size
and
dimensionality
of
datasets,
and
is
a
little
longer
than
that
of
AGRID.
In
the
above
experiments,
we
set
q
=
1
when
applying
Eq.
(5).
To
test
the
effect
of
different
q,
another
experiment
is
conducted
on
a
dataset
of
100,000
objects
and
15
dimensions.
The
running
time
and
accuracy
of
clusters
discovered
with
different
q
are
shown
in
Fig.
12(a)
and
(b),
respectively.
When
q
is
zero,
the
algorithm
is
fastest,
but
the
accuracy
is
very
low.
With
the
increase
of
q,
more
neighbors
are
taken
into
consideration
and
the
accuracy
goes
up
dramatically,
but
the
running
time
becomes
longer.
When
q
is
larger
than
three,
there
is
no
signiﬁcant
increase
in
accuracy
in
the
exper
iment.
From
the
ﬁgure,
it
is
reasonable
to
set
q
to
2
or
3
to
achieve
both
high
speed
and
high
accuracy.
Users
can
set
the
value
of
q
according
to
computer
performance
and
accuracy
requirement
in
their
applications.
4.4.
Multiresolution
clustering
Multiresolution
clustering
can
be
achieved
by
using
different
DT
(or
)
in
Eq.
(14),
which
helps
to
detect
clusters
at
different
levels,
as
shown
in
Fig.
13.
Although
the
multiresolution
property
of
our
technique
is
somewhat
like
that
of
WaveCluster,
they
are
much
different
in
that
AGRID+
achieves
by
adjusting
the
value
of
density
threshold
while
WaveCluster
by
“increasing
the
size
of
a
cell’s
neighborhood”.
The
clustering
of
a
2D
dataset
of
2000
objects
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1535
5
10
15
5
10
15
5
10
15
5
10
15
5
10
15
5
10
15
Fig.
10.
Experimental
results
of
AGRID+.
The
ﬁrst
5
subﬁgures
are
clusters
discovered
by
AGRID+
in
a
dataset
of
11D
subspace
clusters,
while
the
last
subﬁgure
shows
noise.
is
used
to
demonstrate
the
effectiveness
of
multiresolution
clustering.
Fig.
13(a)
shows
the
original
data
before
clustering,
and
the
other
ﬁgures
are
clustering
results
with
different
DTs.
The
values
of
density
threshold
are
respectively
5,
10,
20,
30,
35,
40
and
50
in
Fig.
13(b)–(h).
In
Fig.
13(b),
DT
is
set
to
5
and
there
are
three
clusters
found.
The
two
groups
of
objects
in
topright
are
in
one
cluster,
because
they
are
connected
by
some
objects
between
them.
When
DT
increases
to
10,
they
are
split
into
two
clusters,
as
shown
in
Fig.
13(c).
The
bottom
cloud
of
objects
are
classiﬁed
into
two
clusters,
when
DT
is
20
in
Fig.
13(d).
As
DT
increases
further,
all
clusters
shrink,
resulting
in
the
splitting
or
disappearance
of
some
clusters,
as
shown
in
Fig.
13(e)–(h).
When
DT
is
set
to
50,
only
three
clusters
are
found,
composed
of
objects
in
very
denselypopulated
areas
(see
Fig.
13(h)).
10
20
30
40
50
60
70
80
90
100
0
50
100
150
200
250
Size (x1000)
Time (s)
AGRID+
AGRID
withScalability
(a)
N
0
10
20
30
40
50
60
70
80
90
100
0
100
200
300
400
500
600
700
800
Dimensionality
Time (s)
AGRID+
AGRID
withScalability
(b)
d
Fig.
11.
Scalability
with
the
size
and
dimensionality
of
datasets.
1536 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
500
1000
1500
2000
2500
3000
3500
Order of Neighbors
Time (s)
(a)
Running Time
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0
10
20
30
40
50
60
70
80
90
100
Order of Neighbors (q)
Percentage (%)
Contribution
Accuracy
(b)
Accurac
y
Fig.
12.
Experimental
results
of
running
time
and
accuracy
with
various
values
of
q,
the
order
of
neighbors.
Generally
speaking,
the
greater
the
density
threshold
is,
the
smaller
the
clusters
are,
and
the
more
is
the
number
of
objects
that
would
be
treated
as
outliers.
When
DT
is
a
very
small
number,
the
number
of
clusters
is
small
and
just
a
few
objects
are
treated
as
outliers.
With
the
increasing
of
DT,
some
clusters
break
into
more
and
smaller
clusters.
A
hierarchical
clustering
tree
can
be
built
with
selecting
a
series
of
DT
and
the
appropriate
resolution
level
for
choosing
clusters
can
be
decided
on
the
needs
of
users.
Data
(a)
cluster
s,3
(b)
DT=5
clusters,
4
(c)
DT=10
clusters,5
(d)
DT=20
cluster
s,5
(e)
DT=30
clusters,
5
(f)
DT=35
clusters,4
(g)
DT=40
cluster
s,3
(h)
DT=50
Fig.
13.
Multiresolution
clustering.
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1537
0 10 20 30 40 50 600 10 20 30 40 50 60
0 10 20 30 40 50 600 10 20 30 40 50 60
0 10 20 30 40 50 600 10 20 30 40 50 60
60504030201006050403020100
6050403020100
60504030201006050403020100
6050403020100
Fig.
14.
Control
chart
time
series
data.
4.5.
Comparison
with
Random
Projection
on
public
data
In
addition
to
the
experiments
with
the
above
synthetic
datasets,
experiments
have
been
conducted
with
the
control
chart
time
series
dataset
from
UCI
KDD
Archive
2
and
comparison
was
made
with
Random
Projection
(Fern
and
Brodley,
2003),
an
algo
rithm
for
subspace
clustering.
The
dataset
is
of
60
dimensions
and
600
records.
There
are
six
clusters:
normal,
cyclic,
increasing
trend,
decreasing
trend,
upward
shift
and
downward
shift
(see
Fig.
14).
To
make
it
easy
to
see
the
six
clusters,
only
a
few
time
series
are
shown
in
each
cluster
in
the
ﬁgure.
The
clustering
given
in
UCI
KDD
Archive
is
used
as
the
standard
result
and
Conditional
Entropy
(CE)
and
Normalized
Mutual
Information
(NMI)
are
employed
to
measure
the
quality
of
clustering.
Compactness
(Zait
and
Messatfa,
1997)
is
also
widely
used
to
measure
the
quality
of
clustering,
but
it
favours
sphere
shaped
clusters
since
the
diameter
is
used.
CE
and
NMI
have
been
used
to
measure
the
quality
of
clustering
by
Strehl
and
Ghosh
(2002),
Fern
and
Brodley
(2003)
and
Pﬁtzner
et
al.
(2009),
and
sim
ilar
measures
base
on
entropy
have
also
been
used
by
Hu
and
Sung
(2006).
Conditional
Entropy
measures
the
uncertainty
of
the
class
labels
given
a
clustering
solution.
For
one
clustering
with
m
clusters
and
a
second
clustering
with
k
clusters,
the
Conditional
Entropy
is
deﬁned
as
CE
=
k
j=1
n
j
∗E
j
N
,
where
entropy
E
j
=
−
m
i=1
p
ij
log(p
ij
),
n
j
is
the
size
of
cluster
j
in
the
second
clustering,
p
ij
is
the
probability
that
a
member
of
cluster
i
in
the
ﬁrst
clustering
belongs
to
cluster
j
in
the
second
clustering,
p
i
is
the
probability
of
cluster
i,
p
j
is
the
prob
ability
of
cluster
j,
and
N
is
the
size
of
dataset.
The
value
of
CE
is
a
nonnegative
real
number.
The
less
CE
is,
the
more
the
tested
result
approaches
the
standard
result.
The
two
results
become
the
same
as
each
other
when
CE
is
zero.
For
two
clustering
solutions
C
1
and
C
2
,
the
normalized
mutual
information
is
deﬁned
as
NMI
=
MI
√
H(C
1
)H(C
2
)
,
where
mutual
information
MI
=
i,j
p
ij
log
p
ij
p
i
p
j
,
and
H(C
1
)
and
H(C
2
)
denote
the
entropy
of
C
1
and
C
2
,
respectively.
The
value
of
NMI
lies
in
[0,1].
Contrary
to
CE,
the
larger
the
value
of
NMI
is,
the
2
http://kdd.ics.uci.edu/.
better
is
the
clustering.
If
NMI
is
one,
then
the
two
clusterings
are
the
same
as
each
other.
In
all,
we
would
like
to
minimize
CE
and
maximize
NMI.
Since
the
dimensionality
is
high
and
the
records
are
relatively
few,
subspace
clustering
is
employed
to
ﬁnd
the
clusters.
The
parameters
are
selected
based
on
studying
the
data
and
ﬁnetuning
of
the
parameters.
Generally
speaking,
the
greater
are
the
average
dimensionality
of
subspace
cluster
and
the
density
threshold,
the
stricter
is
the
condition
for
merging
two
objects
or
clusters
into
one
cluster,
and
the
result
tends
to
be
composed
of
many
smaller
clusters.
On
the
contrary,
the
smaller
are
the
two
parameters,
the
looser
is
the
condition
for
merging,
and
the
result
tends
to
con
sist
of
a
few
bigger
clusters.
From
Fig.
14,
we
can
see
that,
for
the
time
series
in
a
cluster,
the
values
in
most
dimensions
are
close
and
there
are
big
difference
in
around
10–20
dimensions.
There
fore,
it
is
reasonable
to
set
the
average
dimensionality
of
subspace
clusters
to
40–50.
In
our
experiments
on
the
above
data,
we
found
that,
with
an
average
dimensionality
less
than
40,
it
often
merges
increasing
trend
and
upward
shift
into
one
cluster
and
decreasing
trend
and
downward
shift
into
another
cluster.
The
best
result
was
achieved
by
setting
the
average
dimensionality
to
45
and
the
den
sity
threshold
to
8.
The
results
of
clustering
with
our
algorithm
and
Random
Projection
are
given
in
Table
4.
From
the
table,
we
can
see
that
the
clustering
of
our
algorithm
are
of
the
lowest
CE
and
the
highest
NMI,
which
shows
that
our
algorithm
performs
better
than
Random
Projection.
The
superiority
of
AGRID+
over
IORDER
also
shows
that
the
effectiveness
of
density
compensation.
Generally
speaking,
the
average
dimensionality
of
subspaces
can
be
set
based
on
domain
knowledge
in
speciﬁc
applications.
However,
if
a
user
has
no
idea
at
all
about
how
to
set
a
right
value
to
it,
he
may
run
the
algorithm
for
multiple
times
with
various
val
ues
for
the
parameter,
and
then
choose
the
best
clustering
with
the
help
of
some
internal
validation
measures
or
relative
valida
Table
4
AGRID+
vs
Random
Projection.
AGRID+ IORDER
Random
Projection
CE
0.466
0.517
0.706
NMI 0.845
0.822
0.790
1538 Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539
tion
measures,
such
as
Compactness,
Silhouette
index,
Figure
of
merit
and
Stability
(Brun
et
al.,
2007;
Halkidi
et
al.,
2001).
5.
Discussions
To
reduce
the
cost
of
computation,
some
assumptions
are
made
in
this
paper.
One
assumption
is
to
reduce
computation
cost
of
V
c
in
Eq.
(1)
by
approximating
it
in
a
simple
way.
For
an
object
in
a
d
dimensional
space,
when
all
neighboring
cells
with
order
no
more
than
q
are
considered,
we
need
to
calculate
the
volume
of
over
lapped
spaces
for
every
neighboring
cell
with
order
no
more
than
q.
That
is,
the
number
of
volume
calculation
would
be
1
+
()
+
()
+
.
.
.
+
()
.
We
can
see
that
the
above
computation
is
costly,
especially
when
d
and
q
are
big
numbers.
To
simply
the
above
calculation,
We
use
the
V
c
for
object
(a,
a,
...a)
to
approximate
the
V
c
for
object
(a
1
,
a
2
...a
d
),
where
a
is
the
mean
of
a
1
,
a
2
,
.
.
.
a
d
.
With
the
approx
imation,
we
need
to
calculate
only
(d
+
1)
overlapped
spaces
(i.e.,
one
for
each
order
of
neighbors).
Although
for
a
speciﬁc
neighbor,
the
volume
may
be
over
or
down
estimated,
the
overall
volume
of
V
c
,
which
is
the
sum
of
the
overlapped
space
with
all
neighbors
with
order
no
more
than
q,
is
well
approximated.
The
effectiveness
of
the
approximation
is
shown
in
Table
1,
where
the
accuracy
is
improved
from
87.7%
to
95.9%
with
density
compensation.
It
is
also
validated
by
an
improvement
in
accuracy
from
85.3%
(IORDER)
to
93.7%
(AGRID+),
as
shown
in
Table
3.
Moreover,
the
above
assump
tion
is
used
only
to
calculate
volume
for
density
compensation,
and
the
calculation
of
Cnt
q
(˛)
in
Eqs.
(1)
and
(2)
is
not
affected
by
it.
Regarding
space
partitioning
for
producing
grid
and
cells,
some
other
techniques,
such
as
adaptive
grid
(Nagesh
et
al.,
1999)
and
optimal
grid
(Hinneburg
and
Keim,
1999),
partition
data
space
by
considering
data
distribution
in
every
dimension.
However,
they
do
not
ﬁt
our
algorithm
due
to
the
following
reasons.
Firstly,
equal
sized
cells
are
preferred
in
our
algorithm
to
cover
neighborhood
with
cells,
since
the
density
is
deﬁned
based
on
neighborhood.
Distributionbased
partitioning
often
produces
cells
with
great
variance
in
their
sizes.
Secondly,
the
number
of
cells
needs
be
able
to
smoothly
change,
so
that
it
would
be
easier
to
choose
an
appropriate
value
for
DT
or
ﬁne
tune
DT.
The
above
two
features
are
important
to
our
algorithm,
but
distributionbased
partition
ing
fails
to
do
so.
Although
our
proposed
partitioning
method
looks
simple,
it
addresses
the
above
two
issues
well
and
its
effectiveness
is
shown
in
our
experiments.
6.
Conclusions
In
this
paper,
we
have
presented
a
novel
and
efﬁcient
grid
density
based
clustering
approach,
which
has
four
novel
technical
features.
The
ﬁrst
feature
is
that
it
takes
objects
(or
points)
as
atomic
units
in
which
the
size
requirement
to
cells
is
waived
without
los
ing
clustering
accuracy.
The
second
one
is
the
concept
of
ithorder
neighbors,
with
which
the
neighboring
cells
are
organized
into
a
couple
of
groups
to
lower
the
computational
complexity
and
meet
different
requirements
of
accuracy.
The
third
is
the
idea
of
density
compensation
to
improve
the
accuracy
of
densities
and
clustering.
The
last
but
not
the
least,
the
measure
of
minimal
subspace
distance
is
used
to
help
AGRID+
to
discover
clusters
in
subspaces.
We
have
experimentally
evaluated
our
approach
and
demonstrated
that
our
algorithm
signiﬁcantly
reduces
computation
cost
and
improves
clustering
quality.
In
fact,
besides
AGRID+,
our
measure
of
minimal
subspace
distance
can
also
help
other
algorithms
to
ﬁnd
clusters
in
subspaces,
which
will
be
included
in
our
future
work.
Another
two
future
works
are:
(1)
ﬁnding
an
optimal
order
of
the
dimen
sions
based
on
the
distribution
of
data
in
every
single
dimension
before
partitioning;
and
(2)
using
internal
indices
to
obtain
optimal
settings
of
parameters.
Acknowledgements
This
research
was
done
when
Yanchang
Zhao
was
an
Australian
Postdoctoral
Fellow
(Industry)
at
Faculty
of
Engineering
&
IT,
Uni
versity
of
Technology,
Sydney,
Australia.
This
work
is
supported
in
part
by
the
Australian
Research
Coun
cil
(ARC)
under
large
grant
DP0985456,
the
China
“1000Plan”
Distinguished
Professorship,
the
Jiangsu
Provincial
Key
Laboratory
of
Ebusiness
at
the
Nanjing
University
of
Finance
and
Economics,
and
the
Guangxi
NSF
(Key)
grants.
References
Aggarwal,
C.C.,
Hinneburg,
A.,
Keim,
D.A.,
2001.
On
the
surprising
behavior
of
dis
tance
metrics
in
high
dimensional
space.
In:
Proc.
of
the
8th
International
Conference
on
Database
Theory.
Agrawal,
R.,
Gehrke,
J.,
et
al.,
1998.
Automatic
subspace
clustering
of
high
dimen
sional
data
for
data
mining
applications.
In:
Proc.
of
the
1998
ACMSIGMOD
International
Conference
on
Management
of
Data
(SIGMOD’98)
,
Seattle,
WA,
June,
pp.
94–105.
Alsabti,
K.,
Ranka,
S.,
Singh,
V.,
1998.
An
efﬁcient
Kmeans
clustering
algorithm.
In:
Proc.
of
the
First
Workshop
on
High
Performance
Data
Mining
,
Orlando,
FL.
Ankerst,
M.,
Breunig,
M.,
et
al.,
1999.
OPTICS:
ordering
points
to
identify
the
cluster
ing
structure.
In:
Proc.
of
the
1999
ACMSIGMOD
International
Conference
on
Management
of
Data
(SIGMOD’99)
,
Philadelphia,
PA,
June,
pp.
49–60.
Assent,
I.,
Krieger,
R.,
Glavic,
B.,
Seidl,
T.,
2008.
Clustering
multidimensional
sequences
in
spatial
and
temporal
databases.
Knowledge
and
Information
Sys
tems
16
(July
(1)),
29–51.
Berkhin,
P.,
2002.
Survey
of
Clustering
Data
Mining
Techniques.
Technical
Report,
Accrue
Software.
Brun,
M.,
Sima,
C.,
Hua,
J.,
Lowey,
J.,
Carroll,
B.,
Suh,
E.,
Dougherty,
E.R.,
2007.
Model
based
evaluation
of
clustering
validation
measures.
Pattern
Recognition,
vol.
40.
Elsevier
Science
Inc,
pp.
807–824.
Ester,
M.,
Kriegel,
H.P.,
et
al.,
1996.
A
densitybased
algorithm
for
discovering
clus
ters
in
large
spatial
databases
with
noise.
In:
Proc.
of
the
1996
International
Conference
On
Knowledge
Discovery
and
Data
Mining
(KDD’96)
,
Portland,
Ore
gon,
August,
pp.
226–231.
Fern,
X.Z.,
Brodley,
E.,
2003.
Random
projection
for
high
dimensional
data
clustering:
a
clustering
ensemble
approach.
In:
Proc.
of
the
20th
International
Conference
On
Machine
Learning
(ICML’03)
,
Washington,
DC.
Grabmeier,
J.,
Rudolph,
A.,
2002.
Techniques
of
cluster
algorithms
in
data
mining.
Data
Mining
and
Knowledge
Discovery
6,
303–360.
Guha,
S.,
Rastogi,
R.,
Shim,
K.,
1998.
Cure:
an
efﬁcient
clustering
algorithm
for
large
databases.
In:
Proc.
of
the
1998
ACMSIGMOD
International
Conference
on
Man
agement
of
Data
(SIGMOD’98),
Seattle,
WA,
June
,
pp.
73–84.
Guha,
S.,
Rastogi,
R.,
Shim,
K.,
1999.
Rock:
a
robust
clustering
algorithm
for
categori
cal
attributes.
In:
Proc.
of
the
1999
International
Conference
on
Data
Engineering
(ICDE’99)
,
Sydney,
Australia,
March,
pp.
512–521.
Halkidi,
M.,
Batistakis,
Y.,
Vazirgiannis,
M.,
2001.
On
clustering
validation
techniques.
Journal
of
Intelligent
Information
Systems
17,
107–145.
Han,
J.,
Kamber,
M.,
2001.
Data
Mining:
Concepts
and
Techniques.
Higher
Education
Press,
Morgan
Kaufmann
Publishers.
Hinneburg,
A.,
Keim,
D.A.,
1998.
An
efﬁcient
approach
to
clustering
in
large
multi
media
databases
with
noise.
In:
Proc.
of
the
1998
International
Conference
on
Knowledge
Discovery
and
Data
Mining
(KDD’98)
,
New
York,
August,
pp.
58–65.
Hinneburg,
A.,
Keim,
D.A.,
1999.
Optimal
gridclustering:
towards
breading
the
curse
of
dimensionality
in
highdimensional
clustering.
In:
Proc.
of
the
25th
Interna
tional
Conference
on
Very
Large
Data
Bases
,
Edinburgh,
Scotland.
Hu,
T.,
Sung,
S.Y.,
2006.
Finding
centroid
clusterings
with
entropybased
criteria.
Knowledge
and
Information
Systems
10
(November
(4)),
505–514.
Huang,
Z.,
1998.
Extensions
to
the
kmeans
algorithm
for
clustering
large
data
sets
with
categorical
values.
Data
Mining
and
Knowledge
Discovery
2,
283–304.
Jain,
A.K.,
Murty,
M.N.,
Flynn,
P.J.,
1999.
Data
clustering:
a
review.
ACM
Computing
Surveys
31
(September
(3)).
Karypis,
G.,
Han,
E.H.,
Kumar,
V.,
1999.
CHAMELEON:
a
hierarchical
clustering
algo
rithm
using
dynamic
modelling.
IEEE
Computer,
Special
Issue
on
Data
Analysis
and
Mining
32
(August
(8)),
68–75.
Kolatch,
E.,
2001.
Clustering
Algorithms
for
Spatial
Databases:
A
Survey.
Dept.
of
Computer
Science,
University
of
Maryland,
College
Park.
Hinneburg,
A.,
Aggarwal,
C.C.,
Keim,
D.A.,
2000.
What
is
the
nearest
neighbor
in
high
dimensional
spaces?
In:
Proc.
of
the
26th
International
Conference
on
Very
Large
Data
Bases
,
Cairo,
Egypt,
pp.
506–515.
Moise,
G.,
Sander,
J.,
Ester,
M.,
2008.
Robust
projected
clustering.
Knowledge
and
Information
Systems
14
(March
(3)),
273–298.
Nagesh
H.,
Goil
S.
and
Choudhary
A.
MAFIA:
Efﬁcient
and
Scalable
Subspace
Clus
tering
for
Very
Large
Data
Sets,
Technical
Report
9906010,
Northwestern
University,
June
1999.
Procopiuc,
M.,
Jones,
M.,
Agarwal,
P.,
Murali,
T.M.,
2002.
A
MonteCarlo
algorithm
for
fast
projective
clustering.
In:
Proc.
of
the
2002
International
Conference
on
Management
of
Data.
Ng,
R.,
Han,
J.,
1994.
Efﬁcient
and
effective
clustering
method
for
spatial
data
mining.
In:
Proc
of
the
1994
International
Conference
on
Very
Large
Data
Bases
(VLDB’94)
,
Santiago,
Chile,
September,
pp.
144–155.
Y.
Zhao
et
al.
/
The
Journal
of
Systems
and
Software
84 (2011) 1524–
1539 1539
Pﬁtzner,
D.,
Leibbrandt,
R.,
Powers,
D.,
2009.
Characterization
and
evaluation
of
sim
ilarity
measures
for
pairs
of
clusterings.
Knowledge
and
Information
Systems
19
(June
(3)),
361–394.
Sheikholeslami,
G.,
Chatterjee,
S.,
Zhang,
A.,
1998.
WaveCluster:
a
multiresolution
clustering
approach
for
very
large
spatial
databases.
In:
Proc.
of
the
1998
Inter
national
Conference
on
Very
Large
Data
Bases
(VLDB’98)
,
New
York,
August,
pp.
428–429.
Strehl,
A.,
Ghosh,
J.,
2002.
Cluster
ensembles—a
knowledge
reuse
framework
for
combining
multiple
partitions.
Machine
Learning
Research
3,
583–617.
Wang,
W.,
Yang
D
J.,
Muntz,
R.,
1997.
STING:
a
statistical
information
grid
approach
to
spatial
data
mining.
In:
Proc.
of
the
1997
International
Conference
on
Very
Large
Data
Bases
(VLDB’97)
,
Athens,
Greece,
August,
pp.
186–195.
Zait,
M.,
Messatfa,
H.,
1997.
A
comparative
study
of
clustering
methods.
Future
Generation
Computer
Systems
13,
149–159.
Zhang,
T.,
Ramakrishnan,
R.,
Livny,
M.,
1996.
BIRCH:
an
efﬁcient
data
clustering
method
for
very
large
databases.
In:
Proc.
of
the
1996
ACMSIGMOD
Interna
tional
Conference
on
Management
of
Data
(SIGMOD’96)
,
Montreal,
Canada,
June,
pp.
103–114.
Zhao,
Y.,
Song,
J.,
2003.
AGRID:
an
efﬁcient
algorithm
for
clustering
large
high
dimensional
datasets.
In:
Proc.
of
The
7th
PaciﬁcAsia
Conference
on
Knowledge
Discovery
and
Data
Mining
(PAKDD’03)
,
Seoul,
Korea,
April,
pp.
271–282.
Yanchang
Zhao
is
a
Senior
Data
Mining
Specialist
in
Centrelink,
Australia.
He
was
an
Australian
Postdoctoral
Fellow
(Industry)
at
the
Data
Sciences
and
Knowledge
Discovery
Research
Lab,
Centre
for
Quantum
Computation
and
Intelligent
Systems,
University
of
Technology,
Sydney,
Australia,
from
2007
to
2009.
His
research
inter
ests
are
clustering,
sequential
patterns,
time
series,
association
rules
and
their
applications.
He
is
a
member
of
the
IEEE.
Jie
Cao
is
a
Professor
and
the
Chair
of
Jiangsu
Provincial
Key
Laboratory
of
Ebusiness
at
the
Nanjing
University
of
Finance
and
Economics.
He
is
a
winner
of
the
Program
for
New
Century
Excellent
Talents
in
Universities
(NCET).
He
received
his
PhD
degree
from
the
Southeast
University,
China,
in
2002.
His
main
research
interests
include
cloud
computing,
business
intelligence
and
data
mining.
Dr.
Cao
has
published
one
book
and
more
than
40
refereed
papers
in
various
journals
and
conferences.
Chengqi
Zhang
has
been
a
Professor
of
Information
Technology
at
The
Uni
versity
of
Technology,
Sydney
(UTS)
since
December
2001.
He
has
been
the
Director
of
the
UTS
Priority
Investment
Research
Centre
for
Quantum
Compu
tation
and
Intelligent
Systems
since
April
2008.
He
has
been
Chairman
of
the
Australian
Computer
Society
National
Committee
for
Artiﬁcial
Intelligence
since
November
2005.
Prof.
Zhang
obtained
his
PhD
degree
from
the
University
of
Queensland
in
1991,
followed
by
a
Doctor
of
Science
(DSc
– Higher
Doctorate)
from
Deakin
University
in
2002.
Prof.
Zhang’s
research
interests
mainly
focus
on
Data
Mining
and
its
appli
cations.
He
has
published
more
than
200
research
papers,
including
several
in
ﬁrstclass
international
journals,
such
as
Artiﬁcial
Intelligence,
IEEE
and
ACM
Trans
actions.
He
has
published
six
monographs
and
edited
16
books.
He
has
delivered
12
keynote/invited
speeches
at
international
conferences
over
the
last
six
years.
He
has
attracted
seven
Australian
Research
Council
grants.
He
is
a
Fellow
of
the
Australian
Computer
Society
(ACS)
and
a
Senior
Mem
ber
of
the
IEEE
Computer
Society
(IEEE).
He
has
been
serving
as
an
Associate
Editor
for
three
international
journals,
including
IEEE
Transactions
on
Knowledge
and
Data
Engineering,
from
2005
to
2008;
and
he
served
as
General
Chair,
PC
Chair,
or
Organis
ing
Chair
for
ﬁve
international
Conferences
including
ICDM
and
WI/IAT.
His
personal
web
page
can
be
found
at:
http://wwwstaff.it.uts.edu.au/∼chengqi/.
Shichao
Zhang
is
a
China
“1000Plan”
Distinguished
Pro
fessor
and
the
Dean
of
College
of
Computer
Science
and
Information
Technology
at
the
Guangxi
Normal
Univer
sity,
Guilin,
China.
He
received
his
PhD
degree
in
Applied
Mathematics
from
the
China
Academy
of
Atomic
Energy
in
1997.
His
research
interests
include
information
qual
ity
and
multisources
data
mining.
He
has
published
10
solelyauthored
international
journal
papers,
about
50
international
journal
papers
and
over
60
international
conference
papers.
He
has
been
a
CI
for
winning
10
nation
class
projects
in
China
and
Australia.
He
is
served/ing
as
an
associate
editor
for
IEEE
Transactions
on
Knowledge
and
Data
Engineering,
Knowledge
and
Information
Sys
tems,
and
IEEE
Intelligent
Informatics
Bulletin,
and
served
as
a
(vice)PC
Chair
for
5
international
conferences.
Comments 0
Log in to post a comment