Clustering of Dynamic Data
Introduction
Data clustering is a field of active research in machine learning and data mining. Most of
the work has focused on static data
sets
. There has been little work on clustering of
dynamic data. We define a dynamic da
ta set as a set of elements whose parameters
change over time.
A flock of flying birds is an example of a dynamic data set.
We are
interested in exploring algorithms are capable of finding relationships amongst the
elements in a dynamic data set.
In thi
s paper we evaluate the use of data clustering
techniques developed for static
data sets
on dynamic data
.
Hypothesis
Traditional clustering algorithms used in data mining will not perf
orm well on dynamic
data sets. A clustering algorithm must consider th
e elements
’
history in order to
efficiently and effectively find clusters in dynamic data.
Experiment
Characterize a set of traditional clustering algorithms against dynamic data sets.
o
Data Sets:
Swarm style data
Traffic style data
Ant Colony data
????
A
u
gment traditional clustering
to make
the
distance measure a function of the
elements
’
history
o
Use a moving average of each attribute
o
Use past cluster labels in the
distance
measure.
o
Start partitioning algorithms with the output from the last time interval
Characterize augmented algorithms against dynamic data sets
.
o
Metrics
Timing
Accuracy
Consistency
Cluster label is consistent over time (label does not thrash

> stays with the “core” group)
Clustering Algorithms
In general, clustering algorithms can be di
vided into two
categories: hierarchical and
pa
rti
ti
oning.
Hierarchical algorithms build clusters gradually. An agglomerative
hierarchical algorithm starts with each element in it own cluster. The clusters are
iteratively combined
to form a
dendrogram
.
Divisive algorithms start
s
with all elements
in one cluster and create the dendrogram from the top down.
Partitioning
algorithms
create clusters directory
by optimizing a function (locally or globally)
with
out creating a
structure
like the dendrogram. P
artitioning algorithms typically run faster than
hierarchical cluster, but need to have the number of clusters to find defined.
Partitioning
algorithms can be further dived into relocation methods and density

based methods.
Relocation methods work by min
imizing a cost function by iteratively relocating
elements to clusters. Density
based methods attempt to cluster densely connected
elements.
We plan to implement a representative set of clustering algorithms and
evaluate their
performance on
dynamic data
set
s
. From the hierarchical clustering category we plan to
implement
a single

link
and a compl
ete

link algorithm. These two
algorithms dif
fer in
the way they measure
dista
nce between clusters. The single

link algorithm measures the
distance between two
clusters A and B as the minimum distance between a
ny
member of
A and a
ny
member of B.
The c
omplete

link
algorithm
measures the distance between
clusters A and cluster B as the maximum distance between any member of cluster A and
any member of cluster B.
The complete

link algorithm can find compact cluster whil
e
the single

link algorithm
find
s
elongated clusters.
From
the
partitioning

relocation category we plan to implement
k

means, k

mediods, and
E
xpectation Maximization
(EM)
. The k

means
and k

mediod
s
algorithm
s
work
by
randomly assigning elements to k clusters and then iteratively reassigning elements to
the
closest
cluster and re

computing the cluster
’s parameters
.
They differ in how the clusters
are represents. K

means represents a cluster as the
centroid of its members. The K

mediods algorithm selects a member of
the
cluster to represent the cluster.
The EM
algorithm
attempts to determine the distribution of the elements in clusters.
EM
iteratively reassigns elements to the clusters that maxim
ize their
probability
of
membership and then re

estimates
the distribution of each cluster.
DBSCAN (Density Based Spatial Clustering of Applications with Noise) and DENCLUE
(
(DENsity

based CLUstEring) will
be
implemented to represent density based
partiti
oning algorithms
. DBSCAN creates clusters from highly connected elements while
DENCLUE clusters elements in highly p
opulated areas.
Both algorithm handle
outliers
well and will not include them in any cluster
.
Bibliography
P. Berkhin. Survey of cluster
ing data mining techniques. Technical report, Accrue
Software, San Jose, CA, 2002.
A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a
review, ACM Computing
Surveys (CSUR), v.31 n.3, p.264

323, Sept. 1999
Comments 0
Log in to post a comment