Clustering of Dynamic Data Introduction

tealackingAI and Robotics

Nov 8, 2013 (3 years and 8 months ago)

72 views

Clustering of Dynamic Data

Introduction

Data clustering is a field of active research in machine learning and data mining. Most of
the work has focused on static data

sets
. There has been little work on clustering of
dynamic data. We define a dynamic da
ta set as a set of elements whose parameters
change over time.
A flock of flying birds is an example of a dynamic data set.

We are
interested in exploring algorithms are capable of finding relationships amongst the

elements in a dynamic data set.

In thi
s paper we evaluate the use of data clustering
techniques developed for static

data sets

on dynamic data
.


Hypothesis

Traditional clustering algorithms used in data mining will not perf
orm well on dynamic
data sets. A clustering algorithm must consider th
e elements


history in order to
efficiently and effectively find clusters in dynamic data.


Experiment



Characterize a set of traditional clustering algorithms against dynamic data sets.

o

Data Sets:



Swarm style data



Traffic style data



Ant Colony data



????



A
u
gment traditional clustering
to make
the
distance measure a function of the
elements


history

o

Use a moving average of each attribute

o

Use past cluster labels in the
distance

measure.

o

Start partitioning algorithms with the output from the last time interval



Characterize augmented algorithms against dynamic data sets
.

o

Metrics



Timing



Accuracy



Consistency



Cluster label is consistent over time (label does not thrash
-
> stays with the “core” group)


Clustering Algorithms

In general, clustering algorithms can be di
vided into two
categories: hierarchical and
pa
rti
ti
oning.

Hierarchical algorithms build clusters gradually. An agglomerative
hierarchical algorithm starts with each element in it own cluster. The clusters are
iteratively combined
to form a
dendrogram
.
Divisive algorithms start
s

with all elements
in one cluster and create the dendrogram from the top down.

Partitioning

algorithms

create clusters directory
by optimizing a function (locally or globally)
with

out creating a
structure

like the dendrogram. P
artitioning algorithms typically run faster than
hierarchical cluster, but need to have the number of clusters to find defined.
Partitioning
algorithms can be further dived into relocation methods and density
-
based methods.
Relocation methods work by min
imizing a cost function by iteratively relocating

elements to clusters. Density
based methods attempt to cluster densely connected
elements.


We plan to implement a representative set of clustering algorithms and
evaluate their
performance on
dynamic data

set
s
. From the hierarchical clustering category we plan to
implement

a single
-
link
and a compl
ete
-
link algorithm. These two

algorithms dif
fer in
the way they measure
dista
nce between clusters. The single
-
link algorithm measures the
distance between two

clusters A and B as the minimum distance between a
ny

member of
A and a
ny

member of B.

The c
omplete
-
link
algorithm
measures the distance between
clusters A and cluster B as the maximum distance between any member of cluster A and
any member of cluster B.


The complete
-
link algorithm can find compact cluster whil
e
the single
-
link algorithm
find
s

elongated clusters.


From
the
partitioning
-
relocation category we plan to implement
k
-
means, k
-
mediods, and
E
xpectation Maximization

(EM)
. The k
-
means
and k
-
mediod
s
algorithm
s

work
by
randomly assigning elements to k clusters and then iteratively reassigning elements to
the
closest
cluster and re
-
computing the cluster
’s parameters
.

They differ in how the clusters
are represents. K
-
means represents a cluster as the

centroid of its members. The K
-
mediods algorithm selects a member of
the
cluster to represent the cluster.

The EM
algorithm
attempts to determine the distribution of the elements in clusters.

EM
iteratively reassigns elements to the clusters that maxim
ize their
probability

of
membership and then re
-
estimates

the distribution of each cluster.


DBSCAN (Density Based Spatial Clustering of Applications with Noise) and DENCLUE
(
(DENsity
-
based CLUstEring) will

be

implemented to represent density based
partiti
oning algorithms
. DBSCAN creates clusters from highly connected elements while
DENCLUE clusters elements in highly p
opulated areas.

Both algorithm handle
outliers

well and will not include them in any cluster
.


Bibliography

P. Berkhin. Survey of cluster
ing data mining techniques. Technical report, Accrue
Software, San Jose, CA, 2002.

A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a

review, ACM Computing
Surveys (CSUR), v.31 n.3, p.264
-
323, Sept. 1999