Clustering Data Streams

sharpfartsAI and Robotics

Nov 8, 2013 (3 years and 7 months ago)

65 views

Clustering Data Streams

Chun Wei

Dept Computer & Information Technology

Advisor: Dr. Sprague

Data Stream


Massive data sets accumulated at an
astonishing rate.


Examples:


Tracking network data to study change in
traffic patterns and possible intrusions


Tracking meteorological data, such as
temperatures

NASA MISR satellite




Collects several TB of
satellite imagery
data daily

Challenges


Compactness of data representation


Fast, incremental processing of new data
points (one
-
pass and linear access of
data)


Clear and fast identification of changes
in evolving clustering models

Compactness


Utilize a data structure that summarizes
a group of data points, minimizing the
storage space


The space required does not grow
appreciably with the number of points
processed

Incremental Processing of data


When clustering new data points, the
algorithm should not require comparison
with all the points processed in the past


The data must be processed as they are
produced. Linear scan is required,
random access is prohibitively
expensive.

Identification of Changes

The algorithm must be able to:


diagnose changes in evolving data
streams


distinguish outliers from data points that
represent a new cluster

Current Algorithms


BIRCH


STREAM


CLU
-
STREAM




BIRCH


Use CF vectors to store data


CF = (N,

X
i
2

,

X
i

) X
i

is a vector


Store the number of points, the linear sum
and the square sum of all data points in a
micro
-
cluster


Sufficient to calculate centroids, radius,
diameter and distances



N
i
i
X
1


N
i
i
X
1
B
-
Tree


Root

16

22

29

2*

3*

5*

7*

14*

16*

19*

20*

22*

24*

27*

29*

33*

34*

38*

39*

7

Building of CF Tree


B
-
Tree, with a branch factor B, threshold
T and L maximum number of entries in a
leave node

CF3

CF1

CF2

CF3

CF6

CF4

CF5

CF6

Leaf node

Non
-
Leaf node

Adjusting CF Tree


Increases the threshold T so that each
leaf entry to absorb more points. T can
be set as radius or diameter.


Leaf entries with “far fewer” points are
regarded as “outliers” and written back to
disk.

STREAM


Process data streams in batches of
points


Use weighted centroids C
i

to represent
ith batch of points.


Recursively cluster the weighted
centroids until k
-
clusters

Problems with BIRCH & STREAM


Old data points are equally important as
new data points


May not be able to detect new trends in
evolving data stream

CLU
-
STREAM


Also use CF vectors to store data
summary


Use time stamps to record the elapsed
time from the beginning


Take snapshots at different time stamp,
favoring the most recent data

(Snapshot: micro
-
clusters stored at
particular moments in the stream)

CLU
-
STREAM (continue)


A snapshot contains q micro
-
clusters, q
depends on the memory available


New data points will be assigned to one
of the micro
-
clusters in previous
snapshot if it falls within the maximum
boundary of that micro
-
cluster.


CLU
-
STREAM (continue)


If a new data points fails to fit into any
current cluster, a new cluster is created,
and an existing one is deleted or two
merged.


A cluster is removed if the average time
-
stamp when it absorbs
m
new data
points is least recent.

Detect New Trends


Comparing clustering results from
snapshots to snapshots reveals trends in
evolving data stream.

References


Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for
Clustering Evolving Data Stream.
In Proc. of the 29
th

VLDB Conference
.


Barbara, D. (2003). Requirements for Clustering Data Streams.
SIGKDD
Explorations, 3 (2
), 23
-
27.


Ester M., Kriegel H.
-
P., Sander J. & Xu X (1998). Clustering for Mining in
Large Spatial Databases.
Special Issue on Data Mining, KI
-
Journal,
ScienTec Publishing, No. 1
.


Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003).
Clustering Data Streams: Theory and Practice.
IEEE Transactions on
Knowledge and Data Engineering, 15 (3
), 515
-
528.


Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient
Data Clustering Method for Very Large Databases.
In Proc. of ACM
SIGMOD International Conference on Management of Data
.