# Clustering Data Streams

AI and Robotics

Nov 8, 2013 (4 years and 8 months ago)

102 views

Clustering Data Streams

Chun Wei

Dept Computer & Information Technology

Data Stream

Massive data sets accumulated at an
astonishing rate.

Examples:

Tracking network data to study change in
traffic patterns and possible intrusions

Tracking meteorological data, such as
temperatures

NASA MISR satellite

Collects several TB of
satellite imagery
data daily

Challenges

Compactness of data representation

Fast, incremental processing of new data
points (one
-
pass and linear access of
data)

Clear and fast identification of changes
in evolving clustering models

Compactness

Utilize a data structure that summarizes
a group of data points, minimizing the
storage space

The space required does not grow
appreciably with the number of points
processed

Incremental Processing of data

When clustering new data points, the
algorithm should not require comparison
with all the points processed in the past

The data must be processed as they are
produced. Linear scan is required,
random access is prohibitively
expensive.

Identification of Changes

The algorithm must be able to:

diagnose changes in evolving data
streams

distinguish outliers from data points that
represent a new cluster

Current Algorithms

BIRCH

STREAM

CLU
-
STREAM

BIRCH

Use CF vectors to store data

CF = (N,

X
i
2

,

X
i

) X
i

is a vector

Store the number of points, the linear sum
and the square sum of all data points in a
micro
-
cluster

diameter and distances

N
i
i
X
1

N
i
i
X
1
B
-
Tree

Root

16

22

29

2*

3*

5*

7*

14*

16*

19*

20*

22*

24*

27*

29*

33*

34*

38*

39*

7

Building of CF Tree

B
-
Tree, with a branch factor B, threshold
T and L maximum number of entries in a
leave node

CF3

CF1

CF2

CF3

CF6

CF4

CF5

CF6

Leaf node

Non
-
Leaf node

Increases the threshold T so that each
leaf entry to absorb more points. T can
be set as radius or diameter.

Leaf entries with “far fewer” points are
regarded as “outliers” and written back to
disk.

STREAM

Process data streams in batches of
points

Use weighted centroids C
i

to represent
ith batch of points.

Recursively cluster the weighted
centroids until k
-
clusters

Problems with BIRCH & STREAM

Old data points are equally important as
new data points

May not be able to detect new trends in
evolving data stream

CLU
-
STREAM

Also use CF vectors to store data
summary

Use time stamps to record the elapsed
time from the beginning

Take snapshots at different time stamp,
favoring the most recent data

(Snapshot: micro
-
clusters stored at
particular moments in the stream)

CLU
-
STREAM (continue)

A snapshot contains q micro
-
clusters, q
depends on the memory available

New data points will be assigned to one
of the micro
-
clusters in previous
snapshot if it falls within the maximum
boundary of that micro
-
cluster.

CLU
-
STREAM (continue)

If a new data points fails to fit into any
current cluster, a new cluster is created,
and an existing one is deleted or two
merged.

A cluster is removed if the average time
-
stamp when it absorbs
m
new data
points is least recent.

Detect New Trends

Comparing clustering results from
snapshots to snapshots reveals trends in
evolving data stream.

References

Aggarwal, C. C., Han J., Wang, J. & Yu, P. S. (2003). A Framework for
Clustering Evolving Data Stream.
In Proc. of the 29
th

VLDB Conference
.

Barbara, D. (2003). Requirements for Clustering Data Streams.
SIGKDD
Explorations, 3 (2
), 23
-
27.

Ester M., Kriegel H.
-
P., Sander J. & Xu X (1998). Clustering for Mining in
Large Spatial Databases.
Special Issue on Data Mining, KI
-
Journal,
ScienTec Publishing, No. 1
.

Guha, S., Meyerson, A., Mishra, N. & Motwani, R., Callaghan, L. (2003).
Clustering Data Streams: Theory and Practice.
IEEE Transactions on
Knowledge and Data Engineering, 15 (3
), 515
-
528.

Zhang, T., Ramakrishnan R. & Livny, M. (1996). BIRCH: An Efficient
Data Clustering Method for Very Large Databases.
In Proc. of ACM
SIGMOD International Conference on Management of Data
.