A Secure Clustering Algorithm
for Distributed Data Streams
Geetha Jagannathan
Rutgers University
Joint work with Krishnan Pillaipakkamnatt and D. Umano
Outline
The problem
Prior results
Clustering data streams
Experimental results and comparison
A privacy

preserving protocol
Conclusion
The problem
Alice and Bob each have a data stream,
defined on the same attributes.
(horizontal partition)
The wish to compute a clustering on the
combined data.
Bob
Alice
1
D
Input: Data stream
2
D
Input: Data stream
1 2
m n
k D D
Output:  clustering of
1 1
m
D m D
The first elements of
2 2
n
D n D
The first elements of
Clustering on joint data
Alice’s Data
k
= 4
Clustering on joint data
Bob’s Data
k
= 4
Clustering on joint data
Combined Data
k
= 4
Trusted third party
Alice
Bob
1
m
D
2
n
D
k

clustering
k

clustering
Privacy requirements
Parties are semi

honest
Same as trusted third party
Reveals nothing but the final output
In this case
–
the
k
cluster centers
Prior results
PPDM protocols convert distributed DM
algorithms into private ones
The
k

means algorithm is the basis for
many clustering protocols [VC03,
JKM05, JW05, BO07]
“Leak” intermediate information
[JPW05] presents a leak

free clustering
protocol based on a new clustering
algorithm.
Our Contributions
A leak free privacy

preserving protocol
for distributed data streams.
A data stream clustering algorithm
Better than
k

means (on average)
Comparable performance with BIRCH on
many data sets, but with lower memory
needs.
Data Stream Algorithms
Data arrives in “stream” fashion:
d
1
,
d
2
,
…,
d
n
, … (the “end” of the stream is not
known ahead of time).
Data is too large to fit entirely in memory.
Data can be accessed only in the order
that it arrives.
Each data item can only be “read” once.
The clustering algorithm
“Incrementally agglomerative”: It merges
intermediate clusters without waiting for
all the data to be available.
Runs in time linear in
n
.
Overview of clustering algorithm
K = 5
Level 0 clustering
Level 1 clustering
Level 2 clustering
Output
Output expected after
n = 25
data points
Clustering Algorithm Outline
The algorithm maintains a list of
k

clusterings (each clustering is on some
partial data).
In each iteration:
Input the next
k
data points as a level

0
clustering.
If two clusterings at level
i
are in the list,
“merge” them into a level

(
i
+ 1)
k

clustering.
Clustering algorithm outline
If output is needed after some
n
points
have been read, all
k

clusterings are
“merged” into a single
k

clustering.
“Merging” clusterings
Have a set
S
clusters, which 
S
 >
k
.
Need a set
S
' of
k
clusters.
S
' =
S
Repeat
Compute merge error for every pair of clusters
Take the union of the pair with lowest error
Until 
S'
 =
k
Error (C
1
U C
2
) =
C
1
.weight * C
2
.weight * (dist(C
1
, C
2
))
2
Sample results (offset grid)
Sample results (vs
k

means)
Sample result (vs. BIRCH)
Realistic Data (Network Intrusion)
Algorithm
Mem. Allowed
(
×
24000 bytes)
ESS
StreamCluster
1
4.1E14
BIRCH
1
*
BIRCH
2
*
BIRCH
4
*
BIRCH
32
*
BIRCH
64
4.8E17
BIRCH
128
4.8E17
The Secure Protocol
Input: Alice owns data stream
D
1
Bob owns data stream
D
2
Output : k

clusters on
D
1
m
U
D
2
n
1.
Alice computes
O
(
k
log ( )) cluster centers
and Bob computes
O
(
k
log ( )) cluster
centers
2.
Alice and Bob securely share their cluster
centers
3.
They securely merge clusters
k
m
k
n
Sample Run
(Distributed non

private protocol)
Complexity
Communication complexity:
O((
k log(mn/k
2
)
2
)
Non

private setting (one party sends the
intermediate clusters to the other)
Comm complexity:
O(k log (m/k))
k
mn
Comments 0
Log in to post a comment