# A Secure Clustering Algorithm

Τεχνίτη Νοημοσύνη και Ρομποτική

25 Νοε 2013 (πριν από 4 χρόνια και 7 μήνες)

78 εμφανίσεις

A Secure Clustering Algorithm
for Distributed Data Streams

Geetha Jagannathan

Rutgers University

Joint work with Krishnan Pillaipakkamnatt and D. Umano

Outline

The problem

Prior results

Clustering data streams

Experimental results and comparison

A privacy
-
preserving protocol

Conclusion

The problem

Alice and Bob each have a data stream,
defined on the same attributes.

(horizontal partition)

The wish to compute a clustering on the
combined data.

Bob

Alice

1

D
Input: Data stream
2

D
Input: Data stream
1 2
m n
k D D

Output: - clustering of
1 1
m
D m D

The first elements of
2 2
n
D n D

The first elements of
Clustering on joint data

Alice’s Data

k

= 4

Clustering on joint data

Bob’s Data

k

= 4

Clustering on joint data

Combined Data

k

= 4

Trusted third party

Alice

Bob

1
m
D
2
n
D
k
-
clustering

k
-
clustering

Privacy requirements

Parties are semi
-
honest

Same as trusted third party

Reveals nothing but the final output

In this case

the
k

cluster centers

Prior results

PPDM protocols convert distributed DM
algorithms into private ones

The
k
-
means algorithm is the basis for
many clustering protocols [VC03,
JKM05, JW05, BO07]

“Leak” intermediate information

[JPW05] presents a leak
-
free clustering
protocol based on a new clustering
algorithm.

Our Contributions

A leak free privacy
-
preserving protocol
for distributed data streams.

A data stream clustering algorithm

Better than
k
-
means (on average)

Comparable performance with BIRCH on
many data sets, but with lower memory
needs.

Data Stream Algorithms

Data arrives in “stream” fashion:
d
1
,
d
2
,
…,
d
n
, … (the “end” of the stream is not
known ahead of time).

Data is too large to fit entirely in memory.

Data can be accessed only in the order
that it arrives.

Each data item can only be “read” once.

The clustering algorithm

“Incrementally agglomerative”: It merges
intermediate clusters without waiting for
all the data to be available.

Runs in time linear in
n
.

Overview of clustering algorithm

K = 5

Level 0 clustering

Level 1 clustering

Level 2 clustering

Output

Output expected after
n = 25

data points

Clustering Algorithm Outline

The algorithm maintains a list of
k
-
clusterings (each clustering is on some
partial data).

In each iteration:

Input the next
k
data points as a level
-
0
clustering.

If two clusterings at level
i
are in the list,
“merge” them into a level
-
(
i
+ 1)
k
-
clustering.

Clustering algorithm outline

If output is needed after some
n

points
have been read, all
k
-
clusterings are
“merged” into a single
k
-
clustering.

“Merging” clusterings

Have a set
S

clusters, which |
S
| >
k
.

Need a set
S
' of
k

clusters.

S
' =
S

Repeat

Compute merge error for every pair of clusters

Take the union of the pair with lowest error

Until |
S'
| =
k

Error (C
1

U C
2
) =

C
1
.weight * C
2
.weight * (dist(C
1
, C
2
))
2

Sample results (offset grid)

Sample results (vs
k
-
means)

Sample result (vs. BIRCH)

Realistic Data (Network Intrusion)

Algorithm

Mem. Allowed

(
×

24000 bytes)

ESS

StreamCluster

1

4.1E14

BIRCH

1

*

BIRCH

2

*

BIRCH

4

*

BIRCH

32

*

BIRCH

64

4.8E17

BIRCH

128

4.8E17

The Secure Protocol

Input: Alice owns data stream
D
1

Bob owns data stream
D
2

Output : k
-
clusters on
D
1
m

U
D
2
n

1.
Alice computes
O
(
k

log ( )) cluster centers
and Bob computes
O
(
k

log ( )) cluster
centers

2.

Alice and Bob securely share their cluster
centers

3.
They securely merge clusters

k
m
k
n
Sample Run

(Distributed non
-
private protocol)

Complexity

Communication complexity:

O((
k log(mn/k
2
)
2
)

Non
-
private setting (one party sends the
intermediate clusters to the other)

Comm complexity:
O(k log (m/k))

k
mn