Distributed Clustering for Robust

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

72 εμφανίσεις

Distributed Clustering for Robust
Aggregation in Large Networks

Ittay Eyal, Idit Keidar, Raphi Rom

Technion, Israel

Aggregation in Sensor Networks


Applications

Temperature sensors thrown in the woods

Seismic sensors

Grid computing load

2

Aggregation in Sensor Networks


Applications


Large networks, light nodes, low bandwidth


Fault
-
prone sensors, network


Multi
-
dimensional (location X temperature)


Target is a function of all sensed data


Average temperature,
max location, majority…

3

What has been done?

Tree Aggregation

Hierarchical solution

Fast
-

O(height of tree)

1

3

9

11

2

10

5

6

Tree Aggregation

Hierarchical solution



䙡獴F


O(height of tree)




Limited to static topology




乯⁦慩汵牥r牯扵獴湥獳r

1

3

9

11

2

10

10

6

6

2

Gossip


D. Kempe, A. Dobra, and J. Gehrke.
Gossip
-
based computation of aggregate information
. In FOCS,
2003
.


S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.

Gossip:

Each node maintains a synopsis

7

11

9

3

1

Gossip


D. Kempe, A. Dobra, and J. Gehrke.
Gossip
-
based computation of aggregate information
. In FOCS,
2003
.


S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.

Gossip:

Each node maintains a synopsis

Occasionally, each node contacts a
neighbor and they improve their
synopses

8

11

9

3

1

7

5

5

7

Gossip


D. Kempe, A. Dobra, and J. Gehrke.
Gossip
-
based computation of aggregate information
. In FOCS,
2003
.


S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.

Gossip:

Each node maintains a synopsis

Occasionally, each node contacts a
neighbor and they improve their
synopses



䥮摩I晥f敮琠瑯⁴潰潬潧礠捨慮c敳e



Crash robust




乯⁤慴愠敲牯爠牯扵獴湥獳

Proven convergence

9

7

5

5

7

6

6

6

6

A closer look at the

problem

The Implications of Irregular Data

4

A single erroneous sample can
radically offset the data

3

1

10
6

27
o

The average (
47
o
) doesn’t tell the whole story

25
o

26
o

25
o

28
o

98
o

120
o

27
o

11

Sources of Irregular Data

Sensor Malfunction

Short circuit in a
seismic sensor

Sensing Error

An animal
sitting on a
temperature
sensor

Interesting Info:

DDoS
: Irregular load on
some machines in a grid

Software bugs:

In grid computing, a
machine reports negative
CPU usage

Interesting Info:

Fire outbreak
: Extremely
high temperature in a
certain area of the woods

Interesting Info:

intrusion
: A truck driving
by a seismic detector

12

It Would Help to Know The Data Distribution

27
o

The average is
47
o

Bimodal distribution with peaks at
26.3
o

and
109
o

25
o

26
o

25
o

28
o

98
o

120
o

27
o

13




Estimate a range of distributions [
1
,
2
]
or


data clustering according to values [
3
,
4
]




䙡獴F慧a牥条瑩潮r[
1
,
2
]




T潬敲慴攠捲慳栠c慩汵牥猬⁤祮慭楣 湥瑷n牫猠[
1
,
2




High bandwidth [
3
,
4
], multi
-
epoch [
2
,
3
,
4
]

or




佮攠O業敮獩潮o氠摡瑡湬礠[
1
,
2
]




No data error robustness [
1
,
2
]

Existing Distribution Estimation Solutions

1.
M. Haridasan and R. van Renesse.
Gossip
-
based distribution estimation in peer
-
to
-
peer networks
. In
InternationalWorkshop on Peer
-
to
-
Peer Systems (IPTPS
08
), February
2008
.

2.
J. Sacha, J. Napper, C. Stratan, and G. Pierre.
Reliable distribution estimation in decentralised
environments
. Submitted for Publication,
2009
.

3.
W. Kowalczyk and N. A. Vlassis. Newscast em. In Neural Information Processing Systems,
2004
.

4.
N. A. Vlassis, Y. Sfakianakis, and W. Kowalczyk.
Gossip
-
based greedy gaussian mixture learning
. In
Panhellenic Conference on Informatics,
2005
.

14

Our Solution

Samples deviating from the
distribution of the bulk of the data

Outliers:

15




Estimate a range of distributions
by


data clustering according to values




䙡獴F慧a牥条瑩潮




T潬敲慴攠捲慳栠c慩汵牥猬⁤祮慭楣 湥瑷n牫r



Low bandwidth, single epoch




䵵汴M

摩浥湳楯m慬 摡瑡




Data error robustness by
outlier detection

Outlier Detection Challenge

27
o

25
o

26
o

25
o

28
o

98
o

120
o

27
o

16

Outlier Detection Challenge

A double bind:

27
o

25
o

26
o

25
o

28
o

98
o

120
o

27
o

Regular data
distribution

~
26
o

Outliers

{
98
o
,
120
o
}

No one in the system has enough information

17

Aggregating Data Into Clusters


Each cluster has its own
mean

and
mass



A bounded number (
k
) of clusters is maintained

Here

k
=
2

Original samples

1

3

5

10

1

a

b

c

d

1

Clustering a and b

1

3

a

b

Clustering all

1

3

5

10

1

abc

d

3

Clustering a, b and c

5

10

1

2

ab

c

2

1

18

But What Does The Mean Mean?

New Sample

Mean A

Mean B

The variance must be taken into account

Gaussian A

Gaussian B

19

Gossip Aggregation of Gaussian Clusters

Distribution is described as
k

clusters

Each cluster is described by:


Mass


Mean


Covariance matrix (variance for
1
-
d data)







20

Gossip Aggregation of Gaussian Clusters

a

b

Merge

21

Keep half,

Send half

Distributed Clustering for Robust Aggregation

22


Aggregate a mixture of
Gaussian clusters


Merge when necessary (exceeding
k
)

Our solution:

Recognize outliers

By the time we need to merge,

we can estimate the distribution

Simulation Results

23

1.
Data error robustness

2.
Crash robustness

3.
Elaborate multidimensional data

Simulation Results:

It Works Where It Matters

Not Interesting

Easy

24

It Works Where It Matters


Error

0
5
10
15
20
25
0
0.5
1
Error


With outlier detection

25

Simulation Results

26

1.
Data error robustness

2.
Crash robustness

3.
Elaborate multidimensional data

Simulation Results:

Error

Round

Protocol is Crash Robust

0
10
20
30
40
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Average Error
No outlier detection,

5
% crash probability

No outlier detection, no crashes

Outlier detection

27

Simulation Results

28

1.
Data error robustness

2.
Crash robustness

3.
Elaborate multidimensional data

Simulation Results:

Describe Elaborate Data

Fire

No Fire

Distance

Temperature

29

The algorithm converges


Eventually all nodes have the same clusters forever


Note: this holds even without atomic actions


The invariant is preserved by both send and receive

Theoretical Results (In Progress)

30

… to the “right” output


If outliers are “far enough” from other samples, then they
are never mixed into non
-
outlier clusters


They are discovered


They do not bias the good samples’ aggregate

(where it matters)

Summary

Robust Aggregation requires
outlier detection

27
o

27
o

27
o

27
o

27
o

98
o

120
o

31

We present outlier detection by Gaussian clustering:

Merge

Summary


Our Protocol

32

Elaborate Data

Crash Robustness

0
10
20
30
40
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Average Error
Outlier Detection (where it matters)