Distributed Clustering for Robust
Aggregation in Large Networks
Ittay Eyal, Idit Keidar, Raphi Rom
Technion, Israel
Aggregation in Sensor Networks
–
Applications
Temperature sensors thrown in the woods
Seismic sensors
Grid computing load
2
Aggregation in Sensor Networks
–
Applications
•
Large networks, light nodes, low bandwidth
•
Fault

prone sensors, network
•
Multi

dimensional (location X temperature)
•
Target is a function of all sensed data
Average temperature,
max location, majority…
3
What has been done?
Tree Aggregation
Hierarchical solution
Fast

O(height of tree)
1
3
9
11
2
10
5
6
Tree Aggregation
Hierarchical solution
†
䙡獴F
O(height of tree)
Limited to static topology
乯慩汵牥r牯扵獴湥獳r
1
3
9
11
2
10
10
6
6
2
Gossip
•
D. Kempe, A. Dobra, and J. Gehrke.
Gossip

based computation of aggregate information
. In FOCS,
2003
.
•
S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.
Gossip:
Each node maintains a synopsis
7
11
9
3
1
Gossip
•
D. Kempe, A. Dobra, and J. Gehrke.
Gossip

based computation of aggregate information
. In FOCS,
2003
.
•
S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.
Gossip:
Each node maintains a synopsis
Occasionally, each node contacts a
neighbor and they improve their
synopses
8
11
9
3
1
7
5
5
7
Gossip
•
D. Kempe, A. Dobra, and J. Gehrke.
Gossip

based computation of aggregate information
. In FOCS,
2003
.
•
S. Nath, P. B. Gibbons, S. Seshan, and Z. R. Anderson.
Synopsis diffusion for robust aggregation in
sensor networks
. In SenSys,
2004
.
Gossip:
Each node maintains a synopsis
Occasionally, each node contacts a
neighbor and they improve their
synopses
䥮摩I晥f敮琠瑯⁴潰潬潧礠捨慮c敳e
Crash robust
乯慴愠敲牯爠牯扵獴湥獳
Proven convergence
9
7
5
5
7
6
6
6
6
A closer look at the
problem
The Implications of Irregular Data
4
A single erroneous sample can
radically offset the data
3
1
10
6
27
o
The average (
47
o
) doesn’t tell the whole story
25
o
26
o
25
o
28
o
98
o
120
o
27
o
11
Sources of Irregular Data
Sensor Malfunction
Short circuit in a
seismic sensor
Sensing Error
An animal
sitting on a
temperature
sensor
Interesting Info:
DDoS
: Irregular load on
some machines in a grid
Software bugs:
In grid computing, a
machine reports negative
CPU usage
Interesting Info:
Fire outbreak
: Extremely
high temperature in a
certain area of the woods
Interesting Info:
intrusion
: A truck driving
by a seismic detector
12
It Would Help to Know The Data Distribution
27
o
The average is
47
o
Bimodal distribution with peaks at
26.3
o
and
109
o
25
o
26
o
25
o
28
o
98
o
120
o
27
o
13
Estimate a range of distributions [
1
,
2
]
or
data clustering according to values [
3
,
4
]
䙡獴F慧a牥条瑩潮r[
1
,
2
]
T潬敲慴攠捲慳栠c慩汵牥猬祮慭楣 湥瑷n牫猠[
1
,
2
崠
High bandwidth [
3
,
4
], multi

epoch [
2
,
3
,
4
]
or
佮攠O業敮獩潮o氠摡瑡湬礠[
1
,
2
]
No data error robustness [
1
,
2
]
Existing Distribution Estimation Solutions
1.
M. Haridasan and R. van Renesse.
Gossip

based distribution estimation in peer

to

peer networks
. In
InternationalWorkshop on Peer

to

Peer Systems (IPTPS
08
), February
2008
.
2.
J. Sacha, J. Napper, C. Stratan, and G. Pierre.
Reliable distribution estimation in decentralised
environments
. Submitted for Publication,
2009
.
3.
W. Kowalczyk and N. A. Vlassis. Newscast em. In Neural Information Processing Systems,
2004
.
4.
N. A. Vlassis, Y. Sfakianakis, and W. Kowalczyk.
Gossip

based greedy gaussian mixture learning
. In
Panhellenic Conference on Informatics,
2005
.
14
Our Solution
Samples deviating from the
distribution of the bulk of the data
Outliers:
15
Estimate a range of distributions
by
data clustering according to values
䙡獴F慧a牥条瑩潮
T潬敲慴攠捲慳栠c慩汵牥猬祮慭楣 湥瑷n牫r
Low bandwidth, single epoch
䵵汴M
摩浥湳楯m慬 摡瑡
Data error robustness by
outlier detection
Outlier Detection Challenge
27
o
25
o
26
o
25
o
28
o
98
o
120
o
27
o
16
Outlier Detection Challenge
A double bind:
27
o
25
o
26
o
25
o
28
o
98
o
120
o
27
o
Regular data
distribution
~
26
o
Outliers
{
98
o
,
120
o
}
No one in the system has enough information
17
Aggregating Data Into Clusters
•
Each cluster has its own
mean
and
mass
•
A bounded number (
k
) of clusters is maintained
Here
k
=
2
Original samples
1
3
5
10
1
a
b
c
d
1
Clustering a and b
1
3
a
b
Clustering all
1
3
5
10
1
abc
d
3
Clustering a, b and c
5
10
1
2
ab
c
2
1
18
But What Does The Mean Mean?
New Sample
Mean A
Mean B
The variance must be taken into account
Gaussian A
Gaussian B
19
Gossip Aggregation of Gaussian Clusters
Distribution is described as
k
clusters
Each cluster is described by:
•
Mass
•
Mean
•
Covariance matrix (variance for
1

d data)
20
Gossip Aggregation of Gaussian Clusters
a
b
Merge
21
Keep half,
Send half
Distributed Clustering for Robust Aggregation
22
•
Aggregate a mixture of
Gaussian clusters
•
Merge when necessary (exceeding
k
)
Our solution:
Recognize outliers
By the time we need to merge,
we can estimate the distribution
Simulation Results
23
1.
Data error robustness
2.
Crash robustness
3.
Elaborate multidimensional data
Simulation Results:
It Works Where It Matters
Not Interesting
Easy
24
It Works Where It Matters
Error
0
5
10
15
20
25
0
0.5
1
Error
With outlier detection
25
Simulation Results
26
1.
Data error robustness
2.
Crash robustness
3.
Elaborate multidimensional data
Simulation Results:
Error
Round
Protocol is Crash Robust
0
10
20
30
40
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Average Error
No outlier detection,
5
% crash probability
No outlier detection, no crashes
Outlier detection
27
Simulation Results
28
1.
Data error robustness
2.
Crash robustness
3.
Elaborate multidimensional data
Simulation Results:
Describe Elaborate Data
Fire
No Fire
Distance
Temperature
29
The algorithm converges
•
Eventually all nodes have the same clusters forever
•
Note: this holds even without atomic actions
•
The invariant is preserved by both send and receive
Theoretical Results (In Progress)
30
… to the “right” output
•
If outliers are “far enough” from other samples, then they
are never mixed into non

outlier clusters
•
They are discovered
•
They do not bias the good samples’ aggregate
(where it matters)
Summary
Robust Aggregation requires
outlier detection
27
o
27
o
27
o
27
o
27
o
98
o
120
o
31
We present outlier detection by Gaussian clustering:
Merge
Summary
–
Our Protocol
32
Elaborate Data
Crash Robustness
0
10
20
30
40
50
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
Round
Average Error
Outlier Detection (where it matters)
Comments 0
Log in to post a comment