Diagnosing and Debugging

brainybootsMobile - Wireless

Nov 21, 2013 (3 years and 11 months ago)

98 views

Diagnosing and Debugging
Wireless Sensor Networks


Eric Osterweil

Nithya Ramanathan

Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What do apples, oranges, peaches
have in common?

Well, they are all fruits, they all grow
in groves of trees, etc.

However, grapes are also fruits, but
they grown on vines! ;)

Defining the Problem


Debugging


an iterative process of detecting and
discovering the root
-
cause of faults


Distinct debugging phases


Pre
-
deployment


During deployment


Post
-
deployment


Ongoing Maintenance / Performance Analysis


How
different from debugging?

Characteristic Failures
1,2


Pre
-
Deployment


Bugs characteristic of wireless, embedded, and
distributed platforms


During Deployment


Not receiving data at the sink


Neighbor density (or lack thereof)


badly placed nodes


Flaky/variable link connectivity

1

R. Szewczyk, J. Polastre, A. Mainwaring, D. Culler “Lessons from a Sensor Network Expedition”. In EWSN, 2004

2 A. Mainwaring, J. Polastre, R. Szewczyk, D. Culler “Wireless Sensor Networks for Habitat Monitoring”. In ACM
International Workshop on Wireless Sensor Networks and Applications.

Characteristic Failures
(continued)


Post
-
Deployment


Failed/rebooted nodes


“Funny” nodes/sensors


batteries with low
-
voltage levels


Un
-
calibrated sensors


Ongoing Maintenance / Performance


Low bandwidth / dropped data from certain regions


High power consumption


Poor load
-
balancing, or high re
-
transmission rate

Scenarios


You have just deployed a sensor network in the
forest, and are not getting data from any node


what do you do?


You are getting wildly fluctuating averages from a
region


is this caused by


Actual environmental fluctuations


Bad sensors


Data randomly dropped


Calculation / algorithmic errors


Tampered nodes

Challenges


Existing tools fall
-
short for sensor networks


Limited visibility


Resource constrained nodes (Can’t run “gdb”)


Bugs characteristic of embedded, distributed, and wireless
platforms


Can’t always use existing Internet fault
-
tolerance techniques (i.e.
rebooting)


Extracting Debugging Information


With minimal disturbance to the network


Identifying information used to infer internal state


Minimizing central processing


Minimizing resource consumption

Challenges
(continued)


Applications behave differently in the field


Testing configuration changes


Can’t easily log on to nodes


Identifying performance
-
blocking bugs


Can’t continually manually monitor the
network (often physically impossible
depending on deployment environment)

Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What is Network Management?

I don’t have to know anything about my neighbor to
count on them…



QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Network Management


Observing and tracking nodes


Routers


Switches


Hosts


Ensuring that nodes are providing
connectivity


i.e. doing their jobs

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Problem


Connectivity failures versus device
failures


Correlating outages with their
cause(s)

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outage Example

Hosts

Switches

Routers

Core Switches

Approach


Polling


ICMP


SNMP


“Downstream event suppression”


If routing has failed, ignore events
about downstream nodes


Modeling

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Outage Example (2)

How does this area differ from
WSNs?

Applied to WSNs


Similarities


Similar topologies


Intersecting operations


Network forwarding, routing, etc.


Connectivity vs. device failures


Differences


Network links


Topology dynamism

QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What is Parallel Processing?

If one car is fast, are 1,000 cars 1,000 times faster?

Parallel Processing


Coordinating large sets of nodes


Cluster sizes can range to the order of
10
4

nodes


Knowing nodes’ states


Efficient resource allocation


Low communication overhead

Problem


Detecting faults


Recovery of faults


Reducing communication
overhead


Maintenance


Software distributions, upgrades, etc.

Approach


Low
-
overhead state checks


ICMP


UDP
-
based protocols and topology
sensitivity


Ganglia


Process recovery


Process checkpoints


Condor

How does this area differ from
WSNs?

Applied to WSNs


Similarities


Potentially large sets of nodes


Relatively difficult to track state (due to
resources)


Tracking state is difficult


Communication overheads are
limiting

Applied to WSNs (continued)


Differences


Topology is more dynamic in WSNs


Communications are more
constrained


Deployment is not structured around
computation


Energy is limiting rather than
computation overhead


WSNs are much less latency sensitive


Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What is Distributed Fault Tolerance?

Put me in coach… PUT ME IN!

Distributed Fault Tolerance


High Availability is a broad
category


Hot backups (failover)


Load balancing


etc.

Problem(s)


HA


Track status of nodes


Keeping access to critical resources
available as much as possible


Sacrifice hardware for low
-
latency


Load balancing


Track status of nodes


Keeping load even


Approach


HA


High frequency/low latency
heartbeats


Failover techniques


Virtual interfaces


Shared volume mounting


Load balancing


Metric (Round robin, least
connections, etc.)

How does this area differ from
WSNs?

Applied to WSNs


HA / Load balancing


Similarities


Redundant resources


Differences


Where to begin…

MANY

Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What are WSNs?

Warning, any semblance of an orderly system is
purely coincidental…

BluSH
1


Shell interface for Intel’s IMotes


Enables interactive debugging


can walk
up to a mote and access internal state





1 Tom Schoellhammer

Sympathy
1,2


Aids in debugging


pre, during, and post
-
deployment


Nodes collect metrics & periodically broadcast to the sink


Sink ensures “good qualities” specified by programmer


based on metrics and other gathered information


Faults are identified and categorized by metrics and tests


Spatial
-
temporal correlation of distributed events to root
-
cause failures


Test Injection


Proactively injects network probes to validate a fault hypothesis


Triggers self
-
tests (internal actuation)

1

N. Ramanathan
, E. Kohler, D. Estrin, "Towards a Debugging System for Sensor Networks", International Journal for Network Management, 2005.


2

N. Ramanathan
, E. Kohler, L. Girod, D. Estrin. "Sympathy: A Debugging System for Sensor Networks". in Proceedings of The First IEEE
Workshop on Embedded Networked Sensors, Tampa, Florida, USA, November 16, 2004

SNMS
1


Enables interactive health monitoring of WSN in
the field


3 Pieces


Parallel dissemination and collection


Query system for exported attributes


Logging system for asynchronous events


Small footprint / low overhead


Introduces overhead only with human querying

1

Gilman Tolle, David Culler, “Design of an Application
-
Cooperative Management System for WSN” Second EWSN,
Istanbul, Turkey, January 31
-

February 2, 2005

Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

What is Calibration and Modeling?

Hey, if you and I both think the answer is true, then
whose to say we’re wrong? ;)

Modeling
1,2,3


“Root
-
cause Localization” in large scale systems


Process of “identifying the source of problems in a
system using purely external observations”


Identify “anomalous” behavior based on
externally observed metrics


Statistical analysis and Bayesian networks used to
identify faults


1

E. Kiciman, A. Fox “Detecting application
-
level failures in component
-
based internet services”. In IEEE
Transactions on Neural Networks, Spring 2004

2
A. Fox, E. Kiciman, D. Patterson, M. Jordan, R. Katz. “Combining statistical monitoring and predictable recovery
for self
-
management”. In Procs. Of Workshop on Self
-
Managed Systems, Oct 2004

3
E. Kiciman, L Subramanian. “Root cause localization in large scale systems”

Calibration
1,2


Model physical phenomena in order to predict which sensors are faulty


Model can be based on:



Environment that is monitored


e.g. assume that the majority of sensors are
providing correct data and then identify sensors that make this model inconsistent
1


Assumptions about the environment


e.g. in a densely sampled area, values of
neighboring sensors should be “similar”
2


Debugging can be viewed as sensor network system calibration


Use system metrics instead of sensor data


Based on a model of what metrics in a properly behaving system should look like,
can identify faulty behavior based on inconsistent metrics.


Locating and using ground truth


In situ deployments


Low communication/energy budgets


Bias


Noise

1

Jessica Feng, S. Megerian, M. Potkonjak “Model
-
based calibration for Sensor Networks”. IEEE International
Conference on Sensors, Oct 2003


2

A Collaborative Approach to In
-
Place Sensor Calibration


Vladimir Bychovskiy Seapahn Megerian et al

Contents


Introduction


Network Management


Parallel Processing


Distributed Fault Tolerance


WSNs


Calibration / Model Based


Conclusion

Promising Ideas


Management by Delegation


Naturally supports heterogeneous architectures
by distributing control over network


Dynamically tasks/empowers lower
-
capable
nodes using mobile code


AINs


Node can monitor its own behavior, detect,
diagnose, and repair issues


Model
-
based fault detection


Models of physical environment


Bayesian inference engines

Comparison


Network Management


Close, but includes some inflexible
assumptions


Parallel Processing


Many similar, but divergent constraints


Distributed Fault Tolerance


Almost totally different


WSNs


New techniques emerging


Calibration


WSN related work becoming available

1

F. Gump et al

Conclusion


Distributed debugging is as
distributed debugging does
1


WSNs are a particular class of
distributed system


There are numerous techniques for
distributed debugging


Different conditions warrant different
approaches


OR different spins to existing
techniques

1

F. Gump et al

References


Todd Tannenbaum, Derek Wright, Karen Miller, and Miron
Livny, "Condor
-

A Distributed Job Scheduler", in Thomas
Sterling, editor,
Beowulf Cluster Computing with Linux
, The MIT
Press, 2002. ISBN: 0
-
262
-
69274
-
0


http://www.open.com/pdfs/alarmsuppression.pdf


http://www.top500.org/


.E. Culler and J.P. Singh,
Parallel Computer Architecture: A
Hardware/Software Approach
, Morgan Kaufmann Publishers
Inc., San Francisco, CA, 1999, ISBN 1
-
55860
-
343
-
3.


The Ganglia Distributed Monitoring System: Design,
Implementation, and Experience
.
Matthew L. Massie
,
Brent N.
Chun
, and
David E. Culler
. Parallel Computing, Vol. 30, Issue 7,
July 2004.


HA
-
OSCAR Release 1.0 Beta: Unleashing HA
-
Beowulf, 2nd
Annual OSCAR symposium, Winnipeg, Manitoba Canada, May
2004 .

Questions?

No? Great! ;)