Triage: Diagnosing Production Run

religiondressInternet and Web Development

Oct 21, 2013 (4 years and 2 months ago)

76 views

Joseph
Tucek
, Shan Lu, Chengdu Huang,
Spiros

Xanthos

and
Yuanyuan

Zhou

University of Illinois at Urbana Champaign

Triage: Diagnosing Production Run
Failures at the User’s Site

Motivation



Software failures are a major contributor to system
downtime.



Security holes.


Software has grown in size, complexity and cost.


Software testing has become more difficult.


Software packages inevitably contain bugs (even production
ones).

Motivation


Result: Software failures during production runs at user’s
site.


One Solution: Offsite software diagnosis:


Difficult to reproduce failure triggering conditions.


Cannot provide timely online recovery (e.g. from fast Internet
Worms).


Programmers cannot be provided to every site.


Privacy concerns.



Goal: automatically diagnosing software failures
occurring at end
-
user site production runs.


Understand a failure that
has

happened.


Find the root causes.


Minimize manual debugging.

Current state of the art

Offsite diagnosis:

Primitive onsite diagnosis:


Interactive debuggers.


Program slicing.


Core Dump analysis
(Partial execution path
construction).

Large overhead makes it
impractical for production
sites.


Unprocessed failure
information collections.


Deterministic replay tools.



All require manual analysis.

Privacy concerns.


Onsite Diagnosis


Efficiently reproduce the occurred failure (i.e. fast and
automatically).


Impose little overhead during normal execution.


Require no human involvement.


Require no prior knowledge.

Triage


Capturing the failure point and conducting just
-
in
-
time
failure diagnosis with checkpoint
-
reexecution
.


Delta Generation and Delta Analysis.


Automated top
-
down human
-
like software failure diagnosis
protocol.


Reports:


Failure nature and type.


Failure
-
triggering conditions.


Failure
-
related code/variable and the fault propagation chain.

Triage Architecture

3 groups of components:

1.
Runtime

Group.

2.
Control

Group.

3.
Analysis

Group.

Checkpoint &
Reexecution


Uses Rx (Previous work by authors).


Rx
checkpointing
:


Use fork()
-
like operations.


Keeps a copy of accessed files and file pointers.


Record messages using a network proxy.


Replay may be potentially modified.


Lightweight Monitoring for detecting
failures


Must not impose high overhead.


Cheapest way: catch fault traps:


Assertions


Access violations


Divide by zero


More…


Extensions: Branch histories, system call trace…


Triage only uses exceptions and assertions.


Control layer


Implements the Triage Diagnosis protocol.


Controls
reexecutions

with different inputs based on past
results.


Choice of analysis technique.


Collects results and sends to off
-
site programmers.


Analysis Layer Techniques:

TDP: Triage Diagnosis Protocol

Simple Replay

Coredump

analysis

Dynamic bug
detection

Delta
Generation

Delta Analysis

Deterministic bug

Stack/Heap OK. Segmentation fault:
strln
()

Null
-
pointer dereference

Collection of good and bad inputs

C
ode paths leading to fault

Report

TDP: Triage Diagnosis Protocol

Example report

Protocol extensions and variations


Add different debugging techniques.


Reorder diagnosis steps.


Omit steps (e.g. memory checks for java programs).


Protocol may be costume
-
designed for specific applications.


Try and fix bugs:


F
ilter failure triggering inputs.


Dynamically delete code


risky.


Change variable values.



Automatic patch generation


future work?

Delta Generation


Two Goals:

1.
Generate many similar replays: some that fail and some that
don’t.

2.
Identify signature of failure triggering inputs.


Signatures may be used for:


Failure analysis and reproduction.


Input filtering e.g. Vigilante, Autograph ,etc.

Delta Generation

Changing the input


Changing the Environment


Replay previously stored client
requests via proxy


try
different subsets and
combinations.


Isolate bug
-
triggering part


data “
fuzzing
”.


Find non
-
failing inputs with
minimum distance from failing
ones.


Make protocol aware changes.


Use a “normal form” of the
input, if specific triggering
portion is known.



Pad or zero
-
fill new
allocations.


Change messages order.


Drop messages.


Manipulate thread
scheduling.


Modify the system
environment.


Make use of prior steps
information (e.g. target
specific buffers).

Delta Generation


Results passed to the next stage:


Break code to basic blocks.


For each replay extract a vector of exercise count of each block
and block trace.


Possible to change granularity.

Example revisited

Good run:


Trace: AHIKBDEFEF…EG

Block vector:

{A:1,B:1,D:1,E:11,F:10,G:1
,H:1,I:1,K:1}

Bad run:


Trace: AHIJBCDE

Block vector:

{A:1,B:1,C:1,D:1,E:1,H:1,I
:1,J:1,K:1}


Delta Analysis

Follows three steps:

1.
Basic Block Vector (BBV) Comparison: Find a pair of most
similar failing and non
-
failing replays F and S.

2.
Path comparison: Compare the execution path of F and S.

3.
Intersection with backward slice: Find the difference that
contributes to the failure.

Delta Analysis: BBV Comparison


The number of times each block is executed is recorded
using instrumentation.


Calculate the Manhattan distance between every pair of
failing and non
-
failing replays (can relax the minimum
demand and settle for similar).


In the Example: {c:
-
1,E:10,F:10,G:1,J:
-
1,K:1} giving a
Manhattan distance of 24.

Delta Analysis: Path Comparison


Consider execution order.


Find where the failing and non
-
failing runs diverge.


Compute: Minimum Edit Distance i.e. the minimum number
of insertion, deletion, and substitution operations needed to
transform one to the other.


Example:





Delta Analysis: Backward Slicing


Want to eliminate differences that have no effect on the
failure.


Dynamic
B
ackward
S
licing: extracts a program slice
consisting of all and only those that lead to a given
instruction’s execution.


Starting point may be supplied by earlier steps of the
protocol.


Overhead is acceptable in post
-
hoc analysis.


Optimization: Dynamically build dependencies during
replays.


Experiments show that overhead is acceptably low.

Backward Slicing and result
Intersection

Limitations and Extensions


Need to define a privacy policy for the results sent to
programmers.


Very limited success with patch generation.


Does not handle memory leaks well.


Failure must occur. Does not handle incorrect operation.


Difficult to reproduce bugs that take a long time to manifest.


No support for deterministic replay on multi
-
processor
architectures.


False positives.

Evaluation Methodology


Experimented with 10 real software failures in 9
applications.


Triage is implemented in Linux OS (2.4.22).


Hardware: 2.4 GHz Pentium
-
4, 512K L2 cache, 1G memory
and 100Mbs Ethernet.


Triage checkpoints every 200ms and keeps 20 checkpoint.


User study: 15 programmers were given 5 bugs and Triage’s
report for some of the bugs. Compared time to locate the
bug with and without the report.

Bugs used for Evaluation

Name

Program

App
Description

#L
OC

Bug Type

Root Cause Description

Apache1

apache
-
1.3.27

A web server

114
K

Stack

Smash

Long alias match pattern overflows a local array

Apache2

apache
-
1.3.12

A web

server

102
K

Semantic (NULL
ptr
)

Missing certain part of
url

causes

NULL pointer
dereference

CVS

cvs
-
1.11.4

GNU version
control server

115
K

Double

Free

Error
-
handling code placed at wrong order leads
to double free

NySQL

msql
-
4.0.12

A database server

102
8K

Data Race

Database

logging error in case of data race

Squid

squid
-
2.3

A web proxy cache
server

94K

Heap

Buffer
Overflow

Buffer length calculation

misses special character
cases


BC

bc
-
1.06

Interactive algebraic
language

17K

Heap

Buffer
Overflow


Using wrong

variable in for
-
loop end
-
condition

Linux

linux
-
extract

Extracted from
linux
-
2.6.6

0.3
K

Semantic

(copy
-
paste error)

Forget
-
to
-
change variable identifier due to copy
-
paste

MAN

man
-
1.5h1

Documentation
tools

4.7
K

Global Buffer
Overflow

Wrong for
-
loop end
-
condition

NCOMP

ncompress
-
1.2.4

File
(de)compression

1.9
K

Stack Smash

Fixed length array can not

hold long input file
name

TAR

tar
-
1.13.25

GNU tar archive
tool

27K

Semantic (NULL
ptr
)


Directory property

corner case is not well
handled

Experimental Results

No input testing

Experimental Results


For application bugs, Delta generation only worked for BC
and TAR.


In all cases Triage correctly diagnoses the nature of the bug
(deterministic or non
-
deterministic).


In all 6 applicable cases Triage correctly pinpoints the bug
type, buggy instruction, and memory location.


When Delta Analysis is applied, it reduces the amount of data
to be considered by 63% (Best: 98% worse: 12%).


For
MySQL



Finds an example interleaving pair as a trigger.

Case Study 1: Apache


Failure at
ap_gregsub
.


Bug detector catches a stack
smash in
lmatcher
.


How can
lmatcher

affect
try_alias_list
?


Stack smash overwrites the stack
frame above it, invalidating
r
.


Trace shows how
lmatcher

is
called by
try_alias_list
.


Failure is independent of the
headers.


Failure is triggered by requests
for a specific resource.

Case Study 2: Squid


Coredump

analysis suggests a
heap overflow.


Happens at
strcat

of two
buffers.


Fault propagation shows how
buffers were allocated.


t

has
strlen
(
usr
)

while the other
buffer has
strlen
(user)*3
.


Input testing gives failure
-
triggering input.


Gives minimally different
non
-
failing inputs.

Efficiency and Overhead

Normal Execution overhead:


Negligble

effect caused by
checkpointing
.


In no case over 5%.


With 400ms
checkpointing

intervals


overhead is 0.1%

Efficiency and Overhead

Diagnosis Efficiency:


Except for Delta Analysis, all steps are efficient.


All (other) diagnostic steps finish within 5 minutes.


Delta analysis time is governed by the Edit Distance D in the O(ND)
computation (N


number of blocks).


Comparison step of Delta Analysis may run in the background.



User Study


Real bugs:


On average, programmers
took 44.6% less time
debugging using Triage
reports.


Toy bugs:


On average, programmers
took 18.4% less time
debugging using Triage
reports.


Questions?