Automated Diagnosis of Chronic Problems in Production Systems

makeshiftklipInternet και Εφαρμογές Web

31 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

73 εμφανίσεις

Automated Diagnosis of Chronic
Problems in Production Systems

Soila Kavulya


Thesis Committee

Christos
F
aloutsos

,
CMU

Greg Ganger,
CMU

Matti

Hiltunen
,
AT&T

Priya

Narasimhan
,
CMU (Advisor)

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

2

Motivation


Chronics

are problems that are


Not transient


Not resulting in system
-
wide outage


Chronics

occur in real production systems


VoIP


User’s calls fail due to version conflict between user and upgraded server


Hadoop

(CMU’s
OpenCloud
)


A user job sporadically fails in map phase with cryptic block I/O error


User and admins spend 2 months troubleshooting


Traced to large heap size in
tasktracker

starving collocated
datanodes


Chronics

are due to a variety of root
-
causes


Configuration problems, bad hardware, software bugs


Thesis:
A
utomate
chronics

diagnosis in production systems

Soila Kavulya @ March 2012

3

Challenge for Diagnosis

Soila Kavulya @ March 2012

4

D
ue
to single
node?

D
ue
to complex
interactions between nodes?

D
ue
to multiple independent
node?

Node1

Single manifestation, multiple possible causes

Node2

Node3

Node4

Node5

Challenges in
Production

Systems


Labeled failure
-
data is not always available


Difficult to diagnose problems not encountered before


Sysadmins
’ perspective may not correspond to users’


No access to user configurations, user behavior


No access to application
semantics


First
sign of trouble is often a customer complaint


Customer complaints can be
cryptic


Desired level of instrumentation may not be possible


As
-
is vendor instrumentation with limited control


Cost of added instrumentation may be high


Granularity of diagnosis consequently limited

Soila Kavulya @ March 2012

5

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

6

Objectives


“Is there a problem?”
(
anomaly detection
)


Detect a problem despite potentially not having seen it before


Distinguish a genuine problem from a workload change


“Where is the problem?”
(
localization
)


Drill down by analyzing different instrumentation perspectives


“What
kind of problems?”
(
chronics
)


Manifestation:
exceptions, performance
degradations


R
oot
-
cause: mis
configuration,
bad hardware,
bugs, contention


Origin: single/multiple independent sources, interacting sources


“What kind of environments?”
(
production systems
)


Production VoIP system at AT&T


Hadoop
: Open
-
source implementation of
MapReduce




Soila Kavulya @ March 2012

7

Thesis Statement

P
eer
-
comparison* enables
anomaly
detection
in
production systems

despite workload changes, and the
subsequent incremental fusion of different instrumentation
sources enables
localization

of
chronic problems
.


Soila Kavulya @ March 2012

8

*Comparison of some performance metric across similar (peer) system elements

9

rika

(Swahili),
noun
. peer, contemporary, age
-
set,
undergoing rites of passage (marriage) at similar times.

What was our Inspiration?

What is a
P
eer
?


Temporal similarity


Age
-
set: Born around the same time


Anomaly detection: Events within same time window


Spatial similarity


Age
-
set: Live in same location


Anomaly detection:
Run on same node


Phase similarity


Age
-
set: (birth, initiation, marriage)


Anomaly detection:
(map, shuffle, reduce)


Contextual similarity


Age
-
set: Same gender, clan


Anomaly detection:
Same workload, h/w

Soila Kavulya @ March 2012

10

Target Systems for Validation


VoIP
system at large telecommunication provider


10s
of millions of calls per
day, diverse workloads


100s
of network
elements with heterogeneous hardware


24x7 Ops team uses alarm
correlation to diagnose outages


Separate
team troubleshoots long
-
term
chronics


Labeled traces available



Hadoop
:
Open
-
source implementation
of
MapReduce


Diverse
kinds of real
workloads


Graph mining, language translation


Hadoop

clusters with homogeneous hardware


Yahoo! M45 &
Opencloud

production clusters


Controlled experiments in Amazon EC2 cluster


Long
running jobs (> 100s
): Hard
to label failures


Soila Kavulya @ March 2012

11

In Support of Thesis Statement

Soila Kavulya @ March 2012

12

OBJECTIVE

VoIP

HADOOP

Anomaly

Detection

Heuristics
-
based,


peer
-
comparison pending

Peer comparison without
labeled data

Problem

Localization

Localize to customer/network
-
element/resource/error
-
code

Localize to
node/task/resource

Chronics

Exceptions, performance
degradation, single/multiple
-
source

Exceptions, performance
degradation, single
-
source

multiple
-
source pending

Production
Systems

AT&T production system

EC2 test

system,

OpenCloud

pending

Publications

OSR’11, DSN’12


WASL’08, HotMetrics’09,
ISSRE’09, ICDCS’10,
NOMS’10,

CCGRID’10

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

13

Goals & Non
-
Goals


Goals


Anomaly detection in the absence of labeled failure
-
data


Diagnosis based on available instrumentation sources


Differentiation of workload changes from anomalies



Non
-
goals


Diagnosis of system
-
wide outages


Diagnosis of value faults and transient faults


R
oot
-
cause analysis at code
-
level


Online/runtime diagnosis


Recovery based on diagnosis




Soila Kavulya @ March 2012

14

Assumptions


Majority
of system is working
correctly


Problems manifest in observable behavioral
changes


Exceptions or performance degradations


All instrumentation is
locally
timestamped



Clocks
are
synchronized to enable system
-
wide
correlation of data


Instrumentation faithfully captures system behavior


Soila Kavulya @ March 2012

15

Overview of Approach

Soila Kavulya @ March 2012

16

End
-
to
-
end

Trace

Construction

Performance

Counters

Application

Logs

Ranked l
ist
of

root
-
causes

Anomaly
Detection

Localization

Target System #1: VoIP

Soila Kavulya @ March 2012

17

PSTN Access

IP Access

Gateway

Servers

IP Base

Elements

Application

Servers

Call Control

Elements

ISP’s network

Target System
#2:
Hadoop

Soila Kavulya @ March 2012

18

JobTracker

NameNode

TaskTracker

DataNode

Map/Reduce tasks

HDFS

blocks

Master Node

Hadoop

logs

OS data

OS data

Hadoop

logs

Performance Counters


For both
Hadoop

and VoIP


Metrics
collected periodically from /
proc

in OS


Monitoring interval varies from 1
sec
to 15 min


Examples of metrics collected


CPU utilization


CPU run
-
queue size


Pages in/out


Memory used/free


Context switches


Packets sent/received


Disk blocks read/written


Soila Kavulya @ March 2012

19

End
-
to
-
End Trace Construction

Soila Kavulya @ March 2012

20

End
-
to
-
end

Trace

Construction

Performance

Counters

Application

Logs

Ranked l
ist
of

root
-
causes

Anomaly
Detection

Localization

Application Logs



Each node logs each request that passes through it


Timestamp, IP address, request duration/size, phone no., …


Log formats vary across components and systems


A
pplication
-
specific parsers extract relevant attributes


Construction of end
-
to
-
end traces


Pre
-
defined schema used to stitch requests across nodes


Match on
key
attributes


I
n
Hadoop
, match tasks with same task IDs


I
n VoIP
, match
calls
with same sender/receiver phone
no


Incorporate time
-
based correlation


In
Hadoop
, consider block reads in same time interval as maps


In VoIP, consider calls with same phone no. within same time interval

Soila Kavulya @ March 2012

21

Application Logs: VoIP

Soila Kavulya @ March 2012

22


Combine per
-
element logs to obtain per
-
call traces


Approximate match on key attributes


Timestamps, caller
-
callee

numbers, IP, ports


Determine call status from per
-
element codes


Zero talk
-
time, callback soon after call termination

IP Base

Element

Call Control

Element

Application

Server

Gateway

Server

10:03:59, START

973
-
123
-
8888 to 409
-
555
-
5555

192.156.1.2 to 11.22.34.1

10:03:59, STOP



10:03:59, ATTEMPT

973
-
123
-
8888 to 409
-
555
-
5555




10:04:01, ATTEMPT

973
-
123
-
xxxx to 409
-
555
-
xxxx

192.156.1.2 to 11.22.34.1




Application Logs:
Hadoop

(1)


Peer
-
comparable attributes extracted from logs


Correlate traces using IDs and request schema

Soila Kavulya @ March 2012

23

2009
-
03
-
06 23:06:01,572 INFO
org.apache.hadoop.mapred.ReduceTask
:
attempt_200903062245_0051_r_000005_0 Scheduled 10 of 115 known
outputs (0 slow hosts and 105 dup hosts)

2009
-
03
-
06 23:06:01,612 INFO
org.apache.hadoop.mapred.ReduceTask
:
Shuffling 2 bytes (2 raw bytes) into RAM from
attempt_200903062245_0051_m_000055_0 …from

ip
-
10
-
250
-
90
-
207.ec2.internal

Temporal similarity: Timestamps

Hostnames: Spatial
similarity

Phase similarity: Map


Reduce

Context similarity:
TaskType

Application Logs:
Hadoop

(2)


No global IDs for correlating logs in
Hadoop

& VoIP


Extract causal flows using predefined schemas

Soila Kavulya @ March 2012

24

NoSQL

Database

2009
-
03
-
06 23:06:01,572 INFO
org.apache.hadoop.mapred.ReduceTask
:
attempt_200903062245_0051_r_000005_0
Scheduled 10 of 115 known outputs (0 slow
hosts and 105 dup hosts)

Application logs

Extract

events

<time=t2,type=shuffle,

reduceid
=reduce1,mapid=map1
,

duration=2s>

MapReduce
: {

“events” : { “Map” :


{ “primary
-
key” : “
MapID
”,


“join
-
key” : “
MapID
”,


“next
-
event” : “Shuffle”},…

Flow schema (JSON)

Causal flows

Anomaly Detection

Soila Kavulya @ March 2012

25

End
-
to
-
end

Trace

Construction

Performance

Counters

Application

Logs

Ranked l
ist
of

root
-
causes

Anomaly
Detection

Localization

Anomaly Detection Overview

Soila Kavulya @ March 2012

26


Some systems have rules for anomaly detection


Redialing number
immediately after disconnection


Server
reported error
codes and exceptions


If no rules available, rely on peer
-
comparison


Identifies peers (nodes, flows) in distributed systems


Detect anomalies by identifying “odd
-
man
-
out”



Anomaly Detection (1)


Empirically determine best peer groupings


Window size, request
-
flow types, job information


Best grouping minimizes false positives in fault
-
free
runs


Peer
-
comparison identifies “odd
-
man
-
out” behavior


R
obust to workload changes


Relies on histogram
-
comparison


Less sensitive to timing differences


Multiple suspects might be identified


Due to propagating errors, multiple independent problems


Soila Kavulya @ March 2012

27

Anomaly Detection (2)

Soila Kavulya @ March 2012

28



Histogram comparison identifies anomalous flows


Generate aggregate histogram represents majority behavior


Compare each node’s histogram against aggregate histogram O(n)


Compute anomaly score using
Kullback
-
Leibler

divergence


Detect anomaly if score exceeds pre
-
specified threshold



Faulty node

Histograms (distributions) of
durations
of
flows

Normal node

Normal node

Normalized counts (total 1.0)

Normalized counts (total 1.0)

Normalized counts (total 1.0)

Localization

Soila Kavulya @ March 2012

29

End
-
to
-
end

Trace

Construction

Performance

Counters

Application

Logs

Ranked l
ist
of

root
-
causes

Anomaly
Detection

Localization

“Truth table” Request Representation

Node1

Node2

Map

ReadBlock

Outcome

Req1

1

0

1

1

SUCCESS

Req2

0

1

1

1

FAIL

Soila Kavulya @ March 2012

30

Log Snippet

Req1: 20100901064914,SUCCESS,Node1,Map,ReadBlock


Req2: 20100901064930,FAIL,Node2,Map,ReadBlock

Identify Suspect Attributes


Assume each
attribute
represented as

coin toss



Estimate attribute distribution using Bayes


Success distribution:
Prob
(
Attribute|Success
)


Anomalous distribution:
Prob
(
Attribute|Anomalous
)


Anomaly score: KL
-
divergence between the two distributions


http://www.pdl.cmu.edu
/

Belief

Probability(Node2=TRUE)

Successful requests

Anomalous requests

Indict attributes
with highest
divergence
between
distributions

Soila Kavulya @ March 2012

31

Rank Problems by Severity

Soila Kavulya @ March 2012

32

Shuffle

Map

Node3

Node2

Step 1: All requests

Problem1:



Node2



Map

Shuffle

ExceptionX

ExceptionY

Node3

Step 2: Filter all
r
equests except

those
matching Problem1

Problem2:

Node3

Shuffle

Indict path with highest
anomaly score

350

120

670

90

290

450

160

340

Incorporate Performance Counters (1)


Annotate requests on indicted nodes with performance
counters based on timestamps








Identify
metrics most correlated with problem


Compare distribution of metrics in successful and failed requests

Soila Kavulya @ March 2012

33

Requests on node2

#
Timestamp,CallNo,Status,Memory
(%),CPU(%)

20100901064914, 1, SUCCESS, 54, 6

20100901065030, 2, SUCCESS, 54, 6

20100901065530, 3, SUCCESS, 56, 4

20100901070030, 4, FAIL, 52, 45

Incorporate Performance Counters
(2)

Soila Kavulya @ March 2012

34

Shuffle

Map

Node3

Node2

All
requests

Problem1:



Node2



Map


High CPU

High CPU

Incorporate performance
counters in
diagnosis

350

120

670

90

Why Does It Work?


Real
-
world data backs up utility of peer
-
comparison


Task durations peer
-
comparable in >75% of jobs [CCGrid’10]


Approach analyzes
both successful and failed
requests


Analyzing only
failed
requests might
elevate common elements

over
causal
elements


Iterative approach discovers correlated attributes


Identifies problems due to conjunctions
of attributes


Filtering step identifies multiple ongoing problems


Handles
unencountered

problems


Does not rely on historical models of normal behavior


Does not rely on signatures of known defects

Soila Kavulya @ March 2012

35

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

36

VoIP: Diagnosis of Real

Incidents

Soila Kavulya @ March 2012

37

Examples of real
-
world

incidents

Diagnose
d

Resource Indicted

Customers use wrong codec to send faxes



NA

Customer problem causes blocked calls at IPBE.



NA

Blocked circuit identification codes on trunk group



NA

Software bug at control server causes blocked calls



NA

Problem with customer equipment leads to poor
QoS



NA

Debug tracing overloads servers during peak traffic.



CPU

Performance problem at application server.



CPU/Memory

Congestion at gateway servers due to high load



CPU/Concurrent
Sessions

Power outage and causes brief outages.



NA

PSX not responding to invites from app. server



Low responses at
app. server

8 out of 10 real incidents diagnosed

Day1

Day2

Day3

Day4

Day5

Day6

Day1

Day2

Day3

Day4

Day5

Day6

VoIP: Case Studies

Soila Kavulya @ March 2012

38

Incident 1: Chronic due to unsupported fax codec

Failed calls for
two customers

Failed calls
for server

Customers stop using
unsupported codec

Chronic nightly problem

Unrelated chronic server
problem emerges

Server reset

Incident 2: Chronic server problem

Implementation of Approach

Draco: Deployment in Production at AT&T

http://www.pdl.cmu.edu
/

39


1
. Problem1

STOP.IP
-
TO
-
PS.487.3

STOP.IP
-
TO
-
PSTN.41.0.
-
.
-

Chicago*
GSXServers

MemoryOverload



2
. Problem2

STOP.IP
-
TO
-
PSTN.102.0.102.102

ServiceB


CustomerAcme

IP_w.x.y.z

Search

Filter

~8500 lines of C code

Soila Kavulya @ March 2012

VoIP: Ranking Multiple Problems

Soila Kavulya @ March 2012

40

Draco performs better at ranking multiple
independent problems

VoIP: Performance of Algorithm

Offline Analysis

Avg.

Log
Size

Avg.

Data

Load Time

Avg.


Diagnosis Time

Draco simulated
-
1hr

(C++)

271 MB

8s

4s

Draco real
-
1day

(C++)

2.4 G

7min

8min

Soila Kavulya @ March 2012

41

Running on 16
-
core Xeon (@ 2.4GHz), 24 GB Memory

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

42

Hadoop
: Target Clusters


10 to 100
-
node
Amazon’s
EC2 cluster


Commercial
, pay
-
as
-
you
-
use cloud
-
computing resource


Workloads under our control, problems injected by us


gridmix
,
nutch
,
sort,
random writer


Can harvest logs and OS data of only our
workloads



4000
-
processor
M45 & 64 node
Opencloud

cluster


Production environment


Offered
to CMU as free cloud
-
computing resource


Diverse kinds of real workloads, problems in the wild


Massive machine
-
learning, language/machine
-
translation


Permission to harvest all logs and OS
data

Soila Kavulya @ March 2012

43

Hadoop
: EC2 Fault Injection

Soila Kavulya @ March 2012

44

Fault

Description

Resource
contention

CPU hog

External process uses 70% of CPU

Packet
-
loss

5% or 50% of incoming packets dropped

Disk hog

20GB file repeatedly written to

Application
bugs


Source:
Hadoop

JIRA

HADOOP
-
1036

Maps hang due to unhandled exception

HADOOP
-
1152

Reduces fail while copying map output

HADOOP
-
2080

Reduces fail due to incorrect
checksum

Injected fault on single node

Metrics

True Positive Rates

Different metrics
detect different
problems

Hadoop
: Peer
-
comparison Results

Soila Kavulya @ March 2012

45

Without Causal Flows

Correlated problems
(e.g., packet
-
loss)
harder to localize

Hadoop
: Peer
-
comparison Results

Soila Kavulya @ March 2012

46

With Causal Flows + Localization

Examples of real
-
world

incidents

Diagnosed

Metrics Indicted

CPU hog



Node

Packet
-
loss



Node+Shuffle

Disk hog



Node

HADOOP
-
1036



Node+Map

HADOOP
-
1152



Node+Shuffle

HADOOP
-
2080



Node+Shuffle

Correlated problems correctly identified

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

47

Critique of Approach


Anomaly detection thresholds are fragile


Need to use statistical tests


Anomaly detection does not address problems at master


Peer
-
groups are defined statically


Assumes homogeneous clusters


Need to automate identification of peers


False positives occur if root
-
cause not in logs


A
lgorithm tends to implicate adjacent network elements


Need to incorporate more data to improve visibility

Soila Kavulya @ March 2012

48

Related Work


Chronics

fly under the radar


Undetected by alarm mining [Mahimkar09
]


Chronics

can persist
undetected for long periods of time


Hard to detect using change
-
points [Kandula09]


Hard to demarcate problem periods [Sambasivan11
]


Multiple ongoing problems at a time


Single fault assumption inadequate
[Cohen05,
Bodik10
]


Peer
-
comparison on its own inadequate



Hard to localize propagating problems [Kasick10,Tan10
,Kang10]

Soila Kavulya @ March 2012

49

Outline


Motivation


Thesis Statement


Approach


End
-
to
-
end trace construction


Anomaly detection


Localization


Evaluation


VoIP


Hadoop


Critique & Related Work


Pending Work

Soila Kavulya @ March 2012

50

Pending Work

Soila Kavulya @ March 2012

51

OBJECTIVE

VoIP

HADOOP

Anomaly

Detection

Heuristics
-
based,


peer
-
comparison pending

Peer comparison without
labeled data

Problem

Localization

Localize to customer/network
-
element/resource/error
-
code

Localize to
node/task/resource

Chronics

Exceptions, performance
degradation, single/multiple
-
source

Exceptions, performance
degradation, single
-
source

multiple
-
source pending

Production
Systems

AT&T production system

EC2 test

system,

OpenCloud

pending

Publications

OSR’11, DSN’12


WASL’08, HotMetrics’09,
ISSRE’09, NOMS’10,

CCGRID’10

Pending Work: Details


OpenCloud

production
cluster & multiple
-
source
problems [April
-
June 2012]


64
-
node cluster housed at Carnegie Mellon


Obtained
and parsed logs from 25 real
OpenCloud

incidents


Root
-
causes include misconfigurations, h/w issues, buggy
apps


Yet to analyze
logs


Peer comparison in VoIP [June
-
July 2012]


Examining data that is not labeled, and identifying peers


Notion of a peer might be determined by function and location


Root
-
causes under investigation are as before


Dissertation writing [June
-
August 2012]


Defense [September 2012]



Soila Kavulya @ March 2012

52

Collaborators & Thanks


VoIP (AT&T)


Matti

Hiltunen
,
Kaustubh

Joshi, Scott Daniels


Hadoop

diagnosis


Jiaqi

Tan,
Xinghao

Pan, Rajeev Gandhi, Keith Bare, Michael
Kasick
, Eugene
Marinelli


Hadoop

visualization


Christos
Faloutsos
, U Kang, Elmer
Garduno
, Jason Campbell
(Intel),
HCI 05
-
610 team


OpenCloud


Greg Ganger, Garth Gibson, Julio Lopez, Kai
Ren
, Mitch
Franzos
, Michael
Stroucken


Soila Kavulya @ March 2012

53

Summary


Peer
-
comparison effective for anomaly detection


Robust to workload changes


Requires little training data


Incremental
fusion of different instrumentation sources
enables localization of
chronics


Starts with user
-
visible symptoms of a problem


Drills down to localize root
-
cause of problem


Usefulness of approach in two production systems


VoIP system at large telecommunication provider (demonstrated)


Hadoop

clusters (underway)

Soila Kavulya @ March 2012

54

Soila Kavulya @ March 2012

55


Questions?

Climbing Mt. Kilimanjaro comes a distant second to a thesis proposal!

Selected Publications (1)

Diagnosis

in
Production

VoIP

system


DSN12
:
Draco
:
Statistical

Diagnosis

of

Chronic

Problems in
Large

Distributed

Systems.
S
. P. Kavulya, S. Daniels, K.
Joshi
, M. Hiltunen, R.
Gandhi, P.
Narasimhan
. To
appear

DSN 2012.


OSR12
: Practical
Experiences with Chronics Discovery in Large
Telecommunications Systems.

S
. P. Kavulya, K.
Joshi
, M. Hiltunen, S.
Daniels, R. Gandhi, P.
Narasimhan
.
Best
Papers from SLAML 2011 in
Operating Systems Review,
2011
.


Survey Paper &
Workload

Analysis

of

Production

Hadoop

Cluster


RAE12:
Failure

Diagnosis

of

Complex

Systems S
. P. Kavulya, K.
Joshi
, F.
Di
Giandomenico
, P.
Narasimhan
.
To
appear

in Book on
Resilience

Assessment

and
Evaluation
.
Wolter, 2012
.


An
analysis

of

traces

from a
production

MapReduce

cluster.

S. Kavulya, J. Tan, R. Gandhi, P.
Narasimhan
.
CCGrid

2010
.

Soila Kavulya @ March 2012

56

Selected Publications (2)

Visualization

in
Hadoop


CHIMIT11:
Understanding

and
improving

the
diagnostic

workflow
of

MapReduce

users
. J. D. Campbell, A. B.
Ganesan
, B.
Gotow
, S. P.
Kavulya, J. Mulholland, P.
Narasimhan
, S.
Ramasubramanian
, M.
Shuster
,
J. Tan. CHIMIT 2011


ICDCS10:
Visual, log
-
based

causal

tracing

for
performance

debugging

of

MapReduce

systems. J. Tan, S. Kavulya, R. Gandhi, P.
Narasimhan
.
ICDCS 2010


Diagnosis

in
Hadoop

(
Application

logs +
performance

counters
)


NOMS10:
Kahuna
: Problem
Diagnosis

for
MapReduce
-
Based

Cloud
Computing

Environments. J. Tan, X. Pan, S. Kavulya, R. Gandhi, P.
Narasimhan
. NOMS 2010.


ISSRE09:
Blind Men and the
Elephant

(
BLIMEy
):
Piecing

together

Hadoop

for
Diagnosis
. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P.
Narasimhan
.
ISSRE
2009
.

Soila Kavulya @ March 2012

57

Selected Publications (3)

Diagnosis

in
Hadoop

(
Performance

counters
)


HotMetrics09:
Ganesha
: Black
-
Box
Fault

Diagnosis

for
MapReduce

Systems. X. Pan, J. Tan, S. Kavulya, R. Gandhi, P.
Narasimhan
.
HotMetrics

2009.


Diagnosis

in
Hadoop

(
Application

logs)


WASL:
SALSA:
Analyzing

Logs as
StAte

Machines. J. Tan, X. Pan, S.
Kavulya, R. Gandhi. P.
Narasimhan
. WASL 2008,


Diagnosis

in
Group Communication Systems


SRDS08:
Gumshoe
:
Diagnosing

Performance

Problems in
Replicated

File
-
Systems. S
.
Kavulya,
R. Gandhi, P.
Narasimhan
. SRDS 2008
.


SysML07:
Fingerpointing

Correlated

Failures

in
Replicated

Systems. S.
Pertet, R. Gandhi, P.
Narasimhan
.
SysML
,
April 2007.

Soila Kavulya @ March 2012

58

Related Work (1)


[Bodik10
]:
Fingerprinting
the datacenter: automated classification of
performance crises. Peter
Bodík
,
Moisés

Goldszmidt
, Armando Fox, Dawn
B. Woodard, Hans Andersen:
EuroSys

2010.


[Cohen05]:
Capturing, indexing, clustering and retrieving system history. Ira
Cohen, Steve Zhang,
Moises

Goldszmidt
, Julie Symons, Terence Kelly,
Armando Fox. SOSP, 2005
.


[Kandula09]:
Detailed
diagnosis in enterprise networks
.
Srikanth

Kandula
,
Ratul

Mahajan
, Patrick
Verkaik
,
Sharad

Agarwal
,
Jitendra

Padhye
,
Paramvir

Bahl
. SIGCOMM 2009.


[Kasick10]:
Black
-
Box Problem Diagnosis in Parallel File Systems.
Michael
P.
Kasick
,
Jiaqi

Tan, Rajeev Gandhi,
Priya

Narasimhan
.

FAST 2010.


[Kiciman05]:

Detecting application
-
level failures in component
-
based
Internet Services.
Emre

Kiciman
, Armando Fox. IEEE Trans. on Neural
Networks 2005
.

Soila Kavulya @ March 2012

59

Related Work (2)


[
Mahimkar09]:
Towards automated performance diagnosis in a large IPTV
network. Ajay Anil
Mahimkar
,
Zihui

Ge
,
Aman

Shaikh
,
Jia

Wang, Jennifer
Yates, Yin Zhang, Qi Zhao. SIGCOMM
2009.


[
Sambasivan11]:
Diagnosing Performance Changes by Comparing
Request Flows. Raja R.
Sambasivan
,
Alice X.
Zheng
, Michael De Rosa,
Elie

Krevat
, Spencer Whitman, Michael
Stroucken
, William Wang,
Lianghong

Xu
, and Gregory R. Ganger. NSDI 2011
.

http://www.pdl.cmu.edu
/

Soila Kavulya @ March 2012

60