20080212 PracticalReports

basesprocketΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 3 χρόνια και 1 μήνα)

68 εμφανίσεις

Practical Reports on
Dependability

Manifestation of System Failure


Site unavailability


System exception /access violation


Incorrect result


Data loss/corruption


Slow down

PAGE UNAVAILABLE

PAGE UNAVAILABLE

System Exception

Performance Slowdown

DOWNTIME

15% contribution

DOWNTIME

unplanned

20 %

planned

80 %

DOWNTIME

UNPLANNED DOWNTIME

UNPLANNED DOWNTIME

UNPLANNED DOWNTIME

Software Errors

Triggers


Resource exhaustion


Logical errors


System Overload


Recovery code


Failed upgrade


Logical Error

SYSTEM OVERLOAD

Operator Errors

Triggers


Configurational


Incorrect parameter setting


Procedural


Omit/inncorect maintainance action


Miscellaneous


FAILURE

DURATION


Short (minutes)


Long (weeks)


Implies large fault
chains

FREQUENCY


Permanent


(down until problem fixed)


Transient


(resolves without
intervention)


Intermittent

(trasient + occasional)


SCOPE


Entire system


Parts of the
System




Fault Chains


”the series of
component failures
that led up to a user
-
visible failure”


Uncoupled


Independent failures


Tightly Coupled


Cascading/corelated
failure

Non
-
Malicious Software Failure


Most Common Causes


Routine maintenance


Software upgrade


System integration


Other Causes


System overload


Resource exaustsion


Complex fault tolerant routines

”ROUTINE” MAINTAINANCE


Danske Bank 2003


March 11
:
routine operation to replace a defective
electrical unit in IBM DB2 disk system


System failure: Disks becomes inaccessable


6 hours later
: system restarted


March 12: Batch systems running incorrectly


Three More errors discovered:

1.
Recovery process on several tables won’t start

2.
Recovery jobs won’t run symultaneously

3.
Recovery jobs can’t reastablish data in tables


March 14:
All data recovered and system functional