WebSphere Serviceability Team - November 2005 Best Practices for Problem Determination and Reliable Operation

sweetleafapartInternet και Εφαρμογές Web

7 Αυγ 2012 (πριν από 4 χρόνια και 10 μήνες)

373 εμφανίσεις

IBM Software Group

®


WebSphere Serviceability Team
-

November 2005

Best Practices for Problem Determination and
Reliable Operation


in Complex WebSphere Application Server Sites

Daniel P. Julin

Technical Area Lead, WebSphere Serviceability Team

dpj@us.ibm.com

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Objective


Propose and discuss some Best Practices and practical tips for
people who undertake a complex Problem Determination task,
from beginning to end


Not a perfect method (yet?), but a start


Based on observations from many PMRs worked by IBM
Support and the WebSphere Serviceability Team


Focus on practical ideas that can be applied in today’s
systems


Top
-
down approach, focus on the big picture


But will point
-
out a few specific tools and emerging
technologies that may help the process


Focus primarily on WebSphere Application Server


But most of those Best Practices probably apply to
many systems and products

Current Tool

(example)

Future Tool

(example)

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Resources

Redbook


SG24
-
6798
-
00


www.redbooks.ibm.com

IBM Software Group

WebSphere Serviceability Team
-

November 2005


A Few General Observations about Problems


Most problems are due to environment and configuration issues,
misunderstandings, hard
-
to
-
diagnose application issues, etc
--

not
actual WAS code defects


Non
-
Defect Oriented Problems

(
NDOP
)
(
--
> typically >90% of PMRs)



Many problems are seen more than once at different Customers


Rediscoveries

(both DOP and NDOP)
(
--
> typically >50% of PMRs)



Most problems are solved fairly quickly and easily


But a few problems are hard to diagnose, never seen before, critical
to Production, involve issues in the WAS internals, may involve
multiple interdependent issues, etc.


These take a disproportionate amount of time to resolve, and cause
the most frustration with the Problem Determination process


IBM Software Group

WebSphere Serviceability Team
-

November 2005


A Few General Observations about
Problem Determination


Problem Determination is not an exact science


We’re getting closer for the “simple” problems, but we have much work
to do for the “complex” problems


A major challenge of PD is dealing with
unanticipated

problems


Much like Detective work: find clues, make educated guesses, verify
suspicions, etc… and see where it leads



Problem Determination is not Rocket Science


Most important skills are common sense, focus, thoroughness,
rigorous thinking



Problem Determination is often a cooperative and iterative process



The biggest cause of delays and frustration is confusion and poor
communication while working the problem



Not every problem requires the most complex Problem Determination skills
and techniques


Enable Customer Self
-
service and automation for common PD tasks


IBM Software Group

WebSphere Serviceability Team
-

November 2005


Goals for Problem Determination
Methodology


Focus on the common/"easy" problems first


expand to more complex problems over time



Enable Customer self
-
help


but transition to IBM
-
assisted approach whenever desired



Enable a reliable process to avoid confusion, mistakes,
miscommunications, false starts


good for the common/"easy" problems


but also able to get into the "hard" problems


IBM Software Group

WebSphere Serviceability Team
-

November 2005


"Solve a Problem" Flow
-

Big picture

Initial Characterization

of the Problem

Collect basic

diagnostics

Quick scan through

basic diagnostics

(common problems)

Attempt and evaluate

remedy

Analyze and interpret

new diagnostics

Consult Knowledge

Base / Expert System

to decide on next

PD step

(specific to each

problem and

current state)

Collect new specialized

diagnostics

(may or may not

involve recreate)

Attempt and evaluate

remedy

START

Found

potential remedy

Found no

remedy

Remedy OK:

DONE

Not OK

Remedy OK:

DONE

Found

potential remedy

Found no

remedy

Not OK

Iterate until

resolved

Go to

Phase 2

Phase 1

Phase 2

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Preparation: Before a Problem Occurs


Good Problem Determination starts long before anything bad happens


There are Best Practices ... and
Malpractices



Problem Prevention
-

The top
-
10 concerns or "Malpractices"


Sufficient Test environment


Load / Stress testing


Capacity plan


Keeping the System operating within the Capacity plan


Production traffic profile (network, etc.)


Process for rolling
-
out changes into Production


Keeping a record of changes


Application Review and Best Practices


Education


Migration plan


Current Architecture plan


Tivoli Performance Viewer


Rational tools


Testing


Code validation


IBM Education Assistant

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Preparation: Before a Problem Occurs


Prepare for Problem Management


Monitoring / Problem detection


System Documentation


Topology diagram


Establish baselines


Clean
-
up spurious errors or learn to recognize them


Diagnostic data collection plan


Relief/Recovery plan



Maintenance Plan
-

scheduled and emergency


IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Topology Diagram


Show all the components and all the main flows in the system



Useful to


Communicate with all parties involved in the PD effort


Identify discrepancies between the expected environment and the
current reality


Identify points where monitoring / health checking can be done


Identify points where diagnostics data can be collected



Be specific and detailed


Indicate software versions, machine names, IP addresses if possible



Identify the main execution flows through this diagram

Customer Profile
Database

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Example: Topology Diagram

Machine: Mars

Machine: Venus

Machine: Jupiter

WebServer

Machine: Web1

LDAP

DB

WebServer

Machine: Web2

Load

Balancer


Firewall

Firewall

WAS version: x.y

OS version: x.y

IP address: a.b.c.d

WAS version: x.y

OS version: x.y

IP address: a.b.c.d

DB version: x.y

OS version: x.y

IP address: a.b.c.d

LDAP version: x.y

OS version: x.y

IP address: a.b.c.d

AppServer

NA

WAS version: x.y

OS version: x.y

IP address: a.b.c.d

AppServer

NA

WAS version: x.y

OS version: x.y

IP address: a.b.c.d

DM

WAS version: x.y

OS version: x

IP address: a.b.c.d

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Diagnostic Data Collection Plan


Decide
in advance

which Diagnostic elements should be captured by
default on the first occurrence of any problem


This is a trade
-
off: we want to capture as much as possible,
without too much impact to the normal operation of the system


Some things to consider


All the default WebSphere logs


WebSphere First Failure Data Capture facility (FFDC)


HTTP access logs (default levels?)


Enable core dumps, java dumps, etc.


Consider enabling verbosegc


Consider a minimal level of monitoring of OS resources


Consider a minimal level of monitoring of network health


Consider application logging



Identify active Diagnostic actions to take for the most likely problems


E.g. trigger a javacore/core


E.g. attempt pings, application checks


E.g. collect specific PMI stats


WAS FFDC


WAS PMI


Tivoli tools


Hung Thread
Detection




IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Diagnostic Data Collection Plan


Make arrangements so that all the pre
-
determined Diagnostic elements
get reliably captured


Clearly identify where they are


Have pre
-
established procedures to collect them and
practice


Synchronize the clocks between all machines to facilitate analysis


Manage the logs and all other Diagnostics artifacts, to avoid
surprises during normal operation


Make sure you have enough disk space available for diagnostic
artifacts



And don't forget the human element:


Who is in charge when there is a Production problem, to execute
the Diagnostic Data Collection plan and take recovery actions


What group(s) need to be informed / coordinated with


Collector Tool


IBM Support
Assistant


...

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Characterize the Problem


Take the time to carefully understand the problem and its context.
Listen

and
ask questions
.


In many cases, this is all it takes to solve simple problems


For complex problems, failure to do so often results in considerable
delays



Crisp and specific description of
what

happened, esp. error message(s),
observed abnormal behavior, etc.


How would we recognize the same problem if it happens again?


What exactly would we expect to be different once the problem is
solved?


Beware of vague terms like
crash
,
fails
, etc.


E.g. a crash is not the same as a hang, is not the same as an exit


Be alert for possibly
-
unrelated symptoms, and the possibility that there
may be several independent problems happening at once



Where

exactly did the problem occur


what machine, what server, etc.



IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Characterize the Problem


When

exactly did the problem occur


Did it happen only once, or does it happen repeatedly


If the problem happens repeatedly, characterize the circumstances


Apparently random times? What frequency?


Every time we do X, or every time some other external event
happens?



Consider
why

does this problem occur here and now, and has not
occurred before?


Is this the first time that we attempt something new (e.g. first product
installation)?


If not, has anything at all changed in the environment (e.g. config
changes in the failing system or any other system that is also present
in the environment)?


Does this problem occur in all similar environments (e.g. multiple
Production systems) or not?





IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: Table of issues and symptoms


List all top
-
level issues reported by users of the system, and all low
-
level
symptoms observed during the investigation



Useful to


Organize the investigation: find all symptoms, always identify what to
check on next


Track progress


Make sure nothing is overlooked


Manage complex situations, with many unrelated high
-
level issues and
many unrelated symptoms



Prioritize and cluster entries to reflect the current state of the investigation


Constantly update and revise as the investigation progresses




IBM Software Group

WebSphere Serviceability Team
-

November 2005


Example: table of issues and symptoms

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Best Practice: High
-
level timeline




List all major events when they occur


Incidents (new occurrences of the same or other problem)


Data collection actions and experiments


Relief and Remedy actions


Other maintenance



Useful to


Help identify patterns, cause and effect, etc.


Avoid confusion over time about what really happened


Facilitate communication between all parties (Customer, IBM Support,
etc.)


Keep track of all available diagnostics artifacts and what they
correspond to, along with precise timestamps to look for in logs


Reinforce Change Control policies


if too many thing change at once, it may never be possible to
understand the problem


Timestamp

Machine

Event / Action

Artifacts

11/24 6:05 pm

Venus

AppServer unresponsive


11/24 6:16 pm

Venus

AppServer crash
-
> restart


11/25 5:31 pm

Mars

AppServer unresponsive (no crash)


11/25 5:32 pm

Mars

Attempt javacore (kill
-
3)
-
> no result


1
1/25 8:30 pm

Mars, Venus

Fix OS directory privileges


11/25 9:00 pm

Mars

Enable CM trace (no restart)


11/26 12:00 am

Mars

Increase Conn Pool size to 20


11/26 12:00 am

Mars

Scheduled restart


11/26 10:03 am

Mars

AppServer unresponsive


11/26 10:04 am

Mars

Collect javacore (kill
-
3)

Javacore1.txt

11/26 10:06 am

Mars

Collect javacore (kill
-
3)

Javacore2.txt

11/26 10:10 am

Mars

AppServer crash
-
> restart


11:26 10:10 am

Mars

Collect CM trace

Trace1.log

11:26 1:15 pm

Venus

AppServer unresponsive















IBM Software Group

WebSphere Serviceability Team
-

November 2005


Example: High
-
level timeline

Timestamp

Machine

Event / Action

Artifacts

11/24 6:05 pm

Venus

AppServer unresponsive


11/24 6:16 pm

Venus

AppServer crash
-
> restart


11/25 5:31 pm

Mars

AppServer unresponsive (no crash)


11/25 5:32 pm

Mars

Attempt javacore (kill
-
3)
-
> no result


1
1/25 8:30 pm

Mars, Venus

Fix OS directory privileges


11/25 9:00 pm

Mars

Enable CM trace (no restart)


11/26 12:00 am

Mars

Increase Conn Pool size to 20


11/26 12:00 am

Mars

Scheduled restart


11/26 10:03 am

Mars

AppServer unresponsive


11/26 10:04 am

Mars

Collect javacore (kill
-
3)

Javacore1.txt

11/26 10:06 am

Mars

Collect javacore (kill
-
3)

Javacore2.txt

11/26 10:10 am

Mars

AppServer crash
-
> restart


11:26 10:10 am

Mars

Collect CM trace

Trace1.log

11:26 1:15 pm

Venus

AppServer unresponsive















IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Consider Relief Options


If the problem occurs in Production or other critical
-
availability system,
we may need to provide relief long before we fully resolve the problem



Consider relief timeline


How long do we have to investigate, before we need to provide some
relief?


Can we keep the system in its failed state, in case there is some
additional information that we want to collect later?



Identify the relief actions



Often, restart one or more components


But beware of chain reaction effects of stopping/starting some
components in a live system


Sometimes, change the usage characteristics of the application


Reduce the load


Avoid some “dangerous” operations



Re
-
evaluate relief options at every step through the investigation

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Initial Investigation


The goal at this stage is to look for already known problems, and to get a
starting point for further, deeper investigation if necessary



Make an inventory of all pertinent anomalies


灯瑥湴楡氠獹浰瑯浳


Errors, warnings, exceptions, out
-
of
-
range stats, any other unusual
behavior (s
can (grep) through available logs)


Ideally should check everything, but this research might be guided by
understanding the flows in the topology diagram


Integrate into the table of Issues and Symptoms


Assess and prioritize symptoms


Not a perfect science


hard to tell which symptom is a cause, and
which symptom is a consequence of the original problem


Use the topology diagram and baselines prepared earlier



Build a low
-
level, detailed timeline of events for one incident, to clarify what
may have caused what


Include pertinent but normal events, in addition to anomalies/symptoms


Keep track of multiple components in parallel, attempt to correlate



Research the top symptoms in the Knowledge Bases




Log/Trace
Analyzer

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Example: Low
-
level timeline


HTTPD Mars Mars Mars


Count Thread 1 Thread 2 Thread 3


12:00:00:*** RESTART
--------------------------------


09:30:00:*** 10


09:55:00:*** 10


09:59:10:564 > getConnection

09:59:10:943 < getConnection


09:59:15:134 > getConnection

09:59:15:201 > getConnection

09:59:15:456 > getConnection


10:00:00:*** 125


10:02:15:859 CONM6026W


10:05:00:*** 192


10:10:00:*** 193


10:10:01:034 > getConnection

10:10:01:750 GC allocation failure (2Meg)
-----------

10:10:01:900 CRASH
----------------------------------

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Knowledge Base Search


Pick keywords to research the problem in available knowledge sources:


Error codes, Exception names, etc. from the most promising
symptoms


If no explicit error: High
-
level problem description


See list of problem categories from the Support web site



Research in IBM Support Assistant or on the Support web page


Will automatically search product InfoCenter for error codes, etc.

IBM Support
Assistant

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Knowledge Base on the WAS Support Web Site


All the PD data is organized in a set of predefined problem categories or
“Components”


Currently 44 categories for WAS: “100% CPU Usage”, “Admin Console”,
“Classloader”, “Crash”, etc.


Also used to manage most Support processes and other PD assets



Library of
Technotes
and other articles


Maintained by a specialized Knowledge Engineering team


Includes known problems, APARs, common questions, troubleshooting
instructions for many specific problems, PD tools, etc.



One
MustGather

document for each problem category (few exceptions)


Provides instructions on how to start troubleshooting that problem, and
what information to provide to Support if opening a PMR



Extensive search facility


IBM Software Group

WebSphere Serviceability Team
-

November 2005


Searching with IBM Support Assistant


Search on the
exception or error
message you're
seeing...


Search on APAR or
fixpack


Search for PD tool


Search for MustGather





IBM Software Group

WebSphere Serviceability Team
-

November 2005


Searching on the WAS Support Page


Search on the
exception or error
message you're
seeing...


Search on APAR or
fixpack


Search for PD tool


Search for MustGather





http://www.ibm.com/sof
tware/webservers/apps
erv/was/support/

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


In
-
depth Investigation


The initial research did not find a solution, so we’re going to need to
undertake a more in
-
depth investigation


With a specific set of methods and tools for each specific problem



Several sources of information to start from:


Troubleshooting section in the InfoCenter


Problem Determination Redbook


MustGather documents on the Support web site


Define initial set of diagnostics to focus on (either for IBM
Support or for internal investigation by Customer)


From there, search for other Technotes with troubleshooting
instructions, tools, etc.


The Troubleshooting Guide on the Support web site also
provides a good starting point for finding relevant documents


...

IBM Software Group

WebSphere Serviceability Team
-

November 2005


WebSphere Infocenter

IBM Software Group

WebSphere Serviceability Team
-

November 2005


PD Redbook

Redbook


SG24
-
6798
-
00


www.redbooks.ibm.com

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Product Components and MustGather

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Problem Categories


100% CPU Usage


Administrative Console (all non
-
scripting)


Administrative Scripting Tools (for
example, wsadmin or ANT)


Application Client


Classloader


Crash


DB Connections / Connection
Pooling


Deploy

(for example, AAT or ANT or
EAR/WAR/JAR)


Double Byte Character Set (DBCS)


Dynacache


Edge Component


EJB Container


EJBDeploy (WebSphere Studio
Application Developer)


Embedded/Express


Enterprise Edition (EE)


General




Hangs/Performance
Degradation HTTP Transport


IBM HTTP Server


Install


Install SMP/E


Java 2 Connectivity (J2C)


Java Messaging Service (JMS)


Java Management Extensions
(JMX) or JMX client API


Java Security (JSSE/JCE)


Java Transaction Service (JTS)


JDK™


JNDI/Naming


JSP


Migration


Object Level Trace/ Distributed
Debugger (OLT/DD)




Object Request Broker (ORB)


PD tools (for example: Log
Analyzer)


Plugin


Plugin (remote) Install


PMI/Performance Tools


Samples


Security


Sessions and Session
Management


System Management/
Repository


Trial


Web Services

(for example:
SOAP or UDDI, or WSGW or
WSIF)


WebSphere Studio Application
Developer Integration Edition


Workload Management (WLM)

One MustGather for each problem category (and possibly each version/platform)

IBM Software Group

WebSphere Serviceability Team
-

November 2005


MustGather Example: Crash on Linux


Operating system specific MustGather documents, when appropriate


Product components include symptoms like “Crash” that span product
components


Updated frequently to help resolve problems more quickly

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Troubleshooting guide on the Support Web Site

IBM Software Group

WebSphere Serviceability Team
-

November 2005


In
-
depth Investigation (continued)


The Knowledge Base provides the starting point for the investigation
process. What happens next depends on what we find



Two main investigation techniques


Analysis Approach


Exploit available diagnostic data in increasing detail until
solution is found


Isolation Approach


Isolate and simplify the problem to the smallest, simplest
possible sub
-
unit of the original system, to reduce the amount
of data to analyze



In practice, these two approaches are often pursued in parallel, and feed
into one another



Consider if we might attempt to reproduce in a Test environment


IBM Software Group

WebSphere Serviceability Team
-

November 2005


Analysis Approach


Exploit available diagnostic data in increasing detail until solution is
found



Heavily driven by knowledge of the meaning of each diagnostic, and
of the internal operation of each component of the system



Many specialized analysis tools available



In practice, we often need to iterate to collect more specialized data


E.g. enable special tracing, collect heap dump, etc.



The low
-
level, detailed timeline of events created earlier, is often a
good tool to help focus that analysis


Keep adding more detail in the timeline


Thread Analyzer


VerboseGC analyzer


Classloader viewer


Heap analyzer


Dump analyzer




IBM Software Group

WebSphere Serviceability Team
-

November 2005


Isolation Approach


Isolate and simplify the problem to the smallest, simplest possible sub
-
unit of the original system, to reduce the amount of data to analyze



Heavily driven by knowledge of the structure of the system and its flows,
but can often treat each individual component as a black box


Use the topology diagram



Monitor flows at various control points in the topology



Inject “pings” at various control points to observe results



Eliminate one component at a time from the topology, rerun test to see if
problem re
-
occurs

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Working with IBM Support


Support is organized around the Problem Categories



Help keep the investigation on track:


Be clear about the problem statement, the environment, the
actions taken


Check and re
-
check for misunderstandings



Always try to provide full MustGather info, not partial



Provide context information along with any diagnostics data


How was this collected?


What remedy actions were taken?


What exactly happened and when?


Please don’t just send a long log and say

“The incident is in there somewhere”



IBM Support
Assistant


...

IBM Software Group

WebSphere Serviceability Team
-

November 2005


How Does Support Work?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Maintenance Strategy: A word about Fixpacks…


IBM Support often recommends upgrading to the latest available fixpack
during the course of an investigation



There is often a lot of concern about the risk of destabilizing the system by
installing a fixpack



It is a difficult trade
-
off:


Though we try very hard, the risk can never be completely eliminated


But on the other hand:


There is also considerable risk in using too many individual Fixes…
no one can possibly test all the possible interactions between all
individual Fixes


Because of the complexity of the system and the difficulty of
reproducing problems and gathering diagnostics, it is not always
practical to determine exactly which APAR resolved a particular
situation



Overall, the best bet is to have a strong maintenance strategy


Regular (small) updates


Reasonable re
-
testing before each update

IBM Software Group

WebSphere Serviceability Team
-

November 2005


WebSphere Resources: IBM Education Assistant

IBM Software Group

WebSphere Serviceability Team
-

November 2005


WebSphere Resources


WebSphere 6.0 Infocenter


Front page
:

http://publib.boulder.ibm.com/infocenter/ws60help/index.jsp


Troubleshooting section
:

http://publib.boulder.ibm.com/infocenter/ws60help/topic/com.ibm.websphere.nd.do
c/info/ae/ae/welc6toptroubleshooting.html


Central WebSphere Support Site

(tools, ASTK, Technotes, etc.


use search
function)


Front page
: http://www.ibm.com/software/webservers/appserv/was/support/


Troubleshooting guide
:
http://www.ibm.com/support/docview.wss?rs=180&context=SSEQTP&uid=swg270
05324


MustGather documents for major problem categories
:
http://www.ibm.com/support/docview.wss?rs=180&context=SSEQTP&uid=swg211
45599


Steps to getting Support
:
http://www.ibm.com/developerworks/websphere/support/appserver_support.html

IBM Software Group

WebSphere Serviceability Team
-

November 2005


WebSphere Resources


IBM Support Assistant

(integrated tools for common support tasks)


http://www
-
1.ibm.com/support/docview.wss?rs=180&uid=swg21192593


IBM Education Assistant

(education modules on many common tasks and issues)


http://www.ibm.com/software/info/education/assistant/


developerWorks

(PD tools, articles, best practices, etc.)


http://www.ibm.com/developerworks/websphere/


AlphaWorks

(contains a variety of WebSphere PD tools)


http://www.alphaworks.ibm.com/


Redbooks

(several books include WebSphere PD material)


http://www.redbooks.ibm.com/


“IBM WebSphere Application Server for Distributed Platforms and z/OS

An Administrator Guide”


Ann Black et al.


IBM Press, Prentice Hall


ISBN 0
-
13
-
185587
-
5

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Key Steps for Problem Determination

Initial investigation
-

"Phase 1"


Pick key symptoms


Research in knowledge base


One step only

Preparation: Before a problem occurs


Problem prevention


Prepare for problem management


Topology diagram


Diagnostic data collection plan

In
-
depth investigation
-

"Phase 2"


Identify problem category


Find specific troubleshooting instructions


Execute troubleshooting actions


Iterate until resolved

Organize the investigation


Characterize the problem


Table of issues and symptoms


High
-
level timeline

Consider relief options

Resolved?

Resolved?

Relief?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


Questions
?

IBM Software Group

WebSphere Serviceability Team
-

November 2005


BACKUP CHARTS