Server Resiliency - Debugging Java deployments

Arya MirSoftware and s/w Development

Oct 6, 2011 (5 years and 10 months ago)

1,555 views

Rohit Kelapure IBM Advisory Software Engineer 29 September 2011

©
2011
IBM Corporation

Server Resiliency
-

Debugging Java
deployments

Rohit Kelapure

IBM Advisory Software Engineer

29 September 2011

©
2011
IBM Corporation

Introduction to Speaker


Rohit Kelapure


Responsible for the resiliency of WebSphere
Application Server


Team Lead and architect of Caching & Data
replication features in WebSphere


Called upon to hose down fires & resolve critical
situations


Customer advocate for large banks


Active blogger
All Things WebSphere


Apache Open Web Beans committer


Java EE, OSGI & Spring Developer


kelapure@us.ibm.com




kelapure@gmail.com


Linkedin


http://twitter.com/#!/
rkela


2

©
2011
IBM Corporation

Important Disclaimers


THE INFORMATION CONTAINED IN THIS PRESENTATION IS PROVIDED FOR
INFORMATIONAL PURPOSES ONLY.


WHILST EFFORTS WERE MADE TO VERIFY THE COMPLETENESS AND ACCURACY OF THE
INFORMATION CONTAINED IN THIS PRESENTATION, IT IS PROVIDED “AS IS”, WITHOUT
WARRANTY OF ANY KIND, EXPRESS OR IMPLIED.


ALL PERFORMANCE DATA INCLUDED IN THIS PRESENTATION HAVE BEEN GATHERED IN
A CONTROLLED ENVIRONMENT. YOUR OWN TEST RESULTS MAY VARY BASED ON
HARDWARE, SOFTWARE OR INFRASTRUCTURE DIFFERENCES.


ALL DATA INCLUDED IN THIS PRESENTATION ARE MEANT TO BE USED ONLY AS A GUIDE.


IN ADDITION, THE INFORMATION CONTAINED IN THIS PRESENTATION IS BASED ON IBM’S
CURRENT PRODUCT PLANS AND STRATEGY, WHICH ARE SUBJECT TO CHANGE BY IBM,
WITHOUT NOTICE.


IBM AND ITS AFFILIATED COMPANIES SHALL NOT BE RESPONSIBLE FOR ANY DAMAGES
ARISING OUT OF THE USE OF, OR OTHERWISE RELATED TO, THIS PRESENTATION OR
ANY OTHER DOCUMENTATION.


NOTHING CONTAINED IN THIS PRESENTATION IS INTENDED TO, OR SHALL HAVE THE
EFFECT OF:


-

CREATING ANY WARRANT OR REPRESENTATION FROM IBM, ITS AFFILIATED
COMPANIES OR ITS OR THEIR SUPPLIERS AND/OR LICENSORS


3

©
2011
IBM Corporation

Copyright and Trademarks



© IBM Corporation 2011. All Rights Reserved.



IBM, the IBM logo, and ibm.com are trademarks or registered trademarks of
International Business Machines Corp., and registered in many jurisdictions
worldwide.



Other product and service names might be trademarks of IBM or other
companies.



A current list of IBM trademarks is available on the Web


see the IBM “Copyright
and trademark information” page at URL:
www.ibm.com/legal/copytrade.shtml


4

©
2011
IBM Corporation

Outline


Server Resiliency Fundamentals


Common JVM Problems


Protecting your JVM


Hung thread detection, Thread Interruption, Thread hang recovery


Memory leak detection, protection & action


Scenario based problem resolution


Tooling


Eclipse Memory Analyzer


Thread Dump Analyzer


Garbage Collection and Memory Visualizer





5

©
2011
IBM Corporation

Resiliency

Property of a material that can absorb external energy when it is
forced to deform elastically, and then be able to recover to its original
form and release the energy




6

©
2011
IBM Corporation

Server Resiliency Concepts

7

March 13, 2013

1.
Redundancy (Data and processing)


Create Replicas


High cost of initialization and reconfiguration


Redundant elements need to be synchronized from time
to time

2.
Partition


Splitting the data into smaller pieces and storing them in
distributed fashion


Allows for parallelization & divide and conquer


Partial failure isolation

3.
Virtualization


Functionalities of processing and data element
virtualized as a service


Loose coupling between system and consumed services


Integration by enforcing explicitly boundary and schema
-
based interfaces

4.
Decentralized Control


High communication overhead of centralized control for a
system of heavy redundancy


Sometimes trapped in locally optimized solutions


Fixing issues requires shutting down the entire system e.g.
AWS outage


5.
Explicit Messaging


Distributed Shared memory


Better
Consistency


Message Passing


Loose Coupling

6.
Uniform Interface


E.g. World Wide Web


Better scalability, reusability and reliability


Data, process and other forms of computations
identified by one mechanism


Semantics of operations in messages for
operating on the data are unified

7.
Self Management


Composed of self
-
managing components.


Managed element, managers, sensors &
effectors


e.g. TCP/IP Congestion Control


©
2011
IBM Corporation

Most common JVM Problem Scenarios

Functional Problems


Unexpected Exceptions, Compatibility

OOM Errors, Memory Leaks


Memory leaks
-

Java Heap , Native Heap
Classloaders


Hangs


Synchronized resources, GC Pause times, CPU Contention

Crash


JVM errors, JIT errors, JNI errors

High CPU


Spin loops Liveliness, Livelock

8

March 13, 2013

©
2011
IBM Corporation

Thread Hangs


Threading and synchronization issues are among the top 5
application performance challenges


too aggressive with shared resources causes data inconsistencies


too conservative leads to too much contention between threads


Application unresponsiveness


Adding users / threads /CPUs causes app slow down (less throughput,
worse response)


High lock acquire times & contention


Race conditions, deadlock, I/O under lock


Tooling is needed to rescue applications and the JVM from itself


Identify these conditions


If possible remedy them in the short term for server resiliency

9

March 13, 2013

©
2011
IBM Corporation

JVM Hung Thread Detection




Every X seconds an alarm
thread wakes up and iterates
over all managed thread pools.


Subtract the "start time" of the
thread from the current time,
and passes it to a monitor.


Detection policy then
determines based on the
available data if the thread is
hung


Print stack trace of the hung
thread

10

March 13, 2013

©
2011
IBM Corporation

Thread Interruption 101


Thread.stop stops thread by throwing
ThreadDeath

exception
* Deprecated


Thread.interrupt():
Cooperative mechanism for a thread to signal another
thread that it should, at its convenience and if it feels like it, stop what it is
doing and do something else.


Interruption is usually the most sensible way to implement task cancellation.


Because each thread has its own interruption policy, you should not interrupt
a thread unless you know what interruption means to that thread.


Any method sensing interruption should


Assume current task is cancelled & perform some task

specific cleanup


Exit as quickly and cleanly as possible ensuring that callers are aware of
cancellation


Propagate the exception
, making your method an interruptible blocking
method, to throw new
InterruptedException
()


Restore the interruption status

so that code higher up on the call stack
can deal with it
Thread.currentThread
().interrupt()


Only code that implements a thread's interruption policy may swallow an
interruption request.


11

March 13, 2013

©
2011
IBM Corporation

Interrupting threads

12

©
2011
IBM Corporation

13

March 13, 2013

Cancelling

Threads

©
2011
IBM Corporation

Dealing with Non

interruptible Blocking


Many blocking library methods respond to interruption by returning early and
throwing
InterruptedException


Makes it easier to build tasks that are responsive to cancellation


Lock.lockInterruptibly


Thread.sleep
,


Thread.wait


Thread.notify


Thread.join


Not all blocking methods or blocking mechanisms are responsive to interruption


if a thread is blocked performing synchronous socket I/O, interruption has no
effect other than setting the thread's interrupted status


If a thread is blocked waiting for an intrinsic lock, there is nothing you can do to
stop short of ensuring that it eventually acquires the lock

14

March 13, 2013

©
2011
IBM Corporation

Thread Hang Recovery


Technique


Application specific hacks for thread hang recovery


Byte code instrumentation


Transform
the concrete subclasses of the abstract classes
InputStream

&
OutputStream

to make the socket I/O operations interruptible.


Transform
an application class so that every loop can be interrupted by
invoking

Interrupter.interrupt
(Thread,
boolean
)


Transform a

monitorenter

instruction and a

monitorexit

instruction so that the wait
at entering into a monitor is
interruptible


http
://
www.ibm.com/developerworks/websphere/downloads/hungthread.html



15

©
2011
IBM Corporation

Memory Leaks


Leaks come in various types, such as


Memory leaks


Thread and ThreadLocal leaks


ClassLoader leaks


System resource leaks



Connection leaks


Customers want to increase application uptime without cycling the server.


Frequent application restarts without stopping the server.


Frequent redeployments of the application result in OOM errors


What
do we have today


Offline post
-
mortem analysis of a JVM heap. Tools like
Jrockit Mission
Control, MAT
. IEMA are the IBM Extensions for Memory Analyzer


Runtime
memory leak detection using JVMTI and PMI (Runtime Performance
Advisor)


We
don’t have application level i.e. top down memory leak detection and
protection


Leak detection by looking at suspect patterns in application code



16

March 13, 2013

©
2011
IBM Corporation

ClassLoader Leaks 101


A class is uniquely identified by


Its name + The
class loader that loaded it


Class
with the same name
can be loaded
multiple times in a single JVM, each in a
different class loader


Web
containers use this for isolating web
applications


Each
web application gets its own class
loader


Reference Chain


An object retains a reference to the class
it is an instance of


A
class retains a reference to the class
loader that loaded it


The
class loader retains a reference to
every class it
loaded


Retaining a reference to a single object
from a web application pins every class
loaded by the web
application


These references often remain after a web
application
reload With
each reload, more
classes get pinned
ultimately leading to an
OOM




17

March 13, 2013

©
2011
IBM Corporation

Tomcat pioneered approach
-

Leak Prevention


JRE triggered leak


Singleton / static initializer


Can be a Thread


Something that won’t get garbage collected


Retains a reference to the context class loader when loaded


If web application code triggers the initialization


The context class loader will be web application class loader


A reference is created to the web application class loader


This reference is never garbage collected


Pins the class loader (and hence all the classes it loaded) in memory


Prevention with a
DeployedObjectListener


Calling various parts of the Java API that are known to retain a reference to
the current context class loader


Initialize these singletons when the Application Server’s class loader is the
context class loader















18

March 13, 2013

©
2011
IBM Corporation

Leak Detection


Application Triggered Leaks


ClassLoader


Threads


ThreadLocal


JDBC Drivers


Non Application


RMI Targets


Resource Bundle


Static final references


InstrospectionUtils


Loggers


Prevention


Code executes when a web application is
stopped, un
-
deployed or reloaded


Check, via a combination of standard API
calls and some reflection tricks, for known
causes of memory leaks

19

March 13, 2013

©
2011
IBM Corporation

Memory leak detection console

20

March 13, 2013

©
2011
IBM Corporation

What is wrong with my application …?


Why does my application run slow every time I do X ?


Why does my application have erratic response times ?


Why am I getting Out of Memory Errors ?


What is my applications memory footprint ?


Which parts of my application are CPU intensive ?


How did my JVM vanish without a trace ?


Why is my application unresponsive ?


What monitoring do I put in place for my app. ?


21

March 13, 2013

©
2011
IBM Corporation

What is your JVM up
to ?


Windows style task manager for displaying thread status and allow for their recovery & interruption


Leverage
the
ThreadMXBean

API in the JDK to display thread information


https://github.com/kelapure/dynacache/blob/master/scripts/AllThreads.jsp

https
://
github.com/kelapure/dynacache/blob/master/scripts/ViewThread.jsp



22

March 13, 2013

©
2011
IBM Corporation

Application runs slow when I do XXX ?


Understand impact of activity on components


Look at the thread & method profiles


IBM Java Health Center


Visual VM


Jrockit Mission Control


JVM method & dump trace
-

pinpoint performance problems.


Shows entry & exit times of any Java method


Method to trace to file for all methods in
tests.mytest.package


Allows taking
javadump
,
heapdump
,
etc

when a method is hit


Dump
javacore

when method
testInnerMethod

in an inner class
TestInnerClass

of a class
TestClass

is called


Use
Btrace
,
-
Xtrace

*

Xdump

to trigger dumps on a range of events


gpf
, user, abort,
fullgc
, slow, allocation,
thrstop
, throw …


Stack traces, tool launching

23

March 13, 2013

©
2011
IBM Corporation

Application has erratic response times ?


Verbose
gc

should be enabled by default


<2% impact on performance


VisualGC, GCMV &PMAT : Visualize GC output


In use space after GC


Positive gradient over time indicates
memory leak


Increased load (use for capacity plan)


Memory leak (take HDs for PD.)



Choose the right GC policy


Optimized for “batch” type applications,
consistent allocation profile


Tight responsiveness criteria, allocations of
large objects


High rates of object “burn”, large # of
transitional objects


12, 16 core SMP systems with allocation
contention (AIX only)


GC overhead > 10%


wrong policy | more tuning


Enable compressed references for 64 bit JVM


24

March 13, 2013

©
2011
IBM Corporation

Out Of Memory Errors ?


JVM Heap sized incorrectly


GC adapts heap size to keep occupancy [40, 70]%


Determine heap occupancy of the app. under load


Xmx

= 43% larger than max. occupancy of app.


For 700MB occupancy , 1000MB Max. heap is reqd. (700 +43% of 700)


Analyze
heapdumps

& system dumps with tools like Eclipse Memory Analyzer


Lack of Java heap or Native heap


Eclipse Memory Analyzer and IBM extensions


Finding which methods allocated large objects


Prints
stacktrace

for all objects above 1K


Enable Java Heap and Native heap monitoring


JMX and metrics output by JVM


Classloader

exhaustion

25

March 13, 2013

©
2011
IBM Corporation

Applications memory footprint ?


HPROF


profiler shipped with JDK


uses JVMTI


Analysis of memory usage
-
Xrunhprof:heap
=all


Performance Inspector tools
-

JPROF Java Profiling Agent


Capture state of the Java Heap later processed by HDUMP


Group a
system
dump by
classloader



since
each
app
has its own
classloader
, you can get
accurate
information on
how much
heap each
application is taking up


Use MAT to investigate
heapdumps

& system dumps


Find large clumps, Inspect those objects, What retains them ?


Why is this object not being garbage collected




List Objects > incoming refs, Path to GC roots, Immediate dominators


Limit analysis to a single application in a JEE environment
-

Dominator tree
grouped by ClassLoader Dominator tree grouped by Class Loader


Set of objects that can be reclaimed if we could delete X
-

Retained Size
Graphs Retained Size Graphs


Traditional memory hogs like HTTPSession, Cache
-

Use Object Query
Language (OQL

Use Object Query Language (OQL)

26

March 13, 2013

©
2011
IBM Corporation

Using
Javacores

for Troubleshooting


Javacores

are often the most critical piece of information to resolve a hang, high CPU, crash and
sometimes memory problems


A Javacore is a text file that contains a lot of useful information


The date, time, java™ version, full command path and arguments


All the threads in the JVM, including thread state, priority, thread ID, name


Thread call stacks


Javacores

can be generated automatically or on demand


Automatically when an
OutOfMemoryException

is thrown


On demand with “kill
-
3 <
pid
>”


Message to the
SystemOut

when a
javacore

is generated

27

"WebContainer : 537" (TID:0x088C7200, sys_thread_t:0x09C19F00, state:CW, native ID:0x000070E8)
prio=5


at
java/net/SocketInputStream.socketRead0
(Native Method)


at java/net/SocketInputStream.read(SocketInputStream.java:155)


at oracle/net/ns/Packet.receive(Bytecode PC:31)


at
oracle/net/ns/DataPacket.receive
(Bytecode PC:1)


at oracle/net/ns/NetInputStream.read(Bytecode PC:33)


at oracle/jdbc/driver/T4CMAREngine.unmarshalUB1(T4CMAREngine.java:1123)


at oracle/jdbc/driver/T4C8Oall.receive(T4C8Oall.java:480)


at oracle/jdbc/driver/T4CPreparedStatement.executeForDescribe(T4CPreparedStatement.java:813)


at oracle/jdbc/driver/OracleStatement.doExecuteWithTimeout(OracleStatement.java:1154)


at
o
racle/jdbc/driver/OraclePreparedStatement.executeQuery
(OraclePreparedStatement.java:3415
)


at com/ibm/commerce/user/objects/
EJSJDBCPersisterCMPDemographicsBean_2bcaa7a2.load
()


at com/ibm/ejs/container/ContainerManagedBeanO.load(ContainerManagedBeanO.java:1018)


at com/ibm/ejs/container/EJSHome.activateBean(EJSHome.java:1718
)

©
2011
IBM Corporation

CPU intensive parts of the app?


ThreadDumps

or
Javacores

-

Poor mans profiler


Periodic
javacores



Thread analysis


using the Thread Monitor Dump Analyzer tool


High
CPU is typically diagnosed by comparing two key pieces of information


Using
Javacores
, determine what code the threads are executing


Gather CPU usage statistics by thread


For each Javacore compare the call stacks between threads


Focus first on Request processing threads first


Are all the threads doing similar work?


Are the threads moving ?


Collect CPU statistics per thread


Is there one thread consuming most of the CPU?


Are there many active threads each consuming a small percentage of CPU?


High CPU due to excessive garbage collection
?


If this is a load/capacity problem then use HPROF profiler


-
Xrunhrof:cpu
=samples,
-
Xrunhprof:cpu
=time




28

March 13, 2013

©
2011
IBM Corporation

Diagnosis
-

Hangs


Often hangs are due to unresponsive synchronous requests


SMTP
Server
, Database, Map Service, Store Locator, Inventory, Order processing,
etc


3XMTHREADINFO

"
Servlet.Engine.Transports

: 11"

(TID:0x7DD38040, sys_thread_t:0x44618828,
state:R
, native ID:0x4A9F)
prio
=5

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java()

4XESTACKTRACE at ...

3XMTHREADINFO

"
Servlet.Engine.Transports

: 12"

(TID:0x7DD37FC0, sys_thread_t:0x4461BDA8,
state:R
, native ID:0x4BA0)
prio
=5

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java()

4XESTACKTRACE at ...

3XMTHREADINFO

"
Servlet.Engine.Transports

: 13"

(TID:0x7DD34C50, sys_thread_t:0x4465B028,
state:R
, native ID:0x4CCF)
prio
=5

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.SQLExecute()

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.execute2(DB2PreparedStatement.java)

4XESTACKTRACE at COM.ibm.db2.jdbc.app.DB2PreparedStatement.executeQuery(DB2PreparedStatement.java
()



Not
all hangs are waiting on an external resource


A JVM can hang due to a synchronization
problem
-

One
thread blocking
several others


29

3XMTHREADINFO

"
Servlet.Engine.Transports

: 11" (TID:0x7DD38040, sys_thread_t:0x44618828,
state:R
,
native ID:0x4A9F)
prio
=5

3LKMONOBJECT com/
ibm
/
ws
/cache/Cache@0x65FB8788/0x65FB8794: owner "Default : DMN0" (0x355B4800

3LKWAITERQ
Waiting to enter
:

3LKWAITER "
WebContainer

: 0" (0x3ACCD000)

3LKWAITER "
WebContainer

: 1" (0x3ACCCB00)

3LKWAITER "
WebContainer

: 2" (0x38D68300)

3LKWAITER "
WebContainer

: 3" (0x38D68800)

©
2011
IBM Corporation

How did my JVM vanish without trace ?


JVM Process Crash Usual Suspects



Bad JNI calls, Segmentation violations, Call Stack Overflow


Native memory leaks
-

Object allocation fails with sufficient space in the JVM
heap


Unexpected OS exceptions (out of disk space, file handles), JIT failures


Monitor the OS process size


Runtime check of JVM memory allocations



Xcheck:memory


Native memory usage
-

Create a core dump on an OOM


JNI code static analysis
-
Xcheck:jni

(errors, warnings, advice)


GCMV provides scripts and graphing for native memory


Windows “
perfmon
“, Linux “
ps
” & AIX “
svmon



Find the last stack of native code executing on the thread during the crash


The signal info (1TISIGINFO) will show the Javacore was created due to a crash


Signal 11 (SIGSEGV) or GPF




30

March 13, 2013

©
2011
IBM Corporation

What do I monitor ?

31

March 13, 2013

©
2011
IBM Corporation

Top Malpractices

no


Arch.
plan

No migration
plan

No change records

No Capacity plan

No Production traffic profile

Changes put directly in Prod.

No load & Stress testing

Communication breakdown

No education

Application Error

Test environment
!= Production

32

March 13, 2013

©
2011
IBM Corporation

Support Assistant Workbench to help with Problem Determination

33

March 13, 2013

©
2011
IBM Corporation

One stop shop for tools to analyze JVM issues

34

March 13, 2013

©
2011
IBM Corporation

Tools

Problem

Artifact

Monitoring & Analysis


Memory

leaks


Out of Memory errors


Application Unresponsive

Verbose Garbage collection
log (native_stdout.log)



d䍍C



sis畡udC



j灳

js瑡t

js瑡瑤

ji湦n

䡩杨g䍐唬C䍲慳栬C䡡湧

Performance bottleneck,
Unexpected termination

Javadump, Javacore
(
javacore
*.txt)




Thread Monitor

& Dump
Analyzer (TMDA)



Lock Contention

Low CPU at high load

Threads (Connection

to
running JVM
)



Sun VisualVM



JConsole



IBM Health Center



g牯捫i琠䵩ssi潮o䍯湴牯C

䵥j潲o

䱥慫

l畴u潦o䵥j潲o 敲e潲o

䡥慰摵ep

⠪⹰桤Ⱐ⨮瑸琬*
⨮桰牯昩



䵁T



䡥慰e湡nyz敲



g䡡e

Native Memory Leak

Anomalies

Unexpected Crash

System

or core dump
(core.dmp, user.dmp), Files
must be processed with
jextract tool


Monitor
-

GCMV,

Examine
-

pmap

&
VMMap
,
Track
-

DebugDiag,
libumem
,
valgrind
,
cmalloc

& NJAMD

35

March 13, 2013

©
2011
IBM Corporation

Runtime Serviceability aids


Troubleshooting panels in the administration console


Performance Monitoring Infrastructure metrics


Diagnostic Provider Mbeans


Dump Configuration, State and run self
-
test


Application Response Measurement/Request Metrics


Follow transaction end
-
to
-
end and find bottlenecks


Trace logs & First Failure Data Capture


Runtime Performance Advisors


Memory leak detection, session size, …


Specialized tracing and Runtime checks


Tomcat
Classloader

Leak Detection


Session crossover, Connection leak,
ByteBuffer

leak detection


Runaway CPU thread protection

36

March 13, 2013

©
2011
IBM Corporation

References



Java
theory and practice: Dealing with
InterruptedException


http
://
www.ibm.com/developerworks/java/library/j
-
jtp05236/index.html


Architectural design for resilience


http://
dx.doi.org/10.1080/17517570903067751


IBM Support Assistant


http://
www
-
01.ibm.com/software/support/isa/download.html


How Customers get into trouble


http://www
-
01.ibm.com/support/docview.wss?uid=swg27008359





37

©
2011
IBM Corporation

Q&A

Thank You

38

March 13, 2013