Part I: Planning and Auditing Your Firm's Capacity Planning Efforts

spongereasonInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 10 μήνες)

189 εμφανίσεις

Planning and Auditing Your
Firm’s Capacity Planning Efforts

By Ron Kaminski

Ronald.R.Kaminski@kcc.com

ron@kaminski
-
family.com

Introduction


Over the past 20 years, I’ve started and expanded capacity planning
groups at dozens of firms, my most recent is now 15 months old


You learn things in that process


CMG is the place to share this information


I look forward to your presentation on this topic in a few years!


Today’s goal is to give you “planning and audit points” that you can use to
review how you do capacity planning, and maybe persuade you that other
methods might be more productive, or at least worth a shot!


There will also be “How to” information, that may have you adding some
“to do” items to your list


If you have a question, ask it!


I like nothing better than surfing off on a tangent that helps the class


Story Times!


New risks


2

© Ron Kaminski 2010, All Rights Reserved

Introduction


In the next few hours, we will cover


Defining your mission


Picking the right vendor partners


Going “Extra
-
Product”


Avoiding the “IT Mindset Traps”


The politics of capacity planning in organizations, the
key factor in your eventual success, or failure


Reporting, what you should and surprisingly should
not do


Classic capacity planning question descriptions and
proper answering techniques




© Ron Kaminski 2010, All Rights Reserved

3

Introduction


In the next few hours, we will cover


How clouds and “software as a service” will still need
capacity tracking and planning tools, and what new kinds
you will need


Modeling when all of the cards are stacked against you, or
“Tricks of the trade”


Goals to work towards


An audit list to compare to your systems


Capacity planning done well can change the fortunes of
a company and help all of our careers. Come sharpen
your methods and learn tricks that will make you part
of your firm’s future productive assets, and not an
expense to be controlled




© Ron Kaminski 2010, All Rights Reserved

4

Ron’s Rules


You can ask anything, at any time


Sometimes the answer is coming up soon in the examples, and in that
case I’ll tell you so



Quick Survey


Does anyone here already have…


A network queuing theory based modeling package?


Regular, automated process and workload pathology detection?


Fast web reporting of resource consumption by business useful workloads?



By the end of this talk, I hope that you will realize that workload
characterized views of consumption, web accessible, over business
useful time spans are a must have part of the best run IT shops


Lets see why…

Defining your mission


Every site has their own “Hot button!” issues


“We are buying a new $23 million computer room every 6 months!”


Attack server sprawl with data, not words


“I don’t know why we hired a capacity planner, we just…”


“Our critical applications are slowing down!”


Use relative response times and historical information to show why


Chargeback used to be a big draw but it has really faded away in the
post .com world


It shows you when you are talking to an old vendor


The ITIL push and reality when facing outsourcing or “ZOG”


ITIL takes a back seat to cost control, at least in the states


“We need better reporting!”


Be careful to be holistic in what you deliver, cover every thing that they can
buy, historically and ideally with business cycle peaks


When you start hearing terms like “focus on business priority” and “really look
at travel expenses” realize that cost cutting is in your future and report in ways
that enable them to cut power and machines




© Ron Kaminski 2010, All Rights Reserved

6

Defining your mission


You might think that all that variation would lead to very
different solutions, and you’d be wrong!


All effective capacity planning systems are based on having:


Efficient data collection, regrouping, reduction and storage


Effective graphical reporting of business meaningful spans of time


Components of workload response time that lead to diagnosis


Solving the desire for answers to “What if…?” questions


Problematic consumption diagnosis, reporting and ticketing


Some capacity planning product “
features
” marketed by vendors to
the naïve are actually seldom used in the real world, and for good
reasons


Linear Trending
, when what you really need is business cycle discovery and
planning


The retail cycle at grocery chains and web payment system vendors


Real Time Monitors
, when you might want to go home or on vacation some day.
Remember, problems happen 24 X 7, and humans won’t be watching “twitch
monitors” that consistently.
-

The mission control room story


Top 10
is often used to focus a newbie on peak consumption, which may all be valid


© Ron Kaminski 2010, All Rights Reserved

7

Defining your mission


Who is doing the reporting?


Vendor supplied reports


Tend to be single metric


Often don’t include contextual information


Are often “generate on demand” and therefore any useful span of
time takes beyond the allowable attention span


Often have serious contextual clarity problems


Workloads change colors as

»
the number present changes

»
You switch machines

»
Use black outlines that swamp the colors for small workloads


The “I’m only using vendor reports this time” and hit count story


Can take unimaginable resources to produce


Set yourself a consumption budget and manage to it


You want to trade more bonds? Stop looking at it!


May focus on reporting “right now” data rather than long term useful
decision support information


Seldom contain “disturbance to the status quo” notation capabilities


© Ron Kaminski 2010, All Rights Reserved

8

Defining your mission


Who is doing the reporting?


Write your own reports


Can be anything that you dream up (and can deliver the code for)


There are multiple “free” languages and infrastructure to pick
from


We’ve used perl, PHP, java and a whole lot more


Can be tailored for your firm’s decision maker’s specific needs


Can use “generate ahead” and other techniques to speed web
reporting


Writing your own can also have “down sides”


Staff turnover and the “Who is going to maintain this ___?” issues


Some staff are not gifted visual communicators


If the information used changes formats, (and over time they all do)
someone is going to have to maintain that stuff




© Ron Kaminski 2010, All Rights Reserved

9

Defining your mission


What do you want to present?


Workload characterized subdivisions of consumption over
time?


Long term historical context for decision makers over
multiple natural business cycles?


Information subdivided into audience specific groupings
for ease of use by subgroups


Integration into your firm’s


CMDB


Ticketing systems


Software development life cycle


Totals over time


The spark lines counter
-
argument




© Ron Kaminski 2010, All Rights Reserved

10

Why sparklines of totals can be really useful


These are
sparklines of
total CPU
used
,
Average CPU
used
and
the
average CPU
used by all
nodes in that
O/S


Is there one
in particular
that draws
your eyes to
it, that wants
you to probe
deeper?

© Ron Kaminski 2010, All Rights Reserved

11

Why sparklines of totals can be really useful


If you are like me, ustca102 has you wondering, “What made it step up like that?


On our system, clicking on the tiny sparkline brings up a “zoomed in” image, which
really gets you wondering:













Clicking on that graphic brings up our normal web reporting system:





© Ron Kaminski 2010, All Rights Reserved

12

Why sparklines of totals can be really useful

© Ron Kaminski 2010, All Rights Reserved

13

Why sparklines of totals can be really useful


OK, sometimes totals are useful


Sometimes they can draw your eye to issues


They can quickly dispel rumors that “All of our
machines are maxed out!”


For example, our applications specialists were
consistently maintaining that all of their machines were
barely big enough to make month end, and they would
argue mightily whenever we might suggest that there
was room for consolidation


I brought the chart on the next slide to the next
meeting, and suddenly their tune changed…



© Ron Kaminski 2010, All Rights Reserved

14

Why sparklines of totals can be really useful

© Ron Kaminski 2010, All Rights Reserved

15

Why sparklines of totals can be really useful


What happened after the meeting?


In the next 9 months, using
extremely conservative

criteria, we


Virtualized

230 machines ($1,521,000)


Retired




55 machines ($ 390,553)


“Oh! You can just turn that off!”, or, “See steam come out of the operations
folk’s ears” stories


Planned


10 machines ($ 40,000)


Potential


28 machines ($ 112,000)


We then plan on going back over with slightly less conservative criteria
and finding a couple million more


We will also be doing more “application stacking” where it makes
more sense


Sort of makes capacity planning tools look cheap, doesn’t it?

© Ron Kaminski 2010, All Rights Reserved

16

Why sparklines of totals can be really useful

© Ron Kaminski 2010, All Rights Reserved

17


A DBA pal of mine asked for a review of memory on a box, asking for an
increase to add caching and improve performance


I didn’t really detect a memory shortage:


Why sparklines of totals can be really useful


Still, people don’t usually mention issues unless there is an
underlying cause. So, as a capacity planner, you have to
always look deeper and always check all of the following:


CPU


Disk I/O


Memory


Network


Response time for key workloads


If you don’t always check everything, something can
sneak by


Here is what I found when I followed the “always check
everything” rule


When I looked at CPU, I saw:


© Ron Kaminski 2010, All Rights Reserved

18

Why sparklines of totals can be really useful

© Ron Kaminski 2010, All Rights Reserved

19

Why sparklines of totals can be really useful

© Ron Kaminski 2010, All Rights Reserved

20

Update!


They’ve since added 2
more CPUs and the issue
continues unabated


Some issues are not based
in physics and data!


© Ron Kaminski 2010, All Rights Reserved

21

Why sparklines of totals can be really useful


Now you see several reasons see why longer term
sparklines can be pretty useful


Do you currently have ways to generate them?


If not, do you want to get ways to generate them?


Don’t you all think that your vendor ought to provide
them, in group and zoomed in formats?


So lets start asking them to…


Do you also see why you should always check
everything and then sit back and ask yourself:


“If I had asked that question and then got this response,
what would I ask next?”


© Ron Kaminski 2010, All Rights Reserved

22

Defining your mission


Anticipate the “next questions” and always answer
them before being asked


The unanswered “next question” can be a huge time
waster


often a stall technique used by the politically astute


It raises temporary doubt in your findings, and builds their case
for swift purchase, before you answer their question


often

a way for the old guard to show that they still are the “top
dogs” to management


Impatient or frightened management might run off and buy
something!


The undeclared war between Project Managers and
Capacity Planners


The “project manager weasel who never lost” story

© Ron Kaminski 2010, All Rights Reserved

23

Defining your mission


If you are going to shoot down someone’s
hypothesis that lack of CPU was the cause of a
problem, you’d better find out what really caused
the problem before the meeting!


Your goal:


One
meeting or phone call per issue!


They may say “We just want a quick and dirty answer”
but they never really do!


Always cover at least:


CPU


Memory


Disk I O


Workload response time changes


For web
-
centric systems, network distances and loads


24

© Ron Kaminski 2010, All Rights Reserved

Defining your mission


Cultural differences are real and might affect your
workload choices


Some cultures avoid direct blame or information that would
cause someone to “lose face”


Any workloads are better than none


The “No personal pronouns” story


Be consistent!


Always use the same groupings on all similar nodes


Use the same colors if you can!


Reduce the burden on your audience


Multiply the value of your workload creation efforts


Use consistent precedence order to decide where to put a
process that meets the criteria to be in several different
workloads


© Ron Kaminski 2010, All Rights Reserved

25

Defining your mission


Whatever you decide:


Track your own tools usage!


There are multiple great freeware web usage reports that will tell you if
folks are using or snoozing your data (We use
webilizer
:
http://www.mrunix.net/webalizer/

)


Unviewed information is wasted time and efforts


Use speed tests


If there are multiple ways to do something (CSV files versus a Performance
database) code for both and have a race


Will your web users want the slower one?


The capacity planning reporting challenge story


Don’t settle, always seek new audiences and better reports


Add new functions


Sadly, there is no shortage of bad vendor reporting on expensive
infrastructure

»
Anyone here ever seen a great graphical historical display in business
useful terms of SAN information or LAN usage by segment?


Your firm may have business specific information that might be really useful to
decision makers if overlaid on or graphically reported near with IT resource
consumption

© Ron Kaminski 2010, All Rights Reserved

26

Our site’s web usage:

© Ron Kaminski 2010, All Rights Reserved

27

Our site’s web usage:

© Ron Kaminski 2010, All Rights Reserved

28

Our site’s web usage:

© Ron Kaminski 2010, All Rights Reserved

29

Our shared long term mission


When you innovate and come up with new report ideas, share
them at CMG!


Or at least send me examples in mail and I’ll do it for you!



Share code in this or other user groups that make sense


We should all work together in user groups, public forums, on the
web, etc., to push all of our vendor partners to address these needs


The more they do for us, the less we carry the “home brew code”
weight


We should also all work to reduce the volume, impact and long
term storage requirements of our solutions


I have yet to encounter a vendor that isn’t carrying around a lot of
extra metrics in the bowels of their systems that will never be used


We should have a CMG sponsored “help wanted” section for
capacity planning specialist positions in the various countries


© Ron Kaminski 2010, All Rights Reserved

30

Picking the right vendor partners


I believe that all capacity planning efforts should have tools
that include:


Efficient resource usage and process consumption collectors


Network queuing theory based “what if…?” modeling based on
workloads, not total consumption


The bulge trap


Efficient, speedy web
-
based historical consumption data display


Ideally your chosen vendor would


support most or all of your differing operating systems and
devices


have ample training and consultants available, there is nothing
better than a co
-
pilot when you are starting out


participates in and supports CMG!

© Ron Kaminski 2010, All Rights Reserved

31

Picking the right vendor partners


In the not too distant future, the best vendors should be:


Offering efficient “low impact” “cloud deployable wrappers”
that run with your applications in a cloud


“We don’t have to worry, its in a cloud” is nonsensical


Are you going to generate fake transactions and time them?


When you get a long time back, or significant variance, are you
going to have enough information to know why? I think that in
time people will realize this need, and want it in their contracts


Don’t you want to know the overhead of encryption and
decryption in the process, and it’s response time effects?


Stupidity is infinitely scalable, as long as you aren’t getting the
bill


If nobody cares to make their code efficient, because they just send it
to the cloud, how good is that code going to be?


Will it be running on the same machine as you tested?


Will it impact your users?

© Ron Kaminski 2010, All Rights Reserved

32

Picking the right vendor partners


In the not too distant future, the best vendors should be:


Offering efficient “low impact” “cloud deployable wrappers”
that run with your applications in a cloud (continued)


The internet will continue to grow logarithmically


So those clouds could get mighty full, mighty quick


How do you want to find out that it is too full?

»
Do you want your customers telling you?

»
Or do you want your own reports based on scientifically accurately
collected consumption data?


Social media sites are becoming valuable business tools


Businesses “tweet” and have
Facebook

pages!


Do you think that a free application originally designed to let 14 year olds
share photos is designed for high performance business needs?


How will you be sure?


© Ron Kaminski 2010, All Rights Reserved

33

Picking the right vendor partners


In the not too distant future, the best vendors should be:


Thinking about
SaaS

user tools as well, Sure,
SaaS

vendors
maintain the code and pay if it is a hog, but are they:


running maintenance activities like backups and virus cans that slow
things down right during prime time for Australia in your globally
distributed firm?


suffering from office hours peaks of consumption that impact your
user’s response times?


Taking outages to horizontally scale that might impact your firm’s
ability to ship product?


Without your own data, you will never know


What responsibility do you have to your firm’s users?


Why is this network queuing theory based modeling stuff
so important?


Let’s understand what it means and then see an example…

© Ron Kaminski 2010, All Rights Reserved

34

© Ron Kaminski 2010, All Rights Reserved

35

Modeling Norms


Most modeling packages assume
a Poison or Chi
-
squared
distributions of the arrival rate of
transactions


Some simpler, yet often quite
elegant
systems
like Dr. Neil
Gunther’s

PDQ modeling just use
a quadratic and forget the tails


They aren’t all that different
despite what we modeling
junkies might say
!


Don’t focus on the distribution
selected, focus on whether they
use queuing theory models and
give you relative response times



Quadratic
Poisson
© Ron Kaminski 2010, All Rights Reserved

36

Why network queuing theory based modeling?


These concepts are also
often illustrated with simple
queue graphics like the one
at the right


An important implied
assumption is that all
requests are served, none
are lost


Response time is the sum of
Queuing Time

plus
Service
Time



Arriving
Transactions
Queuing
Time
Service
Time
Completed
Transactions
Response
Time
© Ron Kaminski 2010, All Rights Reserved

37

Why network queuing theory based modeling?


Methods do differ, but
queues for interactive
workloads are usually
computed based on load
percentage using a
formula like:



Q = U/(1
-
U)


where:


Q

= Expected Queue


U
= Utilization


Response time is the sum
of
Queuing Time

plus
Service Time



Expected Q=U/(1-U) Queues
0
2
4
6
8
10
12
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
% Utilization
Expected Queues
Expected Response Time
0
2
4
6
8
10
12
% Utilization
Total Response Time
CPU Service Time
Expected Queues
© Ron Kaminski 2010, All Rights Reserved

38

Why network queuing theory based modeling?


So, as a
workload
competes for resources
throughout a day, it’s
response time is likely to
vary


Computed relative
response times show us
both the variations and
the reason


The Y Axis metric does not
matter!


Just pick a basis, the ratio is
the important part!



CPU Utilization
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Response Time Componants
0
0.5
1
1.5
2
2.5
3
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
0:00
Relative Response Times
CPU Service Time
Expected CPU Queueing Time
© Ron Kaminski 2010, All Rights Reserved

39

Why network queuing theory based modeling?


A workload’s typical
transaction is likely to rely
on several resources


Imagine a workload running
on a machine with four
CPU
s, six
disk
s and some
network IO
on one card


Note that when
technologies differ, service
times can differ



Workload
Transaction
Response
Time
© Ron Kaminski 2010, All Rights Reserved

40

Why network queuing theory based modeling?


Now do you see where a
graph like this can come
from?


If the warehouse folks are
complaining about response
times at
3:00
AM, should
you upgrade the CPU?


When do you suspect
that
the
backups
are running
?


Would a CPU upgrade help
daytime response
?


But it also might make demand
for I/Os faster and really slow
down the warehouse at 3:00 AM
too, so you better address the
I/O issue!



Relative Response Time Componants
0
2
4
6
8
10
12
14
16
0:00
1:00
2:00
3:00
4:00
5:00
6:00
7:00
8:00
9:00
10:00
11:00
12:00
13:00
14:00
15:00
16:00
17:00
18:00
19:00
20:00
21:00
22:00
23:00
Relative Response Time
CPU Service
IO Service
Network Service
CPU Wait
IO Wait
Network Wait
Picking the right vendor partners


In my experience, network queuing theory based tools
move folks quickest to actionable answers


Once you understand relative response times, most issues are
quick and easy to diagnose


If a new vendor harps on linear “trending” graphics and
projections, don’t expect them to be around for very long


If a monitoring or other product vendor keeps adding “and
you can use this for capacity studies” it is probably because
the salesperson heard that you were looking for capacity
planning tools!


Stick with network queuing theory based packages and you
won’t go wrong!


Dozens of “And we can do capacity planning too!” stories

© Ron Kaminski 2010, All Rights Reserved

41

Ron Goes Off on VMware


VMware is the single biggest indictment of the poor
way most firms have done capacity planning in the
Windows space


The lack of workload characterized views of consumption
is why folks bought a server for each functional part


“We don’t want to stack multiple applications on one
server! So we VMware them!


…which is just stacking with the added joy of paying for not only
extra copies of the OS and tools, but $900+ for VMware as well


And in the end, the code is running on the same box!


VMware’s “so called” capacity planning tool is proof that
they never attended a CMG!


It is as near useless as any marketed tool that I have ever seen, but
at least it is expensive…


© Ron Kaminski 2010, All Rights Reserved

42

Going “Extra
-
Product”


Once you get used to your vendor’s product, if you are like me,
you’ll start wishing for more functions tailored to your specific
needs


In the old days, a grey haired expert would whip out a spreadsheet or
other mathematical package and start creating some “home
-
brew”
solution


I use perl and
GD:Graphics
, PHP, java script and anything else that I can
think of, you can use what makes sense to you


Check out old CMG papers, they are laced with great ideas


In other words, don’t feel limited to what your vendor does “out of
the box”


Find buddies that use the same vendor and start sharing ideas and
code


Things that you will see later in this presentation are shared among
dozens of firms and they wouldn’t live without them


You don’t have to agree 100%, take what fits best and leave the rest

© Ron Kaminski 2010, All Rights Reserved

43

Going “Extra
-
Product”


There are a whole group of us running many of the extensions
that we’ve developed over time


Some of our extensions have made it into some products, but
nowhere near enough of them!


We probably get 50% of our firm’s benefit from the tools from
our own extensions


We regularly meet with the vendors and implore them to add
the features that we like


Having more singing from the same hymnal might just get
through to them!


Come join us! The best ideas might be in your head! Share!


© Ron Kaminski 2010, All Rights Reserved

44

Avoiding the “IT Mindset Traps”


Capacity planners come in several flavors, because people from
several different camps end up in this role


Scientists
-

Scientifically minded users of network queuing theory tools
and simulation models that want to subdivide consumption into
different behavioral groups and analyze them


Application specialists


application subject matter experts who
“know the application” are trusted by management, and care deeply
about it’s success. They often come from the application side of the
firm


Old Timers


They know everybody, have worked on everything and
have connections a and favors to call in to get things done. They often
come from the operations side of the firm


Each of these can be successful, but some are more prone to
certain behaviors that can limit your capacity planning effectiveness
and raise the costs of doing it


Lets look at the typical pros, cons and peccadilloes of each

© Ron Kaminski 2010, All Rights Reserved

45

The Scientists


The Scientist capacity planner


loves to get data from everywhere and everything that they can


Willingly tackles huge tasks as long as there is a possible learning
benefit


Will constantly tweak the automation to be able to get yet more data


Will go “extra product” and build tools for specific functions without
fear, because they are used to building things from scratch and being
successful


Pros


No fear, they view no problem as intractable and are sure that if they
can get real data into a scientifically designed framework, business
useful learning will result


No agenda, all applications and systems are equally important to
them, they will not lobby for one application to get resources instead
of another, preferring instead a rising tide that raises all boats


Willing to try new methods and tools in search of solutions



© Ron Kaminski 2010, All Rights Reserved

46

The Scientists


Cons


Scientists can be viewed as “remote” or “doesn’t know the business”
by some in management, particularly application development


They may want some really expensive and/or tricky software, and on
every machine, and these tools produce copious amounts of data that
needs to be processed, graphed and stored


The volume of tools and special case software that they accumulate
over time can be hard to support by others


Good ones are relatively rare, ones that can teach/mentor others are
extremely scarce


Mindset Traps


Scientists can go off on tangents, they really need a manager who can


Help them get the most productive subset of tools working first


translate their outputs into terms understandable to the business


help keep them focused on what the business deems most valuable


Their pursuit of the “one scientifically superior way” left unchecked
can lead to ongoing high costs






© Ron Kaminski 2010, All Rights Reserved

47

The Application Specialist


The application specialist in the capacity planning role


Will often drop everything else to don their fire
-
fighter jacket and
“save the firm” by working on emergencies


Will rely strictly on simple O/S tools and minimal data, often just totals
because ‘that was all we needed when we started this thing, and look
how far we’ve come”


Seldom tracks historical consumption data over time, or if they do,
seldom presents it in a format that is easily understood by others


Pros


They really do know the application, the folks who are powerful, and
they have a lot of chips at the bargaining table when it comes time to
get things negotiated


Their application specific knowledge can really come in handy when
strange behaviors are noticed


Their continuing drive to make an application succeed and the lengths
that they go to are often very favorably viewed by non
-
technical
management





© Ron Kaminski 2010, All Rights Reserved

48

The Application Specialist


Cons


EGO!


Our conference rooms are named after comic book super heroes!

© Ron Kaminski 2010, All Rights Reserved

49

The Application Specialist


Cons


Their self confidence can lead to large egos, they dismiss opposing
views of how to address issues other than “the way that we’ve always
done it”


Their extreme willingness to join in every fire
-
fight eats a lot of time
and delays the deployment of tools and systems (like long term
historical consumption tracking) that would help others understand
and make better decisions


Tend to enjoy being the “go to guy” and thus seldom share the basis
for their decisions


This is sometimes covering up the fact that the basis for their decisions is gut
feel, not data


They will commit in public forums where management is present to
supporting the scientists to get some application specific technical
need, and then fail to do so in a timely manner, if ever


They really know their silo, but they are very uncomfortable when
asked to go outside of it




© Ron Kaminski 2010, All Rights Reserved

50

The Application Specialist


Mindset traps


These folks career successes have been built on “thinking on their
feet” as issues occur, so they seldom take the time to build data
collection and reporting structures that lead to well informed
decisions


“When you need to know something, just ask me.”


They may even resist or delay deployment of capacity planning systems, calling
them “costly, unnecessary and not our application’s highest priority”


They will resist changes to their sacred “architectures” from the 1980s


They can be initially really interested in capacity planning information
about their application, and use it to point out the positive impacts of
their past decisions and successes


…but don’t expect them to mention immense over capacity


Often their interest stops immediately at the edge of their application


When there are issues larger than one application, they view it as their duty to
“defend their applications turf” and will move to segregate the environments
into “us” and “them” groupings that need not share any infrastructure


They think that “The vendor will tell us when to…”




© Ron Kaminski 2010, All Rights Reserved

51

The Old Timers


The old timers in the capacity planning role


Are a calming presence in meetings


Have stories of a time when we faced something similar


Have the best jokes


Know and address the VPs as ‘Phil” and “Sandy”


Have capacity tracking systems that tend to the super
-
inclusive, when
asked, they alone can root out data about darn near anything, but
they have to be asked


Pros


They have the trust and respect of nearly everyone, because everyone
has worked successfully with them over time


When they need tools or space to get or keep their data , they just go
ask “Phil” or “Sandy”


Are among the few to have worked on many of the systems, not just
one or two, and so they understand deeply the inter
-
reliance of many
of the systems and how an issue in one can manifest elsewhere




© Ron Kaminski 2010, All Rights Reserved

52

The Old Timers


Cons


Old timers are often tired of learning. They seldom want to embrace
radical new methods when they are retiring in a few years


Old timers are survivalists, or they wouldn’t be old timers. They have a
great political sense of when “not to rock the boat” and “who not to
mess with” that can prevent or delay the introduction of useful new
information


Mindset Traps


They approach capacity planning like they approached most of the IT
issues that they’ve faced in their long careers


“Let’s start with a database with thousands of metrics!


You never know what will come in handy”, so resist deleting them while disk
can still be purchased


Their reporting systems evolved over a long time, hence can be
hopeless for someone new to decipher or change


They can be based on large tables of numbers that only a select few can
successfully use




© Ron Kaminski 2010, All Rights Reserved

53

Avoiding the “IT Mindset Traps”


So what do we do?


How do we get the “pros” of each type and minimize the downsides?


You must build a “matrix
-
ed
” team containing some of each type


The team concept must have support from the highest levels


It must have priority from each of their respective management


They must be charged with:


enabling the scientists to integrate new tools into the environment


getting graphical reporting working that management can understand


maintaining just enough information to provide long term historical context for
decisions, but no more


Sometimes, you’ll have to bring in outside expertise, and the only way
that will succeed is to have “friends in high places”


It is critical to put this under an excellent manager


Each of the three types have useful and less useful behavior patterns


You need a manager that all can respect, who doesn’t try to be the
expert, rather one who coaches each to be part of a well functioning
whole



© Ron Kaminski 2010, All Rights Reserved

54

The politics of capacity planning

in organizations


Organizational politics are often the key factor in your capacity
planning group’s eventual success or failure


Long experience has taught many of us the importance of


Friends in high places


Try to get the capacity planning issue instigated by a knowledgeable VP or at
least a director


Often a major initial stumbling block is even getting permission to install
collectors on production systems, much less the physics of actually doing it,
and there is nothing better than having their bosses boss saying, “Yes, you
must do this, it is a priority”


Determining and rating the skills and power balances in your
organization, usually by O/S


Managerial chaos can be a severe issue


Diagnosing and surmounting the barriers to success


Describing the type


Their common barriers and techniques to surmount them

© Ron Kaminski 2010, All Rights Reserved

55

Identifying and surmounting barriers


Barrier:
The “not invented here”
über
-
geek


Identification clues


Often are early members of a firm


Usually position themselves as masters of several related technologies, but
can be rather sparse on details


The younger the firm, the more often you find them, internet firms in high
growth areas are full of them


They are convinced that “If we didn’t need it then, we don’t need it now!”


Their typical barrier methods


“This is not an organizational priority”


“This collector code is not proven on our sensitive production systems”


Techniques to surmount their barriers


Friends in high places compel them


Share credit for successes with them to their management


Involve them in the model setup, ideally model along side them, letting them
suggest probable growth steps


© Ron Kaminski 2010, All Rights Reserved

56

Identifying and surmounting barriers


Barrier: “
The high priests of the old tool set



Identification clues


They like “twitch monitoring” and often have built an extensive installation of
them with impressive sounding names like “The war room” or “mission
control”


Whenever you enter it during non
-
emergencies, notice how few people are actually
using the displays


They prefer current “totals” like total CPU because they’ve never had
consumption by business identifiable sub
-
groupings


They react to brief workload peaks by demanding upgrades


Their typical barrier methods


Stalling. They ask streams of technical questions, and each answer that you
give prompts another


Requests to integrate, new capacity tools must feed information to their “war
room”


Techniques to surmount their barriers


Ask them to put long term, workload characterized consumption on their
displays


Have them tasked to help address pathologies automatically detected (that
their monitors did not seem to surface)


© Ron Kaminski 2010, All Rights Reserved

57

Identifying and surmounting barriers


Barrier:
The application architects


Identification clues


They rigorously defend their current multi
-
node spread as vital for


The organization


Uptime


Scalability


90% of their machines will be empty or nearly so


The architecture was set in stone a decade ago, and is designed to solve the
issues of that time, miniscule PCs


Their typical barrier methods


Lecturing you on how their way is the “only way”


“Don’t you realize that these are business critical systems?” is used to justify
all manner of excessive purchasing


They will lecture you on availability and scalability at the drop of a hat


Techniques to surmount their barriers


Show them the serious speedups possible by collapsing application layers onto
fewer machines and removing network time from chatty applications


Ask them for estimates on just how much more their application will need to
scale, given that it is 7 years old and already in use firm wide?


© Ron Kaminski 2010, All Rights Reserved

58

Identifying and surmounting barriers


Barrier:
The entrenched fire fighting squad


Identification clues


They offer to work with you, but not today as there is an emergency


They position themselves as “the experts” in an application


They are hyper
-
sensitive to any changes in the environment, they view them as
“dangerous”


“Our conference rooms are named after comic book super heroes!” revisited, when you
fly in to interview, everyone is fighting a fire


Their typical barrier methods


They position themselves as “must have” team members and then are never


Beware their commitments to make data or specifics available, they will often be “too
busy” later to do it in a timely manner if at all


Techniques to surmount their barriers


Agree to work with them as valued members of the team, then ignore them in your plans
as they will always be too busy to help anyway


Never trust them to come through with a key item, always plan for another way to get
what they promise that does not involve them


Over time, train them that many of the “time consuming fires” that they fight are simple
pile ups of multiple pathologies that won’t bite if addressed in a timely manner

© Ron Kaminski 2010, All Rights Reserved

59

Identifying and surmounting barriers


Barrier:
The overwhelmed, outsource
-
able and scared


Identification clues


They have single functions, often somewhat amorphous, and difficult to tag a
dollar value on


They are not in politically savvy management’s structures


Their typical barrier methods


They stall, seemingly frightened to take on any task without exact instructions
from their management


The view tasks related to capacity planning as “Not their priority”


They view all new functions as threats


They seem to ignore all information not generated by their own function


Techniques to surmount their barriers


These are politically weak people in politically weak areas, stay away from
them so as not to have to rely on them


If forced to work with them, work with their manager to emphasize that
capacity planning is and important priority that they cannot stall


Help the good ones get out of that group


© Ron Kaminski 2010, All Rights Reserved

60

Identifying and surmounting barriers


Barrier:
“This is a database server only” DBAs


Identification clues


They claim that “In order to save the firm database license money, we are
concentrating the databases from multiple applications on just a few servers”
and “nothing else can run on these servers”


Their typical barrier methods


Outright refusal to try collapsing micro
-
applications onto database servers


Claim remaining capacity on the 1/3 used database server is “for growth” but
are real hard to pin down for specifics, usually because there aren’t any


Techniques to surmount their barriers


Try to get them to allow/install only a certain small percentage of application
code on their machines due to “a network emergency”. That seems tiny and
reasonable.


Use a number like 10% to 20%. They don’t need to know that that was all of the
applications that you ever dreamed of doing.


Show them how your automated process pathology code works, to ease their
fears about rogue applications eating their machines alive and harming other
applications


Praise them to their boss as “innovative and balanced problem solvers”



© Ron Kaminski 2010, All Rights Reserved

61

Identifying and surmounting barriers


Barrier:
Lying, manipulative project leaders


Identification clues


You are originally asked to model 400 users from a sample of 30. Later
they say, “Oh no! We meant 1000 users!”


Their typical barrier methods


Some project leaders view themselves as risk minimizers. Sadly, they
often feel that 60% excess hardware is a proper sized “cushion”, so
they inflate their usage estimate 60% to make the modelers justify
excess hardware for them


They took 3 extra months to get all these whacky features in, way past
their deadline, but now time is an emergency and they need their
results immediately or they just need to buy hardware right away
because they have no time to test properly


Techniques to surmount their barriers


Speed. You can model this stuff far faster than they can get a load test
to work without half of those whacky features blowing up


Ask more people for how many users really are going to be there





© Ron Kaminski 2010, All Rights Reserved

62

Identifying and surmounting barriers


Barrier:
Enthusiastic but “We went to Load Runner
Class and we absolutely have to to run huge
saturation load tests” drones


Identification clues


They don’t understand mesa tests and modeling is all that is
needed. Even if you can get a decent mesa test out of them, they
still want to do a saturated load test anyway


They
REALLY BELIEVE
two seemingly counter intuitive things:

1.
Your operations group must run out and buy exactly the machine
and memory that they dreamed up from dubious research for
their tests

2.
They do not have to run against realistic data volumes with similar
indexes and size as intended production. They will
NEVER

create a
statistically relevant data source. They will frankly state: “It is
impossible!”







© Ron Kaminski 2010, All Rights Reserved

63

Identifying and surmounting barriers


Barrier:
Enthusiastic but “We went to Load Runner
Class and we absolutely have to to run huge
saturation load tests” drones


Their typical barrier methods


No matter how many times you say not to, they will always strive
to ramp up users at the start and ramp down afterward. Get ready
to lose your first and last measurement periods


If you can get a realistic transaction mix from them, they will still
strive to run them too fast


The 30 second contract review, 8 hours a day story


Techniques to surmount their barriers


Always question their user think times, then adjust your model to
deal with the silliness that you uncover. Maybe 20% of the samples
that I get have realistic transaction arrival rates, so beware


Be consistent, over a series of tests you will wear them down, or
get them fired








© Ron Kaminski 2010, All Rights Reserved

64

A mail message to a new fleet of “Load
Runner” enthused contractor drones

The purpose of load tests can be manifold, to test functionality, capacity, and “feel”. Modeling
based on a sample does the same things and more, and usually much faster and cheaper.



If


you choose to run a load test, be sure to run a “realistic transaction mix” with the expected
blend of all commands, not just one kind. If you are limited to simulating a subset of intended
loads by physics (we don’t recommend simulating above 20 users per load running PC for accuracy)
we can then take that load and model much higher ones and any alternate hardware that you
might dream of.




We have these caveats to improve accuracy:

1.
Perform the tests on real, not virtual, servers for measurement accuracy

2.
Run a proper “mesa test” for sampling which includes:

A.
Make sure that the CPP group has a collector on your intended test machine days before the test

B.
Start your test precisely on an hour boundary

C.
Do not, repeat,
DO NOT

“ramp up” or “ramp down” users. Just start and go, 20


users per load
runner box will not overwhelm anything. Ramping is not required for models, indeed it is wrong to
do it.

D.
Stop precisely on an hour boundary

E.
Send mail to us telling us

I.
how many users you simulated

II.
The precise timings

III.
How many more users we should add in the models

IV.
Anything else pertinent

© Ron Kaminski 2010, All Rights Reserved

65

A mail message to a new fleet of “Load
Runner” enthused contractor drones

3.
The purpose of the test is to produce a flat topped “mesa” of usage that depicts your
users acting normally. A graph of CPU consumption should look like a rectangle with a
flat steady top, nowhere near saturated. We then take that sample of happy users
unconstrained and model what hardware is needed for more happy unconstrained
users.

4.
Do a “practice run” several days before your real test to flush out issues and tell us so
we can see how well you followed mesa instructions

5.
DO NOT
do any of the following, which will waste your time, ruin the data

and cause
rework

A.
DO NOT

“ramp up” or “ramp down” usage at the start or end of your tests. It just makes us throw out that
data

B.
DO NOT


try to “saturate the machine”. The models will find that saturation load, don’t waste your time.
Concentrate on producing an unsaturated load of happy users getting great response times

C.
DO NOT

try to simulate hundreds of users from one PC with one network card. It will fail or worse, produce
incorrect data leading to massive errors

D.
DO NOT
create loads with unrealistically fast “think times”. If the user is likely to do a transaction, then wait
5 minutes reading it or processing it, then set the inter
-
transaction wait time to 5 minutes, not 30 seconds.
Remember, your goal is to be realistic, not to have high unrealistic loads.



Mesa tests may seem odd at first, but in time you will learn to love mesa tests and their time and
cost savings to projects. After a few of them, you’ll never load test the old way ever again.



Questions? Please ask, or invite us to your team meetings for a confab!


© Ron Kaminski 2010, All Rights Reserved

66

The politics of capacity planning

in organizations


How to win friends and influence people in the operations group


Set up “being on the capacity planning team” as an aspiration goal, a
promotion path, for the operations folks


Try to find an operations or O/S expert at the top of their game and
get them assigned to the capacity planning effort


These are often the best acolytes and really take well to capacity planning


As the operations staff start to use the capacity planning reporting and
pathology detection systems


Praise their efforts and successes to management


Coach their failures privately


Get them (and their management) to realize that keeping process
pathology counts down reduces emergencies and call
-
outs, and
greatly contributes to system stability


Train them on the tools so they start to use them and build new skills


If the only users of the capacity planning reports are on the capacity
planning team, you are doing something wrong!




© Ron Kaminski 2010, All Rights Reserved

67

The politics of capacity planning

in organizations


How to win friends and influence people in the application
development group


In addition to the barriers presented previously, you may also
encounter


The earnest improver
, who takes the time to learn about new technologies
and tries to integrate their benefits into their software development lifecycle


The non
-
technical manager
, who may never understand all of the math and
formulas, but who will be far better at the political skill required for success


External vendors
whose future profits hinge on success


Try to become an asset to each of these groups


make sure that they see you as a willing partner in their success


work late on their models


help them succeed and get the resources that they need when they need
them


Send mail when you work early, late or on the weekends (and CC your
boss of course), it shows that you are really trying to help

© Ron Kaminski 2010, All Rights Reserved

68

The politics of capacity planning

in organizations


How to win over and influence your boss


There are several types of bosses


The experienced true believer


The unbeliever


The unconvinced cost counter


There are techniques to deal with each


Your goal is to convert the last two into the first one!


Keeping all happy will involve deploying collectors, generating
workload characterized historical consumption web pages and
“What if..?” models of future consumption


The key is to survive long enough to


get a proper network queuing theory model based software
purchased in sufficient quantity to make a difference


Get some applications leadership on your side


keep the last two from canning you before you start to get
meaningful results on a large scale



© Ron Kaminski 2010, All Rights Reserved

69

The experienced true believer


Usually you have worked with or for this boss before, so they
already know


How expensive the tools can be, so they are not shocked


What a reasonable time for results is


How to help enable your success


What battles to fight, and what battles to avoid


My last 4 gigs have been for someone who I had either consulted
for or worked for


Delivering results delivers career options for you!


Characteristics of the experienced believer


Patience


Helps get the software quickly


Helps break through organizational politics to get your collectors
quickly deployed


Projects confidence in meetings with other management








© Ron Kaminski 2010, All Rights Reserved

70

The unbeliever


These folks (often with a development background) are distrustful
of fancy methods like network queuing theory


This is often based on an insecurity, they don’t understand complex
tools and thus distrust them


Have made their career by betting on simple solutions and
extrapolating linearly


Are often in their position due to management turmoil


In several gigs I’ve had non
-
believers in the management structure
above me


Characteristics of the non
-
believer


Initial open contempt of scientific capacity planning methods


Demand results before they help you get collectors in place to answer
it with a historical basis


Often will throw CPU and memory at disk I/O slowness


Can be turned, but wow, it sure takes patience!




© Ron Kaminski 2010, All Rights Reserved

71

The unconvinced cost counter


These can be great bosses in time, because like scientists, they demand
proof before supporting you, but once they have it, they will be true
believers


They either have no experience with sophisticated capacity planning, or
have had running the group forced on them by higher ups who have


Characteristics of the unconvinced cost counter


Repeated references early in the process to how much your group and your
software costs, and lots of implying that savings results had better surpass
that soon


Caution early on, so they will spend the time with other departments getting
them to go along with you


Thrive on informational updates, so show steady progress


You don’t have to be perfect, just constantly getting better


You’ll know when they switch to true believers when


They start buying you more licenses!


They stop complaining about costs


The “We need to show results!” to “Do you need more licenses?”
conversion




© Ron Kaminski 2010, All Rights Reserved

72

Reporting


There are a lot of tragically bad business graphics and
especially capacity planning reports out there. Issues
include:


Graphics that distort the viewers perceptions


Quasi
-
3d


Black outlines around bar charts


Non
-
calendar displays of long spans of time


No color consistency


Foolish consistency may be the hobgoblin of little minds, but it is also the
key to getting management to use your site for decision making (don’t
pay attention to “little minds” and “management” appearing in the same
sentence…)


Lots of chrome, little content


Tufte: “Question every pixel. Basically, any pixel that isn’t conveying new
data, get rid of it!”



© Ron Kaminski 2010, All Rights Reserved

73

Reporting


Other issues that limit effectiveness


Multi
-
page reports that nobody ever reads


If your answer is so complex that it requires that much evidence, start
over on a new one


They paid $10,000! It has to hit the desk with a thud!


The “same thud” lives on!


Relying on the untrained user to wade in and find the answers
themselves


Some you can train, most no


If any correlation of graphics requiring memory is needed, forget it


Ron’s Position:


Non
-
web presentations in general are useless relics of a bygone age.
Most of your reader’s data comes in hyperlinked form, so get with it or
be left behind


Web reports of all nodes in the firm


Most users really appreciate ways to see only their span of control



© Ron Kaminski 2010, All Rights Reserved

74

Reporting


There are also some “Must have’s”


Automated context that graphically
highlights when something is out of the
ordinary (managers love this stuff)


Automated business and hardware
context, ideally driven by your CMDB,
that include


Hardware and software specifics


Business Purpose


Business owner


Primary and backup technical contacts


Ideally a text description of it’s business
function


Other helpful links


© Ron Kaminski 2010, All Rights Reserved

75

The Zen of Great Reporting


Seek minimalism in all parts of it


Reduce graphic clutter


Reduce user perceived complexity


Workload color consistency is a simple “must
-
have”


Reduce user choices and actions


If the user needs to know 4 things to make a decision, they had
better be close on the same web page


Add extra information that lets the user more fully
understand odd behaviors and situations


Sorting it by date is nice too


Don’t restrict yourself to measured quantities


Workload response time detail is one of the most powerful
graphics that you can use


© Ron Kaminski 2010, All Rights Reserved

76

Reporting Examples

© Ron Kaminski 2010, All Rights Reserved

77

Reporting Examples (UNIX)

© Ron Kaminski 2010, All Rights Reserved

78

Reporting Examples (Windows)

© Ron Kaminski 2010, All Rights Reserved

79

Reporting Examples (Windows)

Tangent, Multiple Memory Leaks


Here is an example of
a rather severe
repeating set of
memory leaks


See the saw
-
toothing

memory
?


See the climbing
Commit Bytes
in a
different sequence?

© Ron Kaminski 2010, All Rights Reserved

80

Reporting Examples (Windows)

Tangent, Multiple Memory Leaks


When you dig
deeper, you can
see memory totals
by process owner


People often want
to “blame
someone”


Alas sometimes
the “
Someone
” is
harder to pin
down by just
username

© Ron Kaminski 2010, All Rights Reserved

81

Reporting Examples (Windows)

Tangent, Multiple Memory Leaks


When you dig deeper,
we can see the
individual process
names leaking


In time you’ll find the
best way to keep
them unique, we use
process start
date/time and PID


You can show these
to the
Fake_Name

vendor and then it is
hard for
Fake_Name

to deny a memory
leak


I believe that java is
Finnish for “memory
leak”


© Ron Kaminski 2010, All Rights Reserved

82

Reporting Examples (Windows)

Tangent, Multiple Memory Leaks


Well it is hard to
deny a leak, but
some
Fake_Name

vendor might
want raw data,
so…


Since you
already have it,
put out some
csv files to be
easily mailed to
the vendor,
eliminating one
of their stall
tactics

© Ron Kaminski 2010, All Rights Reserved

83

Reporting Examples (Windows)

Tangent, Multiple Memory Leaks


The right way to convey the message


We detected the issue, and sent mail to the application owner, stating


The exact processes with the issue


They can expect to keep crashing every day or so until they get the vendor to
fix it


Offers to help with data or technical calls


We get no response at all


Three weeks later, we get a request to add memory to the
machine…


The owner “Can’t get the vendor to respond quickly” and wants to
reduce outage counts in the mean time


Don’t get mad…


Stay positive and helpful in tone, they are just trying to help their
users have less outages…


but continue to urge them to turn up the heat on their vendors, but do
it in a nice way…

© Ron Kaminski 2010, All Rights Reserved

84

Reporting Examples

© Ron Kaminski 2010, All Rights Reserved

85

Reporting Examples

© Ron Kaminski 2010, All Rights Reserved

86

New! Reporting Examples Windows

© Ron Kaminski 2010, All Rights Reserved

87

New! Reporting Examples UNIX

© Ron Kaminski 2010, All Rights Reserved

88

Reporting Examples

© Ron Kaminski 2010, All Rights Reserved

89

Classic capacity planning question descriptions

and proper answering techniques


Capacity issues are usually an emergency to someone


Roughly 93% of the requests for upgrades are nonsensical if you have any
historical workload based resource consumption information


So you have to say no in a way that makes the evidence clear


What to expect when you say no:


The 5 stages of grief (also called the
Kübler
-
Ross model)
http://en.wikipedia.org/wiki/K%C3%BCbler
-
Ross_model



Denial


Anger


Bargaining


Depression


Acceptance


Always give them a way to succeed along with your “no”, remember that may they
still have a real problem!


“No, you don’t need CPU or memory, but you are doing 5500 I/Os a second to your slow,
locally attached C drive


Can you turn down logging?


Can you send those I/Os to fast SAN or RAM drives?


Can you get help from your DBA pals?


“No, you don’t need more CPUs, you need to fix those looping processes.”


© Ron Kaminski 2010, All Rights Reserved

90

Classic capacity planning question descriptions

and proper answering techniques


Here is the pattern for this next section:


Real quotes from the users (disguised, slightly)


The evidence


The answer


What happened


I want some interaction on these, if you did it better, speak up!
Share! That is what CMG is for!


These graphs used in the examples are all homebrew perl and
GD:Graphics
, and they are used at several firms


Yes I will share the code if you want it, but
sheesh
, you can do better!


You are going to want some form of screen graphics capture tool


I use freeware
ZScreen
, downloadable from many sources, it is fabulous

© Ron Kaminski 2010, All Rights Reserved

91

Classic capacity planning question descriptions

and proper answering techniques


User quote


“We are keeping these machines
rather heavily loaded.” but they
won’t tell you why


The evidence


© Ron Kaminski 2010, All Rights Reserved

92



Classic capacity planning question descriptions

and proper answering techniques


The answer


It turns out that this
application was on
three nodes, two
heavily used and one
lightly used


They wanted a review
of each


Is ustca027 too
empty?


Is ustwa007 too full?


Is ustca031 too full?


Let’s use Relative
Response Time by
hour to answer them



© Ron Kaminski 2010, All Rights Reserved

93

Is ustwa007 too full?



© Ron Kaminski 2010, All Rights Reserved

94

Is ustca031 too full?

© Ron Kaminski 2010, All Rights Reserved

95

Classic capacity planning question descriptions

and proper answering techniques


What happened


The users are initially shocked to see that the capacity planners, whom
the view as machine stealers for VMs, are recommending that they get
more hardware!


Once they started to understand relative response time graphs, they
became quite sophisticated at moving workloads around


You’ll know that you’ve converted them when they e
-
mail you asking if
their IO_Wait could be solved if they split them over more drives or
better RAID choices


The morals of the story


Any vendor can show totals


Favor vendors that show workload characterized historical views of
consumption


Favor vendors that can show you workload relative response times, so
that your answers make sense to the business


© Ron Kaminski 2010, All Rights Reserved

96

Classic capacity planning question descriptions

and proper answering techniques


We started getting warnings from our
automated checks:

10/03/23 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used
up to 392.920% of an available 400% from 2010/03/23 at 0200 until 2300.

10/03/26 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used
up to 394.572% of an available 400% from 2010/03/26 at 0000 until 2300.

10/03/27 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used
up to 396.000% of an available 400% from 2010/03/27 at 0000 until 2300.

10/03/28 CPU_SATURATION_WARNING: Windows2003 node in04sqp001 used
up to 392.920% of an available 400% from 2010/03/23 at 0300 until 2300.


The evidence (here’s what the sparkline
looked like):





© Ron Kaminski 2010, All Rights Reserved

97

Classic capacity planning question descriptions

and proper answering techniques


More evidence:

© Ron Kaminski 2010, All Rights Reserved

98

Classic capacity planning question descriptions

and proper answering techniques


My initial suspicions were ‘Code improvement opportunities”
so I contacted my DBA pals:



© Ron Kaminski 2010, All Rights Reserved

99

Classic capacity planning question descriptions

and proper answering techniques


Those CPU graphs with response time increases due
to
CPU_Wait

when they hit the “knee in the curve”:


© Ron Kaminski 2010, All Rights Reserved

100

Classic capacity planning question descriptions

and proper answering techniques


The answer from my DBA pals:




© Ron Kaminski 2010, All Rights Reserved

101

Classic capacity planning question descriptions

and proper answering techniques


What happened (the changes went in on Mar 29
th
):



© Ron Kaminski 2010, All Rights Reserved

102

Classic capacity planning question descriptions

and proper answering techniques


What about the charts Ron?

© Ron Kaminski 2010, All Rights Reserved

103

Classic capacity planning question descriptions

and proper answering techniques


Things to learn from this example:


Not all code “innovations” work as efficiently as desired


SQL developed in far flung places for even farther flung places is
especially suspect


“When the answer is correct, the code is done”, well maybe not…


Not all innovations will go through a rigid capacity planning
review


You need either automated warnings or to take the time to scan
thousands of graphs often to detect and correct these


You need fast graphical evidence to get fast reactions


You need to go out of your way to be nice to DBAs, they will
save your firm millions if you let them, and if you only ring them
up when there is real evidence of mayhem


Always ask their boss to praise their efforts, those memos come in
handy at review time




© Ron Kaminski 2010, All Rights Reserved

104

Classic capacity planning question descriptions

and proper answering techniques


Many of you will
be deploying
virtual terminal
environments to
hundreds of
users


What if
something
goes a little
wrong?


The evidence:



© Ron Kaminski 2010, All Rights Reserved

105

Classic capacity planning question descriptions

and proper answering techniques


The answer:


We started ticketing suspicious CPU consuming VMware slices on Feb
3
rd


Most of it was Bezier curve screen savers! We banned them


What happened:


We got back more than half of our VMware farm!




© Ron Kaminski 2010, All Rights Reserved

106

Classic capacity planning question descriptions

and proper answering techniques

User quote:



I was wondering if we could get the memory increased on our Exchange 2007 CAS
servers USTCAX100 and USTWAX100?


Right now both servers are running 4.25GB
and I would like to move them to 8GB.


We are seeing performance issues with
those servers and we are noticing that RAM usage is at 80%
-
90% or higher all of
the time.


Users are starting to notice this with Communicator.


Due to the fact that
it can’t get a response quick enough from CAS, it is putting an exclamation point
on the communicator alerting them to address book issues.


If we are not able to
increase the memory, the only other option would be to add more CAS servers in
the environment to balance the load.




We also are going to be increasing the load on these servers with the 2000 users
we will be adding to the North America environment from the XYZ Co. acquisition
and moving South American users to North America servers.




Please let me know if this is feasible or not?


© Ron Kaminski 2010, All Rights Reserved

107

Classic capacity planning question descriptions

and proper answering techniques

The evidence:


First, look to see if anything
has gone wrong recently
They might be reacting to a
recent problem, but don’t
stop there


© Ron Kaminski 2010, All Rights Reserved

108

Classic capacity planning question descriptions

and proper answering techniques

The evidence:


Looking deeper , we don’t see
a
memory

shortage, (there is
evidence of a slight leak)
paging

is very low,
CommitBytes

isn’t anywhere
near
CommitLimit
, but …


CPU

seems in short supply,
and the
CPU Wait
component
of relative response time is
huge


Their short term performance
issue is due to CPU shortage,
not memory!


© Ron Kaminski 2010, All Rights Reserved

109

Classic capacity planning question descriptions

and proper answering techniques

The Answer:


Along with the graphs from
the previous page (and getting
them to address the
lsass

loop) we added two virtual
processors to this VMware
slice


Note that if you disagree with
their solution, give them an
alternative that fixes present
issues


We may give them more
memory later, when they’ve
earned it


© Ron Kaminski 2010, All Rights Reserved

110

Classic capacity planning question descriptions

and proper answering techniques

What happened:


The
CPU Wait
disappeared immediately


The user’s immediate
issues were solved


The users now know that
decisions will be based on
evidence, the results will
be real, and they like it!


Hardware in use for a
growing application will
grow, but slowly

© Ron Kaminski 2010, All Rights Reserved

111

Classic capacity planning question descriptions

and proper answering techniques


Hey folks, there is still one more issue, with imjpmig
process, the Input Method Editor, which lets you use
Japanese characters. It is looping regularly:



10/01/15 LOOP_PROBLEM:
3444 running imjpmig

CPU
looped from Jan 15 04:59:54 until Jan 15 23:54:53 and
may still be looping.


10/01/16 LOOP_PROBLEM:
3444 running imjpmig
CPU
looped from Jan 16 00:07:48 until Jan 16 23:54:58 and
may still be looping.


10/01/21 LOOP_PROBLEM:
5344 running imjpmig
CPU
looped from Jan 21 13:59:59 until Jan 21 23:54:58 and
may still be looping.


10/01/22 LOOP_PROBLEM:
5344 running imjpmig
CPU
looped from Jan 22 00:01:27 until Jan 22 23:54:56 and
may still be looping.


10/01/23 LOOP_PROBLEM:
5344 running imjpmig
CPU
looped from Jan 23 00:01:25 until Jan 23 23:54:53 and
may still be looping.



I changed the workload to just highlight Input Method
Editor by itself. I also found a bunch of patches available:
http://search.microsoft.com/Results.aspx?q=imjpmig+d
ownloads&mkt=en
-
US&FORM=QBME1&l=1&refradio=0&qsc0=0





© Ron Kaminski 2010, All Rights Reserved

112

Sometimes your own systems detect
problems, so answer in a way that
provides all required information

Classic capacity planning question descriptions

and proper answering techniques



Eventually they got the fix
migrated to production
and everything worked
fine from then on


Don’t get discouraged if
folks don’t always do what
you want immediately


Change controls, priority
conflicts and other issues
may stall the fix


With enough graphical
evidence, eventually you
will win!



© Ron Kaminski 2010, All Rights Reserved

113

What happened?

Classic capacity planning question descriptions

and proper answering techniques


Ron logs in on a Saturday to work on slides for UKCMG
(“Again! And
what do you get paid to do this?” asks my dear wife
) and sees the
following:


The evidence (from my pathology detection code’s morning mail)


CPU saturation found:


CPU_SATURATION_WARNING: Windows2000 node ustca337 used up to 99.000% of an available 100% from 2010/03/12
at
0400 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustwasbx16 used up to 99.000% of an available 100% from
2010/03/12 at 1400 until 2300.

CPU_SATURATION_WARNING: Windows2003 node uktcas06 used up to 99.000% of an available 100% from 2010/03/12
at
0300 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustca227 used up to 99.000% of an available 100% from 2010/03/12
at
0400 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustca724 used up to 99.000% of an available 100% from 2010/03/12
at
0400 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustcas44 used up to 99.000% of an available 100% from 2010/03/12
at
0400 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustcas54 used up to 99.000% of an available 100% from 2010/03/12
at
0400 until 2300
.

CPU_SATURATION_WARNING: Windows2003 node ustca088 used up to 99.000% of an available 100% from 2010/03/12
at 0800 until 2300.


© Ron Kaminski 2010, All Rights Reserved

114

Classic capacity planning question descriptions

and proper answering techniques


The evidence continued


Whenever a whole bunch of bad things happen
synchronized over many machines, think global tool



© Ron Kaminski 2010, All Rights Reserved

115

Classic capacity planning question descriptions

and proper answering techniques


The evidence continued


Whenever a whole bunch of bad things happen
synchronized over many machines, think global tool




© Ron Kaminski 2010, All Rights Reserved

116

Classic capacity planning question descriptions

and proper answering techniques

© Ron Kaminski 2010, All Rights Reserved

117



This is really bad news, a critical Business
Sensitive / Critical production server doing
its normal real
sqlservr

workload with a
Tool

process going on a
CPU

binge and causing
excessive response times due to
CPU_Wait

Classic capacity planning question descriptions

and proper answering techniques


The answer


A new piece of monitoring code was installed
BREAKING THE NO NEW
CODE INSTALLS ON A FRIDAY

rule!


What happened


The code creator had deployed a new script, and he reviewed it after
getting mail about all of the warnings:


”This was a bug in a script update that I made; we should be seeing this behavior on
most of the attached server list. ______ is pushing out an update to the script now;
once this is done we’ll have to log into each of the affected servers, verify the
looping process is running sqlcheck.vbs, and kill it.”




We were able to swiftly detect and fix the issue


How would your site do this?

© Ron Kaminski 2010, All Rights Reserved

118

Classic capacity planning question descriptions

and proper answering techniques


What we saw:


We started getting
Commit_Bytes

approaching
Commit_Limit

warnings:

10/04/05 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from
Apr 5 18:00:00 until Apr 5 23:59:00 and may still be.

10/04/06 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from
Apr 6 00:00:00 until Apr 6 23:59:00 and may still be.

10/04/07 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from
Apr 7 00:00:00 until Apr 7 23:59:00 and may still be.

10/04/09 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from
Apr 9 00:00:00 until Apr 9 23:59:00 and may still be.

10/04/10 COMMIT_BYTES_PROBLEM: Commit Bytes were within 80% of Commit Limit from