Nagios Implementation Case:

makeshiftklipInternet και Εφαρμογές Web

31 Οκτ 2013 (πριν από 3 χρόνια και 5 μήνες)

52 εμφανίσεις

Nagios

Implementation
Case:

Eastman Kodak Company

Eric Loyd

Founder & CEO

Bitnetix Incorporated

eric@bitnetix.com

www.bitnetix.com

877.BITNETIX

© 2012
Bitnetix

Incorporated

2

About Eric
Loyd

and
Bitnetix

Founder and CEO of Bitnetix Incorporated

VOIP services and IT/network consulting

25 Years in IT at places like

Eastman Kodak

Frontier Communications

Global Crossing

Bitnetix started its seventh year in July, 2012

2012 Digital Rochester GREAT Award Finalist in
Communications Technology

Using
Nagios

to monitor our client equipment, VOIP
platform, and still using it at Kodak since 2004

2011

Nagios World
Conference

3

A History of Eastman Kodak’s
kodak.com Web Server
Infrastructure (non
-
confidential)

© 2012
Bitnetix

Incorporated

4

History of kodak.com

Pre
-
2004

Machines located in Rochester, NY

Public Apache servers

Reverse proxy Apache servers

Application servers (ATG/Dynamo, Tomcat, etc)

Database boxes, Production Support, etc.

2004


Moved ~80 machines from ROC
-
> ???

ROC <
-
> ??? Firewalls

Bandwidth requirements

Minimal user impact

Flipped the switch, went live


© 2012
Bitnetix

Incorporated

5

History of kodak.com

Some of the things kodak.com did at the time

Consumer store and product information

B2B portal and wholesaler purchasing

“Picture Of The Day” (
www.kodak.com/go/potd
)

Warranty registration

Photo lab calibration strips

“Phone home” reports for printers, docks, cameras, etc

Software/firmware updates

Corporate press releases, bios, and regulatory information

Reverse proxy for internal information through secure channels

Dozens of
sitelets

for products and campaigns

2011

Nagios World
Conference

6

Why Kodak Chose
Nagios

to Monitor kodak.com

© 2012
Bitnetix

Incorporated

7

Why
Nagios
?

No centralized corporate monitoring software

Nothing to compete with internally

Nothing to build on, either

Cost

No additional cost beyond existing human resources

Framework

Nagios

worked with firewalls without needing agents

Leverage SSH, HTTP and other remote protocols

Custom checks and notifications
(very important)

2011

Nagios World
Conference

8

Initial Hurdles in the New
Complex Server Environment

kodak.com Network

© 2012
Bitnetix

Incorporated

© 2012
Bitnetix

Incorporated

10

Initial hurdles

Firewalls

Public load balancers on external Internet IPs

Public Apaches in Zone 1, Kodak network

Reverse proxy, app servers in Zone 2, semi
-
secure

Nagios

machine in internal Zone 3, most secure

Complex “top” and “bottom” checks for web site

Is the site working from the user’s perspective (top)?

From the application side (bottom)?

How to separate apparent from actual failure

© 2012
Bitnetix

Incorporated

11

Initial hurdles

No Internal
Nagios

Knowledge

It was a
contractor

who set up
Nagios

(me)

Contractors typically have a finite lifespan at Kodak

Contractor made custom checks, event handlers,
and all
Nagios

configurations. Uh
-
oh…

Escalation and Paging

Screw it


let’s email everyone, every time and let
Thunderbird sort it all out

Paging done via texting gateway email address

Which means email gateway failure = notification failure

Twitter API as backup / current primary notification

2011

Nagios World
Conference

12

SSH to Remote Servers

© 2012
Bitnetix

Incorporated

13

SSH to the rescue

One user, one key, infinite access

Software apps run as second user, with SSH auth

Additional robot accounts can be added at any time

Wrap existing checks in an SSH shell

Provides additional control, error handling, reporting

Allows all checks to submit results to SQL database

SQL Database Side Note


all custom scripts executed CLI Perl
code that locked a file, logged to it, and unlocked it. A Perl
cron

job woke up every 5 minutes, locked the file, read it, pushed
things to Oracle, unlocked, and deleted log file. A second
cron

pruned Oracle daily to 400 days of data and collapsed checks
older than 30 days so that successive checks with the same
status were removed.

2011

Nagios World
Conference

14

Managing
Nagios

Configuration Files

© 2012
Bitnetix

Incorporated

15

Configuration Management


SCCS

Solaris’s “poor man’s CVS”

Pre
-
installed, no additional cost, existing expertise

Current configuration is managed through SVN

Rsync



the workhorse to move
config

files

Configuration Repository and Push (
CRaP
) directory

Cfengine

Local versus remote execution

Post
-
install, ignore
pid

files, deploy/restart, etc.

Makefile



the “CLI” to the entire process

2011

Nagios World
Conference

16

Common Event Handler

© 2012
Bitnetix

Incorporated

17

Common Event Handler

EKrestart



That Which Does


Setup


Arguments


Conversions


do_soft
/hard?


do_something
?


do_restart

do_restart


Lock, logs, SQL


send_nagios


SSH to remote


Remote
EKrestart


Process
args


do_<service>


send_nagios


Unlock, log, SQL


Terminate

do_<service>


Locks (level 2)


Instance mapping


Port mapping


App restart


Email & log


Exit

© 2012
Bitnetix

Incorporated

18

A Closer Look at
EKrestart

#!/bin/
sh

PATH=...


[ "$1" = "
-
r" ] &&
client_code


host="$1"

service="$2"

baseService
=`echo $service |
awk

-
F: '{print $1}'`

state="$3"

type="$4"

tries="$5"

perfdata
="$6"

class
="<based on machine name, e.g., x
-
y
-
CLASS
-
nnn.kodak.com>"

number
="<based on machine name, e.g., x
-
y
-
class
-
NNN
.kodak.com>"


case "$state" in


OK)
do_fixit
;;


WARNING)
do_nothing
;;


UNKNOWN)
do_nothing
;


CRITICAL)
do_something
;


*)
do_nothing
;

esac

© 2012
Bitnetix

Incorporated

19

A Closer Look at
EKrestart

do_fixit
() {


case "$
baseService
" in


Workers)
do_restart
;;


*)
do_nothing
;;


esac

}


do_nothing
() {


$debug && echo "$service is in $state
state

($type) for $tries
tries
."

}


do_something
() {


case "$type" in


SOFT)
do_soft
;; # Take action before it's too late?


HARD)
do_restart
;; # Hard CRITICAL
-

Our last chance to take action


*)
do_nothing
;;


esac

}


do_soft
() {


case "$tries" in


3,4,5)
do_restart
;; # Okay, let's restart it before it goes hard


*)
do_nothing
;; # Don't restart yet


esac

}

© 2012
Bitnetix

Incorporated

20

A Closer Look at
EKrestart

do_restart
() {


# <figure some stuff out, set up lock files,
send_nagios
, log to SQL, etc>


ssh

$machine <
EKrestart
>
-
r
do_$service

<parameters>


# <tear down, unlock, close log,
send_nagios
, log to SQL, etc>


exit

}


# On the client side, we use the same
EKretart

script, but start at
client_code
()

client_code
() {


host=`hostname`


function="$2"


service="$3"


# (etc)


eval

$function


exit

}


# Example function

do_Dynamo
() {


# lock file processing


# turn off new sessions, wean existing ones


# /etc/
init.d
/
restart_dynamo_$instance


# tear down


return

}

2011

Nagios World
Conference

21

Integrating
Nagios

into
Operational Procedures

© 2012
Bitnetix

Incorporated

22

Integration with Operations

Homebrew API

nchart
,
send_nagios
,
nlog



all portable to other
installations of
Nagios

on other machines

Integrate with start/stop scripts

Lock files. Lots of lock files! TOO MANY lock files!!

The “
Rippler


Leverage
EKrestart
,
cron
, and
send_nagios

Pager / Twitter and lots of private twitter feeds

Inter
-
group notifications

Predominately with
procmail

2011

Nagios World
Conference

23

Predictive Failure Recovery

and a Good Night’s Sleep

© 2012
Bitnetix

Incorporated

24

Predictive Failure Recovery

On ATG/Dynamo (and other) services

do_soft

triggers
do_restart

on third failure

do_hard

always triggers restart

Notifications on fourth failure

Escalation to pager only on fifth notification

Nagios

has time to restart things that are bad,
or are
going bad
, prior to sending out notifications

Service check dependencies allow us to know whether
it’s a bad application, server, or user experience

Twitter


follow private tweets with
smartphone
, use
apps to acknowledge problems, and get
an even
better

night’s sleep!!


2011

Nagios World
Conference

25

Questions

Eric Loyd

Founder & CEO

Bitnetix Incorporated

eric@bitnetix.com

www.bitnetix.com

877.BITNETIX