VCS Testing - Confluence

rabidwestvirginiaΔίκτυα και Επικοινωνίες

26 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

92 εμφανίσεις

DRAFT
Version: 5/12/10 jl o


VCS

High Availability Production
Testing

Procedures and Contingency Preparation

This document
describes the rationale and procedures, including precautions and contingency planning,
for regular testing of Veritas Cluster Service (VCS) high availability
operations for CDL applications
.

VCS
operations and testing are coordinated between the CDL (through its TechCouncil and Infrastructure
unit) and IR&C (primarily through its Unix and Network Operations groups).

Rationale for testing

VCS high availability

is p
rovided

at the application level, with VCS scripts that
monitor

application
s

and
their resources and
move
(
offline, online, clean
)
them
to s
econdary

nodes/
hosts
under certain
conditions or when manually invoked. CDL/IR&C use of VCS is well
-
documented

(
https://wiki.ucop.edu/display/CDLTC/VCS
)
, but its

mission
-
critical functionality and
complexity
,

in
an
environment

of constant application and infrastructure changes
,

requires regular testing to ens
ure that
it is working as expected and needed, i.e. that stakeholders have
evidence
-
based confidence
that

CDL
applications are, in fact, well
-
protected and highly available.

Testing is performed
for three cases
:

1.

To ensure proper VCS setup in new clusters
or hosts
coming into production


primary
responsibility resides with the Unix group.

2.

To ensure successful failover during unexpected system problems in existing clusters by
simulating system faults


responsibility is shared by CDL and IR&C, with IR&C con
ducting the
actual tests through simulated system and network failures
.

3.

To ensure successful failover/switching for scheduled application or infrastructure maintenance



primary responsibility resides with CDL application owners; testing is done by them, at their
convenience, with courtesy notifications sent to the Unix group. Good practice is to conduct
these tests prior to a maintenance that involves VCS switches (oth
erwise the maintenance can
be slowed by troubleshooting VCS problems).

Procedures
, including precautions and contingencies


VCS testing always includes these steps:

1.

Scheduling (in consultation with all involved groups; explicit agreement recorded in Tech
Council
and/or CDL
-
UnixSys meeting minutes and/or Outlook calendars)

2.

Precautionary and contingency preparations

3.

Testing and recording of results

4.

Sharing of results

& debriefing

5.

Tear down of precautions/contingencies if nec.

Details for steps 2 and 3 are pr
ovided below for each test case.

Procedure details

Case
1
:
New cluster or replacement hosts

Goal:
test application moves successfully during simulated system faults

on all nodes of the cluster
before release to CDL.

Step

Timing

Task

Who

Test1
-

simulate
system crash
Discretionary

See details in Appendix I

Unixgrp

DRAFT
Version: 5/12/10 jl o


(all nodes)

Test2
-

heartbeat links
down

Discretionary;

See details in Appendix I

Unixgrp & NOC

Test3
-

service
link(s) down

Discretionary;

See details in Appendix I

Unixgrp &
NOC

Test4
-

simulate
VCS processes
crash

Discretionary;

See details in Appendix I

Unixgrp


Case
2
:
Simulated system failures

Goal:
applications move successfully during simulated system faults

on one node of the cluster
at
least once/
year (rotate to a
different tested node each year)
.

Step

Timing

Task

Who

Prep/precaution
(optional)

1 week prior

Sync stage with production: prepare
emergency DNS switch request
(stage can become production)

App owner

Prep/precaution
(optional)

1 week prior

Test app/data restore from TSM

App owner & Unixgrp

Prep/precaution
(optional)

1 week prior

Confirm alternate host(s) for
emergency install (from TSM or disk
copy)

App owner & Unixgrp

Prep/precaution

24 hours prior

Migrate sensitive apps to secondary
node

App owner

Prep/precaution

24 hours prior

Duplicate zpools < 50GB (disk
“backup”)

Unixgrp

Prep/precaution

1
-
2 hours prior

Alert Ops & other relevant parties

Unixgrp

Prep/precaution

At start of tests

Ensure all hands on board

Unixgrp

Test1
-

simulate

system crash

Discretionary;
but prefer early
am weekday

See details in Appendix II

Unixgrp

Test2
-

heartbeat links
down

Discretionary;
but prefer early
am weekday

See details in Appendix I

Unixgrp & NOC

Test3
-

service
link(s) down

Discretionary;
but
prefer early
am weekday

See details in Appendix I

Unixgrp & NOC

Test4
-

simulate
VCS processes
crash

Discretionary;
but prefer early
am weekday

See details in Appendix I

Unixgrp


Case
3
:
Application failover/switch

Goal:
application moves successfully in

advance of using VCS during application or infrastructure
maintenance; apps should reside on each cluster node at least once/quarter.

Step

Timing

Task

Who

Prep/precaution

24 hours
prior

Notify IAS and Unix group of pending
VCS activity

App owner

T
est
1


can
VCS detect an
d
iscretionary

S
hut down the application without
freezing the group (or failover if
App owner

DRAFT
Version: 5/12/10 jl o


application shut
down

RestartLimit of the resource is 0 or
equal to the number of failure/restart
count so far)

Test
2


-

manual
failover

d
is
cretionary

click or run failover (CLI: 'hagrp
-
switch [grp_xyz]
-
to [abc]')

App owner



DRAFT
Version: 5/12/10 jl o


Appendix I: Detailed Procedures

I. Tests before a VCS cluster
becomes production, by UNIX admin,

Goal: to make sure VCS react as expected at various faulted situations.


With at least one group configured in the cluster, simulate these events:



a. host/OS level fault
: crash/halt/reset all nodes, one at a time.

This

should trigger failover groups at node down event and trigger paging when the last cluster node
is down.


b
. 2 cluster heartbeat links down
, (that will disable auto
-
failover,


turn into 'jeopardy mode'
-

manual
operation only.).

Ask NOC to disable the po
rts instead of unplugging cables to minimize human errors.


This should trigger email to NOC/UNIX when 2 heartbeat

l
inks down at any node. When links restored,
the cluster will clear this fault automatically. Note that no email will be sent when 1 link is

up and the
service LAN line is also up.


V
erify results before/after, run 'lltconfig
-
a list' on every node. it should display all the cluster node on
every heartbeat link. there should be 3 links, their NICs specified in /etc/llttab.


c. service network
(i.e., 120 subnet) link down
: on one node, both nodes.

This should trigger failover groups at link down event and trigger paging when link down on all nodes.


d. VCS engine (server) processes crash, had, hashadow
. These 2 processes watch/restart each othe
r,
perform other HA features.


Kill them, one at a time, and verify the dead one got restarted by the other process within few
seconds.
V
erify with ps:


ps
-
ef | grep
-
v grep | egrep '/had$|/hashadow$'


II.
Simulated system failures

for clusters al
ready in production

Note
1
: Unixgrp
trigger scripts under /opt/VRTSvcs/bin/triggers on each VCS node are being
monitored/sync'd by config

mgmt tool, puppet, hourly.

Note2: Application owner(s) will need to verify the applications are up or restart the
groups themselves,
as arranged during setup discussions.


a. Schedule a
[
reboot

| halt]

to simulate crash,

if a node hasn't been rebooted
for
one year
.


Procedures:

To simulate crash to make sure groups failover done as expected.

note down the online group
s on the node, e.g., on leto:

'hagrp
-
state|grep
-
i online|grep leto',

fail back the groups when done. Th
ose applications with sensitive
DB/data or may not be robust enough, should be migrated off the node in advance. Application owners
ma
y have to confirm

service group
are up afterward
-

unless its agent 'monitor' script does perform
comprehensive tests, not just checking files and its processes.


b
.
C
luster heartbeat links down

DRAFT
Version: 5/12/10 jl o


Cluster heartbeat LAN line(s) down seems to be the most common


event so far
. The test will expose
human errors in the wiring, LAN ports, trigger scripts, or other networks problems.


If no human error while NOC ve
rify and disable/enable the LAN
ports of the heartbeat links,
no action
required for UNIX sys
admin or application owne
r. V
CS will reenable automatic mode
when heartbeat
links up detected

again. If they make a mistake,

say, disable the service LA
N line, depends on how long
and
how the application hand
le network failure, the service
groups may be affected. It's safer to offl
ine
and on
line
all vcs groups on the node unless asked not to.


Procedures:

UNIX sysadmin
open

a ticke
t to NOC with the node list and
date/time

CDL/UNIX/NOC have agreed
upon,
note down the online groups on the node, e.g., on leto:

1.

'hagrp
-
state|grep
-
i
online|grep leto',

2.

watch vcs log for any human error from NOC.

Should it happen, ask NOC restore all links, then

3.

restart the applications (groups) on a node.

E.g., on leto for each group seen in

hagrp
-
offline grp_xyz
-
sys leto

4.

watch progress in log files,

then ...

5.

hagrp
-
online grp_xyz
-
sys leto


c
. service network (i.e., 120 subnet) link down: on one node, both nodes.

This should trigger failover groups at link down event and trigger paging when link down on all nodes.


Procedures (normally done in sequen
ce with heartbeat link testing below):

[Details needed]


c. VCS engine (server) processes crash, had, hashadow.

These 2 processes watch/restart each other, perform other HA features.



Procedures:

1.

Kill them, one at a time, and verify the dead one got res
tarted by the other process within few
seconds.

2.

verify with ps: ps
-
ef | grep
-
v grep | egrep '/had$|/hashadow$'



DRAFT
Version: 5/12/10 jl o


Appendix II: Sample test runs & output


1. heartbeat link tests: use lltconfig to verify every link is up, i.e., all nodes
are present:


leto (mholm) 122 %lltconfig
-
a list

Link 0 (ce1):


Node 0 bastet : 00:14:4F:16:F3:2F


Node 1 leto : 00:03:BA:B4:5C:B9 permanent


Node 2 zeus : 00:03:BA:B4:5D:B5


Node 3 rhea : 00:14:4F:B7:A7:52


Link

1 (eri0):


Node 0 bastet : 00:14:4F:3B:47:41


Node 1 leto : 00:14:4F:00:F7:68 permanent


Node 2 zeus : 00:14:4F:23:B5:3E


Node 3 rhea : 00:21:28:06:D3:EF


Link 2 (ce0):


Node 0 bastet : 00:14:4F:16:F3:2E


Node 1 leto : 00:03:BA:B4:5C:B8 permanent


Node 2 zeus : 00:03:BA:B4:5D:B4


Node 3 rhea : 00:21:28:06:D3:EE


leto (mholm) 123 %grep link /etc/llttab

link ce1 /dev/ce:1
-

ether
-

-

link eri0 /dev/eri:0
-

ether
-

-

link
-
lowpri ce0 /dev/ce:0
-

ether
-

-



2. killing had or hashadow:


kill had or hashadow, it should be restarted in a second. but if had isn't restarted gracefully, stuck in a
'restart' mode.

restart VCS on this node, you must use the '
-
force' option to l
eave service groups running when run the
hastop.

check log file on any node of the cluster during the process, 'tail
-
f /var/VRTSvcs/log/engine_A.log'


leto (mholm) 1 #ps
-
ef | grep
-
v grep | egrep '/had$|/hashadow$'


root 4637 1 0 Feb 14 ?

0:00 /opt/VRTSvcs/bin/hashadow


root 4635 1 0 Feb 14 ? 35:54 /opt/VRTSvcs/bin/had


leto (mholm) 2 #kill 4635

...

...

leto (mholm) 3 #ps
-
ef | grep
-
v grep | egrep '/had$|/hashadow$'


root 4637 1 0 Feb 14 ?
0:00 /opt/VRTSvcs/bin/hashadow

leto (mholm) 4 #ps
-
ef | grep
-
v grep | egrep 'bin/had|/hashadow'

DRAFT
Version: 5/12/10 jl o



root 4637 1 0 Feb 14 ? 0:00 /opt/VRTSvcs/bin/hashadow


root 22377 1 0 12:16:13 ? 0:01 /opt/VRTSvcs/bin/had

-
restart

.
..

... it's stuck in 'restart mode' and some automatic features are disabled.

... to restart VCS, but not to disturb the applications on it now:

... i.e., use '
-
force' option with hastop


leto (mholm) 5 #hastop
-
local
-
force


... wait one minute before
starting it:


leto (mholm) 6 #hastart

leto (mholm) 7 #ps
-
ef | grep
-
v grep | egrep 'bin/had|bin/hashadow'


root 23909 1 0 12:27:15 ? 0:00 /opt/VRTSvcs/bin/hashadow


root 23907 1 0 12:27:15 ? 0:01 /opt/VRTSvcs/bin/had


... also watch progress in vcs log file to verify no applications got brought down. If need to run other vcs
jobs, wait a few minutes allowing VCS to inventory what are running.