INFORMATION TECHNOLOGY

sweetleafapartInternet and Web Development

Aug 7, 2012 (4 years and 10 months ago)

321 views

®

1

PostalOne! Outage 2/5/2010

MTAC Report

John Edgar

Vice President

Information Technology Solutions

®

2

Eagan

DMZ

NetScaler

www.uspspostalone.com

WebSphere Cells

Web Servers

Application Servers

WebSphere Cells

Web Servers

Application Servers

WebSphere Cells

Web Servers

Application Servers

WebSphere Cells

Web Servers

Application Servers

Eagan PostalOne! Secure Enclave

RAC Listeners

Clustered DB Server

Clustered DB Server

Clustered DB Server

Eagan SAN

Daily BCV (Eagan)

DR DB (San Mateo)

PostalOne! Architecture

®

3

PostalOne! Details


PostalOne! is a very large scale database with high data change rate



Currently 10 TB of live data with 25% change per week


Redundant back up processes intended to ensure complete data
recovery in the event of system outage



Daily disk image of full database



Weekly tape archival of full database



Multiple daily transaction level logs archived


Daily disk image is currently ~14TB. Recovery time from image is
~14 hours. Image is released daily to create new image.

Tape images are kept for 30 days. This provides four images for
recovery if needed. Recovery is about 5 times longer than disk.

Multiple daily transaction backups taken and logged to support
incremental recovery activities.

®

4

PostalOne! Outage Timeline

Friday 2/5/2010

6:08 PM


End users reported experiencing system response problems with
PostalOne!. Technical teams began investigation.

6:47 PM


Identified errors within database



Technical teams worked through multiple options to restore system

8:00 PM


Scheduled daily disk image backup kicked off

9:45 PM


Attempted restore of corrupted table from daily disk image backup



Restore failed

10:57 PM


Began full restore from previous week tape backup and reapplication
of incremental transactions


Wednesday 2/10/2010

12:30 AM PostalOne! system operational and transaction processing resumed


Friday 2/12/2010

COB

Based on counts of processed postage statements for the week,
majority of backlog has been addressed

®

5

Cause and Future Prevention

Cause of Outage: Disk level storage corruption


Preventive Actions to be Taken:


Acquire and implement additional storage for second BCV copy



Revise recovery procedures to keep one BCV as database clone


Implementing SNAP backups to run twice daily, in addition to BCV



Provide further incremental copies of data to minimize future
recovery



More rapid recovery


Developing/acquiring additional automated disk management and
error checking routines to enhance recovery capabilities



®

6

PostalOne! Outage

Contingency Operations Review


Pritha Mehra

Vice President

Business Mail Entry and


Payment Technologies

®

7

Contingency Operations


Revised Contingency Operations to extend
72 Hour Limit


Requested Manual Postage Statements and
Summary Postage Statement Reports


Utilized Manual Log to record all
transactions


Communications to Mailers and USPS



®

8

Contingency Operations


Conducted Extensive Communications


DMM Advisory, MTAC and MTAC workgroups


PostalOne!

Users,
Postalone!

User Group


RIBBS, Gateway


Officers


Stakeholders


Sales, BSN, BME


P&C Weekly


BME Newsletter


Webinars


Mailers


CPP Publishers


Area Marketing Managers & CSPAs


BSN and Sales


Business Mail Acceptance, DMU Clerks


®

9

Contingency Operations


Lessons Learned


Communications worked very well


Revised Contingency Procedures worked


Contingency Plan Revisions for greater Clarity


Updated Manual Logs


Contingency Time Limits subject to expected recovery
timeframe


Checklist for HDQs on Communications


Restoration Procedures for eDOC/Postage Statements


Restoration for Special Postage Payment System


Verification Results Recording upon Restoration


Continuation of Operations Plan