VETS_Data_Systems_v3.1.doc

flameluxuriantData Management

Dec 16, 2012 (4 years and 8 months ago)

158 views

NEXT GENERATION VETS DATA MANAGEMENT SYSTEMS

Requirements and Proposed Architecture

VETS/TDD/CISL


1.

Introduction


The purpose of this document is to analyze the requirements of a new computing
environment suitable to support the extensive VETS data manageme
nt activities for the
medium
-
term future (3 to 4 years). From a very high level perspective, this system must
be able to:



support web
-
based data distribution and analysis to an already large and growing
user community (more than 7000 users across multiple

projects like ESG, CDP,
VSTO and TIGGE, with other projects like NARCCAP, CADIS, WIS and ESC
under development, and yet others proposed)



empower specific applications and usage by university collaborators (COLA)



integrate with TeraGrid activities



support
the VETS software development process


Please note that the requirements analysis and the architecture options discussed in the
document already reflect the necessity to minimize the heat footprint of the overall
system, as dictated by the physical limits
of the current CISL computing room. Also, this
analysis does not include the hardware requirements necessary to support the
Chronopolis project, since these were separately addressed in the Chronopolis proposal,
and will be funded separately. The Chronopo
lis system is included for completeness in
the architecture picture at the end of the document.


2.

High Level Functional Requirements


When architecting the next generation computing environment, the following functional
requirements must be considered (see
Figure 1 for a draft proposed architecture that
captures these requirements):


1.

Separation of Computing Resources
: at present, all VETS data management
applications run within a shared, multi
-
processor environment consisting of two
machines: dataportal.ucar
.edu and datagrid.ucar.edu.
A

more flexible configuration
for the next generation of data systems would be to partition the available computing
resources across different areas of functionality, i.e. allocate separate resources for
web hosting, data transf
er, data processing, database access, and custom user
applications. This more modular architecture would present several advantages:



different applications could be hosted on the hardware best suited for the problem
(for example, web applications don’t ne
ed fast I/O capabilities, while database
services do)



different applications
could be scaled differently to meet growing demand (web
applications can usually be scaled out by purchasing more machines, while
databases need to be scaled up by increasing memo
ry and I/O capabilities)



a sudden peak request in one functional area (for example, a large data transfer
request from one user, or a full database replication process) would be limited to a
single computing resource without affecting the response time acr
oss the whole
spectrum of available services as is presently the case



arbitrary shell access from external users could be confined to only one hardware
partition with a hardened security configuration



the performance of each application, running on separat
e hardware, could be
monitored and debugged independently


2.

Distinct Operational and Development Systems
: currently, the same computing
resources are used to run the operational systems serving the user community as well
as to support the late phases (deplo
yment and integration, load testing, debugging,
etc.) of software development executed by VETS engineers. This is clearly a non
optimal situation because the necessities of the software development process may
end up impacting day
-
to
-
day operations. At t
he same time periods of heavy resource
demands from the users result in slow hardware performance, which affects the
software development. There is a clear need to have two distinct environments, with
a clear separation of concerns as described later in m
ore detail.


3.

Shared Access
t
o
a

Large Disk Array
:

current and future VETS data management
projects need to enable fast access to increasingly larger data holdings from
computational models and observational platforms. Additionally, the same data must
be e
xposed via multiple services, for example to allow one user to download a set of
complete files and another to request an expensive regridding operation over a
specific geographic region. The current SAN system is well suited to support shared
access from

multiple computing resources and to be incrementally expanded for
future storage needs. It must be noted that recently identified security constraints
require the SAN array, and all machines attached to it, to be outside of the UCAR
firewall. This requi
rement has the obvious consequence that the whole data system
proposed in this document must be fully positioned outside of the UCAR security
perimeter. We believe that this topology does
not

conflict with the services exposed
by the system since the only

protected resource that needs to be accessed is the
NCAR MSS, which can be (and actually already is)

enabled via a single machine
proxy
-
ing MSS requests through the firewall.


4.

Fast Access To Network Resources (Internet2 and TeraGrid)
:

most of the data
se
rvices supported by the system involve data distribution to end users over the
network. Additionally, we plan to routinely execute large data transfers over the
TeraGrid from collaborating institutions (ORNL, SDSC, Purdue) to publish new data
collections
onto the system.


5.

Fast Disaster Recovery
:

in general, all of the data services provided by VETS are not
mission
-
critical
, meaning that a sudden interruption of services due to mechanical or
other failure would
not

have catastrophic human or economic reperc
ussions for the
community they serve. So while there is not a requirement to be operational on a
24/7 basis, we do need to strive to recover from an accident by bringing back most of
the system functionality within a short period of time, typically being
the
next
business day
. Preferably, the system should minimize the chances of global failure
across all of the functionality areas and allow each logical unit (web portals, database
services, data transfer, etc.) to be reinstalled and reconfigured independ
ently, possibly
after replacing the underlying hardware. Replacement of individual data collections
may necessarily take longer if they are to be transferred from deep or external
storage, but they should be made accessible via the appropriate services on

an
incremental basis as soon as they are available.


3.

Operational and Development Systems Requirements


We propose two distinct systems be setup and maintained to support the following
functionality areas:


1.

Operational System
: responsible for sustaining da
y
-
to
-
day operations to end users
with the maximum possible performance, the minimum interruption of services, fast
recovery capabilities, and overall best user experience. It should be composed of:



One or more logical units with large virtual memory and m
iddle disk allocation to
enable rapid feedback from hosted web portal user interfaces.



One logical unit and possibly multiple sibling units (to allow for database
replication and failover) with large virtual memory and fast I/O for all database
services
.



O
ne logical unit with a very large disk cache visible to other logical units for
temporary data storage during massive data transfers from ESG and TeraGrid
partners via SRM and SRB middleware.



One or more logical units with large memory, excellent integer a
nd floating point
performance, and a large cache with fast I/O to support server
-
side CPU intensive
operations like sub
-
setting, regridding, and post
-
processing.



One or more logical units with fast I/O for access to data on the shared disk array
via multip
le protocols like HTTP, FTP, GridFTP, WCS, WMS, and WFS.



One logical unit with good processing capabilities, a large disk cache, and
hardened security constraints to support specific operations of university and other
collaborators, most prominently COLA.


2.

Development System
: while VETS engineers will still use their personal computers
for the coding phase of the software development process, a separate development
system is still needed to support the following functionality:



Integration testing before fin
al software release in an environment that mirrors
exactly the operational system.



Load testing without affecting the operational system.



Testing OS upgrades before pushing them to the operational system.



Demonstrating new and upgraded applications to inte
rnal and external
collaborators (ESC, ESG, CDP etc.)



Possibly, provide an easy replacement to logical units of the operational system in
case of failure.


A possible configuration would be to compose the development system out of three
logical units: one f
or web hosting (the envisioned “Gateway” functionality), one for data
services (the “Data Node” functionality), and a separate unit for test database services.


4.

Current and Projected Disk Storage Requirements


Currently, the combined disk storage used and
already allocated to all of the VETS
projects is approximately 36 TB. Best
-
effort estimates project the size of data holdings to
double within the next year, and to double again over the next 3 years. Hopefully,
appropriate provisions can be made for expa
nding the SAN system to accommodate these
requirements.


Project

Current usage

One year

Three Years

CDP (except COLA)

3.6TB

6.7TB

10
-
16TB

CDP (Cola)

4TB

7TB

10TB

ESG***

10TB *,

3.5TB used

40TB

80
-
100TB

ESG xserve **

2.6TB

0TB (retired)

0

TIGGE

6TB

6TB

6TB

NARCCAP

10TB *

10TB

10TB

CADIS

unallocated

2 TB

4TB

Total

36.2TB

71.7TB

120
-
146TB


* allocated disk space not yet used

** disk to be retired, included in ESG 1year estimate

*** Estimates from Gary Strand (CGD)



5.

Operating System Options


While th
e current VETS data systems run under Solaris OS10, it has been our collective
experience that there is an increasing trend among modern applications and middleware
to offer better support for
standard

(i.e. mainstream) Linux OSs. Typically, application
b
inary executables are made available for Linux, while compilation from source code
may be necessary for Solaris
.
Besides always requiring additional labor, the time
difference involved for the two OSs could really affect the critical process of disaster
r
ecovery. Furthermore, the installation, configuration, and execution of many packages
(Globus, LAS, OPeNDAP etc.) seem to be much better tested on Linux than Solaris.
Therefore, it is our recommendation to switch from Solaris OS to Linux OS if possible.


6.

Database Replication and Failover Options


The logical computing unit hosting the database services has been identified as the most
critical part of the overall system because a
system

failure would not only disable access
across all services (similar to
the web portals hosting unit), it would also be much more
difficult to recover in terms of data completeness and integrity. Therefore,
backup

and
failover technologies must be properly built into the system architecture.


Several backup / recovery options

are available, ranging from cold standby systems to hot
failover systems. Given the recovery timeframes and power limitations necessary in this
system, utilizing a cold backup / recovery strategy should meet the requirements with the
lowest power footpri
nt as during normal operation only one machine would be in a
powered up state.


In a cold backup strategy, two servers of identical type and software stack would be
available for use. The primary server would be powered up and the secondary server
powered

down during normal use; while both the primary and secondary servers would
be powered up during software stack synchronization or change. Both servers would be
configured to write transaction log archives to a shared file system. Only the primary
server

would be configured to initialize the PostgreSQL server during the boot process to
prevent the over
-
writing of log files.


In the event of a primary server failure, the secondary server would be brought up and the
database restored from the base backup an
d transaction log archives stored on the shared
file system. At this point the failed database server would be taken off the network and
repaired. Once repaired, the failed server would remain in the powered down state until a
failure in the new active d
atabase server is detected or the software stack changes and the
servers need to be synchronized. To minimize the loss of un
-
archived transactions
between failure and log shipping, transaction logs will be flushed at some interval
yielding a reasonable lo
ss/performance balance.


Base backup and transaction log archives would be tied into the overall incremental
backup strategy. In the event of file system corruption, backups would be restored from
this backup system to point to where the archives are vali
d.



Figure
1
:
Logical System Architecture Emerging From Functional Requirements
Analysis



7.

Proposed Physical Hardware Implementation


Figure
2
: Summary of Physical Hardware Implementation Recommendations

Web Server



Single Server

Not Needed
-

Can be split

Number of CPU cores

8

Memory

32GB per server

Storage

500GB SATAII system disk

Network Connections



Internal

Gigabit

External

TG Requirements

SAN Connection

Yes

Prefered Hardware

None

Supporte
d OS

Linux



Database Server



Single Server

Yes

Number of CPU cores

8

Memory

32GB

Storage



System Disks

500GB SATAII

Database Storage



RAID Controller

SAS Controller

RAID Level Needed

10

Battery Backed Cache

Yes

Disk Array

500GB minimum ini
tial usable storage

Type

SAS SATA

Number of Disks

8

Disk Size

146GB

Network Connections



Internal

Gigabit

SAN Connection

No (backup location?)

Prefered Hardware

Intel

Supported OS

Linux

Chassis Requirements

12
-
16 SAS drive bays for storage
growt
h



Mass Data Transfer Server



Single Server

Yes

Number of CPU cores

4

Memory

32GB

Storage



System Disks

1TB SATAII

Network Connections



Internal

Gigabit

External

TG Requirements

SAN Connection

Yes

Prefered Hardware

None

Supported OS

Linu
x



Data Processing/Distribution
Server



Single Server

Yes

Number of CPU cores

8

Memory

32GB

Storage



System Disks

1TB SATAII

Network Connections



Internal

Gigabit

External

TG Requirements

SAN Connection

Yes

Prefered Hardware

None

Support
ed OS

Linux



Where appropriate blade chassis such as a SuperMicro SuperBlade Chassis, 7U with
support for 10 blades will be used to consolidate rack space and associated components.
In cases where blades do not offer sufficient capabilities, appropriate
ly sized rack mount
chassis will be utilized.



1.

Web Application Server (Web Portals):

The web portals are Java applications that
are expected to
be

primarily network resource and CPU bound. Consequently the
physical hardware proposed to implement the ope
rational web portal hosting and the
development “Gateway” logical units focuses on CPU and internal memory
resources. Disk access will be primarily limited to the operating system and
application loading with no significant I/O as part of normal web appli
cation
processing so I/O is not a major consideration for this hardware.


The recommended physical hardware is a blade enclosure with the following
specifications:



8 CPU cores



32GB memory per blade.



Single SATA II 500 GB system disk per blade.




Although
the expected actual user load is unknown, it is currently believed that 8
CPU cores should be a sufficient starting estimate to support an increasing user load.
This should allow us to handle roughly 16 concurrent requests. Most requests will
likely invo
lve some network and/or disk I/O and thus allow for a reasonable number
of context switches
.


32GB of memory and a single SATA II 500 GB system disk per blade is currently
believed to be sufficient internal memory to run the web applications.
The
applicat
ions will most likely be running with the maximum available heap sizes;
therefore the amount of virtual memory needed can be approximated by the following
calculation:

(
Max JVM Heap Size * expected number of applications) + OS memory


Significant external
storage access can be achieved via NFS to other external storage
resources or SAN access. We have not researched any benchmarks comparing AMD
versus Intel platforms for this application and thus have no particular affinity for
either platform.



2.

Database
Server
:

The database services server is expected to be primarily CPU and
local disk space bound so the physical hardware recommended to implement the
operational, Chronopolis cluster, and development “Data Node” and test database
logic units focuses on int
ernal disk storage and CPU.


A blade enclosure is not applicable for this server as the blade enclosures do not
provide adequate storage.
Perhaps DAS options might negate this.
It is expected that
this machine will need 3U or 4U chassis with at least 8 S
CSI or SAS bays.

The proposed storage subsystem layout is:



System disks


1 500GB Sata II



Raid Subsystem (Raid 5 pattern, 600GB effective storage).



3 x 300GB 15k rpm SCSI/SAS drives.


The initial recommendation on memory is 16
-
32GB.


The recommended number

of CPU cores ranges from .5 to 1 of the all the expected
client cores.
If this will also serve Chronopolis SRB/Irods maybe we should go for 8
cores as well.

The next generation system will run on PostgreSQL database. Recent
benchmarks have shown Postg
reSQL to run significantly better on Intel rather than
AMD CPUs and therefore recommend and Intel platform for this server.


The disk performance is critical for database performance. It is believed that the
SAN does not provide the desired performance fo
r storage needs. Given the typical
I/O patterns of a database/file server a SCSI Raid array appears most applicable.
Beyond the benefits of SCSI in this environment this will also give us better failure
modes as well as volume expansion as needed.



3.

Mass D
ata Transfer:

This component is expected to be network and disk bound.
Therefore this machine should not need a large number of CPUs. This machine will
need access to a 1TB disk cache. This expected to be best achieved through local
disk. The expected qu
eue depths for this application are unknown. Given small queue
depths a large single SATAII disk may provide very suitable, thus allowing this to be
housed as a blade as well. Otherwise this server will have to be in separate chassis
with expanded storage
options.




4 CPU cores



32GB memory.



Single SATAII 1 TB system/cache disk.



4.

Data Processors / Data Servers:

These components are expected to process and
server large volumes of data. The data is expected to be stored on the SAN and thus
don’t require large

volumes of local disk, other than scratch space.




4
-
8 CPU cores



32GB memory.



Single SATAII 1 TB system/cache disk.






































8.

Example Hardware Implementation


Assumptions:


1)

Blade Enclosure provides enough fault tolerance for all

blades.





Blade Enclosure

Web Server (Production)


CPUs x 8

Web Serve
r (
Development
)


CPUs x 8

Data Processors / Data Servers

(Production)


CPUs x
4
-
8

Data Processors / Data Servers

(
Development
)


CPUs x
4
-
8

Mass Data Transfer

(Production)


CPUs x
4

Mass Data Transfer

(
Development
)


CPUs x
4

Database Server (Production)


CPUs x 8

Database Server (Development)


CPUs x 8