Enhancements in the DoD HPC

warbarnacleΑσφάλεια

5 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

81 εμφανίσεις

Solving the hard problems . . .

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
1

User Environment
Enhancements in the
DoD

HPC
Modernization Program

7 April 2011

Steve Scherr,
DoD

HPCMP

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
2

Topics


Background: HPCMP Storage Initiative


Enhanced User Environment


HPC EUE Infrastructure


HPC Portal

MB Revised: 5/4/2009

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
3

HPC Modernization Program

Vision

A pervasive culture existing among
DoD’s

scientists and
engineers where they routinely use advanced computational
environments to solve the most demanding problems
transforming the way DoD does
business─finding

better
solutions faster.

Mission

Accelerate development and transition of advanced defense
technologies into superior
warfighting

capabilities by
exploiting and strengthening US leadership in
supercomputing, communications and computational
modeling.

MB Revised: 12/11/2009

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
4

HPCMP Serves a Large, Diverse
DoD User Community


FY11 statistics


501 active projects with 4,408 users at
250 sites


5,098
Habus
* batch requirements


FY10 statistics (as of 9/30/2010)


496 projects with 4,345 users


2,866
Habus
* non
-
real
-
time requirements

*
Requirements and usage measured in
Habus


92 users
are self characterized as “Other


New CTA Space and Astrophysical Science (SAS)

Computational Structural
Mechanics


465
Users

Electronics, Networking, and
Systems/C4I


211 Users

Computational Chemistry, Biology
& Materials Science


690 Users

Computational
Electromagnetics

& Acoustics


323 Users

Computational Fluid Dynamics


1,223
Users

Environmental Quality Modeling
& Simulation


163
Users

Signal/Image Processing


586 Users

Integrated Modeling &
Test Environments


105 Users

Climate/Weather/Ocean Modeling
& Simulation


315 Users

Forces Modeling
& Simulation


235 Users

Source: Portal to the Information Environment


July 2010

MB Revised: 1/26/2011

Customer Focus

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
5


DSRC systems support classified,
unclassified and open computing
capabilities


17 large HPC systems


1 systems ― 44,000+ cores


6 systems ― 10,000 to 22,000+ cores


10 systems ― 2,000 to 9,000+ cores


1.873 peak
PetaFlops



4,750
Habus


Three new FY10 HPC systems


773
TeraFlops


2,251
Habus



14
Petabytes

single copy data
storage


28
Petabytes

including Disaster
Recovery


Connections to Customers



212 locations


MB Revised: 12/22/2010

DoD Supercomputing Resource Centers (DSRCs)

Six Large HPC Centers

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
6

HPCMP Data Storage Growth

43% increase
over FY 2008

34% increase
over FY 2009

MB Revised: 12/22/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
7

HPC B

HPC A

Archive

Server

HPC File
System

Center
Archive
Cache

Tape

$WORKDIR

short
-
term storage


$WORKDIR

short
-
term storage

HPC File
System

DR

Cache

DR

Tape


Computational results
used in many different
ways


Source for
additional
computation


Interrogated for
post
-
processing


Archived for
scientific value


Users are mobile
within HPCMP


User View of HPCMP Storage

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
8

Storage Lifecycle Management
(SLM) Rationale


HPCMP can provide enough storage for NEW data


Centers support 2+ generations of storage media


Older media unreadable after tech obsolescence


Users: we can live with constraints & manage data


Need tools to manage data


Need
intermediate
-
length storage

Active Use
Archival Use
Removal
Creation
Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
9

ENHANCED

USER
ENVIRONMENT

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
10

Evolving Enterprise Service Model

Single Authentication

Advance Reservations

Web Portal
Framework

Remote
SciViz

ezHPC

Research Community

T & E

Software

Development

Acquisition Community

Remote Job Management

Computational Infrastructure

for Software
Development

(Tools / Environment)

Data Management Tools


Metadata

Batch

Customers

Services

Infrastructure

Interactive Grid Generation

MB Revised: 8/27/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
11

HPC

SYSTEM


A

Temporary
Storage

10 days

Temporary
Storage

10 days

HPC

SYSTEM


B

Metadata
Replication
Between all DSRCs

HPC Enhanced User Environment

Architecture

Data Analysis
Services

Center
-
wide
Job
Management

DR&E Portal

Grid
-
Generation
Capabilities

Single Point of
Access

Services

Compute

Storage

Storage Lifecycle
Management

Software
Development
Environment

Utility

Server

Center
-
wide
ILM
-
managed File
System

30 days

SLM
Metadata
Catalog
Service

Remote Disaster
Recovery
Facility

Local

Tape

Archive

Archive

Server

MB Revised: 12/22/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
12

HPC Enhanced User Environment


Interactive Computing


Single point of access


Center
-
wide job management


Remote data analysis


Center
-
wide
filesystem


Medium
-
term storage


User
-
specified metadata


Data Management Tools


Insight into file archives


Program
-
wide visibility


HPC Portal


Supercharge the engineering
desktop

MB Revised: 8/3/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
13

HPC EUE INFRASTRUCTURE

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
14

Hardware Components


Center
-
wide File System:
Panasas

PAS 8


340 blades, 4 TB unformatted


Arista

7508 switch



Utility Server:
Appro

1U Tetra, 88 nodes


44 compute: 2 AMD
Opteron

2.3 GHz CPUs, 16 cores, 128 GB
memory


22 large memory: 4 AMD
Opteron

2.3 GHz CPUs, 32 cores, 256 GB
memory


22 graphics: 2 AMD
Opteron

2.3 GHz CPUs, 16 cores, 256 GB
memory, NVIDIA Tesla M2050



14

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
15

System Configuration


$HOME


10 GB quota


$WORKDIR


200 TB


100 TB user quota


Standard scrubbing


$CENTER


800 TB


Possible user quota (200 TB)


30
-
day scrub policy


SLM compatible


$ARCHIVE


Managed by SLM


Accessed through SLM



Center
-
wide Job
Management


qsub
,
qstat
,
qdel



Resource Requests


PBS Pro


Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
16

Storage Lifecycle Management


Based on Nirvana SRB and SAM
-
QFS



Manages $ARCHIVE


Set metadata to specify retention period


Can register files on $CENTER
--

target to automate registration
by end 2011


HPC access to $ARCHIVE through transfer queue


Also working PBS parameter mechanism


future just
-
in
-
time


Customer Experience workgroup developing auxiliary commands
(
Sdata
) for user
-
defined metadata


Global visibility

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
17

HPC PORTAL

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
18

HPC Desktop Portal Initiative

Goals

Enable
DoD

scientists and engineers to apply the
power of HPC without being HPC experts

Provide access to HPC resources using current
web technology

attract and retain new
technology experts to
DoD

Methods


Provide HPC Software as a Service over web with
zero or minimal footprint


Provide common analysis tools enabled for
seamless HPC use (MATLAB)


Provide accessible optimized tools for technology
domains (CREATE, institutes)


Extension of desktop; interactive response


Single sign
-
on through CAC

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
19

HPC Portal


Engaging with
DoD

engineering organizations


Understand their requirements and how we can support


Examining Cloud Computing Concepts


Software as a Service


Infrastructure as a Service


Phase 1: Parallel MATLAB capability


ARL lead, deliver in June


Built on Microsoft HPC Server


Additional available applications, FMS, CFD, etc.


Phase 2: Present CREATE capability


Identifying API, middleware, design framework

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
20

HPC Modernization Program

MB Revised: 11/23/2009

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
21

BACKUP

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
22

Storage Lifecycle Management


Layered Software Capability


Information Lifecycle Management


Metadata


user and system defined


Policies


drive HSM


Reporting


Hierarchical Storage Management


Tiered Storage


Disaster Recovery


Multi
-
system, multi
-
center


Assign metadata attributes from all HPC systems


Work toward “shared” files between centers

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
23

Storage Lifecycle Management


Information Lifecycle
Management


Provide capability to users and
administrators


Control costs



Hierarchical Storage
Management


Based on ILM information


Includes disaster recovery



Common user interface


Work toward shared files

ILM
)
Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
24

ILM Requirements


Metadata attributes


User
-
assignable


System
-
assignable


Defaults


Tools and Reports


Enable management of data files


Policies


Based on attributes


Used to drive HSM

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
25

ILM Attribute Requirements


Associated with all objects


Arbitrary number, size, type


Attribute permissions separate from underlying files


System read/write


Creator/Owner read/write


Collections of other users


Inheritance or default
-
setting at creation


Settable via templates or functions


ILM must scale to 1B files today


No impact on I/O performance for HSM


Attributes can be output textually

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
26

ILM Tool Requirements


Tools for manipulating files under ILM control


Attribute
-
aware


Attribute
-
preserving


Operate on files, directories, or lists of objects


Create/modify attributes


Reports


Based on multiple criteria, attribute values


Status of pending operations


Consistent with attribute permissions

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
27

HPCMP Storage Initiative


Computing power grows annually

so do stored files


Archived data is hard for users to use and manage


Costs: User time, labor, hardware,
software and
media


Storage Initiative


Objective: Refresh to manage data for next 10 years


Goals: 10
-
year architecture


Leverage advances in technology


Improve user productivity


Improve reliability & adaptability


Sustain within current storage
budget

MB Revised: 5/4/2009

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
28

0
2
4
6
8
10
12
14
16
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Single Copy of HPCMP Storage in Petabytes

HPCMP Data Storage Growth

Single Copy Data Storage


Impact of 16x
growth in eight
years


Data Analysis


Data Locality and
Movement


Data Duplication


Disaster Recovery


Network Loading


Storage
Technologies


22 x

MB Revised
: 12/22/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
29

HPC Enhanced User Environment
(HEUE)


Purpose


Provide computational scientists more tools
and capabilities to perform research more
efficiently and effectively


Benefit


Decrease time
-
to
-
solution, increase S&E
productivity and analytical power, reduce
future costs of data archive


Tasks


Storage lifecycle management
implementation


Metadata for file management and identification


Program
-
wide
datafile

visibility and access


Center
-
wide
filesystem
: efficient storage for
data analysis and extraction


Center
-
wide job management: single point
-
of
-
access, increase user productivity


Remote visualization for large datasets


Web
-
based access to HPC capability

MB Revised: 12/22/2010

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
30

Requested Software

System Software


PBS Pro,
OpenMPI



InfiniBand

Software Stack


NVIDIA Linux x86_64 driver set


Compliance with BCT policies

Development Tools


PGI Compiler Suite (C/C++/Fortran)


GNU Compiler Suite & debugger


TotalView

debugger


NVIDIA GPGPU development
Environment (
OpenCL

and CUDA)


Common Set of Open Source Utilities


BC policy: PAPII, SCALASCA, TAU,
PDT,
Valgrind


DDT and DDT with CUDA debugger

Data Analysis Tools


CEI


Ensight

Suite


Intelligent Light


FieldView


RSI, Inc.


IDL


Mathworks



Matlab


NCAR Graphics Library


Kitware



ParaView


Tecplot
, Inc.

Tecplot


VisIt

Visualization Tool


Computational Science Environment
(CSE)


ezVIZ

Distribution A: Approved for public release, distribution unlimited.

HPC User Forum

7 Apr 2011
Page
-
31

Requested Software

Pre/Post Processing Software


ANSYS CFD


Abaqus


LS
-
PrePost


Parasolid

Designer (pre)


Pointwise



Gridgen

Math Libraries


ARPACK, FFTW,
PETSc
,
SuperLU
,
LAPACK,
ScaLAPACK
, BLAS, ATLAS,
GotoBLAS
, SPRNG, GSL

New


Pipeline Pilot (
Accelrys

product)


automation of the process of predicting
compute intensity on the fly and
submitting jobs to the US


Isight

(DSS product)
-

design
optimization & process integration (some
portions are interactive & some are for
batch processing)

Secure Remote Visualization


PKI
-
VNC


Longhorn