Oak Ridge Leadership Computing Facility

fingersfieldMécanique

22 févr. 2014 (il y a 3 années et 1 mois)

81 vue(s)

Oak Ridge Leadership
Computing Facility

Don Maxwell

HPC Technical Coordinator

October 8, 2010


Presented To:

HPC User Forum, Stuttgart

www.olcf.ornl.gov

2

Oak Ridge Leadership

Computing Facility


Mission: Deploy and operate the
computational resources required
to tackle global challenges


Providing world
-
class computational resources
and specialized services for the most
computationally intensive problems


Providing stable hardware/software path of
increasing scale to maximize productive
applications development


Deliver transforming discoveries in materials,
biology, climate, energy technologies, etc.


Provide the ability to investigate otherwise
inaccessible systems, from supernovae to
nuclear reactors to energy grid dynamics


2

Managed by UT
-
Battelle

for the Department of Energy

3

Our vision for sustained leadership

and scientific impact


Provide the world’s most powerful open resource

for capability computing


Follow a well
-
defined path for maintaining world leadership

in this critical area


Attract the brightest talent and partnerships from all over the world


Deliver cutting
-
edge science relevant to the missions

of DOE and key federal and state agencies


Unique opportunity for multi
-
agency

collaboration for science based

on synergy of requirements and technology

4

4

With UT, we are NSF’s National Institute
for Computational Sciences for academia

4

Managed by UT
-
Battelle

for the Department of Energy


1 PF system to the UT
-
ORNL Joint Institute for Computational Sciences


Largest grant in UT history


Other partners: Texas Advanced Computing Center, National Center

for Atmospheric Research, ORAU, and core universities


1 of up to 4 leading
-
edge computing systems planned

to increase the availability of computing resources

to U.S. researchers


A new phase in our relationship with UT


Computational Science Initiative


Governor’s Chair and joint faculty


Engagement with the scientific community


Research, education, and training mission



5

Peak performance

2.33 PF/s

Memory

300 TB

Disk bandwidth

> 240 GB/s

Square feet

5,000

Power

7 MW

Oak Ridge National Laboratory

Leadership Computing Systems

Peak performance

1.03 PF/s

Memory

132 TB

Disk bandwidth

> 50 GB/s

Square feet

2,300

Power

3 MW

Jaguar

Kraken

Peak Performance

1.1 PF/s

Memory

248 TB

Disk Bandwidth

104 GB/s

Square feet

1,600

Power

2.2 MW

NOAA CMRS

World’s most powerful computer

NOAA’s most powerful computer

NSF’s most powerful computer

6

Jaguar History

Jan 2005

XT3 Dev
Cabinet


Mar 2005

10 Cabinet
Single
Core


April 2005

+30 XT3
Cabinets


Jun 2005

+16 cabinets for
total of 56 XT3

25TF


Nov 2006

XT4 Dual Core
2.6GHz

32 then 36
cabinets


July 2006

XT3 Dual
Core 2.6 GHz

50TF


March 2007

XT3 and XT4
Combined for
total of 124
cabinets

100TF


May 2008

XT4 68 cabinets
Quad Core

250TF


Dec 2008

200 cabinet
Quad Core XT5

1PF


Nov 2009

200 cabinet

Six Core XT5

2PF


7

What is Jaguar Today?

Jaguar combines a 263 TF Cray XT4 system at ORNL’s
OLCF with a 2,332 TF Cray XT5 to create a 2.5 PF system

System attribute

XT5

XT4

AMD
Opteron

processors

37,376 Hex
-
core

7,832 Quad
-
core

Memory DIMMS

75,772

31,776

Node architecture

Dual socket SMP

Single Socket

Memory per core
/node (GB)

1.3/16

2/8

Total

system memory (TB)

300

62

Disk capacity (TB)

10,000

750

Disk

bandwidth (GB/s)

240

44

Interconnect

SeaStar2+

3D torus

SeaStar2+

3D

torus

8

“Spider”: Center
-
wide High Speed
Parallel File System


“Spider” provides a shared, parallel

file system for all systems


Based on
Lustre

file system


Demonstrated bandwidth of over 240 GB/s


Over 10 PB of RAID
-
6 Capacity


DDN 9900 storage controllers with 8+2 disks per
RAID group


13,440 1
-
TB SATA Drives


192 Dell
PowerEdge

Storage servers


3 TB of memory


Available from all systems via our

high
-
performance scalable I/O network


Over 3,000
InfiniBand

ports


Over 3 miles of cables


Scales as storage grows


Spider is the parallel file system for Jaguar


Spider uses approximately 400 KW of
power


9

Jaguar combines a 2.33 PF Cray XT5
with a 263 TF Cray XT4

System components are linked by 4
×
-
DDR
InfiniBand

(IB) using three Cisco 7024D
switches


XT5 has 192 IB links


XT4 has 48 IB links


Spider has 192 IB links

Spider

Cray XT4

Cray XT5

External
Logins

10



Building an Exabyte Archive


Supercomputers addressing Grand Challenges need to quickly store
massive amounts of data


The High
-
Performance Storage System meets the big
-
storage
demands of big science


25PB of Tape Storage


Planning for 750PB by 2012

Stanley White, National Center
for Computational Sciences

High
-
Performance Storage System adds capacity and speed

“Fifteen years ago, [national] labs
realized they needed something of this
size. They recognized Grand Challenge
problems were coming up that would
require petaflops of computing power.
And they realized those jobs had to
have a place to put the data.”

11

Scheduling to Maximize Capability Computing



Factor

Unit of

Weight

Actual Weight


(Minutes)

Value

Quality of Service

# of days

1440

High
est (90)


High (12)

Medium (2)




Account Priority

# of days

1440

Allocated Project (1)

No Allocation (Staff) (0)

No Hours (
-
365)

Job Size

# of days

1440

0 (90)

>120000

(15)

>80000

& <120000 (10)

>40000 & <80000 (5)

<40000 (0)

Fairshare

# of minutes

1440

(<>)5% user (+/
-
) 30 minutes

(<>)10% acct (+/
-
) 1 hour

Queue Time

1 minute

1

Provided by Moab

Capability jobs get
maximum priority and
walltime

Jobs are prioritized using
several factors to meet DOE
goals and to provide flexibility

12


Job Failure Trends

0.0%
1.0%
2.0%
3.0%
4.0%
5.0%
6.0%
2000
5000
40000
80000
120000
225000
Cores

Failures Due to Hardware By
Job Size

MPI Forum

OpenMPI

HWPOISON


13

ORNL’s Current and Planned Data Centers

Computational Sciences Building
(40,000 ft
2
)


Maximum building power to 25 MW


6,600 ton chiller plant


1.5 MW UPS and 2.25 MW generator


LEED Certified

Multiprogram Research Facility
(30,000 ft
2
)


Capability computing for national defense


25 MW of power and 8,000 ton chillers


LEED Gold Certification

Multiprogram Computing & Data Center
(140,000 ft
2
)


Up to 100 MW of power


Lights out facility


Planned for LEED Gold certification

14

























T. Barron

D. Dillow

D. Fuller

R. Gunasekaran

S. Hicks
5

Y. Kim

K. Matney

R. Miller

S. Oral




National Center for Computational Sciences

J. Hack, Director

A. Bland, OLCF Project Director

L. Gregg, Division Secretary


Operations Council

W. McCrosky, Finance Officer

H. George, HR Rep.

K. Carter, Recruiting

M. Richardson*, Facility Mgmt.

M. Disney, ES&H Officer

R. Adamson, M. Disney, Cyber Security










D.
Leverman

D. Londo
4

J. Lothian

D. Maxwell
@

M. McNamara
4

J. Miller
6

D.
Pelfrey

G. Phipps, Jr.
6

R. Ray

S. Shpanskiy

C. St. Pierre

B. Tennessen
4

K. Thach

T. Watts
4

S. White

C. Willis
4

T. Wilson
6


R. Adamson

M. Bast

J. Becklehimer
4

J. Breazeale
6

J. Brown
6

M. Disney

A. Enger
4

C. England

J. Evanko
4

A. Funk
4

D. Garman
4

D. Giles

M. Hermanson
2

J. Hill

S. Koch

H. Kuehn

C. Leach
6



High
-
Performance

Computing Operations

A. Baker

S. Allen

B. Mintz
7

M. Matheson

R. Mills
5

B. Mintz
7

H. Nam

G.Ostrouchov
5

N. Podhorszki

D. Pugmire

R. Sisneros
7

R. Sankaran

R. Tchoua

A. Tharrington
#

R. Toedte



S. Ahern
#

E. Apra
5

R. H. Baker

D. Banks
3

M. Brown

J. Daniel

M. Eisenbach

M. Fahey

J. Gergel
5

S. Hampton
7

W. Joubert
#

S. Klasky
#

A. Lopez
-
Bezanilla
7

Q. Liu
7

Scientific Computing

R. Kendall

A. Fields


Deputy Project Director

K. Boudwin


B. Hammontree, Site Preparation

J. Rogers, Hardware Acquisition

R. Kendall, Test & Acceptance Development

A. Baker, Commissioning

D. Hudson, Project Management

K. Stelljes, Cray Project Director



Advisory Committee




J. Dongarra

T. Dunning

K.
Droegemeier

S. Karin

D. Reed

J. Tomkins















J. Levesque N.
Wichmann

J. Larkin

D. Kiefer L.
DeRose



Cray Supercomputing Center of Excellence

Application

Performance

Tools
5

R. Graham

T. Darland
















R. Barrett

W. Bland

L. Broto
7

O. Hernandez

S. Hodson

T. Jones

R. Keller

G. Koenig

J. Kuehn


Chief Technology Officer

A. Geist

Director of Operations

J. Rogers

OLCF System Architect

S. Poole

Director of Science

B. Messer, Acting

INCITE Program

J. White

Industrial Partnerships

S. Tichenor


User Assistance

And Outreach

A. Barker

A. Fields


















J. Buchanan

J. Eady
5

D. Frederick

C. Fuson

E. Gedenk
1

B. Gajus
5

M. Griffith

S. Hempfling

J. Hines
#

S. Jones

C. Kerns
1


D. Levy
5

M. Miller

L. Rael

B. Renaud

C. Rockett
1

D. Rose
5

J. Smith

W. Wade
1

B. Whitten

L. Williams
5

B. Settlemyer
5

D. Steinert

J. Simmons

V. Tipparaju
5

S. Vazhkudai
5

F. Wang

V. White

Z. Zhang

Technology Integration

G. Shipman

S. Mowery

1
Student

2
Post Graduate

3
JICS

4
Cray, Inc.

5
Matrixed

6
Subcontract

7

Post Doc

*Acting

#
Task Lead

@
Technical Coordinator

ORNL is managed and operated by
UT
-
Battelle, LLC under contract
with the DOE.

78
FTEs

15

Scientific Computing

15

Scientific Computing facilitates the
delivery of leadership science by
partnering with users to effectively
utilize computational science,
visualization and workflow
technologies on OLCF resources
through:


Science team liaisons


Developing, tuning, and scaling current
and future applications


Providing visualizations to present
scientific results and augment discovery
processes

16

We allocate time on the DOE systems
through the Innovative and Novel
Computational Impact on Theory and
Experiment (INCITE) Program

Provides awards to academic, government, and industry organizations worldwide needing
large allocations of computer time, supporting resources, and data storage to pursue
transformational advances in science and industrial competitiveness.

17

User Demographics

Active Users

by Sponsor

System time is allocated to each project. We do not charge for time
except for proprietary work by commercial companies.

18

• Glimpse into dark matter

• Supernovae ignition

• Protein structure

• Creation of
biofuels


• Replicating enzyme functions

• Protein folding

• Chemical catalyst design

• Efficient coal
gasifiers


• Combustion

• Algorithm development

• Global cloudiness

• Regional earthquakes

• Carbon sequestration

• Airfoil optimization

• Turbulent flow


Propulsor

systems


Nano
-
devices

• Batteries

• Solar cells

• Reactor design

Contact information

Julia C. White, INCITE Manager

whitejc@DOEleadershipcomputing.org

Some INCITE research topics

Next INCITE Call for Proposals: April 2011

Awards for 1
-
, 2
-
, or 3
-

years

Average award > 20 million processor hours per year

Contact us about discretionary time for INCITE preparation

19





Three of six GB finalists ran on Jaguar

Gordon Bell Prize Awarded to ORNL
Team


A team led by ORNL’s Thomas
Schulthess

received the prestigious 2008 Association

for Computing Machinery (ACM)

Gordon Bell Prize at SC08


For attaining fastest performance ever in a
scientific supercomputing application


Simulation of superconductors achieved
1.352
petaflops

on ORNL’s Cray XT Jaguar
supercomputer


By modifying the algorithms and software design
of the DCA++ code, the team was able to boost
its performance tenfold

Gordon Bell Finalists


DCA++ ORNL


LS3DF LBNL


SPECFEM3D SDSC


RHEA TACC


SPaSM

LANL


VPIC LANL

UPDATE: with upgraded Jaguar, DCA++ has exceeded 1.9 PF

20

OLCF is working with users to produce scalable,
high
-
performance apps for the
petascale

Science
Area

Code

Contact

Cores

Total
Performance

Notes

Materials

DCA++

Schulthess

213,120

1.9 PF*

2008 Gordon
Bell Winner

Materials

WL
-
LSMS

Eisenbach

223,232

1.8
PF

2009 Gordon
Bell Winner

Chemistry

NWChem

Apra

224,196

1.4 PF

2009 Gordon
Bell Finalist

Nano

Materials

OMEN

Klimeck

222,720

>1

PF

2010

Gordon
Bell
Submission

Seismology

SPECFEM3D

Carrington

149,784

165 TF

2008 Gordon
Bell Finalist

Weather

WRF

Michalakes

150,000

50
TF

Combustion

S3D

Chen

144,000

83 TF

Fusion

GTC

PPPL

102,000

20 billion

Particles /

sec

Materials

LS3DF

Lin
-
Wang
Wang

147,456

442 TF

2008 Gordon
Bell Winner

Chemistry

MADNESS

Harrison

140,000

550+ TF

20

Managed by UT
-
Battelle


for the U.S. Department of Energy

21

Scientific Progress at the Petascale

Nuclear Energy

High
-
fidelity predictive
simulation tools for the design
of next
-
generation nuclear
reactors to safely increase
operating margins.



Fusion Energy

Substantial progress in the
understanding of anomalous
electron energy loss in the
National Spherical Torus
Experiment (NSTX).

Nano

Science

Understanding the atomic and
electronic properties of
nanostructures in next
-
generation photovoltaic solar
cell materials.

Turbulence

Understanding the statistical
geometry of turbulent
dispersion of pollutants in the
environment.

Energy Storage

Understanding the storage and
flow of energy in next
-
generation
nanostructured

carbon tube
supercapacitors

Biofuels

A comprehensive simulation model
of
lignocellulosic

biomass to
understand the bottleneck to
sustainable and economical ethanol
production
.



21

Managed by UT
-
Battelle


for the U.S. Department of Energy

22

Science Results


Coherent transport simulations in band
-
to
-
band
tunneling devices with simulation times of less than
an hour

=> rapidly explore design space


Incoherent transport simulations coupling all
energies through phonon
-
interactions. Production
runs on 70,000 cores in 12 hours

=> first atomistic incoherent transport simulations

Science Objectives and Impact


Identify next generation
nano
-
transistor
architectures, and reduce power consumption and
increase manufacturability.


Model, understand, and design carrier flow in
nano
-
scale semiconductor transistors.

Nanoscience

/ nanotechnology

Petascale simulations of
nano
-
electronic devices


Research Team

M. Luisier and G. Klimeck,
Purdue University


3
-
year INCITE award, with

20 million hours in 2010

OMEN: 3D, 2D, and 1D atomistic devices

23

Science Results

Science Objectives and Impact

Computational Fluid Dynamics

Smart
-
Truck Optimization

Research Team

Mike Henderson, BMI Corp.


Participant in the Industrial
Partnerships Program

Unprecedented

detail and accuracy of a Class 8 Tractor
-
Trailer aerodynamic simulation.


Minimizes drag associated with trailer underside


Compresses and accelerates incoming air flow and
injecting high energy air into trailer wake

=> UT
-
6 Trailer Under Tray System reduces Tractor/Trailer
drag by 12%


Apply advanced computational techniques from aerospace industry to
substantially improve fuel efficiency and reduce emissions of trucks by
reducing drag / increasing aerodynamic efficiency



If all 1.3 million long haul trucks operated with the drag of a passenger
car, the US would annually:


Save 6.8 billion gallons of diesel


Reduce 75 million tons CO
2


Save $19 billion in fuel costs

Aerodynamic Performance Testing
Methods
-

Jaguar
CFD analysis of truck
and mirrors

24

Examples of OLCF Industrial Projects

Developing new add
-
on parts to reduce drag and increase fuel
efficiency of Class 8 (18
-
wheeler) long haul trucks. This will
reduce fuel consumption by up to 3,700 gallons per truck per
year, and reduce CO
2

by up to 41 tons (82,000 lb) per truck per
year. BMI using NASA FUN3D and NASA team is assisting BMI
with code refinement
(
OLCF Director’s Discretionary Award)

Analyzing unsteady versus steady flows in low pressure
turbomachinery

and their potential effects on more energy
efficient designs.
(
OLCF Director’s Discretionary Award)

Studying at the
nano

scale catalysts that can
selectively produce hydrogen from biomass
(hydrogen to be used as energy for fuel cells)

(
OLCF Director’s Discretionary Award)



Developing a unique CO
2

compression technology for
significantly lower cost carbon sequestration

(
ALCC award
)

INCITE award
s

25


The U.S. Department of Energy requires exaflops computing
by 2018 to meet the needs of the science communities that
depend on leadership computing


Our vision: Provide a series of increasingly powerful
computer systems and work with user community to scale
applications to each of the new computer systems


OLCF
-
3 Project
: New 10
-
20 petaflops computer based on early
hybrid multi
-
core technology

10 Year Strategy:

Moving to the Exascale

OLCF Roadmap from 10
-
year plan


300 PF System


10

20 PF

2008

2009

2010

2011

2012

2013

2014

2015

2016

ORNL Extreme Scale Computing Facility(140,000 ft
2
)

2017

ORNL Computational Sciences Building

2018

2019


ORNL Multipurpose Research Facility

1 EF

OLCF
-
3

Future systems

Today




2 PF, 6
-
core

1 PF

100 PF

26


Similar number of cabinets, cabinet
design, and cooling as Jaguar


Operating system upgrade of today’s

Cray Linux Environment


New Gemini interconnect


3
-
D Torus


Globally addressable

memory


Advanced synchronization

features


New accelerated node design using GPUs


20 PF peak performance


9x performance of today’s XT5


3x larger memory


3x larger and 4x faster file system

OLCF
-
3 “Titan” System Description