oak ridge national laboratory us department - R. Scott Studham

townripeData Management

Jan 31, 2013 (4 years and 6 months ago)

165 views

NWPerf: System Wide Performance Monitoring

Understanding and evaluating the utilization of the system under normal workloads.


R. Scott Studham
studham@ornl.gov



The majority of this work was done while I was working at PNNL with
Ryan Mooney, Ken Schmidt and Jarek Nieplocha.


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

ORNL has a history of Platform Evaluations

IBM Power3

IBM Power4

SC Cluster

SGI Altix

IBM Federation

Cray X1

Cray XD1

Scalable IO

Reconfigurable

Computing

BlueGene

Redstorm

IBM Power5

Today

SRC Prototype

1999

KSR
-
1

1991

Intel i/PSC
-
2

1988

GSN Switch

2000

IBM S80

1999

Intel Paragon MP


X/PS
-
150 1995

Intel I/PSC
-
860

1990

Intel Paragon XP/S
-
35

1992

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Platform Evaluation Methodology

Microbenchmarks

Kernels

Applications

System Infrastructure & IO

Open

Environment

Report

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

HPC Evaluation Methods

Platform Evaluation


Determine new platforms applicability to
a range of applications.

Microbechmarks

IO & System Characteristics

Application Benchmarking

Work with manufacturers and system
architects to make the systems better



Focus: Evaluation of emerging platforms
to understand their benefits.

Workload Characterization


Determine how existing platforms are
utilized by applications and users.

“Efficiency” of application on platform

System Utilization

Average job sizes

Work with application developers, users
and manufactures to decrease runtime.



Focus: Evaluation of deployed platforms
to understand how they are used by
general users.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

PNNL’s HPCS2 Configuration

Elan4

4 Login nodes


with 4Gb
-
Enet

2 System Mgt
nodes

1,976 1.5GHz Itanium® processors

11.8TF

6.8TB Memory

1/2PB of Disk

…...

928
compute
nodes

2Gb SAN / 53TB



Lustre

Elan3

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

PCI
-
X2 (1GB/s)

Elan4

HPCS2 Node Architecture

Each node has:

8GB of RAM

1/2TB filesystem sustaining


200MB/s write rate


400MB/s read rate

1000T Connection

Serial Connection

Elan3 Connection

Elan4 Connection

Elan3

2SCSI160

1.5 GHz

6MB

1.5 GHz

6MB

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

HPCS2 Design Requirements

The MSCF focuses on computational chemistry (NWChem).


Non Uniform Memory architecture based


One
-
sided communications


Multiple IO algorithms that can make use of local and global disk

Leading to an ideal design requirement of:


The majority of the nodes have ~1/2TB of Local IO (each) at 25MB/s per
GF of DGEMM


>200MB/s write and >400MB/s read


One
-
sided remote memory access


Efficient processors for BLAS3 operations


Large cache

NWChem - MD
NWChem - PW
VASP
ADF
Jaguar
Own Code
Other
Guassian
Climate Code
NWChem -
Ab Initio
FY02 Data

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Platform Evaluation of HPCS2

HPCS2 provides best
capability and fastest time
-
to
-
solution for NWChem DFT
benchmark


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Workload Characterization of HPCS2

We wanted to develop a method to answer the following:


What percent of jobs use all that memory/disk?


If we had a shared memory system, or a shared disk pool could
we get away with less storage on the next system?


What is the sustained performance for the average job?


What is the average user doing vs. what the box was designed
for?


What are the impacts of slow memory to the average user?


Where is the bottle neck for the average run?


What is the average job size? How does job size impact
efficiency of the calculation?

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Design requirements


Load >20 metrics
(Eg FLOPs, Int Ops, Mem BW, Network BW, Disk BW.)

for
all

jobs run on the system in a central database.


Have a fine granularity to the data so you can see the needs of
the different algorithms used.






Keep
all

data and develop a mathematical “center profile”.


Can’t impact jobs performance by greater than 1%


0%
10%
20%
30%
40%
50%
60%
70%
80%
0
90
180
270
360
450
540
630
720
810
900
Time
Efficency (FLOPS)
ccsd(t)

ccsd

24% average efficiency over 128 nodes (~1/3TF)

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

NWPerf

A high performance low overhead system analysis tool



Low overhead synchronized data collection


Limited impact on time to solution (< 1%)


Fine Granularity Data (1 minute samples)


Multiple performance metrics, easily extensible to add more


simple initialize(), action(), close() C API


Automatic correlation between jobs and performance metrics


Job post processing tied into batch job scheduler (LSF)


Data archival for reporting on long term historical data (multiple months of
detail, longer for summary data)


2.8 Billion detail data points on unique 19883 jobs between Jun 30 2004 and Sept 22 2004,
average of just over 111 thousand data points/job (over 52 million data points for largest job;
3 day 900 CPU job).


Scalable Architecture works on larger systems (over 1000 Nodes with minimal
data loss)

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Other Solutions Considered


Ganglia


great tool, easy to use but didn’t entirely fit our requirements:


Ganglia V2 was difficult to extend in a low impact fashion (gmetric via cron is not low impact)


Unsynchronized collection (ref:
Petrini) causes variable system performance


RRD storage is nice for fixed time series graphs, but difficult for ad
-
hoc queries


Supermon showed good per node performance, however…


Polling based systems are almost implicitly unsynchronized


Still need to provide a storage solution


Node management seemed excessively cumbersome and manual without major code change.


kernel module reliance reduces portability and increases administrative cost of deployment (same with dproc)


Other solutions such as dproc, CARD, PARAMON were not proven to scale to large clusters


Clumon/PCP oriented towards “point
-
in
-
time” data, not long term systematic analysis
(although potential exists for extension)


Wanted to avoid requiring users to do anything


Transparent and runs “alongside” all jobs (ala Ganglia, Supermon)


Ruled out approaches like srun/perfsuite (these not necessarily competitive, and are likely
complementary)


Complements tools like Vtune, Vprof, Perfsuite, gprof, etc… by helping point out jobs that
may benefit from more detailed analysis.


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Benefits


Benefit system designers

what is really important (memory, flops, something else, all of the above?)


Benefit users

Is my application sized properly? Is it having obvious problems?


Benefit code developers

What resources is the application using, does it have good performance?


Stop the hype

What is performance, we needed a simple way to compare performance
characteristics for a large set of real world applications.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY











GigE

Network

NWPerf
-
Client



M
o
d
ul

e

M
o
d
ul

e

M
o
d
ul

e

M
o
d
ul

e

NWPerf
-
Client

M
o
d
ul

e

M
o
d
ul

e

M
o
d
ul

e

M
o
d
ul

e

Queue Drainer

Lightweight

shmem queue

listen

socket

SQL Command Line

Reporting and
visualization

Packet
-
Handler

listen

socket

Lightweight

shmem queue

Version 1 Block Diagram

Statistics Summarization

Packet
-
Handler

Database

Collection Server

Packet
-
Handler

Packet
-
Handler

Queue Drainer

NTP Clock

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

The collection Module

27+ Metrics Collected

12 metrics in V1 * represents data stored in V1

Each Metric is collected on all nodes once per minute


Itanium Performance Counters,

each of these metrics are collected separately for both
CPU’s in the compute nodes


*Flops


Floating point operations per second as a percent of theoretical peak


*Memory Bytes/Cycle


Average main memory accesses per CPU clock cycle


Total Stalls


Total stalls per CPU clock cycle, this may be one of 5 different stalls


Local Scratch Usage (obtained via fstat() )


Blocks Used and Block Free


Inodes used and inodes free (this yields an estimate of files open)


VMStat information (obtained via /proc/meminfo and /proc/stat)


*Memory swapped out (total), swap blocks in and out


*Memory free, used, and used as system buffers (cache + buffers)


*Block I/O in, and out


*Kernel Scheduler CPU allocation to user, kernel, and idle time


Processes running, and blocked


Interrupts, and Context Switches per second.


Lustre I/O (Shared global Filesystem)


Bytes in/out (both client and OST)


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Advanced Query Features


Using Postgresql Database

(
has nice date and other data type handling features
http://www.postgresql.org
)


Use of views and other features can make complex details more accessible.


No averaging or interpolation so no lost data in storage

(unlike
RRD
ala
ganglia
,
cricket
; although these tools definitely have a place).


Similar to RTG in concept

(
Router Traffic Grapher
http://rtg.sourceforge.net
)


For example we can find the top flops for > 32 processor job “easily”:

db=> select jobid, username, runtime, numproc, point, avg from job_average_detail where
avg in (select max(avg) from job_average_detail where point in ('0_PEAK_FLOPS',
'1_PEAK_FLOPS') and numproc > 32 group by point);


jobid | username | runtime | numproc | point | avg


-------
+
----------
+
----------
+
---------
+
--------------
+
----------


47xxx | xxxxxx | 04:14:27 | 256 | 0_PEAK_FLOPS | 38.736


47xxx | xxxxxx | 04:14:27 | 256 | 1_PEAK_FLOPS | 38.826

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

System Impact


Low overhead synchronized data collection

Effectively reduces cluster clock by a maximum of the collection interval (i.e. 0.6s per 1
minute cycle)


Limited impact on time to solution

(< 1%) Enforced by fixed collection interval times


Fine Granularity Data (1 minute samples)


Unscheduled Collection

Report

Report

Report

APP



Node0

Node1

Node2

NWPerf Scheduled Collection

Report

Report

Report

Node0

Node1

Node2

APP



APP



0.6S

59.4S

Tightly couple jobs may suffer as
sum of overlap of interrupt times
on all nodes


Increasing overlap of
interruptions reduces total
possible interruption

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

System Impact
(v1)

All-to-all 256 CPU (12K runs llcbench)
0
100
200
300
400
500
600
700
Min
Mean
execution time [s]
None
1X
10X
No LSF
Measured impact of NWPerf on All
-
Reduce and All
-
to
-
All for:


no monitoring


normal one sample per minute (1X)
monitoring


one sample every 6 seconds at ten times
normal rate monitoring


No NWPerf and No LSF running


All Reduce 256 CPU (20K runs llcbench)
0
50
100
150
200
250
300
Min
Mean
execution time [s]
None
1X
10X
No LSF
LSF Noise dwarfs effects of NWPerf noise, we suspect that
Quadrics RMS has significant impact as well but haven’t yet
been able to test.


This is true even though 10X NWPerf has almost as much total
interrupt time per node as LSF; however lsf interruptions tend to
be normally distributed whereas NWPerf is more synchronized.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY


High peak burst data rates creates a challenging problems
for data collection.

High Burst Data Gathering



Burst UDP multicast places a large

burden on the server to keep up.

Packet

Socket &

Network

Buffers



Packet

Packet

Packet

Packet

Packet

Packet

Packet

Packet

Overflow


V1 ~10%

V2 <1%



System clock synced to NTP only gave us

88% of the samples within 0.6 seconds.


We were however gathering and storing

over 90% of transmitted data points.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Version 1 Problems


Each metric was a message


Lots of overhead (packet overhead approached data)


Messages had free form text


variable length and long (also added
overhead in database)


Scheduling of synchronous collection, while OK, was susceptible
to clock drift (up to 3 seconds)


Protocol decoding done by packet receivers added
overhead/latency


Lightweight Shmem queue not lightweight enough


Client was synchronous and could not (easily) enforce collection
timeouts


Client could be crashed by a misbehaving module


Poor user interface

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY











GigE

Network

NWPerf
-
Client



M
o
d
ul

e

m
o
d
ul

e

m
o
d
ul

e

M
o
d
ul

e

Packet
-
Handler

NWPerf
-
Client

m
o
d
ul

e

m
o
d
ul

e

m
o
d
ul

e

m
o
d
ul

e

Queue Drainer

Packet decode

non blocking

shmem queue

listen

socket

Web Server

Reporting and
visualization

Packet
-
Handler

listen

socket

non blocking

shmem queue

Version 2 Block Diagram

Activity Scheduler

Statistics Summarization

Database

Collection Server

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Version 2 Enhancements


Centralized scheduler


Single clock


good synchronization


Tells clients where to transmit (opens up interesting possibilities for load
balancing, etc…)


Bundled messages (~30 per packet), fixed format with lookups
for textual information (host names, point names)


Protocol decoding done by queue drainer


Non blocking Shmem queue, very low latency in moving packets
off of network.


Client forks child for each module and can enforce scheduling
(with prejudice) and cleanup/disable misbehaving modules.


User Interface in progress (not saying its “good” yet)

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Major Unresolved Issues


We can correlate performance to specific job runs, but NOT to
specific applications


NWPerf is relatively difficult to deploy and tune compared to
competition


Could use security enhancements.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Future Work


Cross system portability, mostly requires more collection
modules and some source code cleanup (autoconf).


SGI Altix mostly working (collection working needs scheduler
integration work),


AIX 5.2 port in progress (collection mostly working, no work done on
scheduler integration)


Open Source (GPL) in progress, working through Intellectual
Property process


Automatically report on jobs with suspected problems


Better User interface with more detailed help on what the
various metrics mean/what might be done to improve
applications performance based on results


Real
-
time visualization and access to job data to detect
problems with running jobs

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Future Advanced Analysis


Cross node and cross job event correlation

(i.e. job one is using the shared file system so jobs ten and twenty were slowed down).


Perform validation tests at different sampling rates to
determine preferred sample rates for different data
points.


Event anomaly detection and prediction based on
gathered profile.


Potential predictive hardware fault analysis.


Finally we can hopefully characterize more accurately
what parts of the system are important for efficient job
performance.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

User Interface

User can access a list of

all their recent jobs

And pull up quick summaries and charts for all the

performance metrics

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

A “Good” job

The 3 graphs here are for a 3 day 600CPU job.

It was a CCSD(t) calculation of octane.

SCF at the beginning hit 61GB/s of IO.

Utilized 1.8TB of memory and sustained 36% efficiency
1.3TF.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

A “Normal” job

https://statman.hpcs2.emsl.pnl.gov/nwperf/


Job 159999

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

A “Bad” Job

11KB/s Mean I/O to disk

74% Mean Time in Kernel Space

Our “star” bad job

Suffered from floating point assists

due to a compiler “feature” that caused

optimistic prefetching of invalid data in

some types of variable length loops

double foo(int len) {


….


for(i = 0; i<len; i++) {


if(d_array[i] < d) {


blah();


}


}

}

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Finding Problem Jobs

Job went into Swap

CPU User space 20% of runtime

(over 10% idle)

CPU Kernel space 70% of runtime


Found based on previous job:

db=> select jobid, avg

from job_average_detail

where avg > 50

and point = (‘cpu_user'

High kernel space versus user

space CPU


usually indicative of

floating point assists (non normalized

floating point operations)

Over 80% of cycles experienced a stall

End result less than 3% of peak Flops

probably could do much better

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Most users do not utilize the full capability of the system.

I need 6GB
of memory
per CPU

CPU’s Used

Percent of jobs

Memory footprint per node during FY04 on 11.4TF HPCS2 system at PNNL

Most jobs use <25%
of available
memory (max avail
is 6
-
8G)


Large jobs use more
memory.


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Aggregate Results

Busy Cycles as a function of job size

10% of the >256CPU jobs have
the CPU scheduled for idle >50%
of the time.

Sustained Performance as a function of CPU count

The median sustained
performance for jobs over
256CPU’s is 3%.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Aggregate Results

Stalled cycles as a function of job size

Sum of all stalls due to:

BE_FLUSH_BUBBLE_ALL Branch
misprediction flush or exception

BE_EXE_BUBBLE_ALL Execution
unit stalls

BE_L1D_FPU_BUBBLE_ALL Stalls
due to L1D (L1 data cache)
micropipeline or FPU (floating
-
point
unit) micropipeline

BE_RSE_BUBBLE_ALL Register
stack engine (RSE) stalls

BACK_END_BUBBLE_FE Front
-
end stalls


O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Job size

Primarily focused on small capacity jobs.

<12% of the computer

<33% of the computer

<50% of the computer

>50% of the computer

Oct

Nov

Dec

Jan

Feb

Mar

Apr

May

Jun

Jul

50%

Percent of cycles used by job size

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Summary: We developed a tool to profile every code
and get an unbiased assessment of the real needs.

All-to-all 256 CPU (12K runs llcbench)
0
100
200
300
400
500
600
700
Min
Mean
execution time [s]
None
1X
10X
No LSF
27 metrics are collected on all nodes once per minute


Hardware Performance Counters including: Flops, Memory
Bytes/Cycle, Total Stalls


Local Scratch Usage (obtained via fstat() )


Memory swapped out (total), swap blocks in and out


Memory free, used, and used as system buffers


Block I/O in, and out


Kernel Scheduler CPU allocation to user, kernel, and
idle time


Processes running, and blocked


Interrupts, and Context Switches per second.


Lustre I/O (Shared global Filesystem)

The 3 graphs are from the same 3 day 600CPU run

NWPerf: Ryan Mooney, Scott Studham, Ken Schmidt

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Discovery

User expertise

Low

High

High

Scalability of application on platform

Platform
Evaluation

Computer

Design

Most

Utilization

After analyzing 19,883 jobs…


Median sustained FLOP performance for
jobs over 256CPU’s is 3%


Less than 20% of the memory is in use at
any given point in time.


Less than 5% of the jobs use “heavy IO”
(>15MB/s per GF DGEMM)


Over 60% of the cycles are for jobs that
use less than 1/8 of the system.


10% of the >256CPU jobs have the
CPUs scheduled for idle >50% of the
time.


…we have determined that most users do not use the system as designed.

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

NLCF: Three purpose
-
built architectures optimized
for applications


Proven architecture for
performance and reliability


Most
-
powerful processors and
interconnect


Scalable, globally addressable
memory and bandwidth


Leverages commodity where
possible


Offers capability computing for
key applications



Extremely low latency, high
bandwidth, interconnect


Efficient scalar processors,
balanced interconnect


Known system architecture


based on ASCI Red


Shares interconnect
technology with Cray X2


Front end
-
hosted
environment
-

system calls
and I/O


Very low power and space
requirements


Interconnects balanced

with processors


Unique pairing mesh and tree,
low latency


Much higher parallelism

(tens of thousands of CPU)


Potential for new capabilities
for selected applications


Scalability platform for
algorithms and CS research

Cray X1

BG/L

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Capability

Platform

Breakthrough

Science

Ultrascale

Hardware


HW teams

Computational

End Stations

Development End Station

Open End Station

Software & Libs

SW teams

Tuned code

Research
team

High
-
End science
problem

End station 1

End station 2

Science teams enabled through “End Stations”

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Computational End Station

NLCF deploys a fundamentally new approach for long
-
term engagement of research
communities modeled on the “end station” concept through which major experimental
facilities provide specialized instruments to specific user groups

End Station defined by three characteristics:

National Problem

Scientific team

Application suite

Addresses

problems that are

of national importance

(e.g., nanotech)

Scientific team
willing to create

and maintain

end station

Suite of scientific
codes in area tuned
to NLCF resources

1

2

3

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

The Challenge

User expertise

Low

High

High

Scalability of application on platform

Platform
Evaluation

Computer

Design

Capacity Center

Utilization

Work with the application
community, via end stations, to
educate the users as to how to
efficiently use the systems and
ensure the users applications scale
on the given NLCF platforms.

End Station

Calculation

O
AK

R
IDGE

N
ATIONAL

L
ABORATORY

U. S. D
EPARTMENT

OF

E
NERGY

Acknowledgements

Ryan Mooney

and
Ken Schmidt

for their tireless work to develop NWPerf.

Jarek Nieplocha

for his guidance on how to quantify system impacts.


This research described in this presentation was performed using the
Molecular Science
Computing Facility

(MSCF) in the William R. Wiley Environmental Molecular Sciences
Laboratory, a national scientific user facility sponsored by the
U.S. Department of Energy's
Office of Biological and Environmental Research

and located at the Pacific Northwest
National Laboratory. PNNL is operated for the Department of Energy by Battelle.


This research is sponsored by the Office of
Advanced Scientific Computing Research; U.S.
Department of Energy
. The work was performed at the Oak Ridge National Laboratory,
which is managed by UT
-
Battelle, LLC under Contract No. De
-
AC05
-
00OR22725.


Experiments and data collection were performed on the
Pacific Northwest National
Laboratory (PNNL) 977
-
node Linux 11.8 TFLOPs cluster (HPCS2) with 1954 Itanium
-
2
processors.


The data collection server is a Dual Xeon Dell system with a 1TB ACNC IDE to SCSI Raid
Array.