SHARCNET TASP Annual Report 2005-2006

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (3 years and 9 months ago)

119 views






SHARCNET TASP Annual Report 2005
-
2006


Ge Baolai
, David McCaughan, Do
ug Roberts, Jemmy Hu, Sergey Ma
shchenko

High Performance and Technical Computing

Shared Hierarchical Academic Research Computing Network


May 2, 2006



TASP Annual Report 2006


Page
2

About SHARCNET


SHARCNET is a

consortium of colleges and universities in a "cluster of clusters" of high
performance computers, linked by advanced
, dedicated
optic
al network through
Optical
Regio
nal Advanced Network of Ontario (
ORION
)
.


The

unique computational
infrastructure combines

an active academic
-
industry partnership, enabling world
-
class
computational research.



Formally established in June of 2001, SHARCNET now consists of 16 leading academic
institutions in Ontario. The consortium is administered by The University of Weste
rn
Ontario, and includes the Universities of Guelph, McMaster, Wilfrid Laurier, Windsor,
Waterloo, Brock, York, Laurentian, Trent, Lakehead, Ontario Institute of Technology,
Fanshawe and Sheridan Colleges and most recently Perimeter Institute and the Ontar
io
College of Art & Design.


SHARCNET is founded on academic
-
industrial collaboration. Its private sector partners
include Hewlett Packard, Platform Computing, Bell Canada, Nortel Networks, Quadrics,
Silicon Graphics, and the Optical Regional Advanced Ne
twork of Ontario.


With total funding of over $100m to date, SHARCNET is generously supported by both
Federal and Ontario Provincial Funding Programs, as well as by its institutional members
and private sector partners.


Resources and Usage


SHARCNET is c
urrently in a phase of significant infrastructure expansion, which is
raising the visibility of high performance computing both in Ontario and Canada
.
Year
2005
-

2006 has seen the arrival and deployment of over 6000 new processors and almost
500 Terabytes

of storage in a variety of configurations from our vendor partners,
H
ewlett
-
P
ackard and
S
ilicon
G
raphics. A partnership with the
Optical Regional
Advanced Network of Ontario

has allowed SHARCNET to deploy a dedicated optical
network with a 10Gb bandwidth
over Ontario Research Innovation Optical Network
(ORION), connecting systems components together into a single coordinated resource, to
facilitate increased collaboration and more computationally
-
intensive research.


Table 1 lists all system (new and exist
ing) deployed across SHARCNET. Together with
Canada Foundation for Innovation (CFI) funded SHARCNET Phase
-
II expansion, we
now make available over 7200 processors to our user community, and researchers from
Canada more generally. To better facilitate the
unique needs of our member institutions,
smaller
-
scale point
-
of
-
presence (POP) clusters (not shown) have been deployed at most
sites to enable custom deployments, which constitute additional computational resources.





TASP Annual Report 2006


Page
3





Table 1
: SHARCNET deployed syste
ms at a glance.


Name

Type

Make

CPU

# CPUs

OS

State

bull

cluster

hp

Opteron

384

XC 3.0

Testing

cat

cluster

generic

Xeon, Opteron

162

RedHat 8.0

Online

coral

cluster

hp

Itanium2

64

XC 2.0

Online

goblin

cluster

Sun

Opteron

56

Fedora Core 2

Online

great
white

cluster

Compaq

Alpha

388

RedHat 7.2

Online

idra

cluster

Compaq

Alpha

128

Tru64

Online

narwhal

cluster

hp

Opteron

1068

XC 3.0

Online

requin

cluster

hp

Opteron

1536

XC 3.0

Testing

silky

SMP

SGI

Itanium2

128

Linux 2.4

Online

tiger

cluster

Compaq

Al
pha

8

RedHat 7.2

Online

typhon

SMP

Compaq

Alpha

16

Tru64

Online

whale

cluster

hp

Opteron

3072

XC 3.0

Testing

wobbe

cluster

hp

Opteron

208

Fedora Core 3

Online

Total




7218





T
he SHARCNET user community has been steadily growing. A
s of May 1, 2006 t
here
have been 1372 user accounts opened
;
a

significant
increase of 67.9% compared
to
2005
.

Figure 1 shows the accumulated number of
user
accounts. In the graph
,

u
ser accounts are
identified as sponsored

users
(labeled as “users”)
and “sponsors”. Sponsors
are
supervisors and principle investigators who supervise,
and
direct students, postdocs and
fellow researchers in research projects.



With new systems
being
added to the existing computational resource pool, we have seen
an elevated
demand for resources.

Figure 2

show
s

the SHARCNET
-
wide usage in
CPU
years accumulate
d since 2001.
Taken together with the dramatic increase in the volume
of large parallel jobs as shown in Figure 3, the statistical data clearly demonstrate the
demand for additional HPC resourc
es within the academic community. The investment
in HPC by CFI and the provincial government is thus well justified and well utilized, in
turn enabling the leading research performed by researchers in the community.



TASP Annual Report 2006


Page
4



Figure
1
:
Accumulative number of users at SHARCNET since 2001.


It should be pointed out that the apparent slowdown in the cumulative CPU usage during
January 2006 to May 2006 in Figure 2 was due to the service interruptions for machine
room renovation and system te
sting. All legacy hardware had to be disassembled and
transported to new locations and space for the new machines required extensive
renovations. Acceptance testing of the new systems has taken excessive time due to
various unforeseen issues, resulting i
n some delay making these resources available to
users.


In addition to the traditional presence of users from physics, chemistry, mathematics,
computer science and engineering, there has been a notable increase in users from the
social sciences as well.

Alongside the traditional HPC applications, we have observed
the emergence of non
-
traditional scientific work, including studies in pension policies,
simulation of catastrophic events and hazard prevention, as well as research into HPC
services themselves



batch scheduling across systems on a wide area network for
example.




TASP Annual Report 2006


Page
5


Figure
2
: Accumulative CPU usage (years).

.


Figure
3
: Accumulative number of jobs.


TASP Annual Report 2006


Page
6

Accessibility


In SHARCNET Phase
-
I access to
the various

systems
housed

at different sites was

handled in isolation
. This
was proving highly in
convenient,
particularly
when user
s

migrate
d

from one system to another
in search of
cycles.

A major initiative during the
past year was the design and imple
mentation of our unified resource management system.
Users now only deal with a single account across all SHARCNET hardware and software
services. A unified model is in place for handling distributed home directories (with
redundancy in case of network fa
ilure), and we continue to play with the idea of a single
batch scheduling front
-
end obviating the need for separate job submission, and the
possibility of job migration for load balancing across distributed resources. We will
highlight the following

area
s
in more detail,
where accessibility has
improved most
dramatically:


Uniform Accessibility
.
In a process that began in

2005, we
have designed and

established
protocols

for a
unified user environment across
our entire infrastructure.


Dedicated, high
band
width optical networking over ORION efficiently connects distributed HPC facilities
spread throughout SHARCNET institutions. Users are now able

to access system
s and
web
-
based resources f
rom anywhere using a single user account and password.

There is
now

a single set of home directories that appear across the consortium, with limited
replication to ensure users can still access their data even in the event of internal network
failure.


It is no longer necessary for u
sers to move their files when they migr
ate from one
system to
another, and we can now consider implementing more germane measures for
keeping large data sets close to the hardware that is making use of it, with sufficient
transparency that the user need not consider such issues in practice.


Un
ified Software Environment
.
We

have also set policies with regard to a unified
software environment across all systems. Regardless of the physical hardware, operating
system or software, users see the same system installation, the same packages in the sam
e
places, the same compilation and execution environment and invoke most software tools
in the same way (i.e. providing meta
-
level scripts wrapped around the actual tools to keep
the user experience consistent across infrastructure).


These policies have b
een in place
from the beginning on all newly deployed systems, and migration of the legacy hardware
systems to the new model is underway.


User Database and Web Portal
.
SHARCNET’s web portal has grown to be an
increasingly integral part of the user experie
nce, in addition to the physical resources we
provide.


Online i
nformation is dynamic, and presented at two levels: public and user
essential.

At the public level, users as well as the general public can obtain information
regarding current research, even
ts, system status and other announcements,
help
documents and more

either directly on
web
pages or
though RSS news feeds.

Based on
the specific user account, additional information is made available in the form of internal
SHARCNET information, personal p
rofiles, usage statistics (by cluster, by research
group, by user, by job type, etc.), problem tracking and publications.



TASP Annual Report 2006


Page
7

Problem Tracking System
.
The web portal is also the gateway to our online problem
tracking system, developed entirely in
-
house by SHA
RCNET staff.

Users can submit
various issues
regarding
computations and the use of various systems. By logging in to
the web portal, a user can see the status of each problem submitted by him
/
her, and
review the comments and resolution made by staff. User
s can als
o add comments to
provide more
information or discuss issues further with supporting staff. The use of the
problem tracking system has
several
advantages
.

First
,

it builds a knowledgebase from
which users at large are benefited. When submitting a
problem, one
has an option to
search for
possible
solutions by keywords. The search may
immediately
return an answer
if the problem has already been reported and/or dealt with. Second, by submitting a
problem to the tracking system, anyone on the support t
eam will see it, thus the user may
receive

faster response than contacting a staff member by email or phone. In addition, the
interaction between the user and staff members in dealing with a submitted problem is
handled in an organized fashion and helps to

bring forward technical details that might
have significant contribution to the knowledgebase. Lastly, with a record of submitted
problems, stored and managed by our staff and system
s

rather th
a
n
by the
individuals
themselves, users can review problems th
ey have had, which helps them to learn and
improve their skills.



The
problem tracking system has been beneficial on many fronts.
User problems are now
better managed and are less likely to age for undue periods of time. The ever growing
knowledge base i
s an invaluable resource for users and staff alike as problems recur over
time.


Research Publication
s
.
By
making use of a
SHARCNET

account
, users
are
obligat
ed

to
provide information on their research achievement enabled by the use of SHARCNET

facilities.

One of the most important aspects of this is data regarding research
publications. Users can now enter their own publication data through the web portal, or
send it to staff for data entry. Although this publication data is not open to the public, the
t
racking of this information is critical for reporting and the process has been streamlined
significantly by having a central repository where it is collected over time. There has
been some interest
in developing a common
internal
format
for this data, tog
ether with
other consortia in Ontario, to facilitate public access in response to expectations from
federal and provincial granting agencies.


Staffing


We have been
suffering with staffing shortages, particularly with HPTC analysts, for a
significant peri
od of time.


In early 2005,
HPTC analyst
Daniel Stubbs left
to pursue
other career options.

His position remained unfilled until
very
recently.
In Febr
u
ary 2006
SHARCNET welcomed
Sergey
Mashchenko

as our newest software analyst, to be based
out of
McMaste
r University.

Additionally,
new system
administrators

have been
procured. Robert Schmidt assumed system administration duties at the University of
Waterloo in November 2005, and Thomas Hu, University of Ontario Institute of
Technology (UOIT) has taken a
half
-
time position assisting with the hardware

TASP Annual Report 2006


Page
8

deployments at that site since early 2006. Fraser McCrossan has recently been hired as a
second system administrator at the University of Western Ontario site.


John Morton, previously the system administrato
r at the Un
i
versity of Guelph site, was
promoted into the newly created role of Technical Manager for SHARCNET. Kaizaad
Bilimorya, previously working at Brock University, assumed the vacant System
Administrator position in April 2006. The extended period

between John’s promotion
and Kaizaad’s hire mandated an expanded role by existing System Administrators and
HPTC analysts as the Guelph
-
housed cluster was the first large cluster installed and
accepted by SHARCNET.


When compared to
other HPC
centers
, we
remain significantly short

of human resources

given the vast size of the consortium
.

New, large scale deployments and increasing
demands from users and infrastructure management has required each person to
constantly perform in multiple roles.

HPTC Supp
ort and
Activities


SHARCNET
users are finally
getting used to the
idea of utilizing the
Problem Tracking
System

when they encounter problems. The problem tracking system has provided an
efficient and reliable means
to have their problems solved effectively
, as well as tracking
the activities and providing a ready knowledge base
.


HPTC support generally takes the form of

general
programming

inquiries and requests
for research project consultations.

Requests such as t
he parallelization of generalized
eigenvalue problems in arbitrary precision, or the parallel solution of high dimensional
partial differential equations in finance
, require
close

engagement of HPTC analysts in
users


research

projects.


One of the more si
gnificant activities last year was
HPCS’05

hosted by SHARCNET at
the University of Guelph site. TASP analysts were integral to the preparation and
delivery of this conference, with one sitting on th
e Scientific and Organizing committee.
The Parallel Programming Contest was particularly noteworthy as being entirely
conceived and run by SHARCNET staff (and we
regret
that
this
eventful
competition is
not repeated
at HPCS ’06
this year).


HPTC analysts
also provide the primary means by which talks and lectures through
workshops and training courses are provided.


In addition to local, site
-
specific
workshops and seminars, the annual SHARCNET
Fall Workshops

have become a staple
within the consortium since

2001. In
2005, the annual
Fall Workshop/Symposium

was
held at York University

from October 17 to 19
.

This year’s workshop focus
ed

on
programming techniques
,
particularly for multithreaded programmin
g (germane given
our new 128
-
processor Altix deployment), computational theory and large scale
applications.



TASP Annual Report 2006


Page
9

In October 2005, SHARCNET and HPCVL held a joint
symposium

on high performance
compu
ting and applications at Ryerson University in Toronto. The symposium was
attended by people from SHARCNET and HPCVL institutions, as well as attendees from
private sectors.


One other annual educational activity of note is the
Parallel Processing Architectures

course, run in conjunction with the Department of Computing and Information Science at
the University of Guelph from January through April. As always, this is a key reliable
academic re
source for SHARCNET researchers to have their graduate students educated
in an in
-
depth applied environment (with a graduate credit attached).


Future Development


The following details our immediate short tem plan
s

and initiatives.


Acce
s
sGrid
.

SHARCNET
is in the middle of deploying access grid rooms at each of our
member sites. The typical installation will see a site having three large plasma displays,
three severs and two cameras, and will leverage our dedicated fiber network for
connectivity between
sites. Our goal is to have the rooms

accommodate small groups of
people and be used for remote collaboration

and meetings. The size of these initial
deployments will make them unsuitable for large scale remote teaching or other events
involving bigger gr
oups of individuals, although small group collaborations and
workshops are possible, with the option to allow people to participate via desktop clients
where appropriate/feasible.


Four
sites will be up and running by the end of May, in time for our Phase
II grand
opening, with the others

coming on line as local renovations and networking
infrastructure are

completed.


New
Hardware and Software Trial
.

One of the
perennial
areas

of interest to
SHARCNET, given our distributed nature and large user base, is

b
atch scheduling
policies.

We continue to work toward a set of sensible, implementable schedule policies
to both unify the view of SHARCNET resources, and provide the best utilization for the
hardware resources given the highly competitive demands of the u
ser community (serial
farming vs. dedicated parallel, short large jobs vs. long
-
running small jobs, etc.).



Training and Education
.

As the new infrastructure

continues to be deployed
, we
are
ramping up our schedule of
training

and

workshop sessions at ne
w deployment sites.
At
the same time, we continue to formalize our training program by producing modular,
topical materials that can be assembled in arbitrary formats for use in workshops, courses,
and for the HPC community at large. A recently completed
system of user certification
has been designed, and is currently under implementation, in order to satisfy both the
need to insure resources are being used as ideally as possible, and provide additional
value to the teaching resources provided by the conso
rtium, possibly outside academia.
Two streams of certification have been created.


TASP Annual Report 2006


Page
10


The
qualification

stream permits users to demonstrate their knowledge of the systems and
software in order to qualify for access to more sophisticated (read: expensive) com
puting
resources. Background for these certification levels is provided through workshops,
online reference material, and possibly other forms of instruction if they suitably utilize
SHARCNET in the process.


The
technical

stream is intended to provide SH
ARCNET certification regarding levels of
in
-
depth instructions received in areas such as MPI programming, OpenMP programming,
POSIX Threads programming, and would typically be issued after significant hands
-
on
workshops or academic courses. It is hoped to

export this system of certification in
particular in order for it to carry value with other consortia and/or industry.


Concluding Remarks


The user community at SHARCNET is
vibrant, active and continues to g
row

at a strong,
steady pace
,
and the user dema
nd for HPC resources grows along with it. While CFI and
provincial funding agencies provide reasonable mechanisms for funding the
infrastructure, it is the HPTC analysts and other staff that complete the picture making it
possible for these systems to be
adequately maintained (hardware and software), and for
the research programs of the users themselves to be supported, in order to enable our user
community to explore the application of the HPC infrastructure resources to their
research as effectively as p
ossible.