Biweekly Report - FutureGrid

snottysurfsideServers

Dec 9, 2013 (3 years and 6 months ago)

150 views

Future Grid Report

September 4

201
2

Geoffrey Fox


Introduction

This report
is the
sevent
y
f
if
th

for the project and
now con
tinue
s with status of each
team/committee
and the collaborating sites
.
It is last official biweekly report.

Summary

Operations and
Change Management Committee

PY3 Annual Report in progress. New hiring in process. Partner invoice processing on
-
going.

Software
Team

(includes Performance and Systems
Management Teams
)

GLUE2 data is now being published for Foxtrot and Sierra to the Futu
reGrid messaging system,
Inca data is now being received by the FutureGrid messaging system using X509 authentication,
and collaboration continues with the cloud metric work. The 10G upgrades to the perfSONAR
machines and cluster node are stalled waiting o
n one hardware order to be received. The UF
team implemented and tested ViNe2 management extensions that collect overlay operation
conditions. In particular, connectivity and round
-
trip latencies among all active ViNe routers can
be tested. IU has changed
the DB model of FG Move. Now, all the information is stored in a
MongoDB
-
based database. To ease the initialization of the service, simple an inventory file can
be created where clusters and services are described. Thus, the first time you start FG Move, y
ou
can indicate this file to initialize the database. Once the information is in the database, the
inventory file is not needed anymore. IU has spent significant amount of time to coordinate the
yearly report. UTK released a new version of PAPI
-
V, which co
ntained various improvements
intended for deployment on Future Great. Mats Rynge reported that Gideon Juve has created a
new set of Pegasus tutorial images for Pegasus 4.1.0. These are in addition to the exiting
FutureGrid images, but what makes the new on
es interesting is that Amazon EC2, FutureGrid
and VirtualBox images are all autogenerated from one master image. This should help keep the
different images in sync.

Hardware and Network

Team



Daily log information is now collected from clusters with a githu
b repository and UBMoD



fg
-
move utility in testing at IU.



Successfully tested using the AuthZ plugin to call LDAP with the Globus toolkit 5.0.x and
5.2.x. UC will implement this with 5.2 and share the results.



perfSONAR

monitoring systems are now in place and recording data throughput between
sites. 10Gbps upgrade hardware delivered for all the pieces except optics.



Alamo moved to a higher performance switch with redundant connections to the XSEDE
network.

Training, Educ
ation

and O
utreach

Team

(includes user support)

Members of the TEOS team focused on annual report preparations, approaches to present and
disseminate FutureGrid information through videos, interactions with educators using FG in Fall
classes, and user supp
ort.


Knowledgebase Team

No report received

Site Reports

University of Virginia

No report received


University of Southern California Information Sciences

Mats Rynge reported that Gideon Juve

has created a new set of Pegasus tutorial images for
Pegasus 4.1.0. These are in addition to the exiting FutureGrid images, but what makes the new
ones interesting is that Amazon EC2, FutureGrid and VirtualBox images are all autogenerated
from one master
image. This should help keep the different images in sync.


University of Texas at Austin/Texas Advanced Computing Center

Modest Progress


University of Chicago/Argonne National Labs

Nimbus, Outreach and Support


University of Florida

UF performed system maintenance activities to resolve file system failures in foxtrot, developed
and tested ViNe software extensions to collect overlay operation conditions, and contributed to
the writing of the annual report. Fortes chaired the operations

committee, and Figueiredo chaired
the TEOS team.


San Diego Supercomputer Center at University of California San Diego

UCSD got an initial quote for a small 8
-
node cluster to experiment with Flash SSD drives,
updated the authentication mechanism used to p
ublish Inca data to the FutureGrid messaging
system, and worked on sections of the annual report.


University of Tennessee Knoxville

We released a new version of PAPI
-
V, which contained various improvements intended for
deployment on Future Great.



Detailed Descriptions

Operations and Change Management Committee

Operations Committee Chair:
Jose Fortes

Change Control Board Chair: Gary Miksik, Project Manager



FutureGrid PY3 Annual Report is in progress. We are organizing the report along the lines
of the XSEDE Service Provider Quarterly Report format, along with the requisite annual
reporting web pages in Fastlane. Much of the data accumulation effort to date

has been in
the Software and TEOS sections of the report. We are targeting September 17
th

for the final
submission of the report in Fastlane.




Administration. In process of hiring a system administrator and 2
-
3 user support consultant
for FutureGrid.




Financials. NSF approved spending thru September 30, 2012. Partner amendments have
been prepared and sent out for signatures. Florida and Texas have not yet returned their
signed amendment.


Partner Invoice Processing to Date
:




Software
Team


Lead
: Gregor
von Laszewski

ADMINISTRATION (
FG
-
907

-

IU Gregor von Laszewski)

We did inform the group to prepare the yearly report and are conducting the coor
dination of it
via google doc.


We defined the new versions for the upcoming year and informed all teams to use the new
versions to classify their task. We mentioned that it is important to focus on features and not
tasks such as we will release version 4.
1 of for example Pegasus, Nimbus, or Inca to name just a
view examples


3.4 Final 30/Sep/13

IU PO #
Spend Auth
Y1 PTD
Y2 PTD
Y3 PTD
Total PTD
Remaining
Thru
UC
760159
1,115,453
213,235
343,809
271,367
828,412
287,041
Jul-12
UCSD
750476
784,516
192,295
183,855
156,825
532,975
251,541
May-12
UF
750484
587,443
83,298
110,891
92,230
286,419
301,024
May-12
USC
740614
599,999
154,771
115,407
179,822
450,000
149,999
Jul-12
UT
734307
1,086,115
547,509
159,987
181,105
888,601
197,514
Jul-12
UV
740593
348,623
53,488
103,878
125,453
282,819
65,804
Jun-12
UTK
1014379
283,333
N/A
28,271
155,089
183,360
99,973
Jun-12
4,805,482
3,452,586
1,352,896
3.3 Sep 1, 2013 release 01/Sep/13

3.2 Jul 1, 2013 release 01/Jul/13

3.1 May 1, 2013 release 01/May/13

3.0 Feb 1, 2012 release 01/Feb/13

2.3 Nov 1, 2012 release 01/N
ov/12

2.2 Sept 1, 2012 release 01/Sep/12

2.1 July 1, 2012 release 01/Jul/12

4.0 Things We do not do 01/Oct/13 (things that we are not able to do)


Sharif Islam has left the IU systems team for NCSA. Taks have been transitioned to Koji Tanaka.
A new hire ha
s been identified and will start within the next t
hree

weeks.

Defining Activities (
FG
-
1223

-

Gregor von Laszewski)

Updated: 138 issues includes

closed and resolved tasks

Closed: 3 issues

Resolved: 37 issues

Created: 25 issues


We observed that some team members resolved tasks without setting the progress in the task.
Thus we have not closed them yet as to give team members the chance of correctin
g this issue
without reopening the task.

HPC SERVICES

EXPERIMENT

MANAGEMENT

Experiment Management (
FG
-
518

Warren Smith, Mats Rynge)

Mats Rynge

reported that workflow experiment management tasks this period has mostly been
preparing for the annual report.

Integration of Pegasus into Experiment Management (
FG
-
412

-

ISI Mats Rynge)

Mats Rynge reported that Pegasus 4.1.0 has been released. The release contains smaller
improvements to Condor vanllia and CondorC workflows which will be helpful for experiments
based on Pegasus.

Experiment management wi
th support of Experiment Harness (
FG
-
906

-

TACC Warren
Smith)

Please see site report from TACC.

Image Management (
FG
-
899

-

Creation, Repository, Provisioning
-

IU Javier Diaz)

We have changed the DB model of FG Move. Now, all the information is stored in a MongoDB
-
base
d database. To ease the initialization of the service, simple an inventory file can be created
where clusters and services are described. Thus, the first time you start FG Move, you can
indicate this file to initialize the database. Once the information is

in the database, the inventory
file is not needed anymore.

ACCOUNTING

Accounting for Clouds (
FG
-
1301

-

IU Hyungro Lee)

We asked Hyungro to develop scr
ipts to detect failures of the metric database, and to deploy
them with the Inca team. The Nimbus metrics integration has unfortunately not yet been
completed. Hyungro Lee is still working on it.

FG SUPPORT

SOFTWARE

AND

FG

CLOUD

SERVICES

Nimbus (
FG
-
842

-

John Bresnahan)

Please see site report provided by UC.

Eucalyptus (
FG
-
1587

-

IU Fugang Wang)

We have identified the issue causing the LDAP integration broken for the Eucalyptus 3.1
deployment. A fixing solution has been tested in the testing environment (See
https://wiki.futuregrid.org/index.php/Eucalypt
us3_LDAP_Integration#Changes_Since_Euca_V3.
1). We will work on the production deployment to re
-
enable the LDAP integration and
synchronization.

OpenStack (
FG
-
1203

IU


Koji Tanaka)

On the first day of this report period the OpenStack management node had disk failure and went
down. I rebooted with netboot image and checked and repaired file system. Two days later, many
instances started failing to boot. A
fter investigation, we found out OpenStack did not release the
IP addresses. We found a reported bug of OpenStack . We manually edited mysql database to
resolve this issue.

https://bugs.launchpad.net/nova/+bug/1014769


Javier Diaz built a tool for users to change OpenStack's GUI(called Dashboard) password.

Inca (
FG
-
877

-

Shava Smallen, UCSD)

We made improvements to Inca plugin used to publish Inca monitoring data into the FutureGrid
messaging system. Specifically, we change the authentication mechanism to X509 authentication
in both our hello client example code and Inca server plugin. Also, w
e updated the versions of
the bouncycastle security jar files used by the Inca server.

ViNe: (
FG
-
140

-

UF Renato F. Mauricio T. Jose Fortes)

The UF team
developed and tested ViNe software extensions to collect overlay operating
conditions. In particular, mechanisms to test the connectivity and to measure the round
-
trip
latencies of active ViNe routers have been implemented. The ViNe management server keeps

track of active ViNe routers, and upon request, sends the necessary information (end point,
including IP address, protocol, and port that active routers are configured) to routers. Both
connectivity test and round
-
trip latency measurements are based on se
nding probe messages
through ViNe overlays and waiting for responses. Taking advantage of ViNe firewall/NAT
traversal mechanism that enables all
-
to
-
all communication among ViNe routers, full scale
network test can be performed. Connectivity tests can be tr
iggered by both end users and the
ViNe management server. ViNe management server periodically starts connectivity tests to
detect potential connectivity disruptions. If a connectivity disruption is detected, ViNe
management server tries to change ViNe over
lay topology to recover the communication.

KnowledgeBase


(
FG
-
1222

-

Jonathan Bolte IU)

Gregor von Laszewski and Fugang Wang, did have an almost hour l
ong meeting to demonstrate
some of the features of the portal that was unkown to the IUKB team. We demonstrated that they
do have access to all editing functionality and showed them how to add, delete, and reorganize
pages in the user manual. This was done

via our usual breeze session. As part of this we found
out that IUKB entries that are added to books do not show up in the table of contents. Gregor and
Fugang, suggested that this ought to be fixed. Once this is done IUKB teams will have an easier
time t
o make their entries visible to the user as part of the manual.


Gregor and Fugang also provided a detailed statistics based on Google Analytics about the usage
of Future grid in regards to IUKB. This has the advantage that the IUKB team which is required
by their management to create certain metrics abut usage do not have to parse in cumbersome
ways log entries. Instead, usage statistics can be provided via Google analytics. Hence the team
can focus on the improvement of documentation rater than spending t
ime on developing tools to
derive metrics.

PERFORMANCE (UCSD Shava Smallen)

Vampir (
FG
-
955

-

Thomas Williams)

No change in status
--

Thomas William is
currently on parental leave.

PAPI (
FG
-
957

-

Piotr Luszczek (UTK))

See attached UTK site report.

Performance: Interface monitoring data to the Experiment
Harness (
FG
-
1098

-

Shava
Smallen SDSC)

The FutureGrid messaging system is used to provide a uniform API to real
-
time monitoring data
and is designed fo
r application use. Currently Inca and GLUE2 data are provided. In the last few
weeks, the authentication method for publishing Inca data was updated to use X509
authentication (see Inca section above). GLUE2 was also deployed to Foxtrot for publishing
Nimb
us data and to Sierra for publishing Torque/Moab data in addition to Nimbus data.
Collaboration continues between the FutureGrid cloud metric work and plans are being made to
leverage the other’s existing work (i.e., GLUE 2 will leverage Eucalyptus data an
d the Future
cloud metric work will leverage Nimbus data).


FG
-
1094
-

Performance: Help coordinate setup of perfSONAR

The GRNOC received the requested order for the Myricom

10G cards but is still waiting CDW
for SFP+ cables to arrive from SFP+ cables. Once that hardware has been received, the GRNOC
will bundle and send out to each of the FutureGrid sites.


Hardware and Network

Team

Lead
: David Hancock

Networking



All FutureGrid network milestones are complete and networking is in a fully operational
state.



New switches for India to support tagged/untagged port mode have been installed,
required to support tagged VLANs and PXE booting for multiple cloud
provisioning
systems on a single switch.



perfSONAR monitoring systems are now in place and recording data throughput between
sites.



perfSONAR systems will be upgraded to support 10Gbps, all hardware delivered to IU
except optics.



Network performance issue
s between IU and SDSC are being diagnosed, problem has
been narrowed to SDSC FutureGrid ingress or the NLR connection to SDSC from
CENIC. FG is waiting on a primary SDSC engineer to return from vacation.

Compute & Storage Systems



IU iDataPlex (india)

o

RHEL6

upgrade is on hold until after the upcoming software releases

o

Move to a new storage volume in progress, will be 100TB

o

Ubuntu on this system was having issues with IPMI responding. Administrators
have resolved the issue on all nodes except for one that may

require a firmware
update.

o

Installed fg
-
move and tested provisioning an OpenStack compute node several
times. This worked fine. It will be tested with an HPC node, and then eucalyptus
node during the next period.

o

An image repository for testing has also
been provisioned on the FG test
hardware.

o

Eucalyptus LDAP sync encountered problems due to 3.0 to 3.1 upgrade, will be
fixed in LDAP modification in September.

o

System operational for production users.



IU Cray (xray)

o

New software release available (danub,
SLES 11 based).

o

System operational for production HPC users



IU HP (bravo)

o

System being used for some Lustre FS testing, software problems prevented a
clean build on all nodes but the issue has been resolved.

o

Hardware problems on b006 require action from HP

support.

o

System operational for production users



IU GPU System (delta)

o

16 nodes available for production HPC jobs, cloud software not yet integrated

o

BMC setup for remote management has been configured on the system.

o

IB cables need to be wired to the Bravo

IB switch.

o

System operational for early users



SDSC iDataPlex (sierra)

o

Troubleshooting disk failures on Sierra storage environment, no outages taken to
repair issues to date.

o

Network troubleshooting (see networking above)

o

System operational for production

Eucalyptus, Nimbus, and HPC users.



UC iDataPlex (hotel)

o

System operational for production Nimbus and HPC users.

o

RHEL 6 with SE Linux now installed successfully, software is being built before
switching it over as the primary HPC OS.

o

Storage problems stemm
ing from heat related failures have been resolved after
working with DDN and IBM, this work continues.

o

Successfully tested using the AuthZ plugin to call LDAP with the Globus toolkit
5.0.x and 5.2.x. UC will implement this with 5.2 and share the results.



U
F iDataPlex (foxtrot)

o

Jumbo frames enabled which has improved WAN data transfer rates.

o

System operational for production Nimbus users.



Dell system at TACC (alamo)

o

Change network connection from a Dell switch to higher performance switch and
Alamo now has r
edundant connections to XSEDE.

o

Nimbus moved to CentOS 6 installs, HPC nodes on CentOS 5 still, HPC software
being built for CentOS 6. Login node rebuilds are the last time consuming task
remaining.

o

Some nodes have been reprovisioned for Openstack testing,
2 images being
tested.

o

HPC software is being installed for CentOS 6.

o

Test nodes for XWFS released.

All system outages are posted at
https://portal.futuregrid.org/outages_all


Training, Education and Outreach
Team

(includes user support)

Lead
: Renato Figueiredo


TEOS report:



Barbara created draft document and pulled together TEOS related project information
and listed a few activities to get things started; also gathered
together TEOS Bi
-
Weekly
reports and began culling them for information



TEOS members have begun contributing to the report



Kate contributed outreach report



Barbara has started work on publications/presentations list, but the list is not yet conplete


Portal

report:



Barbara worked with Carrie and Gregor to create this segment of the Software report


Videos
:



Added links to YouTube videos Stephen Wu created for his summer school presentations
to Tumblr



The TEOS team discussed during a conference call a vision f
or creating additional
videos, including a short “about FutureGrid” video as well as how
-
to videos



The team discussed the possible benefits and shortcomings of using YouTube as a
repository for videos. One of the main benefits is the ability for users to d
iscover
information through YouTube searches, related videos, etc; technical issues need to be
ironed out first. Barbara in the process of reviewing YouTube as possible repository for
summer school videos



Worked with Carrie to begin shaping a Video Gallery

for FG Portal, to house Science
Cloud Summer School 2012 videos on the FG Portal and additional videos as they
become available. Design is underway



Educational Use of FG:



Based on discussions in a TEOS conference call, the team suggested contacting project
leaders of educational projects in Fall’12 to ensure they have information on how to get
started and pointers to request support, if needed. Barbara has emailed 6 educato
rs
running class projects on FG Fall 2012 heard from Andy Li (FG
-
247), Dirk Grunwald
(FG
-
244) and Sergio Maffioletti (FG
-
247) and followed up with support.



Barbara communicated with Fugang and Gregor regarding project sign
-
up and
assignment. Fugang provi
ded streamlined, customized registration pages to facilitate
sign up and project assignment: e.g. https://portal.futuregrid.org/projects/249/register.
Barbara forwarded this information to the educators.


User Support:



Barbara worked with Gary to review
applications and interview applicants for part time
hourly user support positions. We anticipate hiring 2 or 3 masters students early next
week. They will handle tickets and participate in improvements to FG Portal content and
handle other tasks as assig
ned.


Knowledgebase Team

Lead
:
Jonathan Bolte
, Chuck Aikman

No report Received


Tickets

Lead:
Sharif Islam and Koji Tanaka

No report Received


Site Reports

University of Virginia

Lead: Andrew Grimshaw

No report Received


University of Southern California
Information Sciences

Lead: Ewa Deelman

Mats Rynge reported that workflow experiment management tasks this period has mostly been
preparing for the annual report.

Mats Rynge

reported that Pegasus 4.1.0 has been released. The release contains smaller
improvements to Condor vanllia and CondorC workflows which will be helpful for experiments
based on Pegasus.

University of Texas at Austin/Texas Advanced Computing Center

Lead:
Warren Smith

Dell cluster:



Continued OpenStack installation

o

One node configured as service/control node

o

Another node configured as a compute node

o

Working on the network configuration



Ordered and received 8 1TB hard drives that will be installed in 4 Alamo
nodes

o

In support of project 146 that is demonstrating and evaluating a candidate
distributed file system for XSEDE



See hardware and network team section for additional updates.

Experiment harness:



TeraGrid/XSEDE glue2 software

o

Beta version installed on S
ierra

o

Implemented a few improvements and fixed several small problems based on beta
testing

o

A few more improvements to make before generating a beta version for
installation on Hotel

FutureGrid user portal:



See software team section


University of Chicago
/Argonne National Labs

Lead: Kate Keahey

Our activities in this reporting period were dominated by preparation for 2.10 RC2 release of
Nimbus Infrastructure, work with users, as well as preparations for the annual report.



Updated and created JIRA tasks in preparation for the annual report, went over the
activities in the past year and wrote out the

contributions to the report.



Updated documentatio
n on the multi
-
cloud service.



Began work o
n 2nd release of this service



Prepar
ed and

released the 2.10RC2 release

o

Improved the error codes returned by Nimbus to
reflect those returned by AWS

o

Investigated the new AWS security signing protocol.

Nimbus may need to be
ported to it to work with future relea
ses of boto and other clients

o

Fixed bugs in t
he cloudinitd libcloud driver

o

Discussed a possible service that will allow VMs to mount GPFS partitions.


Support



Helped Paul Marshall use phantom



Continued to work on a cloudinit.d launch plan for a bioinformatics group on FG



Worked with AT
LAS to make better use of backfill.

Unfortunately this does not look
like it will work out.


Outreach:



Preparations for the SC12 tutorial (materials due mid
-
September), and attendance of
Cloud Federation conference


University of Florida

Lead: Jose Fortes

System maintenance activities

File system failures have been detected on 3 foxtrot nodes. Investigations indicated hard
disk failures and IBM was contacted. Replacement disks were shipped by IBM, and the failed
nodes have been rebuilt. The node failures
left nimbus services in an inconsistent state, making it
generate a substantial amount of error logs. The nimbus database was manually edited to restore
the services in a clean state.

ViNe activities
:

The UF team developed and tested ViNe software extensi
ons to collect overlay operation
conditions. In particular, mechanisms to test the connectivity and to measure the round
-
trip
latencies of active ViNe routers have been implemented. The ViNe management server keeps
track of active ViNe routers, and upon re
quest, sends the necessary information (end point,
including IP address, protocol, and port that active routers are configured) to routers. Both
connectivity test and round
-
trip latency measurements are based on sending probe messages
through ViNe overlays

and waiting for responses. Taking advantage of ViNe firewall/NAT
traversal mechanism that enables all
-
to
-
all communication among ViNe routers, full scale
network test can be performed. Connectivity tests can be triggered by both end users and the
ViNe man
agement server. ViNe management server periodically starts connectivity tests to
detect potential connectivity disruptions. If a connectivity disruption is detected, ViNe
management server tries to change ViNe overlay topology to recover the communication.

ViNe
overlay topology changes can be also triggered by analyzing the collected router
-
to
-
router round
-
trip latencies. Based on the measurements, routes that minimize latencies can be configured in
order to improve the end
-
to
-
end overlay performance. Impro
vements to the ViNe management
software that automates the process of changing overlay topologies are under development and
testing. Changing overlay topologies involves dynamically reconfiguring ViNe routing tables on
active routers, and ViNe management s
erver is responsible to generate the appropriate
reconfiguration commands.

San Diego Supercomputer Center at University of California San Diego

Lead: Shava Smallen

In the past few weeks, UCSD got an initial quote from Aeon for a small 8
-
node cluster for
ex
perimenting with different types of Flash SSD drives initially for VM testing as well as
Hadoop and other applications. We also updated the authentication mechanism used to publish
Inca monitoring data to the FutureGrid messaging system as well as a test
client to subscribe for
data. Work continues with the IU GRNOC to add 10G cards for the perfSONAR machines and
cluster node. All items are described further in the software section of this report. UCSD is also
working on sections for the annual report a
nd continues to lead the performance group with a
group call on August 22
nd
.

University of Tennessee Knoxville

Lead:

Jack Dongarra

During this reporting period, we put out a new release of PAPI, *PAPI
-
V*. Along with a

major overhaul of the component interface and being ported to run on on Intel

IvyBridge and Intel Atom Cedarview, this version contains the following

new/improved components:



nVidia Management library component
-

support for various system health

and power

measurements on supported nVidia gpus.



steal time
-

When running in a VM, this provides information on how much

time was "stolen" by the hypervisor due to the VM being disabled.

This is currently only supported on KVM.



RAPL
-

a SandyBridge RAPL (Running A
verage Power Level) Component

providing for energy measurement at the package
-
level.



VMware component for VMware pseudo
-
counters



appio
-

This application I/O component enables PAPI to determine the I/O used by the
application, which should be applicable fo
r Future Grid applications.


As noted in previous reports, much of the work that has gone into PAPI
-
V focused

on VMware and KVM because these two virtualization platforms offer counter

support. For FutureGrid in particular, support for KVM is important and

we are

looking into the question of how we can run more experiments on the FutureGrid

infrastructure. As we reported last week the main requirement to full support for

PAPI
-
V is installation of a recent Linux kernel. At this point, kernel version

3.3 supp
orts the most relevant virtualization features. We were looking into bare

metal provisioning of FutureGrid infrastructure to "rain" the required kernel and

virtualization platform on top of it. We are currently assembling the appropriate

software pieces.