SA1-Goncalo - LIP

internalchildlikeInternet και Εφαρμογές Web

12 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

86 εμφανίσεις

SA1 status


Int.Eu.Grid Integration Meeting

Lisbon, November 2007

J. Gomes, G. Borges, M. Montecelo

LIP

2

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Timeline

3

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Overview


Sites Resources (Core and Local)

CPUs, Storage



Site status and Operational Issues



Test & Validations Activities

CrossBroker, RAS, JIMS, MPI
-
START and others



Interoperability



Accounting and Monitoring

SAM, GRIDICE

4

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Core services status (09/11/2007)


Production core services duplicates at LIP and IFCA

Development core services at FZK

No CrossBroker or LFC being published

Two top BDIIs for development (iwrbdii2.fzk.de,iwrbdii3.fzk.de)

R
-
GMA at CESGA and FZK



Production LFC and VOMS have readonly copy at IFCA



Production CrossBrokers at LIP and IFCA

I2g
-
rb01.lip.pt presently used for T&V activities
(v0.5.12)

I2g
-
rb02.lip.pt, i2grb01.ifca.es
(v0.4.20)

i2grb01.ifca.es is not being published (since long!!!)



Production RASes at LIP and IFCA

I2g
-
ras02.lip.pt (and banana.man.poznan.pl) with
v5.0.7
(certified)

I2g
-
ras01.lip.pt / i2gras01.ifca.es need to be upgraded
(v5.0.6
-
3)


5

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Site CPU capacity (09/11/2007)

Site

May 2007

November 2007

Promissed

LIP

27

60 (shared with EGEE)

20 AMD2.2+Xeon 2.8

IFCA

58

350 (shared with EGEE)

100Xeon + 150PIII

CESGA

14

20

20 Pentium IV 3.2

BIFI

34

22

20 Xeon 3.6

PSNC

186 (Itanium)

75

40 Itanium
-
2

CYFRONET

20

20

20 Xeon 2.8

ICM

32

16 (1/2 cluster down)

32 AMD 2.2

IISAS

28

32

32 Pentium IV 3.2

FZK

64

120 /

2

32 Xeon 3.0

GUP

6

10

10 AMD MP1900+

UAB

20

30

20 Pentium IV

TCD

28

26

16 Pentium IV 2.8

6

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Site storage capacity (09/11/2007)

Site

May 2007

November 2007

Promissed

LIP

1.8TB

1.6 TB + 73 GB

1.5TB on

IFCA

1.5TB

1.1 TB

100TB off, 20TB on

CESGA

10GB

10 GB

5TB on, 20TB off

BIFI

230GB

110 GB

2TB on

PSNC

4.3TB shared

4.7 TB

5TB on

CYFRONET

1TB (30TB shared)

30 TB

1TB on

ICM

1.5TB shared

1.4 TB

1.5TB on

IISAS

437GB

492 GB

200GB

FZK

5GB

1.8 TB /
3.4 GB

1.5TB on

GUP

280GB

Not published OK

500GB

UAB

280GB

Not published OK

360GB on

TCD

6GB

4.8 TB (?)

400GB

7

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Infrastructure Tests


Submit different applications from the UI

Check sites ability to run

Batch Jobs

Interactive Jobs

Open MPI Jobs

PACX
-
MPI Jobs

(Open MPI + Interactive) Jobs

(PACX
-
MPI + Interactive) Jobs



For the Parallel aplications we use the cpi.c program

Π

computation



Last results on the WIKI page (09/11/2007)

https://wiki.fzk.de/i2g/index.php/Status_report_%2809/11/2007%29

8

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Infrastructure results (09/11/2007)

Site

1) Batch

2) Inter

3) OpenMPI

3)+2)

4) PACXMPI

4)+2)

BIFI

OK

OK

OK

OK

OK

X(e)

CESGA

OK

OK

OK

OK

OK

X(e)


CYFRONET

OK

OK

OK

OK

X(c)

X(c)

FZK

X

OK

OK

OK

X(d)

X(d)

ICM

X(a)

IFCA

OK

OK

OK

OK

X(b)

X(b)

IISAS

OK

OK

OK

OK

OK

OK

LIP

OK

OK

OK

OK

OK

X(e)

PSNC

a) Cluster
Downtim
e

b) Missing
PACX
-
MPI
tag

c) Check
FWs

d) Check
LRMS, JM

e) Can not
find
.hostfile

Only I2G site running pbs JM
and NFS shared homes

9

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

More on infrastructure results


Site results from the UI (on 09/11/2007)

Batch, Interactive, Open MPI and (Interactive+Open MPI) Jobs
OK

PACX
-
MPI is
OK

in half of the production sites

See WIKI page for details and possible failure reasons

(Interactive+PACX
-
MPI) is failing in all sites

excepting IISAS

Infrastructure problem? MPI
-
START problem (can’t find the .hostfile?!)

Possible solution on Ahmad email to SA1 on 06/11/2007...


Site results from the MD (Michal Owsiak on 07/11/2007)

Tests from the MD have some added value

Md
-
job
-
starter + gsiftp

Batch, Interactive and (Interactive+Open MPI) Jobs mostly
OK

PACX
-
MPI results mostly bad:

Only OK at CESGA and LIP

(Interactive+PACX
-
MPI) is failing everywhere

excepting in CYFRONET

10

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Some considerations on PACX
-
MPI


Setup and Communication

Internal Communication: Communication between processes running
inside the same local cluster is performed via the local, optimized MPI
implementation

External Communication

Send message to “out” daemon using local MPI

“out” daemon send msg to destination host over the network using TCP

The “in” daemon send message to destination using local MPI

Relay

Worker Nodes

Worker Nodes

Relay

3

5

2

4

2

3

4

5

11

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Some considerations on PACX MPI (2)


The failure rate on PACX
-
MPI is much higher…

We are trying to request more and more CPUs



If the allocation of CPUs at one site fails, the whole job fails

If the job gets stuck in a queued at one site the whole job will fail

The availability of resources changes very rapidly making the
allocation of the required resources at all sites very difficult

Policies limiting the amount of available resources to single users or
VOs may fool the resource broker matchmaking

12

Int.Eu.Grid Integration Meeting, Lisbon, November 2007


The CrossBroker has support to obtain the policy limits for
slots per VO from the GRIS RunTimeEnvironment

Set when I2G had GlueSchema 1.2

if possible please DO NOT IMPOSE policies limiting the slots



An approach similar to the one from the EGEE “short
deadline jobs” enabling noqueueing could be used

A similar strategy was implemented at the level of the SGE JM and is
working for all i2g queues at LIP

The jobs either start or fail. Due to the slowness of the lcg
jobmanager the failure still takes some time to be reported

At least jobs don’t stay queued forever…

Some considerations on PACX MPI (3)

13

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

I2G middleware directives


MPI
-
START was different in different sites

i2g
-
mpi
-
start
-
0.0.34 vs i2g
-
mpi
-
start
-
0.0.44

Be sure you have

i2g
-
mpi
-
start
-
0.0.44

deployed

Necessary to let PACX
-
MPI+MD work together...



Bug found by Álvaro (IFCA) in i2g
-
profile
-
0.017

I2g
-
profile
-
0.0.18

already available in the repositories



IFCA VOMS changed certificate and DN

New rpm already in the production repository
i2g
-
vomscerts
-
1.2
-
0

Upgrade your site and don’t forget to change the VOMS DN and
definitions in your configuration files siteinfo/vo.d/<VO>

/DC=es/DC=irisgrid/O=ifca/CN=host/i2gvoms01.ifca.es

14

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

I2G middleware directives


How about I2G repositories to SLC4?

EGEE is moving to SLC4... SLC3 no longer supported...

Everyone supporting SLC3 should move to the SL repositories

Need these SLC3 repositories for still claim interoperability

Need this features to make our infrastructure desirable...

What’s the effort to port I2G middleware on SLC4

I2glogin

CrossBroker

Openmpi, pacxmpi

15

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Test & Validation Reports
(09/11/2007)


Present middleware under T&V

i2g
-
gcb v 0.1.4 @LIP in the CrossBroker framework

(in progress)

CrossBroker/UI @LIP
(in progress)

v0.5.11
(not OK)

v0.5.12 @LIP
(in progress)

RAS v5.0.7 @LIP
(Done)

JIMS v3.0.0 (
Submitted by kbalos@agh.edu.pl
)

Marmot v2.0.5 @IFCA

(in progress)

Mpitrace v 1.0.0
(Submitted by dichev@hlrs.de)

Mpi
-
start v 0.0.51
(Submitted by dichev@hlrs.de)

ASI v 0.1.0
(Submitted by stuart.kenny@cs.tcd.ie)



We need more man power for these tasks

16

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

CrossBroker T&V
-

v0.5.11


i2g
-
wl
-
wm
-
0.4.20
-
1

wasn't upgraded


Condor

version was not upgraded

6.7.10
-
1 had problems with expired proxies

Had to be installed from a different repository


The YAIM I2G plugin didn't worked anymore:

The
node
-
info.def

changed location
(solved)

Need for substitution of "RB" by "I2G_RB"
(solved)

/opt/gridice/monitoring/etc/gridice
-
role.cfg

file not created
correctly
(solved)


Some daemons not properly started


Probably because
i2g
-
wl
-
wm

was not the latest version

17

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

CrossBroker T&V
-

v0.5.12


i
2
g daemons were not properly started, again…

CONDOR env vars were still pointing to the old configurations

Re
-
export CONDOR_CONFIG &CONDORG_INSTALL_PATH

Run again the
YAIM

configuration, and works



glide
-
in mechanism
:

When submitting to the CE directly (using the “
-
r” option)
glide
-
in mechanism don't work

It's a specific feature of the CrossBroker M

atchmaking process

If your
WNs are in a private network
, you’ll have to
deploy
i
2
g
-
gcb in the CE

to allow connection from the CB to the WNs

18

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

CrossBroker T&V


v
0.5.12
(cont.)



glide
-
in mechanism (cont.):

Batch jobs remain
«Ready»

for about
30
m, then
«Aborted».
Logs
show the following mechanism:

1
) Monitoring Condor, we see that for each
JobWrapper,
3
remote
-
setup*

instances are launched (every ~
10
m or so…)


2
) They're submitted to the CE and end up in
different

WNs. In each
WN, the correct condor processes are started
(
condor_master,
condor_startd)

3
) The batch system shows that each instance runs for about
20
minutes, and then dies in the WNs

4
) From the CrossBroker point of view, only the first submitted
instance dies in Condor, right before the JobWrapper ending. The two
remaining instances keep on hold.

5
) From the UI point of view, only when the
JobWrapper

ends we
can see that the job gets to the new status «Aborted».

19

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

User Interface T&V


v0.5.12


Packages installed (and their versions) seem OK



YAIM config fails because it doesn't know
“I
2
G_UI”

node

2
middleware releases for UI/WNs:

gLite
3.0
(in SLC
3
) and gLite
3.1
(in SLC
4
)


YAIM checks which middleware is installed…

If it is gLite
3.0
searches for files with the "_
30
" suffix.

Since this file doesn't exist for the I
2
G_UI node (not distributed in
the release), it will fail to configure the node.

Workaround: Copy the
glite
-
i
2
g_ui

file to
glite
-
i
2
g_ui_
30
. After that,
the configuration goes OK



Check WIKI page to know details and see how things go…



20

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

RAS
-
MD


Principle of Operation


The Roaming Access Server needs:

Initially the same services than an UI

Additionally, it needs Tomcat providing WebServices, with auxiliary
software:

Apache HTTP server, MySQL server, LDAP server



The client is a Java application (Migrating Desktop)


The user interacts with the Grid via this GUI app.

It uses the WebServices from the RAS, the RAS acts as a backend
for it, acting together as a (Graphical) UI

It stores most of the configuration at the server, users can use the
same environment almost anywhere

21

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

RAS
-
MD


Installation/Upgrade
(v
5.0.7
)



Server:

Install the software via package i
2
g
-
RAS

Install and configure an UI machine (not needed for upgrade)


Some manual configuration:

1
st

time: Configure Tomcat, LDAP and MySQL

Load (or update) the DB schema for MySQL

Create and fix some file/directory permissions related with Tomcat
(usually needed only the
1
st

time)


Open ports in the firewall, related with the services offered (
80
,
8080
,
8443
,
2811
,
20000
-
25000
)



Client:

It doesn't need configuration, it runs with JavaWebStart accessing a
URL in the webserver at RAS machine

22

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

RAS
-
MD


Testing and Validation
(v
5.0.7
)



Installation/Configuration Main Problems:

Upgrade of RAS
-
LDAP not removing old package in the
1
st

patch, both
new and old version remained installed

MD not working in
1
st

patch, old .war packages not removed


Runtime Main Problems:

Interactive jobs not working in the first patch

MD not consuming the error stream in the first patch

PACX
-
MPI interactive not tested, not supported in the infrastructure


New Features:

Java
1.6
support

PACX
-
MPI jobs available, both batch and interactive

New RASAdmin console in patch
2

23

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

RAS
-
MD


Evaluation (v
5.0.7
)



Pros:

It allows to work in a GUI environment

It's sometimes more pleasant

It's usually a more discoverable way to use the Grid

It allows to run graphical and interactive applications not available in
classic interfaces

even games as Quake
2
or Doom, as proof
-
of
-
concept

People can interact with the Grid anywhere

It preserves their own environment


Cons:

Random errors while trying to access some MD functionalities

dialogs hanging forever, ...

Usability glitches (not always intuitive, autodiscoverable)

24

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Interoperability


Our cluster interoperability solution works for sharing WNs

But only LIP and IFCA are applying it

What about ICM, PSNC and other I
2
G sites in EGEE?



Our I
2
G UI package works and its very simply to deploy

That’s what we did for the tutorial. Install I
2
G rpms on top of an
EGEE UI...



No feedback/requests from EGEE users...

Some pontual conversation after the interoperability talk in EGEE’
07
but nothing more...

Should we push further? How?

It’s a pitty because some considerable time was devoted to this...

25

Int.Eu.Grid Integration Meeting, Lisbon, November 2007

Monitoring and Accounting


Monitoring

The daily report is wonderful

Makes site admis to react instantly

Some issues in SAM tests

SWDIR, JS, RM, MPI JS and GFAL tests become Critical Tests recently

Should the PACX
-
MPI become a CT test also?

Should we implement alerting emails?

The general openmpi job test report only shows the mpi
-
start debug info
but we do not see the stdout generated by the program.

If the submit filter is not installed in a PBS site, the MPI job only runs
with
1
CPU and SAM thinks it’s OK…

I
2
glogin test is now being implemented

Shoul it become a CT also?



Accounting (
See CESGA talk...)