Grid Operations: accounting and monitoring - GridPP

arghtalentΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 6 μήνες)

111 εμφανίσεις

Dave Kant

D.Kant@rl.ac.uk


Monitoring and Accounting

Dave Kant

CCLRC e
-
Science Centre, UK


GridPP 12


Jan 31
st

-

Feb 1
st

2005

2

Overview


1.
GOC Database

2.
Monitoring Tools

3.
Accounting

4.
Issues

5.
Future Plans

4


GOC Database


What features?


Configuration of monitoring tools


Security


Organisations


Administrative Roles


Replication


What role will it play in the future?


New site registration procedure


BDII generation

8

GRID Configuration Database

GOCDB


GridSite

MySQL

Resource Centre

Resources & Site Information

EDG, LCG
-
1, LCG
-
2,



ce

se

bdii

rb


Monitoring Services




Operations Maps



Configure other Tools



Resource Provider



Organisation Structures



Secure services

-

Site News

-

Self Certification

-

Accounting

Secure Database Management via HTTPS / X.509

Store a Subset of the Grid Information system

People, Contact Information, Resources

Maintenance Bit

RC

SQL

https

S
E
R
V
E
R

GOC DB can also contain information that is
not present in the IS such as:

Scheduled maintenance; News;
Organisational Structures; Geographic
coordinates for maps.

9

EGEE ROC Structure


EGEE is made up of regions.


Each region contains many computing centres.


Regional Operational Centres are a focus for
operational activities.

USA

10


Developed a tool to manage organisational structures.


Modelled on GridPP Tier1/2 Structure


Materialised Path Encoding


Provide ROCs with a package to monitor the resources in the region


Tailored Monitoring


Administrative roles to the coordinators in GOCDB





Organisational Structures

EGEE (1)

France (1.1)

UK/I (1.2)

S.E.E (1.3)

GridPP (1.2.1)

LondonT2

ScotGrid

IMPERIAL

QMUL

Edinburgh

11



Total List of all sites is derived from GOCDB (via RGMA)



GOC bit: sites which have opted out e.g. scheduled maintenance



White List: Sites that failed one or more
core

tests but are well supported are
put back in e.g. a Tier1 site



Core tests are a subset of the site functional tests run by CERN every day



Black List: Sites that are not trusted


100’s of Sites

Monitoring
Services

Total List of
all sites

Sites pass
core tests

Trusted
Sites

Black List

White List


BDII

RGMA

GOC Bit



GOC DB Site info



Gstat Data



Site Functional Tests



GOC Hourly Tests


Generation of BDII configuration file via feedback into IS

Adaptive Job Brokering Based on the Monitoring System

Environments
Production, VO,
GridPP, …

12


How Are New Sites Added?

Site

ROC

GOCDB



Site and ROC liaise


[1]

EGEE

1.
JSPG have written a “Site Registration Policy & Procedure” Document

2.
https://edms.cern.ch/document/503198/

3.
New GOCDB portal to streamline the site registration process.


[3] Site installs middleware

[2] “candidate” site


[4] “uncertified” Site


[6] “certified” Site

[5] Certification Testing



13


Replication


Two replicas, each one has a different security
considerations


“Services” replica managed by Taipei


Direct connections to the database by the
monitoring tools from known hosts


“Users” replica to be setup at IN2P3


Web portal based on X.509 certificates


CIC on duty



14


Monitoring Tools


What are the main tools that are used in
the day
-
to
-
day operations of the LCG
Grid?


GPPMON


GSTAT


Site Functional Tests


Other monitoring tools exist, but I won’t
discuss them here


GridIce

15

Operations Map


Job Submission Tests


GPPMON

Displays the results of
tests against sites.

Test: Job Submission


Job is a simple test of the
grid middleware
components e.g.
Gatekeeper service, RB
service, and the
Information System via
JDL requirements.

This kind of test deals with the functional behaviour core grid services


do simple
jobs run. They are lightweight tests which run hourly. However, they have certain
limitations e.g. Dteam VO; WN reach (specialised monitoring queues).

16

Operations Map


Certificate Lifetime


GPPMON

Displays the results of
tests against sites.

Test:Certificate Lifetime

Many grid services
require a valid certificate
for security.


By probing the host certificates on CEs and SEs at sites with a simple SSL client
service, we can identify certificates which are due to expire and send an early
warning to them. A predictive tool!

23

GIIS Monitor


Developed by MinTsai (GOC Taipei)


Tool to display and check information published by the site GIIS (sanity
checks, fault detection)


http://goc.grid.sinica.edu.tw/gstat/

Regional Plot:

http://map.gridpp.ac.uk

24

Site Certification Service




In terms of middleware, the installation and configuration of a site is
quite a complicated procedure.


When there is a new release, sites don’t upgrade at the same time


Some upgrades don’t always go smoothly


Unexpected things happen (who turned of the power?)


Day
-
to
-
day problems; robustness of service under load?



Its necessary to actively hunt for problems





Site certification testing is by CERN deployment team on a daily
basis. First step toward providing this service involves running a
series of replica manager tests which register files onto the grid, move
them around, delete them; and 3
rd

party copies from remote SE.



Unlike the simple job submission tests implemented in GPPMON,
these tests are more heavy weight and attempt simulate the life cycle
of real applications.

25

Certification Test Results

http://lcg
-
testzone
-
reports.web.cern.ch/lcg
-
testzone
-
reports/cgi
-
bin/listreports.cgi

26

Aggregator
RSSReader
(Windows Client)

GOC generates
RSS feeds which
clients can pull
using an RSS
aggregator.


How can we
integrate feeds
and ticketing
systems?

Syndication of Monitoring Information

27

Real Time Grid Monitor

http://www.hep.ph.ic.ac.uk/e
-
science/projects/demo/index.html

A Visualisation
tool to track jobs
currently running
on the grid.

Applet queries
the logging and
bookkeeping
service to get
information
about grid jobs.

Why are jobs failing?

Why are jobs queued
at sites while others
are empty?

28

Problems with Existing Tools


Lots of monitoring tools around which have things in common:
-


-

all the information which they generate is hidden away or difficult to
access


-

limited interfaces: the data can only be accessed in specific ways



Therefore, its difficult to build “on
-
demand” services to allow
communities “Players” to interact with the data.



The idea is for the services to collect information and put it into a
common repository such as an RGMA Archiver. In this way, the
information can be shared and accessible to all.



Services (EGEE parlance: ROC and CIC services) munch the data and
present it to the community.



How much CPU in UKI ROC


How much in GridPP?


How much in each Tier2?

=> Integrate data from different sources to provide this information


29

Monitoring Paradigm

A Better way to unify monitoring information.

GOC Services collect information and publish into an archiver.

ROC/CIC Services provide a means for the community to interact with this
information on
-
demand. GOC provides services tailored to the requirements of the
community.

Information Repository
(RGMA)

Accounting

Monitoring

GSTAT

Testing

ROC Services

Self Certification

CIC Services

Communities

VOs

ROCs

EGEE

Sites

Organisations

GOC Services

30

Use Cases


Monitoring services which use RGMA as the
backbone for data transport and data location via
the registry service.


Grid Event Monitoring System


“Site Functional Test” Reporting Tool


Accounting

31

UseCases
-

GEMS


Grid Event Monitoring System


List of resources to monitor is provided by GOCDB

Alert system that
uses RGMA


Looks for changes
of state in the
monitoring data
tables


Generates an
alert and displays
on the GEMS
console.


Notification
features


Event filtering


32

Reporting Tool Prototype

Organisational Identities taken from GOCDB

36

Accounting


Information collected at each site from batch logs,
gatekeeper logs etc


Information joined at site level to select grid jobs and
stored in database on R
-
GMA MON box at site.


Information published through R
-
GMA and collected
centrally in an R
-
GMA archive at GOC


Web site presents various views of this data for
presentation



Information schema based on GGF Usage Group


Structure of Grid taken from GOC DB


the grid
configuration database.


Only normalised cpu time collected (at the moment)


37

39

GOC Accounting Services

http://goc.grid
-
support.ac.uk/gridsite/accounting/index.html

BaseCpuSeconds Aggregated
across EGEE

Each Site, per VO, per Month

Simple interface to customise views
of data: VO, time frame and Region
(default = EGEE)

Each Region, per VO, per Month

On Demand Services to EGEE Community

Other
Distributions

Normalised
CPU

# Jobs

40

Web form to
apply selection
criteria on the
data

Aggregate data across
an organisation structure
(Default= All ROCs)


Select VOs
(Default = All)

Select date
range

41

VO Index


Summed CPU (Seconds) consumed by resources in selected Region

Selected Date Range

42


List of Sites Belonging to the Selected ROC

A breakdown of the resource usage per Site, per VO, per Month

43

Deployment


Package was released to LCG in August 2004 and
certified soon afterwards.


There was no LCG release after that until LCG2_3_0
on 18th December 2004


Today there are still very few 2_3_0 sites. There are
28 sites producing accounting records today.


The 2_3_0 release has some bugs which are fixed in
a new release that is available on the accounting
home page


Recommend that sites upgrade accounting to version
APEL 3.4.40 available on the accounting homepage




http://goc.grid
-
support.ac.uk/gridsite/accounting/index.html

46

Future Plans


Support for the LSF batch system.


Understand Normalisation issues; do we
have faith in the numbers we present?


Extend accounting schema to include
information about the worker node, Job
efficiency and globalJobID.


Integrate the LCG schema with de
-
facto
grid accounting standards, namely GGF


Share data with other Grid Communities


NorduGrid, Grid03

47

Summary


GOCDB to take a more important role in operation
environment


A shift in the monitoring paradigm which relies on
sharing data through RGMA


Accounting Information gathering infrastructure and
reporting web site


Development towards on
-
demand services to provide
the community with up
-
to
-
date information,
aggregated at different levels.


Development of Visualisation tools to enhance our
understanding of the grid.


Adaptive Job brokering based on the monitoring
system