Supervision of Production Computers

traderainSoftware and s/w Development

Aug 15, 2012 (5 years and 4 days ago)

324 views


Supervision of Production Computers WG


Web page
http://cern.ch/wg
-
supc



Minutes of the meetings n


1‫ 2⁨ l搠潮d22⸠⬠+4⸰9⸲003




Participants
:

Maite BARROS
O

LOPEZ (IT),
Roberto BARTOLOME (ST
),
Peter CHOCHULA (EP),
Uwe EPTING
(ST),
Bruce
FLOCKHART (IT),
Helge MEINHARD (IT
)
, Tono RIESCO
(ST)


Distribution
:

Participants




1

Introduction

This was the first meeting of the working group. After a short introduction of the
members each person presented the current activities and ideas. As time was
insufficient at Monday, we met a se
cond time at Wednesday. The minutes cover both
meetings.


2

Presentations in chronological order:


Alastair Bland: The CYAN Jaguar project for Infrastructure
Surveillance


The project covers the control infrastructure of PS, SPS and LHC. The supervised
syste
ms are LynxOS, HP
-
UX and Linux. the architecture is based on the common
LHC alarm system (LASER).

SL/TCR world: one agent on every host (clic), info gathered by central
log
ger

(Clogger), Xcluc for display (remote debug and control)

Developed 1988 by J.M. J
ouanigot with input from Alastair Bland.

Can be configured for around 200 hosts. Interface did not change since more than 10
years

Big Brother used for workstation monitoring, but not used for PCR and TCR

LHC/IAS: uses
NAGIOS


PS:
X
-
windows

tool ALARM used
10 years ago. Uses clic/clogger but different
Xcluc.


Next generation:

Enterprise JavaBeans, Oracle 9i for Application Layer, SoniqMQ for messaging

NetBeans (from Sun) for Accelerator GUIs


Aim: phase out CERN developed RPC and polling systems, homogenise

on one
technology for complete AB/CO infrastructure by using MOM (SoniqMQ) and
NetBeans. The GUI is partly running, actions/commands and full GUI still to be
implemented (planned until end Oct 2003)


Developed on Linux, first deployment will be on Windows

-

Bottom: LynxOS, PC,
PowerPC, HP
-
UX, Linux, CYGWIN (not ready yet), Solaris (first try done). In
general is quite complicated, because supervision should not use a lot of process time
and thus is very optimised. The application Layer will probably run on

Linux



Tono Riesco: Supervision of Access System (ACIS)

Access Control Information System (ACIS) is operational. Application is web based
using standards.


NAGIOS is running on Linux or any other
U
nix

fla
vor
..


The monitoring is done by using NAGIOS wit
h specific plug
-
ins for actions and
services. The supervised systems are Linux and HP
-
UX (~30), ~30 PLCs, special
devices (15,
e.g.

card
-
readers) and Windows, evtl. LynxOS (to be verified).


Checks: disk space, access keys, CPU load, logs, user databases, S
CADA, zombie
processes, PLCs: OK/NOK


Future: NAGIOS plus necessary plug
-
ins for Windows and Sun etc. (using shell
-
scripts or SQL scripts)


Maite Barroso Lopez: EDG WP4 Fabric management: Fabric
Monitoring and Fault Tolerance

Development project, used by
different collaborators around the world.

Computing fabrics monitoring between 100 to many thousand servers, very important
is scalability.

Project covers configuration management and monitoring including the installation of
several nodes by using a centra
l database.

Monitoring & Fault Tolerance part: gathering information from all servers, analysing,
fault detection and error handling including actions.

Messaging between servers and central database through UDP or TCP/IP

Rules definition is done in XML usi
ng a web
inte
r
face
, including action to be
triggered. Servers can subscribe to the central database and get a message if statuses
change and trigger an action. The same is possible also locally.

Sensors mainly running on Linux and Solaris, developed in C,
C++, perl, shell
-
scripts, GUI running on Linux, alarm display being developed using Java


Helge Meinhard: LEMON
-

Lhc Era MONitoring

Uses above tool for monitoring of about ~1500 computers

MSA is running stable on Linux since 18 months

Metrics configured t
o run in different intervals (ranging currently from 1 minute to
once per day)

Data transfer can be done by UDP or TCP, currently only UDP is deployed, TCP
could be used if security issues get more important.

Repository (for logging) uses Oracle, timeframe

some months up to two
-
three years
required. Alternative approach is a PVSS
-
based repository. Both work fine and will
be used in future. Alarm interface done by using PVSS with some adoptions to the
standard PVSS alarm screen (grouping, bulk acknowledge)

In
addit
i
on
: derived metrics combining several metrics from different servers

Windows not done yet, but would be evtl. possible.

It is not decided if OraMon or PVSS will be used in the future.


Bruce Flockhart
: Control Systems Supervision

An importa
nt factor is not only looking at computers but also remote configuration
and restarting/acting on the equipment. This is required for Linux and Windows and
will be developed over the next 2
-
3 years, will be common for the four LHC
experiments.


1. DCS (Det
ector Control System):

Access to specific process data must be possible, remote
rebooting/installation/configuration, secure access very important, local monitoring
agents, HMI interface for alarms needed.


2. DSS (Detector Safety System)

Runs on private L
AN connected through PC gateway to network, PVSS, Siemens S7
connected via Profibus, Windows XP, Windows Terminal Services for monitoring.
Access needed from CERN site, access from home under discussion


3
.

GCS (Gas Control System):

~23 systems around LHC
+ some other projects, Schneider PLCs, PVSS, Windows,
PLCs connected directly to Ethernet but could be connected through gateway if
necessary.


Requirements:

Central managed OS ("controls NICE"), custom load for specific OS, particularly
different language
s, block unwanted NICE software, include standard not NICE
software (e.g. Siemens),


Virus scan is seen as possible problem source, controlled access to software required,
preventing unauthorized access


DSS and GCS will be operational very soon and
super
visi
on

has to be included soon.


Peter Chochula: Supervision of Production Computers in
ALICE

DCS network, mainly same requirements as Bruce

Some hundreds computers, Windows
an
d

Linux, PLCs, intelligent power supplies,
VME masters, readout controllers with

FPGA executing Linux

Significant number of computers is inaccessible during run, they run critical tasks.
patches cannot be deployed during run, only during shutdown.

Supervision should cover more than OS related processes, like OPC, front
-
end
monito
rin
g

and control servers.


Alarm/error handling should be merged with DCS operation, thus should run within
PVSS


Currently test systems use PCMON (from JCOP framework), using PVSS and DIM
for data subscription + some other (non PCMON) systems also based on D
IM
-
PVSS


2003
-
2004: lab systems will be installed where monitoring is needed, 2005 pre
-
installation in experimental site, 2006 monitoring must be operational


Special requirements: PVSS, remote reboot, secured access to
equi
p
ment
.


Roberto Bartolome: SPI
-
M
E (System Performance Information
Measurement)

Project was started with User Requirements Documents

Monitoring requirements: CPU, processes, memory usage ...

OS: Windows, Linux

Common interfaces needed.

ALIVE messages to be sent from bottom to top

must be
cheap/free
-
> results in NAGIOS for monitoring and VNC for remote access


System runs on dedicated
LINUX

machine for monitoring (lnxspime1)

A test installation is under evaluation, for the moment only
supervis
i
on

done , no
actions defined as this was outsi
de the scope of the project.




3

Discussion

The discussion showed that some systems have not yet been covered,
e.g
.
: LHC
online farms and the IT
-
IS Windows farms.
We should also invite somebody of them
to report on the supervision ac
tivities of those installations
.