Correlation Engine Portotype

helmetpastoralΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

64 εμφανίσεις

Correlation Engine Portotype


V.Pose, JINR Dubna

B.Panzer
-
Steindel, CERN IT



Overview


The Correlation Engine Prototype was developed during a 3
-
month work in CERN IT Division
financed by the
CERN
-
Intas

project.


The
CERN monitoring prototype
, part of
th
e
fabric management
work package

(WP4) of the
DataGrid

project
, gathers monitoring data from farm nodes in CERN into a central monitoring database.

Performing
correlations on the data in the monitoring database

should help to:




f
oresee exceptions on indivi
dual nodes and on node groups



a
nalyse performance of the farm.


So the results of the correlations can be used to
save a system administrators work

in the
following ways:




to
to trigger
automatic remedy actions



to gather additional monitoring data to get a

detailed view of the current exceptional state

of nodes; this additional information should reduce the efforts a system administrator makes
to decide how to treat the affected node(s).


The correlation engine prototype was developed to enable
easy adding

of new correlations

of
monitoring data and actions triggered in case of exceptions. The new functionality can be added
in form of new engines.

The current prototype is written in Perl and the engines are subroutines placed in a separate Perl
module. Two e
ngines are implemented


ayt

and
procpu
.


The
ayt

engine

tries to connect to the hosts, for which the
monitoring data are missing or out
-
of
-
date

and to get some basic information about their state.


The
procpu

engine
:




looks for nodes with a
number of p
rocesses
higher then usual and a

CPU usage
less
then usual



looks for nodes with a very high number of processes



connects to such a node using the
ayt

engine



gathers more information about the node state and stores it as a
report

in a textfile in a
file da
tabase.


Currently no remedy actions are implemented.

The results of the correlation engine can be accessed through a
web
-
inerface
. The main pages
of the web
-
interface are shown on the poster.


ayt engine


The ayt engine does the following for a given nod
e:


1.

tests if the data read from the monitoring database are up
-
to
-
date:

o

the timeout is the sampling period of the metric plus a general wait interval

o

the sampling period for each metric is set in the configuration module
Cfg.pm

2.

if the data are out
-
of
-
date
tests the node for a
ping

response

3.

if the node responds to
ping

tries to open telnet
-
port 23

4.

if the port is opened tries to login in a telnet session

5.

if the login succeedes rus the
ps

command

6.

if
ps

succeedes runs
df

x afs


7.

if
df

succeedes runs
df

t afs

8.

if
df

t afs

succeedes runs
ls

for a directory on the AFS partition



procpu engine


The procpu engine does the following for a given node:




tests the data read from the monitoring database against threshold sets configured in the
configuration module
Cf
g.pm
:

o

each threshold set can contain a minimum and a maximum threshold for each
metric; so in different sets a given metric can have different thresholds



runs the
ayt
engine on the node



uses the telnet
-
session opened by
ayt
to get the information reflecte
d in the report.



Report


A report produced by the procpu engine for a node contains:




the monitoring data read from the monitoring database and their timestamps



the thresholds set for the different metrics which are correlated



values for the same metrics

measured by the correlation engine



a list (top 10) of open files sorted by number of links to the open file



a list (top 10) of processes running the same command sorted by the nubmer of processes
running the command



a list (top 10) of hosts to which the

tested node is connected through TCP sorted by the
number of connections to the host



a list (top 5) of states of internet sockets sorted by the number of connections beeing in
the state



virtual memory usage summary



a list (top 5) of processes with top vir
tual memory usage



a list (top 5) of processes with top physical memory usage.


Implementation


The correlation engine prototype consists of the following Perl modules:




main module
ce.pl



module containing the code of the engines
Engine.pm



library module
Celib.pm



configuration module
Cfg.pm

.


The web
-
interface consists of a couple of CGI
-
scripts written in Perl.

The data exchange betweeen the correlation engine and the web
-
interface is made by text files.


The main module
ce.pl

implements in particular th
e following functionality:




initializes common datastructures of the engines from the configuration module
Cfg.pm



reads data from the monitoring database



periodically runs the engines



saves the results of the engines into a file database .


The
Engine.pm

m
odule currently contains the code for the
ayt
and
procpu

engines.


The library module
Celib.pl

contains subroutines used by the engines. In particular:




an envelope to run commands on the node where the correlation engine is running:

o

implements timeout fun
ctionality:



on timeout the process running the command is killed

o

measures command execution time

o

saves
stdout
,
stderr

and exit code of the command



an envelope to run commands on a remote node:

o

implements timeout functionality:



in case of timeout sends the
telnet BREAK signal

o

measures command execution time

o

saves
stdout

and
stderr

of the command



a subroutine to reduce the maximum execution frequency of an engine on a given node:

o

a minimal time interval is set for each engine in the configuration module
Cfg.
pm
.


The configuration module
Cfg.pm

contains subroutines returning hashes with the following
configuration information:




nodes watched by the correlation engine grouped by clusters



metrics read from the monitoring database with following attributes:

o

sampl
ing period

o

description



engines executed by the
ce.pl

module with the following attributes:

o

name of subroutine to call in
Engine.pm

o

minimal execution interval of the engine on a node

o

engine
-
specific information, e.g. the threshold sets for the
procpu

engine

.



Figure
1
.
Main screen of the web interface of the Correlation Engine Prototype


-

the
Nodes

box contains 2 nodes currently beeing in
exceptional state

-

the
Report

buttons will show the last or the last 10
reports for the sel
ected nodes

-

the
history
links provide a 24 hour or 72 hour
history of exceptions

-

the
Threshold

section on the rigth shows the
thresholds used for each cluster



Figure 2.
The cluster status page shows the status of the 3 watched clusters




Figure

3.1
24
-
hour history of exceptions, page 1


8



9

Figure 3.2
24
-
hour history of exceptions, continuation of page 1


10


Figure 4.
Search form for archieved reports