Monitoring service in distributed systems: review of INCA monitoring system.

glisteningchickensΑποθήκευση

11 Δεκ 2013 (πριν από 3 χρόνια και 8 μήνες)

527 εμφανίσεις

Monitoring service in distributed systems
: review of INCA
monitoring system.

Harshad Joshi

Abstract:

Distributed computing is currently one of the most exploited computing platform with the
emerging
techniques such as
cloud computing. Thus it is becoming
increasingly
important to
understand all aspects of the distributed computing. From setting up the hardware to applications
of various softwares on the distributed systems, probably the most important factor is
“monitoring service”. In this survey, Inca se
rvice is as implemented on Teragrid computing
service is analyzed in detail

along with other services
.

Keywords:

monitoring, distributed computing, Grid Computing, publish/subscribe systems


Introduction
:

As the size of the distributed system increases, th
e control of its operation and
maintenance

becomes increasingly difficult. For geographically distributed system (such as Teragrid, which is
a high
-
performance scientific computing facility spread across the US),
Monitoring and
controlling
is quite a challenge. The main challenge occurs due to the requirement of some
operations to be real
-
time or quasi real
-
time [1].
In this paper,

some
standard

publish/subscribe
middleware candidates, specially designed and developed for Grid

are examined f
or
their
architecture and functionality, and the advantages and disadvantages

of these are discussed
.

Distributed systems (DS) are becoming very popular and in the near future, there will be a larger
number of services based on the concepts of DS that bec
ome routine for many operations.
Previously, when CPU power and/or memory was limited the main driving force for DS was to
build the system with larger compute power and large amount of (shared) memory to tackle more
compute
-
intensive task (mainly scientif
ic and engineering problems). This also led to
supercomputer and cluster architecture. However, with the
advent of hardware technologies, not
only in CPU technology
(Moore’s law still driving the increase rate for CPU power [2])
, but also
for other hardwar
e such as memory, GPU computing, the cluster computing is achieving
petascale computing with a modest size cluster [3]. With internet,

and other technologies such as
mobile computing,

however, in the last decade,
n
ew concepts started to realize, such as
ge
ographically distributed
computing,
and cloud computing
and this has attracted both academia
(to construct large scale scientific computing facilities such as TeraGrid [4])
as well as industry
[5
-
7].
These systems are highly dispersed in different physical

locations.
On one hand, these
advances a
re really attractive making computing boundaries
invisible and making what is
now
called
“Internet of things” a reality [8]. On the other hand, the ever increasing complexity is
making it rather a challenging task
for the control and monitoring of these systems to make sure
that everything works just as fine

[
9
].

Monitoring a DS
typically
needs
produc
tion of
data that can be collected remotely and updated
frequently.
This requires algorithms that can efficiently wo
rk even with slow connectivity,
accurately yielding a real
-
time (or at least quasi real
-
time) control over t
he system of interest.
For
example, if a
compute node of a remote supercomputer
has been switched on but does not
respond for a long time then it wi
ll be considered to be malfunctioning. A real
-
time system does
not need to be very fast but should be stable and respond within a reasonable predefined time
limit.
So that a
monitoring
system
is
considered as
a distributed realtime monitoring system
,
m
ost
of the data for monitoring should be received within a
reasonable time limit
. Traditional
monitoring systems are highly ce
ntralized
and
may
not scale very well.

D
etailed review
s

of how the monitoring service should be can be
found in the lliterature

(e.g., r
ef
[11]
)
. Here we restrict the discussion by stating that a

publish/subscribe (pub/sub) system
seems
to be the best solution for monitoring services due both its ability to disseminate
many
-
to
-
many
data
and highly distributed nature of the DS
. Pub
lishers publish data and subscribers receive data
that they are interested in. Publishers and subscribers
in pub/sub system
are independent and
need to know nothing about each other. The middleware
not only
delivers data to its destination

but also exhibit
s higher functionality features such as
data discovery, dissemination, filtering,
persistence and reliability, etc. The subscriber can be automatically notified when new data
becomes available. Compared to a traditional centralized client/server communicat
ion model,
pub/sub system is asynchronous and is usually distributed and scalable.

A v
ariety of monitoring and discovery services exist
s
, ZenOSS, VMware vCloud, XCat,
MonALISA, INCA, to name a few [12]. In the following we review the features of INCA, as
successfully implemented on TeraGrid computing platform

[13]
.

INCA architecture and Features

Inca is developed at SDSC to create the monitoring system for TeraGrid portal.
Inca is deployed
on a wide variety of production Grids such as TeraGrid
, GEON, TEAM, University of California
(UC Grid), ARCS, DEISA, NGS and ZIH. Inca has also been used to monitor Open Source
DataTurbine deployments on CREON and GLEON as well as execute and collect performance
data from IPM instrumented applications.
Inca o
ffers a variety of web status pages from
cumulative summaries to reporter execution details and result his
tories [see Fig. 1].
While other

Grid monitoring tools provide system
-
level information on

the utilization of Grid resources, the
Inca system provides

user
-
level Grid monitoring with periodic, automated user

level

testing of
the software and services required to support

Grid operation.
Thus,
Inca can be used by Grid
operators, system

administrators, and
application users to identify, analyze,

and troubl
eshoot
user
-
level Grid failures, thereby improving

Grid stability.

User
-
level Grid monitoring provides
Grid infrastructure

testing and performance measurement from a generic, impartial

user’s
perspective. The goal of user
-
lev
el monitoring

is to detect and
fix Grid infrastructure problems
before users

notice them


user complaints should not be the
first indication of Grid failures.
A
successful user
-
level Grid

monitoring needs to include
following

features

(cf. Inca Tech. reports
from Inca website)


Runs f
rom a standard user account in order to reflect

regular user experiences.


Executes with a standard user GSI credential mapped

to a standard user account when tests or
performance

measurements require authentication to Grid services.



Emulates a regula
r user by using tests and performance

measurements created and configured
based on user

documentation, rather than on system administrator

knowledge (of hostnames,
ports, pathnames, etc.). In

cases where documentation and tests are developed simultaneously

during pre
-
production, test development

should be closely coordinated with the documentation

as it is written.


Centrally manages the configuration of user
-
level tests

or performance measurements in order
to ensure consistent

testing across resources.


Easily updates and maintains user
-
level tests and performance

measurements. This is important
because

tests and measurements are often updated when Grid

infrastructure changes. Also,
multiple iterations of test

development are often required to determine w
hether

a detected test
failure stems from a faulty test, incomplete

user documentation, or a failed Grid resource.


Provides a representative indication of Grid status by

testing documented user commands and
individual

Grid software components.


Automate
s the periodic execution of user
-
level tests or

performance measurements to
understand Grid behavior

over time.


Executes locally on Grid resources to verify user accessible

Grid access points. Executes from
each resource

to every other resource (all
-
to
-
a
ll) to detect

site
-
to
-
site configuration errors such as
authentication

problems.

Inca implement
ation
provide
s

these features

to provide a user
-
level Grid monitoring system.

Inca

Features

Inca
is a system that provides user
-
level monitoring of

Grid
functionality and per
formance. It
was designed to be
general, flexible, scalable, and secure, in addition to being

easy to deploy and
maintain. Inca benefits Grid operators

who oversee the day
-
to
-
day operation of a Grid, system
administrators

who provide a
nd manage resources, and users

who run applications on a Grid.

The Inca system

(taken from Inca user manual [15])
:

1. Collects a wide variety of user
-
level monitoring results

(e.g., simple test data to more complex
performance

benchmark output).

2. Captur
es the context of a test or benchmark as it executes

(e.g., executable name, inputs,
source host, etc.)

so that system administrators have enough information

to understand the result
and can troubleshoot system

problems without having to know the internals

of Inca.

3. Eases the process of writing tests or benchmarks and

deploying them into Inca installations.

4. Provides means for sharing tests and benchmarks between

Inca users.

5. Easily adapts to new resources and monitoring requirements

in order to faci
litate maintenance
of a running

Inca deployment.

6. Stores and archives monitoring results (especially any

error messages) in order to understand
the behavior of

a Grid over time. The results are available through a

flexible querying interface.

7. Securely

manages short
-
term proxies for testing of

Grid services using MyProxy.

8. Measures the system impact of tests and benchmarks

executing on the monitored resources in
order to tune

their execution frequency and reduce the impact on resources

as needed.


Fi
gure 1

shows the architecture of Inca
, which incorporates

three core components (highlighted
box)


the agent,

depot, and reporter manager. The
agent
and
reporter managers

coordinate the
execution of tests and performance

measurements on the Grid resources

and the
depot
stores

and
archives the results. The inputs to Inca are one or

more
reporter repositories
that contain user
-
level tests and

benchmarks, called
reporters
, and a configuration file describing

how to execute
them on the Grid resources. This
configuration is normally created using an administration

GUI
tool called
incat
(Inca Administration Tool). The output

or results collected from the resources
are queried by the

data consumer
and displayed to users. The following steps

describe how an
Inca

administrator would deploy user
-
level

tests and/or performance measurements to their
resources.


1. The Inca administrator either writes reporters to monitor

the user
-
level functionality and
performance of

their Grid or uses existing reporters in a publis
hed

repository.

2. The Inca administrator creates a deployment configuration

file that describes the user
-
level
monitoring for

their Grid using incat and submits it to the agent.

3. The agent fetches reporters from the reporter repository,

creates a report
er manager on each
resource, and

sends the reporters and instructions for executing them

to each reporter manager.

4. Each reporter manager executes reporters according to

its schedule and sends data to the
depot.

5. Data consumers display collected data b
y querying the

depot.




A

reporter

is an executable program that tests or measures some aspect of the system or installed
software.


A

report

is the output of a reporter and is a XML document complying to the reporter schema
.


A

suite

specifies a
set of reporters to execute on selected resources, their configuration, and frequency
of execution.


A

reporter repository

contains a collection of reporters and is available via an URL



A

reporter manager

is responsible for managing the schedule and
execution of reporters on a single
resource.


A

agent

is a server that implements the configuration specified by the Inca Administrator.


incat

is a GUI used by the Inca administrator to control and configure the Inca deployment on a set of
resources.


A

depot

is a server that is responsible for storing the data produced by reporters.


A

data consumer

is typically a web page client that queries a depot for data and displays it in a user
-
friendly format


Fig1. Inca Architecture (taken from Ref[14])


Fig2. Inca provides detailed test results both useful for users and system administrators (taken
from Ref[14])

Conclusions
:

Inca has proved to be highly successful monitoring service and is at the core of teraGrid facility.
It

detects Grid infrastructure problems by executing periodic, automated, user
-
level testing of
Grid software and services.

It e
mulates a Grid user by running under a standard user account and
executing tes
ts thus e
nsur
ing
consistent testing across resources

with centralized test
configuration.

Inca m
anages and collects a large number of results through a GUI interface
(incat).
It m
easures resource usage of tests and benchmarks to help Inca administrators balance
data freshness with system impact.

Data is col
lected by reporters, executables that measure
particular
aspect
s

of the system and output the result as XML. Multiple types of data can be
collected

since
Inca offers a number of prewritten test scripts, called

reporters
, for monitoring
Grid health. Reporter APIs make it easy to create new Inca tests. By storing and archiving
complete monitoring results it allows system administrators to debug detected failures using
archived execution details. Inca offe
rs a variety of Grid data views from cumulative summaries
to reporter execution details and result histories. Inca components communicate using SSL
making it a secure monitor for DS service testing.

References
:

[1] Chenxi

Huang, Peter R. Hobson, Gareth A. Taylor, Paul Kyberd, "A Study of Publish/Subscribe
Systems for Real
-
Time Grid Monitoring," 2007
IEEE International Parallel and Distributed Processing
Symposium
, 2007, pp 360

[2]
http://news.cnet.com/New
-
life
-
for
-
Moores
-
Law/2009
-
1006_3
-
5672485.html

[3]
Glotzer, S.C.;


Panoff, B.;


Lathrop, S.;

Challenges and Opportunities in Preparing Students for
Petascale Computational Science and Eng
ineering
Computing in Science & Engineering
. 2009, Vol 11,
issue 5, pp 22
-
27

[4]
https://www.teragrid.org/

[5]
Yi Wei, M. Brian Blake,
"Service
-
Oriented Computing and Cloud Computing: Challenges and
Opportunities,"

IEEE Internet Computing
, vol. 14, no. 6, pp. 72
-
75, Nov./Dec. 2010,
doi:10.1109/MIC.2010.147

[6] Microsoft Azure:
http://
www.microsoft.com/windowsazure/

[7] Amazon cloud service:
http://aws.amazon.com/ec2/

[8]
Neil Gershenfeld, Raffi Krikorian, Danny Cohen; The

Internet

of

Things
.
Scientific

American

291:44, 76
-
81, 10/2004

[9] G. A.

Taylor, M. R. Irving, P. R. Hobson, C. Huang, P. Kyberd and R. J. Taylor, “Distributed
Monitoring and Control of Future Power Systems via Grid Computing”,
IEEE PES General Meeting

2006, Montreal, Quebec, Canada, 18
-
22 June 2006.

[10]
http://www.globus.org/grid_software/monitoring

[11] Andrzej Goscinski, Michael Brock; Toward dynamic and attribute based publication, discovery and
selection for cloud computing;
Future Generation Computer Systems

26 (2010) 947

970.

[12]
http://sites.google.com/site/cloudcomputingsystem/research/monitoring

[13]
http://inca.sdsc.edu/drupal/

[14]

Shava Smallen, Kate Ericson, Jim Hayes, Catherine Olschanowsky
; User
-
level Grid Monitoring with
Inca 2

[15]
http://inca.sdsc.edu/releases/2.5/guide/