MICE0266 - International Muon Ionization Cooling Experiment

knifedamagingInternet and Web Development

Feb 2, 2013 (4 years and 8 months ago)

140 views

Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


1




18/03/2013


MICE Data Acquisition and Controls Review



Review committee:
F.

Bartlett

(FNAL), J.

Cobb

(Oxford),
T.

Nicholls

(RAL),
E.

Radicioni

(INFN),
P.

Vande

Vyvre

(CERN)

(Chair)




1

Introduction

The committee was asked
to
review
the MICE Data Acquisition and Contr
ols
.

The
review has been performed on the 4
th

of June 2009
and has consisted of a set of

presentations
covering all aspects of the MICE Data Acquisition and Controls. This document is the report
of the review committee.

The committee has been

given a char
ge
which included a list of questions.
The

questions
have been a
nswer
ed in the first section of this document.

The following sections
include more
detailed information

about the answers and
report about some findings that the
committee identified as import
ant issues to be communicated to the MICE collaboration.

2

General comments

The committee
would

like to thank the participants of the review for their contributions
during the presentations and in particular for the open and receptive manner in which they
en
gaged with the discussions and questions throughout the review. As reviewers we certainly
learnt a great deal abo
ut the MICE online systems and we

hope that the discussion and our report
are

of benefit to the future success of the experiment.

It is clear t
hat a significant amount of progress has already been achieved by a small,
committed team, for which the MICE online group is to be commended. However, there are
critical areas, most notably the control systems, where manpower is extremely limited and
we

w
ould urge the management of the collaboration to identify additional effort as a matter of
priority.

The review material, both as presented on the day and made available in advance, was
limited in some areas and makes it difficult to answer a number of the

specific questions listed in
the charge given to the review. In particular, an overall view of the architecture and inter
-
relationship of the various online systems was lacking; it seems that the boundaries and
interfaces between the control and DAQ syste
ms have evolved organically and are imprecisely
defined. In addition there were no detailed technical specifications for the various systems
provided so it is not possible to judge whether the systems as presented meet the design
specification. However, th
e data acquisition requirements document (MICE
-
Note 222) does
specify the overall performance requirements of the system and these were addressed during the
presentations.

Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


2




18/03/2013

3

Answers to the charge questions


1.
Are the technical specifications of the systems

under review clearly stated and documented?


Not really for the highest level of interaction between the Data Acquisition and the Control
systems. Although many detailed presentations on specific topics were done, a presentation about
the global architect
ure of the overall MICE online systems with clear indications of the control
flows was missing.


2.
Does the baseline design meet the experiment’s objectives for all Steps of MICE?

The system designed is well adapted to the present needs of the experimen
t during a phase that
can be qualified of test and commissioning. The committee considers that it should evolve for the
production phase of the experiment. This evolution should aim at making the MICE online
system more reliable and streamlining its global

control architecture.


3.
Will the systems be fully functional for step III?

The system will most probably be ready for step III, contingent to the addition of some
manpower to the control system and the addition of some redundant hardware to avoid singl
e
points of failure.


Specifically, the committee should consider:


1. D
oes the committee see any potential problems with the front
-
end electronics systems in the
experiment?

The committee is concerned with the potential issue of synchronizing the electr
onics relative to
the T0 time of the interaction and with the de
-
synchronization between the readout in different
VME crates which has already been seen to occur.


2. D
oes the committee see any potential problems with the electronics used for the Controls

system interfaces?

Limited information was provided about the electronics used for controls system interfaces. The
committee recommends adding a few instances of a general interface providing spare inputs to
control the additional devices that will most p
robably be added during the natural evolution of the
control system.


3. I
s the overall DAQ architecture appropriate for the experiment and the expected data rates?

The committee considers that the overall DAQ architecture is not completely appropriate fo
r the
coming phase of the experiment. A substantial benefit (in terms of reliability and processing
performance) could be obtained by using the parallelism built in the DATE package at the level
of the event
-
building.


4. I
s the overall Controls architect
ure appropriate for the experiment and will it be capable of
providing a transparent and efficient interface to the various components of the MICE
experiment?


The committee considers that the overall Controls architecture is appropriate for the experiment
.
This architecture could be improved by adding an explicit overall control of the overall
experiment for example by implementing a finite
-
state machine modelling the global behaviour.


5. Can the Controls system easily accommodate the numerous configura
tion changes anticipated
for MICE?


Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


3




18/03/2013

Technically the Controls system can accommodate the numerous configuration changes
anticipated for MICE but the manpower does not seem available to implement those changes.
Access to the configuration database, supported

by a software expert, will be required.


6. Will the DAQ be able to handle the desired data rate of 600 recorded muon events per spill or
an instantaneous rate of 600kHz?


The MICE DAQ will be able to handle the desired rate of 600 recorded muon events pe
r spill.
Given the architecture selected using one super
-
event per spill, the DAQ will not be directly
confronted to the instantaneous rate of 600 kHz.


7. Is the particle Trigger of an appropriate design for the application? Is it likely to attain the
ef
ficiency and purity specifications of the experiment?

Limited information was provided regarding the implementation of the physics trigger. It is
therefore not possible for the committee to make any evaluation of the expected efficiency and
purity.

The co
mmittee had a small concern about the pileup of events given the RF structure of
ISIS.


8. Is the scheme for synchronizing the data readout with the trigger system and ISIS beam
structure well conceived?

The synchronization scheme between MICE and the ISI
S beam seems to be well conceived.


9. Was the choice of the DATE Framework for the online system appropriate for the application
and does it meet the needs of the experiment?

The choice of the DATE framework is appropriate for the MICE application. The s
ystem seems
to be under
-
exploited and its use could be extended to improve the reliability and the online
processing.


10. Is the interface between the DAQ and the C
ontrol system well thought out?

The committee considers that the interface between the DAQ
and the Controls system is not well
thought out. This interface requires a systematic approach to identify all the control flows inside
the experiment and to implement these flows with a top
-
down approach.


11. Is the proposed Online Monitoring applicatio
n adequate?

The
committee is questioning two aspects of the online monitoring application. First it would
seem appropriate to develop and maintain inside MICE a single library for decoding and
processing the data that would be used online and offline. Seco
nd, given the relatively limited
time needed to reconstruct the data, it would seem appropriate and feasible technically and
financially to reconstruct online a significant fraction (if not all) of the data.

4

Front
-
end electronics (FEE) interface, trigger a
nd timing

These elements of the system were presented in depth during the first presentation. An
overview of the FEE and trigger architecture, in the form of top
-
level architecture and trigger
logic chart, would have been useful to set the overall context
of the system. Nevertheless,
there was sufficient information presented to make an informed impression of the system.

The choice of front
-
end electronics and readout components seems generally well
-
matched to the requirements of the experiment. The particu
lar components (TDC, flash ADC
boards) are appropriate, although their performance is not necessarily well characterised and
it would be worth quantifying these.

F
or instance
,

t
he Particle Trigger is timed on the TDCs
Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


4




18/03/2013

and it is used as a time reference for

the experiment. Given its importance, the linearity of the
TDCs should be checked experimentally at regular intervals.

The tracker is read out using the custom electronics chain derived from D0 (AFE2
-
t /
VLSB). This appears to be correctly interfaced to t
he DAQ system. A minor concern is that
the setup and control of the front
-
end boards is achieved via a MIL1553 bus integrated into
the control system; this is an example of where the interface and operation of the DAQ and
control systems is tightly coupled
.

Management of event pile
-
up is achieved by offline analysis and it was not clear that
its impact on trigger efficiency or the ability to resolve multiple particles can be fully
characterised by this method. This may be relevant since the particle trigger

rate will be
dependent on the position in the spill due to the trajectory of the target through the beam.
Based on Poisson statistics, pile
-
up has been estimated by the experts at 10
-
15% for 500
muons per spill; in practice the intensity may well be sign
ificantly lower, mitigating any
impact.
We

would recommend that this be monitored carefully.

The choice of trigger architecture and interface to the machine timing has been
carefully thought out and is appropriate for the demands of the experiment.
The bea
m
structure is such that the usual handling of individual events in DATE cannot be used. For
this reason, free running clocks in the Front
-
End digitizers are used to associate event
fragments at a later analysis stage.
The particle trigger concept, with a
single DAQ trigger at
the end of each spill should achieve the necessary acquisition rate.
As an added precaution
we suggest, if possible, to reset the clocks at each start
-
of
-
spill. Given the fact that this point
is very delicate, we also suggest that in
presence of any inconsistency the complete spill is
discarded.

It is clear that considerable attention has been paid to characterising the trigger timing
relative to the beam and addressing the demands of early commissioning. Out
-
of
-
spill
calibration trigg
ers have been designed into the system and should prove capable of
providing the necessary information.

The trigger and clock logic is implemented
and distributed
in a “classical”
with discrete
NIM/CAMAC/VME

components
.
This approach is adequate but

a

poss
ible area of concern is
that it could prove difficult to scale the clock and trigger distribution to accommodate
additional future front
-
end subsystems should they arise. In comparison it would prove
relatively straightforward to scale the DAQ in the same
situation by adding additional
VME/LDC branches.

No statement was made on the design trigger efficiency or purity, so it is not possible
for the review to answer whether the system as designed will achieve the required
performance. Trigger efficiency will
require detailed analysis with minimum bias triggers,
which appears possible with the architecture presented. Purity could only be established in
conjunction with a detailed end
-
to
-
end simulation of the system and it was not clear if this is
planned or nec
essary.

5

Data acquisition and dataflow

The measured performances in terms of data rate indicate that, as expected, the limiting
element is the VME bus; the actual numbers (25MB/s per LDC) are in line with the expected
performance.

There is a general lack of

redundancy in the GDC and storage part. There are plans already
to add one GDC in the chain; care should be taken to make sure the system is able to sustain the
load (all tasks included: building, monitoring, data transfer to offline) even when one GDC fa
ils.

Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


5




18/03/2013

There was considerable discussion about the impact of the near
-
online dataflow


for
both reconstruction and data archival


on the performance of the DAQ. While the data rate
is not particularly high (30 MB/s) consideration must be given to the impac
t of archiving data
files into the Tier
-
1 storage systems on DAQ. In particular attention should be given to
having a single RAID file

system where concurrent read and write accesses could lead to
contention.

The disk system is presently implemented with o
ne RAID
-
6 volume on the GDC.
Thi
s approach protects from single and double
disk failures, but it is not adequate to sustain
data migration to offline when operating the DAQ continuously. More than one volume is
needed, in order to de
-
couple the data writin
g of the GDC from the reading made by the
process migrating data to the offline storage.

6

Online monitoring and reconstruction

Good progress has been made in the online monitoring and reconstruction. In the case
of the latter, there has not yet been any dev
elopment of the event distribution scheme for
passing events from the DAQ for reconstruction.
We suggest considering
the usage

of the
DATE framework
capability
to distribute events to a large number of GDCs and, at the same
time, provide data to processes
running online reconstruction via the monitoring library. The
actual number of GDC computers can be right
-
sized as the need arise or as the experiment
evolves.

This approach could provide the possibility to adjust the number of GDC pipelines to
the needed
processing power in order to obtain a complete analysis of the run without any
delay. As an added bonus, it could also provide redundancy in a very natural way.

The online reconstruction serves multiple purposes:



to monitor the detectors;



to monitor the
data quality;



to provide diagnostics for the muon beam, and



to provide diagnostics for the cooling section.


The online reconstruction is well developed and can already deliver (e.g.) detector
channel hist
o
grams and decoded hit maps for several of the de
tectors. It is foreseen that it
will also be able to produce

qua
n
tities and distributions of physics interest, such as emittance
and amplitude, in near
-
to
-
real
-
time. Other quantities of interest (for example beam
momentum spectrum and emittance d
e
terminati
on with the TOF counters) for which the
analysis is being developed offline should be incorp
o
rated when available. The online
reconstruction group should consult the analysis group to avoid possible duplication of effort.
Similarly, the beamline group shou
ld be consulted to ensure that all the information required
to tune the beam is available in a convenient form (e.g. hits in sp
a
tial, rather than detector,
coordinates). Consideration should be given as to whether there are any specific cooling
-
section dia
gnostics which can and should be developed.

Fast reconstruction is essential to MICE since, in contrast to many experiments, it will
make measurements for many di
f
ferent operating ‘modes’ of the cooling section. It is
possible that some of the parameters o
f the channel may have to be altered in real time and
feed back from the reconstruction is essential. It seems possible that with some increase in
processing power 100% of the data could be r
e
constructed online.

Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


6




18/03/2013

7

Controls

7.1

Introduction

Overall monitoring and

control of MICE will be accomplished with EPICs. The higher
-
level techn
i
calities of the system were presented. The overall system seems well thought out
a
l
though several sub
-
systems were not yet covered in terms of hardware or manpower. The
‘Uber
-
GUI’ sho
uld be deve
l
oped to simplify the access to information for the physicist user.

Both of the controls development groups, the MICE collaboration group and the
Daresbury group, appear knowledgeable about general control system design principles and
the use o
f EPICS, upon which the MICE control/monitoring system is based. The
controls/monitoring architecture for the experiment appears well designed, adequate for the
needs of the experiment, and scalable should future expansion be required. Further, the
contro
ls staff has incorporated existing, freely available components where available.
Noteworthy among these selections are:



The EPICS system and its Channel Access (CA) protocol as the basis for the
control/monitoring system



MEDM for the majority of the operat
or displays



Gateway nodes to isolate remote client programs and to minimize the load on
the controls network and CA server nodes

7.2

Staffing

and responsibilities

The
MICE controls group is
significantly understaffed

considering the scale of the
effort require
d to complete the project.
The controls are being developed by two people
(only) with effort effectively bought from Dare
s
bury laboratory. The committee felt this was
an inadequate level of effort.
In particular, the following events are straining or will
strain the
already marginal staffing of the MICE controls/monitoring group:



The possible departure of Pierrick Hanlet in the near future



The transfer of responsibility for further devel
opment and maintenance of
Daresbury group
products
to the MICE collabor
ation

The collaboration should promptly add one more person to the MICE controls effort
;

and this
person, considering the schedule, must already be familiar with the implementation of
EPICS
-
based control systems.

This critical staffing situation is exacerb
ated by the recent funding situation and an
apparent disconnection between the team at Daresbury and the “MICE local community”
[sic] efforts on Control and Monitoring.
It was not
apparent

who had overall respo
n
sibility
for the control system.

This was cle
arly evidenced by the separate presentations and the level
of working discussion between the two groups during the course of the review.

No d
e
tails were given of the levels to which some of the parameters critical to the
stated physics goals of MICE (a 1/1
000 measurement of emittance), such as magnetic fields,
a
b
sorber density, and RF phases and voltages, would be monitored. Although perhaps these
details are beyond the scope of this review, a wri
t
ten specification for these would be
desirable.

T
he committe
e felt that in view of the possible risks (principally exceeding the safe
loading of the cold
-
mass supports) the controls for
all

the superconducting magnets,
including the spectrometer solenoids, should be a separate critical sub
-
system where only
pre
-
def
ined sets of cu
r
rents are permissible; other configurations would require the
evaluation of the magnetic loads, including the possibility of the quenching of one or more
Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


7




18/03/2013

coils, before being permitted. Such a sub
-
system should be developed in consultation w
ith
the cooling channel experts and the magnet coo
r
dinator.

There are also technical compatibility issues that need to be addressed (EPICS versions
for VxWorks versus Linux, development environment licensing, provision of infrastructure
services). There a
re a number of single points of failure in the system, notably the controls
system server (miceecserv); not providing redundancy may prove to be false economy.

7.3

Experiment/Controls Interface

MICE has very specific needs in terms of DCS, and interaction betw
een the DCS and
DAQ systems

is foreseen at several levels but t
he nature of the interface between the data
acquisition and the control/monitoring systems is unclear.
The present approach of inserting
explicit code fragments in the DAQ to communicate to/fro
m the DCS will not scale to the
needed complexity, and adds interdependencies between the State Machines of the two
systems which cannot be easily modelled and managed.

To avoid unforeseen behaviours, i
t is recommended that a formal
unified state machine
is

used to coordinate the various components of the online and control/monitoring
subsystems. The control system must be able to read this state and other DAQ system
parameters that relate to controls actions.

It is also possible that the control/monitorin
g system may need to sense the state of the
accelerator and this will require a gateway or data exchange link between the two control
systems.

7.4

General Purpose Device Interface

A general
-
purpose analog/digital interface module should be selected and interfa
ced to
the EPICS system. The module should provide several (> 10) analog input channels, several
words (16 bits) of digital input, and several words (16 bits) of digital output. Experience
shows that experiments often require the addition of unanticipated
pieces of electronic
equipment at short notice. The EPICS community should be polled for the existence of such
a module that has already been integrated into EPICS.

7.5

Single
-
Point Failures

The
miceecserv

node

appears to be a single
-
point failure component fo
r the
control/monitor system. A second, fallback computer node


not necessarily of the same
capacity


should be provided for the programs that run on this node.

Operation of the
elog

system requires an active network connection to the Daresbury
laborator
y. The possibility of running the remote component of
elog

on

some computer at
RAL should be investigated.

The control/monitoring system should be reviewed by the controls staff for additional
single
-
point failure components.

7.6

Application Development Enviro
nment

The application development environment was not discussed during the review;
however it is an important issue. The only API (Application Programming Interface)
mentioned for the controls/monitoring system was for the C and C++ languages. Unless the
c
ollaboration has sufficient existing C and C++ expertise to build the required online tasks,
considerations should be given to the use of one of the object
-
oriented interpretive languages
(Java, Python, Ruby ...) for high level online applications that int
eract with the DAQ and
Final Report

knifedamaging_349fa9d9
-
8493
-
47b3
-
bb59
-
d8423e7aff26.doc


8




18/03/2013

controls/monitoring subsystems. The principal advantages to the use of one of these scripting
languages are:



Compared with complied languages, the program development cycle is shorter



Because of the interactive execution mode, progra
m

testing proceeds more
rapidly



Most have extensive support libraries that provide interfaces to standard
protocols, graphics packages, etc.



The object
-
oriented language features are more intuitive and less complex
than those of C++



Since skill in the use
of these languages is more easily acquired than for C and
C++, members of the collaboration who otherwise might not participate in
program development for the online system will be able to write applications
for their sub
-
systems

8

Configuration database

A
bi
-
temporal scheme for the configuration database was presented. Addition of entries
into the database is labelled by a “valid” and “transaction” time, which was presented as
allowing multiple valid calibrations to exist for a given period and be selected
as appropriate.
The committee had some concerns that structure of the database would not easily provide
access to online distributions taken at some arbitrary date in the past for comparison with
current data.

It was not clear
either
from the presentation
that this mechanism allows one capture
multiple possible configurations for the online systems, for instance setup for normal beam
data
-
taking versus special calibration modes, choose as necessary at run time and then
retrospectively determine which was in

force.

The scheme for a database API was presented. Here the term API is overloaded and
used to refer to both an API proper and the run
-
time interface to the database. It appears to be
a partial implementation of a model
-
view
-
controller architecture on to
p of the configuration
database. A major concern here is that the proposed architecture allows multiple concurrent
offline accesses to the configuration database, which is a critical element of the online
system. As such its performance must not be allowed

to degrade under the load of external
connections. No authentication, access control or throttling was presented in the design,
which presents significant performance and security risks. One example is where multiple
offline reconstruction jobs, distribut
ed across the grid for instance, could effectively mount
an inadvertent denial of service attack against the online systems by overloading the
database. Replication of the database to an additional copy accessible externally would
mitigate the risk, but th
e appropriate security measures need to be designed into the system as
a priority.