A Comparison of Data Analysis

coordinatedcapableSoftware and s/w Development

Nov 4, 2013 (3 years and 11 months ago)

83 views

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

A Comparison of Data Analysis
Packages

Irwin Gaines, Jeff Kallenbach

Fermilab

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Outline


Introduction: a little history


Build vs. Buy: general considerations


User Requirements


Basic Features


Advanced features


Conclusions

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Introduction


Previous generation HEP experiments have
used a ubiquitous homemade product: PAW


Why? Commercial systems did not offer
either functionality or, more important,
performance


Use of a universal product allows:


data sharing (ntuple files)


procedure and environment sharing (kumac
files)

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Build vs. Buy


Old days (70’s
-
80’s): in house development effort “free”,
any software purchase is expensive


More recently(90’s):attractive licensing terms,
development costs should be amortized over as large a user
base as possible, Support?


Now: Consider full product lifetime costs, including
development, licensing, support. Does product need to be
customized or enhanced to meet HEP needs?

build

buy

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Project Scope


Selecting events based on programmed selection criteria


Preparing various statistical distributions of various mathematical functions of
data in the selected events


Linking in high level language programs to process event data prior to plotting


Modifying selection criteria and plotted functions interactively


Fitting the distributions


Comparing and performing calculations on different distributions


Preserving selection criteria and functions for later use or to pass to others


Saving samples of events in a variety of specialized formats for later analysis


Accessing these specially formatted event samples to make plots, fits,
statistical outputs, etc.

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements


Web reference:
http://www.fnal.gov/projects/runii/pasrec/


Data Access


Data Analysis


Data Presentation


Usability


Support and Maintenance

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements: Data Access


Access rates (online)


Access rates (offline)


Serial vs. random access


Granularity of access


Foreign I/O Formats


Specialized optimized output formats

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements : Data
Analysis


Scripting language


User control


Data selection


Input/Output


Numerical and mathematical functionality


Offline compatibility


Prototyping

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements: Data
Presentation


Interactive visualization


Presentation quality graphical output


Formal publication graphical output

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements: Usability


Batch vs. interactive


Sharing data structures


Shared access by several
clients


Parallel processing (using
distinct data streams)


Debugging and profiling


Modularity (user code)


Modularity (system code)


Access to source code


Robustness


Web based documentation


Use of standards


Portability


Scalability


Performance


User Friendliness


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements: Support


Maturity


customer base


product lifetime


product survivability


product support


licensing


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

User Requirements: Maintenance


who provides maintenance


what does it cost


maintenance infrastructure


maturity and completeness


modularity


portability


standards


reliability and security


application specific issues

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Main Contenders


Homemade package:
ROOT



Commercial Package:
IDL (other commercial
packages offer similar
features; IDL appeared to
be most aggressive in
licensing terms)


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Basic Features


plotting


fitting


event selection


command languages


event I/O

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Gee Whiz plots

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Plots, Fits, Event selection


ROOT: from browser,
from tree viewer, from
command line


All plots are active,can
be manipulated, saved
for later use, printed in
a variety of formats


IDL:command line
examples on following
slides


plots can be either
static or active,
displayed or printed

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Displaying a Histogram


Display a histogram


The Canvas

Open the a root file

Browse the file

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Fitting, Coloring, and Zooming


Adding a gaussian fit


Coloring the histogram


Zooming

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

The Tree Viewer

Tree Viewer buttons:


Variables


Slider


XYZ


Draw, Scan, Break


Ilist, Olist


Gopt


Weight



CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Scripting language


ROOT


CINT C++ interpreter
(almost full C++ syntax)


commands are methods of
root classes


Full access to compiled
code (in any language)


IDL


“natural” control
language (see
examples)


commands are part of
scripting syntax


full access to compiled
code (in any language)


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

IDL command language

chain=["d3_51.nhis","d3_68.nhis","d3_99.nhis","d3_19.nhis","d3_04.nhis"]

mass=htGetVar(chain,"Rmass")

cut4=where(lsig gt 5 and iso1 lt .05 and clsec gt .05 and iso2 lt .03)

plot,histogram(mass(cut4),binsize=mybin)


concatenate several files of ntuples



read in a variable



event selection (cut on several variables)



plot histogram

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

IDL Command Language


Fit plot and draw fit







plot
-
>liveplot for interactive plots

dist = histogram(mass(cut4),binsize=mybin)

x=findgen(134)*mybin+1.7

dfit=gaussfit(x,dist,a)

plot,x,dist



oplot,x,dfit,color=20


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Reading ntuples with IDL

ht2IDL
-

An Interface between HEP Data files and IDL

As part of our investigation of the Interactive Data Language (IDL) for use in our environment, we have
assembled a prototype of what we call ht2IDL (for "hepTuple to IDL). The is a small package of C++ code
and IDL procedure files which enable the user to access HEP data stores, such as HBOOK files, from the

IDL session. It uses the HepTuple package from PAT.


How the package works

Like most modern tools, IDL provides the capability to interface with external functions written by the user.
This is accomplished by writing some code, using a C
-
based interface, then compiling it and linking it into a
shared
-
object file. Then, by creating some simple helper files for IDL, and starting IDL from the correct
directory, where all of the new interface code lies, the user has access to all of the new functionality provided
the written code and the IDL "External Interface" In our prototype, this was all accomplished on an SGI/IRIX
system. In order to attempt to achieve maximum compatibility with the RunII environment, it was decided to
use KCC. In principal there is no reason it should not work with CC or g++. Then, referring to the IDL
External Developers' Guide, we wrote some code which uses the HepTuple library to read HBOOK files, load
the data into data structures compatible with IDL, and then return them to the IDL session. We have written a
prototype provides an interface to the HBOOK files (using HepTuple), makefiles and some documentation on
how to use them, and sample IDL scripts (called "procedure" files) to invoke the ht2IDL functions and
display and manipulate the results.



http://patwww.fnal.gov/pas/idl/ht2idl.html

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Support Features


Commercial products have excellent
documentation, generally good support, but


you pay for it


hard to customize, usually don’t get source


homemade products moving to free
software support model (support by
community)


can modify source to enhance or customize


relatively easy to use other’s code


both require a local support organization

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

ROOT How To’s

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Advanced Features


Optimized I/O and very large data samples


Using native user objects


Customized GUIs


Accessing over web

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Optimized I/O


Two separate issues:


data in memory vs. data on disk (efficient disk
access necessary for large data files)


can’t improve on disk speed unless objects that
are read together are next to each other on disk
(column wise n
-
tuple and generalizations)

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

ROOT I/O


Many years of struggle/experience to use
disk based data


optimized data formats for efficient access:
CWNT
--
> split trees


Formats designed with HEP type data
access in mind

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

IDL I/O


Basically memory based


Associated I/O allows mapping an IDL array or structure
variable onto a file:


I/O occurs automatically when the associated variable
is subscripted, accessing only the desired object


data set size limited by file size rather than memory size


direct access to each element in the file; including
convenient event selection by indexing


files can have multiple associated structures (full
events, tracks, hits, etc)


performance still limited by record structure

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Access to user objects


Root script language is C++, user classes
can be used by interpreter if their header
files are run through rootcint to create
dictionary


IDL supports structures, a collection of
scalars, arrays and other structures. Needs
an external structure definition file to allow
use in commands; no automatic way to
create these from class headers


CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

IDL GUI Builder

Available in IDL 5.3, the IDL GUIBuilder enables you to build intuitive GUIs with drag
-
and
-
drop ease. A convenient control palette with icons such as radio buttons,
checkboxes, and horizontal and vertical sliders let you quickly construct interfaces that
users understand. Widget properties are easily editable. Pre
-
made bitmaps give you
graphical cues for customizing buttons relevant to their function. Also, widgets are
arranged in row and column geometry for on
-
screen consistency. At the code level, built
-
in comments help you understand what each widget and event will accomplish.



CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

What Is ION?


An easy method for users to leverage the graphics and
analysis power of IDL in web based applets and
applications


Allows users to share IDL applications with non
-
IDL
users


Easy set
-
up, use and management

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

ION Overview

Client Machine

Client Machine

Web Browser

ION Client

Server Machine

Web Server

ION Server

IDL

Internet

ION Application

HTTP Data, Java Classes

IDL Commands,

Graphic Primitives

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

ION Applications



Web publishing is obvious, but what else?


Applications based on ION


Workgroups can develop and easily deploy data processing and
visualization apps with ION


Thin clients download fast and can be updated easily


Applications can exist in
any
Java enabled machine and still
access the power of IDL

CHEP2000 9
-
Feb 2000

A comparison of data analysis packages

Conclusions


Both satisfy user requirements


Commercial products offer all basic
functionality and many attractive advanced
features


Homemade products still better optimized
for specific HEP use


Support models evolving (open source
model)


Can we mix and match to get best of both
worlds?