Comparison of Oracle, MySQL and PostgreSQL DBMS in the context ...

arghtalentData Management

Jan 31, 2013 (4 years and 8 months ago)

120 views

Comparison of



Oracle, MySQL and
PostgreSQL DBMS


in the context of ALICE needs

Wiktor Peryt, Warsaw University of Technology, Faculty of Physics

We have taken the following approach: first of all we determined what
features of DBMS are important from the point of view of such a large
experiment.

We chose the following features:

Elementary features


basic data types

SQL language features

declarative integrity constraints

programming abstractions

automatic generation of identifiers

national characters support

Transactions and multi
-
user access


transactions


locks


multi
-
user access

Programming in database



stored procedures


triggers

Elements of database administration



access control


backup copies


data migration

Portability and scalability


portability of DBMS


scalability

Performance and VLDB

(Very Large Databases)


query optimization


structures supporting query optimization


support for analytical processing


allocation of disk space


data size limits


VLDB implementations

Distributed databases


access to multiple databases


heterogeneous systems support

Distributed databases


access to multiple databases


heterogeneous systems support

Special data types


large objects in database


post
-
relational extensions


support for special data types

Application development and interfaces


embedded SQL


standard interfaces, additional interfaces


interoperability with Web technology


XML, CASE

Reliability


failure recovery

Commercial issues


technical support available


market position


Having completed step one we carried out subsequent work in 3
subgroups; each of them dealt with only one DBMS.


The members of particular subgroups had their own practical
experience with using DBMS being subject to investigation by their
subgroup.


Such a procedure gave us the possibility of verifying information
contained in manuals and other documentation available (for instance
on Internet).


As a result 3 extended documents devoted to Oracle, MySQL and
PostgreSQL were created
.

Konrad Bohuszewicz


undergraduate student

Maciej Czyzowicz



undergraduate student

Michal Janik



Ph.D. student

Dawid Jarosz



undergraduate student

Piotr Mazan




undergraduate student

Marcin Mierzejewski


undergraduate student

Mikolaj Olszewski



undergraduate student

Wiktor S. Peryt

Sylwester Radomski


undergraduate student

Piotr Szarwas



Ph.D. student

Tomasz Traczyk

Dominik Tukendorf


undergraduate student

Jacek Wojcieszuk



undergraduate student



Faculty of Electronics and Information Technology

Faculty of Mathematics and Information Sciences

Faculty of Physics

About “Comparison …”


discussion by all people involved in this task


compilation was made by Dr. Tomasz Traczyk


compilation circulated within the whole group a few times to make sure
we avoided some omissions or mistakes


this version of the document is accepted by all co
-
authors


we consider it a quite comprehensive and objective comparison


it contains also some kind of "weights" called by us "importance", with
differentiation for Central database and Lab
-
participants. Central
database should be a kind of data warehouse at CERN,
containing all
the data, also data transferred from Lab
-
participants periodically


the term "Lab
-
participants" denotes smaller databases in labs involved
in ALICE experiment preparation


few explanations of terminology used in the database domain are also
included to make this document easy to comprehend for non
-
specialists

Summary
Importance
Assessment
Category
Problem
Central
database
Lab-
partici pants
MySQL
Oracle8
PostgreSQL
Basic data types
C
C
B
C
A
SQL
B
B
C
B
B
Declarati ve constraints
B
B
C
A
A
Programming abstractions
A
C
D
A
C
Generation of i ds
C
C
C
A
A
Elementary features
National chars
B
C
B
A
B
Transacti ons
A
C
D
A
A
Locks
A
C
D
A
A
Transacti ons
Multiuser access
A
D
C
A
C
Programming in DB
Stored procedures and
triggers
B
C
D
A
A
Access control
B
D
A
A
B
Backup
A
C
C
A
C
Administration
Data migrati on
C
C
A
B
A
Portability
B
C
B
A
B
Portability and scalability
Scalability
A
C
B
A
C
Query opti mization
A
C
B
A
B
Structures supporting
optimization
B
D
D
A
B
Support for OLAP
B
D
D
A
D
Allocation of the disk space
A
C
C
A
C
Size limits
A
B
B
A
C
Performance and VLDB
VLDB i mplementati ons
A
C
D
A
B
Access to multi ple databases
C
D
C
A
C
Distributed databases
Heterogeneous systems
support
B
D
D
B
D
Large objects
B
B
B
A
C
Post-relational extensions
C
C
D
A
B
Special data types
Support for s pecial data
types
C
C
D
A
C
Embedded SQL
C
C
D
A
B
Standard interfaces
B
C
B
A
B
Addi tional interfaces
A
A
A
A
A
Web technology
A
A
B
A
B
XML
B
C
D
A
D
Application development and
interfaces
CASE
B
C
D
A
D
Reliability
Recovery
A
B
C
A
C
Prices
C
A
A
D
A
Technical support
A
B
C
B
D
Commerci al issues
Position on the market
A
C
D
A
D
Our preliminary conclusions:

for Central Data Repository for ALICE at CERN:


only ORACLE can be taken into account seriously

for Labs
-
participants (mainly for production phase databases):


Oracle is also the best
but using MySQL or PostgreSQL is
possible

the choice one of them is not obvious at the moment


Some extended tests concerning MySQL and PostgreSQL performance,
stability etc. with real data for STAR SSD are still in progress in Warsaw.
They will be published in 1
-
2 weeks on the website:

http://ITS_DB_ALICE.if.pw.edu.pl

the same place for document
“Comparison of Oracle, MySQL and PostgreSQL DBMS”

Questions for ALICE


How to start with databases for ALICE and how
to manage the project?


General concept of system architecture


Databases in production phase


Software technologies recommended


DBMS platform choice


How to proceed?

Databases types for ALICE


The following main categories of information should
go into databases:



production and assembly phase measurements and
descriptive data


ProdPhase

database


calibrations data


Calibration

database


configuration data


Configuration

database


detector condition data


Condition

database


run logs data


RunLog

database


geometry data (?)


Geometry

database or part of
Calibration

DB (?)


some others?
...
to be defined later, during "phase one" work

Databases contents


(1)

ProdPhase

database


all information coming from test
-
beds, from manufacturers,
assembly processes, object flow between manufacturers
and labs, etc.



RunLog

database


to store the summary information describing the contents of an
experimental run and to point the locations where detailed information
associated with the run is stored


Example of Web based interface developed by Sylwester Radomski (undergraduate
student from Faculty of Physics, WUT) for STAR can be seen on
http://www.star.bnl.gov
-
> Computing

and from table
New

the
first item



The environment in which the archive facility operates is
composed of many sources of information


We have to deal with data:


produced by various test
-
bench systems


entered manually by operators


submitted by collaborating institutes and companies


Usually there is a number of distinct data formats


Files are stored in many locations

Consequently, without database:



it is not only hard to locate the right piece of information but also to
ensure the safety and good quality of data


Why database in production phase?


secure archiving of all the test results in repository


easy availability of info upon location of objects
(in geographic
sense: manufacturers, labs)

makes the assembly arrangement
easier


creating the possibility of automatic assignment of quality
attributes according to the well defined criteria


statistical analysis of the quality should be made easily and at
any time


preparing data for future on
-
line use by slow
-
control, DCS and
DAQ


easy access to all data during production and assembly phase


In the future
-

easy access to all data during experiment run

Goals for production phase database

Basic requirements:


data should be stored in central repository to make
easy and reliable the management and maintenance


access to the data should be assured for everybody
which participates in tests during production phase,
i.e.
software allowing use of WEB browsers is necessary


objects' registration should be possible manually (by
operator with suitable privileges) as well as
automatically (from LabVIEW application, for example
or other software)


The software should allow creating (SQL) queries to
the database even for inexperienced users

DB production phase

From the point of view of domain experts .
..
(1)


there is an ever
-
increasing demand for centralized
storage of data for consistent and easy to use search and
retrieval facilities


experts want to be able to retrieve and analyze the
information
in a user
-
friendly way, regardless of its origin


They do not want to be forced

to perform several queries
just because data in question was taken by different data
acquisition systems

From the point of view of domain experts .
..
(2)


they wish to do statistics on data sets
spanning months (and
more)

without having to browse tens of subdirectories on
backup storage devices


usually
-

they prefer to use industry
-
standard, versatile
software tools
to process and analyze data


they certainly would not mind should they be able to
automate their routine,

everyday tasks

Their task is to look
at

the information, not to look
for

it


we should address those issues
by providing a
modular framework for archiving and for platform
-
independent retrieval of data in heterogeneous
distributed computing environment


our database system must be
open enough to
follow inevitable evolution of information gathering
systems related to the development of the particular
detectors


we should be able
to cope with the fast evolving
new Internet technologies in order to take full
advantage of facilities they provide

Requirements addressed to software developers

Use of:


PHP4 software running on the server side


C/C++ for API


JAVA + SWING & JDBC
1

for applications requiring
more interactivity


(JDBC = JAVA DataBase Connectivity)


"
seems to be the right choice of

tools used for client
side software development

DB for STAR
-

software technologies used

On the flight plots creation ...

From SQL query to plot
...

Generation of plots and histograms from database and putting them on the
Web.
Attempt made by
S. Radomski
:

Data chain:

Http server (Apache
-

Tomacat) calls servlet (dbPlot) with parameter
-

SQL query.

Servlet in http server connects to ROOT based server through socket and sends query

ROOT server means ROOT script which handles connections and scripting dbPlot class.

dbPlot::Init() reuse existing connection to database or creates new one if the old one
does not exist.

dbPlot::TakeData() server sends query to DB and takes data using TSQLServer class.

dbPlot::TakeData() takes data from TSQLResult and put them to TNtuple. This function
can recognise and parse 'private' format of data stored in BLOB.

dbPlot::PlotData() calls TTree
-
>Draw() with proper parameters.

dbPlot::Style() set colors and labels.

ROOT server saves canvas as EPS and then calls convert to produce PNG.

ROOT server sends to servlet identification of created PNG file.

servlet sends data in PNG format to use agent
-

Web browser.

Tool can be used in a normal ROOT session without http server.

Performance and problems
...

One histogram takes about 1
-
2 sec.

Slowest element in the chain
-

convert. Convert makes use of
GhostScript. Creation of PostScript and then conversion to PNG is
overcomplicated and rather simple TGrph with ~758 lines takes ~10 s

ROOT cannot generate Gif in
-
b mode.

Problems with memory deallocation in ROOT
-

after about 100 plots
ROOT crashes.

Modification of Draw() in Ttree:

In 1
-
D Histogram Draw() always makes 100 bins.

If data has its own grid (measurement precision) plots look terribly
-

especially when histogramming integers.

Small modification in TTreePlayer permits to recognize if data are
“gridy” and sets number of bins dependently of bounds, grid and
number of entries.

We suggest to include this code to ROOT release

Typical architecture for local site/lab i.e. Lab
-
participant


measurements are performed on
dedicated computer


data are transferred over Ethernet
local network to database


users can access the measurements
by means of JAVA applets or PHP
applications


graphical user interface make the
construction of complex queries
easy even for user with no database
experience


another capability of this applet is
the visualisation of selected data


it is clear that using JAVA, JDBC and
PHP allows to access the database
over the Internet or local network
with user's favourite browser


DB server

(daemon)

repository

JAVA applet

PHP

applications

LabVIEW

application

DUT

ROOT

or AliROOT

Production phase database for ALICE

MySQL

repository

MySQL server

(daemon)

JAVA applet

PHP

applications

LabVIEW

application

DUT

ROOT

or AliROOT

Lab 3

Lab 2

Lab 4

Lab n

DATA

repository

Data services

Data Archive

Server Library

ORACLE server

(daemon)

Application services

DBMS

AliROOT

CERN

Lab1

somewhere

in Europe

.....

Central data warehouse at CERN

three tier architecture


one can easily distinguish the
three logical tiers
-

according to
present tendencies: client layer,
application services layer and data
services layer


each layer contains several
components
(not all shown on the
picture)


top level is a layer containing
client applications, responsible for
data transfer into database and
visualisation


the middle layer is composed of
application services; this layer
knows the logical structure and
physical locations of data


the bottom layer contains data and
Database Management System


Data Archive

Server Library

DATA

repository

Data services

ORACLE server

Application services


Filters


Generic

data

loader


Custom

data loader


Interactive software:

WWW browsers,

JAVA applets,

PHP, HTML,

command line utilities

etc...


Client layer


modules and applications

DBMS

AliROOT

Project should be managed in few phases


Project is large so I strongly suggest to apply

methodology proven in
"commercial environment"

How to start with databases for ALICE?

How to manage the project?


(1)


Phase 1:

strategy (or planning) for the WHOLE
project:


determination of scope of the project


partitioning into subsystems
(natural way:



subdetectors,
but not only)


formulation of general models which could be applied


creation of list of actors/participants


approximate time schedule for particular tasks


initial choice of software technologies


Successive phases

should be performed in
"spiral
cycle"

It simply means that particular subsystems are elaborated successively.



Each subsystem must go through the following phases:

analysis/conceptual design


software design




development (programming)


implementation



improvements/corrections in earlier completed subsystems
must be continued during the work on successive subsystems


simultaneous work on several subsystems is a good practice;
easy to apply
-

especially when more working teams would be
involved


How to manage the project?


(2)


work on pilot project
-

in parallel to the main one; the
same software technology, it should contain most urgent
things


efficiency tests; creation of "simulated data" with capacity
volumes similar to the expected ones


creation of
"conceptual models"
during the analysis
phase is necessary before design of subsystems; the
appropriate formalism and class CASE tools are needed
for that. For linux
-

UML
(
U
nified

M
odelling
L
anguage)

is a
appropriate option


elaboration for the whole project of such standards as:
system of keys, terminology, security, access rights etc.

How to manage the project?


(3)

First steps ...


Start to formally organize database
central group

for ALICE


After that:
begin phase 1 of the project, i.e. strategy for the
WHOLE project/experiment



Partial, of highest priority tasks for this group:


determination of scope of the project


formulation of general models which could be applied


creation of list of actors/participants

(including 1
-
2 representatives from
each subdetector!)


initial choice of software technologies which could be used


partitioning into subsystems


analysis/conceptual design