Comparison of Oracle, MySQL and PostgreSQL DBMS in the context ...

boreddizzyData Management

Dec 16, 2012 (4 years and 5 months ago)

169 views

Comparison of
Oracle, MySQL and
PostgreSQL DBMS
in the context of ALICE needs
Wiktor Peryt, Warsaw University
of Technology, Faculty of Physics
We have taken the following approach: first of all we determined
what
features of DBMS are important from the point of view of such a large
experiment.
We chose the following features:
Elementary features
bas
ic data types
SQL language features
dec
larative integrity constraints
programming abstractions
automatic generation of identifiers
national characters support
Transactions and multi-user access
n
transactions
n
locks
n
multi-user access
Programming in database
n
stored procedures
n
triggers
Elements of database administration
n
access control
n
backup copies
n
data migration
Portability and scalability
n
portability of DBMS
n
scalability
Performance and VLDB
(Very Large Databases)
n
query optimization
n
structures supporting query optimization
n
support for analytical processing
n
allocation of disk space
n
data size limits
n
VLDB implementations
Distributed databases
n
access to multiple databases
n
heterogeneous systems support
Distributed databases
n
access to multiple databases
n
heterogeneous systems support
Special data types
n
large objects in database
n
post-relational extensions
n
support for special data types
Application development and interfaces
n
embedded SQL
n
standard interfaces, additional interfaces
n
interoperability with Web technology
n
XML, CASE
Reliability
n
failure recovery
Commercial issues
n
technical support available
n
market position

Having completed step one we carried out subsequent work in 3
subgroups; each of them dealt with only one DBMS.

The members of particular subgroups had their own practical
experience with using DBMS being subject to investigation by their
subgroup.

Such a procedure gave us the possibility of verifying information
contained in manuals and other documentation available (for instance
on Internet).

As a result 3 extended documents devoted to Oracle, MySQL
and
PostgreSQL
were created.
Konrad Bohuszewicz
undergraduate student
Maciej Czyzowicz
undergraduate student
Michal Janik
Ph.D. student
Dawid Jarosz
undergraduate student
Piotr Mazan
undergraduate student
Marcin Mierzejewski
undergraduate student
Mikolaj Olszewski
undergraduate student
Wiktor
S. Peryt
Sylwester Radomski
undergraduate student
Piotr Szarwas
Ph.D. student
Tomasz Traczyk
Dominik Tukendorf
undergraduate student
Jacek Wojcieszuk
undergraduate student
Faculty of Electronics and Information Technology
Faculty of Mathematics and Information Sciences
Faculty of Physics
About “Comparison …”
About “Comparison …”
n
discus
sion by all people involved in this task
n
compilation was made by Dr. Tomasz Traczyk
n
compilation circulated within the whole group a few times to make sure
we avoided some omissions or mistakes
n
this version of the document is accepted by all co-authors
n
we consider it a quite compr
ehensive and objective comparison
n
it contains also some kind of "weights" called by us "importance", with
differentiation for Central database and Lab-participants. Central
database should be a kind of data warehouse at CERN, containing all
the data, also data transferred from Lab-participants periodically
n
the term "Lab-participants" denotes smaller databases in labs involved
in ALICE experiment preparation
n
few explanations of terminology used in the database domain are also
included to make this document easy to comprehend for non-specialists
Sum
m
ary
Impor
tance
Ass
ess
m
ent
Cate
g
or
y
Proble
m
Centr
al
da
taba
s
e
Lab
-
par
tic
i
pants
MyS
Q
L
Oracle8
Po
s
tgreSQL
Ba
sic d
at
a t
yp
es
CC
B
CA
SQ
L
BB
C
BB
De
clarat
i
v
e
constra
ints
BB
C
AA
Pro
gramming ab
st
ract
i
o
n
s
AC
D
AC
G
eneration of
i
d
s
CC
C
AA
Ele
ment
a
ry features
Nati
o
nal c
har
s
BC
B
AB
Tr
a
ns
acti
on
s
AC
D
AA
Lo
cks
AC
D
AA
Tra
ns
acti
on
s
Mult
iuser acces
s
AD
C
AC
Pro
gramming in D
B
S
tor
e
d pr
oce
dur
es

and
t
rigger
s
BC
D
AA
Ac
ces
s control
BD
A
AB
Ba
ckup
AC
C
AC
Adm
inistrat
io
n
Data m
i
g
rati
on
CC
A
BA
Portability
BC
B
AB
P
ortabi
lity an
d s
calabi
lit
y
S
ca
lab
ility
AC
B
AC
Query opt
imiza
tion
AC
B
AB
S
truc
ture
s
s
uppo
rting
o
ptimization
BD
D
AB
Su
p
p
o
rt
fo
r

OLAP
BD
D
AD
All
ocation of the dis
k
spac
e
AC
C
AC
S
ize limits
AB
B
AC
Perfo
rmance a
n
d V
L
D
B
VL
D
B
im
p
lem
en
t
a
t
io
n
s
AC
D
AB
A
c
ce
ss t
o
m
u
lt
ip
l
e d
atab
a
se
s
CD
C
AC
Dis
tr
i
b
u
te
d database
s
H
et
er
ogen
eo
u
s sy
st
e
m
s
su
p
p
o
r
t
BD
D
BD
La
rge obj
ect
s
BB
B
AC
Post-relat
ional e
x
t
ensions
CC
D
AB
S
p
ecial dat
a
typ
es
S
u
p
p
or
t
f
o
r
s
pecia
l
da
ta
types
CC
D
AC
Embe
dde
d S
Q
L
CC
D
AB
St
a
nda
rd
in
t
erf
a
ce
s
BC
B
AB
A
d
di
tional in
ter
face
s
AA
A
AA
Web tec
hn
olog
y
AA
B
AB
XM
L
BC
D
AD
App
licat
ion d
evelop
ment a
nd
in
ter
face
s
CAS
E
BC
D
AD
R
eliability
R
e
cov
ery
AB
C
AC
Prices
CA
A
DA
Technical suppo
rt
AB
C
BD
C
omm
erci
al iss
ue
s
P
osi
tion
on

the
mark
et
AC
D
AD
Our preliminary conclusions:
for Central Data Repository for ALICE at CERN:

only ORACLE can be taken into account seriously
for Labs-participants (mainly for production phase databases):

Oracle is also the best but using MySQL or PostgreSQL is
possible
the choice one of them is not obvious at the moment

Some extended tests concerning MySQL and PostgreSQL performance,
stability etc. with real data for STAR SSD are still in progress
in Warsaw.
They will be published in 1-2 weeks on the website:
http://ITS_DB_ALICE.if.pw.edu.pl
the same place for document “Comparison of Oracle, MySQL and PostgreSQL DBMS”
Questions for ALICE
Questions for ALICE

How to start with databases for ALICE and how
to manage the project?

General concept of system architecture

Databases in production phase

Software technologies recommended

DBMS platform choice

How to proceed?
Databases types for ALICE
Databases types for ALICE
The following main categories of information should
go into databases:

production and assembly phase measurements and
descriptive data →
ProdPhase
database

calibrations data →
Calibration
database

configuration data →
Configuration
database

detector condition data →
Condition
database

run logs data →
RunLog
database

geometry data (?) →
Geometry
database or part of Calibration
DB (?)

some others? ... to be defined later, during "phase one" work
Databases contents
Databases contents
(1)
(1)
ProdPhase
database

all information coming from test-beds, from manufacturers,
assembly processes, object flow between manufacturers
and labs, etc.
RunLog
database

to store the summary information describing the contents of an
experimental run and to point the locations where detailed information
associated with the run is stored
Example of Web based interface developed by
Sylwester Radomski
(undergraduate
student from Faculty of Physics, WUT) for STAR can be seen on
http://www.star.bnl.gov
-> Computing
and from table New
the first item
Why database in production phase?
Why database in production phase?

The environment in which the archive facility operates is
composed of many sources of information

We have to deal with data:

produced by various test-bench systems

entered manually by operators

submitted by collaborating institutes and companies

Usually there is a number of distinct data formats

Files are stored in many locations
Consequently, without database:

it is not only hard to locate the right piece of information but
also to
ensure the safety and good quality of data
Goals for production phase database
Goals for production phase database


secure archiving of all the test results in repository
secure archiving of all the
test results in repository

easy availability of info upon location of objects (in geographic
sense: manufacturers, labs)
makes the assembly arrangement
easier

creating the possibility of automatic assignment of quality
attributes according to the well defined criteria

statistical analysis of the quality should be made easily and at
any time

preparing data for future on-line use by slow-control, DCS and
DAQ

easy access to all data during production and assembly phase

In the future -
easy access to all data during experiment run
DB production phase
Basic requirements:

data should be stored in central repository to make
easy and reliable the management and maintenance

access to the data should be assured for everybody
which participates in tests during production phase, i.e.
software allowing use of WEB browsers is necessary

objects' registration should be possible manually (by
operator with suitable privileges) as well as
automatically (from LabVIEW application, for example
or other software)

The software should allow creating (SQL) queries to
the database even for inexperienced users
From the point of view of domain experts .
..
..
(1)
(1)

there is an ever-increasing demand for centralized
storage of data for consistent and easy to use search and
retrieval facilities

experts want to be able to retrieve and analyze the
information in a user-friendly way, regardless of its origin

They do not want to be forced
to perform several queries
just because data in question was taken by different data
acquisition systems
From the point of view of domain experts .
..
..
(2)
(2)

they wish to do statistics on data sets
spanning months (and
more)
without having to browse tens of subdirectories on
backup storage devices

usually -
they prefer to use industry-standard, versatile
software tools to process and analyze data

they certainly would not mind should they be able to
automate their routine,
everyday tasks
Their task is to look
Their task is to look
at
at
the information, not to look
the information, not to look
for
for
it
it
Requirements addressed to software developers
Requirements addressed to software developers

we should address those issues by providing a
modular framework for archiving and for platform-
independent retrieval of data in heterogeneous
distributed computing environment

our database system must be open enough to
follow inevitable evolution of information gathering
systems related to the development of the particular
detectors

we should be able to cope with the fast evolving
new Internet technologies in order to take full
advantage of facilities they provide
DB for STAR
DB for STAR
-
-
software technologies used
software technologies used
Use of:

PHP4 software running on the server side

C/C++ for API

JAVA + SWING & JDBC 1
for applications requiring
more interactivity

(JDBC = JAVA DataBase Connectivity)
"
"
seems to be the right choice of
seems to be the right choice of
tools used for client
side software development
On the flight plots creation ...
On the flight plots creation ...
From SQL query to plot
...
...
Generation of plots and histograms from database and putting them on the
Web. Attempt made by S. Radomski
:
Data chain:
Http server (Apache -
Tomacat) calls servlet
(dbPlot) with parameter -
SQL query.
Servlet
in http server connects to ROOT based server through socket and
sends query
ROOT server means ROOT script which handles connections and scripting dbPlot
class.
dbPlot::Init() reuse existing connection to database or creates new one if the old one
does not exist.
dbPlot::TakeData() server sends query to DB and takes data using TSQLServer
class.
dbPlot::TakeData() takes data from TSQLResult and put them to
TNtuple. This function
can
recognise
and parse 'private' format of data stored in BLOB.
dbPlot::PlotData() calls
TTree->Draw() with proper parameters.
dbPlot::Style() set colors and labels.
ROOT server saves canvas as EPS and then calls convert to produce PNG.
ROOT server sends to
servlet
identification of created PNG file.
servlet
sends data in PNG format to use agent -
Web browser.
Tool can be used in a normal ROOT session without http server.
Performance and problems
...
...
One histogram takes about 1-2 sec.
Slowest element in the chain -
convert. Convert makes use of
GhostScript. Creation of PostScript and then conversion to PNG is
overcomplicated and rather simple TGrph
with ~758 lines takes ~10 s
ROOT cannot generate Gif in -b mode.
Problems with memory
deallocation
in ROOT -
after about 100 plots
ROOT crashes.
Modification of Draw() in Ttree:
In 1-D Histogram Draw() always makes 100 bins.
If data has its own grid (measurement precision) plots look terribly -
especially when
histogramming
integers.
Small modification in
TTreePlayer
permits to recognize if data are
“gridy” and sets number of bins dependently of bounds, grid and
number of entries.
We suggest to include this code to ROOT release
Typical architecture for local site/lab i.e. Lab
Typical architecture for local site/lab i.e. Lab
-
-
participant
participant

measurements are performed on
dedicated computer

data are transferred over Ethernet
local network to database

users can access the measurements
by means of JAVA applets or PHP
applications

graphical user interface make the
construction of complex queries
easy even for user with no database
experience

another capability of this applet is
the visualisation of selected data

it is clear that using JAVA, JDBC and
PHP allows to access the database
over the Internet or local network
with user's favourite browser
DB server
(daemon)
repository
JAVA applet
PHP
applications
LabVIEW
application
DUT
ROOT
or AliROOT
Production phase database for ALICE
Production phase database for ALICE
MySQL
repository
MySQL server
(daemon)
JAVA applet
PHP
applications
LabVIEW
application
DUT
ROOT
or AliROOT
Lab 3
Lab 4
Lab n
DATA
repository
Data services
Data Archive
Server Library
ORACLE server
(daemon)
Application services
DBMS
AliROOT
CERN
Lab1
somewhere
in Europe
.....
Lab 2
Central data warehouse at CERN
Central data warehouse at CERN
three tier architecture
three tier architecture
Client layer →
modules and applications

one can easily distinguish the
three logical tiers -
according to
present tendencies: client layer,
application services layer and data
services layer

each layer contains several
components (not all shown on the
picture)

top level is a layer containing
client applications, responsible for
data transfer into database and
visualisation

the middle layer is composed of
application services; this layer
knows the logical structure and
physical locations of data

the bottom layer contains data and
Database Management System
Data Archive
Server Library
DATA
repository
Data services
ORACLE server
Application services
Filters
Generic
data
loader
Custom
data loader
Interactive software:
WWW browsers,
JAVA applets,
PHP, HTML,
command line utilities
etc...
DBMS
AliROOT
How to start with databases for ALICE?
How to start with databases for ALICE?
Project should be managed in few phases
Project is large so I strongly suggest to apply
methodology proven in
"commercial environment"
How to manage the project?
How to manage the project?
(1)
(1)

Phase 1:
strategy (or planning) for the WHOLE
project:

determination of scope of the project

partitioning into subsystems (natural way:

subdetectors, but not only)

formulation of general models which could be applied

creation of list of actors/participants

approximate time schedule for particular tasks

initial choice of software technologies
How to manage the project?
How to manage the project?
(2)
(2)

Successive phases
should be performed in "spiral
cycle"
It simply means that particular subsystems are elaborated successively.
Each subsystem must go t
hrough the following phases:
analysis/conceptual design →
software design

→ development (programming) →
implementation

improvements/corrections in earlier completed subsystems
must be continued during the work on successive subsystems

simultaneous work on several subsystems is a good practice;
easy to apply -
especially when more working teams would be
involved
How to manage the project?
How to manage the project?
(3)
(3)

work on pilot project -
in parallel to the main one; the
same software technology, it should contain most urgent
things

efficiency tests; creation of "simulated data" with capacity
volumes similar to the expected ones

creation of "conceptual models" during the analysis
phase is necessary before design of subsystems; the
appropriate formalism and class CA
SE tools are needed
for that. For linux -
UML (Unified
Modelling Language)
is a
appropriate option

elaboration for the whole project of such standards as:
system of keys, terminology, security, access rights etc.
First steps ...
First steps ...

Start to formally organize database central group
for ALICE

After that: begin phase 1 of the project, i.e. strategy for the
WHOLE project/experiment

Partial, of highest priority tasks for this group:

determination of scope of the project

formulation of general models which could be applied

creation of list of
actors/participants
(including 1-2 representatives from
each subdetector!)

initial choice of software technologies which could be used

partitioning into subsystems

analysis/conceptual design