the story behind the scenes

scacchicgardenSoftware and s/w Development

Dec 13, 2013 (4 years and 19 days ago)

84 views

Scientific Databases:

the story behind the scenes

Martin Kersten

Milena Ivanova


M.Kersten Mar 2010

DIR Edinburgh

M.Kersten Mar 2010

Departure for a journey


CWI Database Architecture Group



Core business:


To research efficient and effective database technology


To deploy this technology in real
-
life application settings


To disseminate this knowledge as open
-
source software



Key research issues


What is the ultimate (virtual) machine architecture and
software stack for database processing?

DIR Edinburgh

The Big Data Bang

M.Kersten Mar 2010

DIR Edinburgh

M.Kersten Mar 2010

Outline


Departure for a journey


Mapping unknown territory


Crossing the Great Divide


Stepping stone 1: Multimedia Dimension


Stepping stone 2: Geometric Dimension


Stepping stone 3: Lineage Dimension


Stepping stone 4: Heterogeneous Databases


Stepping stone 5: Semantic Search


Stepping stone 6: Wireless sensor databases


Stepping stone 7: Distributed Databases


Arrival and outlook


SciDB and SciLens ambitions


Teaming up and making it a success




DIR Edinburgh

M.Kersten Mar 2010

DIR Edinburgh

M.Kersten Mar 2010

SkyServer provides public


access to SDSS

for astronomers, students,


and wide public

A project to make a map of
a large part of the Universe

230 million object images

1 million spectra

4TB catalog data

9TB images

DIR Edinburgh

M.Kersten Mar 2010

SkyServer Schema

446 columns

>370 million rows

Vertical fragment of 100+
popular columns

Materialized join of
Photo and Spectra

DIR Edinburgh

M.Kersten Mar 2010

Initial exploration

DIR Edinburgh

M.Kersten Mar 2010

Initial exploration

DIR Edinburgh

M.Kersten Mar 2010

Mapping unknown territory

Multimedia Images

Geometric Mapping

Features Space

Annotations

Modelling (Atlas)

Astronomy

Neuroscience













Geophysics


Biosciences

DIR Edinburgh

One size fits all?

M.Kersten Mar 2010

DIR Edinburgh

Pico scale Mega scale

Structured semi
-
structure documents images

Oracle

MS SQLserver

DB2







Vertica MonetDB

Postgresql

Mysql, MariaDB

SQLite

MongoDB

LucidDB


NoSQL

We have to stand the storm

M.Kersten Mar 2010

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia
Dimension


Storage challenges:


Large volumes (>Tbyte, >Pbyte) of raw data


Partitioning based on image, video segmentation


Indexing based on feature vectors



Query challenges:


Proximity and probability based search


CPU intensive, user defined predicates


Content
-
based information retrieval



DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia Dimension


The database consists of 100.000 images.


From each image we extract 25 patches


For each patch a 14
-
dimensional feature vector is derived


2.500.000 images



Challenge, find similar images based on Euclidian distance
with sub
-
second response time.



Solution, novel database algorithms to solve K
-
nearest
neighbours (k
-
NN) search



Lessons:
start from generative models
.

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 1: Multimedia Dimension


Alternative scheme, determine the probability that an image can
be generated with a limited number of Guassian mixtures



Fix a limited number of GMM and use an Expectation
Maximization algorithm to fit the model over the image



Search similar images by comparison of the GMM model
parameters

DIR Edinburgh

M.Kersten Mar 2010

Probabilistic Image Dimension


Query:













Which of the models is most likely to generate these 24
samples?

DIR Edinburgh

M.Kersten Mar 2010

Probabilistic Image Dimension

?

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 2: Geometric
Dimension


Any geometric abstraction of reality provides a good
navigational map



Database storage and indexing support for 2D is mature


R
-
trees and Quad
-
trees


Commercial database vendors do ‘not like them’


Open research issue is to support 2D query embedding


Scaling out towards 3
-
, 4
-
, dimensions and temporal support



Examples: researched extensively in Geographical
Information Systems. Google
-
map is omnipresent or
openGIS



Lessons:
avoid abundance of reference models, baroque
datastructures not necessarily scale

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 3: Lineage Dimension


The problem encountered in many scientific databases is to
ensure
data lineage
, the ability to travel back in time to
understand, redo and judge the derivations.



How to keep track of the complete context?


Data, software, parameter settings,…


How to redo part of the analysis ?


How to store and remember the lineage trails?



Example: AstroWise project in Groningen keeps track of a
complete workflow for telescope data analysis in a large
Oracle database. All derivations are 5
-
line python programs.



Lesson:
don’t be afraid for storage cost, be an accountant

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 4: Heterogenous
Databases


A key problem is to share heterogeneous information


Use commonly approved vocabulary and standard syntax


XML is the language inter
-
galactica for self
-
descriptive
data and its exchange between software systems


RDF claims to be the next king



The database community was actively working on XML,
XQuery, and Xupdate database engines, but it is
not

easy !


Challenges, how to scale to large XML stores ? How to
efficiently search components? How to realize structural
information retrieval?


RDF world brings in graph
-
algorithms



Lessions:
science is done, jewels are captured by bandits

DIR Edinburgh

M.Kersten Mar 2010

Database and Informatics Working Group


FBIRN 2005


David Keator



MR scanner

scanner
-

or
software
-
specific
file formats

XML
-
based
events file

XML
-
based
image header

image pre
-
processing

event
analysis

fBIRN
pipeline

“big picture”

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 5: Semantic search


Ontology integration is one of the most pressing challenges
for the semantic web to take off.



Integration of technology with databases is still immature.



RDF and OWL are the leading paradigms, SPARQL is an
attempt to bridge the gap between traditional database
management and semantic web technology.



Lessons: not a technological issue, but an educational and
cultural issues



http://e
-
culture.multimedian.nl/demo/search

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 6: Sensor Databases


Database management functionality can be downscaled to
the level of small sensor
-
enabled devices. They can form ad
-
hoq networks and provide a straightforward SQL interface
for aggregation. The focus is on
network based aggregation

under severe
energy limitations

.



Embedded database systems are not up to the job. Positive
case studies include TinyDB on TinyOS (Berkeley)



The DataCell project at CWI ( and Philips) aims to provide
for a more expressive query language and application
interface.

DIR Edinburgh

M.Kersten Mar 2010

sensor cluster

mobile

stationary

distributed

sensor

net

mobile

sensor cluster

integrated

management

distributed

management

Research World Perspective

PC
-
less

sensor

net

AmbientDB

Semantic Sensors

Past

Future

DIR Edinburgh

M.Kersten Mar 2010

Stepping stone 7: MR/DDBMS



HPC … Grids …. Clouds …


Grids are focussed on high
-
performance computing with a
focus on
Authentication
-
Authorization
-
Access

and data
shipping over
wide
-
area networks
.



Map
-
reduce technology is a re
-
invention of re
-
scaled
distributed database technology and distributed
programming.



Data distribution, replication, and parallel query processing
is well studied over the last 3 decades !!


Lessions: application programmers are infected by “not
-
written
-
by
-
me” hype bacteria

DIR Edinburgh

MonetDB in the large


MonetDB/Map
-
reduce


Pure map
-
reduce approach driven by query streams
leading to self
-
organising distributed database.



MonetDB/Octopus


Dynamic partial replication of databases with economic
model for reallocation and
recycler
technology



MonetDB/Datacyclotron


Let the database hotset flow like a stream or particles
through a large and fast ring
-
connected machines, e.g. a
data collider



M.Kersten Mar 2010

DIR Edinburgh

Get our hands dirty

M.Kersten Mar 2010

DIR Edinburgh

Toys

Tools

&

Techniques

The MonetDB product family

MonetDB

kernel

MAPI protocol

JDBC

C
-
mapi lib

Perl

End
-
user application

ODBC

PHP

Python

SQL

XQuery

RoR

M.Kersten Mar 2010

The MonetDB Software Stack

XQuery

MonetDB 4

MonetDB 5

MonetDB kernel

SQL 03

Optimizers

GIS

SQL/XML

SOAP

Open
-
GIS

An advanced column
-
oriented DBMS

compile

DIR Edinburgh

An advanced column
-
oriented DBMS

The MonetDB Software Stack

MonetDB 5

MonetDB kernel

SQL 03

Optimizers

Extensions

Orthogonal extension of SQL03


Clear computational semantics


Minimal extension to MonetDB



30/06/2009 SIGMOD'09
Providence, RI

An Architecture for Recycling
Intermediates M. Ivanova, M. L.
Kersten, N. Nes, R. Goncalves

32
/20

Run
-
time Support

Recycler
Optimizer

MonetDB Recycler Architecture

SQL

MonetDB
Server

Tactical Optimizer

MonetDB Kernel

XQuery

MAL

MAL

Recycle Pool

function user.s1_2(A0:date, ...):void;


X5 := sql.bind("sys","lineitem",...);


X10 := algebra.select(X5,A0);


X12 := sql.bindIdx("sys","lineitem",...);


X15 := algebra.join(X10,X12);


X25 := mtime.addmonths(A1,A2);


...

function user.s1_2(A0:date, ...):void;


X5 := sql.bind("sys","lineitem",...);


X10 := algebra.select(X5,A0);


X12 := sql.bindIdx("sys","lineitem",...);


X15 := algebra.join(X10,X12);


X25 := mtime.addmonths(A1,A2);


...

Admission & Eviction

SciDB and SciLens projects


Design and implement a database management system better
geared at the requirements of scientific applications



SciDB vision (http://www.scidb.org)


Array datamodel is missing


Distributed, map
-
reduce processing from the start


No
-
cost loading of data


… redo all the hard work from the ground up



SciLens


Multi
-
paradigm software layer


Database summarisation is the key


… build on the shoulders of the MonetDB team


M.Kersten Mar 2010

DIR Edinburgh

M.Kersten Mar 2010

Teaming up and making it a success

Crossing the Great Divide is challenging and rewarding
iff


Building the bridge starts from both ends


Parties recognize and respect each others core business


Open
-
source database technology provides a sound basis to manage
sizeable scientific databases


To capitalize and steer expertise development


The
database

community can provide knowledge on modelling, query
processing, algorithms, data structures, scalability, persistency, …and
flexible database systems


The MonetDB team seeks new frontiers in scalable structured database
management








DIR Edinburgh

M.Kersten Mar 2010

DIR Edinburgh