Database Engine Design

basesprocketData Management

Oct 31, 2013 (3 years and 8 months ago)

83 views

August 2007

Slide
1


SERC Research Symposium


Database Engine Design

a.k.a. Research@ DSL




Jayant Haritsa

August 2007

Slide
2


SERC Research Symposium

Database Management Systems (DBMS)


Efficient and convenient mechanisms for
storing, querying and maintenance of
enterprise data




Cornerstone of computer industry


Uses more than 80 percent of computers worldwide


Employs more than 70 percent of computer professionals


Largest monetary sector of computer business



August 2007

Slide
3


SERC Research Symposium


Handle data of arbitrary size


Income
-
Tax records are in
Petabytes (10
15
)


Self
-
contained


contains both data and
meta
-
data



Program
-
Data insulation


application s/w not affected by storage changes


DBMS FEATURES

SR No | Name | Address | Hostel | GPA

SR No | Name | Address | GPA | Hostel

August 2007

Slide
4


SERC Research Symposium

DBMS FEATURES (contd)


Declarative Access


state what you want, not how to get it


On
-
the
-
Fly Questions


ask new questions without writing new programs


PEACE OF MIND


changes to the database are guaranteed to be
immune to subsequent system failures


Sri Sri Ravishankar of the Information World

August 2007

Slide
5


SERC Research Symposium

Current Database Systems



Commercial



IBM DB2 / Oracle / Microsoft SQL Server / Sybase




Public
-
domain



PostgreSQL / MySQL / Berkeley DB













August 2007

Slide
6


SERC Research Symposium

DBMS Myths



Databases? Isn’t that the boring part of
accounting?


Hazaar dumb Cobol programming!


Maha
-
bore
-

almost as dull as watching


Rahul Dravid bat!


High
-
tech name for data entry!


Will only get job with TCS!


...

August 2007

Slide
7


SERC Research Symposium

DBMS Realities



Design of
database engines

has lots of really, really
interesting intellectual problems with practical impact


theory, algorithms, data structures, experiments,
prototypes



Turing awards


1981: Edgar Codd (relational data model)


1999: Jim Gray (transaction model)



Ullman, Silberschatz, Papadimitrou, …


Rajaraman, Patnaik, Balakrishnan,
Jacob/Govindarajan …



August 2007

Slide
8


SERC Research Symposium


Database Systems Lab

(DSL)



Established 1995

August 2007

Slide
9


SERC Research Symposium

Research Topics


Real
-
Time

Database Systems


Distributed Transaction

Management


OODBMS


Web Databases


Data Mining


XML Databases


Biological Databases


Query Optimization


Multilingual Databases


Music Databases

1995
-
2000

2000
-
2005

Last few years

August 2007

Slide
10


SERC Research Symposium

Research Trajectory

Mining

XML

MIDDLEWARE

OO Models

CORE DB TECHNOLOGY

Access

Methods

Transaction

Processing

Query

Processing

August 2007

Slide
11


SERC Research Symposium

Research Techniques


Theory


real
-
time, data mining, query optimization


Simulation studies


real
-
time, distributed, web dbms


Empirical evaluation


data mining, biological, multilingual dbms, query optimization


Prototype development



OODBMS (Flexible Manufacturing
[MIDAS]
,

VLSI
[DIAS]
,

Bio
-
diversity
[Oshadhi,Bodhi]
)


XML (Storage
[LegoDB]
,

Compression
[XGrind]

)


Query Optimization (Clustering
[Plastic]
,

Visualization
[Picasso]

)


Multilingual Databases (Cross
-
lingual SQL
[Mira]

)


August 2007

Slide
12


SERC Research Symposium



SPINE: Putting Backbone into
Genomic Sequence Indexing






August 2007

Slide
13


SERC Research Symposium

1

5

1

A

GTTAATTACT$

T

A

TA

ATTACT$

CT$

TTACT$

ATTACT$

CT$

ATTACT$

CT$

3

$

7

4

8

0

2

6

9

5

Standard Genomic Index: Suffix Tree
[Weiner 1973]

Suffix Links

(xW → W)

Tree Edges

Search for


Query =


TTA


Vertically
-
compressed
trie

of
suffixes

augmented with links


0 1 2 3 4 5 6 7 8 9

Data =


GTTAATTACT
$’

August 2007

Slide
14


SERC Research Symposium

Locate all Maximal Matching Substrings


[Chang & Lawler 1990]


For each position in query sequence
Q

,
locate all
longest
matching substrings of
length




in the indexed data sequence
D



Example:


D
=

GTTAATTACT
$




Q

=

CTAATGA


and



= 3




Result:


{
TAAT
:<
2
,
1
>

AAT
:<
3
,
2
>

}


August 2007

Slide
15


SERC Research Symposium

2

2

3

3

Maximal Substring Search

with Suffix Tree Index

A

GTTAATTACT$

T

A

TA

ATTACT$

CT$

TTACT$

ATTACT$

CT$

ATTACT$

CT$

7

4

8

0

6

1

9

5


0 1 2 3 4 5 6 7 8 9

D =


GTTAATTACT
$’

Q

=


CTAATGA





=

3

$

August 2007

Slide
16


SERC Research Symposium


Accurate retrieval


no false negatives (unlike BLAST)


Linear

Time Complexity for both Construction

and Search
!



because of Suffix
-
links


Widely used


More than 40
-
50 applications over biological

sequences
[Gusfield, 2002]


MUMmer [Celera Genomics], AVID, …

Features of Suffix Tree Index

August 2007

Slide
17


SERC Research Symposium

Crippling Limitation


Viable only for sequences that are short
enough for their associated suffix tree
to fit completely in main memory …


[Baeza
-
Yates and Navarro, 2000]




Best that has been built so far is for sequences of
~ 10 Mbp (Human Genome is 300 times longer!)

August 2007

Slide
18


SERC Research Symposium

Difficulties in Supporting

Suffix Trees on Long Sequences
-

1

Space overheads are enormous


Order(s) of magnitude larger than data!


Human Genome can be easily stored in

main memory (~1 GB) but the index could

be of the order of 10
-
100 GB



Disk
-
resident

suffix trees for long sequences

August 2007

Slide
19


SERC Research Symposium

Difficulties in Supporting

Suffix Trees on Long Sequences
-

2

Tree Construction on Disk is Very Slow


Due to disk
thrashing

from random seeks



The active suffix creeps through the text like a
caterpillar

… corresponding active node swings
through the tree like a
butterfly

[Giegerich and Kurtz, 1995]

August 2007

Slide
20


SERC Research Symposium

Difficulties in Supporting

Suffix Trees on Long Sequences
-

3


Searching on Disk is Very Slow


Unbalanced
Tree Structure


Shape of tree depends on

sequence stochastic properties


“Multi
-
directional”

traversals


causes disk thrashing


Tree
-
Edge


“Vertical Walk
-
Down”


Suffix
-
Link


“Horizontal Jump
-
Across








Suffix Tree Search



Edge + Link mesh




Two phase Search



Locate



Report




Combination of


Batman and Spiderman !

August 2007

Slide
22


SERC Research Symposium

The SPINE
*

Index

A Horizontally
-
Compacted Trie Index

[
*S
equence
P
rocessing

IN
dexing
E
ngine]

August 2007

Slide
23


SERC Research Symposium


Link

D =

ACCACAC


Vertebra

Root node

Rib

Extension

rib

SPINE Index Structure


Nodes


Forward Edges


Vertebras (Backbone)


Ribs / Ext
-
Ribs


Backward Edges


Links

0

1

5

C

5

6

A

0

1

A

1

2

C

2

3

C

3

4

A

4

C(0)

A(
1
)

0

0

1

2

1
(
2
)

2

7

C

Complete horizontal compaction
into
single linear chain
!!

August 2007

Slide
24


SERC Research Symposium

Structural Advantages of SPINE

w.r.t. Suffix Trees

1)
Number of nodes is
equal to length

of string
,
whereas in suffix tree can go up to
double
.

2)
Entire data sequence explicitly embedded in index


throw away the data!

3)
On
-
line

incremental
algorithm (by definition)


do not need to possess entire data sequence in advance


4)
Node creation order and

logical order are the same




prefix
-
partitionable

0

1

2

3

4

A

C

C

A

C (0)

A (1)

D =

ACCA


August 2007

Slide
25


SERC Research Symposium

Advantages of SPINE (contd)

5)
Each node represents a
set
of suffixes
whereas in suffix tree each node
represents only a
single

suffix


Number of suffixes processed for
construction and searching is
smaller


6)
Easy to develop
buffering
strategies for

persistent implementations


0

1

2

3

4

A

C

C

A

C (0)

A (1)

August 2007

Slide
26


SERC Research Symposium

SPINE Performance Summary

Data Sets


Ecoli: 3.5 Mbp Celegans: 15.5 Mbp


HC 21: 28.5 Mbp HC19: 57.5 Mbp

Suffix Tree (MUMmer
-

Celera Genomics)



Spine Space



~
2/3

of Suffix Tree


Spine Time


Construction: ~
1/2

of Suffix Tree


Searching: ~
1/2

of Suffix Tree

August 2007

Slide
27


SERC Research Symposium

SPINE Summary


First index based on
horizontal

(inter
-
path)
compaction of the trie


Collapses into a single
linear

structure


Improved
features

and
performance

w.r.t. suffix
trees, the classical index


Prefix
-
partitionable (first index to have this property)


Easily amenable to persistent disk implementation


Retains linear time/space complexity


Better construction speed and capacity


Better search response times






August 2007

Slide
28


SERC Research Symposium


Full details at
http://dsl.serc.iisc.ernet.in




Questions?


August 2007

Slide
29


SERC Research Symposium

END PRESENTATION