SQL Rewriting Engine and its Applications

treescattleSoftware and s/w Development

Nov 2, 2013 (3 years and 9 months ago)

64 views

SQL Rewriting Engine a
nd i
ts Applications


Joshwini Pereira

Tzi
-
cker Chiueh

Computer Science Department

Stony Brook University



Abstract



There have been numerous efforts directed
towards improving the durability of databases; various
appl
ications such as

Dependency Tracking

,

Query
Level Database B
ackup


and

Query Level Restore’

have been built which provide novel ways for
maintaining the consistency of the database. All these
applications have a common trait between them


they
aim at
rewriting the query issued by the user. SQL
rewriting engine is built with a goal to integrate the
common framework between the three applications and
provide a user level API which offers all the
functionalities of

Dependency Tracking

,

Query Level
D
ata
base Backup


and the


Query Level

Restoration
Functionality

.

The Engine also provides additional
features such as

Phantom Dependency Tracking


that

can help determine the phantom dependencies apart
from the
regular

dependencies identified.

In this paper
,
we describe the design and implementation of the SQL
Rewriting Engine and also provide the performance
evaluation results run on the engine.
The Engine has
been built using .NET framework and is implemented
specifically for Oracle 10g as the backend data
base.

Thorough evaluations have been carried out and
detailed description of the resu
lts obtained has been
explained.


1
Introduction



Databases are corrupted either by the actions of a
malicious intruder or due to a common human error.
Restora
tion of the database to a prior stable state
involves discarding valid transactions along with the
anomalous ones. In order to carry out a selective undo,
Dependency Tracking is provided. This functionality
helps to track the interdependence of transaction
s and
helps to isolate the malicious transactions from the
legitimate ones.


Dependency Tracking

[
1
]

identifies two types of
operations


DDL and DML. Any transaction involving
creation or alteration of the tables are grouped into
DDL operations and

all other operations such as Select,
Delete, Update, Insert fall into the category of DML
operations. Dependenc
y is tracked by obtaining the Id
s
of the transactions that had previously modified the
read set

of the current transaction where the
read set

ar
e the rows being read or modified(DML operations)
by the current transaction. From this implication, the
current transaction is said to ‘depend’ on the other
transactions that had previously modified the
read set

of that transaction.


Dependency
Tra
cking
does not always yield
accurate results and can lead to identifying false
positives

[
2
]
. Thus Dependency Tracking has two
separate implementations


Row Based Dependency
Tracking and Column Based Dependency Tracking
where the latter provides a finer g
ranularity for t racking
the dependencies and eliminates false positives.


In order to simplify the restoration process, during
any DML operation usually Update and Delete, the
rows being modified by the transaction are backed up
before any operatio
n is carried out on them. This is
termed as the pre image of the rows being backed up
and is specifically addressed as Query level backup
.
This is because

the backup is carried out on a query
level basis rather than storing the entire snapshot of the
datab
ase.


After the detection of database corruption, t
he
reconstruction

process involves restoring the database
to a prior stable state. This is made possible through a
Query Level Restore p
rocess where the restoration
of
the database is carried out in

the reverse temporal order,
with
only the modified rows
being

updated
based on
their pre images stored by the Query Level Backup
process
. The row Id

of the row being restored along
with the timestamp specified forms a unique identifier
that helps to resto
re the database to a specific point in
time.


Once a set of malicious transactions are
determined and before any subsequent restore process
can be carried out

to undo these specific transactions
,
all the transactions with dependencies on the malicio
us
transactions need to be identified. Apart from the direct
dependencies, the transactions that have phantom
dependencies on the initial Undo Set also need to be
determined. Phantom Dependency Tracking
determines
the phantom dependencies
in addition to
th
e inter
transaction dependencies and helps in identifying the
correct final Undo set.


The SQL rewriting engine is built as an integral
whole of three applicat ions


Dependency Tracking,
Query level backup and Restoration Procedure. The
common facto
r between the applications is that they
rewrite the query issued by the user to suit their
individual functionalities. The rewriting engine has an
inbuilt parser that parses the incoming SQL queries and
provides a suitable format for the parsed statements
that
can be easily utilized by the applications. The engine
offers a wide array of user level APIs which provides
the user the flexibility of choosing any type of
functionality desired and also offers a choice of using
all the three functionalities as an i
ntegrated whole.

Thus
the SQL rewriting engine rewrites the incoming queries
and returns to the user the execution sequence for the
rewritten queries. This ordering helps the user to
understand the exact sequence of events occurring at
the backend and also

gives a clear picture about the
state of the database. The engine also provides the user
with the option of executing the set of rewritten queries
on the backend database
,

further simplifying the user’s
effort.


The Engine also provides an enhanced

feature of
Phantom Dependency Tracking that can be used to
determine the
transactions that
have phantom
dependencies and that
need to be undone

in the face of
database corruption.

This auxiliary component offers
additional support for identifying and reco
vering from
failures caused by
the erro
neous transactions.


2
Related Work



Extensive
research has previously been carried out
with reference to Dependency Tracking among
transactions. The ideas expressed in the papers
-

A
Portable Implementatio
n Framework for Intrusion
-
Resilient Database

Management Systems

[
1
]

and

Accurate Inter
-
Transaction Dependency Tracking for
Repairable DBMS

[
2
]

by

Chi
u
e
h and Smirnov,
provide
lucid
algorithms and solutions for accurate inter
dependency tracking

among transa
ctions
.


The motivation behind Dependency Tracking was
to determine the relat ionship among the transactions
throughout their execution sequence. This informat ion
obtained
,

helps in identifying the exact set of
dependencies shared among the transacti
ons

and helps
in isolating the different dependencies observed
. This
proves to be very useful in the case of database
corruption, where transactions affected by the actions of
any erroneous transaction can be identified and
selectively rolled back.

This is

a useful technique as we
do not have to roll back the entire database, but rather
just a subset of erroneous transactions.


The Repairable DBMS [
2
] as it is known, is
basically a proxy between the application accessing the
backend and the database be
ing accessed.
The proxy
intercepts the outgoing queries from the application,
transforms them based on the type of SQL query and
redirects it to the backend. The transformat ion carried
out on the query helps to maintain additional
informat ion pertaining to

dependency tracking on the
backend.


The dependency i
nformat ion tracked by this tool

helps the database administrators in determining the
exact set of transactions to be undone. With this method,
the actions of valid transactions can be preserved a
nd

thus

need not be undone in face of
database
failure.





The prototype built, explains three separate
implementations of dependency tracking


Row based,
Column based and Selective Column based tracking.

Column based tracking provides for a fine
r granularity
in terms of tracking but at the same time accounts for
higher overhead.

Selective Column based tracking is an
improvement on the previous method where a pre
determined column set is provided and tracking is
carried out only with respect to th
ese specific attributes.

This leads to a lower overhead compared to Complete
Column based tracking. Row based tracking allows for
coarser level of tracking and may lead to the
ide
ntification of false positives
. But, Row based
tracking has a lower overhead
compared to the above
mentioned methods as the amount of t racking
information to be maintained is relatively lower.


The RDB also provides for detection of
‘phantoms’ where transactions do not share
direct
dependencies but rather c
ould have had a d
ifferent
effect on the database if not for certain erroneous
transactions.

The framework allows for successful
detection of
these
phantom
s and provides the
administrator with this additional information.




The RDB framework is built for portability
and
was successfully implemented for
PostgreSQL
,
Oracle

and
Sybase

databases.

Finally,

p
erformance evaluations
showed that the proxy added an overhead of nearly 6%
-
13% during runtime
.



3
SQL
Rewriting Engine Design


3.1
Architecture



The Engin
e
is built as a client side tool where it
offers

a
rich
set of APIs to the
developer designing the
application. Based on the functionality desired, specific
APIs can be utilized to rewrite the outgoing SQL
queries.

The engine offers the developer the flexi
bility
of either obtaining just the rewritten queries or allows
for the rewritten queries to be executed on the backend.
This allows the developer to choose from varied
features and helps to incorporate specific functions
into
the application thus making i
t as flexible as possible in
designing
any

application.



The engine is
designed

as

three distinct sections


Connection
,
Common

and
Parsing
. The
Connection

module sets up and maintains the connection to the
backend database. It also stores the conn
ection
informat ion in the
Common

module from which the
other sub modules inherit the connection details.


The
Common

module contains informat ion
that

is
common to all the sub
-
modules such as the connecti
on
parameters, functions to set
up
the support
ing tables etc.






The
Parsing

module functions as the parser and is
built as a wrapper to the DLLs provided by the
General

SQL P
arser

[
3
]
. It uses the primit ive functions provided
by the GSQL Parser to identify

the type of the queries,
obtain the various parts of the statement such as table
name, the where clause etc and finally stores the query
in a format which is compatible with the functioning of
the rewrit ing engine. Additional functions such as
parsing of
the Alter Table command and obtaining the
parameter list in a Create Table command had to be
developed as they were not supported by the general
SQL

parser.


The Connection, Common and Parsing modules
provide auxiliary functions to the core componen
ts of
the Engine


Statement

and
SQLRewtStatement

from
where

the Dependency Tracking, Query Level Backup
and Restore Modules branch out
.



The
Statement

and the
SQLRewtStatement

are the
two
main

s
tubs

of the rewriting engine. They each
contain funct
ions to carry out Dependency Tracking
and Query Level Database B
ackup but
Statement

provides these functionalities independent of each other
where as
SQLRewtStatement

integrates these

functionalities and provides the user with a common
interface to manipul
ate both the features.


This distinction was made so as to provide small
improvements to the working of

the engine.

The

engine
requires a
set

of tables to store the metadata pertaining
to the tables and
the
transactions and this number can
vary base
d on the features desired.


In case of
Statement
, three tables need to be set up


Transactions
,
DDL_Dep

and
Trans_Dep
.




Transactions (Trans_ID, TS)
stores
informat ion regarding the transactions such as
the Id and the time stamp whe
n the transactio
n
was initiated.




Trans_Dep (Trans_ID, Dep_ID)

stores the Ids
of the transactions that a particular transaction
depen
ds upon.




DDL_Dep(Trans_ID, Table_N
ame, Type, TS
)

keeps track of the transactions that carry out
DDL operations and records the correspondi
ng
table and the type of operation (Create/ Alter)
carried out.





The
Transactions
table is common to both the
dependency tracking and the query level backup
modules but
Trans_Dep

and
DDL_Dep
are the tables
needed by dependency tracking alon
e. Thus depending
on the type of functions invoked the corresponding
tables are created.



In the case of
SQLRewtStatement

only two kinds
of tables are created


Transactions (Trans_ID, TS,
Dep_ID)
and
DDL_Dep (Trans_ID, Table_Name, Type,
Ts)
. This

is because both dependency tracking and
query level backup are carried out compulsorily when
any function is invoked and thus the
Trans_Dep

and
Transactions

tables can be integrated into a single table
Transactions (Trans_ID, TS, Dep_ID).


Apart fr
om the number of tables created, the
Statement
and the
SQLRewtStatement

modules also
differ in the manner in which the statements are
rewritten.


3.2

Dependency Tracking



The

module
for dependency tracking
carries out
transformations on the incoming qu
eries and obtains
and maintains the dependency information among the
transactions.
Trans_Dep

and the
DDL_Dep

are the
tables
that are created in order to support tracking
.
All
the dependency informat ion is collected over the
execution sequence of the transa
ction and is
subsequently stored in the
corresponding

table once the
transaction is ended or committed.



The tracking carried
out
is

specifically row based
dependency tracking where dependencies are recorded
based on the set of rows being modified.

Dependencies
exist among those transactions that update or read a set
of rows that were previously created or modified by
a
different

set of

transactions.

D
atabase


Figure 1:

SQL Rewriting Engine resides on the client side,
where it
eithe
r

[1]

transforms and redirects the query to
the backend
OR

[2.1]
returns the rewritten queries to the
user

AND

[2.2] i
s subsequently executed on the server by
the user
.

Client Machine




Server Machine

1. Redirect
Rewrit t en Query

2.2 Query dat abase wit h
re
writ t en command


Client
App
licat ion

SQL
Re
writing
Engine

Issue
Query

2.1


Transformations on the queries are carried out
based on the type of SQL statement.

In case
of DDL
statements such as Create Table, an additional field
called
TRANS_ID

is added to the schema of the table
being created. This field is used to record the Id of the
transaction that had Inserted/Modified the specific row.
Additionally
,

the Id of the t
ransaction running this
command, the current timestamp, the table name and
the type of command (Create/Alter) is stored in the
DDL_Dep

table.



In the case of Insert command, the query is
rewritten to include the Id of the t ransaction that is
issuin
g the Insert command. For an Update Statement,
the Ids of the transactions that had modified the read set
of the query is obtained and stored in the
Dep_ID

field
of the
Trans_Dep

table. The query is then rewritten to
insert the Id of the transaction curren
tly issuing the
Update statement into the
TRANS_ID

field of the table.

For a Select statement, the read set Ids are stored in the
Trans_Dep

table as the transaction is reading a set of
rows
that

were previously updated by some other
transactions.

Thus, wit
h an exception of the Delete
command all the other statements can be rewritten with
respect to dependency tracking.


Finally, t
he rewritten transactions are ordered
according to the sequence in which they need to
executed on the backend. Based on the
feature selected
by the user, the set of queries are either executed on the
database or the
rewritten
set is directly returned to the
user.



3.4
Query Level Database Backup



Query Level Backup

entails storing only the pre
-
images of

the ro
ws being affected by any query where
this selective backup leads to a smaller overhead
compared to storing the entire database image
.

Fundamentally, whenever a DML statement is
encountered, all the rows that are to be affected by
these statements are
backe
d up onto a separate table
. At
any point of time, when a rollback is required to a
previous consistent state, the stored values can be
restored onto the affected rows.


Q
uery level backup
also carries out a transformat ion
or rewriting on the queries b
eing issued to the database.

For a Create command, a corresponding backup table is
created as <table_name>_BAK
.

The backup table
contains a
dditional

columns
along with the attributes of
the
original table
.

The additional fields contain the
Id
s

of the trans
action
s

modif
ying
the
set of
rows
,

the
timestamp of the operation,
the row identifiers of the
rows being backed up

and a flag indicating whether the
row was deleted or updated
.


An A
lter
table
command is rewritten only if the
command would alter the

number of columns of the


Original Query

Rewritten Query

Create Table XYZ(
A1
I
nt
eger, A2 V
archar(10)
,
A3 R
eal)

Create Table
XYZ_BAK(
A1 I
nt
eger, A2
V
archar(10)
, A3 R
eal
,
Trans_id I
nt
eger, TS
Timestamp, IS_DEL
I
n
teger, Rid R
owid)

Alter Table XYZ drop
col
umn A1

Alter Table XYZ_BAK
drop column A1

Update XYZ set A3 = 1.0
where C1

Insert into XYZ_BAK
(Select *

from XYZ where
C1)

Delete from XYZ where
C2

Insert into XYZ_BAK
(Select * from XYZ where
C2)





t
able
,

like deletion or insertion of additional f
ields to
the table.


For an Update/Delete statement, all the rows to be
affected by the command are stored as a backup before
the actual query is executed.


Query level backup can be carried out for DML
statements such as Update and Delete an
d DDL
statements such as Create Table and Alter Table but

the
Select and Insert statements cannot be rewritten with
respect to query level backup

as they do not modify any
existing rows
.


3.5

Database Restore



Th
is functionality helps to
reconstruct

th
e image of
the database to a prior state

by carrying out a query
level restore
. The restoration is carried out by utilizing
the pre
-
images of the rows stored in the backup tables.
Based on the t
imestamp, all the rows affected after the
specified time are r
etrieved from the backup table. The
retrieved rows are then restored onto the original table
in the
r
everse
t
emporal order, with the younger rows
being restored before the older ones.


The restoration process involves recreating the
image of the ta
ble prior to the effects of either update or
delete statements.
During restore
, r
ow Ids of the
retrieved rows
from the backup table
match
ing the ones
in the original table indicate that the row was
updated
.

Subsequently,

rows from the ori
ginal table are re
placed
by their corresponding values in the backup table
.



Row Ids from

the
backup table

that do not match

the original table

along with flags indicating that the
row was deleted helps us in identifying deleted rows.
T
hese rows are then inserted in
to the original table.

_____________________________________________

* The rewritten query is not in the exact format according to our
implementation. The Figure 2 is
only

for illustration purposes.

Figure 2
:
*

Simple example

of transformations carried out on the
query with respe
ct to query backup.


The user is required to specify a timestamp for t
he
restore application and all the rows that were modified
after the specified point in time are restored onto the
original table in a reverse temporal order
.


This
feature

provides the user with sufficient
flexibility to restore the rows of only a pa
rticular table,
restore rows modified by a particular transaction,
restore all the rows that were updated in the original
table
or

restore all the rows that were deleted from the
original table.


4
Extensions to Dependency Tracking


4.1

Motivation



Row

based dependency tracking is not a panacea
in determining the exact inter relationship shared
among the various transactions. Dependency is tracked
by recording the Ids of transactions that had prev
iously
modified the rows that are

being read by a curr
ent
transaction. This method of tracking can lead to the
identification of false positives.


Consider a scenario where we have two
transactions T1 and T2 that are both modifying a table
XYZ
.


Row Id

A1

A2

A3

1

10

20

100

2

30

40

200


T1
:

Update XYZ

set A1 = 20 where A3<150

T2
:

Update XYZ set A2

= 5
0 where A3<150




In the above scenario, the Id of T1 will be
recorded as the modifying transaction for Row 1
because

it
updates

an attribute of that row. When the
query is issued by T2, the Id of T
1 is retrieved as
transaction upon which it has a dependency
. This is
because T1

had previously modified the row being
accessed by
T2
. Upon careful scrutiny, we realize
that
the two transactions are updating different attributes of
the same row and do not
share any inter dependency
among themselves. Such a situation is termed as false
positives
.



Row based dependency tracking can lead to
similar conditions
where it

incorrectly identifies

relationships
that do not really

exist
.


4.2

Column Based De
pendency Tracking



In order to prevent the detection of false
dependencies
,

a finer level of tracking needs to be
carried out.
C
olumn Based Dependency
t
racking

plays a
pivotal role in eliminating false
positives

and identifies
the exact set of depe
ndencies among the transactions
.

Tracking is carried out at a finer granularity where
transactions are tracked based on the attributes that they
modify. This eliminates all possibilit ies of incorrectly
identifying false positives
as only transactions t
hat
had
previously modified the
attribute being accessed are
recorded as dependencies.



Every table created contains additional fields that
store the dependency information on a per attribute
basis. This is known as
Complete Column
Based
Dependency Tra
cking

[
2
] where auxiliary informat ion
is stored for every attribute in the table.

For a Create
Table statement, a field <attribute_name>_DEP
corresponding to every
attribute of the table
is
also
created
. This new field records the
Id

of the transaction
tha
t recently updated the corresponding attribute

of the
table
.

For an Insert statement, the Id

of the transaction
issuing
the command is stored in every
<attribute_name>_DEP field.


In case of an Update statement, all the values
from

the <attribute_n
ame>_DEP fields
that
correspond to the
fields in the

where clause


of the statement
are

retrieved. Next, the Id

of the transaction executing the
command is inserted into the corresponding ‘DEP’ field
based on the attribute being modified. The retrieved
se
t
of Ids identifies

the transactions that the current
transaction has a dependency
up
on.


A Select statement is rewritten in a manner

similar
to
an

Update statement
but

the

updating of

the
‘DEP’ fields with the Id of the transaction issuing the
co
mmand

is not carried out
.


Original
Query

Rewritten Query

Create Table
XYZ (A1
Integer, A2
Varchar(10),
A3 Real)

Create Table XYZ (A1 Integer, A2
Varchar(10), A3 Real, A1_DEP
Integer, A2_DEP Integer, A3_DEP
Integer)

Insert into
XYZ(A1,A2,A
3) values

(1,
ABC,2.0)

Insert into
XYZ(A1,A2,A3,A1_DEP,A2_DEP,A3_
DEP) values (1,
ABC,2.0,Cur_TID,Cur_TID,Cur_TID)

Update XYZ
set A1 = 10
where

A2 = MNO

Select A2_DEP from XYZ where A2 =
MNO

Update XYZ set A1 = 10, A1_DEP =
Cur_TID where A2 = MNO

Select A3
from XYZ
wh
ere A1 =
10

Select A1_DEP from XYZ where
A1=10






Fi gure 3
:
An example

of

rewriting carried out with respect to
complete column base
d tracking.


Thus the above method helps in isolating disjoint
set of transactions that modified the same row but
different columns.



4.3

Identifying Phantom Dependencies




Column based
tracking can also lead to a situation
of false negatives [
2
] where
a transaction could have
led to a different outcome if some previous transaction
had not modified a certain set of rows. Though these
transactions are independent of each other where no
dep
endency exists between them, a subtle relat ionship
does exist in certain scenarios.


Consider the two transactions T1 and T2 updating
a table XYZ.


Row Id

A1

A2

A3

1

10

20

100

2

30

40

200


T1:

Update XYZ set A3 = 500 where A2 = 40

T2:

Update XYZ
set A1 = 1 where A3 < 400




In this scenario, T1 updates
only
row 2. The read
set of T2 corresponds only to row 1 and it updates the
attribute A1 accordingly. If we identify T1 as an
erroneous transaction then T2 will not be present in the
Undo se
t as it does not have a direct dependency on T1.
But we realize that in the absence of T1, the outcome of
T2 would have been different where it would have
updated both the rows. Thus T2 is said to have a
phantom dependency on T1.


In the face of dat
abase corruption due to a
ny

malicious transaction, the exact set of transactions to be
rolled back needs to be determined. This process
involves the detection of phantom dependencies too.
For this we maintain t wo images of the table to be
rolled back


a
P
re Image table a
nd a Current Image
table.

The Pre Image table
<original_table_name>_PRE
consists of all the rows that were present in the table
before any modificat ion
was made
by the erroneous
transaction.

The Current Image table contains the
effects
of

t
he execution of the erroneous transaction. To
determine phantoms, the ‘WHERE’ clause of every
transaction succeeding the erroneous transaction is
executed against the Pre Image and Current Image
tables. If we obtain differing result sets from the two
table
s then we correctly conclude that phantoms exist.
The result sets differ if the rows returned from the Pre
Image table is absent from the result set of the Current
Image table.


Let us assume that all the transactions belong to a
universal set T. Le
t the set of transactions that need to
be undone be represented by a set Sc. Let Sn = T


Sc,
represent the set of transactions that are not in the Undo
set and need to be processed before we consider
undoing them. The following algorithm is us
ed to
determ
ine whether any phantom dependency

exist
s

b
etween the elements in Sc and Sn
.



We process the transactions in an ascending
order of timestamps.
For every transaction T’
in Sc, obtain all the columns modified by it.
Add these set of columns to a set C if they

weren’t already added. For any new column
added to set C,

add the column to the Pre
Image table and

restore the values
of

that
column to that point in time before any
modification was carried out by T’.




For every transaction T’’ in Sn, we run the
‘WHERE’

clause of the transaction against
the Pre Image table and obtain a set of rows X.
Next, we use the Backup table (created by the
Query Level Backup function) and obtain
rows corresponding to the Current Image
Table. The following is carried out to obtain
t
he Current Image rows.


Th
e Backup table consists of the p
re



i
mage of all rows that were modified by the
transaction T’’ that is, they satisfy the
‘WHERE’ clause of transaction T’’. These
images correspond to the effects of
all
pre
vious

transactions
that had already
modified the table prior to T’’
.

They contain
the exact sequence of events that had occurred
due the actions of the malicious transaction
and hence
map to

the ‘Current Image’ of the
table.

All the rows ‘Y’ corresponding
to T’’ are
retrieved from the Backup table and these
form the Current Image rows.



A comparison is carried out and if X is not
equal to Y, then a phantom is detected. The
transaction T’’ is added to the set Sc and it
processed as a regular erroneous transac
tion.



In order to support the efficient detection of
phantoms, we made the following assumptions


every
transaction modifies only a single table and every
transaction issues
only
a single SQL command
. With
these assumptions in mind, we made the f
ollowing
design changes




The
TRANSACTIONS

table that stored meta
data pertaining to the transactions affecting the
database was modified with an updated
schema.



TRANSACTIONS
(TRANS_ID, TS,
DEP_ID,TABLE_NAME, WHERE_CLAUSE,
COL_LIST)

where
TRANS_ID
stores

t
he Id of
the transaction issuing the query,
TS

corresponds to the timestamp of the operation,
DEP_ID


contains the set of dependencies for
that transaction,
TABLE_NAME

corresponds
to the table being accessed,
WHERE_CLAUSE

contains the ‘WHERE’ clause of the

query and
COL_LIST

contains all the columns that were
modified by the transaction.



Let us c
onsider
a simple example
of
two
transactions T1 and T2. Let us assume that both the
transactions access the table
XYZ
.



Row Id

A1

A2

A3

1

10

20

10
0

2

30

40

200


T1:

Update XYZ set A3 = 500 where A2 = 40

T2:

Update XYZ set A1 = 1 where A3 < 400



After the execution of the above queries, the table
XYZ

will have the following image.


Row Id

A1

A2

A3

1

1

20

100

2

30

40

5
00



The Back
up table
XYZ_BAK

corresponding to
table
XYZ

will appear as follows


Row Id

A1

A2

A3

Trans_ID

2

30

40

200

1

1

10

20

100

2



Let us set Sc = T1 and this automatically implies
that T2 is in Sn.


Initially, the

Pre Image

table

XYZ_PRE

will
co
ntain only the pre image values of the attribute A3 as
it was the column updated by T1

in Sn
.


Row Id

A1

A2

A3

1

1

20

100

2

30

40

2
00



Next, the ‘WHERE’ clause of T2 is run against the
Pre Image table
XYZ_PRE

and rows corresponding to
row Ids 1
and 2 are obtained. This implies X = {1
, 2
}.


Then, the row corresponding to

the Current Image
of
XYZ

that is backup image of rows modified by
T2 in
the Backup table
XYZ_BAK

is obtained
and
Y

is set to

{1}.


Next, X and Y are compared and we
find that an
additional row has been retrieved from the Pre Image
table

XYZ_PRE
.

From this we conclude that T2 has a
phantom dependency on T1. T2 is
then
added to the set
Sc and is processed as a regular erroneous transaction.


5
Performance Evaluation


5.1
Experimental Setu
p


The
experimental set
up consisted of two machines
set up on a 100 Mbps local LAN. The machines were
organized as in
figure 1

and consisted of a server and a
client machine. The server housed the oracle 10g
database and wa
s s
et up on a L
inux 2.6 FC4 box with a
processor speed of 2.8

GH
z, disk space of 92 GB and
supported
1 GB of memory.

The client was on a
Microsoft Windows XP
laptop

with 80 GB of hard disk
capacity
, a processor speed of 1.5GHz and 512 MB of
RAM. The .NET 2.0
framework
was set up on the
client
machin
e to support the SQL Rewriting E
ngine.



5.2
TPCC Benchmark




Performance testing was carried out on the Engine
by measuring the overhead incurred by employing the
functions of the engine. Tests were carried
out
according to the TPCC Benchmarks [
6
] where the
benchmarking simulates real t ime transactions of a
business activity for processing customer orders. The
benchmark specifies a mix of read/write and read
intensive workloads. The foot print size of the dat
abase
was va
ried by
altering the ‘warehouse’ size. The
measurements for the read/write transactions and the
read intensive transactions were taken separately and
the overhead was recorded for varying foot print size
s
.




The read/write transaction
s consisted of an
interspersed mix of New Order, Payment and Delivery
operations. The read int
ensive transactions

consisted of
a series of read
-
only Stock Level queries.

The init ial
tests were carried out with W=2 that had about 700
read/write and 3000 rea
d intensive transactions. The
subsequent tests were carried out by varying the
warehouse factor in increments of 2 until W=10 which
had a total of 19000 read/write transactions and 5000
read intensive transactions.


5.3
Dependency Tracking with Query level



Database Backup




The tests were carried out
by invoking the

Dependency Tracking and Query Level Backup
functionalities from the Engine. Four tests were carried
out


Column
Based
Dependency Tracking with Query
Level Backup under rea
d/write load, Row

Based

Dependency Tracking with Query Level Backup under
read/write load, Column

Based

Dependency Tracking
with Query Level Backup under read intensive load and
Row
Based
Dependency Tracking with Query Level
Backup under read intensive loa
d. The results are show




in figur
es 4

and
5

respectively.
The measurements
displayed indicate the ratio of increase in processing
time with the invocation of functions from the Engine.


T
he results indicate that there exists a very large
overhead (
%
in order of 1000s) for both column and
row based dependency tracking along with query level
backup. The reason for this stems from the fact that
both
Dependency Tracking and Query Level Backup
functionalities are integrated together and a higher load
is placed on the system in terms of query interception,
parsing and rewrit ing.

Due to the additional burden of
the above mentioned operations, the overhead turns out
to be quite overwhelming.


5.5

Phantom Dependency Tracking



The phantom dependency detection module was
executed after the init ial run of all the transactions. An
undo set was provided as an input, based on which
phantoms were detected and the corresponding direct
dependencies
along with phantoms
were added to a
final
undo
set. The overhead of invoking the Phantom
Dependency Tracking
(figure 6)
was
noted

and
it

was
found to be hovering around a constant range of 100%
-
110% even when provided with larger undo sets.


Results

also indicate that
the initial discovery
of
new
dependencies

(figure 7)

with an undo set of size 20
is quite high but with larger undo sets, the
percentage

significantly decreases. This is because,
with the larger
undo
sets, the same
set of phantoms is

li
ked to be
detected and
this would not lead to a significant
increase in the number of new dependencies discovered.













Figure 7
: Percentage of additional dependencies
discovered
due to

the detection of phantoms given an initial
Undo Set.
T
he

Undo Set

sizes are

varied along t he X axis
and t he % of addit ional dependencies

detected

is recorded
on t he Y axis.

Fi gure 4
: Row/Column Based Dependency Tracking along
wit h Query Level Database Backup

under
Read/Write
i ntensive

load
.

The figure indicates t he ratio of increase in
processing t ime

after employing t he above mentioned
funct ionalities.

Fi gure 6:

Phantom
Detection Overhead wit h varying
Undo
S
et

sizes on t he X axis and co
rresponding overhead on the
Y ax
is.

Fi gure

5
: Row/Column Based Dependency Tracking along
wit h Query Level Database Backup

under
Read
intensive

load
.

The figure indicates t he ratio of increase in overhead
aft er employing the above mentioned functionalities.


6

Future
Work


6.1
Multi
Tenant Data

Architecture




Mult i Tenant Data Architecture

[
8
]

is a SaaS
(Software as a Service) applicat ion that aims to offer
the user differing levels of flexibility for centralized
control
and storage
of user data. SaaS

deals with the
centralized management of user data where accessing
of data is more efficient when compared to locally
installed application. The user needs to surrender some
level of control of their data to the SaaS vendor and the
vendor then needs to pr
ovide efficient measures to
ensure that the user data will not be compromised at
any point in time.


The user
can choose from varying level
s

of data
isolation that is, they can decide the manner in which
their data is to be stored.

Multi Tenant Dat
a
Architecture (MTDS) offers three levels of isolation


Separate Databases, Shared Database Separate Schema
and Shared Database Shared Schema.
Separate
Databases
involves storing the tenant data in separate
databases and is more suitable for larger organi
zat ions
that store a large amount of data and
who maintain
highly sensitive data
.

Shared Database Separate
Schema

architecture allows for tenants to share the
same database but maintains separate tables for
different tenants.
Shared Database
Shared

Schema

has
tenants sharing the same database and same tables
where a level of secu
rity is built into the system that

prevent
s

unau
thorized access of other tenant
s


data. The
latter two schemes
are

more suitable
for tenants that
prefer a lower cost of maintaining
their data and are
ready to take the risk of co
-
locating their data with
other customers.


The two Shared approaches discusse
d above have
a high init ial set
up cost because customized services
need to be built to ensure that data is being accessed
o
nly
by authorized users. At the same t ime, in case of
corruption of a particular tenant’s data
, a costly
restoration procedure is carried out in order to prevent
affecting the other tenants’ data
.


The architecture suggests the use of
Tenant View
F
ilter

for

ensuring isolation among tenants in a shared
approach. H
ere
,

based on the SID of the tenant,
rows
belonging to that tenant are

retrieved. This offers a
filtered view to any tenant accessing the database.


The architecture also discusses cu
stomization
of
the schema for various tenants accessing the Shared
table.
Pre

allocated

fields

and
Name Value

P
airs

were
some of the approaches suggested.
Pre allocated

fields

consists of preset custom columns that any tenant can
extend based on their requ
irement.
Name Value

P
airs

consist of two levels of indirection where a pointer
from the original table points to a table containing the
custom field value
.

The custom field value table
contains

an additional pointer that points to
a table
containing the
me
tadata pertaining to that specific
column.

This does not restrict the tenant to a

preset
number of columns and

extensibility is made as flexible
as possible.

Thus with these varied features tenants are
given a choice of different levels of customization.


In the face of corruption of a part icular tenant’s
data, the restore proce
ss involves the complete
restoration

of the database on
to

a separate temporary
database. Next, only the tables corresponding to the
affected tenant are exported from the tem
porary
database onto the original database. This exp
ensive
procedure is carried out
to prevent
modification of

the
unaffected tenants’ data in the database.


6.2
Extensions to SQL Rewriting Engine




The SQL Rewriting engine can be further
e
nhanced to support Multi Tenant Data Architecture.

MTDA employ
s costly methods to restore tenants’ data
in the case of data corruption. We can adapt the Query
Level Backup functionality and the Restore
functionality of the Rewriting Engine, to carry out
re
store in MTDA. We can store the pre images of rows
corresponding to different tenants in the

backup table
prior to their

modificat ion. This data can also be
replicated on different servers to ensure durability.
During the restore process, only the rows cor
responding
to a particular tenant can be updated onto the original
table thus preserving the
data of

other tenants. This
selective method of restorat
ion will work out to be
cheaper than the current method employed by MTDA

and
can
service each tenant indepe
ndent of the other
.


Features for extensibility patterns of the MTDA
can also be incorporated into the Rewrit ing Engine.
Based on the tenant issuing the query, the command can
be transformed depending on the custom columns
created by the tenant. The

rewriting can be carried out
for both
Pre allocated fields

and
Name Value

P
airs

depending on the functionality chosen for the
architecture.
After analyzing the source of the query,
the command can be split or rewritten to access
different tables storing t
he custom column information.

In this manner the tenant will not have to keep track of
the complexity involved on the backend but rather the
rewrit ing will be done transparently
in order
to service
each customer.
Thus the Rewriting Engine can be
enhanced
to incorporate additional features that can
support the Multi Tenant Data Architecture.






7

Conclusion




The SQL rewrit ing engine integrates the common
functionalities of the three different applications


Dependency tracking, Query Le
vel database backup
and the
Query Level
Restore function and provides a
common framework to manipulate the various features
of the
se
applications. The user level API offers
sufficient flexibility to the users to understand and
employ the features provided.

Phantom detection
allows for
accurate

detect
ion of

the exact set of
transactions to be undone and thus proves to be
a
very
helpful

tool before any restore process
.

Results indicate
that
Phantom Dependency Tracking incurs a relatively
low overhead for dete
rmining the final undo set
.

Also,

a
large number of new interdependencies are detected by
this function and provides a more accurate picture of
the transactions to be undone.

The Engine can be
further refined to provide addit ional functionalit ies to
suppor
t Multi Tenant Data Architecture.

Thus
with
the
APIs provided
,

complex applications
can be developed
to suit the requirements of any
data driven
architecture.



8 References


[1]

Alexey Smirnov, Tzi
-
cker Chiueh,

A Portable
Implementation Framework for Intrusion
-
Resilient
Database Management Systems
,
in Proceedings of
DSN
2004
, Florence, Italy, 2004.


[2]

Shweta Bajpai
,

Alexey Smirnov

and

Tzi
-
cker
Chiueh

-

Ac
curate Inter
-
Transaction Dependency
Tracking for Repairable DBMS
.


[3]

General SQL Parser

-


http://www.sqlparser.com/


[4]
Microsoft .NET 2.0
Framework
-

http://msdn2.microsoft.com/enus/netframework/default.
aspx


[5] Oracle 10g Database
Server
-

http://www.oracle.com/technology/software/products/d
ataba
se/index.html


[6] TPCC Benchmarking
-

http://www.tpc.org/tpcc/


[7] C# Reference library
-
http://msdn2.microsoft.com/en
-
us/library/default.
aspx


[8] Multi Tenant Data Architecture

-

http://msdn2.microsoft.com/enus/library/aa479086.aspx