Umbrella - People - Kansas State University

assistantashamedData Management

Nov 29, 2012 (4 years and 10 months ago)

294 views


1



Umbrella
DB

Virtual Data Base
Architecture

to Integrate

Heterogeneous Data Sources


Eric T. Matson


CIS 864 Data Engineering


Kansas State University


Manhattan, KS, USA, 66506

etm7766@cis.ksu.edu




Abstract


The UmbrellaDB Virtual Database Architectur
e’s main purpose is to reconcile the
heterogeneity
occurring

in large organizations.
Umbrel
la
DB
provides a

generic

architecture for
multiple,
heterogeneous data

sources to be efficiently accessed and
utilized from a unified, simple

and intuitive

applicati
on
.

It presents the varied data
sources to the user as a single virtual database schema and allows the user to query and
manipulate the data sources with an easy
-
to
-
u
s
e
,

“point and click” Graphical User
Interface

(GUI)
.

The UmbrellaDB tool uses the Stand
ard Query Language

(SQL)

as its
base query language so there is no requirement to learn a new language or syntax.



1

Introduction


Large organizations, with a great number of people, products, customers, and
services are vast repositories of information.


As organizations grow and change, the data
organization requirements and needs
evolve
. Far too often the
ability to

proactively

manage the
organizational information requirements cannot keep pace with the actual


2

information growth.

Enterprises tend to r
epresent their data using a variety of data
models and schemas while users drive for dat
a

integration and cohesiveness. [3]


Organizations naturally create
information storage and classification
heterogeneity. The heterogeneity can be purposeful, which me
ans there are specific and
just reasons for selecting and implementing multiple platforms, systems and
architectures. Heterogeneity can be driven by acquisitions and mergers. If an
organization takes control of

or inherits

another organization, the data
needs of both have
to be

considered and

managed for the greater good of the resulting new organization

structure
. The worst cast is of the heterogeneity resulting from un
intended or accidental
purposes. Normally this is prop
a
gated by a lack of strategy,
lack o
f

discipline or

instances of
poor execution.

The result is a unorganized and fragmented organization
-
wide data schema with very few or no intersections of which to integrate the data to
provide information.


There are many perfectly justifiable rea
sons for not changing the
residence or
format of organizational information. It may be cost prohibitive to implement new
technology, either from a human resource or financial resource perspective

[1]
.

Implemented technology or infrastructure may be obsol
ete not allowing the organization
to migrate to a newer instance or release. The implementation may be of a short life span
where there is no Return on Investment (ROI) or Net Present Value (NPV) justification
for migrating from the information platform.


To approach the problem of data and information heterogeneity, an organization
must be able and willing to develop an enterprise level schema capable of representing
the minimal set of relations required to

provide information to
answer questions
and


3

solv
e problems. One perspective to
view this type of technology architecture is that of a
virtual data warehouse. The movement toward data warehousing is recognition of the
fact that the fragmented information that an organization uses to support day
-
to
-
day
operations at a department level can have immense
s
trategic
value when brought
together
.

[2]


In this paper, I address the problem of data source heterogeneity and integration
very specifically.

I will introduce the UmbrellaDB tool which implement
s a

Virt
ual
Database Architecture (VDA).



2

Problem Statement


An organization’s need to integrate its data is not a single requirement with minimal
constraints.

Th
e organization must have the ability to change the configuration of a
global data schema very quickl
y to adjust to changes in business and to satis
f
y temporal
aspects of organization information evolution.

The organization must also have the
ability to integrate data sources of many types existing on many platforms.


The tool
used to integrate the data

sources must be intuitive to use and not require a steep learning
curve or intimate knowledge of all data source architectures to configure.



3

UmbrellaDB Virtual D
atabase Introduction




The
UmbrellaDB

architecture

that I
present in this paper, simplifies

the process of
integrating data of different formats across heterogeneous platforms and architectures.
Data is integrated at the user level. The user sees a virtual view of all data sources
integrated into a single conceptual database. All technical ar
chitecture details, of where

4

the data is located and configured, are hidden from the user. The user sees only a
graphical database where they can “point
-
and
-
click” to formulate queries in order to
retrieve the required information.


UmbrellaDB is a
n exa
mple of
a
V
irtual
D
atabase
A
rchitecture. Schemas of all
data sources are integrated within the UmbrellaDB workbench so that it appears a user is
working with a single
,
unif
ie
d global schema.

A conceptual, high
-
level description of the
UmbrellaDB VDA is
d
escribed by Figure 1.


France
Oracle
Database
Object
Database
China
PostgreSQL
Database
USA
Text
File
USA
Text
File
local
CLIPS
Database
local
TCP/IP
TCP/IP
UmbrellaDB

Figure 1: UmbrellaDB Virtual Database



4

Architecture Overview



The general architecture of the UmbrellaDB VDA is shown in Figure 2.

The architecture
consists of three main components: The UmbrellaDB GUI and Application
Programmi
ng Interface (API). The GUI is the tool to be used if the user needs to execute
queries across data sources and look at information within the tool’s workbench. The

5

API is necessary if the user wants to develop additional programs in Java or C or C++ to
use all of the data sources in a real time environment as a single active database schema.


The second part of the architecture is the DataServer. The DataServer acts similar to a
Relational Database Management System (RDBMS) except it accesses and manag
es
formatted and delimited text files. The allows the
textual files to easily be accessed and
manipulated similar to RDBMS tables.


API
Umbrella
GUI
Parser
Splitter
Router
Unifier
Profile
Engine
D
a
t
a
C
o
n
n
e
c
t
o
r
s
TCP/IP
TCP/IP
Text
File
KB
File
Relational
Database
Object
Database
Data Sources
Data Server

Figure 2: UmbrellaDB Architecture



The third architecture piece is the UmbrellaDB Engine which is the heart of the
appli
cation. IT has six distinct parts that require additional explanation: Parser, Splitter,
Router, Unifier, Profile and Data Connectors.


The data sources are represented as schemas. A data source
Profile

stores that
source’s metadata. Profile metadata
such as the number of records and size of records is
used by the
UmbrellaDB

engine to optimize queries prior to execution.

The
UmbrellaDB

Engine Agent continually monitors the defined data sources to retrieve and update the
data source’s Profile. A
Data
Connector

is an intelligent agent that acts as an interpreter

6

for each type of data source. It interprets and communicates the query to each data
source and translates query results back to the
UmbrellaDB

engine. The GUI displays
the
parts of a

in a use
r
-
friendly and organized manner, which assists the user to easily
build queries using “point
-
and
-
click” technology.


The main components of the
UmbrellaDB

Engine Agent are:
Parser
,
Splitter
,
Router

and
Unifier
. Upon execution of a query, the
UmbrellaDB

e
ngine propagates
through a series of steps:



Parse

the query to insure correctness



Split

the query into sub
-
queries relevant to each data source being accessed



Route

the queries to the data source for processing



Unify

the result sets of all of the returned
queries into a single, unified relation


Query optimization across the data sources will be conducted by two general processing
algorithms: parallel and sequential execution. A decision is made at the time of the query
execution whether to execute the que
ry in parallel or sequential mode. The intelligent
agent engine that executes and manages the query plan is based on a metadata description
of the data sources. The goal of the query optimization process is to minimize the
defined cost during query execu
tion. Cost is defined by amount of data transported
across the network, access time, input/output, response time and other relevant factors.



5

Data Sources integrated with UmbrellaDB


UmbrellaDB is a flexible architecture designed to support
m
any data s
ource type
s
.

Examples of the data

source types supported are: formatted text files, delimited text files
and relational databases that can be accessed via a Java Database Connector (JDBC).
Future research and development will provide for integration wit
h knowledge based
formats such as Prolog and CLIPS files, non
-
relational data sources, such as Key

7

Sequenced Data Sets (KSDS), Open Data Base Connectivity (ODBC) data sources,
B
-
tree data sources and traditional mainframe technologies such as IMS.



6

E
xampl
e
Schema


UmbrellaDB
Informix
MS Excel
Text
PostgreSQL
Data Connector
JDBC
JDBC
ODBC
Room
room_no
hotel_no
type
price
Hotel
hotel_no
hotel_name
address
Booking
hotel_no
guest_no
date_from
date_to
room_no
Guest
guest_no
guest_name
no_adults
no_children
address



Figure 3: Example Virtual Schema


Figure
3

provides an

example to be used throughout the description of the Query Engine
processes.

It is a

specific e
xample of
the

UmbrellaDB virtual database schema. There
are sub
-
schemas representing four

data sources: an Informix Relational Database
Management System (RDBMS), a PostgreSQL Object Relational Database Management
System (ORDBMS), a Microsoft Excel spreadsheet ODBC data source, and a formatted
text file.


7

Query Process


The Qu
e
ry Engine is res
ponsible for processing queries

provided by the GUI or the API.
.
A query is pro
cessed in four distinct steps: Parsing, Splitting, Routing and Unification.
The query is represented by SQL language constructs.


8


There are two exec
u
tion processes that the Qu
ery Engine

can use to most
efficiently process a query over a set of heterogenenous data sources
: parallel and
sequential. The decision is made by the Query Engine Agent upon the execution of a
query. If it is more efficient to split a query into paralle
l processes and run them
simultaneously, the Engine Agent will select that option. If it is more efficient to
sequentially execute a query, the agent will select that option.

The decision is made by
evaluating the meta
-
data, of each data source involved
in the query, and calculating the
most efficient solution.


7.1

Query
Parsing


When a Query is generated graphically through the Umbrella GUI Workbench or passed
as an SQL statement from the API, it is parsed to test correctness and validity.

Splitting
the qu
ery into parts is the initial step of the pars
ing process
. The parts are strings that
contain a specific part of the query. The second step

check
s

the query for the correct
syntax.

A

sample query:


SELECT



hotel.hotel_no, hotel.hotel_name, booking.room
_no, room.price

FROM

hotel, booking, room

WHERE


hotel.hotel_no

=
booking.hotel_no

AND



room.room_no

=
booking.room_no
;


Is decomposed into a set of Tokens:


Token
: SELECT

Token
: Field List

Token
: FROM

Token
: Table List

Token
: WHERE

Token
: Conditionals

List



9

The SELECT, FROM, and WHERE Tokens are checked to insure the spellings are
correct. The Field List is checked against the Profile database to make sure all requested
fields are valid, spelled correctly, and exist in the schema. The Table List is
checked to
insure that all listed tables are valid, spelled correctly and are registered data sources in
the global schema. The Conditionals List checks to make sure that all conditional fields
are valid, spelled correctly, and exist in the schema. It al
so checks the operands and
syntax of each conditional statement. The final check it performs is to insure that the
data
types of any conditional comparison
match.



7.2

Query Splitting


After the query is successfully parsed, it is forwarded to the Splitter.

The Splitter
decomposes the query into a set of sub
-
queries. Each data source represented will
process a

sub
-
query.

Using the sample query again:


SELECT


hotel.hotel_no, hotel.hotel_name, booking.room_no, room.price

FROM

hotel, booking, room

WHERE


h
otel.hotel_no = booking.hotel_no AND



room.room_no = booking.room_no;


Is decomposed into a set of sub
-
queries:


SELECT

hotel_no, hotel_name
FROM

hotel;

SELECT

room_no, hotel_no
FROM

booking;

SELECT

room_no, price from
FROM

room;


These sub
-
queries are
not optimized in this example.

The sub
-
queries are passed to the
Router and the conditionals would be used by the Unifier to perform the necessary joins
upon successful return from the Router.



10


7.3

Query Routing


The Query Router will take the output of the
Query Splitting process and send it to the
appropriate data source for processing.

If the query is successful and data is returned it
will be in the form of a UmbrellaDB dataset that can then be used by the Unifer.

The sub
-
queries:


SELECT

hotel_no, hotel
_name
FROM

hotel;

The hotel query is routed to the Informix database by a JDBC call and returned as a
dataset.


SELECT

room_no, hotel_no
FROM

booking;

The booking query is routed to the PostgreSQL database by a JDBC call and returned as
a dataset.


SELECT

room_no, price from
FROM

room;

The room query is router to a DataServer object and processed against the formatted
textual file and returned as a dataset.



7.4

Query Unification


The goal of the Unification step of the Query process is to take all of the retu
rned
datasets, from each data source involved in the query, and join the data sets into a single
relation.

Using the set of sub
-
queries:


SELECT

hotel_no, hotel_name
FROM

hotel;

SELECT

room_no, hotel_no
FROM

booking;

SELECT

room_no, price from
FROM

room;


And the conditional statement from the initial query:


WHERE


hotel.hotel_no = booking.hotel_no AND



room.room_no = booking.room_no;


The query is joined together as a universal relation.



11

7.5

Query
Optimization


An interesting research are
a

of the Umbrella
DB project is query optimization. With
virtual database architecture
s
, there are new design issues to be addressed that don’t
occur in traditional, integrated database architectures.

Firstly, virtual

database
architecture
s

are

not represented by a native

global schema.

The absence of a global
schema indicates that integration
must be defined by the user.


The integration is managed at the user level by an active and intelligent agent that
monitors the data sources and collects metadata to continuously up
date the virtual
database schema.

With JDBC related data sources the Engine Agent will make calls to
the database and get changes in size and schema on a planned basis. The textual data
sources have no meta
-
data environment, so they will be monitored and

updated by the
user.


Most RDBMS, ORDBMS, and commercial systems have query optimizations in
use that execute
in

the
database
server
run time environment
. Our optimization issues
are
focus on

t
hree
distinct areas. First we examine the ability to effici
ently retrieve
textual data from formatted and delimited data sources.
The second consideration is the
order in which the queries are executed. If a sub
-
query’s result dataset is very small, it
can be executed and then used against a dataset that is very
large. This will reduce the
overall data transfer across network interfaces
.
The
third

area is
u
nification of data sets
from multiple heterogeneous sources.



The concept of

query tree decomposition

[5]
[6]

is used to design the
algorithms for the economi
c analysis and query evaluation. Depending on a selection of
parallel or sequential execution the Engine will decomposed the queries into query trees

12

and use the cost of projections, selections, joins and Cartesian products to evaluate the
economic factor
s associated with executing a query. Once the total cost is calculated the
selection of the parallel or sequential algorithm is made and the query prop
a
gate
s

through
its steps
to
successful
completion.

The evaluation will consider all e
conomic factors
po
ssible to calculate the overall cost. There are some factors that cannot be factored
because of the potential for errors and unforeseen issues such as network bandwidth and
traffic
[5].


7.5.1

Sequential

Query

Execution

Example


We present an example of a seque
ntial execution of a query.

This example is based on
the virtual database exhibited in Figure
4
. This query accesses to data sources: a
PostgreSQL ORDBMS and an MS Excel spreadsheet via ODBC.

The two virtual
database tables are related using a
guest_no

field that relates the data.

The objective is to
find the guest name and address of every registered guest who stayed at the hotel on July
4
th

who has children.


Step 1 is to split the query into sub
-
queries so that the Query Engine can evaluate.
Step 2
uses

the

sub
-
query of each data source and factors it against the current

metadata

for that data

source

to evaluate if the query should be executed in sequential or parallel
order. Once the decision is made to evaluate in sequential order the Query Engine

Agent
develops the query plan to most efficiently execute the query across the data sources.


I
n

this example the Agent decides to process the data from the PostgreSQL data
source first because it will have a small set of guest numbers
that meet the query

constraints
. The data returned will be smaller than if w
e

went to the Excel data source

13

first.

A selection is executed on the PostgreSQL data source to return all
records where
the date is equal to July 4
th
. A projection of guest numbers from the July

4
th


data set is
then returned to the Umbrella Engine.


Step 4 will use the lis
t

of the data sets to query the MS Excel data source. First a
selection is run to return all relations where the number of children is greater than zero.
A selection is used
on that dataset to only return the guest name and address of each
relation.

To complete the query the return of the MS Excel dataset is minimized by the
guest number list from the PostgreSQL dataset to produce the final guest name and
address dataset.


SELECT
guest_name, address
FROM
Booking, Guest
WHERE
date_from = 07
-
04
-
2002
AND
no_children > 0;
1
2
3
Split Queries
Evaluate Impact
Process Query
(smallest return)
4
Process 2
nd
Query
Booking
hotel_no
guest_no
date_from
date_to
room_no
Guest
guest_no
guest_name
no_adults
no_children
address
PostgreSQL
MS Excel
1

Date_from=07
-
04
-
2002

no_children > 0
AND
guest_no
guest_no

guest_name, address

Dist. Server
Local Server
Over the network
guest_name, address
3
4
2


Figure
4
: Sequential Execution Example



7.5.2

Parallel
Query
Execution Example


We present an example of a
parallel query

execution

.

This example is based on the
virtual database exhibited in Figure
5
. This query accesses to data sources: a PostgreSQL
ORDBM
S and an MS Excel spreadsheet via ODBC. The two virtual database tables are

14

related using a
guest_no

field that relates the data. The objective is to find the guest
name and address of every registered guest who stayed at the hotel on July 4
th

who has
ch
ildren.


Step 1 is to split the query into sub
-
queries so that the Query Engine can evaluate.
Step 2 uses the sub
-
query of each data source and factors it against the current metadata
for that data source to evaluate if the query should be executed in seq
uential or parallel
order. Once the decision is made to evaluate in sequential order the Query Engine Agent
develops the query plan to most efficiently execute the query across the data sources.


In this case
,
the

Query Engine Agent makes the decision to
process in parallel
because
it is
the most efficient execution
.

A selection is executed against the Booking
table

of the
PostgreSQL
database where the date from is July 4
th
, 2002. At the same time
a selection is being executed against the Guest MS Excel
spreadsheet
where the number
of children is greater than zero. Once the projections are complete both resulting datasets
have selections executed against them to return the guest number on the dataset from the
PostgreSQL dataset and the
guest number, gues
t name and address from the MS Excel
dataset, respectively.

The two returned datasets are joined by guest number to derive the
result set of guest name and address.



15

SELECT
guest_name, address
FROM
Booking, Guest
WHERE
date_from = 07
-
04
-
2002
AND
no_children > 0;
Booking
hotel_no
guest_no
date_from
date_to
room_no
Guest
guest_no
guest_name
no_adults
no_children
address
MS Excel
PostgreSQL

Date_from=07
-
04
-
2002

no_children > 0
Guest_no, guest_name, address

guest_no

1a
1b
guest_no
guest_name, address
Local Server
Over the network
Distributed Servers


Figure
5
: Sequential Execution

Example




10

Conclusion


In the present paper we prese
nt a working prototype of
UmbrellaDB
. The prototype
successfully integrates delimited text, formatted text, MySQL Relational Database
Management System (RDBMS) and PostgreSQL Object
-
Relational Database
Management System (ORDBMS) data sources. The data so
urces reside on different
physical hardware systems consisting of Sun Solaris, Red Hat Linux, Windows XP and
Windows NT connected via a TCP/IP based network. Virtual schemas of each data
source have been defined to
UmbrellaDB

and queries have been success
fully executed
across the heterogeneous sources. We have been able to test that

the
sequential and
parallel execution of queries are optimal.


16


The ability to integrate numerous heterogeneous data sources has been tested and
proven with this architecture
. User level integration has been achieved. Queries are
easily constructed and optimally executed to completion.


11

Future Work


There are numerous additional developments planned for the UmbrellaDB Virtual
Database tool. A representative list of futur
e

research topics and

enhancements is listed
below:



Extension to numerous data source types



Enhancement of the Query Engine Algorithms



Inclusion d
ata
d
ec
ompression techniques to drastically reduce data transmission
time



Use of

proposed

Huffman Code metho
ds [4] will allow for fast retrieval and
proce
ssing of large text datasets.



17

References

[1] E. Matson, “Querying Distributed, Heterogeneous Databases and Knowledge Bases
with a Single Unified Approach”, Research Paper, CIS 860, 1997.

[2] I. Witten, E. Fra
nk, “Data Mining”, Morgan
-
Kaufman Publishers, San Francisco,
2000.

[3] Y. Papakonstantinou, H. Garcia
-
Molina, J. Widom, “Object Exchange Across
Heterogeneous Information Sources”, Research sponsored by Wright Laboratory,
Aeronautical Systems Center, Air Fo
rce Material Command, USAF.

[4] E. Moura, G. Navarro, N. Ziviani, and R. Baeza
-
Yates
,

“Direct pattern matching on
compressed text”. In Proc. 5th International Symposium on String Processing and
Information Retrieval (SPIRE'98) (September 1998), pp. 90
--
95
. IEEE Computer
Society.

[5] M. Oszu, P. Valduriez, “Principles of Distributed Database Systems”, 1991, Prentice
-
Hall, Inc. New Jersey.

[6] J. Ullman, Principles of Database and Knowledge
-
Base Systems, Volume II: The
New Technologies”. 1989, Computer Sci
ence Press, Rockville, Maryland.

[7] J. Ullman, Principles of Database and Knowledge
-
Base Systems, Volume I: Classical
Database Systems” 1988, Computer Science Press, Rockville, Maryland.