Using Machine Learning Technique to Parallelize Databases: Where each query answered by a single node

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 7 μήνες)

80 εμφανίσεις

Using Machine Learning Technique to Parallelize Databases:
Where each query answered by a single node

Jozsef

Patvarczki
,
Elke

A.
Rundensteiner
,
Craig E. Wills, and
Neil T. Heffernan


Web based applications often suffer when trying to scale to
support higher loads from the database being a bottleneck. We
propose a rule
-
based data replication middleware for using
multiple database servers for web applications. Knowing each
query template in advance allows us to propose better solutions
for balancing load across multiple servers in the scenario of
web applications, above and beyond what is supported for
traditional applications. Prior knowledge of all of the incoming
query templates and the workload give us the ability to select
an appropriate table placement where each query template can
be answered with a single database server. Our goal is to
minimize the effective response time from the database, by
figuring out how to distribute the data across multiple nodes
effectively.


Instead of using theory only to do database layout, we need a
system that will collect empirical data on when horizontal
partitioning (HP), vertical partitioning (VP), de
-
normalization
(DN), and full replication (FR) operators are effective. We have
implemented a brute force search technique

to try different operators, and then we used this empirically
measured data to see if any speed up has

occurred. After creating a large data set where these four
different operators have been applied to make different
databases, we can employ machine learning to induce rules to
help govern the physical design of the database across an
arbitrary number of computer nodes. This, in turn, would allow
the database placement algorithm to converge quickly over time
as its trains over a larger set of examples.

Abstract



A characteristic of web applications such as our ASSISTment
Intelligent Tutoring System (
www.ASSISTment.org
), is that
we
know all the incoming query templates beforehand
as the users
typically interact with the system through a web interface [1];



Prior knowledge of all the incoming query templates and the
query workload give us the ability to select an appropriate table
placement;



Given a query workload
, that describes all the query templates
for a web
-
based application, and the percentage of queries of
each template that the application typically processes;



Given this workload and the
optimization goal
, determine the
best possible placement
using four operators (FR, HP, VP, and
DN) and arbitrary number of database servers answering each
query by a single node;



Our
optimization goal
is to maximize the total system
throughput [2].



Problem Statement

Proposed Solution

References

1.
Tobias
Groothuyse
,
Swaminathan

Sivasubramanian
, Guillaume Pierre
,

Globetp
:

template
-
based database replication for
scalable web applications”, WW
W07, Banff, Canada, pp. 301
-
310

2.
Jozsef Patvarczki,
Murali

Mani, and Neil Heffernan, "Performance Driven Database Design for Scalable Web
Applications", Advances in Databases and Information Systems, In J.
Grundspenkis
, T.
Morzy

& G.
Vossen

(
Eds
) Advances
in Databases and Information Systems Springer
-
Verlag
: Berlin, ISBN 978
-
3
-
642
-
03972
-
0, pp . 43
-
58.



We characterize the problem as an AI search over layout;



Our hypothesis is that we can learn rules to capture
human
-
like expertise and use these rules to better partition
a given database;



By the help of the learned rules, we will be capable to fit
layout characteristics, and the layout generation can be
faster and faster;



We will perform the layout and empirically measure the
cost, since we want to know what is effective and under
what conditions;



We explore multiple ways to represent this knowledge
(maybe decision
-
tree);



We Apply cross
-
validation to prevent overfitting our rules
to training data;
















Core parts of the system:



(a) A data placement algorithm that can converge
quickly over time as it trains over a set of examples and
machine learned rules;



(b) Parameterized and machine learned rules to help
govern the physical design of the database across an
arbitrary number of computer nodes;



(c) A shared
-
nothing data replication middleware for
Web
-
based applications that can be easily built using low
-
cost existing resources to realize database scaling
possibilities without expensive storage area networks




Contact: Jozsef Patvarczki,
patvarcz@cs.wpi.edu

Collaborators

Collaborators

Sponsors

ASSISTment

MediaWiki

Moodle

TPC
-
W

TPC
-
W