Distributed and Non-Relational Databases

longtermagonizingInternet and Web Development

Dec 13, 2013 (3 years and 5 months ago)

143 views


Distributed and Non
-
Relational Databases


Mark Feltner

Department of Computer Science

Universi
t
y of Wisconsin


Platteville

Platteville, WI 53818

feltnerm@uwplatt.edu


Abstract

The data storage climate was much different when
SQL



the standard data storage model


was
conceived

in
the
1970
s. Today’s data is adaptive, moves quickly, can be astronomical in size,
and may not even be located in a single geographic spot.
Consider
:

the New York Stock
Exchange generate
s

about one teraby
te of trade data per day, Facebook host
s

approximately
ten

billion photos which take up about one petabyte of storage, and the Large Hadron Collider in
Geneva, Switzerland will produce about fifteen petabytes of data per year

[1]
.
New data storage
solutions have been, and are being, conceived which specialize in handling these sorts of
problems. These non
-
relational models (so called because they stray from the foundations of
relational, SQL databases) are flexible to design and

develop with, can easily be scaled
horizontally, and are specialized
for dealing with Big Data.


Theory

This section deals broadly with the theory of databases and distributed systems. The foundational
concepts of databases and both the relational and non
-
relational models are described. A few
important
concepts relevant to data storage with distributed syst
ems are also featured.


Relational Databases

The first coinage of a database m
anagement system was in 1970 by Edgar Codd at

I
BM Almden
Research Center. Codd proposed a relational model, based on
Set Theory, to structure stored data

[2]
.

The relational model defines a relation as a set of tuples which have the same attribute. Usually
one tuple represents an object and its associated information. Basically, each tuple is a row in the
table and each attribute is a column

[3]
. Usually, both rows and attributes are unordered sets. A
1


relation requires a
primary key
, or at least a set of attributes which are guaranteed to be distinct
among all tuples in the relation
[4]
.

Relations are reference
d

via
foreign keys
. This allows a single tuple to be reference
d by any
number of other tuples (i.e., one
-
to
-
many relationship).

The relational model also allows one to define precompiled
indexes

which can find tuples
quicker than checking

each single tuple for a match.

Data in the relational model is often
normalized

to retain integrity and remove redundancy

[1]
.

Database transactions are the standard units of work fo
r a

relational

database management
system (DBMS). Transactions are small and simplified work units
. Examples being
INSERT
,
UPDATE
,
SELECT
, and
DELETE
.

A good DBMS is usually able to take a complex query and divide
it up into efficient, small, discrete, and

reliable transactions.

This provides the DBMS
with
the
ability to
keep

the databa
se in a consistent state even if a transaction fails.

Transaction
-
based
systems benefit from being able to
ROLLBACK

from changes by simply reversing the t
ransactions
that
were executed.

Besides querying data with the
SELECT

statement, users can modify relations
using
INSERT
,
DELETE
,
UPDATE
, and, of course, other DBMS provided procedures.


ACID

ACID is one of the core characteristics of a "reliable" database. ACID is an acro
nym for
Atomicity, Consistency, Isolation, and Durability. This set of properties guarantees that reliable
database transactions may take place, and has been a mainstay in modern relational database
systems.

Atomicity

simply
means that each transaction is
all
-
or
-
nothing
.

If any operation in the transaction
group fails, then the entire transaction must be undone.

If one were to insert a million rows into a
table, and the 999,999 thousandth one failed before a
COMMIT

was issued, the database would
revert back to its previous state.

For a database to be
consistent
, the data must always be in a valid, or consistent, state. Relational
databases are designed with rules in mind. Strict schemas are defined and sp
ecify
maximum
lengths, value types, relations, and other rules
.

An ACID compliant database would never break
the rules it is defined with.

Of course, this c
onsistency does not guarantee programmer
correctness, just that the database will follow the rules laid ou
t for it.

Isolation

defines a DBMS property that enables concurrent execution of transactions. Isolation
ensures that transactions executed serially will result in the same system state as transactions
executed concurrently.

Durability

means that once a pr
ogrammer has executed a
COMMIT

then the transaction is
permanent across all clients.

2


All of these checks and rules, of course, do add overtime to each transaction that occurs in an
ACID compliant DBMS. For systems in the banking or government sectors that
do require a
strong degree of data reliability, then founding their models on that of ACID will benefit them.

The relational model is backed by decades of use, strong foundations in well
-
studied
mathematics, and a variety of ready
-
to
-
use implementations.
That being said, it is not the end
-
all
-
be
-
all of data storage.
It is inflexible, tedious to work with, and the benefits of ACID do not
always outweigh the costs
[4]
.
Relational databases lend themselves to systems which require

a
large degree of reliability and availability, and are generally better suited for single
-
node
systems. T
he rise of distributed systems and web services, and the difficulty found when scaling
traditional DBMSs, ha
ve

led to new
paradigms

in data storage
.


Non
-
Relational Databases

In 2009, the landscape of databases had changed vastly from that conceived by Codd in the
1970s. Nowadays, we are dealing with systems made to handle thousands of concurrent
connections a second, store
petabytes

of data, and sync
with other similar server
s and services
around the globe simulatenously.

NoSQ
L

(so named because of its focused on data storage techniques sans SQL)
defines non
-
relational, distributed data stores
that

often did not attempt to provide atomicity, consistency,
isolation, and durabi
lity


the components of ACID.

As consumers became more and more
connected, data needed to be delivered faster. Two methods of solving this problem were

conceived. First, s
pread

out the data stores geographically
. Second, specialize
the data stores for
the types of data they
hold

and
process

[4]
.

NoSQL systems store
semi
-
structured
or

unstructured

data. Semi
-
structured data is much like a
spreadsheet. You know that the data is organized in cells, rows, and columns, but you only have
a loose idea of what kind of data is stored. Unstructured data has no particular internal structure
(i.e., plain text

or image data).

Recently, there has been a surge of NoSQ
L

and NoSQL
-
esque database management systems.
Most, if not all, of these systems lack fixed schemas, avoid joins, and scale really well.

Different systems satisfy different needs. Document
-
oriented
ones, for instance, have an
immense easy
-
of
-
use, while key
-
value or column oriented systems sacrifice this in order to more
efficiently scale data over clusters
[4]
.


Key
-
Value

Key
-
value stores are basically “big, distributed,
persistent, fault
-
tolerant hash” tables. Based on
an extremely simple concept, hash tables, key
-
value stores are prized for their simplicity and
speed

[4]
.
Key
-
value stores allow the application developer to store schema
-
less d
ata.

This is
beneficial when using a database for persistent storage of object oriented data.

3


The data in a key
-
value store is usually just a string that represents a key, and a value which is
some type of primitive of whatever programming language is bei
ng used (string, integer, array)
[8]
.

There are a variety of ways to implement this.

The most common key
-
value store is the file
system. The key, for example, would be the path to a file, and the value is the binary data
associ
ated with that file.

This simplicity means that key
-
value stores are not only easy to use, but
the code produced is simple and readable. An example of accessing a key
-
value store in ruby:

require "pstore" # the key
-
value store interface

store = PStore.new(
"data
-
file.pstore")

store.transaction do # begin transaction

# load some data into the store...

store[:single_object] = "Lorem ipsum dolor sit amet..."

store[:obj_heirarchy] = { "Marc Seeger" => ["ruby", "nosql"],

"Rainer Wahnsinn" => ["php", "mysql"] }

en
d # commit changes to data store file
[8]
.

Some of the most popular key
-
value storage solutions include: Apache Cassandra, Redis,
memcached, and Google’s BigTable.

Document
-
oriented

Document
-
oriented databases are defined as
semi
-
structured data storages implemented without
tables and relations

[4]
. Document
-
stores are highly flexible and have no fixed schema. The idea
is to allow the client application to add and remove attributes to any single
tuple without wasting
any space by creating empty attributes for all the other tuples.

Document
-
oriented databases are suited well for dynamic data because of their lack of fixed
-
schemas and developer flexibility. Because of this, though, document
-
oriented

databases have a
lack of safety (much like using a duck
-
typed language versus a statically typed one) and there is
a relative performance hit.

Many popular document
-
oriented databases provide layers over already standardized semi
-
structured file formats s
uch as JSON, YAML, or XML. This allows them to map easily to web
services and other distributed systems which operate by passing these small, structured files
around.

Some of the most popular document
-
oriented databases include MongoDB, CouchDB, and
Simple
DB.


Graph

Graph
-
based databases use graph
-
like structures and use nodes, edges, and properties to
represent data. Graph database are powerful tools, especially for graph
-
like queries such as
shortest path computation and community analysis.

4


These databas
es provide another advantage: index
-
free adjacency. This basically means no index
lookups are necessary because each element contains a direct pointer to its adjacent element.

Graph database scale more naturally for large datasets and are often faster tha
t
object
-
oriented

applications for associative data sets. Graphs typically don’t require expensive JOIN operations
and their flexible schemas are suitable for ad
-
hoc and changing data.

Some popular graph
-
ba
sed solutions include
AllegroGraph, InfiniteGraph,

and FlockDB.


Other

Of course, there are many other types of distributed
databases including tuple stores, tabular
databases, RDF databases, object databases, multivalue databases, RAM stores, and many other
sub
-
categories of previously listed categories
. All
-
in
-
all, there exist quite a number of solutions
for almost any data storage need.



Distributed Systems

Distributed systems, also referred to commonly as “grids”, are in rising popularity. There has
been a growing need for distributed systems in both commercial and scientific domains.
Grid
systems can be classified into the computational, data, and service t
ypes. For the purposes of this
paper, we will focus on the data grid. The data grid is defined as “systems that provide a
hardware and software infrastructure for synthesizing new information from data repositories
that are distributed in a wide area netwo
rk”
[9]
.

Almost any application that relies on a database which houses a large volume of data, generally
the bottleneck of that application will be its database. I/O operations are expensive, disk reads
are slow,
writes are slo
w
er
, and
since many databases exist in remote data centers one must take
network latency into account as well
. A simple solution is to add more hardware to the database
system, but this is innately unsustainable.

Another solution is to parallelize and dis
tribute your data processing. A parallel database
management system can achieve both high throughput and interquery parallelism and low
response time in intraoperation parallelism
[10]
.

Parallelism does add architectural comple
xity, and creates a lot of system administration
overhead. The biggest disadvantage in parallel computing is the lack of standards and applicable
examples
[10]
.

The structure of distributed systems grants them different
performance metrics and different
goals
. These systems are measured on their efficiency, or utilization rate of resources;
dependability; adaptability, or ability to support billions of job requests of massive data sets and
virtualized cloud servers under
various workloads and service models; and flexibility, or the
ability of the system to run in both high
-
performance and high
-
throughput computing
environments
[11]
.

5



CAP Theorem

In 2000, Eric Brewer laid out the CAP Theorem in
a keynote at the
Proceedings of the Annual
ACM Symposium on Principles of Distributed
Computing
, and
Seth

Gilbert and Nancy Lynch
formalized his ideas into a theory in 2009. The CAP Theorem states that it is impossible to
guarantee consistency, availabilit
y, and partition tolerance in a distributed system
[4]
. This idea
has vast implications, especially today where almost every system is distributed in one way or
another.

A service that is consistent operates fully or not at al
l. That is, a system is consistent if an update
is applied to all relevant nodes at the same logical time

[5]
. One way to drop consistency with
your service is to simply deal with inconsistencies. Some services simply accept that things are
eventually consistent
.

Eventual consistency is a consistency model used in the parallel programming domain. It means
that gi
ven a sufficiently long period of time over which no changes are sent, all updates can be
expected to propagate throughout the system and replicas will be consistent
[6]
.

“For a distributed system to be continuously available,
every request received by a non
-
failing
node in the system must result in a response”
[7]
. Availability is one of the hardest things to test
and plan for because it tends to desert you when you need it most. It is only inevitab
le that
services go down at busy times just because they are busy. Implementing a system that can deal
with availability gracefully can be non
-
trivial
[4]
.

“In order to model partition tolerance, the network will be allowed to
lose arbitrarily many
messages sent from one node to another. When a network is partitioned, all messages sent from
nodes in one component of the partition to nodes in another component are lost”
[7]
. Once a
system has become d
istributed, there is a high chance that it will be in a partitioned state. Say the
data has not traversed the wires to reach your entire data store yet, or perhaps some of your
machines have crashed and are no longer online. To handle issues of consistency

and
availability, we can scale horizontally, but as the number of partitions increases, the more atomic
entities you need to keep synchronized. You can drop partition tolerance by placing everything
on a single machine, but this severely limits your abili
ty to scale
[4]
.

To have a consistent system you must sacrifice some level of availability or partition tolerance,
a system with good availability will have to sacrifice some consistency or partition tolerance, and
a system th
at is partitioned will need to sacrifice some degree of consistency or availability.

Choosing which part of CAP to drop has been argued over
-
and
-
over again since CAP’s
inception. Hale believes that you cannot drop partition tolerance because the
probability that one
or more node in your system will fail increases exponentially as you add nodes. Instead, Hale
says that you should be asking yourself: “
In the event of failures, which will this system
sacrifice? Consistency or availability?”



6


Algorithms

Distributed Systems

and Data Parallelization Techniques


MapReduce

MapReduce is a popular web programming model used for scalable data processing on large
clusters over large data sets

[4]
.

MapReduce is suited for applications where data is written once
and read many times
[1]
.

The user provides two functions: a map function which generates a list
of intermediate key/value pairs, and a reduce function which merge
s all the intermediate values
with the same intermediate key. Being purely functional, MapReduce operates well on
semi
-

or
unstructured

data

The map function serves to process a subset of the original problem and returns the result as a list
of (A,B)
-
tupl
es.

The reduce function will, given an A value and a list of B values, produces a list of results that is
the final answer for all A
-
parts. The beauty of these two operations is that they are purely
functional and have no side
-
effects. What does this mean
? It means they tend to scale well, are
highly parallelizable, and extremely flexible
[4]
.


Figure
1
: MapReduce Visual Representation


7


This operation is linearly scalable and is massively parallelizable
[1]
; if you double the amount
of data to operate on, the operation takes twice as much time, but you can also double the size of
the cluster and have the opera
tion take half as much time
[1]
. It has been proven to work on
terabytes of data on hundreds of thousands of client machines with hundreds of simultaneous
MapReduce programs. In fact, there are estimated thousands of these oper
ations occurring on
Google’s clusters every day
[11]
.


Databases


Table 1: Example
collection

used to compare row
-

and column
-
oriented operations

Title

Artist

Album

Year

Breaking the Law

Judas Priest

British Steel

1980

Aces
High

Iron Maiden

Powerslave

1984

Kickstart My Heat

Motley Crue

Dr. Feelgood

1989

Raining Blood

Slayer

Reign in Blood

1986

I Wanna Be Somebody

W.A.S.P.

W.A.S.P.

1984



Row
-
oriented

Row
-
oriented databases represent objects as tuples and attributes as one

piece of information
about that object. Generally, we think of a tuple as a row and a attribute as column. This is the
primary way of thinking abou
t data in the relational model
[4]
.

A row
-
oriented database would store data
somewhat like this:

Breaking the Law, Judas Priest, British Steel, 1980,
A
ces High, Iron Maiden, Powerslave.
1984 …

[4]


Column
-
oriented

Col
umn
-
oriented databases are tuned to operate more efficiently on a set of attributes for

all
tuples.

And the table above in a column
-
oriented database:

Breaking the Law, Aces High, Kickstart My Heart, Raining Blood, I Wanna Be
Somebody, Judas Priest, Iron Maiden, Motley Crue, Slayer, W.A.S.P.

,
British Steel
,
Powerslave, Dr. Feelgood,
Reign
in Blood



[4]

8


These two distinctly different ways of storing data result in very different results when database
operations are performed.

To create a new tuple (CREATE) in a row
-
oriented database, the system would only need
to
append the new tuple to the end of the collection (table). In a column
-
oriented database, careful
insertions would need to be made to keep the data lined up correctly.

Reversely, creating a new tuple in a row
-
oriented database require the more expensive

and non
-
trivial operation, whereas it is as simple as inserting the new data at the end of a column oriented
database.

Select statements, such as a query listing all the albums from the year 1984, are more efficient in
a row
-
oriented system because the qu
ery simply needs to check the 4
th

element of each tuple and
return the ones matching the year.

Aggregate statements, on the other hand, are more efficient in a column
-
oriented solution. An
operation such as summing all the years is much easier in a column

oriented system because all
the years are consecutively found and
need not

be compared to other attributes.

In summary, we can conclude that row
-

and column
-
oriented solutions both have trade
-
offs. A
row
-
oriented system may be best when the system
requires the creation of many rows or the
retrieval of many attributes from the same tuple at the same time. Colum
n
-
oriented systems
might be the better option when new columns are created often, and aggregations over rows, but
not columns, happen a lot of

the time

[4]
.


Technology


Enabling Technology

It has only been in the last decade or so that NoSQL solutions have really started to appeal to
developers and engineers. This is the result of a couple of enabling technologies i
n the hardware
sector. Specifically, advances in computation and memory and storage systems.


Today’s hard drives can reach sizes in excess of one terabyte with read speeds around 100 MB/s.
So it takes more than two and a half hours to read all the data
off the

disk. This a long time to
read data, and

it takes even longer to write.

The solution to this problem is to spread the data out
onto separate drives. By replicating the data onto, say, 100 separate drives, we can take
advantage of
the

ability to rea
d from multiple disks at once.

There are two problems with this approach though. First is how to deal with hardware failure, a
common occurrence. Commonly, you can use replication to store redundant copies of data so
that if there is a failure, extra copie
s are available. The second big issue is that you are going to
need to combine that data in some way eventually.
Distributed data
-
storage systems abstract
these problems and provide reliable, shared storage and analysis systems.

9



Distributed Systems


Hadoop

Hadoop was created by Doug Cutting, named after his son’s stuffed elephant, and was bred off
of Apache Nutch, an open
-
source web search engine that is part of the Lucene project
[1]
.

In
2008, Yahoo! announced that its production search
-
index was being powered by a 10,000 core
Hadoop cluster
[1]
.

In April of 2008, Hadoop broke
the
world record to become the fastest
system to sort a terabyte of data. It ran
on a 910

node cluster and only took 209 seconds,
beating the previous winner by 88 seconds
[1]
.


HBase

HBase is another Apache Software Foundation project that stemmed from the Hadoop project.
HBase is basically a Hadoop
database running over a Hadoop Distributed File System. It is
modeled after Google’s BigTable.

HBase provides nice developer tools including a Java API,
REST interface, and Thrift gateway.

HBase follows a column
-
oriented architecture and has easy integrati
on with MapReduce jobs
which are run i
n Hadoop. Tables can be used as

both input and output for Ma
pReduce jobs. It
s
data model is based on a four dimensional dictionary.


Hbase_db[table][row][column]][time
-
stamp]
-
> value

HBase has excellent fault toleranc
e. It has block
-
level replication which is achieved from storing
files as set of blocks which are then distributed across the entire cluster. Block replication also
gives HDFS high
-
throughput across large data sets. By splitting individual files in 64MB bl
ocks
HDFS is able to decrease the amount of metadata storage per file and read data fast because of
sequential blocks.

HBase and the HDFS
file system

it is built on will sometimes compromise reliability in order to
achieve lower communication costs. HBase
supports heartbeat and block report periodic
messaging between nodes.



Cassandra

Cassandra is a “distributed storage solution for managing very large amounts of structured data
spread out across many commodity servers, while providing highly available service with no
single point of failure”
[12]
. Cassandra

is an Apache Software Foundation project now, but was
designed by Facebook to fix their Inbox Search problem. The ability of Facebook users to search
through their Inboxes required massive amounts of read and write throughput, billions of writes
per day,
and needed to be scalable for the future. Facebook created Cassandra to resolve this
10


problem and so far Cassandra has held to its duty by scaling well from 100 million users to 250
million users
[12]
.

Cassandra is a column
-
orie
nted, distributed DBMS. It features a four to five dimensional key
-
value store with eventual consistency
[4]
.

The Cassandra API is extremely simple, and only exposes three methods:



Insert(table, key, rowMutation)



Get(table, ke
y, columnName)



Delete(table, key, columnName)

[12]

Cassandra has the backing of a large and popular company in Facebook, and has proven itself on
the largest scale. It is extremely simple to use and get started, yet extremely p
owerful.


Databases


Relational

MySQL is one of the most popular relational database management systems in the world. It is a
core component of web development and is the ‘M’ in the LAMP (Linux Apache MySQL PHP)
acronym. You will find MySQL used in small
to medium scale single
-
server deployments.
MySQL does support scaling by using full replication to increase read capacity, or sharding to
increase write capacity. It also supports multiple storage engines, both native, partner developed,
and open
-
source.


MongoDB

One popular document
-
oriented database is MongoDB. MongoDB is a scalable, opens
-
source,
document
-
oriented database that has been around since October 2007. MongoDB is based on B
-
Trees and stores data as binary JSON
-
style documents. MongoDB scales v
ery easily horizontally
and supports sharding and replication. MongoDB is mainly used for problems with strong
transaction requirements. MongoDB is also very flexible and developer
-
friendy.
Its

lack of fixed
schema means modifying database definitions, col
umns, and tuples is easy.


Conclusion
s


Any software that depends on persistent storage will end up using a database one way or another,
and many times the database can and will be t
he bottleneck of an application.

It is important to
11


know what the implications are in choosing a database system. Today’s data storage climate is
such that traditional, SQL
-
based DBMSes are not the best solution to every problem. Through a
better understanding of the theory, algorithms, and implementations
behind relational and non
-
relation databases one will be better suited to making data
-
driven decisions.
For a vast account of
performance considerations that is out of the scope of this paper, the author would like to suggest
Chapter 7 in
Lith and Mattsson
’s Paper: "Investigating storage solutions for large data”. The
paper investigates relational versus non
-
relational databases in a specific case study and includes
pages of graphs showing specific performance benchmarks for different databases.







12


Refer
ences


[1]

T. E. White, Hadoop: the definitive guide, Sebastopol, CA: O'Reill Media, Inc., 2012.

[2]

M. Stonebraker and U. Çetintemel, "“One Size Fits All”: An Idea Whose Time Has Come and
d潮攬o⁐r潶i摥湣攬′〰㜮T

x㍝

C⸠.i捡r
摯ⰠÇat慢a獥猠sllumi湡n敤e⁓散潮o⁅摩tio測⁏湴慲ioW g潮敳…⁂artl整t⁌ ar湩湧Ⱐ㈰ㄲ⸠

x㑝

A⸠䱩.栠慮搠g⸠䵡tt獳潮Ⱐ"f湶敳tig慴i湧⁳ 潲慧攠e潬uti潮猠sorarg攠摡t愬a
Chalmers University of
Technology,
pp. 1
-
70, 2010.

[5]

C. Hale, "You Can't Sacrifice
Partition Tolerance," 07 October 2010. [Online]. Available:
http://codahale.com/you
-
cant
-
sacrifice
-
partition
-
tolerance/. [Accessed 07 April 2013].

[6]

W. Vogels, "Eventually Consistent,"
ACM Queue,
2008.

[7]

S. Gilbert and N. Lynch, "Brewers conjectur
e and the feasibility of consistent, available, partition
-
tolerant web services," p. 12, 2002.

[8]

M. Seeger, "Key
-
Value stores: a practical overview,"
medien informatik,
2009.

[9]

F. Magoules, Fundamentals of Grid Computing, Boca Raton: Chapman and
Hall, 2010.

[10]

A. J. Wells, Grid Database Design, Boca Raton: Auerbach Publications, 2005.

[11]

J. D. G. C. F. Kai Hwang, Distributed and Cloud Computing: From Parallel Processing to the
Internet of Things, Morgan Kaufmann, 2011.

[12]

A.
Lakshman and P. Malik, "Cassandra
-

A Decentralized Structured Storage System," Ithaca, 2009.

[13]

F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes
and R. E. Gruber, "Bigtable: A Distributed Storage System for
Structured Data," in
OSDI'06:
Seventh Symposium on Operating System Design and Implementation
, Seattle, WA, 2006.


Acknowledgements

Figure 1: http://upload.wikimedia.org/wikipedia/en/3/35/Mapreduce_Overview.svg