What's wrong with relational databases?

gayheadtibburInternet and Web Development

Feb 5, 2013 (4 years and 8 months ago)

198 views

: what’s all the buzz about?

http://nosql
-
database.org/

Next

generation

databases

are
:


Non
-
relational,



Distributed,



Open
-
source,


Horizontal

scalable


Often

more

characteristics
:


Schema
-
free,

easy

replication

support,

simple

API,

eventually

consistent

/

BASE

(not

ACID),

a

huge

data

amount

List of
NoSQL

databases
[122+]


Wide Column Store / Column Families

HBase
, Cassandra,
Hypertable
,
Cloudata
,
Cloudera
, Amazon
SimpleDB


Document
Stores

CouchDB
,
MongoDB
,
Terrastore
,
ThruDB
,
OrientDB
,
RavenDB
,
Citrusleaf
,
SisoDB


Key Value / Tuple Store

Azure Table Storage, MEMBASE,
Riak
,
Redis
,
Chordless
,
GenieDB
,
Scalaris
, Tokyo Cabinet / Tyrant,
Keyspace

Berkeley DB,
MemcacheDB
,
Faircom

C
-
Tree,
Mnesia
,
LightCloud
,
Hibari
,
HamsterDB
,
STSdb
,
Pincaster
,
RaptorDB


Eventually Consistent Key Value
Stores

Amazon Dynamo,
Voldemort
,
Dynomite
,
KAI


Graph Databases

Neo4J, Infinite Graph,
Sones
,
InfoGrid
,
HyperGraphDB
, Trinity,
AllegroGraph
,
Bigdata
, DEX,
OpenLink
, Virtuoso,
VertexDB
,
FlockDB


Object Databases

db4o, Versant, Objectivity, Gemstone, Progress,
Starcounter
,
Perst
, Caching, ZODB, NEO,
PicoLisp
,
Sterling


More and more databases

So
what’s wrong with relational
databases
?

Main principals of
RDBMS


SQL


ACID


Atomic
“all or nothing”


Consistent

means that data
moves from one correct
state to another correct state, with no possibility that
readers could view different values that don’t make
sense together.


Isolated

means that transactions
executing
concurrently
will not become entangled with each
other.


Durable
once a transaction has succeeded, the
changes will not be lost.

Shortcomings
of
RDBMS


Transactions under heavy load


Complexities
of vertical
scaling


2 phase
commit
(
2PC
) protocol

Sharding

If you can’t split it, you can’t scale it (Randy
Shoup
, distinguished architect, eBay)



Sharging approach


Feature
-
based shard or functional
segmentation


Key
-
based
sharding


Lookup table


Shared
-
nothing
or Cassandra like
sharding

The real question is not


What’s wrong with
relational
databases
?” but rather,


What problem do you have
?”

Brewer’s CAP
Theorem

Availability

Consistency

Partition

Tolerance

Brewer’s CAP
Theorem

Availability

Consistency

Partition

Tolerance

Amazon

Dynamo

derivatives
:

Cassandra,

Voldemort
,


Riak
,

CouchDB

Neo4j, Google Big Table and
its derivatives:
MongoDB
,
Redis
,
Hypertable

Relational:

MySQL, Oracle, MSSQL

in
50
words
or
less

Apache

Cassandra

is

an

open

source,

distributed,

decentralized,

elastically

scalable,

highly

available,

fault
-
tolerant,

tuneably

consistent,

column
-
oriented

database

that

bases

its

distribution

design

on

Amazon’s

Dynamo

and

its

data

model

on

Google’s

Bigtable
.

Created

at

Facebook,

it

is

now

used

at

some

of

the

most

popular

sites

on

the

Web
.

Cassandra case studies

Cassandra outlines


BASE (
Basically
Available Soft
-
state Eventual
consistency) and not
ACID (Atomicity,
Consistency, Isolation, Durability
)


Distributed and
decentralized


Elastic
scalability


High
availability
and
fault tolerance


Tunable
consistency


Use
cases
for Cassandra


Large
deployments


Lots of
writes
,
statistics
and
analysis


Geographical
distribution


Evolving
applications

Writes

Commit

log

Write

Memtable

SSTable

SSTable


No reads


No seeks


Fast


Sequential
disk
access


Atomic
within a
column family


Any node


Always
writable (hinted hand
-
off
)


≈ 0.2
ms

Threshold

Reads

Read

SSTable

SSTable


Bloomfilter

field to determine
whether a provided
key is in the
SSTable


Index field for quick read


Any
node


Read repair



15
ms

Memtable

Bf

Idx

Bf

Idx

The tenets of column
-
oriented
model


Keyspace

Outer container, that contains column
families (is sort of like a relational database)


Column Family


Logical division that associates similar data
(very roughly analogous to tables in the
relational world)


Column

Name/value pair (and a client
-
supplied
timestamp of when it was last updated)


Super Column Family

Container for super columns sorted by their
names


Super Column

Structure with name and set of dependent
columns

Column
Family
\
Column

Column Family

A container for columns sorted by their names. Column Families are referenced and
sorted by row keys.

row key

column
name
1

column
value
1

column
name
n

column
value
n

Column

A name value pair (contains also a
time
-
stamp for
conflict resolution
on the
server side)

column
name

column
value

+ timestamp : long

: byte[]

: byte[]

row key

super column name
1

super column name
m

column
name
1

column
value
1

column
name
n1

column
value
n1

column
name
1

column
value
1

column
name
nm

column
value
nm

Super Column Family

A container for super columns sorted by their names.
Like
Column Families, Super Column Families
are referenced
and sorted
by row keys.

Super Column
Family
\
Super
Column

Super Column

A sorted associative array of columns.

column
name
1

column
value
1

column
name
n

column
value
n

super column name

row key

super column name
1

super column name
m

column
name
1

column
value
1

column
name
n1

column
value
n1

column
name
1

column
value
1

column
name
nm

column
value
nm

Addressing Super Column Family


Five
-
dimensional
hash


[
Keyspace
][
ColumnFamily
][Key][
SuperColumn
][
SubColumn
]

row key

column
name
1

column
value
1

column
name
n

column
value
n

Addressing Column Family


Four
-
dimensional hash


[
Keyspace
][
ColumnFamily
][Key][Column
]

Cassandra client options

Thrift

(12
different languages
)

Avro
(data serialization system)


Java
:


Hector:
http://github.com/rantav/hector

(abstraction over thrift)


Pelops
:
http://github.com/s7/scale7
-
pelops

(abstraction over thrift)


CQL: JDBC driver for Cassandra version starting from 0.8 (SQL like language)


Hector JPA:
https://github.com/riptano/hector
-
jpa

(ORM client)




Cassandrelle
:
http://demoiselle.sf.net/component/demoiselle
-
cassandra
/

(documentation ???)



Kundera
:
http://code.google.com/p/kundera
/

(buggy ???)


Python:


Pycassa
,
Telephus

Grails:


grails
-
cassandra

.NET:


Aquiles
,
FluentCassandra

Ruby
:


Cassandra

PHP
:


phpcassa
,
SimpleCassie

Cassandra
\
RDBMS query
differences


No update query


Record
-
level atomicity on writes


No duplicate
keys


Basic
write properties: consistency level
(ZERO, ANY, ONE, QUORUM, ALL)


Basic
read properties
: consistency level
(ONE
,
QUORUM, ALL
)

Integrating

Hadoop (http://hadoop.apache.org) is a set of open source
projects that
deal with
large amounts of data in a distributed way
.


Hadoop
Distributed File System (
HDFS): a
distributed file system that provides
high
-
throughput access to application data.


Hadoop
MapReduce
:
a
software framework for distributed processing of large
data sets on compute clusters
.


Other Hadoop
-
related projects at Apache include
:


Cassandra
™:
a
scalable multi
-
master database with no single points of failure.


Hive
™:
a
data warehouse infrastructure that provides data summarization and ad
hoc querying.


Mahout
™:
a
Scalable machine learning and data mining library.


Pig
™:
a
high
-
level data
-
flow language and execution framework for parallel
computation
.



The end

Questions?