Column Oriented Database Technologies

povertywhyInternet and Web Development

Nov 18, 2013 (3 years and 8 months ago)

80 views

Column
Oriented
Database

Technologies

Monday
, Ju
ly

23
, 2012


By: Dale T. Anderson


Principal Consultant


DB Best, Technologies, LLC


M
y
recent

blog (
Big Data & NoSQL Technologies
)
discussed various NoSQL technologies and market
vendors. Today
let’s
dive into
c
olumn
-
o
riented
d
atabases and why they should play an important role in
any data warehouse whose focus is on aggregations or metrics (and whose
isn
’t
?
)
.


So you are
all
proba
bly familiar with row
-
oriented

databases. Tables of data where rows of fields (also
called columns) represent the structural storage and
the corresponding
SQL queries that select, insert,
update, and delete that data. Most database vendors like Oracle, M
icrosoft,
Sybase,
Informix, and many
others all base their technology on this ANSI standard.
Column
-
oriented

databases are indeed what you
might surmise; tables of data where columns of data values represent the structural storage. What you
might not exp
ect is that on the surface many
c
olumn
-
oriented

databases
look and feel like row oriented
databases
also us
ing

SQL queries in much the same way. Creating tables, storing data, querying t
hem
are all pretty much identical
. The
y

may appear similar, but two principal things to understand is that the
significant
differences
under the hood, in particular,

physical storage
and query optimization
.




As noted in my previous blogs on NoSQL, there is

also

a column
-
store

technology out t
here. Let’s not
confuse that with column
oriented
databases. They are different. Since several NoSQL column
-
store

vendors were highlighted before, we will focus instead on the column

oriented
database vendors here.


First, s
ome key benefits to column
or
iented databases
:




High performance on aggregation queries
(
like COUNT, SUM, AVG, MIN, MAX
)



Highly efficient d
ata comp
ression
and/or partitioning



True scalability and fast data loading
for

Big Data



Accessible by many 3
rd

party BI
a
nalytic
t
ools



Fairly s
imp
le
s
ystems
a
dministration


Due to their aggregation capabilities which compute large numbers of similar data items, column
oriented databases offer key advantages
for

certain types of systems, including:




Data Warehouses

and Business Intelligence



Customer
Relationship Management (CRM)



Library Card Catalogs



Ad hoc query systems


Column oriented database technology has actually been around for many years
originating in 1969 with
an application call
ed

TAXIR which provided abstracts for mathematical biosciences
. In 1976, Canada
implemented the RAPID system for processing and retrieval of population and housing
census
statistics.
Sybase IQ was the only commercially available column
-
oriented database for many years
,

yet that has
changed rapidly in the last few y
ears. Let’s take a quick look at some of today’s key players:




SAP Sybase IQ

(
www.sybase.com
)


A

highly optimized analytics server designed
specifically to deliver superior performance
for mission
-
critical business

intelligence,
analytics and data warehousing solutions on any standard hardware and
operating system
.

Its column oriented grid
-
based architecture, patented data
compression, and advanced query optimizer delivers high performance,
flexibility, and economy

in challenging reporting and analytics environments.



Essentially a data partitioned, index based storage technology, Sybase IQ’s
engine offers several key features which include:



Web enabled analytics



Communications & Security



Fast Data Loading



Query
Engine supporting Full Text Search



Column Indexing Sub System



Column Storage Processor



User Friendly CUI based Administration & Monitoring



Multiplex Grid Architecture



Information Live
-
cycle management


The Sybase IQ Very Large Data Base (VLDB) option provi
des partitioning and
placement where a table can have a specified column partition key with value
ranges. This partition allows data that should be grouped together to be
grouped together and separates data where they should be separated. The
drawback to

this methodology is that it is not always known which is which.




Infobright
(
www.infobright.com
)


Offering both a commercial (IEE) and a free
community
(ICE) edition, the combination
of a column oriented databa
se with their
Knowledge Grid architecture delivers a self
-
managed, scalable, high
performance analytics query platform. Allowing 50Tb using a single server, their
industry
-
leading data compression (10:1 up to 40:1) significantly reduces
storage requiremen
ts and expensive hardware infrastructures. Delivered as a
MySQL engine, Infobright runs on multiple operating systems and processors
needing only a minimum of 4Gb of RAM (however 16Gb is a recommended
starting point).




Avoiding partition schemes,
Infobright d
ata is stored in data packs, each node
containing pre
-
aggregated statistics about the data stored within them. The
Knowledge Grid above provides related metadata providing a high level view of
the entire content of the database.
I
ndexes, proj
ections, partitioning or
aggregated tables are not needed

as these metadata statistics are managed
automatically
. The granular computing engine processes queries using the
Knowledge Grid information to optimize query processing eliminating or
significantl
y reducing the amount of data required for decompressing and
access to answer a query. Some queries may not need to access the data at all,
finding instead the answer in the Knowledge Grid itself.


The Infobright Data Loader is highly efficient so data i
nserts are very fast. This
performance gain does come at a price so avoid updates unless absolutely
necessary, design de
-
normalized tables, and don’t plan on any deletes.

New
features to the data loader include a reject option which allows valid rows to
commit while invalid rows are logged. This is highly useful when loading
millions of rows and only having a few rows with bad data. Without this feature
the entire

data

load would

be

roll
ed
back.




Vertica (HP)
(
www.vertica.com
)


Recently acquired by Hewlett Packard, this
platform was purpose buil
t

from the ground
up to enable data value
s

having

high
performance real
-
time analytics

needs
. With
extensive data loading, queries, columnar storage, M
P
P architecture,
and data
compression features, diverse communities can develop and scale with a
seamless integration ecosystem.


Claiming elasticity, scale, performance, and simplicity the Vertica analytics
platform
uses transformation partitioning to specify which rows
belong together
and parallelism for speed. S
everal
key
features include:



Columnar Storage & Execution



Real
-
Time Query & Loading



Scale
-
out M
P
P Architecture



Automatic High Availability



Aggressive Data Compression



Extensible In
-
Database Analytics Framework



I
n
-
Database Analytics Library



Database Designer & Administration Tools



Native
BI & ETL
support for MapReduce & Hadoop


The Vertica Optimizer is the brains of the analytics platform producing optimal
query execution plans where several choices exist. It doe
s this through
traditional considerations like disk I/O and further incorporates CPU, memory,
network, concurrency, parallelism factors and the unique details of the
columnar operator and runtime environment.




ParAccel

(
www.paraccel.com
)


Analytic
-
driven companies need a platform,
not just a database

where speed, agility, and
complexity drive the data ecosystem. The
ParAccel Analytic Platform streamlines the delivery of complex business
decisions through its high performance analytic database. Designed for speed,
its extensible framework supports on
-
demand integration and embedded
functions.



The ParAccel Database (PADB) present four main components: the ‘Leader’
node, the ‘Compute’ node, the Parallel Communications Fabric, and an optional
Storage Area Network (SAN). The ‘Leader’ controls the exec
ution of the
‘Compute’ nodes and all nodes communicate with each other via the ‘Fabric’
running on standard x86 Linux servers. Each ‘Compute’ node is subdivided into
a set of parallel processes called ‘slices’ that include a CPU co
r
e, an
d thier

allocatio
n of memory, and
local
disk

storage
.

The ‘Communication Fabric’
provides a low
-
level MPP network protocol for increased performance.




Key PADB features include:




High Performance & Scalability



Columnar Orientation



Extensible Analytics



Query
Compilation



High Availability



Solution Simplicity


ParAccel Integrated Analytics Library and Extensibility Framework incorporates
advanced functions along with an API to add your own functions to help address
complex business problems right in the core database
enabling customers to
focus upon their specific data complexities.





Microsoft SQL Server 2012
(
www.microsoft.com
)


Re
leased this year, Microsoft has now
embr
aced the columnar database idea. The
latest
SQL Server
release
2012
includes
xVelocity
, a

column
-
store index feature
that
stores data similar to a column
-
oriented DBMS.

While not a true column oriented database, this technique
allows for the creation of a memory optimized index that groups and stores data
for

each column then and joins them together to complete the index. For
certain types of queries, like aggregations, the query processor can take
advantage of the column
-
store index to significantly improve execution times.
Column store indexes can be used
with partitioned tables providing a new way
to think about how to design and process large datasets.




The column
-
store index

can be very useful on large fact tables in a Star schema
improving overall performance, however the cost model approach utilized may
choose the column
-
store index for a table when a row based index would have
been better. Using the
IGNORE_NONCLUSTERED_CO
LUMNSTORE_INDEX

query
hint will work around this if it occurs.
When data is stored with a column
-
store index,
data can often be compressed more effectively over a row based index. This is
accomplished as typically there is more redundancy within a column

than within a
row. Higher compression means less IO is required to retrieve data into memory
which can significantly reduce response times.



There are several restrictions and limitation in using a column
-
store index. For
example, which data types are

supported or not and that you can only create one
column
-
store index on any table can be problematic. Become familiar with what it
can do and where best to use it.
Currently the column
-
store index is not supported
on Microsoft Azure.


Column
-
oriented da
tabases provide significant advantages over traditional row oriented system applied
correctly; In particular for data warehouse and business intelligence environments where aggregations
prevail. It would not be fair however to ignore the disadvantages.

L
et’s look at these t
w
o:




Column
-
Oriented
Advantages



Efficient storage and data compression



Fast data loads



Fast aggregation queries



Simplified administration & configurations




‘Column
-
Oriented
Disadvantages



Transactions are to be avoided or
just
not
supported



Queries with table joins can reduce high performance



Record updates and deletes reduce storage efficiency



Effective partitioning
/indexing

schemes can be difficult to design


The real value in using column
-
oriented database technology
comes from
high performance, scalable storage and retrieval of
large to massive datasets (Big Data) focused on aggregation
queries. Simply put: Reports! You can design Star schema’s or
Data Vaults (
The Data Vault


What is it?


Why do we need it?
)
incorporating these technologies and you will find that column
-
oriented databases provide a clear solution in data warehouse and
business intelligence.


Look for future blogs on

Hadoop/Hive/HBase a
nd
Extract
-
Transform
-
Load (ETL) technologies, and don’t
be afraid to comment, question, or debate, there is always room to learn new things...