Why We Chose MongoDB to Put Big Data

mangledcobwebΛογισμικό & κατασκευή λογ/κού

14 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

71 εμφανίσεις

WHY WE CHOSE
MONGODB

TO

PUT BIG
-
DATA ‘ON THE MAP’

MARCH 2011








@
nknize

+Nicholas Knize

“The 3D UDOP allows near real time visibility of all SOUTHCOM Directorates information in one
location…this capability allows for unprecedented situational awareness and information sharing”













-
Gen. Doug Frasier

TST PRODUCTS

ACCOMPLISHING THE IMPOSSIBLE


Expose enterprise data in a geo
-
temporal user defined
environment


Provide a flexible and scalable spatial indexing framework
for heterogeneous data


Visualize spatially referenced data on 3D globe & 2D maps


Manage real
-
time data feeds and mobile messaging


View data over geo
-
rectified imagery with 3D terrain


Support mission planning and simulation


Provide real
-
time collaboration and sharing

ISPATIAL

OVERVIEW

ACCOMPLISHING THE IMPOSSIBLE

Why
NoSQL
?!?

(CAVEATS)


Use the right tool for the job

WHY
NOSQL
?

ACCOMPLISHING THE IMPOSSIBLE


Understand your needs!


Relational is not always bad

Engineering with Constraints

Unbounded Engineering


Horizontally scalable



Large volume / elastic



Vertically scalable


Heterogeneous data types (“Data Stack”)



Widely Distributed


Reduce the distance bits must travel



Fault Tolerant


Replication Strategy and Consistency model



High Availability


Node recovery



Fast


Reads or writes (can’t always have both)

BIG DATA STORAGE CHARACTERISTICS

ACCOMPLISHING THE IMPOSSIBLE


Desired Data Store Characteristic for ‘Big Data’


Battle tested, Battle proven


Relational Model dates back to 1969



Plethora of Relational Experience


Full
-
Time DBAs, Training & Certs



Company Backed


Safe choice for business / mission critical systems



Fewer Alternatives


Non
-
relational is a 5 year old know
-
it
-
all



Mostly Standardized


SQL ISO/IEC 9075 Accepted Standard



Theoretically Sound


Based on 100 years of First
-
Order Logic theory

RDBMS

STRENGTHS

ACCOMPLISHING THE IMPOSSIBLE


RDBMS Strengths


Atomicity


If one fails, we all fail!



Consistency


All data constraints (normalized schema) cascades,
triggers, etc. must be met before transaction succeeds. (LATENCY)



Isolation


Synchronization, no operation can see a transaction that
hasn’t yet completed



Durability


Once a transaction is committed it will remain committed
even in power loss crashes or other hardware errors.


ACID THEORY

ACCOMPLISHING THE IMPOSSIBLE


Relational on ACID



Writes are accomplished using in
-
place update on disk (crazy disk
swapping rate)



Table joins, updates, and large queries quickly outgrow disk cache
requiring many random disk seeks (performance bottleneck!!)



Strict consistency requirements
impacts scalability (e.g.
Postgres

uses
Multiversion

Consistency,
commonly resulting in stale data)



As data centers grow, the probability of node failure (due to Disk
Writes, Consistency, and Atomic operations) increases


RDBMS

WEAKNESSES

ACCOMPLISHING THE IMPOSSIBLE

RDBMS Weaknesses


Cassandra


Nice Bring Your Own Index (BYOI) design


… but Java, Java, Java… Memory management can be an issue


Adding new nodes can be a pain (Token Changes,
nodetool
)


Key
-
Value store…good for simple data models



Hbase


N
ice
BigTable

model


Theory grounded heavily in C.A.P, inflexible trade
-
offs


Complicated setup and maintenance



CouchDB


Provides some
GeoSpatial

functionality


HEAVILY dependent on Map
-
Reduce model (complicated design)


Erlang

based


poor multi
-
threaded heap management




NOSQL OPTIONS

ACCOMPLISHING THE IMPOSSIBLE

Subset of Evaluated
NoSQL

Options


Why
MongoDB

for Thermopylae?


Documents based on
Javascript

Object Notation (JSON)


A GEOJSON
match made in heaven!



C++
-

No Garbage Collection Overhead! Efficient memory management
design reduces disk swapping and paging



Disk storage is memory mapped, enabling fast swapping when necessary



Built in auto
-
failover with replica sets and fast recovery with journaling



Tunable Consistency


Consistency defined at application layer



Schema Flexible


Retains friendly properties of SQL while enabling ad
-
hoc queries



Provided initial spatial indexing support


Point based only!


WHY
TST

LIKES
MONGODB

ACCOMPLISHING THE IMPOSSIBLE

MONGODB

SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE


... The Spatial Indexer wasn’t quite right


MongoDB

(like nearly all relational DBs) uses a b
-
Tree


Data structure for storing sorted data in log time


Great for indexing numerical and text documents (attribute data)


Cannot store multi
-
dimension data


NOT
GEOMETRY FRIENDLY


DIMENSIONALITY REDUCTION

ACCOMPLISHING THE IMPOSSIBLE

How does
MongoDB

solve the dimensionality problem?


Space Filling Curve


A continuous line that
intersects every point in a
two
-
dimensional plane



Use
Geohash

to
represent
lat
/
lon

values


Interleave the bits of a
lat
/long pair


Base32 encode the result


GEOHASH

BTREE

ISSUES

ACCOMPLISHING THE IMPOSSIBLE


Neighbors aren’t so
close!


Neighboring points on the
Geoid may end up on
opposite ends of the
plane


Impacts search efficiency



What about Geometry?


Doesn’t support > 2D


Mongo uses Multi
-
Location documents
which really just indexes
multiple points that link
back to a single document



Issues with the
Geohash

over b
-
Tree approach


Constrain the system to single point searches


Multi
-
dimension support will be exponentially complex (won’t scale)




Interpolate points along the edge of the shape


Multi
-
dimension support will be exponentially complex (won’t scale)




Customize the spatial indexer


Selected approach


SOLUTIONS TO
GEOHASH

PROBLEM

ACCOMPLISHING THE IMPOSSIBLE


Potential Solutions

Case 3:

Case 4:

Multi
-
Location Document (aka. Polygon)

Search Polygon

Case 1:

Case 2:

Success!

Success!

Fail!

Fail!

Mongo Multi
-
location Document Clipping Issues

($within search doesn’t always work w/ multi
-
location)

CUSTOM TUNED SPATIAL INDEXER

ACCOMPLISHING THE IMPOSSIBLE

Thermopylae Custom Tuned
MongoDB

for Geo

TST Leverage’s
Guttman’s

1984 Research in R/R* Trees


R
-
Trees organize any
-
dimensional data by representing
the data as a minimum bounding box.


Each node bounds it

s children. A node can have many
objects in it (max:
m

min: ceil(
m/2)
)


Inserts and merges optimized by minimizing overlaps


The leaves point to the actual objects (stored on disk
probably)


Height balanced


search is always O(log n)


Spatial Indexing at Scale with R
-
Trees

RTREE

THEORY

ACCOMPLISHING THE IMPOSSIBLE

Spatial data represented as minimum bounding rectangles (n
-
dimension)


Index represented as: <I, tuple> where:



I = (I
0
, I
1
, … I
n
) : n = number of dimensions


Each I is a set in the form of [
min,max
] describing MBR range along a dimension



tuple
-
identifier includes a key that contains a data
-
center, server identifier


Spatial Index Example


Sample insertion result for 4
th

degree tree


Objective: Minimize overlaps

a

b

c

d

e

f

g

h

i

j

k

l

m

n

o

p

T
-
Sciences Custom Tuned Spatial Indexer


Optimized Spatial Search


Finds intersecting MBR and
recurses

into
those nodes



Optimized Spatial Inserts


Uses the Hilbert Value of MBR centroid to
guide search


28% reduction in number of nodes touched



Optimize Deletes


Leverages R* split approach for rebalancing tree
when nodes become
underfull



Low maintenance


Leverages
MongoDB’s

automatic data compaction
and partitioning


Example Use Case


OSINT (Foursquare Data)


Sample Foursquare
data set mashed with
Government Intel
Data



1 million Geo
Document test
(points and polys)



4 server replica set



~350ms query
response



~300%
improvement over
PostGIS

Community Support


Thermopylae contributes fixes to the codebase


http://github.com/mongodb



TST will work with 10gen to fold into the baseline



Active developer collaboration


IRC: #
mongodb

freenode.net



THANK YOU

Questions?


Nicholas Knize

nknize@t
-
sciences.com


Backup


Thermopylae Sciences & Technology


Who are we?


Advanced technology w/ 160+ employees


Core customers in national security, venues and
events, military and police, and city planning


Partnered with Google and imagery providers


Long term relationship focused


TS/SCI Staff



TST + 10gen + Google = Game
-
changing approach



WHO ARE THESE GUYS?

ACCOMPLISHING THE IMPOSSIBLE

ENTERPRISE

PARTNER

Key Customers
-

Government



US Dept of State Bureau of Diplomatic Security


Build and support 30 TB Google Earth Globe with multi
-
terabytes of individual globes sent to embassies throughout
the world. Integrated Google Earth and
iSpatial

framework.


US Army Intelligence Security Command


Provide expertise in managing technology integration


prime contractor providing operations, intelligence, and IT
support worldwide. Partners include IBM, Lockheed Martin,
Google, MIT, Carnegie Mellon. Integrated Google Earth and
iSpatial

framework.


US Southern Command


Coordinate Intelligence management systems spatial data
collection, indexing, and distribution. Integrated Google
Earth, iSpatial, and iHarvest.


Index large volume imagery and expose it for different
services (Air Force, Navy, Army, Marines, Coast Guard)



GOVERNMENT CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

COMMERCIAL CUSTOMERS

ACCOMPLISHING THE IMPOSSIBLE

Key Customers
-

Commercial


Cleveland

Cavaliers

USGIF

Las Vegas

Motor Speedway

Baltimore

Grand Prix

iSpatial

framework serves thousands of mobile devices

MONGODB BEST OF BOTH WORLDS

ACCOMPLISHING THE IMPOSSIBLE


MongoDB



The Best of Both Worlds!

Big Data Scaling
-

Terminology

SLIDESHOW HEADER

ACCOMPLISHING THE IMPOSSIBLE

Shard


Stores a single
partition (subset) of the
big dataset.


Replica



A copy of a
partition following a
consistency model (delta,
eventual, causal, etc.)


Slice


Single Operating
S
ystem in a large pool of
heterogeneous operating
systems (virtualization).