6. Big Data and Cloud Computing

moonlightmidgeInternet and Web Development

Nov 18, 2013 (3 years and 6 months ago)

61 views

6
.
Big Data and Cloud Computing

http://en.wikipedia.org/wiki/Big_data


What is big data?


Big
data
is data that cannot be
analyzed using
a traditional
relational database


there is so much of it!


Companies
that develop the database platforms to
analyze
big
data

will make
(are making) a fortune!


Big
data is the next technology problem looking for a
solution!

We live a world of data …

Using
a Traditional Database

-
RDBMS


Storing
large data in traditional database




It is
easier to get the data in than out
.


Most RDBMS
are designed for
efficient
transaction processing





Adding, updating, searching for, and retrieving small

amount of information




Data is acquired in a transactional
fashion



Then, what
is the problem with “Big Data”?

The trouble
comes …


Managing massive amounts of accumulated
data




Collected over months or years


and

learning something from the data


and

naturally we want the answer in
seconds or
minutes


Primarily, it is about the
analysis of the large
data sets.

Big Data Cloud


Source Data


Log Files


Event Logs / Operating System (OS)
-

Level


Appliance / Peripherals


Analyzers / Sniffers


Multimedia


Image Logs


Video Logs


Web Content Management (WCM)


Web Logs


Search Engine Optimization (SEO)


Web Metadata

Data in the cloud


Storing the data

o
Google BigTable, Amazon S3, NoSQL (Cassandra,
MongoDB), etc.



Processing the data

o
MapReduce (Hadoop), Mahout, etc.

Big Data Cloud


Cloud
-
Based Big Data Solutions



DBaaS


Amazon Web Services (AWS)

»
DynamoDB

»
SimpleDB

»
Relational Database Service (RDS): Oracle 11g / MySQL


Google App Engine

»
Datastore


Microsoft SQL Azure


Processing


AWS Elastic
MapReduce

(EMR)


Google App Engine
MapReduce
: Mapper API


Microsoft: Apache
Hadoop

for Azure

Cloud File Systems


Traditional
distributed
file
systems (DFSs) need
modifications
.

Like in traditional DFSs we
need ...



...
performance



...
scalability



... reliability



...
availability

Differences
:


-

Component
failures are the norm (large number of
commodity
machines
).


-

Files
are huge
(>>100GB
).


-

Appending
new data at the end of
files
is better than
overwriting existing
data.


-

High
, sustained bandwidth is more important than low latency.


Examples
of Cloud DFSs: Google File System (GFS), Amazon Simple

Storage System (S3),
Hadoop

Distributed File System (HDFS).

Storage as a Service


Many companies already provide access to massive data
sets as a service (e.g. Amazon, Google)


Provide access to raw storage as a service


Advantages:


Already know how to manage storage clusters


More reliable than personal storage


Available anywhere


Disadvantages:


Security?

The Cloud Scales:

Amazon S3 Growth

S3 = Simple Storage System

Overview


As
the Internet reaches every corner of the world and the
technology keeps advancing,
the amount of digital data
generated by the web and digital devices grows exponentially.


According
to
estimates
by the
Economist
,
a total of 1200
exabytes

of data were generated in
2010,
and 7,900
exabytes

are predicted
by 2015 (An
exabyte

is equal to one billion gigabytes).


The amount of data, the speed at which it is generated, and the
variety of data formats raise
new challenges,
not only in
technology,
but also in all other fields where data is utilized as
one of the critical resources in making decisions and predictions
.

As
defined by McKinsey Global Institute, "
Big Data
refers to
datasets whose size is beyond the ability of typical database
software tools to capture, store, manage, and analyze" and it
"is the next frontier for innovation, competition, and
productivity."


The USA government
announced last March a
Big Data
Initiative
with
$200 million in new funding to support research
in improving
“the
ability to extract knowledge and insights
from large and complex collections of digital data."


Big
Data
has been moved to
centre

stage for advancing
research, technology, and productivity not only in
mathematics
, statistics, natural sciences, computer science,
technology, and
business, but
also in humanities, medical
studies and social sciences.

Big Data

-

What is it?


The term
big data
refers to
collections of data sets so large and
complex that it becomes difficult to process them using
traditional database management tools or data processing
applications.




Big data
is difficult to work with using most conventional
relational database management systems, and desktop statistics
and visualization packages,
requiring instead massively parallel
software running on tens, hundreds, or even thousands of
servers.

http://www.zdnet.com/blog/virtualization/what
-
is
-
big
-
data/1708


http://queue.acm.org/detail.cfm?id=1563874


Some Trends in Computing

• The Data Deluge
is a clear trend from commercial (Amazon, e
-
commerce) , community (Facebook, Search) and scientific
applications.



Lightweight clients from smartphones, tablets
with

sensors.



Multicore

processors

are reawakening parallel computing.



Clouds
with cheaper, greener, easier to use IT for (some)
applications.


Internet of Things and the Cloud


It is projected that there will be
24 billion devices
on the Internet
by 2020.



-

Most will be small sensors that send streams of information
into the cloud where it will be processed and integrated with other
streams and turned into knowledge that will help our lives in a
multitude of small and big ways. At least, that is the hype!



The cloud will become increasing important as a controller of, and
resource provider for, the
Internet of Things


the use of
computers to enrich every aspect of everyday life.


Data produced …


The World of Data


Our Data
-
driven World


Science

Data bases from astronomy, genomics, environmental data,
transportation data, …


Humanities and Social Sciences

Scanned books, historical documents, social interactions data, …


Business and Commerce

Corporate sales, stock market transactions, census, airline traffic, …


Entertainment

Internet images, Hollywood movies, MP3 files, …


Medicine

MRI and CT scans, patient records, …

Data
-
rich World


Data capture and collection:



-

Highly instrumented environment


-

Sensors and Smart Devices



-

Networks


Data storage:


-

Seagate 1 TB Barracuda @ $68.74 from Amazon.com


Cloud Computing Modalities


Hosted Applications and services


Pay
-
as
-
you
-
go model


Scalability, fault
-
tolerance,
elasticity, and self
-
manageability




Very large data repositories


Complex analysis


Distributed and parallel data
processing

“Can we outsource our IT software and
hardware infrastructure?”


“We have terabytes of click
-
stream data


what can we do with it?”


Data in the Cloud

-

Platforms for Data Analysis


Data Warehousing, Data Analytics and Decision
-
Support Systems


Used to manage and control business.


Transactional Data: historical or point
-
in
-
time.



-

Optimized for inquiry rather than update.


Use of the system is loosely defined and can be ad
-
hoc.


Used by managers and analysts to understand the business and
make judgments


Data Analytics in the Web Context

Now, data capture at the user interaction level




-

In contrast to the client transaction level in the
Enterprise context


As a consequence the amount of data increases significantly


Greater need to analyze such data to understand user
behaviours

Data Analytics in the Cloud

Scalability to large data volumes:


-

Scan 100 TB on 1 node @ 50 MB/sec = 23 days


-

Scan on 1000
-
node cluster = 33 minutes


Divide
-
and
-
Conquer (i.e. data partitioning
-

sharding
)


Cost
-
efficiency:


-

Commodity nodes (cheap, but unreliable)


-

Commodity network


-

Automatic fault
-
tolerance (fewer administrators)


-

Easy to use (fewer programmers)

Limits to Computation



Processor cycles are cheap and getting cheaper

• What limits application of infinite cores?


1
. Data: inability to get data to processor when needed


2
. Power: cost rising and will dominate


Attributes that
need most innovation




Infinite cores require infinite power




Getting data to processors in time to use next
cycle.

• Caches,
multi
-
threading, …




All techniques consume power

• More memory lanes drives
bandwidth
but more pins costs power



Power
and
data movement remain key constraints

Platforms for Large
-
scale Data Analysis

Parallel DBMS technologies



-

Proposed
in the late eighties



-

Matured
over the last two decades


Multi
-
billion
dollar industry: Proprietary DBMS
engines
intended
as Data Warehousing solutions for very large enterprises


Map Reduce


-

Pioneered
by
Google


the theory


-

Popularized
by Yahoo!
(via
Hadoop

implementation)

Parallel DBMS technologies


Popularly used for more than two decades


Research Projects


Commercial: Multi
-
billion dollar industry but access to
only a privileged few


Relational Data Model


Indexing


Familiar SQL interface


Advanced query optimization


Well understood and well studied


Overview:


Data
-
parallel programming model


An associated parallel and distributed implementation for
commodity clusters



Pioneered by Google


Processes 20 PB of data per day



Popularized by open
-
source
Hadoop

project


Used by Yahoo!, Facebook, Amazon, and the list is
growing …

Defining big data


Big
data refers to any data that cannot be
analyzed
by a
traditional
database
due to three typical characteristics:




-

H
igh
volume,
high velocity
and high
variety.


High
volume:
big data’s sheer volume slows down traditional
database
racks.


High
velocity:
big data often streams in at high speed and can be
time
-
sensitive.


High
variety:
big data tends to be a mix of several data types,
typically with
an element
of unstructured data (e.g. video), which is
difficult to
analyze.

As the big data industry evolves, four trends are emerging.


1.
Unstructured
data:
Data is moving from structured to unstructured
format, raising
the costs of analysis.


This
creates a highly lucrative market
for analytical
search
engines that can interpret this unstructured data
.


2.
Open source:
Proprietary database standards are giving way to new,
open source
big data technology platforms such as
Hadoop
.


This
means
that barriers
to entry may remain low for some time
.


3.
Cloud:

Many corporations are opting to use cloud services to access
big
data analytical
tools instead of building expensive data warehouses
themselves.


-

This
implies that most of the money in big data will be made
from
selling hybrid
cloud
-
based services rather than selling big
databases
.

4.
M2M:

In future, a growing proportion of big data will be generated
from
machine
-
to
-
machine
(M2M) using sensors.



-

M2M
data, much of which is business
-
critical and time
-
sensitive, could give telecom operators a way to profit from the big data
boom.

Today, 90% of data warehouses hold less than 5 terabytes of
data.


Yet
Twitter alone produces over 7 terabytes of
data every
day!


As a result
of this data deluge, the database industry is going
through a significant transformation.

The first businesses that had to deal with big data were the leading
Internet
companies such as Google, Yahoo and Amazon.


Google and Yahoo
, for example,
run
search engines which
have
to gather
unstructured data


like web pages


and process them within
milliseconds
to produce
search rankings.


Worse
, they
have
to deal with millions of concurrent users all submitting
different search queries at once.


So Google
and Yahoo
developers
designed entirely new database
platforms to deal with this type of unstructured query at lightning speed
.

They built everything themselves, from the physical infrastructure to
the storage and processing layers.


Their
technique was to scale
out horizontally
(rather than vertically),
adding more nodes to the database network
.


Horizontal scaling
out
involves breaking down
large databases
and
distributing them across multiple servers.


These
innovations resulted in the first
“distributed databases”
and
provided
the foundation
for two of today’s most advanced database
technology standards, commonly referred to as
NoSQL

and
Hadoop
.