The Big Deal About Big Data

radiographerfictionData Management

Oct 31, 2013 (4 years and 13 days ago)

123 views

© 2012 IBM Corporation

November 1, 2013

The Big Deal About
Big Data

Dean Compher

Data Management Technical Professional for UT, NV

dcomphe@us.ibm.com

www.db2Dean.com

@db2Dean

facebook.com/db2Dean

Slides Created and Provided by:



Paul Zikopoulos



Tom Deustch


www.db2Dean.com

© 2012 IBM Corporation

November 1, 2013

Why Big Data

How We Got Here

© 2012 IBM Corporation

3

3

…by the end of 2011, this was about 30
billion and growing even faster

In 2005 there were 1.3 billion RFID


tags in circulation…

© 2012 IBM Corporation

4


An increasingly sensor
-
enabled and instrumented


business environment generates
HUGE

volumes of


data with
MACHINE SPEED
characteristics…

1 BILLION lines of code

EACH engine generating 10 TB every 30 minutes!

© 2012 IBM Corporation

5

350B
T
ransactions/Year


Meter Reads
every 15 min.


3.65B


meter reads/day

120M


meter reads/month

© 2012 IBM Corporation

6


In August of 2010, Adam
Savage, of “Myth Busters,”
took a photo of his vehicle
using his smartphone. He
then posted the photo to his
Twitter account including the
phrase “Off to work.”



Since the photo was taken by
his smartphone, the image
contained metadata revealing
the exact geographical
location the photo was taken



By simply taking and posting
a photo, Savage revealed the
exact location of his home,
the vehicle he drives, and the
time he leaves for work

© 2012 IBM Corporation

7

The Social Layer in an Instrumented Interconnected World

2+
billion

people
on the
Web by
end 2011

30 billion

RFID
tags today


(1.3B in 2005)

4.6
billion

camera
phones
world
wide

100s of
millions
of GPS
enabled

devices
sold
annually

76 million

smart
meters in 2009…


200M by 2014

12+ TBs


of tweet data

every day

25+ TBs
of

log data
every day

? TBs
of

data every day

© 2012 IBM Corporation

8

Twitter Tweets per Second Record Breakers of 2011

© 2012 IBM Corporation

9

Extract Intent, Life Events, Micro Segmentation Attributes

Jo Jobs

Tina Mu

Tom Sit

Pauline

Name, Birthday, Family

Not Relevant
-

Noise

Not Relevant
-

Noise

Monetizable

Intent

Monetizable Intent

Relocation

Location

Wishful Thinking

SPAMbots

© 2012 IBM Corporation

10

Extracting insight from an immense volume, variety and velocity of data,
in context, beyond what was previously possible

Big Data Includes Any of the following Characteristics

Manage the complexity of
data in many different
structures, ranging from
relational, to logs,


to raw text


Streaming data and large
volume data movement


Scale from Terabytes to
Petabytes (1K TBs) to
Zetabytes (1B TBs)

Variety:






Velocity:



Volume:


© 2012 IBM Corporation

11


Retailers collect click
-
stream data from Web site interactions and
loyalty card data


This traditional POS information is used by retailer for shopping basket
analysis, inventory replenishment, +++


But data is being provided to suppliers for customer buying analysis



Healthcare has traditionally been dominated by paper
-
based
systems, but this information is getting digitized



Science is increasingly dominated by big science initiatives


Large
-
scale experiments generate over 15 PB of data a year and can

t be
stored within the data center; sent to laboratories



Financial services are seeing large and large volumes through
smaller trading sizes, increased market volatility, and technological
improvements in automated and algorithmic trading



Improved instrument and sensory technology


Large Synoptic Survey Telescope

s GPixel camera generates 6PB+ of image
data per year or consider Oil and Gas industry

Bigger and Bigger Volumes of Data

© 2012 IBM Corporation

12

Data
AVAILABLE

to
an organization

Data an organization
can
PROCESS

The Big Data Conundrum


The percentage of available data an enterprise can analyze is
decreasing proportionately to the available to it


Quite simply, this means as enterprises, we are getting


mor攠n慩癥


慢ou琠our bu獩n敳猠o癥v 瑩me



We don’t know what we could already know….


© 2012 IBM Corporation

13

Why Not All of Big Data Before: Didn’t have the Tools?

© 2012 IBM Corporation

14

Applications for Big Data Analytics

Homeland Security

Finance

Smarter Healthcare

Multi
-
channel
sales

Telecom

Manufacturing

Traffic Control

Trading Analytics

Fraud and Risk

Log Analysis

Search Quality

Retail: Churn, NBO

© 2012 IBM Corporation

15

15

Most Requested Uses of Big Data


Log Analytics & Storage



Smart Grid / Smarter Utilities



RFID Tracking & Analytics



Fraud / Risk Management & Modeling



360
°

View of the Customer



Warehouse Extension



Email / Call Center Transcript Analysis



Call Detail Record Analysis



+++

© 2012 IBM Corporation

16

So What Is Hadoop?

© 2012 IBM Corporation

17

17

Hadoop Background


Apache Hadoop

is a
software framework

that supports data
-
intensive applications under a free license. It enables
applications to work with thousands of nodes and petabytes of
data.
Hadoop was inspired by Google Map/Reduce

and
Google
File System

papers.



Hadoop is a top
-
level
Apache

project being built and used by a
global community of contributors,
using the Java programming
language
.
Yahoo
has been the largest contributor to the
project, and uses Hadoop extensively across its businesses.



Hadoop is a paradigm that says that you send your application
to the data rather than sending the data to the application

© 2012 IBM Corporation

18

What Hadoop Is Not


It is not a replacement for your Database &
Warehouse strategy


Customers need hybrid database/warehouse &
hadoop models


It is not a replacement for your ETL strategy


Existing data flows aren’t typically changed, they are
extended


It is not designed for real
-
time complex event
processing like Streams


Customers are asking for Streams & BigInsights
integration


© 2012 IBM Corporation

19

So What Is
Really

New Here?


Cost effective / Linear Scalability.


Hadoop brings massively parallel competing to commodity servers. You can start small
and scales linearly as your work requires.


Storage and Modeling at Internet
-
scale rather than small sampling


Cost profile for super
-
computer level compute capabilities


Cost per TB of storage enables superset of information to be modeled


Mixing Structured and Unstructured data.


Hadoop is its schema
-
less so it doesn’t care about the form the data stored is in, and thus
allows a super
-
set of information to be commonly stored. Further, MapReduce can be run
effectively on any type of data and is really limited by the creatively of the developer.


Structure can be introduced at the MapReduce run time based on the keys and values
defined in the MapReduce program. Developers can create jobs that against structured,
semi
-
structured, and even unstructured data.


Inherently flexible of what is modeled/analytics run


Ability to change direction literally on a moment’s notice without any design or operational
changes


Since hadoop is schema
-
less, and can introduce structure on the fly, the type of analytics
and nature of the questions being asked can be changed as often as needed without
upfront cost or latency


© 2012 IBM Corporation

20

Break It Down For Me Here…


Hadoop is a
platform and framework
, not a database


It uses both the CPU and disc of single commodity boxes, or
node


Boxes can be
combined into clusters


New nodes can be
added as needed,

and added
without
needing to change

the;


Data formats


How data is loaded


How jobs are written


The applications on top



© 2012 IBM Corporation

21

So How Does It Do That?

At its core, hadoop is made up of;



Map/Reduce


How hadoop understands and assigns work to the nodes (machines)



Hadoop Distributed File System =
HDFS


Where hadoop stores data


A file system that’s runs across the nodes in a hadoop cluster


It links together the file systems on many local nodes to make them
into one big file system

© 2012 IBM Corporation

22

What is HDFS


The HDFS file system stores data across multiple machines.


HDFS assumes nodes will fail, so it achieves reliability by
replicating data across multiple nodes


Default is 3 copies


Two on the same rack, and one on a different rack.


The filesystem is built from a cluster of
data nodes
, each of
which serves up blocks of data over the network using a block
protocol specific to HDFS.


They also serve the data over HTTP, allowing access to all content
from a web browser or other client


Data nodes can talk to each other to rebalance data, to move copies
around, and to keep the replication of data high.

© 2012 IBM Corporation

23

File System on my Laptop

© 2012 IBM Corporation

24

HDFS File System Example

© 2012 IBM Corporation

25

25

Map/Reduce Explained


"Map" step:



The program is chopped up into many smaller sub
-
problems.


A worker node processes some subset of the smaller
problems under the global control of the JobTracker
node and stores the result in the local file system
where a reducer is able to access it.


"Reduce" step:


Aggregation


The reduce aggregates data from the map steps.
There can be multiple reduce tasks to parallelize the
aggregation, and these tasks are executed on the
worker nodes under the control of the JobTracker.

© 2012 IBM Corporation

26

26

The MapReduce Programming Model


"Map" step:


Program split into pieces


Worker nodes process individual pieces in parallel (under
global control of the Job Tracker node)


Each worker node stores its result in its local file system
where a reducer is able to access it




"Reduce" step:


Data is aggregated (‘reduced” from the map steps) by
worker nodes (under control of the Job Tracker)


Multiple reduce tasks can parallelize the aggregation

© 2012 IBM Corporation

27

Map/Reduce Job Example

© 2012 IBM Corporation

28


Murray 38


Salt Lake 39


Bluffdale 35


Sandy 32


Salt Lake 42


Murray 31


Bluffdale 32


Sandy 40


Murray 27


Salt Lake 25


Bluffdale 37


Sandy 32


Salt Lake 23


Murray 30


Sandy 40


Salt Lake 25


Bluffdale 37


Murray 30


Murray 38


Bluffdale 35


Sandy 32


Salt Lake 42


Murray 38


Bluffdale 35


Bluffdale 37


Murray 30


Sandy 40


Salt Lake 25


Sandy 32


Salt Lake 42


Murray 38


Bluffdale 37


Sandy 40


Salt Lake 42

Map

Shuffle

Reduce

© 2012 IBM Corporation

29

MapReduce In more Detail


Map
-
Reduce applications specify the input/output locations and supply
map

and
reduce

functions via implementations of appropriate Hadoop interfaces,
such as
Mapper

and
Reducer
.


These, and other job parameters, comprise the job configuration. The
Hadoop job client then submits the job (jar/executable, etc.) and
configuration to the JobTracker


The JobTracker then assumes the responsibility of distributing the
software/configuration to the slaves, scheduling tasks and monitoring them,
providing status and diagnostic information to the job
-
client.


The Map/Reduce framework operates exclusively on
<key, value>

pairs


that is, the framework views the input to the job as a set of
<key, value>

pairs and produces a set of
<key, value>

pairs as the output of the job,
conceivably of different types.


The vast majority of Map
-
Reduce applications executed on the Grid do not
directly implement the low
-
level Map
-
Reduce interfaces; rather they are
implemented in a higher
-
level language, such as Jaql, Pig or BigSheets

© 2012 IBM Corporation

30

30

JobTracker and TaskTrackers


Map/Reduce requests are handed to the
Job Tracker
which is a
master controller for the map and reduce tasks.


Each worker node contains a Task Tracker process which manages
work on the local node.


The Job Tracker pushes work out to the Task Trackers on available
worker nodes, striving to keep the work as close to the data as
possible


The Job Tracker knows which node contains the data, and which
other machines are nearby


If the work cannot be hosted on the actual node where the data
resides, priority is given to nodes in the same rack


This reduces network traffic on the main backbone network. If a Task
Tracker fails or times out, that part of the job is rescheduled

© 2012 IBM Corporation

31

How To Create Map/Reduce Jobs


Map/reduce development in Java


Hard, few resources that know this


Pig


Open source language / Apache sub
-
project


Becoming a “standard”


Hive


Open source language / Apache sub
-
project


Provides a SQL
-
like interface to hadoop


Jaql


IBM Research Invented


More powerful than Pig when dealing with loosely structure data


Visa has been a development partner


BigSheets


BigInsights browser based application


Little development required


You’ll use this most often


Skill Required

© 2012 IBM Corporation

32

Taken Together
-

What Does This Result In?


Easy To Scale


Simply add machines as your data and jobs require


Fault Tolerant and Self
-
Healing


Hadoop runs on commodity hardware and provides fault tolerance through software.


Hardware losses are expecting and tolerated


When you lose a node the system just redirects work to another location of the data
and nothing stops, nothing breaks, jobs, applications and users don’t even know.


Hadoop Is Data Agnostic


Hadoop can absorb any type of data, structured or not, from any number of sources.


Data from many sources can be joined and aggregated in arbitrary ways enabling
deeper analyses than any one system can provide.


Hadoop results can be consumed by any system necessary if the output is structured
appropriately


Hadoop Is Extremely Flexible


Start small, scale big


You can turn nodes “off” and use for other needs if required (really)


Throw any data, in any form or format, you want at it


What you use it for can be changed on a whim


© 2012 IBM Corporation

33

The IBM Big Data Platform

© 2012 IBM Corporation

34

Analytic Sandboxes


aka “Production”


Hadoop capabilities exposed to LOB with some
notion of IT support


Not really production in an IBM sense


Really “just” ad
-
hoc made visible to more users in
the organization


Formal declaration of direction as part of the
architecture


“Use it, but don’t count on it”


Not built for secutity


© 2012 IBM Corporation

35

Production Usage with SLAs



SLA driven workloads


Guaranteed job completion


Job completion within operational windows


Data Security Requirements


Problematic if it fails or looses data


True DR becomes a requirements


Data quality becomes an issue


Secure Data Marts become a hard requirement


Integration With The Rest of the Enterprise


Workload integration becomes an issue


Efficiency Becomes A Hot Topic


Inefficient utilization on 20 machines isn’t an issue, on 500 or 1000+ it is


Relatively few are really here yet outside of Facebook, Yahoo,
LinkedIn, etc…


Few are thinking of this but it is inevitable

© 2012 IBM Corporation

36

IBM


Delivers a Platform Not a Product


Hardened Environment


Removes single points of failure


Security


All Components Tested Together


Operational Processes


Ready for Production


Mature / Pervasive usage


Deployed and Managed Like Other Mature Data Center
Platforms


BIG INSIGHTS


Text Analytics, Data Mining, Streams, Others



© 2012 IBM Corporation

37

The IBM Big Data Platform

InfoSphere BigInsights

Hadoop
-
based low latency
analytics for variety and
volume

IBM Netezza High
Capacity Appliance

Queryable

Archive
Structured Data

IBM Netezza 1000

BI+Ad

Hoc

Analytics on Structured Data

IBM Smart Analytics
System

Operational Analytics on
Structured Data

IBM Informix
Timeseries

Time
-
structured analytics

IBM InfoSphere
Warehouse

Large volume structured
data analytics

InfoSphere Streams

Low Latency Analytics for
streaming data

MPP Data Warehouse

Stream Computing

Information Integration

Hadoop

InfoSphere Information
Server

High volume data integration
and transformation

© 2012 IBM Corporation

38

What Does a Big Data Platform Do?

Analyze Information in Motion

Streaming data analysis

Large volume data bursts and ad
-
hoc analysis

Analyze a Variety of Information

Novel analytics on a broad set of mixed
information that could not be analyzed
before

Discover and Experiment

Ad
-
hoc analytics, data discovery and
experimentation

Analyze Extreme Volumes of Information

Cost
-
efficiently process and analyze PBs of information

Manage & analyze high volumes of structured, relational
data

Manage and Plan

Enforce data structure, integrity and control to
ensure consistency for repeatable queries


© 2012 IBM Corporation

39

Big Data Enriches the Information Management Ecosystem

Who Ran What,
Where, and When?

Audit
MapReduce
Jobs and tasks

Managing a

Governance Initiative

OLTP

Optimization

(SAP, checkout,
+++)

Master Data Enrichment via Life
Events, Hobbies, Roles, +++

Establishing

Information

as a Service

Active Archive
Cost Optimization

© 2012 IBM Corporation

November 1, 2013

Get More Information…

© 2012 IBM Corporation

41

www.bigdatauniversity.com

© 2012 IBM Corporation

42

Get the
Book

© 2012 IBM Corporation

43