Hadoop: An Industry Perspective

fuelgrubbsSoftware and s/w Development

Dec 2, 2013 (3 years and 7 months ago)

76 views

Hadoop: An Industry Perspective




Amr

Awadallah

Founder/CTO,
Cloudera
, Inc.

Massive Data Analytics over the Cloud (MDAC’2010)

Monday, April 26
th
, 2010

Amr

Awadallah
,
Cloudera

Inc

2

Outline


What is Hadoop?


Overview of HDFS and MapReduce


How Hadoop augments an RDBMS?


Industry Business Needs:


Data Consolidation (Structured or Not)


Data Schema Agility (Evolve Schema Fast)


Query Language Flexibility (Data Engineering)


Data Economics (Store More for Longer)


Conclusion

Amr

Awadallah
,
Cloudera

Inc

3

What is Hadoop?


A scalable fault
-
tolerant distributed system for
data storage and processing


Its scalability comes from the marriage of:


HDFS:

Self
-
Healing High
-
Bandwidth Clustered Storage


MapReduce
:

Fault
-
Tolerant Distributed Processing


Operates on
structured and complex data


A
large and active ecosystem
(many developers
and additions like
HBase
, Hive, Pig, …)


Open source
under the
Apache License


http://wiki.apache.org/hadoop/



Amr

Awadallah
,
Cloudera

Inc

4

Hadoop History


2002
-
2004:

Doug Cutting and Mike
Cafarella

started working on
Nutch


2003
-
2004:

Google publishes GFS and
MapReduce

papers


2004:

Cutting adds DFS &
MapReduce

support to
Nutch


2006:

Yahoo! hires Cutting, Hadoop spins out of
Nutch


2007:
NY Times converts 4TB of archives over 100 EC2s


2008:

Web
-
scale deployments at Y!,
Facebook
, Last.fm


April 2008:

Yahoo does fastest sort of a TB, 3.5mins over 910 nodes


May 2009:



Yahoo does fastest sort of a TB, 62secs over 1460 nodes


Yahoo sorts a PB in 16.25hours over 3658 nodes


June 2009, Oct 2009:

Hadoop Summit, Hadoop World


September 2009:

Doug Cutting joins
Cloudera

Amr

Awadallah
,
Cloudera

Inc

5

Hadoop Design Axioms

1.
System Shall Manage and Heal Itself

2.
Performance Shall Scale Linearly

3.
Compute Shall Move to Data

4.
Simple Core, Modular and Extensible

Amr

Awadallah
,
Cloudera

Inc

6

Block Size = 64MB

Replication Factor = 3

HDFS: Hadoop Distributed File System

Cost/GB is a few
¢/month
vs

$/month

Amr

Awadallah
,
Cloudera

Inc

7

MapReduce: Distributed Processing

Amr

Awadallah
,
Cloudera

Inc

8

Apache Hadoop Ecosystem

HDFS

(Hadoop Distributed File System)

HBase

(key
-
value
store)

MapReduce

(Job Scheduling/Execution System)

Pig (Data Flow)

Hive (SQL)

BI Reporting

ETL Tools

Avro (Serialization)

Zookeepr

(Coordination)

Sqoop

RDBMS

(Streaming/Pipes APIs)

Amr

Awadallah
,
Cloudera

Inc

9

Relational Databases:


Hadoop:


Use The Right Tool For The Right Job

When to use?


Affordable

Storage/Compute


Structured

or Not (Agility)


Resilient Auto Scalability




When to use?


Interactive Reporting (<1sec)


Multistep Transactions


Lots of Inserts/Updates/Deletes




Amr

Awadallah
,
Cloudera

Inc

10

Typical Hadoop Architecture

Hadoop: Storage and Batch Processing

Data Collection

OLAP Data Mart

Business Intelligence

OLTP Data Store

Interactive Application

Business Users

End Customers

Engineers

Amr

Awadallah
,
Cloudera

Inc

11

Complex Data is Growing Really Fast

Gartner


2009


Enterprise Data will grow 650%

in the next 5 years.


80% of this data will be
unstructured (complex)

data


IDC


2008


85% of all corporate information
is in unstructured (complex) forms


Growth of unstructured data
(61.7% CAGR) will far outpace
that of transactional data


Data types

Complex
Structured
Amr

Awadallah
,
Cloudera

Inc

12

Data Consolidation: One Place For All

A single data system to enable processing
across the universe of data types.

Complex Data

Documents

Web feeds

System logs

Online forums

Structured Data (“relational”)

CRM

Financials

Logistics

Data Marts


SharePoint

Sensor data

EMB archives

Images/Video




Inventory

Sales records

HR records

Web Profiles



Amr

Awadallah
,
Cloudera

Inc

13

Schema
-
on
-
Read:


Schema
-
on
-
Write:


Data Agility: Schema on Read
vs

Write


Schema must be created
before data is loaded.


An explicit load operation has
to take place which transforms

the

data to the internal

structure of the database.


New columns must be added
explicitly before data for such
columns can be loaded into
the database.


Read is Fast.


Standards/Governance.





Data is simply copied to

the file
store, no special transformation
is needed.


A
SerDe

(
Serializer
/
Deserlizer
)
is applied during read time to
extract the required columns.


New data can start flowing
anytime and will appear
retroactively once the
SerDe

is
updated to parse them.


Load is Fast


Evolving Schemas/Agility



Amr

Awadallah
,
Cloudera

Inc

14

Query Language Flexibility



Java MapReduce
: Gives the most flexibility and
performance, but potentially long development cycle (the
“assembly language” of Hadoop).



Streaming MapReduce
: Allows you to develop in any
programming language of your choice, but slightly lower
performance

and less flexibility.


Pig
: A relatively new language out of Yahoo, suitable for
batch data

flow

workloads


Hive
: A SQL interpreter on top of MapReduce, also
includes a meta
-
store mapping files to their schema
s and
associated
SerDe’s
. Hive also supports User
-
Defined
-
Functions and pluggable MapReduce streaming functions
in any language.


Amr

Awadallah
,
Cloudera

Inc

15

Hive Extensible Data Types


STRUCTS:


SELECT
mytable.mycolumn.myfield

FROM …


MAPS (Hashes):


SELECT
mytable.mycolumn
[
mykey
] FROM …


ARRAYS:


SELECT
mytable.mycolumn
[
5
]

FROM …


JSON:


SELECT
get_json_object
(
mycolumn
,

objpath
)

Amr

Awadallah
,
Cloudera

Inc

16

Data Economics (Return On Byte)

Low ROB



Return on Byte = value to be extracted from that
byte / cost of storing that byte.



If ROB is < 1 then it will be buried into tape
wasteland, thus we need cheaper
active

storage.

High ROB

Amr

Awadallah
,
Cloudera

Inc

17

Case Studies: Hadoop World ‘09


VISA:

Large Scale Transaction Analysis


JP Morgan Chase:

Data Processing for Financial Services


China Mobile:

Data Mining Platform for Telecom Industry


Rackspace
:

Cross Data Center Log Processing


Booz Allen Hamilton:

Protein Alignment using Hadoop


eHarmony:

Matchmaking in the Hadoop Cloud


General Sentiment:

Understanding Natural Language


Yahoo!:
Social Graph Analysis


Visible Technologies:

Real
-
Time Business Intelligence


Facebook
:

Rethinking the Data Warehouse with Hadoop and Hive


Slides and Videos at
http://www.cloudera.com/hadoop
-
world
-
nyc


Amr

Awadallah
,
Cloudera

Inc

18

Cloudera

Desktop for Hadoop

Amr

Awadallah
,
Cloudera

Inc

19

Conclusion

Hadoop is a
scalable distributed data
processing system
which enables:

1.
Consolidation
(Structured or Not)

2.
Data Agility
(Evolving Schemas)

3.
Query Flexibility
(Any Language)

4.
Economical Storage
(ROB > 1)

Amr

Awadallah
,
Cloudera

Inc

20


Amr

Awadallah

CTO,
Cloudera

Inc.

aaa@cloudera.com

http://twitter.com/awadallah


Online Training Videos and Info:

http://cloudera.com/hadoop
-
training

http://cloudera.com/blog

http://twitter.com/cloudera



Contact Information

(c) 2008 Cloudera, Inc. or its licensors.

"Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Amr

Awadallah
,
Cloudera

Inc

22

MapReduce: The Programming Model

Split 1

Split i

Split N

Reduce 1

Reduce i

Reduce R

(sorted words, counts)

Shuffle

(sorted words, counts)

Map 1

(docid, text)

(docid, text)

Map i

(
docid
, text)

Map M

(words, counts)

(words, counts)

“To Be
Or Not
To Be?”

Be, 5

Be, 12

Be, 7

Be, 6

Output
File 1

(sorted words,
sum of
counts)

Output
File i

(sorted words,
sum of
counts)

Output
File R

(sorted words,
sum of
counts)

Be, 30

SELECT word, COUNT(1) FROM docs GROUP BY word;

cat
*.txt | mapper.pl | sort | reducer.pl >
out.txt


Amr

Awadallah
,
Cloudera

Inc

23

Hadoop High
-
Level Architecture

Name Node

Maintains mapping of file blocks

to
data node slaves

Job Tracker

Schedules jobs across

task
tracker slaves

Data Node

Stores and serves

blocks of data

Hadoop Client

Contacts Name Node for data

or
Job Tracker to submit jobs

Task Tracker

Runs tasks (work units)

within
a job

Share Physical Node

Amr

Awadallah
,
Cloudera

Inc

24

Economics of Hadoop Storage


Typical Hardware:


Two Quad Core
Nehalems


24GB RAM


12 * 1TB SATA disks (JBOD mode, no need for RAID)


1 Gigabit Ethernet card


Cost/node:

$5K/node


Effective HDFS Space:


¼ reserved for temp shuffle space, which leaves 9TB/node


3 way replication leads to 3TB effective HDFS space/node


But assuming 7x compression that becomes ~ 20TB/node

Effective Cost per user TB: $250/TB

Other solutions cost in the range of $5K to $100K per user TB

Amr

Awadallah
,
Cloudera

Inc

25

Data Engineering
vs

Business Intelligence


Business Intelligence:


The practice

of extracting business numbers to
monitor and evaluate the health of the business.


Humans make

decisions

based

on these
numbers to improve revenues or reduce costs.


Data Engineering:


The science of writing algorithms

that
convert

data into money


Alternatively, how to
automatically transform data into new features
that increase revenues or reduce costs.