Cloudera_BigDataUser Group 111511 - Meetup

signtruculentBiotechnology

Oct 2, 2013 (3 years and 6 months ago)

84 views

Cloudera & Hadoop Use Cases

Rob Lancaster
|
Omer Trajman

"Big Data" ... Applications From Enterprises to Individuals

The ‘Big Data’ Phenomenon

©2011 Cloudera, Inc. All Rights Reserved.

2

Big Data Drivers:


The proliferation of data capture
and creation
technologies


Increased
“interconnectedness

drives consumption (creating
more data)


Inexpensive storage makes it
possible to keep more, longer


Innovative software and analysis
tools turn data into information

Big Data encompasses not
only the
content itself
, but
how it’s consumed.

More Devices

More
Consumption

More Content

New & Better
Information


E
very
gigabyte

of stored content can generate
a

petabyte

or more of transient data*


The
information about you
is much greater
than the information you
create

*Source:

IDC 2011

Big Data Challenges

It’s not just about “big”

©2011 Cloudera, Inc. All Rights Reserved.

3

C
ost
-
effectively managing the
volume, velocity and
variety
of data

Deriving
value
across

structured
and

unstructured
data

Adapting to
context changes
and integrating

new data sources and types

Common Challenges

©2011 Cloudera, Inc. All Rights Reserved.

4

1

Network Analysis and Sessionization

2

Content Optimization and Engagement Modeling

3

Usage Analysis and Mediation

4

Entity Surveillance and Signal Monitoring

5

Recommendations and Modeling

6

Loyalty
, Promotion Analysis and Targeting

7

Fraud Analysis, Reconciliation and Risk

8

Time series Analysis, Mapping and Modeling

What is Apache Hadoop?

5



Hadoop
Distributed File
System (HDFS)




MapReduce

Consolidates Mixed Data

Complex
and relational data
into a single repository


Stores Inexpensively

Keep raw data always
available

Processes at the Source

Eliminate ETL bottlenecks

Mine data first, govern later

Apache Hadoop

is a platform for
data storage and processing that is…


Scalable


Fault tolerant


Open source

CORE HADOOP COMPONENTS

©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction
or redistribution without written permission is prohibited.

Cloudera in Production

©2011 Cloudera, Inc. All Rights Reserved.

6

Logs

Files

Web Data

Relational
Databases

IDE’s

BI / Analytics

Enterprise
Reporting

Enterprise Data
Warehouse

Operational Rules
Engines

Management
Tools

OPERATORS

ENGINEERS

ANALYSTS

BUSINESS USERS

Cloudera’s Distribution
Including Apache Hadoop (CDH)

&

SCM Express

Cloudera Enterprise


Cloudera Management Suite


Cloudera Support

UNI VERSI TY


Consulting Services


Cloudera University

Web
Application

CUSTOMERS

What Can Hadoop Do For You?

©2011 Cloudera, Inc. All Rights Reserved.

7

ADVANCED ANALYTICS

1

2

Two
Core Use Cases

Applied Across Industries

DATA PROCESSING

Social Network Analysis

Content Optimization

Network Analytics

Loyalty & Promotions Analysis

Fraud Analysis

Entity Analysis

Clickstream Sessionization

Engagement

Mediation

Data Factory

Trade Reconciliation

SIGINT

INDUSTRY TERM

INDUSTRY TERM

INDUSTRY

Web

Media

Telco

Retail

Financial

Federal

Bioinformatics

Genome Mapping

Sequencing Analysis

Genomics

Cost of DNA Sequencing Falling Very Fast

Raw data needs to be aligned and matched

Scientists want to collect and analyze these sequences

Hadoop

Can Read Native Format

hadoop
-
bam Java library for manipulation of Binary Alignment/Map

Alignment, SNP discovery, genotyping


Genomic Tools Based On Hadoop

SEAL


distributed short read alignment

BlastReduce



parallel read mapping

Crossbow


whole genome re
-
sequencing analysis

Cloudburst
-

sensitive MapReduce alignment



Copyright 2010 Cloudera Inc. All rights reserved

8

Biodiversity Indexing

Consolidation and serving of Biological data

Provide free and open access to biodiversity data

Collection, search, discovery and access to a variety of data

Data matching and cleansing

Geography, Water/land mapping

Dictionaries and
t
axonomic services

Data is harvested into multiple RDBMS

Sqoop to Hadoop for processing workflows and index generation

Sqoop back to MySQL for Web app serving

Future development is to crawl into and serve from HBase

©2011 Cloudera, Inc. All Rights Reserved.

9

Processing Seismic Data

Optimize the IO
-
intensive phases of seismic processing

Incorporate additional parallelism where it makes sense

Simplify gather/transpose operations with
MapReduce

Seismic Unix for Core Algorithms

Well
-
known, used at many grad programs in geophysics

SU file format can be easily transformed for processing on
HDFS

Hadoop
Streaming

Seismic Unix,
SEPlib
,
Javaseis

-

non
-
Java code in
MR

Framework is aware of parameter files needed by SU
commands


Copyright 2011 Cloudera Inc. All rights reserved

Targeted Offers

©2011 Cloudera, Inc. All Rights Reserved.

11

The checkout lane is everywhere

Cookies track users through ad impressions

Purchasing behavior is time sensitive

Logs collected from on
-
site and off
-
site browsing

Data is ingested incrementally

Process happens at a variety of time scales

Data logged to HBase as primary store

Some events naturally associate, others require deeper analysis

Random access useful for debugging algorithms


Recommendations and Forecasting

Copyright 2010 Cloudera Inc. All rights reserved

12

Collect and serve personalization information

Wide variety of constantly changing data sources

Data guaranteed to be messy

Data ingestion includes collection of raw data

Filtering and fixing of poorly formatted data

Normalization and matching across data sources

Analysis looks for reliable attributes and groupings

Interpretation (e.g. gender by name)

Aggregation across likely matching identifiers

Identify possible predicted attributes or preferences

Who is Cloudera?

13

The #1 commercial and non
-
commercial

Apache Hadoop distribution.

Complete, Integrated Hadoop Stack

Who is Cloudera?

Helps organizations profit from all their data

Largest contributor

to Hadoop ecosystem

Provides the
most widely used

open source
distribution

Develops the most sophisticated
Hadoop
operations software

Supports mission
critical Hadoop clusters

Trained the largest number of
Hadoop
Developers and Administrators

Coordination

Data
Integration

Fast
Read/Write
Access

Languages / Compilers

Workflow

Scheduling

Metadata

APACHE ZOOKEEPER

APACHE FLUME,
APACHE SQOOP

APACHE HBASE

APACHE PIG, APACHE HIVE

APACHE OOZIE

APACHE OOZIE

APACHE HIVE

File System Mount

UI Framework

SDK

FUSE
-
DFS

HUE

HUE SDK

©2011 Cloudera, Inc. All Rights Reserved. Confidential. Reproduction
or redistribution without written permission is prohibited.

©2011 Cloudera, Inc. All Rights Reserved.

14

Cloudera helps you profit
from all your data.


cloudera
.com

+1 (888) 789
-
1488

sales
@cloudera.com

twitter.com/

cloudera

facebook.com/

cloudera

Get
Hadoop