Big Data Tools

judgedrunkshipΔιακομιστές

17 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

158 εμφανίσεις

Big Data Tools

Hadoop

S.S.Mulay

Sr. V.P. Engineering


February 1, 2013


Confidential Netmagic Internal Use Only

Hadoop
-

A Prelude

2



Confidential Netmagic Internal Use Only

Apache Project and Animal Friendly
names


Some of the Projects under Apache Foundation to mention:



3

Apache
Zookeeper

Apache Tomcat

Apache Pig

Confidential Netmagic Internal Use Only

And now Hadoop











Confidential Netmagic Internal Use Only

Hadoop


The Name


5

Confidential Netmagic Internal Use Only

Hadoop


The Relevance

6

Apache
Zookeeper

Two Important things to know when discussing Big Data







MapReduce













Hadoop
.


Confidential Netmagic Internal Use Only

Hadoop


How was it Born?


To Process Huge Volume of data, as the amount of generated data continued
to rapidly increase. (Big Data).


Also the Web generated more and more information, which was becoming
quite challenging to index the content.


7

Apache
Zookeeper

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop


The Reality Vs Myth

Hadoop

is not a direct replacement for enterprise data
warehouses, data marts and other data stores that are
commonly used to manage structured or transactional
data.

It is used to augment enterprise data architectures by
providing an efficient and cost
-
effective means for storing,
processing, managing and analyzing the ever
-
increasing
volumes of semi
-
structured or un
-
structured data.

Hadoop

is useful across virtually every vertical
industry.

8

Apache
Zookeeper

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop


Some Use Cases

Digital marketing automation

Log Analysis and Event Correlation

Fraud detection and prevention

Predictive modeling for new drugs

Social network and relationship analysis

Perform ETL ( Extract Transform Load ) functions on unstructured
data

Image Correlation and Analysis

Collaborative Filtering

9

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop


What do we expect from it ?

If

we

analyze

the

mentioned

use

cases,

we

realize

that



10

The data is coming in varied formats and from varied sources.

Need to handle incoming stream of data in real time and also process it, sometimes in real
-
time.

Need a connect to their existing RDBMS

Need for a Distributed File System

Need capability for Data Warehousing over and above the processed data

Need “Map Only” capability to perform Image matching and correlation

Need for a Scalable database

Growing need for a GUI to operate and Develop Applications for
Hadoop

Need for a Framework for Parallel Compute

Need for a Distributed Computing Environment

Need for a Machine Learning and Data Mining requirements

Almost all of the workloads have a need to Manage data processing Jobs

Confidential Netmagic Internal Use Only

Hadoop


Components which come to
the rescue

11

Apache
Zookeeper

Apache Tomcat

HDFS


Distributed File System

Mahout

MapReduce



Distributed Processing of large Data
sets

ZooKeeper



Co
-
ordination Service for Dist Apps

HBase



Scalable Distributed DB. Supports Structured
Data

Avro


Data Serialization System

SQOOP



Connector to Structured Database

Chukwa


To monitor Large Distributed System

Flume



To move Large Data post processing efficiently

Hue



GUI to operate & develop
Hadoop

Applications

Hive



Data Warehousing framework

Many more ….

Pig



Framework for Parallel Computation

Oozie



Workflow Service to manage Data Processing Jobs

Confidential Netmagic Internal Use Only

Hadoop


Who’s Using It ?

12

Apache
Zookeeper

Apache Tomcat

Uses
Hadoop

and
HBase

for :



Social services



Structured data storage



Processing for internal use

Uses
Hadoop

for :



Amazon's product search
indices They process millions of
sessions daily for analytics.

Uses
Hadoop

for :



Search optimization



Research

Uses
Hadoop

for :



Databasing

and analyzing Next
Generation Sequencing (NGS)
data produced for the Cancer
Genome Atlas (TCGA) project and
other groups

Uses
Hadoop

for :



Internal log reporting/parsing
systems designed to scale to
infinity and beyond.



web
-
wide analytics platform

Uses
Hadoop

:



As a source for
reporting/analytics and machine
learning.

And Many More ….

Confidential Netmagic Internal Use Only

Hadoop


The Various Forms Today

13

Apache
Zookeeper

Apache Tomcat

Apache
Hadoop



Native
Hadoop

Distribution from Apache Foundation

Yahoo!
Hadoop



Hadoop

Distribution of Yahoo

CDH


Hadoop

Distribution from
Cloudera

GreenPlum

Hadoop



Hadoop

Distribution from EMC

HDP


Hadoop

Platform from
Hortonworks

M3 / M5 / M7


Hadoop

Distribution from MAPR

Project Serengeti


Vmware’s

Implementation of
Hadoop

on
Vcenter

And More …

Confidential Netmagic Internal Use Only

Hadoop


Use Case Example


Log
Processing


Some of the Practical Use cases for Log Processing Generally in use
today :







Assuming a situation we have Huge Log’s generated for a period of time ranging in TB’s
and we want to know :


14

Apache Tomcat

Analytics


Application / Web Site Performance

Reporting


Page views, User sessions

Event Detection & Correlation

Page views / User sessions Weekly / Monthly

Users and their Behavioral Pattern

Investigate IP and its behavioral pattern

Confidential Netmagic Internal Use Only

Hadoop


Use Case Example


Log
Processing

In the Conventional Method :

Parallelism is on a per file basis and not on a Single file.




15

Apache Tomcat

Log file
-

1

Log file
-

2

Log file
-

n

Task
-

1



grep

[pattern]



awk


Task
-

2


grep

[pattern]


awk


Task
-

n


grep

[pattern]


awk


Final Data Set

Concatenate Data Set

Task
-

new

Confidential Netmagic Internal Use Only

Hadoop


Use Case Example


Log
Processing

With Map Reduce:



16

Apache Tomcat

Log file
-

1


Chunk
-
1

Log file


1


Chunk
-

2

Log file
-

1


Chunk
-

n

Task
-

1



grep

[pattern]



awk


Task
-

2


grep

[pattern]


awk


Task
-

n


grep

[pattern]


awk


Resultant Data Set

Confidential Netmagic Internal Use Only

Hadoop


Use Case Example


Log
Processing


Infrastructure realities in Conventional Method :







How things Change With Map Reduce








Assuming


Single Disk can transfer data at the speed of 75MB/Sec


If we consider a
Hadoop

Cluster of 4000 Nodes and each Server of 6 Disks each.


The overall Throughput of the Setup would be





17

Apache Tomcat

1 Server with a 1Gbps NIC


Can Copy 100GB file in 14 Minutes

1 Server with 1 Disk can Typically copy a 100GB file in about 20 to
25 minutes.

The Network Bottleneck is eliminated as we see multiple Servers with
1
Gbps

NIC reading the same 100GB Data in smaller chunks each.

The Disk Bottleneck is eliminated since each individual Server would
have multiple Disks and underlying RAID to improve the Disk
performance.

= 6 * 75 * 4000 = approx 1.8 TB/sec

So in result for 1PB of data to be read it would approx take 10 min’s.

Confidential Netmagic Internal Use Only

Hadoop


Big Data Integration Challenges

18

Apache Tomcat

Technology / Tools.


A successful big data initiative requires acquiring, integrating and managing
several big data technologies such as
Hadoop
,
MapReduce
,
NoSQL

databases, Pig, Scoop, Hive,
Oozie

and others.

Conventional data management tools fail when trying to integrate, search and
analyze big datasets, which range from terabytes to multiple
Petabytes

of information.

People
.

As with any new technology, staff needs to be trained in big data technologies to learn proper
skills and best practices. The two biggest challenges are :

Finding in
-
house expertise,

Allocation of sufficient budget, time and resources.

Processes.

Being a niche area not many documented procedures and processes are
available.. Also depending upon the Application Use case, requirements change.

Confidential Netmagic Internal Use Only

Hadoop


Native Solutions & Challenges

Inherent Knowledge of the various Components and their dependencies is
required.

Configuration and implementation needs specific skills to not only implement
but also to manage.

Dependency of Data Scientist on the backend Programming Team.

Any version upgrades etc need to be tested thoroughly before upgrading the
current setup.

Support Model is only through community based support and can lead to
issues for an enterprise implementing
Hadoop
.

Any integration and problems arising out of that can become a show stopper.

19

Apache
Zookeeper

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop


Advantages of Commercial
Solutions

Comes fully Integrated as a Package and Documented.

Implementation is a straightforward activity

Come with a Configuration manager which can help quickly setup the infrastructure.

Give a great and easy connect to Enterprise Applications / Architecture.

Some of them come with GUI capabilities to eliminate most of the programming
requirements. Thus giving the control to the “Data Scientists” all by themselves.

Come with a lot of Add
-
on Capabilities including the GUI for Management.

Most of these Commercial editions work closely with the “Apache Foundation” and
hence are compatible.

It is pre
-
tested and hence the dependencies of the packages and their version
changes etc is assured with the Distribution.

20

Apache
Zookeeper

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop


Commercial Solutions For
Hadoop

The Solutions Fit into 2 Categories :



Infrastructure Automation



Application Automation






21

Apache
Zookeeper

Apache Tomcat


Infrastructure
Automation

Cloudera


Infrastructure
Automation

HortonWorks


Application
Automation


Karmasphere

Studio


Application
Automation

Talend


Application
Automation

Pentaho


These are just some
of them.

Confidential Netmagic Internal Use Only

Gartner Report


Magic Quadrant for
Data Integration Tools

22

Apache
Zookeeper

Apache Tomcat

Confidential Netmagic Internal Use Only

Hadoop & Cloud


Hand in Hand ?


What Advantages does Cloud Bring in :








Thus
Hadoop

going on Cloud does bring in the above advantages on the table to
the Enterprises.


All the Commercial Distributions available today, do offer a Virtual image option to
deploy on Cloud / Virtualization Platform.



Virtualization Solution Providers like
vmware

have come up with Project
“Serengeti” to Support Quick Deployment and Management of
Hadoop

on Cloud.



Cloud Service providers like Amazon, Netmagic and others have a deployment
option of
Hadoop

Infrastructure on Cloud.



23

Apache
Zookeeper

Apache Tomcat

Reduced Physical Infrastructure

Quick Deployment using the Cloud Cloning / Templates.

Elasticity

Auto
-
Scaling capabilities of the Cloud to Spawn / De
-
spawn
instances as and when required.

Confidential Netmagic Internal Use Only

Insert your image
here

Contact Details


For related queries/ feedback, mail to

ssmulay@netmagicsolutions.com




+
91
-
9820453568

Confidential Netmagic Internal Use Only

Thank You



http://www.linkedin.com/
companies/netmagic

http://twitter.com/netmagic

http://www.facebook.com/
NetmagicSolutions

http://www.youtube.com
/user/netmagicsolutions