What is MapReduce? - Decideo

lilactruckInternet and Web Development

Dec 4, 2013 (3 years and 9 months ago)

107 views

www.decideo.fr/bruley

MapReduce

michel.bruley@teradata.com

April 2012

Extract from various presentations: Sudarshan, Chungnam, Teradata Aster, …

www.decideo.fr/bruley

What is
MapReduce
?



Restricted parallel programming model meant for large
clusters


User implements Map() and Reduce()


functions




Parallel computing framework


Libraries take care of EVERYTHING else


Parallelization


Fault Tolerance


Data Distribution


Load Balancing




Useful model for many practical tasks

www.decideo.fr/bruley

Map and
Reduce



The idea of Map, and Reduce is 40+ year old


Present in all Functional Programming Languages.


See, e.g., APL, Lisp and ML




Alternate names for Map: Apply
-
All




Higher Order Functions


take function definitions as arguments, or


return a function as output




Map and Reduce are higher
-
order functions.

www.decideo.fr/bruley

Map and
Reduce Functions



Functions borrowed from functional programming
languages (eg. Lisp)





Map()



Process a key/value pair to generate intermediate
key/value pairs




Reduce()



Merge all intermediate values associated with the same
key

www.decideo.fr/bruley

Example: Counting Words



Map()



Input <filename, file text>


Parses file and emits <word, count> pairs


eg.‏<”hello”,‏
1
>




Reduce()



Sums all values for the same key and emits <word,
TotalCount>


eg.‏<”hello”,‏(
3 5 2 7
)>‏‏=>‏<”hello”,‏
17
>

www.decideo.fr/bruley

Execution on Clusters

1.
Input files split (
M

splits)


2.
Assign Master & Workers


3.
Map tasks


4.
Writing intermediate data to disk (
R
regions)


5.
Intermediate data read & sort


6.
Reduce tasks


7.
Return

www.decideo.fr/bruley

Map/Reduce Cluster Implementation

split 0

split 1

split 2

split 3

split 4

Output 0

Output
1

Input
files

Output
files

M map
tasks

R reduce
tasks

Intermediate
files

Several map or
reduce tasks can
run on a single
computer

Each intermediate
file is divided into R
partitions, by
partitioning function

Each reduce task
corresponds to one
partition

www.decideo.fr/bruley

Map Reduce vs. Parallel Databases



Map Reduce widely used for parallel processing


Google, Yahoo, and
100
’s‏of‏other‏companies


Example uses: compute PageRank, build keyword indices,
do‏data‏analysis‏of‏web‏click‏logs,‏….




Database people say:


but parallel databases have been doing this for decades




Map Reduce people say:


we operate at scales of
1000
’s‏of‏machines


We handle failures seamlessly


We allow procedural code in map and reduce and allow
data of any type

www.decideo.fr/bruley

Typical
MapReduce

Cluster

www.decideo.fr/bruley

Map Reduce Implementations



Google


Not available outside Google




Hadoop


An open
-
source implementation in Java


Uses HDFS for stable storage


Download:
http://lucene.apache.org/hadoop/




Teradata Aster


Cluster
-
optimized SQL Database that also implements
MapReduce


IITB alumnus among founders




And several others, such as Cassandra at Facebook, etc.

www.decideo.fr/bruley

MapReduce

v.
Hadoop

MapReduce

Hadoop

Org

Google

Yahoo/Apache

Impl

C++

Java

Distributed
File Sys

GFS

HDFS

Data Base

Bigtable

HBase

Distributed
lock mgr

Chubby

ZooKeeper

www.decideo.fr/bruley

Solutions
Stack

for Teradata Aster

Aster Data
nCluster

Business
Intelligence
Tools

Analytics
Specialists

Data
Integration
/ ETL

Systems Management

Security

Query
Tools

Servers

Operating System

Cloud Infrastructure

Aster Data
Ecosystem

Aster Data
Platform
Infrastructure

Storage

www.decideo.fr/bruley

Teradata Aster Platform Infrastructure

For physical infrastructure (non
-
cloud) deployments

Server

Hardware

Operating

System

Aster Data

Analytic

Platform

Certified commodity (x86) server
hardware with internal storage

Certified Linux operating system

Aster Data
nCluster

packaged
software

nCluster

www.decideo.fr/bruley

Teradata Aster Infrastructure

For cloud deployments

Compute

Instance

Compute instance from cloud
provider (e.g. Amazon Web Services
EC2)

CC

xLarge

Storage

Storage connected to cloud
computing capacity

EBS

Ephemeral

Operating

System

Aster Data

Analytic

Platform

Linux operating system

Aster Data
nCluster

packaged
software

nCluster

www.decideo.fr/bruley

Teradata Aster Architecture for Analytics

Your Analytics & Advanced Reporting
Applications

Aster Data nCluster

Massively Parallel Data Stores


Hybrid
row/column
DBMS


Linear
, incremental scalability


Commodity

hardware


Standard SQL interface


MapReduce

processing integrated with SQL
via
SQL
-
MapReduce

interface


Rich libraries
of
MapReduce

analytics
from Aster
Data and partners


Visual development environment
--
develop in
hours

Unified Interface

SQL

SQL
-
MapReduce

Analytic Functions and Frameworks


Optimized SQL engine


Fully
-
integrated
in
-
database
MapReduce

Analytics Processing Engines

App

App

App

App

SQL

MapReduce




Support for
in
-
database processing of custom
applications
written in broad variety of languages


Integration with third
-
party packaged software
via
ODBC/JDBC or in
-
database integration

www.decideo.fr/bruley

Teradata Aster Ecosystem

Partner

Product

Product
release

Platform
for Certification

MicroStrategy

Intelligence Server

9.2.1 32
-
bit

Windows
7, Enterprise Edition SP1, 32
-
bit,
64
-
bit

SAP

Business Objects

XI
3.1

Windows 2008, 32
-
bit

Informatica

Powercenter

9.0.1

Client: Windows 2003/2008 Server 32 bit.

Server: Windows 2003/2008 Server 32 bit and 64 bit

IBM

Cognos

10.1FP1

n/a

Tableau

Tableau Server

6

Windows (SS: TBU)

Microsoft

SSLS, SSAS, SSFS,
SSIS

SQL Server
2008

.NET Framework 2.0

Windows Server, 2008 64
-
bit

Windows 2003, 32
-
bit

*Oracle BIEE certification currently in process