HAMS Technologies www.hams.co.in director@hams.co.in priyank@hams.co.in vivek@hams.co.in

longtermagonizingInternet and Web Development

Dec 13, 2013 (3 years and 4 months ago)

88 views

HAMS Technologies

1


HAMS Technologies


www.hams.co.in


director@hams.co.in


priyank@hams.co.in


vivek@hams.co.in


HAMS Technologies

2

»
A framework
that lets one easily write and run applications that process vast
amounts of data. It
includes terminology like:


MapReduce
, HDFS, Hive,
Hbase
,
Pig.


»
Yahoo is the biggest contributor. Other major contributor are Facebook,
G
oogle,
Amazon/A9.


»
Here's
what makes it especially useful
:



Scalable and reliable


Easy of implementation


Efficient


Lots of tool available


Supporting many well known languages and scripts.



Hadoop

overview

3

HAMS Technologies


How Hadoop works
?


MapReduce

divides applications into small blocks of work.


HDFS creates desire replicas of data blocks for reliability, placing them on
compute nodes around the cluster.


MapReduce

can then process the data locally followed by aggregation of
intermediate result .


4

HAMS Technologies

General flow in
MapReduce

architecture


1.
Create a clustered network.

2.
Load the data into cluster using Map (mapper task).

3.
Fetch the processing data with help of Map
(mapper
task).

4.
Aggregate the result with Reducer ( Reducer task).

Local Data

Local Data

Local Data

Partial
Result
-
1

Partial
Result
-
2

Partial
Result
-
3

Map

Map

Map

Reduce

Aggregated
Result

5

HAMS Technologies

General attributes of in
MapReduce

architecture


1.
Distributed file system (DFS)

2.
Data locality

3.
Data redundancy for fault tolerance

4.
Map tasks applied to partitioned data it scheduled so that input blocks
are on same machine.

5.
Reducer tasks applied to process data partitioned by MAP task.

Local Data

Local Data

Local Data

Partial
Result
-
1

Partial
Result
-
2

Partial
Result
-
3

Map

Map

Map

Reduce

Aggregated
Result

6

HAMS Technologies

Hadoop

is an open source implementation of
MapReduced

architecture
maintained by Apache

Hadoop

HDFS

Hadoop

Distributed file system

MapReduce

Job trackers

name node/s

Data node/s

Job tracker node/s

Data Node

Data node/s

Tracker node/s

Data Node

Data node/s

Tracker node/s

Data Node

Data node/s

Tracker node/s

Master
nodes

Slave

nodes

Hive

(
H
adoop

interact
IVE
)


»
Hadoop
-
streaming
allow to create and run
MapReducde

job as Mapper
and/or as Reducer
.

»
HDFS

(
Hadoop

Distributed File System) is a clustered network used to
store data. HDFS contain the script to replicate and track the different data
blocks. HDFS write is show below. In same reverse manner we retrieve
data from HDFS.


7

HAMS Technologies

hams.txt

Block
-
1

Block
-
2

Block
-
3

Name Node

Data Node
-
1

Data node/s

Tracker node/s

Data Node
-
2

Data node/s

Tracker node/s

Data Node
-
3

Data node/s

Tracker node/s

Data Node
-
n

Data node/s

Tracker node/s

1

2

3

3

3

I am having a file contains 3
blocks.. Where should I write
these?

Okey
, Write these
on data
-
node 1 ,2
and 3

8

HAMS Technologies


Unstructured data for analysis



Very large amount of data



Write ones (less), read many



Multiple modules written in different languages


When to use Hadoop

9

HAMS Technologies

1
.
Hadoop Admin/Technical person

: People who configure the Hadoop
environment, setting required number of cluster with detail of all data source
and different nodes


2
.
Hadoop programmer
: People who write the different map reduce function
to perform the data analysis.



*Here we are taking the perspective of Hadoop programmer.


Kind of people working in development of Application using Hadoop

10

HAMS Technologies

Map/Reduce is a programming model for efficient distributed computing

It works like a Unix pipeline
:


Unix
-
> cat
input |
grep

|
sort
|
uniq

-
c | cat > output



Hadoop
-
>
Input

|
Map

| Shuffle & Sort |
Reduce

|
Output


A simple model but good for a lot of applications

Log
processing.

Web index
building.

Count of URL Access Frequency


ReverseWeb
-
Link
Graph:

list
of all source URLs associated with a given target URL

Inverted index: Produces <word, list(Document ID)> pairs

Distributed sort


11

HAMS Technologies

12

HAMS Technologies

Here we need to take care the implementation of Map and reduce function and need
to write code for launching the application

Mapper

Input:

value
: lines of text of input

Output:

key
: word, value:
1


Reducer

Input: key:

word
, value: set of counts

Output: key:

word
, value:
sum


Launching program

Defines the job

Submits job to cluster

13

HAMS Technologies

Mapper

( example for word count)

public static class
WordCountMap

extends

Mapper<
LongWritable
, Text, Text,
IntWritable
>
{


private final static
IntWritable

one = new
IntWritable
(1);


private Text word = new Text();




public void map(
LongWritable

key, Text value, Context context)
throws
IOException
,
InterruptedException

{


String line =
value.toString
();


StringTokenizer

tokenizer

=
new
StringTokenizer
(line,"
\
t");


//
System.out.println
(line);


while (
tokenizer.hasMoreTokens
()) {


word.set
(
tokenizer.nextToken
());


context.write
(word,
one);


}


}


}



14

HAMS Technologies

Reducer ( example for word count)

public static class Reduce
extends Reducer<Text,
IntWritable
, Text,
IntWritable
>
{


private
IntWritable

result = new
IntWritable
();



public void
reduce(Text key,
Iterable
<
IntWritable
> values, Context context)



throws
IOException
,
InterruptedException

{


int

sum = 0;


for (
IntWritable

val

: values) {


sum +=
val.get
();


}


result.set
(sum);


context.write
(key
, result);


}

15

HAMS Technologies

Map reduce launcher

Configuration
conf

=
new Configuration();





Job
job

=
new Job(
conf
, "
wordcount
");




job.setOutputKeyClass
(
Text.
class
);


job.setOutputValueClass
(
IntWritable.
class
);




job.setMapperClass
(
WordCountMap.class
);





job.setReducerClass
(
Reduce.
class
);




job.setInputFormatClass
(
TextInputFormat.
class
);


job.setOutputFormatClass
(
TextOutputFormat.
class
);




FileInputFormat.
addInputPath
(job,
new Path(
args
[1]));


FileOutputFormat.
setOutputPath
(job,
new Path(
args
[2]));




job.waitForCompletion
(
true);

16

HAMS Technologies

Running the complete program


Build the jar file either directly using eclipse or by jar command.



Configure the
Hadoop
.



Place the jar file in appropriate location.



Lets move to the Demo : )

17

HAMS Technologies

Documentation :



Hadoop
Wiki



Introduction



http://hadoop.apache.org/core/



Getting Started



http://wiki.apache.org/hadoop/GettingStartedWithHadoop



Map/Reduce Overview



http://wiki.apache.org/hadoop/HadoopMapReduce



DFS



http://hadoop.apache.org/core/docs/current/hdfs_design.html



Javadoc



http://hadoop.apache.org/core/docs/current/api/index.html

18

HAMS Technologies

Thank you



Kindly drop us a mail at below mention address for any suggestion
and clarification. We like to hear from you


HAMS Technologies


www.hams.co.in


director@hams.co.in


priyank@hams.co.in


vivek@hams.co.in