Utilizing R with HADOOP

perchorangeΛογισμικό & κατασκευή λογ/κού

1 Δεκ 2013 (πριν από 3 χρόνια και 6 μήνες)

48 εμφανίσεις

DUKE UNIVERSITY DEPA
RTMENT OF COMPUTER S
CIENCE

Utilizing R with HADOOP

A strategy for end user development


Chao Chen & John Engstrom

12/15/2010






Currently a collection

of methods have been developed to enable R programmers to utilize the benefits
of parallel computing via HADOOP. Each system requires the user to understand HADOOP development,
low level computer programming, and parallel computing. This investigation out
lines the methods
required to reduce complexity to the end user while giving applied examples.

Page |
2



Contents

Motivation for Research

................................
................................
................................
.................

3

Scope

................................
................................
................................
................................
...............

3

RHIPE

................................
................................
................................
................................
...............

3

Word Count in RHIPE

................................
................................
................................
..................

4

Application in Practice

................................
................................
................................
....................

5

Program
ming Architecture

................................
................................
................................
.........

5

Performance Metrics

................................
................................
................................
..................

6

Requirements for Full Scale Development & Extensibility

................................
.............................

7

Mapper & Reducer Optimizer

................................
................................
................................
.....

7

Robust Math Library & Functions with Multiple Operators

................................
.......................

8

Interactive General U
ser Interface & Graphics Packages

................................
...........................

8

Pros and Cons

................................
................................
................................
................................
..

8

Pros

................................
................................
................................
................................
.............

9

Mappers & Reducers

................................
................................
................................
..............

9

Programming Independent

................................
................................
................................
.....

9

Efficient Runtime

................................
................................
................................
....................

9

Cons

................................
................................
................................
................................
.............

9

Existing Community

................................
................................
................................
................

9

User Control over Jobs

................................
................................
................................
............

9

Proof of Robustness

................................
................................
................................
..............

10

Conclusion

................................
................................
................................
................................
.....

10






Page |
3



Motivation for Research

Currently individuals give the task of industrial analysis have a large suite of options available
within their analytical tool kit. Systems such as SAS, SAP
, Statistica, and Minitab give analysts a
large toolkit with which they can analyze data sets on local memory under given conditions.
These systems require hands on within each step of the analytical process, and offer little or no
extensibility. R this wi
dely becoming a defacto standard for analysts not intimidated by
declarative programming because it enables the end user full control over the statistical models
offered, and enables a much more automated execution of experiments after development. As
with

all good analysis more data enables much greater insights and local memory is simply not
enough for even the most powerful machines.

Integrating the benefits of parallel computing within the world of statistics will have profound
effects on companies’ ab
ility to derive useful and actionable
results for their information in real
time. Currently, systems such as RHIPE enable programmers to write mappers and reducers
within the R development environment to send them to preconfigured nodes within a cluster.
T
his is extremely effective on internally owned computing systems with settings and nodes
preconfigured for the experiments needs. Unfortunately this does not allow for the rapid
elasticity required with use on the cloud as it requires significant initial s
etup time investment.

The solution proposed will enable an R programmer to use the language they are familiar with
within a friendly GUI which then automates the parallel computing process so as to reduce this
barrier to entry and streamline rapid develop
ment.

Scope

We investigate RHIPE as a credible method and use this knowledge to develop a position on
what we believe a system of this nature should look like.
The
full scale
scope of this project will
be to develop simple real world applications

in JAVA

for testing with a real data set
.
This
development will enable us

to gain a better understanding of what exactly will be required to
make
methods

of this nature work seamlessly in a solution driven environment.
The solution
given should be able to read in specific arguments, typed in R, and parse them in such a manner
that can then produce the JAR files required to send jobs to the HADOOP. Commands for
variance, standard deviation, and mean were selected as they
are fundamental in and of
themselves, but also to every other statistical experiment used in practice.

RHIPE

RHIPE is an internally contained library within R and is used as a “wrapper” to place around R
code resulting in your code being sent directly to
HADOOP. The incremental programming
knowledge required is negligible but a user has to become comfortable with the idea of
Page |
4



mappers and reducers which can be challenging to do properly.
We give a Word Count example
in RHIPE in the next session to illustrate

the syntax of mapper and reducer in RHIPE.

Word Count in RHIPE

First, setting up RHIPE in Amazon Cloud. Using ami
-
6159bf08, and then install Googl
e
Protobuffers, RHIPE and R on A
mazon nodes.
After that, we can start programming.

Library(Rhipe)

Load RHIP
E library.

Rhinit()

Initialize RHIPE.

m <
-

expression({


y <
-

strsplit(unlist(map.values), “ “)


lapply(y, function(r) rhcollect(r,T))

)}

This is the mapper function. We split the map.values

by white spaces and use it as the key for
the mapper, that is, y is the key.
And rhcollect(r,T) is the value to the key.
(expression in R is like
class in Java)

r <
-

expression
(


pre={



count=0


},


reduce={



count <
-

sum(as.numeric(unlist(reduce.values
)), count)


},


post={



rhcollect(reduce.key, count)


})

This is the reducer function. We initialize the count as 0, then we add their values according to their specific keys.
Lastly, we collect the key
-
value pair as the output.

Z=rhmr(map=m, reduce=r,
comb=T, inout=c(“text”, “sequence”),
ifolder=”/Users/xiaochao1777/Desktop/test_data”,
ofolder=”/Users/xiaochao1777/Desktop/test_data_output”)

Page |
5



This is the job configuration. We set m as the mapper, r as the reducer

and T as the combiner
.
The combiner is the

same as reducer in this example and it is optional. We specify the input
value as text format, and output value as sequence format. ifolder is the input folder, and
ofolder is the output folder.

The good thing is we can specify the number of mappers and
recuders in this job configuration.
In this case, the above code looks like:

Z=rhmr(map=m, reduce=r, comb=T, inout=c(“text”, “sequence”),
ifolder=”/Users/xiaochao1777/Desktop/test_data”,
ofolder=”/Users/xiaochao1777/Desktop/test_data_output”, mapred=list(m
apred.map.tasks=100,
mapred.reduce.tasks=20))

For example, we specify the number of mappers as 100 and the number of the reducers as 20.

rhex(z)

The last step is to run the job.

A major detractor from this type of system is that it
is not extendable and flexible for users to
customize. And the update of old API into new API may
cause

problems
.
When we were doing
this project, the old API of “RHLAPPLY” was deleted and substituted with a new one.
Sometimes, this can make

the users very

uncomfortable.
This system has all of the benefits of
programming with R including but not limited to user support community, robust library,
graphics packages, and a GUI. These contents are a very large promoter for the RHIPE system
and make it easy to u
nderstand why this methodology was used.

Application in Practice

Programming Architecture

For the local version, you input the R command as you do in R, and then the application will
open a GUI for you to pick up your data file (any file format, e.g. csv
, txt. Use “,” as delimiter).
After you choosing the data file, the application will give you the result and running time in
seconds by a dialogue.

For the distributed version, you input the R command as you do in the local version (also same
as in R). Th
en the application will generate the Job Configuration Java file, the Mapper Java file
and the Reducer Java file under the current working file path in Eclipse or other SDKs.
After
that, you send these files to the HADOOP server as well as the specific dat
a file
path,

compile
these files
,

and run it. The results will store under the path you choose. You can download the
result or have a look at the result and running time using command line.

Page |
6




Performance Metrics

We tested 10 data files
in CSV

format for b
oth

the

local version and distributed version. The
data files range from 10K entries to 5M entries. When the data file is relatively small, the
running time of local version and distributed version is almost the same, around 20 seconds.
However, as long as

the data file is bigger 8M bytes,
the advantages of the distributed version
are shown. At the point of 8M bytes, the running time of the local version almost triple the
running time of the distributed version.

In addition, the size limit of the input fil
e for the local version is 8M bytes. It will crash if the
input file is bigger than 8M bytes. However, there is no limitation for the size of input file for
distributed version. We tested a file with 1G bytes using distributed version

to calculate
Standard

Deviation and the total running time is 27 minutes 54 seconds.

Page |
7





Requirements for Full Scale Development

& Extensibility

This section focuses primarily on the requirements for turning this methodology into an
industrial
-
scale end user software package.
This section we will speak from a very high level
view to focus on the major milestones required to make such an endeavor successful both as an
analytical toolkit and as a competitive business model.

Mapper & Reducer Optimizer

As all mappers and reducer are developed internally it is absolutely critical that jobs are created
in as efficient manner as possible,
and these methods must be extremely robust. This
optimizers’ job would be to determine both how many mappers and reducer
s to create, and
also if using HADOOP is even necessary as sometimes it is much more efficient to use local
memory.
These optimization algorithms would be a function of many variables including but
not limited to the job request (R code), data file format,

local computer memory, I/O to the
cluster, and total runtime on the cluster. This can be done in one of two primary ways
(generalization): contain a small database of rules designed from experiments run prior to
runtime, or predicative analytics run at th
e time on request. Depending on the given scenario
either could be most beneficial. It is very important to note that for the latter of the two
options these optimization algorithms would execute at runtime and it is critical that they do
not become so tax
ing as to significantly increase the total job runtime.

Page |
8



Robust Math Library &
Functions with Multiple Operators

It is important that a mathematical tool contain at least the most important mathematical
functions. Consumers are not concerned simply with ho
w good of an idea something is, it either
fulfills their needs or not. Selection of the most utilized mathematical functions could be easily
done by assessing what is most utilized by density or volume in any analytical situation. It would
also be importan
t to ensure that end users could add their own novel functions.

Furthermore most

mathematical functions do not require single operations as many are
complex and layered. An example of this is with order of operations. Below is the formula for a
binomial di
stribution, which is used very frequently and considered simple.

(



)



(


)











Upon inspection of this formula
to the right of the equals sign
we see
one summation, three
factorials, one division, three multiplications, two exponentials, a
nd two subtractions. Not only
is it important that each of these are done independently, but they must also be done in the
correct order.

Interactive General User Interface

& Graphics Packages

From a computer intensive point of view this is a small detail

but extremely important to the
end user. Having an interactive general user interface is something that makes end users
quickly fall in love with their abilities. Because software development is largely a fixed cost
endeavor it is import to ensure repeat
business. This is done by keeping end users happy with
easy to use programming interfaces than are near self explanatory.

Pros and Cons

In this section we will focus on the pros and cons with this type of methodology

as compared to
a system such as RHIPE
,

and not necessarily only what was programmed. There are many pros
and cons to this type of analysis system

so we will select a small sample to represent the largest
concerns
. The major pros that we will focus on are the ability to separate the user from t
he
need to develop and understand mappers and reducers, independence from computer
programming, and highly efficient runtimes. The cons we will focus on are the lack of an
existing support community for such a development met
hod, user control over jobs, an
d proof
of robustness in nature.

Page |
9



Pros

Mappers & Reducers

Separating the end user from the need to develop mappers and reducers ensures that even a
user who has no knowledge of parallel computing may have as robust an analytical toolkit as
any.

Programmin
g Independent


The ability to eliminate this barrier to entry is absolutely critical

for many R end users. Because
many programming languages already contain numerical libraries the assumption that anyone
using R doesn’t know or is not comfortable enough w
ith these languages is very reasonable.
Because we make the assumption that these users cannot program these models on their own
it would be un reasonable to hope that they would be able to interact with parallel computing
clusters via these languages.

Ef
ficient Runtime

Since

the process of mapper and reducer optimization would be internal to the system
allocation of these resources would be much more efficient than any single typical programmer
would be able to produce. The idea behind this thought is tha
t a small collection of experts
developing these algorithms for this software would do a much better job than any give
typical
R
programmer would ever be able to.

When programs like RHIPE send jobs to a cluster each node must have R already prepared for
u
se to understand the internal controls within the system. Because this system would read the
R code to determine what is desired and then call the proper JAVA class this is not necessary.
This eliminated the need for initialization of a cluster which has v
ast benefits in many scenarios.

Cons

Existing Community

As with any new novel ideas, few people are currently involved. This is a large issue both for
programmers looking to develop these libraries, and for end user seeking debugging assistance.
The reaso
n by programs like RHIPE are so successful is because R currently has a very vast
community and people involved with such issues to assist with these types of issues.

User Control over Jobs

An inherent complaint with doing all operations internally is tha
t end users have no control
over what actually happens within the system. We openly concede that this is true and of
concern.
In a fully robust system an experienced programmer would have the ability to “turn
off” the internal optimizer and code reader to
simply do it themselves. This would be similar to
the extensibility measures within the windows office suite where in even the most user
Page |
10



encapsulated experience a simple click of ALT+F11 introduces with visual basic editor for full
scale control over use.

Proof of Robustness

In this paper some very large generalizations, assumptions, and wishful thinking have been
claimed to make a case for the successful implementation of this type of software. It is
important to point out that while each of the requirem
ents for full scale development and
extensibility are somewhat independent programming projects they are all critically important
to the success of such a system.

Conclusion

It suffices to show as proof of practice and observations made in this paper that

a technology of
this nature is not only possible but would find itself is a very large market of individuals needing
to do advanced statistics with inadequate computer programming knowledge. Successful
development of such a project would find itself fully

ingrained in many different analytical
societies from engineers, mathematicians, managers, biostatisticians, government officials, and
the like. While programs such as RHIPE would still have their respective place in critical
application there is a large
market share neglected by current offerings and this type of method
would aim to fill that void.