Big Data Workshop Lab Guide - Oracle DWH

kettleproduceSoftware and s/w Development

Dec 2, 2013 (3 years and 10 months ago)

419 views



http://www.oracle
-
developer
-
days.com


Copyright © 2012
, Oracle and/or its affiliates. All rights
reserved














Big Data Workshop


Lab Guide











Big Data Workshop




http://www.oracle
-
developer
-
days.com


Copyright © 2012
, Oracle and/or its affiliates. All rights reserved






2

T
ABLE OF
C
ONTENTS

Big
Data Workshop Lab Guide

................................
................................
................................
..........

i

1. Introduction

................................
................................
................................
................................
....

4

2. Hadoop Hello World

................................
................................
................................
.......................

7

2.1 Introduction to Hadoop

................................
................................
................................
.............

7

2.2 Overview of Hands on Exercise

................................
................................
...............................

8

2.3 Word Count

................................
................................
................................
..............................

8

2.4 Summary

................................
................................
................................
................................

22

3. Pig Exercise

................................
................................
................................
................................
.

23

3.1 Introduction to Pig

................................
................................
................................
..................

23

3.2 Overview Of Hands On Exercise

................................
................................
...........................

23

3.3 Working with
PIG

................................
................................
................................
....................

23

3.4 Summary

................................
................................
................................
................................

43

4. Hive Coding

................................
................................
................................
................................
.

44

4.1 Introduction to Hive

................................
................................
................................
................

44

4.2 Overview Of Hands On Exercise

................................
................................
...........................

44

4.3 Queries with Hive

................................
................................
................................
...................

44

4.4 Summary

................................
................................
................................
................................

55

5. Oracle ODI and Hadoop

................................
................................
................................
..............

56

5.1 Introduction To Oracle Connectors

................................
................................
........................

56

5.2 Overview of Hands on Exercise

................................
................................
.............................

57

5.3 Setup
and Reverse Engineering in ODI

................................
................................
.................

57

5.4 Using ODI to import text file into Hive

................................
................................
....................

64

5.5 Using ODI to import Hive Table into Oracle

................................
................................
...........

77

5.
6 Using ODI to import Hive Table into Hive

................................
................................
..............

93

5.7 Summary

................................
................................
................................
..............................

109

6. Working with External Tables

................................
................................
................................
....

110

6.1 Introduction to External Tables

................................
................................
............................

110

6.2 Overview of Hands on Exercise

................................
................................
...........................

110

6.3 Configuring External Tables

................................
................................
................................
.

110

6.4 Summary

................................
................................
................................
..............................

120

7. Working with Mahout

................................
................................
................................
.................

121

7.1 Introduction to Mahout
................................
................................
................................
..........

121

7.2 Overview of Hands on Exercise

................................
................................
...........................

121

7.3 Clustering with K
-
means

................................
................................
................................
......

121

7.4 Summary

................................
................................
................................
..............................

131

Big Data Workshop




http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






3

8. Programming
with R

................................
................................
................................
..................

132

8.1 Introduction to Enterprise R

................................
................................
................................
.

132

8.2 Overview of Hands on Exercise

................................
................................
...........................

132

8.3 Talking data from R and inserting it into
database

................................
...............................

133

8.4 Taking data from database and using it in R and clustering

................................
................

144

8.5 Summary

................................
................................
................................
..............................

154

9. Oracle NoSQL Database

................................
................................
................................
...........

155

9.
1 Introduction To NoSQL
................................
................................
................................
.........

155

9.2 Overview of Hands on Exercise

................................
................................
...........................

155

9.3 Insert, and retrieve Key


Value pairs

................................
................................
..................

155

9.4 Summary

................................
................................
................................
..............................

175

Appendix A

................................
................................
................................
................................
.....

176

A.1 Setup of a Hive Data Store

................................
................................
................................
..

176




Big Data Workshop




http://www.oracle
-
developer
-
days.com


Copyright © 2012
, Oracle and/or its affiliates. All rights reserved






4

1.

I
NTRODUCTION

Big data is not just about managing petabytes

of data. It is also about managing large numbers of complex
unstructured data streams which contain valuable data points. However, which data points are the most
valuable depends on who is doing the analysis and when they are doing the analysis. Typical b
ig data
applications include: smart grid meters that monitor electricity usage in homes, sensors that track and
manage the progress of goods in transit, analysis of medical treatments and drugs that are us
ed, analysis
of CT scans etc
. What links these big
data applications is the need to track millions of events per second,
and to respond in real time. Utility companies will need to detect an uptick in consumption as soon as
possible, so they can bring supplementary energy sources online quickly. Probably t
he fastest growing
area relates to location data being collected from mobile always
-
on devices. If retailers are to capitalise on
their customers’ location data, they must be able to respond as soon as they step through the door.

In the conventional model
of business intelligence and analytics, data is cleaned, cross
-
checked and
processed before it is analysed, and often only a sample of the data is used in the actual analysis. This is
possible because the kind of data that is being analysed
-

sales figures

or stock counts, for example


can
easily be arranged in a pre
-
ordained database schema, and because BI tools are often used simply to
create periodic reports.


At the center of the big data movement is an open source software framework called Hadoop. Ha
doop has
become the technology of choice to support applications that in turn support petabyte
-
sized analytics
utilizing large numbers of computing nodes. The Hadoop system consists of three projects: Hadoop
Common, a utility layer that provides access to
the Hadoop Distributed File System and Hadoop
subprojects. HDFS acts as the data storage platform for the Hadoop framework and can scale to massive
size when distributed over numerous computing nodes.

Hadoop MapReduce is a framework for processing data set
s across clusters of Hadoop nodes. The Map
and Reduce process splits the work by first mapping the input across the control nodes of the cluster, then
splitting the workload into even smaller data sets and distributing it further throughout the computing
c
luster. This allows it to leverage massively parallel processing, a computing advantage that technology
has introduced to modern system architectures. With MPP, Hadoop can run on inexpensive commodity
servers, dramatically reducing the upfront capital cost
s traditionally required to build out a massive system.
Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






5

As the nodes "return" their answers, the Reduce function collects and combines the information to deliver a
final result.

To extend the basic Hadoop ecosystem capabilities a number of new open source
projects have added
functionality to the environment. A typical Hadoop ecosystem will look something like this:





Avro is a data serialization system that converts data into a fast, compact binary data format. When
Avro data is stored in a file, its schem
a is stored with it



Chukwa is a large
-
scale monitoring system that provides insights into the Hadoop distributed file
system and MapReduce



HBase is a scalable, column
-
oriented distributed database modeled after Google's BigTable
distributed storage system.

HBase is well
-
suited for real
-
time data analysis



Hive is a data warehouse infrastructure that provides ad hoc query and data summarization for
Hadoop
-

supported data. Hive utilizes a SQL
-
like query language call HiveQL. HiveQL can also be
used by programm
ers to execute custom MapReduce jobs



Pig is a high
-
level programming language and execution framework for parallel computation. Pig
works within the Hadoop and MapReduce frameworks



ZooKeeper

provides coordination, configuration and group services for distributed applications
working over the Hadoop stack



Data exploration of Big Data result sets requires displaying millions or billions of data points to uncover
hidden patterns or records of
interest as shown below:

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






6



Many vendors are talking about Big Data in terms of managing petabytes of data. For example EMC has a
number of Big Data storage platforms such as it's new Isilon

storage platform. In reality the issue of big
data is much bigger and Oracle's aim is to focus on providing a big data platform which provides the
following:



Deep Analytics



a fully parallel, extensive and extensible toolbox full of advanced and novel
st
atistical and data mining capabilities



High Agility



the ability to create temporary analytics environments in an end
-
user driven, yet
secure and scalable environment to deliver new and novel insights to the operational business



Massive Scalability



the
ability to scale analytics and sandboxes to previously unknown scales
while leveraging previously untapped data potential



Low Latency



the ability to instantly act based on these advanced analytics in your operational,
production environmen
t



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






7

2.

H
ADOOP
H
ELLO
W
ORLD

2.1 Introduction
to

Hadoop

Map/Reduce is a programming paradigm that expresses a large distributed computation as a sequence of
distributed operations on data sets of key/value pairs. The Hadoop Map/Reduce framework harnesses a
cluster of
machines and executes user defined Map/Reduce jobs across the nodes in the cluster. A
Map/Reduce computation has two phases, a map phase and a reduce phase. The input to the computation
is

a data set of key/value pairs.

In the map phase, the framework spli
ts the input data set into a large number of fragments and assigns
each fragment to a map task. The framework also distributes the many map tasks across the cluster of
nodes on which it operates. Each map task consumes key/value pairs from its assigned fra
gment and
produces a set of intermediate key/value pairs. For each input key/value pair (K,V), the map task invokes a
user defined map function that transmutes the input into a different key/value pair (K',V').

Following the map phase the framework sorts t
he intermediate data set by key and produces a set of
(K',V'*) tuples so that all the values associated with a particular key appear together. It also partitions the
set of tuples into a number of fragments equal to the number of reduc
e tasks.

In the reduc
e phase, each reduce task consumes the fragment of (K',V'*) tuples assigned to it. For each
such tuple it invokes a user
-
defined reduce function that transmutes the tuple into an output key/value pair
(K,V). Once again, the framework distributes the many r
educe tasks across the cluster of nodes and deals
with shipping the appropriate fragment of intermediate data to each reduce task.

Tasks in each phase are executed in a fault
-
tolerant manner, if node(s) fail in the middle of a computation
the tasks assigne
d to them are re
-
distributed among the remaining nodes. Having many map and reduce
tasks enables good load balancing and allows failed tasks to be re
-
r
un with small runtime overhead.

Architecture

The Hadoop Map/Reduce framework has a master/slave architect
ure. It has a single master server or
jobtracker and several slave servers or tasktrackers, one per node in the cluster. The jobtracker is the point
of interaction between users and the framework. Users submit map/reduce jobs to the jobtracker, which
puts
them in a queue of pending jobs and executes them on a first
-
come/first
-
served basis. The jobtracker
manages the assignment of map and reduce tasks to the tasktrackers. The tasktrackers execute tasks
upon instruction from the jobtracker and also handle dat
a motion between the map and reduce phases.

Hadoop DFS

Hadoop's Distributed File System is designed to reliably store very large files across machines in a large
cluster. It is inspired by the Google File System. Hadoop DFS stores each file as a sequence o
f blocks, all
blocks in a file except the last block are the same size. Blocks belonging to a file are replicated for fault
tolerance. The block size and replication factor are configurable per file. Files in HDFS are "write once" and
have strictly one wri
ter at any time.

Architecture

Like Hadoop Map/Reduce, HDFS follows a master/slave architecture. An HDFS installation consists of a
single Namenode, a master server that manages the filesystem namespace and regulates access to files
by clients. In addition,

there are a number of Datanodes, one per node in the cluster, which manage
storage attached to the nodes that they run on. The Namenode makes filesystem namespace operations
like opening, closing, renaming etc. of files and directories available via an RP
C interface. It also
determines the mapping of blocks to Datanodes. The Datanodes are responsible for serving read and write
Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






8

requests from filesystem clients, they also perform block creation, deletion, and replication upon instruction
from the Namenode.

2
.2 Overview o
f Hands o
n Exercise

To get an understanding of what is involved in ru
nning a Hadoop Job and what are all of the steps one
must undertake we will embark on setting up and running a “hello world” type exercise on our Hadoop
Cluster.

In this exe
rcise you will
:

1)

Compile

a
Java

Word Count

written
to run on a Hadoop Cluster

2)

Create some files to run word count on

3)

Upload the files into HDFS

4)

Run Word Count

5)

View the Results

NOTE: During this exercise you will be asked to run s
everal

scripts. If you would like to see the content of
these scripts type cat
script
N
ame

and the
contents of the script will be displayed in the terminal


2.3 Word Count

1.

All of the setup and
execution

for the Work Count exercise can be done from the terminal, h
ence to
start out this first exercise please open the terminal by double clicking on the
Terminal

icon

on the
desktop
.




2.

To get into the folder where

the scripts for the first exercise are
,

type

in the terminal
:


cd /home/oracle
/exercises/wordCount

T
hen

press

Enter




3.

Let’s

look at the java code which

will run word count on a Hadoop cluster.
Type in the terminal:


gedit WordCount.java

T
hen press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






9



4.

A new window will open with the java code for word count. We would like you to look at line
14 and
2
8 of the code. You can see there the Mapper and Reducer Interfaces are being implemented.





5.

When you are done evaluating the code you can click on the
X

in the right upper corner of the
screen to close the window.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






10



6.

We can now go ahead and compile the Word Count code.
We need run the compile.sh script which
will set the correct classpath and output directory while compiling WordCount.java. Type in the
terminal:


./
compile.sh

T
hen press

Enter




7.

We can now crea
te a jar file from the compile directory
of Word Count.

This jar file is required as
the code for word count will be sent to all of the nodes in the cluster and the code will be run
simultaneous on all nodes that have appropriate data.
To create the jar fi
le in the terminal type:


./createJar.sh

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






11



8.

For the exercise to
be more interesting

we need to create some file on which word count will be
executed.
To create some file go the terminal and type:


./createFiles.sh

Then press
Enter




9.

To see the contents
of the files type in the terminal
:


cat file01 file02

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






12



In the terminal window you will see the contents of the two files. Each file

having 4 words in it.
Although these are quite small files the code would run identical with more than 2 file and with files
that are several Gigabytes of Terabytes in size.





10.

Now that we have the files ready we must move them into the Hadoop File Sys
tem (HDFS).
Hadoop can now work with file on other file systems; they must be within the HDFS for them to be
usable. It is also important to note that files which are within HDFS
are

split into multiple chunks

and stored on separate nodes
for parallel parsing.
To upload our two file into the HDFS
you need
to use the copyFromLocal commanding Hadoop. R
un the
command

by typing at the terminal

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






13


hadoop dfs
-
copyFromLocal file01 /user/oracle/wordcount/input/file01

Then press
Enter


For convince y
ou can also run the

script
copyFiles.sh

and it will upload the files for you

so do
not need to type in this and the next command
.






11.

We should now upload the second file. Go to the terminal and type:


hadoop dfs
-
copyFromLocal file02

/us
er/oracle/
wordcount/input/file02

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






14



12.

We can now r
un our
MapReduce

job to do a word count on the file we just uploaded.
Go the the
terminal and typing:


hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input
/user/oracle/wordcount
/output


Then press
Enter


For your convenience you can also run the script
runWordCount.sh

and will run the Hadoop job
for you so do not need to type in the
above

command.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






15




A lot of text will role by in the terminal window. This is informational
data
coming from the Hadoop
infrastructure to help track the status of the

job.

Wait for the job to finish, this is signaled by the
command prompt coming back.


13.

Once you have you command prompt back your MapReduce task is complete.
It is now time to
look at the

results. We can display they results file right from the HDFS files by using the cat
command from Hadoop.
Go to the terminal and type the following command
:


hadoop dfs
-
cat /user/oracle/wordcount/output/part
-
00000

Then press
Enter


For your convenience you can also run the script
viewResults
.sh

and will run the Hadoop
command for you to see the results.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






16



In the terminal
the word count results are displayed.
You will see that
job counted the number of
times

each

word appears.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






17




14.

As an experiment let’s try to run the Hadoop job again. Go to the terminal and type:


hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input
/user/oracle/wordcount/output

Then press
enter


For your convenience you can also run the script
runWordCount.sh

and will run the Hadoop job
for you so do not need to type in the above command.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






18



15.

You will notice an error message appears and no map reduce task is

executed. This is easily
explained by the immutability of data. Since Hadoop does not allow an update of data files (just
read and write) you cannot update the data in the results directory hence the execution has
nowhere to place to output. For you to re
-
run the Map
-
Reduce job you must either point it to
another output directory or
clean out the current output directory. Let’s go ahead and clean out the
previous output directory. Go to the terminal and type:


hadoop dfs
-
rmr /user/oracle/wordcount/output

Then press
Enter


For convince you can also run the script
deleteOutput
.sh

and it will
delete

the files for you so
do not need to type in this
command


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






19




16.

Now we have cleared the output directory

and can re
-
run the map reduce task. Let’s just go ahead
and make sure it works again. Go to the terminal and type:


hadoop jar WordCount.jar org.myorg.WordCount /user/oracle/wordcount/input
/user/oracle/wordcount/output

Then press
enter


For your convenience you can also run the script
runWordCount.sh

and will run the Hadoop job
for you so do not need to type in the above command.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






20



Now the Map Reduce job ran fine again as per the output on the screen.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






21




17.

This completes

the word count example. You can now close the terminal w
indow; go to the
terminal window and type:


exit

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






22



2.4 Summary

In this exercise you were able to see the basic steps required in setting up and running a very simple Map
Reduce Job
.
You say what interfaces must be implemented when creating a MapReduce task, you saw
how to upload data into HDFS and how to run the map reduce task. It is important to talk about execution
time for the exercise and the amount of time required to count 8
words is quite high in absolute terms. It is
important to understand that Hadoop needs to start a separate Java Virtual Machine to process each file or
chunk of a file on each node of the cluster. As such even a trivial job has some processing time which l
imits
the possible application of Hadoop as it can only handle bath jobs. Real time application where answers
are required can’t be run on a Hadoop cluster. At the same time as the data volumes increase processing
time does not increase that much as long a
s there are enough processing nodes. A recent benchmark of a
Hadoop cluster saw the complete sorting of 1 terabyte of data in just over 3 minutes on 910 nodes.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






23

3.

P
IG

E
XERCISE


3.1 Introduction t
o Pig

Apache Pig is a platform for analyzing large data sets

that consists of a high
-
level language for expressing
data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of
Pig programs is that their structure is amenable to substantial parallelization, which in turn
s enables them
to handle very large data sets
.

At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map
-
Reduce programs, for which large
-
scale parallel implementations already exist (e.g., the Hadoop
subproject)
. Pig's language layer currently consists of a textual language called Pig Latin, which has the
following key properties:

Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data
analysis tasks. Complex tas
ks comprised of multiple interrelated data transformations are explicitly
encoded as data flow sequences, making them easy to write, understand, and maintain.

Optimization opportunities:

The way in which tasks are encoded permits the system to optimize their
execution automatically, allowing the user to focus on semantics rather than efficiency.

Extensibility:

Users can create their own functions to do special
-
purpose processing.

3.2 Over
view Of Hands On Exercise

In this exercise we will be analyzing data coming from the New Y
ork Stock Exchange specifically we will
like to evaluate the dividends given by different companies.
We have a
tab de
limited file with four columns;
exchange name, st
ock name
, date, and dividend.
For our analysis we want to find the companies which
had the highest average dividend
.

In this exercise we will:

1.

Load our stock exchange data into our HDFS

2.

Run a PIG script which will find the company with the highest dividen
ds

3.

View the results

NOTE: During this exercise you will be asked to run several scripts. If you would like to see the content of
these scripts type cat
scriptName

and the contents of the script will be displayed in the terminal

NOTE2: This exercise and da
taset for this exercise was inspired from the flowing website:

http://ofps.oreilly.com/titles/9781449302641/running_pig.html

3.3 Working w
ith PIG

1.

All of the setup and execution

f
or this
exercise can be done from the terminal, hence the terminal
by double clicking on the
Terminal icon

on the desktop.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






24



2.

To get into the folder where the scripts for the first exercise are, type in the terminal:


cd /home/oracle
/exercises/pig

Then press
Enter




3.

To get an idea of what our dividends file looks like let
’s

look at the first couple of rows. In the
terminal type:


head NYSE_dividends

Then press
Enter




The first 10 rows of the data file will be displayed on the screen


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






25



4.

Now
that we have an idea what our data file looks like, let

s load it into the HDFS for processing
.
To load the data we use the copyfromLocal function of Hadoop
, g
o to the terminal and type:


hadoop dfs
-
copyFromLocal NYSE_dividends /user/oracle/NYSE_dividends


Then press
Enter


For convince you can also run the script
loadData
.sh

and it will upload the files
for you. This is
so you do not need to type in the command above.





5.

We will be
running our PIG script in interactive mode so we can
see each step of the process. For
this we will need to open the PIG interpreter called grunt.
Go the terminal and type


pig

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






26




6.

Once at the grunt shell we can start typing
Pig script. The first thing we need to do is load the
datafile from HDFS into Pig for processing. The data is not actually copied but a handler is created
for the file so Pig know how to interperate the data.
Go to the grunt shell and type
:


dividends =
load 'NYSE_dividends' as (exchange, symbol, date, dividend);

Then Press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






27



7.

Now the data is loaded as a for column table lets see what the data looks like in PIG.
Go to the
grunt shell and type:


dump dividends
;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






28



You will see output similar to the first
exercise on the screen. This is normal as Pig is merarly a
high level language all commands which process data simply run Map Reduce rasks in the
backgroup so the dump command simply becomes a map reduce task that i
s run. This will apply to
all of the command you will run in Pig. The output on the screen
will show you all of the rows of the
file in tuple form


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






29




8.

The first step in alalizing the data
will be grouping the data by stock symbol so we have all of the
dividends of one compay grouped together.
Go to the grunt shell and type:


grouped

= group dividends by symbol;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






30



9.

Let’s go ahead and dump this grouped variable to the screen

to see what its contents look like

Go
to the grunt shell and type:


dump grouped;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






31



Open the screen you will see all of the groups displayed in tuple of typle form. As the output might
look a bit confusing only one tuple is hiligted in
the screen shot below to help clarity.

The hilighed
region show all of the rows of the table for the CATO stock symbol



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






32




10.

In the next step we will go through each

group

tuple and get the group name and the average
dividend. Go to the grunt shell and typ
e:


avg = foreach grouped generate group, AVG(dividends.dividend);

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






33



11.

Let’s go ahead and see what this output looks like. Go to the grunt shell and type:


dump avg

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






34



Now you can see on the screen a dump of all stock symbols with their respective average dividend.
A couple of them are hilighed in the image below


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






35



12.

Now that we have the dividents for each company it would be ideal if we had them in order from
highest to

lowest dividend. Let’s get that list, go to the grunt shell and type:


sorted = order avg by $1 DESC;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






36



We can now see what the sorted list looks like. Go to the grunt terminal and type:


dump sorted;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






37



On the screen
you now see the lis
t sorted in descending

order. On the screen are the lowest
dividens but can scroll up the see the rest of the value.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






38



13.

We now have the final results we want. It might be worth writing these results out to HDFS. Let’s
do that. Go to th
e grunt shell and type:


store sorted into 'average_dividend';

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






39



14.

The new calculated data is now permanently
stored in HDFS. We can now exit the grunt shell. Go
to the grunt shell and type:


quit;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






40



15.

Now back at the terminal lets view the top 10 companies by average dividend
directly from HDFS.
Go to the terminal and type:


hadoop dfs
-
cat /user/oracle/average_dividend/part
-
r
-
00000 | head

Then press
Enter


For convince you can also run the script
viewResults
.sh

and it will display the files for you.
This is so you do not need to type in the command above.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






41



This command
simply

did
cat on the results file available in the HDFS
. The results are
seen

on the
screen.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






42




16.

That concludes the
the Pig
exercise you can now close the terminal window. Go to the terminal
and type:


exit

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






43



3.4 Summary

In this exercise you saw what a pig script looks like and how to run it. It is important to understand that pig
is a scripting language
which ultimately runs MapReduce jobs on a Hadoop cluster hence all of the power
of a distributed system and the high data volume / size which HDFS can accommodate are exploitable
through pig. Pig provides an easier interface to the MapReduce infrastructure

allow for scripting paradigms
to be u
sed rather than direct java coding
.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






44

4.

H
IVE
C
ODING

4.1 Introduction t
o Hive

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad
-
hoc queries, and
the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to
project structure onto this data and query the da
ta using a SQL
-
like language called HiveQL. At the same
time this language also allows traditional map/reduce programmers to plug in their custom mappers and
reducers when it is inconvenient or inefficient to express this logic in HiveQL.

4.2 Overview Of H
ands On Exercise

In this exercise you will

use Hive Query Language to create tables, insert data into those table
s

and run
queries on that data. For this exercise we will use
the same data file

as the PIG

exercise above

which
contains a tab delimited file

with four columns; exchange name, stock name, date, and dividend. For our
analysis we want to find the companies which had the highest average dividend.


In this exercise you will

1.

Upload the comments file into HDFS

2.

Create a table in Hive

3.

Load the comments

data into

the

Hive

table

4.

Run queries on the Hive table

4.3 Q
ueries

with Hive

1.

All of the setup and execution

for this
exercise can be done from the terminal, hence

open a

terminal by double clicking on the
Terminal icon

on the desktop.




2.

To get into the
folder where the scripts for the
Hive

exercise are, type in the terminal:


cd /home/oracle
/exercises/
hive

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






45



3.

W
e

already

have an idea what our data file looks like,
so
let

s load it into the HDFS for processing.
This is done identically
to the

way it was done in the first two exercises.
We will see a better way to
load data in the next exercise.
To load the data we use the copyfromLocal function of Hadoop. Go
to the terminal and type:


hadoop d
fs
-
copyFromLocal NYSE_dividend

/user/oracle/
NYSE_dividend

Then press
Enter


For convince you can also run the script
loadData.sh

and it will display the files for you. This is
so you do not need to type in the command above.




4.

Let’s now
enter the Hive interactive shell

environment to create tables and run queries

against
those tables. To give an analogy this is similar to SQL *Plus

but

on this environment is specifically
for

the

HiveQL

language
.
To enter the environment go to the terminal and type:


hive

Then press
Ente
r




5.

This first thing we need to do in hive is create a table. We will create a table named
dividends

with
four fields

field called
exchange, symbol, dates and dividends something that looks very natural
based on the data set
. Go the terminal and type:


create table

dividends(exchange

string
,

symbol string,

dates string,

dividend float
)

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






46



An OK

should
be
printed

on the screen indicating the success of the operation.
This OK message
will be printed for all operation but we will only mention it this time. It if left up to the user to check
for this message on future HiveQL commands.




6.

We can now run a command to see all of the tables available to this OS user. Go to
the hive
terminal and type
:


show tables
;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






47



You can see the only table currently available is the one we just created.




7.

As with normal SQL you also have a describe command available to see the columns in the table.
Go the terminal
and type


describe
dividends
;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






48



As you can see that the
dividends

table has
the 4 fields
each with their own Hive specific data type
.
This is to be expected as this
is the way we created the table.




8.

Let’s go ahead and load some data into this table. Data is loaded
into hive
from flat files available
in the

HDFS file system.
Go the terminal and type:


load da
ta inpath ‘/user/oracle/NYSQ_dividend’ into table dividends
;

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






49

Then press
Enter




The data is n
ow loaded into the table.


9.

We can now see the data that has been loaded into the table. Go the the terminal and type:


select * from dividends

limit 5;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






50



Five lines from the table
are printed to the screen; only 3

of the lines are highlighted in the image
below
.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






51



10.

Now that we have all of the data loaded
into a Hive table we can run SQL queries on the code.
As
we has the same data set as in the Pig exercises let’s try to extract the same data. We will look for
th
e top 10 companies by average dividend.
Go to the terminal and type:


select
symbol, avg(dividend) avg_dividend from dividends group by symbol
order by avg_dividend desc limit 10;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






52



On the screen you will see a lot of log information
scr
oll through. Most of this is generated by
Hadoop as Hive
(
just like P
ig
)

takes the queries you write and rewrites them as
MapReduce

jobs
then executes them.

The query we wrote can take full advantage of the distributed computational
power of Hadoop as w
ell as the striping and parallelism that HDFS enables.


When the query is done you should see on the screen
the top 10 companies in descending order.
This output shows the exact same information as we got with the previous exercises.

As the old
idiom goes

there is more than one way to skin a cat. As with hadoop there is always more than one
way to achieve any task.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






53



11.

This is the end of the Hive exercise. You can now exit the hive interpreter. Go to the terminal and
type:


exit;

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






54



12.

Then

close the

terminal. Go the the terminal and type:


exit

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






55



4.4 Summary


In this exercise you were introduces to the Hive
Query Language. You saw how to create and view
tables using the HiveQL. Once tables were created
you were introduced to the JSON native interface as
well as some of standard SQL constructs which HiveQL has available.
It is important to understand that
Hive is an abstraction layer for Hadoop and MapReduce jobs. All queries written in HiveQL get transfo
rmed
into a D
AG

(Directed Acyclic Graph)

of MapReduce tasks which are then run on the Hadoop cluster, hence
taking advantage

of all performance
, scalability capabilities, but also maintain all of the limitations of
Hadoop.

HiveQL has most of the functiona
lity available with standard SQL, have s series of DDL and DML functions
implemented, but at the same time it does not strictly adhere to
SQL
-
92 standard
HiveQL offers extensions
not in SQL, including

multitable inserts

and

create table as select
, but only

offers basic support for indexing.
Also, HiveQL lacks support for transactions

and

materialized views, and only limited subquery support. It is
intended for long running queries of a Data Warehousing type rather than a transactional OLTP type of
data load
.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






56

5.

O
RACLE
ODI

AND
H
ADOOP

5.1 Introduction To Oracle Connectors

Apache Hadoop is designed to handle and process data from data sources that are typically non
-
RDBMS
and data volumes that are typically beyond what is handled by relational databases.

The
Oracle Data Integrator Application Adapter for Hadoop enables data integration developers to integrate
and transform data easily within Hadoop using Oracle Data Integrator. Employing familiar and easy
-
to
-
use
tools and preconfigured knowledge modules, the a
dapter provides the following capabilities:



Loading data into Hadoop from the local file system and HDFS.



Performing validation and transformation of data within Hadoop.



Loading processed data from Hadoop to Oracle Database for further processing and gener
ating
reports.

Typical processing in Hadoop includes data validation and transformations that are programmed as
MapReduce jobs. Designing and implementing a MapReduce job requires expert programming knowledge.
However, using Oracle Data Integrator and the
Oracle Data Integrator Application Adapter for Hadoop, you
do not need to write MapReduce jobs. Oracle Data Integrator uses Hive and the Hive Query Language
(HiveQL), a SQL
-
like language for implementing MapReduce jobs. The Oracle Data Integrator graphical

user interface enhancing the developer's experience and productivity while enabling them to create
Hadoop integrations.

When implementing a big data processing scenario, the first step is to load the data into Hadoop. The data
source is typically in the l
ocal file system, HDFS, Hive tables, or external Hive tables.

After the data is loaded, you can validate and transform the data using HiveQL like you do in SQL. You can
perform data validation such as checking for NULLS and primary keys, and transformation
s such as
filtering, aggregations, set operations, and derived tables. You can also include customized procedural
snippets (scripts) for processing the data.

When the data has been aggregated, condensed, or crunched down, you can load it into Oracle Databa
se
for further processing and analysis. Oracle Loader for Hadoop is recommended for optimal loading into
Oracle Database.

Knowledge Modules:


KM Name

Description

Source

Target

IKM File To
Hive (Load
Data)

Loads data from local and HDFS files into Hive
tables. It
provides options for better performance through Hive
partitioning and fewer data movements.

File System

Hive

IKM Hive
Control Append

Integrates data into a Hive target table in truncate/ insert
(append) mode. Data can be controlled (validated).

Invalid
data is isolated in an error table and can be recycled.

Hive

Hive

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






57

KM Name

Description

Source

Target

IKM Hive
Transform

Integrates data into a Hive target table after the data has
been transformed by a customized script such as Perl or
Python.

Hive

Hive

IKM File
-
Hive to
Oracle (O
LH)

Integrates data from an HDFS file or Hive source into an
Oracle Database target using Oracle Loader for Hadoop.

File System
or Hive

Oracle
Database

CKM Hive

Validates data against constraints.

NA

Hive

RKM Hive

Reverse engineers Hive tables.

Hive
Metadata

NA


5.2 Overview o
f Hands o
n Exercise

In this workshop
we have loading data into the
HDFS using a cumbersome command line utility one for
one. We viewed results from within HDFS also using a command line utility. Although this is fine for smaller

jobs, it would be

a good

idea to integrate the moving of data with a
n

ETL tool su
ch as Oracle Data
Integration

(ODI)
. In the exercise we will

see how Oracle Data Integrator

integrates seamlessly
with
Hadoop and more specifically
H
ive.

In this exercise yo
u will
:

1.

Use ODI to reverse engineer a Hive Table

2.

Use ODI in import a text file into a Hive Table

3.

Use ODI to move data
from a Hive table into
the Oracle Database

4.

Use ODI to move data from A Hive table to another table with Check Constraints


5.3
Setup and
R
everse Engineering
in ODI

1.

All of the setup
for this
exercise can be done from the terminal, hence

open a

terminal by double
clicking on the
Terminal icon

on the desktop.




2.

To get into the folder where the scripts for the first exercise are, type in the t
erminal:


cd /home/oracle
/exercises/
odi

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






58



3.

Next

we need to run a script to setup the

environment for the next exercise. We will be loading the
same data as in the previous exercise (the
dividens

table) only this time we will be
ODI to
perform
this task
.

For this

we need to drop that table and recreate it
so it is empty for the import
.
Also we
need to start the hive server to enable O
DI to communicate with Hive. We have a script which will
perform both tasks. Go to the terminal and type:


./setup.sh

Then press
Enter




4.

Next we need to start Oracle Data Integrator
. Go to the terminal and type


./startODI

Then press
Enter


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






59



5.

Once ODI opens we need to connect to the repository. In the right upper corner of the screen
screen click on the
Connect To Repository…





6.

In the dialog that
pops up
all of the connection details should already be configured.


Login Name:

DEFAULT_LOGIN1

User:

SUPERVISOR

Password:

Welcome1


If all of the data is entered correctly you can simple click
OK


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






60




7.

Once
you login make sure you are on the

Designer Tab. Near the top of the scre
en on the left side
click on

Designer
.




8.

Near the bottom of the screen on the left side there is a
Models

tab click on it.




You will notice that we have already

created

a File, Hive,
and Oracle mode
for you. These were
pre
-
create
d

as
to reduce the number of steps in the exercise
.
For details on how use flat files and
Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






61

Oracle database with ODI please see the excellent ODI tutorials offered by the
Oracle by Example
Tutorials

found at
http://www.oracle.com/technetwork/tutorials/index.html
.


9.

The first feature of ODI we would like to show involved reverse engineering a data store.

The
reverse enginee
ring function takes a data store and finds all of the table and their structure
automatically.

In the Models tab on the left
side of the screen

the
re

Hive

is a model
. Lets click on
the
+

to expand it out.





10.

You will notice there is no information about
the data that is stored in that particular location. Right
click on the Hive folder and select
Reverse Engineer


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






62



11.

You will see two

item
s

appear in the Hive Folder called
dividends
and

dividends2
. It is the table
s

we created in Hive. You can click on the
+

beside
dividends

to see
some more information.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






63




12.

You can also expand
the Columns folder to see all of the column in this particular table.




You will see the columns created in step 3 displayed.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






64



This is the power of the Hive Reverse
Engineering Knowledge Module

(RKM)

integrated in Oracle
Data Integrator. Once

you define a data store (in our case a hive source) the RKM will
automatically discovery all tables and their corresponding columns available at that source. Once a
data model is

created there is no need to rewrite it in ODI. ODI will automatically discover
that
model for you to be able to get straight to the development of the data movement
.

5.4 Using ODI to import
text file

into Hive

1.

Now that we have

configured

o
ur models

we ca
n begin creating Interfaces to perform ETL tasks
.
Near the top
left

corner

of the screen

in the
Projects

tab there is an icon with 3 s
quares. Click o
n
them and select
New Project
.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






65



2.

In the new window that opened up on the right side of the screen
enter
the following information:


Name:
Hadoop

Code:
HADOOP


Then click on the
Save All

in the right upper corner of the screen.





3.

In the left hand menu in the Projects section a new item appeared called Hadoop.
Click on the
+

to
expand it out


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






66



4.

Next to the

folder called First Fold

er

the
re

is another
+

expand out that folder as well by clicking
the
+





5.

Right click on the

Item
Interface
s and select
New Interface



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






67



6.

We can now start to define the new interface. In this interface we will map out the columns in the
text file and move the data into the hive table. To start out let’s give the interface a name.
In the
new tab that opened on the right side of the screen typ
e in the following information.


Name:
File_To_Hive




7.

Next we need to move to the mapping tab of the File_To_Hive interface. Click on
Mapping

at the
bottom of the screen.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






6
8



8.

We now need t
o define the sources and target data stores. On the
left

bottom of the screen in the
Models Section expand the File folder by clicking on the
+

beside it.





9.

Now we can drag and drop the
Dividends

table from the File model

into the
source
section

of the
interface.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






69




10.

Next we will drag and drop the
dividends

Hive table into the target section of the interface.





Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






70



11.

A pop up window will appear which will ask if you would like to create automatic mapping.
This will
try to automatically match source columns with target columns based on column name. Click on
Yes

to see what happens.




12.

By name it was able to match of the

columns.
The mapping is now complete. Let’s go back to the
Overview tab to setup one
last t
h
ing. Click on the Overview tab on the
left

side of the screen.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






71



In the definitions tab
Tick the box
Staging Area Different From Target
.




13.

A drop down menu below the tick now gets activated
.

Select
File: Comments File


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






72



14.

We can now click on the

flow tab at the bottom of the screen to see what the interface will look like.




15.

On the screen in the top right box you will see a diagram of the data flow.

Let’s see all of the
options for the integration. Click on the
Target(Hive Server)

header.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






73



16.

At the bottom of the screen a

new window appeared, a
Property

Inspector
. There you can

inspect
and modify

the configuration of the integration process. Let change one of
the properties. We don’t
need a s
taging table so let’s disable it. Set the following
options:


USE_STAGING_TABLE:
false




Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






74


Let’s now execute this Interface. Click on the
Execute

button at the top of the screen.




17.

You will be asked to save your interface before running it. Click
Yes




18.

Next you will be asked for the Execution options
. Here you can choose agents contexts and other
elements. You can just accept the default options and click
OK



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






75


19.

An informational screen will pop up telling you the session has started. Click
OK




20.

We will now check if the execution of the interface was successful. In the left menu click on the
Operator

Tab




21.

In the menu on the left side
make sure the Date tab is expanded. Then expand the
Today

folder



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






76



You will see a green checkmark beside the File_To_Hive execution. This means the integration
process was successful.




You have now successfully moved data from a flat file into a hive table with
out

touching the
terminal. All of the dat
a
was moved without cumbersome command line interface and allowing for
the use of all of the powerful functionality of a powerful ETL tool.

Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






77


22.

You can now move back to the Desinger tab in the left menu and close all of the open tabs on the
right side menu




This was to prepare for the next exercise.

5.5 Using ODI to import Hive Table into Oracle

1.

Another useful process ODI can perform is move data from
a Hive table into the Oracle database.
Once

data processing has occurred with the Hive table you might wan
t to move the data back into
your Oracle database for integration into your data warehouse. Let’s move the d
ata we just loaded
into our hive

table into an Oracle database.

First step is to create a new interface.

In the projects tab in the left ha
n
d menu
right click on
Interfaces

and select
New Interface



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






78



2.

On the right side of the screen a new window pops up.
Enter the following name for the Interface


Name:
Hive_To_Oracle




3.

Next at the bottom of the screen we will need to
move the mapping tab to
setup the interface. At
the botto
m of the screen click on
Mapping


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






79




4.

To give up more viewing space lets clean up the Models tab in the left bottom part of the screen.
Minimize the
File

Tab and the
dividends
table
to give up more viewing space.




5.

In the

same tab (the Models tab) we now see the Oracle folder. Let

s expand that out as we will
need to Oracle tables as our targ
et. Click on the
+

beside the Oracle folder


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






80



6.

We can now drag and drop the Hive
dividends
table

into the
sources window


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






81



7.

Similarly you can drag the Oracle
DIVIDENDS

tables into the destination window.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






82



8.

As before you will be asked if you would like to perform automatic mapping.
Click on
Yes




9.

Unfortunately due to a
capitalization difference no mapping could be done automatically. We will
need to map the tables manually. Drag and drop each source table
(from the source tables
windows in the left upper part of the right tab)
to its corresponding target ta
ble
(in the right upper
part of the right tab)
given the following mapping:


exchange

-
>


STOCK

dates

-
>


DATES

dividend

-
>



DIVIDEND


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






83



10.

One of the advantages

of an ETL tool can be seen when doing transformations

during the data
movement. Let’s concate
nate the exchange and symbol into one string and load that into the
STOCK column in the database. Go to the Property Inspector screen of the STOCK column buy
click on it in the targets window




11.

The property inspector window should open at the bottom of
the screen. In the implementation edit
box type the following


c
oncat(DIVIDENDS.exchange, DIVIDENDS.symbol)


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






84



12.

The transformation is now setup. L
ets now go back to the Overview Tab to configure the staging
area
.

Click on the
Overview

tab




Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






85

13.

In the definitions tab Tick the box
Staging Area Different From Target
.




14.

A drop down menu below the tick now gets activated. Select
Hive: Hive Store




15.

We are now ready to run the interface. To run the interface go to the left upper corner of the screen
and click the
execute

button


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






86



16.

A window will pop up telling you a save is required before the execution can continue. Just click
Yes






17.

Another
window will pop pop asking

you to configure

the Execution Context and Agent. The default
options are fine just click
OK




18.

A final window will pop up tell you the session has started
. Click
OK


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






87



19.

Let’s now go
to the
O
perator

tab to check if the execution was successful.
In the top left corner of
the screen click on the
Operator

tab




20.

When you get to the Operator Tab you might see a lightning bolt beside the Hive_To_Oracle

execution. This means integration is executing wait for a little bit until the checkmark appears.


Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






88




The movement of data from Hive to Oracle has completed successfully
.


21.

One of the great feature of ODI is it allows you to look at the code that was ex
ecuted as part of the
ETL process. Let’s
drill down and
see some of the code executed. In the Operator tab click on the
+

next to the latest execution.



Big Data Workshop



http://www.oracle
-
developer
-
days.com


Copyright ©
2012
, Oracle and/or its affiliates. All rights reserved






89

22.

Continue to drill down by click on the
+

next to the
1
-

Hive_to_Oracle


<date>





23.

You can now see all of the steps taken to perform this particular mapping. Let’s investigate further
the forth step in the process. Double click on
4


Integration