Teradata customer has world's largest database with ... - SEA-TUG

righteousgaggleΔιαχείριση Δεδομένων

31 Ιαν 2013 (πριν από 4 χρόνια και 8 μήνες)

366 εμφανίσεις

WELCOME TO

February 28, 2012

S
ystems

E
ngineering &

A
dministration


T
echnology

U
ser

G
roup

Upcoming Meetings


March 21
Implementing and Managing a BYOD
Policy In Your Organization



April 26

Re
-
cap of Microsoft Management Summit



May

Review of Two Recent Data Center Moves



June

What do you want to see? Please send
feedback to CARETAKER@SEA
-
TUG.COM

Tools and Industry News



Local (Seabrook) secondary market parts / server supplier:
AdaptableComputer.com



Good job hunting resources: Indeed.com, Glasdoor.com



Linux Newbie Guide:
http://www.faqs.org/docs/lnag/



EXT2 driver for Windows:
http://www.fs
-
driver.org/



Sandbox individual applications for testing:
http://www.sandboxie.com/



OS X Hardening Guides:
http://isc.sans.edu/diary.html?storyid=12616



Did you know you can open a free technical support incident with Microsoft if you’re
having an issue related to security or service pack installations? 866
-
PC
-
SAFETY

Upcoming Webcasts

Microsoft

http://www.microsoft.com/events/webcasts/calendar/MonthView.aspx?stdate=2/1/2012&audience=IT
%20Professional&series=0&product=0&presenter=0&tz=0

-
or
-

http://bit.ly/AkqIbK


February 28
-

Cloud
Computing Soup to Nuts (Part 4): Introduction to SQL Azure (Level 100)


February
29
-

Private
Cloud Chat (Episode 4) (Level 200)


March
7
-

Database
Development Management (Level 300)


VMware

http://webcasts.vmware.com/event/19/26/21/rt/index.html


February 28
-

Maximum Performance and Availability for Virtualized Oracle Databases with
VMware and EMC


O'Reilly

http://oreilly.com/webcasts/index.html


March 1
-

Building a Bomb
-
Proof Backup Strategy


March
14
-

Working with Office 365 for Small Business


Tonight’s Presentation




Big Data:

What is it and What is it Good For?



Geoff Noel


Technical Architect, SAS

Big Data
Datasets so large that they’re difficult to work with using conventional
toolsets (terabytes,
exabytes
, and
zettabytes

of data)

Map Reduce

A framework originally developed by Google to deal with the large
amount of search data it had/kept

Hadoop

The open source project which was born from Map Reduce

No SQL
Non
-
relational DBMS


uses Pig Latin as language to access data

Hbase

Open source DB product which implements the No SQL methodology

Cassandra

Another No SQL product, used and developed by Facebook

HDFS

Hadoop

Distributed File System
-

designed to handle large file sizes, it doesn’t
handle lots of small files very well

[Volume]
Total number of bytes
associated with the data.
Unstructured data are estimated
to account for 70
-
85% of the data
in existence and the overall
volume of data is rising


[Velocity]

The pace at which the
data are to be consumed.


As
volumes rise, the value of
individual data points tend to more
rapidly diminish over time


[Variety]

The complexity of the
data in this class.


This complexity
eschews traditional means of
analysis


[Variability]

The differing ways in
which the data may be
interpreted. Differing questions
require differing interpretations

Where does this data come from?

Weblogs, cameras, RFID scanners, Internet search histories, retail transactions, social
media activity, genomic/biomed research, etc.

Vertical vs. Horizontal Scaling

Vertical: Add more memory, disk, CPU resources to server

Horizontal: Add more servers, each with relatively small memory, disk, CPU resources

Concepts and Terminology

Putting it to use

or:
Umm, Dad…

How Target Figured Out A

Teen Girl Was Pregnant

Before
Her Father Did

Source: http://www.forbes.com/sites/kashmirhill/2012/02/16/how
-
target
-
figured
-
out
-
a
-
teen
-
girl
-
was
-
pregnant
-
before
-
her
-
father
-
did
/

[Pole] ran test after test, analyzing the data, and before long some
useful patterns emerged
.
Lotions, for example. Lots of people buy lotion, but one of Pole

s colleagues noticed that women on the
baby registry were
buying larger quantities of unscented lotion around the beginning of
their second trimester
. Another analyst noted that sometime in the first 20 weeks, pregnant
women loaded up on supplements like calcium, magnesium and zinc. Many shoppers purchase soap
and cotton balls, but when someone suddenly starts buying lots of scent
-
free soap and extra
-
big bags
of cotton balls, in addition to hand sanitizers and washcloths, it signals they could be getting close to
their delivery date.

Target assigns
every customer a Guest ID number, tied to their credit card, name, or email address
that
becomes a bucket that
stores a history of everything they’ve bought and any demographic information Target
has collected from them or bought from other sources
. Using that, Pole looked at
historical buying data for all
the ladies who had signed up for Target baby registries
in the past.

As Pole’s computers crawled through the data, he was able to
identify about 25 products that, when analyzed
together, allowed him to assign each shopper a “pregnancy prediction” score
. More important, he could also
estimate her due date
to within a small window, so Target could send
coupons timed to very specific stages
of her
pregnancy.

Andrew Pole

via LinkedIn

So
Target started sending coupons for baby items
to customers

according
to their pregnancy scores
. What Target discovered fairly quickly is that it
creeped

people out
that the company knew about their pregnancies in advance.


Target
got sneakier about sending the coupons
. The company can create
personalized booklets;
instead of sending people with high pregnancy
scores books o’ coupons solely for diapers, rattles, strollers
, and the “Go
the F*** to Sleep” book, they more subtly spread them about:


“Then we
started mixing in all these ads for things we knew pregnant
women would never buy, so the baby ads looked random
. We’d put an ad
for a lawn mower next to diapers. We’d put a coupon for wineglasses next to infant clothes.
That way, it
looked like all the products were chosen by chance
.


“And we found out that as long as a pregnant woman thinks she hasn’t been spied on,
she’ll use the coupons. She just assumes that everyone else on her block got the same
mailer for diapers and cribs
.
As long as we don’t spook her, it works
.”

Another article, about Diapers.com includes this bold statement:


“We can predict how much money we’ll earn from you over a lifetime across the existing site and all six sites”.


Source: http://www.forbes.com/sites/meghancasserly/2012/02/16/pampers
-
or
-
huggies
-
how
-
diapers
-
com
-
profiles
-
customers
-
from
-
first
-
c
lick/

Who Am I and Why I am Here?


SAS Principal Solutions Architect
(Retail and Manufacturing Vertical)










Geoff.Noel@SAS.com

or
Gnoel24@hotmail.com



What is This?











One of the first commercial disk drives from IBM. It has a 5 MB capacity
and it’s stored in a cabinet roughly the size of a luxury refrigerator.


In contrast, a 32 GB
microSD

card measures around 5/8 x 3/8 inch and
weighs about 0.5 gram.





Photo: Mike
Loukides
. Disk drive on display at IBM
Almaden

Research



O'Reilly Radar Team (2011
-
09
-
06). Big Data Now: Current Perspectives from O'Reilly Radar (Kindle Locations 126
-
128).
OReilly

Media
-

A. Kindle Edition.


Big Data

Mission


Track and control
a 14.5 Ton spaceship
,
orbiting at
2175 miles
per hour around the moon,
land it safely within yards of a specified location
and guide it back from the surface to rendezvous
with a command ship in lunar orbit.


The
system has to work the first time, and
minimize fuel consumption because the
spacecraft only contains enough fuel for one
landing attempt.


The
C
omputer


5,000
primitive integrated circuits,


weighs
66 pounds and
costs over $150,000.



Software Storage


No
disk drive,


74
kilobytes
ROM
memory that has been
literally
hard
-
wired


4
Kb of something that is sort of like RAM.



http://www.abc.net.au/science/moon/computer.htm


“Big Data” Myths

Data Volumes are “Exploding”



Did Wal
-
Mart suddenly sell more stuff?


Did E
-
bay suddenly hold more auctions?


Did NYSE suddenly do more stock trades?


Did Chrysler suddenly sell more cars?


Did Netflix suddenly rent more movies?


Did Amazon suddenly sell more books?










No
-

Often

this is
existing

data that previously went un
-
analyzed:


Too large to manage


Too costly to store


Lack of the “analytic chops” to capitalize





Copyright © 2011, SAS Institute Inc. All rights reserved.

VOLUME

VARIETY

VELOCITY

RELEVANCE

TODAY

THE FUTURE

DATA SIZE

SAS & ‘Big Data’
-

The Ultimate Solutions For Retail

BIG DATA


Copyright © 2011, SAS Institute Inc. All rights reserved.

BIG DATA

When volume, velocity and variety of data exceeds an
organization

s storage or compute capacity for accurate
and timely decision
-
making

High Performance Buzz Words Defined

BIG ANALYTICS

The process surrounding the development, interpretation,
and useful application of statistics to solve a problem
.

ANALYTICS

The combination of using
ANALYTICS

on
BIG DATA
AND/OR the capability to run advanced or complex
analytics on any size data.

Corporate Skills

Big data: sorting out the skill sets

Companies surveyed by the Economist
Intelligence Unit
fall into four loosely
defined
categories of big

data management
(Each
group has
specific
characteristics,
which were
assessed by cross
-
referencing the responses of

each group against those of the rest of the
survey respondents:



Strategic data
managers


Companies
that
have well
-
defined
data management strategies that focus


resources
on collecting and
analyzing
the
most valuable data




Aspiring data
managers


Companies that understand
the value of data and are
marshalling resources



to take better advantage of them;



Data collectors



Companies
that collect a
large amount
of data but do not consistently



maximize their value



Data
wasters


Companies
that collect data
but severely
underuse them.

When it comes to big data, you
either use it or lose.


“Big analytics is the brains to cloud computing’s brawn” [Economist]
.


Here, however, the speed is less of a challenge; given an addressable data set in memory, most statistical algorithms can yie
ld
results in seconds. The challenge is scaling these out to address large datasets, and rewriting algorithms to operate in an o
nli
ne,
distributed manner across many machines. Because data is heavy, and algorithms are light, one key strategy is to push code
deeper to where the data lives, to minimize network IO.


This often requires a tight coupling between the data storage layer and the analytics, and algorithms often need to be re
-
writte
n as
user
-
defined functions (UDFs) in a language compatible with the data layer. Greenplum, leveraging its Postgres roots, supports
UDFs written in both Java and R. Following Google’s BigTable, HBase is introducing coprocessors, which allows Java code to be

associated with data tablets, and minimize data transfer over the network.


Netezza pushes even further into hardware, embedding an array of functions into FPGAs that are physically co
-
located with the
disks of it’s storage appliances. The field of what’s alternatively called business or predictive analytics is nascent, and w
hil
e a
range of enabling tools and platforms exist (
such as R, SPSS, and SAS
), most of the algorithms developed are proprietary and
vertical
-
specific. As the ecosystem matures, one may expect to see the rise of firms selling analytical services


such as
recommendation engines


that interoperate across data platforms. But in the near
-
term, consultancies like Accenture and
McKinsey, are positioning themselves to provide big analytics via billable hours. Outside of consulting, firms with analytica
l
strengths push upward, surfacing focused products or services to achieve success.



Underutilized resources


Support incremental
growth


Unnecessary data
movement


Guarantee uptime &
continuity


Increasing costs





Growth in data and user
volumes; complexity


Slow time to results


Slow response time


Limited analysis due to
lack of resources


Low productivity

SAS High Performance Computing

SAS Grid Computing

SAS

In
-
Database

SAS

In
-
Memory Analytics

SAS High Performance Computing:
Key Business Challenges

Big Data has Been Here (NCR)


1996: A Teradata database becomes the world's largest database at 11 terabytes.


1997: Teradata customer creates world's largest production database at 24 terabytes.


1999: Teradata customer has world's largest database with 130 terabytes.


(In 2000 Oracle Claimed the largest warehouse at ~ 140 TB hosted on two massive Oracle
instances … today 140TB is almost
passe
)



Teradata began to associate itself with the term, “
Big Data
” in 2010. CTO, Stephen
Brobst
,
attributes the rise of big data to “new media sources, such as Social Media.”
[



The increase in semi
-
structured and unstructured data gathered from online interactions prompted
Teradata to form the “Petabyte club” in 2011 for its heaviest big data users.

The rise of big data
resulted in many traditional data warehousing companies updating their products and technology.



For Teradata, big data prompted the acquisition of
Aster Data

in 2011 for the company’s
MapReduce

capabilities and ability to store and analyze semi
-
structured data.


Public interest in big data resulted in a 13% increase in Teradata’s global sales


Others like them


Sequent / NUMA
-
Q
-
> IBM Acquired


Tandem /Non
-
Stop
-

> Compaq

> HP

> HP NEO VIEW(NSK Guardian) and HP NEOVIEW SQL (NON
-
Stop)


GreenPlum

( modified
PostGresQL
) owned by EMC


MySQL (Owned by Oracle)


Vertica

-
> Acquired by HP

Big Data has Been Here (SAS)

Some existing SAS Scalable Performance
Data Server deployments contain more than 8
terabytes of data with single tables exceeding
500 gigabytes. SAS Scalable Performance
Data Server has demonstrated scalability to
databases and tables containing billions of
rows and was designed with a petabyte
-
sized
address space to support massive data
warehouses.

Scalable I/O

SAS Scalable Performance Data Server speeds the
processing of large amounts of data by partitioning the
data across multiple disks and I/O channels. This
enables parallelization of many SAS I/O functions over
multiple data partitions. It is designed to use all of the
resources available on a machine, and maximum
benefits are gained on machines with multiple CPUs,
I/O channels and disks where there are large amounts
of data to be manipulated.

In 1996 SAS introduced
the scalable performance
data server.



lustre
SAS RP
7
.
2
Hi Level Architecture
Prepared by SAS Institute Inc
.
Last Updated

GMN

<
<
20120215
>
>
Data
/
Grid Tier
MYSQL
(
OEM
)
Client Tier
SPDS
(
grid
)
SAS Web Client
Required Components
:
-

Adobe Flash Player
-

Web Browser
(
IE
8
/
Firefox
)
System Management Client
Varies depending on solution
:
-

SAS Data Integration Studio
-

SAS Management Console
-

SAS Foundation Services
-

SAS Workflow Studio
SAS Solution Server Tier
(
logical Servers
)
Base Data Store
:
Fact Data
:
SAS Rich Client
(
application
)
Varies depending on solution
(
not for all solutions
)
-

SAS Foundation Services
-

SAS Merchandise Intelligence Plug
-
ins for SAS Management Console
-

SAS Rich Client Platform Plug
-
ins
,
-

SAS Rich Client Platform
,
SAS
(
grid
)
RTOLAP
In Memory Data Server
(
Apache SOLR
)
SAS Framework Server Tier
(
logical servers
)
Lustre FS

Moore’s Law and Big Data

Since the early ‘80s, processor speed has increased from 10 MHz to 3.6 GHz
--

an increase of 360 (not
counting increases in word length and number of cores). But we’ve seen much bigger increases in
storage capacity, on every level. RAM has moved from $1,000/MB to roughly $25/GB
--

a price
reduction of about 40000, to say nothing of the reduction in size and increase in speed. Hitachi made
the first gigabyte disk drives in 1982, weighing in at roughly 250 pounds; now terabyte drives are
consumer equipment, and a 32 GB
microSD

card weighs about half a gram. Whether you look at bits
per gram, bits per dollar, or raw capacity, storage has more than kept pace with the increase of CPU
speed.




What is all this stuff ?!?

“Big Data” Why Now?

Storage Cost


Cost of Storage Dropping


In 2000 a GB avg $16.06, a 1 TB data warehouse was rare


Today a GB avg $0.0621, a terabyte can be had for < $100


1TB of MPP appliance 100
-
200k , 1TB SAN/NAS avg @2
-
5k

Storage Costs continue to plunge

as Storage Needs Grow

$0
$40,000
$80,000
Hadoop
SAN/NAS
MPP RDBMS
Avg

Cost of 1TB of Storage

What is Hadoop?




A scalable fault
-
tolerant
grid operating system
for data storage and
processing


Its scalability comes from the marriage of:


HDFS:

Self
-
Healing High
-
Bandwidth Clustered Storage


MapReduce:

Fault
-
Tolerant Distributed Processing


Operates on
unstructured and structured data


A
large and active ecosystem
(many developers and additions like HBase,
Hive, Pig, …)


Open source
under the
Apache License


http://wiki.apache.org/hadoop/



28

Commodity Hardware Cluster

Hadoop Architecture


Typical 2 level architecture: nodes & switches


Nodes are commodity PCs


40 nodes/rack


Uplink from rack is 8 gigabit


Rack
-
internal is 1 gigabit

Hadoop & Map/Reduce



categorize data |
map

| sort |
reduce

SMAQ stack for big data

by
Edd

Dumbill


O'Reilly Radar Team (2011
-
09
-
06). Big Data Now: Current Perspectives from O'Reilly Radar (Kindle Location 356).
OReilly

Media
-

A. Kindle Edition.

“Big data” is data that becomes large enough that it cannot be processed using conventional methods.


Creators of web search engines were among the first to confront this problem. Today, social networks, mobile phones, sensors
and

science contribute to
petabytes of data created daily. To meet the challenge of processing such large data sets,
Google

created
MapReduce
. Google’s work and Yahoo’s creation
of the
Hadoop

MapReduce

implementation
has spawned an ecosystem of big data processing tools. As
MapReduce

has grown in popularity, a stack for
big data systems has emerged, comprising layers of Storage,
MapReduce

and Query (SMAQ). SMAQ systems are typically open source, distributed, and run
on commodity hardware. In the same way the commodity
LAMP

stack of Linux, Apache, MySQL and PHP changed the landscape of web applications,
SMAQ systems are bringing commodity big data processing to a broad audience. SMAQ systems underpin a new era of innovative da
ta
-
driven products and
services, in the same way that LAMP was a critical enabler for Web 2.0. Though dominated by
Hadoop
-
based architectures,
SMAQ

encompasses a variety of
systems, including leading
NoSQL

databases.




Analytics

Map Reduce

Created at Google
(to solve problem of creating web search indexes)



MapReduce

framework is the powerhouse behind most of today’s big data processing. The key innovation of
MapReduce

is the ability to take a query over a data set, divide it, and run it in parallel over many nodes. This
distribution solves the issue of data too large to fit onto a single machine.



In order for
MapReduce

to do its job, the map and reduce phases must obey certain constraints that allow the work to be parallelized.
Translating queries into one or more
MapReduce

steps is not an intuitive process. An important way in which
MapReduce
-
based systems
differ from conventional databases is that they process data in a batch
-
oriented fashion. Work must be queued for execution, and

may
take minutes or hours to process.



Using
MapReduce

to solve problems entails three distinct operations: E, T, And L (simplified ;
-
>)



MapReduce

--

This phase will retrieve data from storage, process it, and return the results to the storage. Extracting the result
--

Once

processing is complete, for the result to be useful to humans, it must be retrieved from the storage and presented. Many SMAQ

sy
stems
have features designed to simplify the operation of each of these stages.

Hadoop MapReduce


Hadoop

is the dominant open source
MapReduce

implementation.


Funded by Yahoo (aka
Hortonworks
)


it emerged in 2006


creator Doug Cutting


Hadoop

project is now hosted by Apache
.


It has grown into a large endeavor, with multiple subprojects that together comprise a full
SMAQ

stack. Since it is
implemented in Java,
Hadoop’s

MapReduce

implementation is accessible from the Java programming language. Creating
MapReduce

jobs involves writing functions to encapsulate the map and reduce stages of the computation. The data to be
processed must be loaded into the
Hadoop

Distributed
Filesystem
.


The process of running a
MapReduce

job with
Hadoop

involves the following steps:



Defining the
MapReduce

stages in
a Java
program


Loading the data into the
filesystem



Submitting the job for execution


Retrieving the results from the
filesystem



Run via the standalone Java API



Hadoop

MapReduce

jobs can be complex to create, and necessitate programmer involvement. A broad ecosystem has
grown up around
Hadoop

to make the task of loading and processing data more straightforward.


Storage

MapReduce

requires storage from which to fetch data and in which to store the results of the computation. The data expected by
MapReduce

is
not
relational data, as used by conventional databases. Instead, data is consumed in chunks, which are then divided among
nodes and fed to the map phase as key
-
value pairs. This data does not require a schema, and may be unstructured. However, the da
ta
must be available in a distributed fashion, to serve each processing node. design and features of the storage layer are impor
tan
t not just
because of the interface with
MapReduce
, but also because they affect the ease with which data can be loaded and the results of
computation extracted and searched.













Hadoop

Distributed File System

The standard storage mechanism used by
Hadoop

is the
Hadoop

Distributed File System,

HDFS
.

A core part of
Hadoop
, HDFS has the following features, as detailed in the HDFS design document.



Fault tolerance
--

Assuming that failure will happen allows HDFS to run on commodity hardware.


Streaming data access
--

HDFS is written with batch processing in mind, and emphasizes high throughput NOT random access to da
ta.


Extreme scalability
--

HDFS will scale to petabytes;



HBase, the Hadoop Database

Such an installation is in production use at
Facebook.
Portability
--

HDFS is portable across operating systems. Write once
--

By
assuming a file will remain unchanged after it is written, HDFS simplifies replication and speeds up data throughput. Localit
y o
f
computation
--

Due to data volume, it is often much faster to move the program near to the data, and HDFS has features to
facilitate this. HDFS provides an interface similar to that of regular
filesystems
. Unlike a database, HDFS can only store and
retrieve data, not index it. Simple random access to data is not possible. However, higher
-
level layers have been created to
provide finer
-
grained functionality to
Hadoop

deployments, such as
HBase
.



One approach to making HDFS more usable is
HBase
. Modeled after Google’s
BigTable

database,
HBase

is a column
-
oriented
database designed to store massive amounts of data. It belongs to the
NoSQL

universe of databases, and is similar to Cassandra
and
Hypertable
.


HBase

uses HDFS as a storage system, and thus is capable of storing a large volume of data through fault
-
tolerant, distributed
nodes. Like similar column
-
store databases,
HBase

provides REST and Thrift based API access. Because it creates indexes,
HBase

offers fast, random access to its contents, though with simple queries. For complex operations,
HBase

acts as both a
source and a sink (destination for computed data) for
Hadoop

MapReduce
.
HBase

thus allows systems to interface with
Hadoop

as a database, rather than the lower level of HDFS.


Specifying MapReduce jobs in terms of defining distinct map and reduce functions in a
programming language is unintuitive and inconvenient, as is evident from the Java code
listings shown above. To mitigate this, SMAQ systems incorporate a higher
-
level query
layer to simplify both the specification of the MapReduce operations and the retrieval of
the result.


Many organizations using Hadoop will have already written in
-
house layers on top of the
MapReduce API to make its operation more convenient. Several of these have emerged
either as open source projects or commercial products. Query layers typically offer
features that handle not only the specification of the computation, but the loading and
saving of data and the orchestration of the processing on the MapReduce cluster.
Search technology is often used to implement the final step in presenting the computed
result back to the user.


Query

About Hive


Why Hive?
“people are comfortable writing SQL”


Developed at Facebook


Hive is a higher
-
level interface for Hadoop (has to use MAP
-
R Batch process)


Interactive shell “Hive CLI”


Declarative, SQL
-
like language ~ Hive QL (query language)


Hive engine compiles Hive QL into MapReduce


Operations


DDL


CREATE TABLE invites (foo INT, bar STRING) PARTITIONED BY (ds STRING);


DML (load operations)
-

LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes;


SQL


SELECT
-

SELECT a.foo FROM invites a WHERE a.ds='2008
-
08
-
15';


INSERT SELECT
-

INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a


CREATE TABLE SELECT


CREATE TABLE catalogue_number ASSELECT rowSequence(), catalogue_numberFROM
verbatim_record



Cassandra and Hypertable Cassandra and Hypertable are both scalable column
-
store databases
that follow the pattern of BigTable, similar to HBase. An Apache project, Cassandra originated at
Facebook and is now in production in many large
-
scale websites, including Twitter, Facebook,
Reddit and Digg. Hypertable was created at Zvents and spun out as an open source project. Both
databases offer interfaces to the Hadoop API that allow them to act as a both a source and a sink
for MapReduce. At a higher level, Cassandra offers integration with the Pig query language and
Hypertable works with Hive.

About Pig


Why
Pig?

"
writing
map
-
reduce
routines, is like coding in assembly



Pig
is a higher
-
level interface for
H
adoop


Interactive shell
“Grunt”


Declarative,
Kind of SQL
-
like language


Pig
engine compiles Pig Latin into
MapReduce


Extensible via Java
files

http
://pig.apache.org
/


myinput

= LOAD 'input/access.log' USING extLoader();

words

= FOREACH myinput GENERATE FLATTEN(TOKENIZE(
\
$0));

grouped

= GROUP words BY
\
$0;

counts

= FOREACH grouped GENERATE group, COUNT(words);

ordered = ORDER counts BY
\
$0;

STORE ordered INTO 'output/pigOut' USING PigStorage();

Apache SOLR

Search with
Solr

An important component of large
-
scale data deployments is retrieving and summarizing data.


The addition of database layers such as
HBase

provides easier access to data, but does not provide sophisticated
search capabilities.


To solve the search problem, the open source search and indexing platform
Solr

is often used alongside
NoSQL

database systems.
Solr

uses
Lucene

search technology to provide a self
-
contained search server product.


For example, consider a social network database where
MapReduce

is used to compute the influencing power of each
person, according to some suitable metric. This ranking would then be
reinjected

to the database. Using
Solr

indexing
allows operations on the social network, such as finding the most influential people whose interest profiles mention
mobile phones, for instance. Originally developed at CNET and now an Apache project,
Solr

has evolved from being
just a text search engine to supporting faceted navigation and results clustering. Additionally,
Solr

can manage large
data volumes over distributed servers. This makes it an ideal solution for result retrieval over big data sets, and a useful
component for constructing business intelligence dashboards.



Cascading (API)


Cascading, the API Approach The Cascading project provides a wrapper around
Hadoop’s

MapReduce

API to make it
more convenient to use from Java applications. It is an intentionally thin layer that makes the integration of
MapReduce

into a larger system more convenient. Cascading’s features include: A data processing API that aids the simple definition
of
MapReduce

jobs. An API that controls the execution of
MapReduce

jobs on a
Hadoop

cluster. Access via JVM
-
based
scripting languages such as
Jython
, Groovy, or
JRuby
. Integration with data sources other than HDFS, including Amazon
S3 and web servers. Validation mechanisms to enable the testing of
MapReduce

processes. Cascading’s key feature is
that it lets developers assemble
MapReduce

operations as a flow, joining together a selection of “pipes”. It is well suited
for integrating
Hadoop

into a larger system within an organization. While Cascading itself doesn’t provide a higher
-
level
query language, a derivative open source project called
Cascalog

does just that. Using the
Clojure

JVM language,
Cascalog

implements a query language similar to that of
Datalog
. Though powerful and expressive,
Cascalog

is likely to
remain a niche query language, as it offers neither the ready familiarity of Hive’s SQL
-
like approach nor Pig’s procedural
expression.



The listing below shows the word
-
count example in
Cascalog
: it is significantly terser, if less transparent.

(
defmapcatop

split

[sentence]



(
seq

(.split

sentence

"
\
\
s+")))

(?<
-

(
stdout
)

[?word

?count]



(sentence

?s)

(split

?s

:>

?word)



(c/count

?count))


The language of Hadoop


HDFS


Haddop Distributed File System


Hive


SQL Like query language “Hive QL” for HDFS, think of EL
-
T style processing “create table as select ...”


Pig


AKA Pig Latin is a high
-
level programing language which produces a sequence of MR programs , think of ET
-
L
or DataStep style processing


HBASE


is the Database of Hadoop, Use it when you need random, realtime read/write access to your Big Data


MR


Map Reduce a parallel processing software framework


MAP STEP


master node takes inputs and partitions it into smaller sub
-
problems, and distributed to worker
nodes


REDUCE
-

master node then takes the answers to all the sub
-
problems and combines them in some way to
get the output



HCatalog


storage management service think of metadata and table cataloging


SQOOP
-

Sqoop is a tool designed to import data from relational databases into Hadoop. Sqoop uses JDBC to
connect to a database.


Mahout


data mining library for Hadoop particularly used for recommender engines


Distributions (often referred to as Distros)




Cloudera



currently the leading distribution provider of Hadoop offering support and
services



Hortonworks


new yahoo spinoff offering hadoop distributions expected to be the defacto
standard for Hadoop in the near future because they provide 90% of the hadoop R&D. think
Linux/RedHat of Hadoop



MapR


a more perform
-
ant distribution of Hadoop, this is what Greenplum bundles.



IBM


BigInsights
is IBM’s Hadoop Distribution



Platform Computing


yet another distro


late to the game but now aquired by IBM


Big data
-
Summary


“By
all accounts, we are in the early years of the era of big
data”.


Companiesare

still struggling to understand the nature of this shift and its implications for their business. This is

When
the Economist Intelligence Unit asked survey respondents about the most challenging aspects

of data management, most said they had their storage and security needs under control. They believe the

costs are manageable.
Of much greater concern, however, is ensuring that their data are accurate and

reliable. And by far the most
diffi

cult process right now is reconciling disparate data sources
.


Levi Strauss
is assembling non
-
standard,
siloed

information across multiple regions onto one common

platform, using standard taxonomies and a single language.



Through this process, we quickly
realized that
we had a number of different processes and systems across the various
geographies. In
some cases, we
had duplicate entries or inconsistent data
.
Getting on top of this was critical for
thecompany

to unlock valuable insight from information such as sell
-
out or
customer
programmes
.


Wim

Vriens.Director

Of Business Improvement Levi Strauss


This
critical step in the management of
big data
is perhaps the least mature of all data
management disciplines. Companies struggle with it
for many reasons. BUT to Succeed
they have to Master it.




Thanks


Drive Safe