and Hive with CAT

unknownlippsΤεχνίτη Νοημοσύνη και Ρομποτική

16 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

135 εμφανίσεις

Cindy Gross

CAT PM

SQL, BI, Big Data


HDInsight: Jiving about Hadoop
and Hive with CAT

http://blogs.msdn.com/cindygross/

@SQLCindy

Cindy.Gross@microsoft.com


Why are we
here?

Objectives


Quick Overview: Big Data,
Hadoop, HDInsight, Open
Source


What Hive is


Why Hive for Hadoop?


Why Hive for SQL Pros?


How Hive fits into
Hadoop/HDInsight


Hive is better together with
SQL, AS, BI


Key Takeaways


How Hive fits


Hive DDL and DML


Formats, Structure


Storage options

Big Data

How do I optimize
my fleet based on
weather and traffic
patterns?

What’s the social
sentiment for my
brand or
products

How do I better
predict future
outcomes?


Increases ad revenue by processing 3.5
billion events per day


Massive Volumes


Processes 464 billion rows per quarter, with
average query time under 10 secs.



Measures and ranks online user influence
by processing 3 billion signals per day


Cloud Connectivity


Connects across 15 social networks via the
cloud for data and API access


Uses sentiment analysis and web analytics
for its internal cloud


Real
-
Time Insight


Improves operational decision making for
IT managers and users

Cloud (
Azure
) Flexibility + On
-
Premises Option

Big Data Technologies

SQL
Server and PDW
Office, Analysis
Services
HDInsight
Analysis
Services
Office, Analysis
Services
How it fits together

Open Source
Community


We Consume Code

We Contribute Code


Core Code Same Across
Distributions



Apache Hadoop

Microsoft Partner


Heavy Contributors to
Open Source Hadoop


Trusted in Open Source
Community




Hortonworks

HDInsight Service,

HDInsight Server Built
on Hortonworks
Platform


Additional
Functionality



HDInsight

Distributed Storage

(HDFS)

Query

(Hive)

Distributed Processing

(MapReduce)

Scripting

(Pig)

NoSQL Database

(HBase)

Metadata

(
HCatalog
)

Data Integration

( ODBC / SQOOP/ REST)

Machine
Learning

(Mahout)

Graph

(Pegasus)

Stats
processing

(RHadoop)

Event Pipeline

(Flume)

Pipeline /
Workflow

(
Oozie
)

Legend

Red = Core
Hadoop

Blue = Data
processing

Purple = Microsoft
integration points
and value adds

Yellow = Data
Movement

Green = Packages

White = Coming
Soon

Hive

Hadoop

Hive Architecture

Microsoft Confidential

Enables BI tools via ODBC, structure

Structure without full relational modeling

Familiar HiveQL
-

skillset reuse

Simplifies Hadoop data access

Why Hive for
Hadoop?

Microsoft Confidential

Batch oriented

Data Warehouse focused

Entire data sets (table scans)

Generates/runs MapReduce (not faster than MR!)

Limited indexing, no stats, no cache

Programmer is the optimizer

Append only (mostly)

Hive Characteristics

Microsoft Confidential

Someone in your org will be doing it, why not you?

Fit projects to appropriate tech

Adds to, complements SQL, AS, BI

New opportunities for biz and for
you


Explore, archive, prototype, pre
-
aggregate,


refine algorithms, some self
-
service

Why Hive for SQL
Pros?

Microsoft Confidential

Updates, OLTP, ACID

Subsets, indexes/
aggs
, built
-
in optimizer, caching

Apps, data, structure, infrastructure already exists

Each query has to be fast

You know what you need to know

Where it makes sense

SQL/AS still needed
for….

Not Partitioned

CREATE
EXTERNAL TABLE
baconUnPart

(type string COMMENT 'type of bacon')


COMMENT 'SQL Bacon!'


ROW FORMAT DELIMITED


FIELDS TERMINATED BY '
\
t'


STORED AS TEXTFILE


LOCATION '/user/demo/bacon
';


Partitioned

CREATE EXTERNAL TABLE
baconPart

(type string COMMENT 'type of bacon strips')


COMMENT 'SQL Bacon strips'


PARTITIONED BY (year string)


ROW FORMAT DELIMITED


FIELDS TERMINATED BY '
\
t'


STORED AS TEXTFILE
;

ALTER TABLE
baconPart

ADD PARTITION (Year = ‘1’)


LOCATION '/user/demo/bacon1';

ALTER TABLE
baconPart

ADD PARTITION (Year = ‘2’)


LOCATION '/user/demo/bacon2';



Create Table

DATA TYPES

EXTERNAL / INTERNAL

PARTITIONED BY | CLUSTERED BY | SKEWED BY

Terminators

ROW FORMAT DELIMITED | SERDE

STORED AS

Fields/Collection Items/Map Keys

TERMINATED BY

LOCATION


Inside a Hive
Table

Metadata

is stored in a
MetaStore

database such as

Derby

SQL Azure

SQL Server

See Schema

SHOW TABLES
'
ba
.*';

DESCRIBE
baconunpart
;

DESCRIBE
baconunpat.type
;

DESCRIBE EXTENDED
baconunpart
;

DESCRIBE FORMATTED
baconunpart
;

SHOW FUNCTIONS "x.*";

SHOW FORMATTED INDEXES ON
baconunpart
;

MetaData

Primitives

Numbers:
Int
,
SmallInt
,
TinyInt
,
BigInt
, Float, Double

Characters: String

Special: Binary, Timestamp


Collections

STRUCT<
City:String
,
State:String
> |
Struct

(‘Boise’, ‘Idaho’)

ARRAY <String> | Array (‘Boise’, ‘Idaho’)

MAP <String, String> | Map (‘City’, ‘Boise’, ‘State’, ‘Idaho’)

UNIONTYPE <
BigInt
, String, Float>


Properties

No fixed
string lengths

NULL handling depends on
SerDe


Data Types

CREATE
EXTERNAL
TABLE
baconUnPart
(…)


LOCATION '/
user/demo/bacon';


LOCATION ‘
hdfs
:///
user/demo/bacon';


LOCATION ‘
asv
://
user/demo/bacon';

Use

EXTERNAL

when

Data also used outside of Hive

Data needs to remain even after a DROP TABLE

Use custom location such as ASV

Hive should not own data and control settings,
dirs
, etc.

Use

INT
ERNAL

when


You want Hive to manage the data and storage


Short term usage


Creating
table based on existing table (AS SELECT)


Storage


External and
Internal

CREATE EXTERNAL TABLE
baconPart

(…)


PARTIONED BY

(Year string)


CLUSTERED BY

(type)
into 256
BUCKETS


Partition

Directory for each distinct combo of string partition values

Partition key name cannot be defined in table itself

Allows partition elimination

Useful in range searches

Can slow performance if partition is not referenced in query

Buckets

Split data based on hash of a column

One HDFS file per bucket within partition sub
-
directory

Performance may improve for aggregates and join queries

Sampling

set
hive.enforce.bucketing

= true;




Storage


Partition and
Bucket

CREATE EXTERNAL TABLE
baconPart

(…)


ROW FORMAT DELIMITED


FIELDS TERMINATED by


\
001‘


STORED AS TEXTFILE
, RCFILE, SEQUENCEFILE, AVRO



Format

Generally remove headers before loading files

TEXTFILE
is common, useful when data is shared and alphanumeric

Extensible storage formats via custom input, output formats

Extensible on disk/in
-
memory representation via custom
SerDes



Storage


File
Formats

CREATE EXTERNAL TABLE
CustomSerDeUsage
(…)


ROW FORMAT SERDE

'
com.cloudera.hive.serde.JSONSerDe
'


LOCATION ….


SerDes

Create your own Java Serialization/Deserialization

Includes parse input/output, optimization

Usually overrides CREATE TABLE DDL

Common
SerDes
: CSV, XML, JSON, Custom

Library:
org.apache.hadoop.hive.serde2



Storage

SerDes

ASV
://containername@account.blob.core.windows.net/user/demo/

HDFS
:///user/demo/


Storage Format

HDFS is Hadoop distributed file system

ASV is Azure Storage Vault using an API on top of HDFS

ASV allows reuse across clusters and with other apps

ASV data quickly available to new HDInsight clusters



Storage

HDFS and
ASV

CREATE INDEX
baconPart_idx



ON TABLE
baconPart

(type)


AS '
org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler
'


WITH DEFERRED REBUILD


IN TABLE
baconPart_index
;

ALTER INDEX
baconPart_idx

ON
baconPart

REBUILD
;


Key Points

No keys

Index data is another table

Requires REBUILD to include new data

SHOW FORMATTED INDEXES on
MyTable
;


Indexing May Help

Avoid many small partitions

GROUP BY


CREATE
INDEX

CREATE VIEW

baconOneYear

(type)


AS SELECT
type


FROM
baconPart


WHERE

year = 1


ORDER BY
type;


Sample Code

SELECT * FROM
baconOneYear
;

DESCRIBE FORMATTED
baconOneYear
;


Key Points

Not materialized

Can have ORDER BY or LIMIT

Create View

SELECT

c.state_fips
,
c.county_fips
,
c.population



FROM

census c


WHERE

c.median_household_income

> 100000


GROUP

BY

c.state_fips
,
c.county_fips
,
c.population


ORDER

BY

county_fips


LIMIT

100;


Key Points

Minimal caching, statistics, or optimizer

Generally reads entire data set for every query


Performance

The order of columns, tables can make a difference to performance

Use partition elimination for range filtering


Query

ORDER BY

One reducer does final sort, can be a big bottleneck


SORT BY


Sorted
only within each reducer, much faster


DISTRIBUTE BY


Determines
how map data is distributed to reducers


SORT BY + DISTRIBUTE BY = CLUSTER BY


Can
mimic ORDER BY, better
perf

if even distribution


Sorting

Supported Hive Join Types

Equality

OUTER
-

LEFT, RIGHT, FULL

LEFT SEMI


Not Supported

Non
-
Equality

IN/EXISTS
subqueries

(rewrite as LEFT SEMI JOIN)


Characteristics

Multiple MapReduce jobs unless same join columns in all tables

Put largest table last in query to save memory

Joins are done left to right in query order

JOIN ON completely evaluated before WHERE starts

Joins

EXPLAIN

SELECT * FROM
baconPart
;

EXPLAIN

SELECT * FROM
baconPart

WHERE year > 1;

EXPLAIN EXTENDED

SELECT * FROM
baconPart
;

Characteristics

Does not execute the query

Shows parsing

Lists stages, temp files, dependencies, modes, output operators, etc
.

ABSTRACT SYNTAX TREE:


(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME
baconPart
))) (TOK_INSERT (TOK_DESTINATION
(TOK_DIR TOK_TMP_FIL

E)) (TOK_SELECT (TOK_SELEXPR TOK_ALLCOLREF))))


STAGE DEPENDENCIES:


Stage
-
0 is a root stage


STAGE PLANS:


Stage: Stage
-
0


Fetch Operator


limit:
-
1

EXPLAIN

Configuration

cd %
hive_home
%
\
bin

<
install
-
dir
> currently: C:
\
Hadoop
\
hadoop
-
1.1.0
-
SNAPSHOT

Hive default configuration <install
-
dir
>/
conf
/hive
-
default.xml

Configuration variables <install
-
dir
>/
conf
/hive
-
site.xml

Hive configuration directory HIVE_CONF_DIR environment variable

Log4j configuration <install
-
dir
>/
conf
/hive
-
log4j.properties

Typical Log: c:
\
Hadoop
\
hive
-
0.9.0
\
logs
\
hive.log

Configure
Hive

BUZZ!

BI on Big Data

Cross
-
pollinate your existing SQL skills!

Makes Hadoop cross
-
correlations, joins, filters easier

Allows storage of intermediate results for faster/easier querying

Batch based processing

E2E
insight may be much
faster

Get the right projects on the right technologies







Why Use
Hive

Get Involved

Read a bit

http://sqlblog.com/blogs/lara_rubbelke/archive/2012/09/10/big
-
data
-
learning
-
resources.aspx


Programming Hive Book


http://blogs.msdn.com/cindygross

Subscribe to
Windows Azure HDInsight Service
http://HadoopOnAzure.com

(Cloud CTP)

Download Microsoft HDInsight Server
http://microsoft.com/bigdata

(On
-
Prem CTP)

Think about how you can fit Big Data into your company data strategy

Suggest uses, be prepared to combat misuses







Next Steps

Microsoft Big Data
http://microsoft.com/bigdata

Denny Lee
http://dennyglee.com/category/bigdata/


Carl Nolan
http://tinyurl.com/6wbfxy9


Cindy Gross
http://tinyurl.com/SmallBitesBigData


Big Data References

Hadoop
: The Definitive Guide by Tom White

SQL Server
Sqoop

http://bit.ly/rulsjX


JavaScript

http://bit.ly/wdaTv6

Twitter
https://twitter.com/#!/search/%23bigdata


Hive
http://hive.apache.org

Excel to
Hadoop

via Hive ODBC
http://tinyurl.com/7c4qjjj

Hadoop

On Azure Videos
http://tinyurl.com/6munnx2

Klout

http://tinyurl.com/6qu9php


What we
covered

Objectives


Quick Overview: Big Data,
Hadoop, HDInsight, Open
Source


What Hive is


Why Hive for Hadoop?


Why Hive for SQL Pros?


How Hive fits into
Hadoop/HDInsight


Hive is better together with
SQL, AS, BI


Key Takeaways


How Hive fits


Hive DDL and DML


Formats, Structure


Storage options

Cindy Gross

CAT PM

SQL, BI, Big Data


HDInsight: Jiving about Hadoop
and Hive with CAT

http://blogs.msdn.com/cindygross/

@SQLCindy

Cindy.Gross@microsoft.com