Developing and Deploying Apache Hadoop Security

greenpepperwhinnyΑσφάλεια

3 Νοε 2013 (πριν από 4 χρόνια και 7 μέρες)

90 εμφανίσεις

Developing and Deploying
Apache Hadoop Security

Owen O’Malley
-

Hortonworks

Co
-
founder and
Architect

owen@hortonworks.com

@
owen_omal ley

©
Hortonworks

Inc. 2011

July 25, 2011

Who am I


An architect working on Hadoop full
time since
the beginning of the project (Jan ‘06)


Primarily focused
on
MapReduce


Tech
-
lead on adding security to Hadoop


Co
-
founded
Hortonworks

this month


Before
Hadoop


Yahoo Search
WebMap


Before Yahoo


NASA, Sun


PhD from UC Irvine

What is Hadoop?


A framework for storing and processing big data on
lots of commodity machines.


Up to
4,500
machines in a cluster


Up to 20 PB in a cluster


Open Source Apache project


High reliability done in software


Automated failover for data and computation


Implemented in Java


Primary data analysis platform at Yahoo!


40,000+ machines running
Hadoop


More than 1,000,000 jobs every month






twice the engagement

5

Personalized


for each visitor


Result:

twice the engagement


+160% clicks

vs. one size fits all

+79% clicks

vs. randomly selected

+43% clicks

vs. editor selected

Recommended links

News Interests

Top Searches

Case Study: Yahoo Front Page

Problem


Yahoo! has more yahoos than clusters.


Hundreds of yahoos using Hadoop each month


40,000 computers in ~20 Hadoop clusters.


Sharing requires isolation or trust.


Different users need different data.


Not all yahoos should have access to sensitive data


financial data and PII


In Hadoop 0.20, easy to impersonate.


Segregate different data on separate clusters

6

Solution


Prevent unauthorized HDFS access


All HDFS clients
must
be authenticated.


Including tasks running as part of MapReduce jobs


And jobs submitted through Oozie.


Users must also authenticate servers


Otherwise fraudulent servers could steal credentials


Integrate Hadoop with Kerberos


Provides well tested open source distributed
authentication system.


7

Requirements


Security must be optional.


Not all clusters are shared between users.


Hadoop commands must not prompt for
passwords


Must have single sign on.


Otherwise
trojan

horse versions are easy to write.


Must support backwards compatibility


HFTP must be secure, but allow reading from insecure
clusters

Primary Communication Paths

9

Definitions


Authentication


Determining the user


Hadoop 0.20 completely trusted the user


User
passes their
username and groups over wire


We need it on both RPC and Web UI.


Authorization


What can that user do?


HDFS had owners, groups and permissions since 0.16.


Map/Reduce had nothing in 0.20
.


Auditing



Who did what?


Available since 0.20

Authentication


Changes low
-
level transport


RPC authentication using SASL


Kerberos (GSSAPI)


Token (Digest
-
MD5)


Simple


Browser HTTP secured via plugin


Tool HTTP (
eg
.
fsck
) via SSL/Kerberos


Authorization


HDFS


Command line unchanged


Web UI enforces authentication


MapReduce added Access Control Lists


Lists of users and groups that have access.


mapreduce.job.acl
-
view
-
job


view job


mapreduce.job.acl
-
modify
-
job


kill or modify job

Auditing


A critical part of security is an accurate method for
determining who did what.


Almost useless until you have strong authentication


HDFS Audit log tracks


Reading or writing of files


MapReduce audit log tracks


Launching or modifying job properties

© Hortonworks Inc. 2011

13

Kerberos and Single Sign
-
on


Kerberos allows user to sign in once


Obtains Ticket Granting Ticket (TGT)


kinit



get a new Kerberos ticket


klist



list your Kerberos tickets


kdestroy



destroy your Kerberos ticket


TGT

s last for 10 hours, renewable for 7 days by
default


Once you have a TGT, Hadoop commands just work


hadoop

fs


ls

/


hadoop

jar
wordcount.jar

in
-
dir

out
-
dir


14

API Changes


Very Minimal API
Changes


Most applications work unchanged


UserGroupInformation

*completely* changed.


MapReduce added secret credentials


Available from
JobConf

and
JobContext


Never displayed via Web UI


Automatically get tokens for HDFS


Primary HDFS, File{
In,Out
}
putFormat
, and
DistCp


Can set
mapreduce.job.hdfs
-
servers


15

MapReduce task
-
level security


MapReduce tasks run as submitting user.


No more accidently killing TaskTrackers!


Use a
setuid

C program.


Task output logs aren’t globally visible.


Task work directories aren’t globally visible.


Distributed cache is split


Public


shared between all users


Private


shared between jobs of same user

© Hortonworks Inc. 2011

16

Web UIs


Hadoop relies on Web User Interfaces served
from embedded Jetty.


These need to be authenticated also…


Web UI authentication is pluggable.


SPENGO or static user plug
-
ins are available


Companies may need or want their own systems


All
servlets enforce permissions based on the
authenticated user.

Proxy
-
Users


Some services access HDFS and MapReduce
as other users.


Configure service masters (
NameNode

and
JobTracker
) with the proxy user:


For each proxy user, configuration defines:


Who the proxy service can impersonate


Which hosts they can impersonate from


New admin commands to refresh


Don

t need to bounce cluster


18

Out of Scope


Encryption


RPC transport


Block
transport
protocol


On disk


File Access Control Lists


Still use Unix
-
style owner, group, other permissions


Non
-
Kerberos Authentication


Much easier now that framework is available

Deployment


The security team worked hard to get security added to
Hadoop on schedule.


Roll out was smoothest major Hadoop version in a long time.


In the 0.20.203.0 and upcoming 0.20.204.0 release.


Measured performance degradation < 3%


Security Development team:


Devaraj

Das, Ravi
Gummadi
,
Jakob

Homan, Owen O

Malley
,
Jitendra

Pandey
, Boris
Shkolnik
,
Vinod

Vavilapalli
,
Kan

Zhang


Currently deployed on all shared clusters (alpha,
science, and production
) at Yahoo!

Incident after Deployment


Only tense incident involved one cluster where 1/3
of the machines dropped out of the cluster after a
day.


Had to diagnose what had gone wrong.


The dropped machines had newer
keytab

files!


An operator had regenerated the keys on 1/3 of the
cluster after it was running. Servers failed when
they tried to renew their tickets.

© Hortonworks Inc. 2011

21

Hadoop Eco
-
system


Security percolates upward…


You can only be as secure as the lower levels


Pig finished integrating with security


Oozie supports security


HBase is being updated for security


All backing data files are owned by HBase user.


Doesn’t support reading/writing files directly by
application


Hive is also being updated


Doesn’t support column level permissions


© Hortonworks Inc. 2011

22

Questions?


Questions should be sent to:



common
/
hdfs
/
mapreduce
-
user@hadoop.apache.org


Security holes should be sent to:



security@hadoop.apache.org


Thanks
!