Real World Cloud Application Security

longtermagonizingInternet και Εφαρμογές Web

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

89 εμφανίσεις

Real World Cloud Application Security

chan@netflix.com

About Me


Director of Engineering @
Netflix


Responsible for:


Cloud app, product, infrastructure, ops security


Previously:


Led security team @ VMware


Earlier, primarily security consulting at @stake,
iSEC

Partners

Netflix, Inc.

“Netflix is the world’s leading Internet television
network with more than 33 million members in 40
countries enjoying more than one billion hours of TV
shows and movies per month, including original
series . . .”

Source:
http
://
ir.netflix.com

APPSEC

CHALLENGES

Lots of Good Advice


BSIMM


Microsoft SDL


SAFECode


But, what works?

Forrester Consulting, 12/10

Especially, given phenomena such as
DevOps
, cloud, agile,
and the unique characteristics of an organization?

CLOUD @ NETFLIX

Availability

“Undifferentiated Heavy
Lifting”

Netflix Culture


may well be the most important document ever to come out of the Valley.



Sheryl Sandberg, Facebook COO

Scale and
Usage Curve

Netflix is now ~99% in the cloud

On the way to the cloud . .
.
(architecture)

On the way to the cloud . .
.
(organization)

(or
NoOps
,
depending
on definitions)

DEPLOYING CODE

A common
graph @ Netflix

Lots of watching in prime time

Not as much in early morning

Old way
-

pay and provision for peak, 24/7/365

Multiply this pattern across the dozens of apps that comprise the Netflix
streaming service

Weekend afternoon ramp
-
up

Solution: Load
-
Based
Autoscaling

Autoscaling


Goals:


# of systems matches load requirements


Load per server is constant


Happens without intervention (the ‘auto’ in
autoscaling
)


Results:


Clusters continuously add & remove nodes


New nodes must mirror existing

Every change requires a new cluster push

(not an incremental change to existing systems)

Deploying code must be easy

(it is)

Netflix Deployment Pipeline

Perforce/Git

Code change

Config change

YUM

RPM
with

app
-
specific

bits

Bakery/
Amina
tor

Base image +

RPM

AMI

VM template

ready to launch

ASG

Cluster
config

Running systems

Operational Impact


No changes to running systems


No systems
mgmt

infrastructure (Puppet,
Chef, etc.)


Fewer logins to prod


No snowflakes


Trivial “rollback”

Security Impact


Need to think differently on:


Vulnerability management


Patch management


User activity monitoring


File integrity monitoring


Forensic investigations

Architecture, organization, deployment

are all different.

What about security?

We’ve adapted too.

Some principles we’ve found useful.

POINTS OF EMPHASIS

Points of Emphasis


Integrate


Make the right way easy


Self
-
service, with exceptions


Trust, but verify



Two contexts:

1.
Integration with your
engineering ecosystem

2.
Integration of your
security controls


Organization


SCM, build and release


Monitoring and alerting

28

Integration: Base
AMI Testing


Base AMI


VM/instance template used for all cloud systems


Average instance age = ~24 days (one
-
time sample)


The
base AMI is managed like other packages, via P4, Jenkins,
etc.


We
watch the
SCM
directory &
kick
off testing when it
changes


Launch
an instance of the AMI, perform
vuln

scan and other checks

SCAN COMPLETED ALERT


Site name: AMI1


Stopped by: N/A


Total Scan Time: 4 minutes 46 seconds


Critical Vulnerabilities: 5

Severe Vulnerabilities:


4

Moderate Vulnerabilities: 4

Integration: Control Packaging
and Installation


From the RPM spec file of a webserver
:

Requires
:
ossec

cloudpassage

nflx
-
base
-
harden
hyperguard
-
enforcer




Pulls in the following RPMs:


HIDS agent


Config

assessment/firewall agent


Host hardening package


WAF

Integration: Timeline (
Chronos
)


What IP addresses have been blacklisted by
the WAF in the last few weeks?


GET /
api
/v1/
event?timelines
=
type:blacklist&start
=
20130125000000000


Which security groups have changed today?


GET /
api
/v1/
event?timelines
=
type:securitygroup&start
=
20130206000000000



Integration: Static Analysis


Available self
-
service through build
environment


FindBugs
, PMD


Jenkins plugin to display graphs and support
drill through to results


Integration: Static Analysis


Points of Emphasis


Integrate


Make the right way easy


Self
-
service, with exceptions


Trust, but verify



Developers are lazy

Making it Easy:
Cryptex


Crypto: DDIY (“Don

t Do It Yourself”)


Many uses of crypto in web/distributed systems:


Encrypt/decrypt (cookies, data, etc.)


Sign/verify (URLs, data, etc.)


Netflix also uses heavily for device activation,
DRM playback, etc.

Making it Easy:
Cryptex


Multi
-
layer crypto system (HSM basis, scale
out layer)


Easy to use


Key management handled transparently


Access control and auditable
operations

Making it Easy: Cloud
-
Based
SSO


In the AWS cloud, access to data center services
is problematic


Examples: AD, LDAP, DNS


But, many cloud
-
based systems require
authN
,
authZ


Examples: Dashboards, admin UIs


Asking developers to securely handle/accept
credentials is also problematic

Making it Easy: Cloud
-
Based
SSO


Solution: Leverage
OneLogin

SaaS

SSO (SAML) used
by IT for enterprise apps (e.g. Workday, Google Apps)


Provides a single & centralized login page


Built base module to make SSO/
authN

trivial

Points of Emphasis


Integrate


Make the right way easy


Self
-
service, with
exceptions


Trust, but verify



Self
-
service is perhaps the most
transformative cloud characteristic


Failing to adopt this for security
controls will lead to friction

Self
-
Service: Security Groups


Asgard

cloud orchestration tool allows developers to
configure their own firewall rules


Limited to same AWS account, no IP
-
based rules



Points of Emphasis


Integrate


Make the right way easy


Self
-
service, with exceptions


Trust, but
verify


Culture precludes
traditional “command
and control” approach


Organizational desire
for agile,
DevOps
, CI/CD
blur traditional security
engagement
touchpoints

Trust but Verify: Security
Monkey


Cloud APIs make
verification and analysis
of configuration and
running state simpler


Security Monkey
created as the
framework for this
analysis



Includes:


Certificate checking


Firewall analysis


IAM entity analysis


Limit warnings


Resource policy analysis

Trust but Verify: Security
Monkey

From:

Security Monkey

Date:

Wed, 24 Oct 2012 17:08:18 +0000

To:

Security Alerts

Subject:

prod Changes Detected




Table of Contents:


Security Groups




Changed Security Group






<sgname> (eu
-
west
-
1 / prod)



<#Security Group/<sgname> (eu
-
west
-
1 / prod)>



Trust but Verify: Exploit
Monkey


AWS
Autoscaling

group is unit of deployment, so
changes signal a good time to rerun dynamic scans

On 10/23/12 12:35 PM, Exploit Monkey

wrote:


I noticed that
testapp
-
live has changed current ASG name from testapp
-
live
-
v001 to testapp
-
live
-
v002.


I'm starting a vulnerability scan against test app from these
private/public IPs:

10.29.24.174

Takeaways


Netflix runs a large, dynamic service in AWS


Newer concepts like cloud &
DevOps

need an
updated
approach to
application security


Specific
context can help jumpstart a pragmatic
and
effective security
program


Don

t
swim upstream
-

integrate and collaborate with
your engineering partners

Netflix References


http://
netflix.github.com


http://
techblog.netflix.com


http://
slideshare.net
/
netflix

Other References


http://
www.webpronews.com
/netflix
-
outage
-
angers
-
customers
-
2008
-
08


http://
www.pcmag.com
/article2/0,2817,2395372,00.asp


http://
www.readwriteweb.com
/archives/
etech_amazon_cto_aws.php


http://
bsimm.com
/online/


http://
www.microsoft.com
/en
-
us/download/
confirmation.aspx?id
=29884


http
://
www.slideshare.net
/reed2001/culture
-
1798664


http
://
techcrunch.com
/2013/01/31/read
-
what
-
facebooks
-
sandberg
-
calls
-
maybe
-
the
-
most
-
important
-
document
-
ever
-
to
-
come
-
out
-
of
-
the
-
valley/


http
://
www.gauntlt.org

Questions?

chan@netflix.com