pptx

slateobservantNetworking and Communications

Oct 26, 2013 (4 years and 2 months ago)

114 views

A Scalable, Commodity Data
Center Network Architecture

Mohammad
Al
-
Fares

Alexander
Loukissas

Amin
Vahdat


Presenter: Xin Li

Outline


Introduction


Data Center Network


Motivation


Scale
,
Disadvantages of Current Solutions


Fat
-
Tree


Topology, Multi
-
path Routing, Addressing,
Two
-
Level Routing, Flow Schedulin
g, Fault
-
Tolerance,
Power & Heat


Evaluation and Result


Conclusion


Introduction

VM

Guest OS

App

Under client web
service control

oversubscription

Motivation


T
he
principle bottleneck in large
-
scale clusters is
often
inter
-
node
communication
bandwidth



Two solutions:

Specialized

hardware and communication protocols


e.g.
Infiband
,
Myrinet
(supercomputer environment)



Cons: No commodity parts (expensive)



The protocols are not compatible with TCP/IP

Commodity Ethernet Switches



Cons: Scale poorly



non
-
linear cost increases with cluster Size.



high
-
end core switch, oversubscription(tradeoff)

Oversubscription
Ratio

Server 1

B

……………

Server 2

Server n

B

B

Upper Link
Bandwidth(UB)

Oversubscription Ratio= B*n/UB

Current Data Center Topology


Edge hosts connect to 1G Top of Rack (
ToR
) switch



ToR

switches connect to 10G


End of Row (
EoR
) switches



Large clusters:
EoR

switches



to 10G core switches


Oversubscription of 2.5:1


to 8:1 typical in guidelines


No story for what happens as we move to 10G to the edge

Key challenges: performance, cost, routing, energy,
cabling

Data Center Cost

Design Goals


Scalable interconnection bandwidth



Arbitrary host communication at
f
ull
b
andwidth


Economies of scale



Commodity Switch


Backward compatibility



Compatible with hosts running Ethernet and IP

Fat
-
Tree Topology

k/2 servers in each Rack

k/2 Edge Switches in each pod

k/2 Aggregation Switch in each pod

(
𝑘
/
2
)
2

Core Switches


K Pods

Fat
-
tree Topology Equivalent

Routing


IP needs extension here!

(k/2)*(k/2)
shortest path!

Single
-
Path Routing VS Multi
-
Path Routing

Static VS Dynamic

ECMP

Equal
-
Cost Multiple
-
Path Routing)



Static Follow scheduling



limited multiplicity of path to 8
-
16



Increase routing table multiplicatively, hence latency time



Advantage: Don’t need reordering! Modern Switch support!


Extract Source
and Destination
Address

Hash
Function(CRC16)

Determine which
region fall in

1

2

3

4

0


𝟔



Hash
-
Threshold

Two
-
level Routing Table


Routing Aggregation

192.168.1.2/24

192.168.1.10/24

192.168.1.45/24

192.168.1.89/24

192.168.2.3/24

192.168.2.8/24

192.168.2.10/24

0

1

192.168.1.0/24

0

192.168.2.0/24

1

Two
-
level Routing Table

Addressing


Using 10.0.0.0/8 private IP address



Pod Switch: 10. pod. Switch.1.



pod range is [0, k
-
1](left to right)



switch range is [0, k
-
1] (left to right, bottom to top)



Core Switch: 10. k.
i

.
j


(
i,j
) is the point in (k/2)*(k/2) grid



Host: 10.pod. Switch.ID


ID range is [2, k/2+1] (left to right)

Two
-
level Routing Table


10.0.0.1

10.0.0.2

10.0.0.3

10.4.1.1

10.4.1.2

10.4.2.1

10.4.2.2

10.2.0.2

10.2.0.3

Two
-
level Routing
Table


Two
-
level Routing Table Structure







Two
-
level Routing Table implementation



TCAM=
Ternary Content
-
Addressable
Memory

Parallel
searching

Priority
encoding

Two
-
level Routing
Table
---
example


example

Prefix

Outgoing Port

10.0.0.0/24

0

10.0.1.1/24

1

0.0.0.0/0

Suffix

Outgoing Port

0.0.0.2/8

3

0.0.0.3/8

2

Prefix

Outgoing Port

10.0.1.2/32

0

10.0.1.3/32

1

0.0.0.0/0

Suffix

Outgoing Port

0.0.0.2/8

2

0.0.0.3/8

3

Prefix

Outgoing Port

10.0.0.0/16

0

10.1.0.0/16

1

10.2.0.0/16

2

10.3.0.0/16

3

Prefix

Outgoing Port

10.2.0.0/24

0

10.2.1.0/24

1

0.0.0.0/0

Suffix

Outgoing Port

0.0.0.2/8

2

0.0.0.3/8

3

Prefix

Outgoing Port

10.2.0.2/32

0

10.2.1.3/32

1

0.0.0.0/0

Suffix

Outgoing Port

0.0.0.2/8

2

0.0.0.3/8

3

Two
-
level Routing
Table Generation

Aggregation Two
Level Routing
Table Generator


Two
-
level Routing
Table Generation

Core Switch
Two Level Routing
Table Generator




Two
-
Level Routing Table



Avoid Packet Reordering



traffic diffusion
occurs
in the first half of a
packet
journey



Centralized Protocol to Initialize the Routing
Table

Flow Classification(Dynamic)


Soft State(Compatible with Two
-
Level
Routing Table)


A flow=packet with the same source and
destination
IP
address


Avoid
Reordering of Flow


Balancing



Assignment and Updating


Flow
Classification

Flow Assignment


Hash(
Src,Des
)

Have seen this
hash value?

Lookup previously assign port
x

Send packet on port x

Y

Record new flow record
f

Assign f to least
-
loaded
port x

Send packet on port x

N

Flow
Classification

Update

Perform every t seconds

For every aggregate switch


D=
𝑃
𝑎𝑥

𝑃
𝑖



Find the largest flow
f
assigned to port
pmax

whose
size is
smaller than
D
; Then assign this flow to
pmin

Flow Scheduling



distribution of
transfer times
and burst lengths of
Internet traffic is
long
-
tailed



Large flow dominating



Large flow should be specially handled

Flow Scheduling








Eliminates
global congestion


Prevent long lived flows from sharing the same
links


Assign long lived flows to different links


Edge Switch
Detecting
Flow size
above a
threshold

Notify the
central
controller

Assign this
flow to non
-
conflicting
path

Fault
-
Tolerance



Bidirectional Forwarding
Detection
session (
BFD)



Lower
-

to Upper
-
layer
Switches



Upper
-
layer
to Core
Switches



For flow scheduling, it is much more easier to handle.

Failure b/w upper layer and core switches


Outgoing inter
-
pod
traffic:

local
routing table marks the affected link
as unavailable and chooses another core
switch


Incoming
inter
-
pod
traffic:

core
switch broadcasts a tag to upper
switches directly connected signifying its
inability to carry traffic to that entire pod,
then upper switches avoid that core switch
when assigning flows destined to that pod


Failure b/w lower and upper layer
switches


Outgoing inter
-

and intra pod traffic from
lower
-
layer:


the local flow classifier sets the cost to infinity and
does not assign it any new flows, chooses another
upper layer switch

Intra
-
pod traffic using upper layer switch as
intermediary:


Switch broadcasts a tag notifying all lower level
switches, these would check when assigning new flows
and avoid it

Inter
-
pod traffic coming into upper layer
switch:


Tag to all its core switches signifying its ability to carry
traffic, core switches mirror this tag to all upper layer
switches, then upper switches avoid affected core
switch when assigning new flaws


Power and Heat

Experiment Description

Fat
-
tree, Click



4
-
port
fat
-
tree, there are 16 hosts, four pods (each
with four
switches), and four core switches
.


We
multiplex these
36 elements onto ten physical
machines,
interconnected by
a 48
-
port
ProCurve

2900 switch with 1 Gigabit Ethernet links
.


Each pod of switches is hosted on one machine;
each pod’s
hosts are hosted on one machine; and
the two remaining
machines run
two core switches
each.


bandwidth
-
limited
to 96Mbit/s
to ensure that the
configuration is not CPU limited
.


Each host generates a constant 96Mbit/s of
outgoing traffic

Experiment
Description

hierarchical
tree,click


four machines running four hosts each, and
four
machines each
running four pod
switches with one additional
uplink


The four
pod switches are connected to a 4
-
port core switch running
on a
dedicated
machine
.


3.6:1 oversubscription on
the uplinks
from
the pod switches to the core
switch


Each host generates a constant 96Mbit/s of
outgoing traffic



Result


Conclusion


Bandwidth is the scalability
bottleneck in large scale clusters


Existing solutions are expensive
and limit cluster size


Fat
-
tree topology with scalable
routing and backward
compatibility with TCP/IP and
Ethernet


Q&A

Thank you