Scale-Out Processors - HiPEAC

moneygascityInternet and Web Development

Dec 8, 2013 (3 years and 6 months ago)

73 views

Scale
-
Out Processors

Michael Ferdman

Pejman
Lotfi
-
Kamran,
Boris Grot,
Stavros Volos,

Onur

Kocberber
, Javier
Picorel
,
Almutaz

Adileh
,

Cansu

Kaynak
,
Djordje

Jevdjic
,
and
Babak

Falsafi


eurocloudserver.com


Our world is Data
-
Driven!

Datacenters

Datacenters
: work horses of the information age

Scale
-
Out Datacenters 101


Massive Scale


$100+ M investment


5
-
20 MW power budget


Why?


Big data sets, redundancy, …


Efficient!


Applications


Web search, media streaming,

social connectivity


Scale
-
out by design

Dataset

Scale
-
Out Applications








Many independent request/task


Huge dataset split into shards


Minimal communication among servers


Load balancer/

Master node


Client Requests

Server

Server

Server

Scale
-
Out Datacenters 102


Vast data sharded across servers



Memory
-
resident workloads


Necessary for performance


Major cost burden



Processors access data in memory


Abundant request
-
level parallelism


Performance scales with core count


Data

Maximize

performance

for better TCO

Core

Core

Core

$

Core

Core

Core

Memory

Grow perf, not TCO: the easy way



Smaller transistors


More performance in fixed area






Less energy/transistor


More performance at fixed power


Robert H.
Dennard

Gordon Moore

More performance with constant area & power

Life with Moore & Dennard

More performance with constant area & power

0

0.2

0.4

0.6

0.8

1

1.2

1.4

2001

2005

2009

2013

2017

2021

Power Supply V
dd

Today

2001

2009

Slope=
-
.026

Slope=
-
.053

Slowdown in Dennard Scaling

data from ITRS

Supply voltage scaling has slowed dramatically

Life with Moore alone

More performance with constant area & power

Massive Data meets Energy Wall

Must innovate to enable sustainable data
-
centric IT

Outline


What do scale
-
out workloads need?


Why not today’s processors?


Scale
-
Out Processors


Overview


Key results


Microarchitectural considerations

Cloud Suite 1.0

Linux 2.6.32

MapReduce

Machine learning on
Hadoop


SAT Solver

Symbolic VM constraint solver

Web Search

Apache
Nutch

Media Streaming

Apple
Quicktime

Server

Data Serving

Cassandra
NoSQL

Web Frontend

Nginx
, PHP server

(released @ parsa.epfl.ch/
cloudsuite
)

What do Scale
-
Out Apps Need?


Cores share instructions


Large code footprint fits in LLC



Data is in memory


Data footprint dwarfs LLC



Cores don’t communicate


Independent requests


Mostly Read
-
only accesses


Poor match to existing server processors

Code access

Data access

Core

Core

Core

$

Core

Core

Core

Server

requests

Memory

[ASPLOS’12
]


Few fat cores


Limited on
-
chip parallelism



Large LLC


Dwarfed by vast datasets


Slow to access


Takes up much die area





Large cores & LLC wasteful on scale
-
out apps

Core

Core

Core

$

Core

Core

Core

e.g.,
Intel
Westmere

$

Conventional Processors: Major Mismatch

Small
-
chip Processors: Not Enough


Few
lean
cores


Low parallelism



Modestly
-
sized cache


Fast to access, but…



takes up large fraction of die area


C

C

C

C

$

$

e.g.,
Calxeda

Large LLC & distance prevent scalability

C

C

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

$

$

$

e.g.,
Tilera

Tile64

Emerging Tiled Processors: Not Optimal



Many lean
cores


High parallelism



Large distributed cache


Dwarfed by vast datasets


Slow to access


Takes up much die area



Tiled organization


Distance hurts performance


Outline


What do scale
-
out workloads need?


Why not today’s processors?


Scale
-
Out Processors


Overview


Key results


Microarchitectural considerations

Roadmap for Scale
-
Out Processor


1.
Eliminate wasteful capacity in the LLC


Shrink the cache, add cores in freed space


2.
Reduce delay to LLC


Partition the die into independent multi
-
core pods


3.
Enable technology scalability


Do not interconnect the pods

[ISCA‘12
]

Step 1: Less cache, more cores


Small LLC


Capture instructions


More area for the cores

C

C

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

$

$

C

C

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

$

$

Step 1: Less cache, more cores


Small LLC


Capture
instructions


More area for the
cores


More cores
for
higher
throughput


C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

C

C

$

$

$

$

$

$



Small
LLC



More cores
for
higher throughput


Fast path to LLC for instructions

Step 2: Form
“pods”

to reduce distance

How to choose the optimal pod?

C

C

C

C

C

C

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

Sizing Pods: Not So Simple


Too
few core
s


l
imited parallelism

Too
many cores


long distance to LLC


Slower instruction fetch



How do we characterize optimality?

Few cores

Many cores

Distance hurts performance

Cache dominates

die area

Optimal
pod?

$

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

Perf/Core

Perf/Chip

PD pinpoints optimal use of silicon

Our Optimization Criterion:

Perf/mm²

Peak PD

Few cores

Many cores

Performance Density (PD)

Step 3: Scale
-
Out Processors

One or more
pods

Each
pod is a standalone server


Runs a full software
stack

No
inter
-
pod connectivity or coherence


Enhances scalability


Inherently optimal & scalable

40nm

20nm

10nm

C

C

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

$

C

C

C

C

C

C

C

C

$

C

C

C

C

C

C

C

C

C

$

C

C

C

C

C

C

$

$

$

$

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

C

$

C

C

C

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

C

C

C

C

C

C

C

$

$

C

C

C

$

C

C

C

$

C

C

C

C

Methodology

Flexus

simulation infrastructure
[
Wenisch

'
06
]

CMP analytic model
[
Hardavellas

'
09]

TCO model
[Hardy ‘11]

Chip budget


Die area:
~260
mm
2


Power: 95W


BW: 6 x 3.2 GT/s DDR4

Datacenter


Power: 20 MW


Rack: 42 U, 17 kW


Server: 64 GB, N sockets

Core Parameters


Out
-
of
-
order, 3
-
way, 2 GHz


L1 (I & D): 32KB, 2
-
way


Chip components


Technology node: 20 nm


Core: 1.1
mm
2


1MB cache: 1.2 mm
2


Mem

channel: 12
mm
2


Pod Design Space Exploration

Optimal Pod: 32 cores & 4MB of cache

0.00

0.05

0.10

0.15

1

2

4

8

16

32

64

128

256

Performance Density

# of cores

1MB

2MB

4MB

8MB

Optimal Pod

Processor Efficiency

0%

20%

40%

60%

80%

100%

0

50

100

150

200

250

300

Area

Perf

Area

Perf

Area

Perf

Tiled

Tiled Optimal

Scale
-
Out

Performance Improvement

Area (mm
2
)

Core

Cache

Other

0%

20%

40%

60%

80%

0

50

100

150

200

250

300

Area

Perf

Area

Perf

Area

Perf

Tiled

Tiled Optimal

Scale
-
Out

Performance Improvement

Area (mm
2
)

Core

Cache

Other

SOP: highest performance per mm
2

0

2

4

6

8

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

Conventional

Small
-
Chip

Tiled

Scale
-
Out

Normalized to Conventional

0

2

4

6

8

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

Conventional

Small
-
Chip

Tiled

Scale
-
Out

Normalized to Conventional

0

2

4

6

8

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

TCO

Perf

Perf/TCO

Conventional

Small
-
Chip

Tiled

Scale
-
Out

Normalized to Conventional

Datacenter Efficiency

SOP: highest performance/TCO

160%

92%

54%

Outline


What do scale
-
out workloads need?


Why not today’s processors?


Scale
-
Out Processors


Overview


Key results


Microarchitectural considerations

How do we architect a pod?


Pod characteristics:


Several MB of LLC


Many cores


Challenge:


Efficiently connect the pieces


Maximize PD

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

$

$

[MICRO‘12
]

What do Scale
-
Out Apps Need?


Many cores


Exploit abundant parallelism



Modestly
-
sized LLC


Capture large instruction footprint


Avoid caching massive & non
-
reusable data



Fast interconnect


Accelerate accesses to LLC

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

C

How to organize a processor for scale
-
out apps?

C

C

C

C

$

$

$

$

Tiled Pod

Mesh NOC


Nearest
-
neighbor connectivity


5
-
ported routers

Distributed LLC



Low complexity



Small NOC area


High LLC access latency



Low cost, low performance

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

Fast Tiled Pod

Flattened Butterfly NOC


Rich connectivity


M
any
-
ported routers

Distributed
LLC


High complexity


Large NOC area (7x of mesh!)




Low LLC access latency



[Kim, MICRO’07]

k
2
/2 links per row/column

High cost, high performance

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

row

column

column

row

Off
-
the
-
shelf designs not ideal


Large area




Low latency

Exploit workload characteristics for efficiency

Ideal



Small
area


High
latency

Latency

Cost

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

How do threads access data?


Where is the code and data?


Code: in LLC


Data: in memory


Nothing useful in remote L1s!



Data access pattern: bilateral


Core LLC


Core
Core



Code

Data

C

C

C

$

C

C

C

requests

Memory

Access Pattern vs. Traffic Pattern

Access pattern: bilateral


Core LLC


Core
Core

C

C

C

C

Traffic pattern: all
-
to
-
all


LLC distributed to all tiles


Tile
Tile


C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

C

C

C

C

C

C

C

C

C

C

C

$

$

$

$

$

$

$

$

$

$

$

$

$

$

$

$

Turn all
-
to
-
all traffic into bilateral to reduce cost!

Roadmap for an “Ideal” Pod

Starting point: fast tiled pod

1.
LLC in the center


Decouple cores and LLC banks


2.
Remove links from flattened butterfly


Leverage the
bilateral

traffic pattern to lower cost


3.
Share and specialize core
-
to
-
LLC interconnect

Step 1:
LLC in the Center



Decouple core and LLC tiles


Natural fit for the bilateral traffic pattern

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

Step
2: Remove unneeded links



Core
-
to
-
core connectivity not needed


Less cost, same
performance



Rich intra
-
LLC connectivity helps performance


Expense limited to a fraction of the die

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

Step
3: Share links and specialize



Cores share links to LLC


Dedicated core
-
to
-
LLC links have poor cost/
perf



Specialize
request
/
reply

(to/from LLC) networks


Maximize cost
/
perf

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

NOC
-
Out: Request Network



Tree topology


Each node: 2
-
to
-
1 flow
-
controlled
mux

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

Network

2
-
to
-
1 Mux

fast and cheap

Local

NOC
-
Out: Reply Network



Tree topology


Each node: 1
-
to
-
2 flow
-
controlled
demux

C

C

$

C

C

C

$

C

C

C

$

C

C

C

$

C

C

C

C

C

Network

1
-
to
-
2
Demux

fast and cheap

Local

Evaluation Highlights

NOC
-
Out:
FBfly’s

performance at 1/10
th

cost

0.9

1.0

1.1

1.2

Mesh

FBFly

NOC
-
Out

Normalized Performance

0

5

10

15

20

25

Mesh

FBFly

NOC
-
Out

NOC Area
(mm
2
)

Links

Buffers

Crossbars

area of

8 cores

Summary


We have: Scale
-
out datacenters


Vast datasets
sharded

across many servers


Wide range of apps, but quite specific characteristics


Want tailored high
-
throughput / low
-
TCO processors



We want: Scale
-
out Processors


Peak
-
PD Pod: cores tightly coupled to a modestly
-
sized LLC


Minimal connectivity: multiple physically decoupled pods


Technology scaling for free

Thank
You
!

Questions?

eurocloudserver.com