Near-Optimal Cache Block Placement with Reactive Nonuniform Cache Architectures

sunfloweremryologistΔιαχείριση Δεδομένων

31 Οκτ 2013 (πριν από 4 χρόνια και 12 μέρες)

100 εμφανίσεις

Near
-
Optimal Cache Block Placement with
Reactive
Nonuniform

Cache Architectures





Nikos Hardavellas, Northwestern University

Team: M. Ferdman,

B. Falsafi,

A.
Ailamaki

Northwestern, Carnegie Mellon, EPFL

© Hardavellas

2

Moore’s Law Is Alive And Well

90nm

90nm transistor

(Intel, 2005)

Swine Flu A/H1N1

(CDC)

65nm

2007

45nm

2010

32nm

2013

22nm

2016

16nm

2019

Device scaling continues for at least another 10 years

© Hardavellas

3

Good Days Ended Nov. 2002

[Yelick09]

“New” Moore’s Law: 2x cores with every generation

On
-
chip cache grows commensurately to supply all cores with data

Moore’s Law Is Alive And Well

© Hardavellas

4

slow access

large caches

Larger Caches Are Slower Caches


Increasing access latency forces caches to be distributed

© Hardavellas

5

Cache design trends

Balance cache slice access with network latency

As caches become bigger, they get slower:

Split cache
into smaller “slices”:

© Hardavellas

6

core

core

core

Modern Caches
:
Distributed

Split cache into “slices”, distribute across die

L2

L2

L2

L2

L2

L2

L2

L2

core

core

core

core

core

Data Placement Determines Performance

©
Hardavellas

7


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2

Goal
:
place data on chip close
to where they are used

cache

slice

© Hardavellas

8

Our
proposal: R
-
NUCA

Reactive
Nonuniform

Cache Architecture


Data
may exhibit arbitrarily complex behaviors


...but few that matter
!



Learn the behaviors at run
time & exploit their characteristics


Make the common case
fast, the rare case correct


Resolve conflicting requirements


©
Hardavellas

9

Reactive
Nonuniform

Cache Architecture


Cache accesses can be classified at run
-
time


Each class amenable to different placement



Per
-
class block placement


Simple, scalable, transparent


No need for HW coherence mechanisms at LLC


Up to 32% speedup (17% on average)


-
5% on avg. from an ideal cache organization



Rotational Interleaving


Data replication and fast single
-
probe lookup

[
Hardavellas
et al
, ISCA 2009
]

[Hardavellas
et al
, IEEE
-
Micro Top Picks 2010]

©
Hardavellas

10

Outline


Introduction


Why do Cache Accesses Matter?


Access Classification and Block Placement


Reactive NUCA Mechanisms


Evaluation


Conclusion

© Hardavellas

11

Bottleneck shifts from memory to L2
-
hit stalls

Cache accesses dominate execution

4
-
core CMP

DSS: TPCH/DB2

1GB database

[
Hardavellas

et al
, CIDR 2007
]

Lower is

better

Ideal

© Hardavellas

12

How much do we lose?

We lose
half

the potential throughput

4
-
core CMP

DSS: TPCH/DB2

1GB database

Higher is

better

©
Hardavellas

13

Outline


Introduction


Why do Cache Accesses Matter?


Access Classification and Block Placement


Reactive NUCA Mechanisms


Evaluation


Conclusion

©
Hardavellas

14

Terminology: Data Types


core


L2


core


L2


core



core


L2


core


Read

or

Write

Read

Read

Read

Write

Private

Shared

Read
-
Only

Shared

Read
-
Write

© Hardavellas

15

Distributed shared L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2

Maximum capacity, but slow access
(30
+ cycles
)

address

mod <#slices>

Unique location

for any block

(private or shared)

© Hardavellas

16

L2

Distributed private
L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


Fast access to core
-
private
data

Private data:

allocate at

local L2 slice

On every access

allocate data

at local L2 slice

© Hardavellas

17

L2

Distributed private L2: shared
-
RO access


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


Wastes capacity due to
replication

Shared read
-
only

data: replicate

across L2 slices

On every access

allocate data

at local L2 slice

© Hardavellas

18

Distributed private L2: shared
-
RW access


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

dir


core



core



core



core


L2

L2

L2

L2

Slow for shared read
-
write

Wastes capacity (dir overhead) and bandwidth

X

Shared read
-
write

data: maintain

coherence

via

indirection (dir)

On every access

allocate data

at local L2 slice

©
Hardavellas

19

Conventional Multi
-
Core Caches


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


dir

L2

L2

L2

We want: high capacity (shared) + fast access (private)

Private

Shared


Address
-
interleave blocks

+

High capacity



Slow access


Each block cached locally

+

Fast access (local)



Low capacity (replicas)



Coherence: via indirection


(distributed directory)

©
Hardavellas

20

Where to
Place
the
Data
?


Close to where they are used!


Accessed by single core: migrate locally


Accessed by many
cores: replicate (?)


I
f read
-
only, replication is OK


I
f read
-
write, coherence a problem


Low reuse: evenly distribute across sharers



sharers#

read
-
write

migrate

replicate

share

read
-
only

21

Methodology

Flexus
:
Full
-
system cycle
-
accurate timing simulation

Model
Parameters


Tiled, LLC = L2


Server/Scientific
wrkld
.


16
-
cores, 1MB/core


Multi
-
programmed
wrkld
.


8
-
cores, 3MB/core


OoO
, 2GHz, 96
-
entry ROB


Folded 2D
-
torus


2
-
cycle router, 1
-
cycle link


45ns memory

Workloads


OLTP: TPC
-
C 3.0 100 WH


IBM DB2 v8


Oracle 10g


DSS: TPC
-
H
Qry

6, 8, 13


IBM DB2 v8


SPECweb99 on Apache 2.0


Multiprogammed
: Spec2K


Scientific: em3d

©
Hardavellas

[
Hardavellas

et al
, SIGMETRICS
-
PER 2004

Wenisch

et al
, IEEE Micro 2006]

©
Hardavellas

22

Cache
Access
C
lassification
E
xample


Each bubble: cache blocks shared by x
cores


Size of bubble proportional to % L2 accesses


y axis: % blocks in bubble that are read
-
write

% RW Blocks in Bubble

©
Hardavellas

23

Scientific/MP
Apps

Cache
Access Clustering

Accesses naturally
form 3 clusters

Server Apps

migrate

locally

share (
addr
-
interleave)

replicate

R/W

migrate

replicate

share

R/O

sharers#

% RW Blocks in Bubble

% RW Blocks in Bubble

Instruction Replication

©
Hardavellas

24

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2

Distribute in cluster of neighbors, replicate across


Instruction working set too large for one cache slice

© Hardavellas

25

Reactive NUCA in a nutshell



Classify accesses


private data: like private
scheme (migrate)


shared data: like shared
scheme (interleave)


instructions: controlled
replication (middle ground)


To place cache blocks, we first need to classify them

©
Hardavellas

26

Outline


Introduction


Access Classification and Block Placement


Reactive NUCA Mechanisms


Evaluation


Conclusion

© Hardavellas

27

Classification
Granularity


Per
-
block classification


High
area/power
overhead (cut L2 size by half)


High latency (
indirection through directory)



Per
-
page classification (utilize OS page table)


Persistent structure


Core accesses
the page table for
every access anyway (TLB
)


Utilize already existing SW/HW structures and events


Page classification is accurate (<0.5% error)

Classify entire
data pages
, page table/TLB for bookkeeping


Instructions classification: all accesses from L1
-
I (per
-
block)


Data classification: private/shared per
-
page at TLB miss

Classification Mechanisms

© Hardavellas

28

TLB Miss


core


L2

Ld A

Core
i

OS

A:
Private

to “
i


TLB Miss

Ld A

OS

A: Private to “
i



core


L2

Core
j

A:
Shared

On 1
st

access

On access by another core

Bookkeeping through OS page table and TLB

Page Table and TLB Extensions

© Hardavellas

29

vpage

ppage

L2 id

P/S/I

2 bits

log(n)

vpage

ppage

P/S

TLB entry:

1 bit

Page granularity allows simple + practical HW

Page table entry:


Core accesses the page table for every access anyway (TLB)


Pass information from the “directory” to the core


Utilize already existing SW/HW structures and events

© Hardavellas

30

Data Class Bookkeeping and Lookup

offset

Physical
Addr
.:

vpage

ppage

L2 id

vpage

ppage

L2 id

S

cache index

tag

Page table entry:

Page table entry:

vpage

ppage

P

TLB entry:

L2 id

vpage

ppage

S

TLB entry:

P


private data
: place in local L2 slice


shared data
: place in aggregate L2 (addr interleave)

©
Hardavellas

31

Coherence: No Need for HW Mechanisms at LLC

Fast
access, eliminates HW overhead, SIMPLE


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2

Private data: local slice

Shared data:
addr
-
interleave


Reactive NUCA placement guarantee


Each R/W datum in
unique

&
known

location


each slice caches
the same blocks

on behalf
of any cluster

© Hardavellas

32

3

1

0

0

1

3

2

0

3

1

3

1

0

0

1

3

2

0

3

1

Instructions Lookup: Rotational
Interleaving

2

2

3

1

0

1

3

2

2

0

2

3

1

0

0

1

3

2

0

1

3

2

+1

+log
2
(
k
)

RID


Fast
access (nearest
-
neighbor, simple lookup)


Balance access latency with capacity constraints


Equal capacity
pressure at
overlapped
slices

PC: 0xfa480

RID

Addr

size
-
4 clusters:

local slice + 3 neighbors

©
Hardavellas

33

Outline


Introduction


Access Classification and Block Placement


Reactive NUCA Mechanisms


Evaluation


Conclusion

©
2009 Hardavellas

34

Evaluation

Delivers robust performance across workloads

Shared: same for Web, DSS;
17%

for OLTP, MIX

Private:
17%

for OLTP, Web, DSS; same for MIX











Shared (S)


R
-
NUCA (R)


Ideal (I)

© Hardavellas

35

Conclusions


Data may exhibit arbitrarily complex behaviors


...but few that matter!



Learn the behaviors that matter at run time


Make the common case fast, the rare case correct



Reactive NUCA: near
-
optimal cache block placement


Simple, scalable, low
-
overhead, transparent, no coherence


Robust performance


Matches best alternative, or 17% better; up to 32%


Near
-
optimal placement (
-
5% avg. from ideal)

For more information:






http://www.eecs.northwestern.edu/~hardav/

©
Hardavellas

36

Thank You!


N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Near
-
Optimal Cache Block Placement with Reactive
Nonuniform

Cache Architectures.
IEEE Micro Top Picks
, Vol. 30(1), pp. 20
-
28,
January/February 2010.


N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki. Reactive
NUCA: Near
-
Optimal Block Placement and Replication in
Distributed Caches.
ISCA 2009
.

©
2009 Hardavellas

37

BACKUP SLIDES

Why Are Caches Growing So Large?


Increasing number of cores: cache grows commensurately


Fewer but faster cores have the same effect


Increasing datasets: faster than Moore’s Law!


Power/thermal efficiency: caches are “cool”, cores are “hot”


So, its easier to fit more cache in a power budget


Limited bandwidth: large cache == more data on chip


Off
-
chip pins are used less frequently

© Hardavellas

38

©
2009 Hardavellas

39

Backup Slides

ASR

©
2009 Hardavellas

40

ASR vs. R
-
NUCA Configurations

ASR
-
1

ASR
-
2

R
-
NUCA

12.5
×

25.0
×

5.6
×

2.1
×

2.2
×

38%

Core Type

In
-
Order

OoO

OoO

L2 Size (MB)

4

16

16

Memory

150

500

90

Local L2

12

20

16

Avg. Shared L2

25

44

22

© Hardavellas

41

ASR design space search

©
2009 Hardavellas

42

Backup Slides

Prior Work

©
Hardavellas

43

Prior Work


Several
proposals for CMP cache management


ASR,
cooperative
caching, victim
replication,

CMP
-
NuRapid
, D
-
NUCA



...but suffer from shortcomings


complex, high
-
latency lookup/coherence


don’t
scale


lower effective cache capacity


optimize only for subset of accesses

We need:

Simple, scalable mechanism for fast access to all data

© Hardavellas

44

Shortcomings of prior work


L2
-
Private


Wastes capacity


High latency (3 slice accesses + 3 hops on shr.)


L2
-
Shared


High latency


Cooperative
Caching


Doesn’t scale (centralized tag structure)


CMP
-
NuRapid


High latency (pointer dereference, 3 hops on
shr
)


OS
-
managed L2


Wastes capacity (migrates all blocks)


Spill to neighbors useless (all run same code)

© Hardavellas

45

Shortcomings of Prior Work


D
-
NUCA


No practical implementation (lookup?)


Victim
Replication


High latency (like L2
-
Private)


Wastes capacity (home always stores block)


Adaptive
Selective Replication (ASR)


High latency (like L2
-
Private)


Capacity pressure (replicates at slice granularity)


Complex (4 separate HW structures to bias coin)


©
2009 Hardavellas

46

Backup Slides

Classification and Lookup

©
2009 Hardavellas

47

Data
Classification Timeline

TLB Miss

OS

vpage

ppage

i

P


core


L2

Ld A

Core
i

allocate A

vpage

ppage

x

S

TLB Miss


core


L2

Ld A

Core j

i≠j

inval A

TLBi

evict A


core


L2

Core k

allocate A

reply A

Fast & simple lookup for data

© Hardavellas

48

Misclassifications at
Page Granularity

Classification at page granularity is accurate


Accesses from pages with

multiple access types

Access misclassifications


A page may service multiple access types


But, one type always
dominates accesses

©
2009 Hardavellas

49

Backup Slides

Placement

© Hardavellas

50


Spill
to
neighbors if working set too large?


NO!!! Each
core runs similar
threads

Private Data Placement

Store in local L2 slice (like in private cache)

© Hardavellas

51

Private
Data Working Set



OLTP: Small
per
-
core work. set
(
3MB/16 cores = 200KB/core
)


Web: primary
wk. set <6KB/core, remaining <1.5% L2
refs


DSS: Policy
doesn’t
matter much

(>100MB work. set, <13% L2 refs


very low
reuse on private)

© Hardavellas

52


Read
-
write + large
working set + low
reuse



Unlikely to be in local slice for reuse


Also, next sharer is random [WMPI’04]

Shared Data Placement


Address
-
interleave in aggregate L2 (like shared cache)

© Hardavellas

53

Shared Data Working Set

Instruction
Placement


Working set too large for one slice


Slices store private & shared data too!


Sufficient capacity with 4 L2 slices


© Hardavellas

54


Share in clusters of neighbors, replicate across

© Hardavellas

55

Instructions Working Set

©
2009 Hardavellas

56

Backup Slides

Rotational Interleaving

Instruction Classification and Lookup

©
2009 Hardavellas

57

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2


core



core



core



core


L2

L2

L2

L2

Share within neighbors’ cluster, replicate across


Identification: all
accesses from
L1
-
I


But, working set too large to fit in one cache slice

RotationalID

0

© 2009
Hardavellas

58

3

1

0

1

3

2

0

3

1

Rotational Interleaving

2

2

3

1

0

1

3

2

2

0

2

3

1

0

0

1

3

2

0

1

3

2

+1

+log
2
(
k
)


Fast access (nearest
-
neighbor, simple lookup)


Equalize capacity
pressure at overlapping slices

16

25

27

26

17

19

18

20

9

11

24

28

29

31

30

21

23

22

8

10

12

13

15

14

0

1

3

2

4

5

7

6

TileID

© Hardavellas

59

Nearest
-
neighbor size
-
8 clusters