CCNoC: On-Chip Interconnects for Cache-Coherent Manycore ...

sunfloweremryologistData Management

Oct 31, 2013 (3 years and 7 months ago)

67 views

CCNoC
: On
-
Chip Interconnects for

Cache
-
Coherent
Manycore

Server Chips

CiprianSeiculescu

Stavros Volos

Naser Khosro Pour

Babak Falsafi


Giovanni De Micheli

LSI

Integrated

Systems

Laboratory

NoCs

Major Power Consumer


Move towards
manycore



Tiled architectures



Network
-
on
-
Chip (
NoC
)


Significant power
consumer


40% MIT RAW


30% Intel
Tera
-
scale



Cache coherent CMP


Server workloads

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

C

$

Core

Core

$

$

Crossbar

Proposals to Reduce
NoC

Power


Multiple networks


Better area and power [Balfour & Dally ICS 2006]



Commercial server workloads


Traffic patterns are different



Run on cache coherent CMPs


Strong relation between coherence protocol and
NoC



Not optimized for Commercial Server Workload traffic

Contributions


Commercial server workloads


Optimized for reuse in L1, little sharing


Full blown coherence protocol in CMPs


Only some transitions are frequent



Duality in Request/Response message size



CCNoC


Full advantage of heterogeneity


Same number of buffers


16% less power same performance as Mesh

Outline


Overview



Why
CCNoC
?



Dual
-
router design



Evaluation



Conclusions

Dual Router is More Efficient


Dual router


Two crossbars per routing node







Wires less expensive on
-
chip


Use more wires for better performance


Area and power grows faster than connectivity


Balfour & Dally ICS 2006


Dual router: better performance, power and area




N bit wide

N/2 bit
wide

N/2 bit
wide

Right Dual Router Design


Avoid protocol level deadlock


Separate


Requests


Responses


Use Virtual Channels



CCNoC



sub
-
networks


Request / Response


No VCs needed


Same number of buffers


Buffers are power hungry

MIT RAW

Buffers
Crossbar
+ Links
H.S.Wang

&
L.S.Peh
, MICRO 2003

Protocol Activity


CMPs implement full blown coherence protocol



Some transitions are frequent [
Hardavellas

ISCA 2009]


Read clean block


Evict clean block


Write to unshared block



Other transitions needed for correctness (infrequent)


Read dirty block


Evict dirty


Write to shared block

Frequent

Read Protocol Activity

Reader

Directory

Writer

Read
Req

Read
Resp

Evict Clean
Req

Short
Req

Short
Req

Short
Resp

Long
Resp

Frequent

Write Protocol Activity

Writer

Directory

Fetch/Upgrade
Req

Fetch

Resp

Short
Req

Short
Req

Short
Resp

Long
Resp

Upgrade
Resp

Infrequent Read Protocol Activity

Reader

Directory

Writer

Read
Req

Read
Resp

Short
Req

Short
Req

Short
Resp

Long
Resp

Downgrade
Req

Downgrade
Resp

Infrequent Write Protocol Activity

Writer

Directory

Reader 1

Fetch/Upgrade
Req

Fetch
Resp

Short
Req

Short
Req

Short
Resp

Long
Resp

Reader 2

Upgrade
Resp

Inv
Req

Inv
Req

Inv
Resp

Inv
Resp

Evict Dirty
Req

Traffic Analysis

0%
20%
40%
60%
80%
100%
DB2
ORACLE
DB2 MIX
APACHE
ZEUS
EM3D
SPEC2K
OLTP
DSS
WEB
SCI
MIX
Traffic Distribution

Long Resp
Short Resp
Long Req
Short Req
Request: 93% short

Response: 86% long

CCNoC

Router

Request network narrow:
optimized for short messages



Response network wide:
optimized for long messages


Request

Switch



Response

Switch

NI

Router

Previous Work


Balfour et al. ICS 2006


Better than single large router


Read/Write traffic


Same number of reads and writes



Yoon et al. DAC 2010


Physical channel better then virtual channel



Not optimized for cache coherent CMP


Running commercial server workloads

Outline


Overview



Why
CCNoC
?



Dual
-
router design



Evaluation



Conclusions

Evaluation Methodology


FLEXUS


Full system simulation


16 or 8
UltraSPARC

III
ISA cores


Split I/D, 64KB L1


1 or 2 MB L2



ORION 2.0


power estimation


area estimation


Workloads


OLTP: TPC
-
C


IBM DB2 and Oracle


DSS: TPC
-
H


IBM DB2


Q1, Q6, Q13, Q16


Web: SPECweb99


Apache and Zeus


Scientific: EM3D


Multiprogrammed
:


SPEC2K


2x:
gcc
,
twolf
, art,
mcf


Evaluation
NoCs


Mesh
-
128
-

baseline


128 bit flit width


Torus
-

reference


128 bit flit width


Mesh
-
176


high
performance


176 bit flit width


CCNoC


Request: 48 bit flit width


Response: 128 bit flit width


Switches


Wormhole flow control


Input queued


Transmission protocol


On/Off


Input buffers


2 entry


Performance

0
0.2
0.4
0.6
0.8
1
1.2
DB2
ORACLE
DB2 MIX
APACHE
ZEUS
EM3D
SPEC2K
OLTP
DSS
WEB
SCI
MIX
Normalized IPC (to Torus)

Mesh-128
Mesh-176
CCNoC
Performance loss: 2% Torus, 8% Mesh
-
176

Power Savings

Power savings: 16% Mesh
-
128, 22% Torus, 38% Mesh
-
176

0
0.2
0.4
0.6
0.8
1
1.2
1.4
DB2
ORACLE
DB2 MIX
APACHE
ZEUS
EM3D
SPEC2K
OLTP
DSS
WEB
SCI
MIX
Normalized Total Power(%)

Torus
Mesh-128
Mesh-176
CCNoC
Conclusions


Duality in Request/Response traffic


Request: dominated by short messages


Response: dominated by long messages



Proposed
CCNoC


Narrow request network


Wide response network



Showed significant power savings


22% against Torus


38% against Mesh
-
176

Thank you!

Q&A