Galaxy: High-Performance Energy-Efficient Multi-Chip Architectures Using Photonic Interconnects

basketontarioElectronics - Devices

Nov 2, 2013 (3 years and 11 months ago)

87 views

Galaxy
:

High
-
Performance
Energy
-
Efficient
Multi
-
Chip Architectures Using
Photonic Interconnects





Nikos
Hardavellas

PARAG@N


Parallel Architecture Group

Northwestern University


Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik

Chip Power Scaling

© Hardavellas

2


Chip power does not scale

[
Azizi

2010]

Voltage Scaling Has Slowed

© Hardavellas

3


In last decade:
13x
transistors but 30% lower
voltage


Cannot

run all transistors fast enough

0
2
4
6
8
10
12
14
2003
2006
2009
2012
2015
Scaling Factor

Year

Transistor
Scaling
(Moore's Law)
Pin Bandwidth Scaling

© Hardavellas

4

[TU Berlin]


Cannot feed cores with data fast enough to keep them busy

0
2
4
6
8
10
12
14
2003
2006
2009
2012
2015
Scaling Factor

Year

Transistor Scaling
(Moore's Law)
Pin Bandwidth
Data Scaling


SPEC, TPC
datasets
growth:
faster than
Moore


Same trends in scientific,
personal computing


Large Hadron Collider


March’11
: 1.6PB data
(Tier
-
1)


Large Synoptic Survey
Telescope


30
TB/night


2x
Sloan Digital Sky
Surveys/day


Sloan: more
data than entire
history of astronomy before it


© Hardavellas

5

0
5
10
15
20
2004
2007
2010
2013
2016
2019
Scaling Factor

Year

OS Dataset Scaling (Muhrvold's Law)
Transistor Scaling (Moore's Law)
TPC Dataset (Historic)

More data


more computing power to process them

Galaxy: Optically
-
Connected Disintegrated Processors


Physical constraints limit single
-
chip designs


Area, Yield, Power, Bandwidth


Multi
-
chip designs break free of these limitations


Processor disintegration


Macro
-
chip integration

© Hardavellas

6

[Pan, WINDS 2010]

Outline


Introduction



Background


Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons (
Processor D
isintegration)


Multi
-
Chip Comparisons (
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

7

Nanophotonic

Components

© Hardavellas

8

off
-
chip

laser
source

coupler

resonant
modulators


resonant

detectors

Ge
-
doped

waveguide

Selective: couple optical energy of a specific wavelength

Modulation and Detection

© Hardavellas

9

11010101

11010101

10001011

10001011

16
-

64
wavelengths DWDM

5

-

20
μ
m

waveguide pitch

10
Gbps

per link


8
Tbps
/
m
m bandwidth density or more !!
!


Outline


Introduction


Background



Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons (
Processor D
isintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

10

Galaxy Architecture

© Hardavellas

11

Routing Example

© Hardavellas

12

Galaxy Architecture

© Hardavellas

13

Galaxy MWSR Optical Crossbar

© Hardavellas

14


More energy
-
efficient than SWMR at that scale


MWSR avoids broadcast bus, but requires arbitration

Token
-
Based Arbitration

© Hardavellas

15


8 cycles on average for token arbitration (5
chiplets
)

Dense Off
-
Chip Coupling

© Hardavellas

16


Dense optical fiber array
[Lee, OSA/OFC/NFOEC 2010]


~3.8
dB

loss, 8
Tbps
/mm

demonstrated


Misalignment <0.7
μm
, 0.4
μm
, 0.7
μm
>


loss <1
dB


Loss comparable to optical proximity couplers

Nanophotonic

Parameters

© Hardavellas

17

Outline


Introduction


Background


Galaxy Architecture



Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons
(Processor Disintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

18

Architectural Parameters

© Hardavellas

19

Modeling Infrastructure

© Hardavellas

20

3D
-
stack model

SimFlex

sampling

95% confidence

photonic
-
layer

ring heating

Outline


Introduction


Background


Galaxy Architecture


Experimental Methodology


Results



Sensitivity Studies


Single
-
Chip Comparisons
(Processor Disintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

21

Load
-
Latency Curves

© Hardavellas

22


16 tokens provide optimal buffer depth

Laser Power Sensitivity to Optical Parameters

© Hardavellas

23

Coupler Loss

Off
-
Ring Loss

Waveguide & Filter Drop Loss

Modulator
Insertion Loss


H
ighly sensitive to coupler loss, insensitive to other losses

Sensitivity to Fiber Density










116
mm
2

chiplets



43
mm

along
the chip
edge


Enough
room for 172 fibers
@ 250
μm

pitch

© Hardavellas

24


128 fibers: within 3% of max performance

Outline


Introduction


Background


Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies



Single
-
Chip Comparisons
(Processor Disintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

25

Performance Against “Unlimited” Designs

© Hardavellas

26










Unlimited power (max speed of design, irrespective of temp.
)


Mesh_20MC & Corona_20MC


Also unlimited bandwidth (20 MCs per chip, 5x more pins)


Galaxy matches the performance of “unlimited” designs

Performance Against Realistic Designs










Realistic: within power and bandwidth envelopes


Galaxy
chiplets

within 66.2
o
C


chiplets

run at max speed

© Hardavellas

27


Galaxy: 2.2x speedup on average (3.4 max)

Energy
-
Delay Product











Cool
chiplets

minimize leakage

© Hardavellas

28


Galaxy: 2.4x
-
2.8x smaller EDP on average (6.8x max)

Outline


Introduction


Background


Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons (
Processor Disintegration)



Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

29

Comparison Against Multi
-
Chip Alternatives

© Hardavellas

30

Comparison Against Multi
-
Chip Alternatives


© Hardavellas

31

Fiber


Galaxy: 2.5x over Oracle
Macrochip

(6.8x max)

Tapered vs. Optical Proximity Couplers

© Hardavellas

32


6x less laser power than Oracle
Macrochip

with demonstrated couplers

Outline


Introduction


Background


Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons
(Processor Disintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)



Thermal Modeling


Conclude


Overview of Other Research

© Hardavellas

33

80
-
core 5
-
chiplet Galaxy Thermal CFD Modeling

© Hardavellas

34


8cm spacing allows cooling with cheap passive
heatsinks

88.2
0
C

9
-
chiplet Dense Array (Oracle
Macrochip
)

© Hardavellas

35


Tight arrangement points to liquid cooling requirement

249
0
C

9
-
chiplet Galaxy 2D

© Hardavellas

36


Cooling 9
chiplets

with passive
heatsinks

110
0
C

9
-
chiplet Galaxy 3
D

© Hardavellas

37


Flexible fibers allow “virtual chip” to break free of 2D planar designs

83.6
0
C

Galaxy Summary



V
irtual chips” with the performance of unlimited designs


Breaks free of typical physical constraints


Large aggregate area


I
mproved yield (break
-
even point : 60% yield for photonics)


Tb/s/mm bandwidth density


Pushes back power wall


Processor disintegration


2.2x avg. speedup (
3.4 max
)


2.4x
-
2.8x
avg. smaller EDP (
6.8x max
)


Macrochip

integration


2.5x speedup over Oracle
Macrochip

(6.8x max
)


6x more power efficient links


© Hardavellas

38

Outline


Introduction


Background


Galaxy Architecture


Experimental Methodology


Results


Sensitivity Studies


Single
-
Chip Comparisons
(Processor Disintegration)


Multi
-
Chip Comparisons

(
Macrochip

Integration)


Thermal Modeling


Conclude



Overview of Other Research

© Hardavellas

39

Energy
i
s Shaping the IT Industry

#1 of Grand Challenges for Humanity in the Next 50 Years



[Smalley Institute for
Nanoscale

Research and Technology, Rice U.]



Computing
worldwide: ~
408
TWh

in 2010
[Gartner]


Datacenter
energy consumption
in US ~
150
TWh

in 2011
[EPA]


3.8%
of domestic power
generation, $15B


CO
2
-
equiv. emissions

Airline Industry (2%)


Carbon
footprint
of world’s data centers ≈
Czech Republic


Exascale

@ 20MW:
200x

lower energy/instr. (2nJ


10pJ)


3% of the output of an average nuclear plant!


10% annual growth
on installed computers worldwide

[Gartner
]

© Hardavellas

40


Exponential increase in energy consumption


Integer add:
0.5pJ
; FP
-
FMA:
50pJ
. Where does energy go?


Data
movement:
1200pJ
across 400mm
2

chip,
16000pJ
memory

Elastic caches:
minimize data transfers through adapting caches to
workload demands
[ISCA’09, IEEEMicro’10, DATE’12]


Processing:
~1500pJ
to schedule the operation

SeaFire
:
specialized
computing on dark
silicon to eliminate general
-
purpose computing’s overheads
[IEEEMicro’11, USENIX
-
Login’11]


Circuits: wide voltage guardbands


Low voltages, process variation


timing errors


computing errors

Elastic fidelity:
allow errors at select code/data segments to save energy
while maintaining fidelity contract with
user
[CoRR abs/1111.4279]


Chips fundamentally limited by physical constraints. Need to break free.

Galaxy:
processor disintegration/macrochip integration using photonic
interconnects
[WINDS’10]

Overall Focus: Energy
-
Efficient Computing

Thank You!

© Hardavellas

42

Overcoming Data Movement and Processing Overheads


Elastic caches:
adapt cache to workload’s demands


Significant energy on data movements and coherence requests


Co
-
locate data, metadata, and computation


Decouple address from placement location


Capitalize on existing OS events


simplify hardware


Cut on
-
chip interconnect traffic by
half



Seafire
: specialized computing on dark
silicon


Repurpose dark silicon to implement specialized cores


Application cherry
-
picks a few cores, rest of chip is powered off


Vast unused area


many specialized cores



likely to find good matches


12x lower energy

(conservative)

43

© Hardavellas


Elastic
fidelity:
selectively trade accuracy for energy


We don’t always need 100% accuracy, but HW always provides it


Language constructs specify required fidelity for code/data segments


Steer computation to exec/storage units with appropriate fidelity and
lower voltage


35% lower
energy


Overcoming Voltage
Guardbands

44

© Hardavellas

No errors

10% errors