Galaxy
:
High
-
Performance
Energy
-
Efficient
Multi
-
Chip Architectures Using
Photonic Interconnects
Nikos
Hardavellas
PARAG@N
–
Parallel Architecture Group
Northwestern University
Team: Y. Demir, P. Yan, S. Song, J. Kim, G. Memik
Chip Power Scaling
© Hardavellas
2
Chip power does not scale
[
Azizi
2010]
Voltage Scaling Has Slowed
© Hardavellas
3
In last decade:
13x
transistors but 30% lower
voltage
Cannot
run all transistors fast enough
0
2
4
6
8
10
12
14
2003
2006
2009
2012
2015
Scaling Factor
Year
Transistor
Scaling
(Moore's Law)
Pin Bandwidth Scaling
© Hardavellas
4
[TU Berlin]
Cannot feed cores with data fast enough to keep them busy
0
2
4
6
8
10
12
14
2003
2006
2009
2012
2015
Scaling Factor
Year
Transistor Scaling
(Moore's Law)
Pin Bandwidth
Data Scaling
•
SPEC, TPC
datasets
growth:
faster than
Moore
•
Same trends in scientific,
personal computing
•
Large Hadron Collider
March’11
: 1.6PB data
(Tier
-
1)
•
Large Synoptic Survey
Telescope
30
TB/night
2x
Sloan Digital Sky
Surveys/day
Sloan: more
data than entire
history of astronomy before it
© Hardavellas
5
0
5
10
15
20
2004
2007
2010
2013
2016
2019
Scaling Factor
Year
OS Dataset Scaling (Muhrvold's Law)
Transistor Scaling (Moore's Law)
TPC Dataset (Historic)
More data
more computing power to process them
Galaxy: Optically
-
Connected Disintegrated Processors
•
Physical constraints limit single
-
chip designs
Area, Yield, Power, Bandwidth
•
Multi
-
chip designs break free of these limitations
Processor disintegration
Macro
-
chip integration
© Hardavellas
6
[Pan, WINDS 2010]
Outline
•
Introduction
➔
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons (
Processor D
isintegration)
Multi
-
Chip Comparisons (
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
7
Nanophotonic
Components
© Hardavellas
8
off
-
chip
laser
source
coupler
resonant
modulators
resonant
detectors
Ge
-
doped
waveguide
Selective: couple optical energy of a specific wavelength
Modulation and Detection
© Hardavellas
9
11010101
11010101
10001011
10001011
16
-
64
wavelengths DWDM
5
-
20
μ
m
waveguide pitch
10
Gbps
per link
8
Tbps
/
m
m bandwidth density or more !!
!
Outline
•
Introduction
•
Background
➔
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons (
Processor D
isintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
10
Galaxy Architecture
© Hardavellas
11
Routing Example
© Hardavellas
12
Galaxy Architecture
© Hardavellas
13
Galaxy MWSR Optical Crossbar
© Hardavellas
14
More energy
-
efficient than SWMR at that scale
MWSR avoids broadcast bus, but requires arbitration
Token
-
Based Arbitration
© Hardavellas
15
8 cycles on average for token arbitration (5
chiplets
)
Dense Off
-
Chip Coupling
© Hardavellas
16
•
Dense optical fiber array
[Lee, OSA/OFC/NFOEC 2010]
•
~3.8
dB
loss, 8
Tbps
/mm
demonstrated
•
Misalignment <0.7
μm
, 0.4
μm
, 0.7
μm
>
loss <1
dB
Loss comparable to optical proximity couplers
Nanophotonic
Parameters
© Hardavellas
17
Outline
•
Introduction
•
Background
•
Galaxy Architecture
➔
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons
(Processor Disintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
18
Architectural Parameters
© Hardavellas
19
Modeling Infrastructure
© Hardavellas
20
3D
-
stack model
SimFlex
sampling
95% confidence
photonic
-
layer
ring heating
Outline
•
Introduction
•
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
➔
Sensitivity Studies
Single
-
Chip Comparisons
(Processor Disintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
21
Load
-
Latency Curves
© Hardavellas
22
16 tokens provide optimal buffer depth
Laser Power Sensitivity to Optical Parameters
© Hardavellas
23
Coupler Loss
Off
-
Ring Loss
Waveguide & Filter Drop Loss
Modulator
Insertion Loss
H
ighly sensitive to coupler loss, insensitive to other losses
Sensitivity to Fiber Density
•
116
mm
2
chiplets
43
mm
along
the chip
edge
•
Enough
room for 172 fibers
@ 250
μm
pitch
© Hardavellas
24
128 fibers: within 3% of max performance
Outline
•
Introduction
•
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
➔
Single
-
Chip Comparisons
(Processor Disintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
25
Performance Against “Unlimited” Designs
© Hardavellas
26
•
Unlimited power (max speed of design, irrespective of temp.
)
•
Mesh_20MC & Corona_20MC
Also unlimited bandwidth (20 MCs per chip, 5x more pins)
Galaxy matches the performance of “unlimited” designs
Performance Against Realistic Designs
•
Realistic: within power and bandwidth envelopes
•
Galaxy
chiplets
within 66.2
o
C
chiplets
run at max speed
© Hardavellas
27
Galaxy: 2.2x speedup on average (3.4 max)
Energy
-
Delay Product
•
Cool
chiplets
minimize leakage
© Hardavellas
28
Galaxy: 2.4x
-
2.8x smaller EDP on average (6.8x max)
Outline
•
Introduction
•
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons (
Processor Disintegration)
➔
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
29
Comparison Against Multi
-
Chip Alternatives
© Hardavellas
30
Comparison Against Multi
-
Chip Alternatives
© Hardavellas
31
Fiber
Galaxy: 2.5x over Oracle
Macrochip
(6.8x max)
Tapered vs. Optical Proximity Couplers
© Hardavellas
32
6x less laser power than Oracle
Macrochip
with demonstrated couplers
Outline
•
Introduction
•
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons
(Processor Disintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
➔
Thermal Modeling
•
Conclude
•
Overview of Other Research
© Hardavellas
33
80
-
core 5
-
chiplet Galaxy Thermal CFD Modeling
© Hardavellas
34
8cm spacing allows cooling with cheap passive
heatsinks
88.2
0
C
9
-
chiplet Dense Array (Oracle
Macrochip
)
© Hardavellas
35
Tight arrangement points to liquid cooling requirement
249
0
C
9
-
chiplet Galaxy 2D
© Hardavellas
36
Cooling 9
chiplets
with passive
heatsinks
110
0
C
9
-
chiplet Galaxy 3
D
© Hardavellas
37
Flexible fibers allow “virtual chip” to break free of 2D planar designs
83.6
0
C
Galaxy Summary
•
“
V
irtual chips” with the performance of unlimited designs
•
Breaks free of typical physical constraints
Large aggregate area
I
mproved yield (break
-
even point : 60% yield for photonics)
Tb/s/mm bandwidth density
Pushes back power wall
•
Processor disintegration
2.2x avg. speedup (
3.4 max
)
2.4x
-
2.8x
avg. smaller EDP (
6.8x max
)
•
Macrochip
integration
2.5x speedup over Oracle
Macrochip
(6.8x max
)
6x more power efficient links
© Hardavellas
38
Outline
•
Introduction
•
Background
•
Galaxy Architecture
•
Experimental Methodology
•
Results
Sensitivity Studies
Single
-
Chip Comparisons
(Processor Disintegration)
Multi
-
Chip Comparisons
(
Macrochip
Integration)
Thermal Modeling
•
Conclude
➔
Overview of Other Research
© Hardavellas
39
Energy
i
s Shaping the IT Industry
#1 of Grand Challenges for Humanity in the Next 50 Years
[Smalley Institute for
Nanoscale
Research and Technology, Rice U.]
•
Computing
worldwide: ~
408
TWh
in 2010
[Gartner]
•
Datacenter
energy consumption
in US ~
150
TWh
in 2011
[EPA]
3.8%
of domestic power
generation, $15B
CO
2
-
equiv. emissions
≈
Airline Industry (2%)
•
Carbon
footprint
of world’s data centers ≈
Czech Republic
•
Exascale
@ 20MW:
200x
lower energy/instr. (2nJ
10pJ)
3% of the output of an average nuclear plant!
•
10% annual growth
on installed computers worldwide
[Gartner
]
© Hardavellas
40
Exponential increase in energy consumption
•
Integer add:
0.5pJ
; FP
-
FMA:
50pJ
. Where does energy go?
Data
movement:
1200pJ
across 400mm
2
chip,
16000pJ
memory
Elastic caches:
minimize data transfers through adapting caches to
workload demands
[ISCA’09, IEEEMicro’10, DATE’12]
Processing:
~1500pJ
to schedule the operation
SeaFire
:
specialized
computing on dark
silicon to eliminate general
-
purpose computing’s overheads
[IEEEMicro’11, USENIX
-
Login’11]
Circuits: wide voltage guardbands
Low voltages, process variation
timing errors
computing errors
Elastic fidelity:
allow errors at select code/data segments to save energy
while maintaining fidelity contract with
user
[CoRR abs/1111.4279]
•
Chips fundamentally limited by physical constraints. Need to break free.
Galaxy:
processor disintegration/macrochip integration using photonic
interconnects
[WINDS’10]
Overall Focus: Energy
-
Efficient Computing
Thank You!
© Hardavellas
42
Overcoming Data Movement and Processing Overheads
•
Elastic caches:
adapt cache to workload’s demands
Significant energy on data movements and coherence requests
Co
-
locate data, metadata, and computation
Decouple address from placement location
Capitalize on existing OS events
simplify hardware
Cut on
-
chip interconnect traffic by
half
•
Seafire
: specialized computing on dark
silicon
Repurpose dark silicon to implement specialized cores
Application cherry
-
picks a few cores, rest of chip is powered off
Vast unused area
many specialized cores
likely to find good matches
12x lower energy
(conservative)
43
© Hardavellas
•
Elastic
fidelity:
selectively trade accuracy for energy
We don’t always need 100% accuracy, but HW always provides it
Language constructs specify required fidelity for code/data segments
Steer computation to exec/storage units with appropriate fidelity and
lower voltage
35% lower
energy
Overcoming Voltage
Guardbands
44
© Hardavellas
No errors
10% errors
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Comments 0
Log in to post a comment