Serial Code Accelerators for Heterogeneous Multi-core Processor with 3D memory

stingymilitaryΗλεκτρονική - Συσκευές

27 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

99 εμφανίσεις

Serial Code Accelerators for Heterogeneous
Multi
-
core Processor with 3D memory

Philip Jacob

Thesis Defense

July 26
rd

2010


Committee members

John F. McDonald

Tong Zhang

Paul
Schoch

Christopher D. Carothers

Outline


Need for Serial code
accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

2
/50

Outline


Need for Serial Code
Accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

3

Motivation for High Clock Rate CPU: HPCS


Faster processing nodes to execute MPI code using
SiGe

HBTs.


Improve packet handling to reduce communication latency.


Ref1:
http://
www.nas.nasa.gov/About/Projects/Columbia/columbia.html

4

Previous decade: Clock Race suggested need for

3D Memory

Ref
2:
Hennessey, Patterson ,”Computer Architecture


A Quantitative approach”

Memory Wall

5

The Clock Race for CMOS has Ended

Ref
3
:
Wilfried

Haensch
, 2008 IBM TAPO meeting

6 Clock
Doubling
Times = 64
GHz!

6

CMOS Repeater Crisis
-

Wires Don’t Scale Well



Number of Repeaters is Exploding as a Power of 10 per 33% Shrink

Ref
4:
Ruchir

Puri
, IBM, 2007
Sematech

/
ACMThermal

and Design Issues in 3D ICs



Mx

resistance increasing with technology scaling.



High resistance requiring increased repeater counts.



Increased power consumption as buffers are leaky and

accounts >50% of logic leakage.



Forced to reduce /hold clock rate

7

Chip Integration


Technology Challenges

Result: Multi
-
cores in CMOS



Dual core to Quad Core to 50 core Generation

Dual Core

Quad Core

50

core

Knights

corner


cloud

computing

chip

8

Is adding more cores the right solution?

Amdahl’s
1967 Figure of Merit (FOM) estimates speedup to an overall
system when only part of the system is improved.

Speeding up parallel code by adding “n” cores.


S P
Performance FOM
P
S
n



0
lim[ ];
lim[ ] 1/
S
n
FOM n
FOM P S



 
Ref
5:Gene
Amdahl ”Validity of the single processor approach to achieving
large
scale computing
capabilities” AFIPS Conference, 1967

9

Speed up The Serial Code

10

Heterogeneous Multi
-
core System with MCU’s,
and a single HCRU for Serial Code

HCRU

MCU0

MCU1

MCU3

MCU2

MCU4

MCU5

MCU6

MCU7



Turn

off

High

clock

rate

processor


during

parallel

operation

to

save

power
.




Integration

could

be

either

on

same

chip


or

through

Silicon

carrier
.

11

Outline


Need for Serial Code
Accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

12
/50

Alternate technologies

SiGe

HBT

Strained Si

FinFETs

13

SiGe

HBT



Vertical Device.



3 regions of operation: OFF, Forward active, Sat.



Current equations are exponential making them
better drivers of wires.

14

Doping Profile to form Hetero
-
junction


Ge

into the base region
reduces the potential barrier
to injection of electrons from
emitter into the base.



Drift field accelerates e
-
.




Results in increased
Ic

and
reduced base transit time.

Ref 6:On the potential of
SiGe

HBTs for extreme environment Electronics,
Cressler
,

Proceedings of IEEE, Sept 2005

15

Scaling in
SiGe

HBTs

Ref 6: On the potential of
SiGe

HBTs for extreme environment Electronics,
Cressler
, Proceedings of IEEE, Sept 2005



FOM
-

Cut off Frequency.



Solomon Tang Scaling rule.


* Circuit delay scales with
emitter size.


* Shrink the Emitter for
constant TOTAL Current.



Collector current density goes
up.



Supply Voltage and swing
voltage is constant.

16

180nm

130nm

90nm

Emitter Coupled Logic Design



Current Steering circuits.



Differential input/outputs.



Low voltage swings.



Taller trees for more complex gates but higher static
power consumption.

NAND gate

17


D Flip Flop

18

Latch

Cross coupled inverters

Low Power in Bipolar: I2L / Integrated
Injection Logic

INV

NAND

NOR

Vcc

= 1V

Signal Levels

Low= 0.2V

High=0.7V

19

NPN only IIL

20

V
EE

V
CC

V
EE

V
CC

Out

in

Ref 7: J.H
.
Pugsley

and C.B.
Silio
, Proceeding of the 8
th

International
Symposium of Multiple
-
Valued Logic, Pg 21
-
31, 1978

1.1V power supply

4.4ps rise time

300mV swing

In collaboration with
Tuhin
,
Srikumar

ITRS Roadmap for CMOS Microprocessor Power

21

Apple Sponsored Exponential PowerPC

0
10
20
30
40
50
60
70
80
90
1st
Qtr
3rd
Qtr
East
West
North

0.7M Hitachi Si
-
bipolars
.


0.3um x 1.0 um emitter 20 GHz
fT

1995.


2.0M 0.5µm FET’s.


Die Size 15mm x 10mm.


Metal Pitch 2µm.


~80Watts.


0.75~0.85 GHz (last
tapeout
).


Mixed ECL 500mV and CML
250mV swing.


Main power supply was 3.5V
(most contemporary designs
would use 2.5V).


22

22

Outline


Need for Serial Code
Accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

23
/50

High Level Architecture

24

CPI vs. Clock vs. Bus width

Cache structure


-
unified L0 (1KB)


-
unified L1 (16KB)


-

A huge L2

-

CPI=
7.82




Trace driven simulator


Dinero



Cache access time
-

CACTI

25

Access time improvement in
BiCMOS

over CMOS L1 cache (16K cache)

1.
Decoder data

2. Word Line

3
. Sense amp data
4
. Comparator

5.
Mux



6.
Sel

Inverter

7
. o/p driver


CMOS
access time=0.718ns


BiCMOS

access time=0.431ns

Ref
8:
CACTI 4.2, 5.0 http://quid.hpl.hp.com:9081/cacti/detailed.y?new

26

Simplescalar



Execution driven simulator

3D cache with wide bandwidth

27

Ref 9: www.simplescalar.com

Reducing CPI for HCRU



Simple scalar simulator



3 level cache



SPEC
int

benchmarks



CPI around 2.5 to 3

28

3D processor memory stack solution

Multi
-
tier

Multi
-
bank



Higher bandwidth through 3D
-
vias translates to multi
-
port
cache accessing simultaneously multiple banks or tiers.



Good for multi
-
cores where bus arbitration can be avoided.

Multi
-
core

29

Multiprocessor simulator
-

RSIM

Ref
10:
RSIM http://rsim.cs.uiuc.edu/rsim/

Symmetric multi processor simulator adapted for 3D memory over multi
-
core

30

Multi
-
core processor


RSIM results

31

FFT benchmark

Outline


Need for Serial Code
Accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

32
/50

7 stage Pipelined processor core

Instruction

Decode

Register File

Stage 1

Register File

Stage 2

Operand
preparation

ALU

Post Ex/ Write
Back

Queue


Update Remote PC


L0
i
-
cache +

Remote Program
Counter


L0
d
-
cache


Pipeline controller

(FSM)

Instruction

queue

Core Test input

(
instruction
sequence
generator)

Pipeline stage

control signals

Signals to

FSM

Data Bus

External signals

& traps

ALU feed forward

Output Scan Chain

Data Reg File

33

Dual Ported 8HP Register File

Ref 11:Okan
Erdogo

Phd

Thesis 2008



Read Port A Operation at
18.4 GHz (measured)



2 read ports/ 1 write port



size = 8 words


34

CLA carry chain test structure

Ref 12:Paul
Belemjian

Phd

Thesis, 2008

Measured waveform of the

8 HP adder test chip 26.67GHz

35

Operand Preparation block

36

S2

S1

S0

ALU

L

L

L

CLEAR

L

L

H

B MINUS A

L

H

L

A MINUS B

L

H

H

A PLUS B

H

L

L

A xor B

H

L

H

A + B

H

H

L

AB

H

H

H

PRESET

Pipeline Controller FSM chip

CLOCK

SET

HLT

STALL_CACHE

STALL_BR

UNSTALL_CACHE

X

Y

Z

Pipe Clear

FSM


States

Data

I/p

counter

Test

output

STAGE

1

STAGE

2

Pipe

control

signal

37

3D FDSOI CMOS Process
-

MITLL

Ref
13:MIT
LL process documentation

38

3D cache


Floor plan & Microphotograph

W
a
y
0

W
a
y
1

W
a
y
2

W
a
y
3

T
A
G

A
R
R
A
Y

3D Via

Controller

3D Via

39

In collaboration with
Aamir

Zia

Measured Results of 3D memory chip

Measured waveform of
alternating read after write
from Tier 1 at 500MHz clock

Measured waveform with a string
of consecutive 0s from Tier 3

40

Floor planning (5mm * 5mm)

i
-
cache


(Reg file) 5w

Inst

Q

4
words


1.4w

Inst

Dec
-
oder

1w

FSM

(Pipe

line

Ctrl)

1w

Reg File

5w

Op.

Prep

1w

Adder

2.5w

Write/store
queue 1.4w

SERDES

2.5w

L0 d
-
cache


(reg file) 5w

L0 d
-
cache


(reg file) 5w

L0 d
-
cache


(reg file) 5w

L0 d
-
cache


(reg file) 5w

Test Inst generator

i
-
cache


(Reg file) 5w

L1 CACHE

41

Thermal Studies of Processor floor plan using
COMSOL

335K



Substrate too thick
that the heat is not
spreading into the
bottom sink.




Deep Trench Isolation
in
SiGe

HBT prevents
lateral heat spreading

42

In collaboration with
Okan

Erdogan

Use of Diamond Heat Spreaders

Ref 14:J.C. Sung et al, “Semiconductor on Diamond (SOD) for System on
Chip (
SoC
) Architectures”, VMIC Conference, Sept. 2006, pp. 35
-
38.

View at diamond Cu boundary
for 50um Diamond layer under
CPU with one tier of 3D Memory

Silicon thinning to 50 µm, and
bonding to 50 µm diamond

43

Thermal studies with Processor
-

3D memory

313K


Wafer thinning


Diamond substrate


Cu heat spreading interface
layers

44

Outline


Need for Serial Code
Accelerator


Clock Race


Multi
-
core CMOS


Amdahl’s law


Alternate technologies


SiGe

/
FinFET

etc


ECL/ I2L


Architectural studies


HCRU CPI


Multi
-
core


3D memory


Processor core and 3D memory


FPGA core model


Chip designs


Thermal Analysis


Conclusion & Future Research

45
/50

46

Milestones


Fall 2004
-
2005


Preliminary study of 3D architecture,


2005
-

2006


DQE, IEEE D&T Paper accepted, Processor design on FPGA,
MS degree


2006
-
2007


Processor redesign on FPGA, Multi
-
core processor evaluations,
Completion of course work,
Candidacy


2007
-
2008


Chip implementation, Testing blocks.


Operand preparation blocks


Pipeline
Controller implementation in 8HP
SiGe
.


2009
-
2010


Amdahl’s law and heterogeneous core integration


Thesis Defense

Publications


"Mitigating Memory wall effects in High clock rate and Multi
-
core CMOS 3D ICs
-

Processor Memory Stacks",
Philip Jacob
,
Aamir

Zia, Mike Chu, Jin Woo Kim,
Russell Kraft, John F. McDonald, and Kerry Bernstein,

Proceedings of the IEEE



3D
IC special issue
.
Vol.97, No.1 , Jan 2009, pp 108
-
122


"Predicting the Performance of a 3D Processor
-
Memory Chip Stack”

Philip Jacob
,
Okan

Erdogan
,
Aamir

Zia, Paul M.
Belemjian
, Russell Kraft and John F. McDonald,
IEEE

Design and Test, Nov
-
Dec 2005, pp 540
-
547.

(cited 14 times)


“A Three
-
Dimensional L2 cache with Ultra
-
Wide Data Bus for 3D Processor
-
Memory
Integration”,
Aamir

Zia,
Philip Jacob
, Russell P. Kraft and John F. McDonald,
Transactions in VLSI,
IEEE
. Vol. 18, No. 6, June 2010, pp 967
-
977.


“A 40Gs/s Time Interleaved ADC using
SiGe

BiCMOS

technology”, Michael Chu,
Philip Jacob
, Jin
-
Woo Kim, Mitchell
LeRoy
, Russell Kraft, John F. McDonald, JSSC,

IEEE,
Vol. 45, No. 2
,
Feb 2010, pp 380
-
390
.


“A Reconfigurable 40 GHz
BiCMOS

Uniform Delay Crossbar Switch for Broadband
and Wide Tuning Range Narrowband Applications”, Jin
-
woo Kim, Michael Chu,
Philip Jacob
,
Aamir

Zia, Russell Kraft, John F. McDonald, IET Circuits, Devices and
Systems.
[Accepted]


47

48

Conclusion & Future Research goals


Need for a fast core


Possible alternative technologies especially
SiGe


Chip designs in 3D memory and
SiGe

for processor core


Thermal analysis using
COMSOL


Heterogeneous core integration with 3D memory


the way forward!


IIL Logic for low power operations



Serial code/ parallel code separation.

Questions

Thank you for your attention