VLSI Datapath Choices: Cell-Based Versus Full-Custom - DAC

mittenturkeyElectronics - Devices

Nov 26, 2013 (3 years and 22 days ago)

57 views

2

Explaining The Gap Between ASIC and Custom Power:

A Custom Perspective

Andrew Chang

Cadence Design Systems*


William J. Dally

Computer Systems Laboratory

Stanford University


* Work done while Author was at Stanford


3

Design Tradeoffs: Power vs. Performance



1. Move to More Energy Efficient


Operating Point


More Energy Efficient w/ Custom




Power

2

1

3

Performance

4

Design Tradeoffs: Power vs. Performance



1. Move to More Energy Efficient


Operating Point


More Energy Efficient w/ Custom



2. Trade Performance for


Power


Larger Range w/ Custom




Power

2

1

3

Performance

5

Design Tradeoffs: Power vs. Performance



1. Move to More Energy Efficient


Operating Point


More Energy Efficient w/ Custom



2. Trade Performance for


Power


Larger Range w/ Custom



3. Move to Different


Power vs. Performance Curve


More Architectural Choice with


Custom



Power

2

1

3

Performance

6

Dynamic Power Dissipation





P
dyn

=
a

CV
dd
2

f

=
a

E
circuit

f




Reduce V
dd


Static, dynamic, voltage islands, power gating


Reduce
a

and/or
f



Clock gating, block enables, bus encoding, glitch identification
and elimination


Reduce E
circuit


Engineer interconnects, increase circuit efficiency, subthreshold
circuit techniques


7

Static Power Dissipation


P
static

= V
dd

(I
sub

+ I
ox

)



I
sub

= K
1
W e
-
V
t
/ nV
q

(1
-

e

V
gs
/V
q
)


I
ox

= K
2

W (V
gs
/t
ox
)
2

e

a

t
ox
/ V
gs





With K
1
, K
2
, n, and
a

experimentally determined



Reduce V
dd


Static, dynamic, voltage islands, power gating


Increase effective V
t


Substituting high
-
threshold devices, transistor stacking, static and active
body bias


Reduce effective W


Reduce number and size of devices in design

8

Which Design Is More Efficient?




0.7um CMOS 173MHz chip w/ 460K T’s







0.18um CMOS 10kHz chip w/ 640K T’s


9

Which Design Is More Efficient?




0.7um CMOS 173MHz chip w/ 460K T’s


Vdd (typ) = 3.3V, Vdd (min) = 1.1V






0.18um CMOS 10kHz chip w/ 640K T’s


Vdd (max) = 1.8V, Vdd (min) = 0.18V

10

Which Design Is More Efficient?




0.7um CMOS 173MHz chip w/ 460K T’s


Vdd (typ) = 3.3V, Vdd (min) = 1.1V


Power = 845mW






0.18um CMOS 10kHz chip w/ 640K T’s


Vdd (max) = 1.8V, Vdd (min) = 0.18V


Power = 1.6mW

11

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

12

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

13

Defining E
bit





E
bit

= C
bit

* V
dd
2




C
bit

= 4 * 2 fF/um * W
min





Energy needed to write a 1
-
bit SRAM cell


Approximates minimum useful capacitance


The ratio of E
bit

to the energy for a range of
circuits remains largely constant with technology
scaling

14

Technology Scaling for E
bit




c

is a normalized unit of distance equal to the M1 pitch

Technology

0.5
m
m

0.18
m
m

c
2

58

18

5.7

18

m
m
2


15

Technology Scaling for Nand2




c

is a normalized unit of distance equal to the M1 pitch

4
c

= 2.24
m
m

8
c

= 4.48
m
m

NAND2

A

B

YN

A

B

YN

16

Applying E
bit

Energy

180nm

130nm

90nm

65nm

E
bit

(fJ)

3.3

1.4

0.5

0.36

Relative

180nm

130nm

90nm

65nm

E
bit


1

1

1

1

1b FO4

~10

~10

~10

~10

1b SP
-
SRAM

0.3
-
7

0.3
-
7

0.3
-
7

0.3
-
7

1b RF

4
-
20+

4
-
20+

4
-
20+

4
-
20+

1b DFF

20
-
30+

15
-
30+

10
-
30+

10
-
30+

1b Nand2

11
-
30 (typ 19)

5
-
30 (typ 14)

5
-
30 (typ 14)

5
-
30 (typ 14)

Move 1b 1000
c

~100

~100

~100

~100

Move 1b 1.5mm

268

367

467

714

17

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

18

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

19

Design Style: Custom

NVIDIA GeForceFX

Intel Pentium
-
4

Design Style: ASIC

400MHz


125M Transistors


2600MHz


55M Transistors


Effect of Architecture


20

Design Style: Custom

NVIDIA GeForceFX

Intel Pentium
-
4

Design Style: ASIC

400MHz


125M Transistors

~20 Watts

2600MHz


55M Transistors

~60 Watts

Effect of Architecture


21

Effect of Architecture


ASIC Architecture: 6x Efficiency

Design Style: Custom

NVIDIA GeForceFX

Intel Pentium
-
4

Design Style: ASIC

400MHz


125M Transistors

~20 Watts:
10GFlops & 13 GBs

2600MHz


55M Transistors

~60 Watts:
5GFlops & 5 Gbs

22

Custom Circuits: 9x (7x) Efficiency

Design Style: Custom

NVIDIA GeForceFX

Intel Pentium
-
4

Design Style:
Custom

400MHz


125M Transistors

~3 Watts: 10GFlops & 13 GBs


Vdd = 0.65V

2600MHz


55M Transistors

~60 Watts: 5GFlops & 5 Gbs


Vdd = 1.3V

23

Combined Architecture and Circuits

40x+ Improvement
but

1.5 Years vs. 3+ Years

Design Style: Custom

NVIDIA GeForceFX

Intel Pentium
-
4

Design Style:
Custom

400MHz


125M Transistors

~3 Watts: 10GFlops & 13 GBs


Vdd = 0.65V

2600MHz


55M Transistors

~60 Watts: 5GFlops & 5 Gbs


Vdd = 1.3V

24

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

25

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom



Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

26

ASIC vs. Custom



ASIC Methods


Provide only coarse
-
grain control 100K+ gates,
but require

much less effort

and
historically
scale with complexity


Custom Methods


Offer fine
-
grain control individual transistors &
gates, but require

large effort
and

scale poorly
with complexity


Exploits Design Structure


Exploits Circuit Techniques

27

Custom Methods Emphasize

Fine
-
Grain Manual Control + Custom Library


Design
Gate Library
Floorplanning/
Coarse
Detailed
Coarse
Detailed
Style
Partitioning
Placement
Placement
Routing
Routing
Custom
Complex
Manual
Manual
Manual
Manual
Manual
Specific
ASIC
Simple
Manual/Automated
Automated
Automated
Automated
Automated
Generic
Automated w/ Hints
28

Custom Methods Emphasize

Fine
-
Grain Manual Control + Custom Library


Design
Gate Library
Floorplanning/
Coarse
Detailed
Coarse
Detailed
Style
Partitioning
Placement
Placement
Routing
Routing
Custom
Complex
Manual
Manual
Manual
Manual
Manual
Specific
ASIC
Simple
Manual/Automated
Automated
Automated
Automated
Automated
Generic
Automated w/ Hints
Operation and Performance Characterized

for the Specific Case

29

ASIC Methods Substitute

Coarse
-
Grain

Control


Automation + Generic Library


Design
Gate Library
Floorplanning/
Coarse
Detailed
Coarse
Detailed
Style
Partitioning
Placement
Placement
Routing
Routing
Custom
Complex
Manual
Manual
Manual
Manual
Manual
Specific
ASIC
Simple
Manual/Automated
Automated
Automated
Automated
Automated
Generic
Automated w/ Hints
30

ASIC Methods Substitute

Coarse
-
Grain

Control


Automation + Generic Library


Design
Gate Library
Floorplanning/
Coarse
Detailed
Coarse
Detailed
Style
Partitioning
Placement
Placement
Routing
Routing
Custom
Complex
Manual
Manual
Manual
Manual
Manual
Specific
ASIC
Simple
Manual/Automated
Automated
Automated
Automated
Automated
Generic
Automated w/ Hints
Operation and Performance Characterized

for the Typical/Generic Case

31

ASIC

Focus on 100K+ Gates

Lost Opportunities to Exploit Structure



Designs reuse similar basic building blocks



Building blocks: 1
-
10K
-
gates
not 100K+ gate


64
-
bit adder 1K
-
gates


64x64 rf 2K
-
gates


64x64 multiplier 20K
-
gates



Opportunities to exploit these structures lost
when design is viewed in large chunks



32

Different Architectures

Similar Building Blocks

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

EX

RF

SRAM

XCVRS

L

C

Bus

Bank 1

Bank 0

CLST 0

CLST 1

CLST 2

CLST 0

CLST 1

CLST 2

NIF/ROUTER

MEMORY
SWITCH

CLUSTER
SWITCH

E
M
I

L
T
L
B

1998 “MAP” 64b Microprocessor
-

5M T’s

(MIT/Stanford)

EX

RF

SRAM

XCVRS

Bus

L

C

L

C

L

C

L

C

L

C


2002 “Imagine” 32b Stream Processor
-

22M T’s

(Stanford)

Cluster1
Cluster0
Cluster3
Cluster2
Cluster5
Cluster4
Cluster7
Cluster6
Microcontroller
33

Significant Structure Exists

Within

100K
-
gates

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

L

C

EX

RF

SRAM

XCVRS

Bus

EX

RF

SRAM

XCVRS

Bus

Bank 1

Bank 0

CLST 0

CLST 1

CLST 2

CLST 0

CLST 1

CLST 2

NIF/ROUTER

MEMORY
SWITCH

CLUSTER
SWITCH

E
M
I

L
T
L
B

1998 “MAP” 64b Microprocessor
-

5M T’s

(MIT/Stanford)


2002 “Imagine” 32b Stream Processor
-

22M T’s

(Stanford)

Cluster1
Cluster0
Cluster3
Cluster2
Cluster5
Cluster4
Cluster7
Cluster6
Microcontroller
34

Energy of 100K
-
gate Equivalent




ASIC (N2) = 1400K E
bits

(typ)



Custom Logic = 424K E
bits
*



SRAM (small) = 1085K E
bits



SRAM (med) = 155K E
bits




SRAM (large) = 50K E
bits



*Based on data extracted from Intel McKinley

35

Exploiting Circuit Techniques



Custom circuits more efficient



Reduced parasitics



1.7x circuit techniques and flops



1.4x libraries



1.4x due to engineering interconnects



Subthreshold Circuits


Low Performance but ultra
-
low power


Requires Architecture, Gates, Memories, CAD
Tools



36

Relating Power to Performance

CV/I, I
dsat
, tFO4

I
dsat

= K
3

L
eff

-
0.5 t
ox
-
0.8 (V
gs

-

V
t
)
1.25

t
FO4

= K
4

[C
eff

V
dd

/I
dsat
] (K
4

~ 13.5)

37

Relating Power to Performance


Relating V
dd

and V
t
to tFO4

I
dsat

= K
3

L
eff

-
0.5 t
ox
-
0.8

(V
gs

-

V
t
)
1.25

t
FO4

= K
4

[C
eff

V
dd

/I
dsat
]

(K
4

~ 13.5)

38

Relating Power to Performance

Correlation to Reported Foundry Data

Technology Node

CV/I est
(ps)

CV/I reported
(ps)

t
FO4

est
(ps)

Foundry A 180
-
nm

3.94

3.70

53

Foundry A 130
-
nm

2.55

2.17

34

Foundry A 90
-
nm

1.85

2.04

25

Foundry A 65
-
nm

1.45

1.00

20

I
dsat

= K
3

L
eff

-
0.5 t
ox
-
0.8 (V
gs

-

V
t
)
1.25

t
FO4

= K
4

[C
eff

V
dd

/I
dsat
] (K
4

~ 13.5)

39

Achievable Power Improvement

(Assuming 50/50 split of Logic and Memory)

Technique

Type

Custom vs.
ASIC

Energy

Type

Circuit Styles and
Flops

Dynamic

1.7

0.815

Logic

Libraries + V
dd

Scaling

1.4

0.855

Logic

SRAM Circuits

2

0.95

SRAM

Interconnect + V
dd

Scaling

1.4

0.855

Inter
-
connect

40

Achievable Power Improvement

(Assuming 50/50 Split of Logic and Memory)

Technique

Type

Custom vs.
ASIC

Energy

Type

Bit Encoding

Dynamic

1

0.84

Inter
-
connect

Clock Gating

1

0.84

Chip

Frequency Scaling

1

0.5

Chip

Subthreshold
Circuits

N/A

0.062

Chip

41

Achievable Power Improvement

(Assuming 50/50 Split of Logic and Memory)

Technique

Type

Custom vs.
ASIC

Energy

Type

V
dd

Scaling

Static

1

0.79

Chip

MT
-
CMOS

1

0.5

Chip

Stacking and input
state vector

1.4

0.7

Chip

(typically
only one of
these three is
applied)

Body Bias

2

0.5

Supply Gating

10

0.1

42

Achievable Power Improvement

Assuming 50/50 Split of Logic and Memory

Type

Tech


ASIC
(Custom)

Tech

ASIC
(Custom)

Net Dynamic

130
-
nm

45%
(32%)

90
-
nm

28%(20%)

Net Static

8% (4%)

20%(10%)

Total

53%
(36%)

48%(30%)



130nm uP assumes 80% Dynamic and 20% Static



90nm uP assumes 50% Dynamic and 50% Static

43

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”

44

Talk Outline




Normalized Metric: E
bit



Effect of Architecture



ASIC vs. Custom


Building Blocks


Achievable Energy Efficiency



16b 1024 FFT Example



Answer to “Which Design is More Efficient”


45

16b 1024 point FFT



Generally, k N log N

operations
(complex multiplies) with pre
-
computation



Radix
-
2, Radix
-
4 etc…
implementations



Decimation in time and/or decimation in
Frequency

46

Range of Implementations



MIT FFT (2005)


0.18um CMOS, 628K T’s, 10KHz: Architecture and subtheshold circuits, 180mV
operation



Spiffee (1999)


0.7um CMOS, 460K T’s, 173MHz: Cached FFT Architecture and algorithm, 1.1V
operation



SA
-
1100 (1999)


0.35um CMOS, 2.6M T’s, 74MHz: Commercial embedded processor, Custom
Circuits, 1.5V operation



Imagine (2003)


0.15um CMOS, 22M T’s , 232MHz: Streaming Media Processor, tiled standard
cells, 1.2V operation



Stratix IS25F627C8 (2005)


0.13um CMOS, 3.9K logic elements, 123K memory bits, 24 DSP blocks, 272MHz:


Commercial FPGA Co
-
processor,



Intel P4 (2003)


0.13um CMOS, 3GHz, SSE: Commerical General Purpose Processor, Custom
Circuits, 1.5V operation



TI ‘C6416 (2003)


0.13um CMOS, 720MHz: Commercial Digital Signal Processor

47

E
bit

Energy 16b 1024 point FFT

Design

Fab

V
dd

MHz

mW

Cycles

MIT FFT

180

1.8

0.01

1.6

95

Spiffee

700

3.3

173

845

5190

SA
-
1100

350

2

74

39

31500

Imagine

150

1.5

232

4000

3708

Stratix

130

1.3

275

884

1291

Intel P4

130

1.2

3000

51200

71680

TI 'C6416

130

1.2

720

1200

6526

48

E
bit

Energy 16b 1024 point FFT

Design

EDP

(rel norm)

E
bit

(fJ)

E
fft

(nJ)

Normalized to
E
bit

(1e6)

Energy

Ratio

MIT FFT

143

3.3

154

47

1

Spiffee

1

91

25350

277

6

SA
-
1100

283

4.2

16601

3953

85

Imagine

148

2.2

63931

29726

637

Stratix

24

1.4

4149

2964

64

Intel P4

12548

1.4

1E+06

873813

18591

TI 'C6416

27

1.4

10877

7769

166

49

Which Design Is More Efficient?




0.7um CMOS 173MHz chip w/ 460K T’s


Vdd (typ) = 3.3V, Vdd (min) = 1.1V


Power = 845mW





0.18um CMOS 10kHz chip w/ 640K T’s


Vdd (max) = 1.8V, Vdd (min) = 0.18V


Power = 1.6mW

50

Which Design Is More Efficient?

Depends on the Metric!



0.7um CMOS 173MHz chip w/ 460K T’s


Vdd (typ) = 3.3V, Vdd (min) = 1.1V


Power = 845mW


EDP 143x better




0.18um CMOS 10kHz chip w/ 640K T’s


Vdd (max) = 1.8V, Vdd (min) = 0.18V


Power = 1.6mW


Absolute energy 6x better

51

Summary



Normalized metric


E
bit

-

enables meaningful
comparisons across designs and technologies


Custom designers can exploit a wide range of
optimizations: enabling architecture with circuits and
circuits with Architecture


Custom designs can readily achieve a 3x advantage
in energy with the potential for over 10x


Selective application of custom techniques and
automated support for performance characterization
at specific instead of generic operating points can
enable ASIC designers to begin to bridge this Power
Gap.

52

Back
-
Up Slides


53

ASIC Rely on General Optimization Techniques

Focus

-

Improve the Average Case



Partitioning: Hyper
-
graph
-

min
-
cut, ratio cut



Solutions: move
-
based, geometric & combinatorial forms, clustering

Hypergraph

H(V,E) E = { e1, e2….} nets

Circuit

e1

e3

e4

e5

e6

e7

e8

V1

V3

V4

V5

V2

e2

e2

V3

V4

e6

e7

e4

e5

e8

e3

Vertex & Edge weights

used to encode costs

V1

V2

V5

e1

54

Designs with Structure


Do Not Exhibit Average Characteristics

64b Multiplier (half
-
array)

Clear Disparity in Resource Usage

Routing

Density