Reliability in VLSI Design - CPDEE - UFMG

mittenturkeyΗλεκτρονική - Συσκευές

26 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

101 εμφανίσεις

PPGEE
’08


Reliability in
Nanometer Technologies


Problems and Solutions

Dr.
-
Ing. Frank Sill

Department of Electrical Engineering, Federal University of Minas Gerais,

Av. Antônio Carlos 6627, CEP: 31270
-
010, Belo Horizonte (MG), Brazil


franksill@ufmg.br

http://www.cpdee.ufmg.br/~frank/

Copyright Sill, 2008

Agenda


Motivation


Failures in Nanometer Technologies


Techniques to Increase Reliability


Shadow Transistors

PPGEE‘08, Reliability

2

Copyright Sill, 2008

Motivation


Reliability important for


Normal user


Companies


Medical applications


Cars


Air / Space Environment




PPGEE‘08, Reliability

3

Copyright Sill, 2008

Motivation

Probability for failures increases due to:


Increasing transistor count


Shrinking technology

PPGEE‘08, Reliability

0
100
200
300
400
500
2002
2004
2006
2008
Transistors [Mill.]

Year

130 nm

90 nm

65 nm

45 nm

0 nm
50 nm
100 nm
150 nm
0
100
200
300
400
500
2002
2004
2006
2008
Technology

Transistors [Mill.]

Year

Northwood

55 Mill.

Prescott

125 Mill.

Yonah,

151 Mill.

Wolfdale

410 Mill.

Yonah

151 Mill.

Copyright Sill, 2008

Dimensions

PPGEE‘08, Reliability

5

1 m

10 cm

1 cm

1 mm

100 µm

10 µm

100 nm

„65 nm“
-
Transistor


Source: Intel

Source: „Spektrum der Wissenschaften“

Failures in Nanometer
Technologies

Copyright Sill, 2008

Process Failures


Occur at production phase


Based on


Process Variations


Particles




PPGEE‘08, Reliability

7

Source: Mak

Copyright Sill, 2008

Sub
-
wavelength Lithography

PPGEE‘08, Reliability

8

193nm

248nm

365nm

Lithography Wavelength [nm]

65nm

90nm

130nm

Generation

Gap

45nm

32nm

13nm
EUV

180nm

Source: Mark Bohr, Intel

0,01

0,1

1

1980

1990

2000

2010

2020

Generation [µ]

10

100

1000

Copyright Sill, 2008

Field
-
dependent Aberrations

PPGEE‘08, Reliability

9

Cell A
Cell A
Cell A
(
X
1
,
Y
1
)
(
X
0
,
Y
0
)
(
X
2
,
Y
2
)
Big Chip
)
,
(
A
_
CELL
)
,
(
A
_
CELL
)
,
(
A
_
CELL
2
2
0
0
1
1
Y
X
Y
X
Y
X


Center:
Minimal
Aberrations

Edge: High
Aberrations

Towards Lens

Wafer
Plane

Lens

Source: R. Pack, Cadence

Copyright Sill, 2008

Varying Line Width

PPGEE‘08, Reliability

10

2.3

2.2

2.1

2.0

1.9

1.8

50

100

150

0

20

40

60

LineWidth [nm]

Wafer X

Wafer Y

0

Source: Zhou, 2001

Copyright Sill, 2008

Random Dopant Fluctuations

PPGEE‘08, Reliability

11

Uniform

Non
-
uniform

Causes V
th

Variations

Source: Borkar, Intel

10
100
1000
10000
1000
500
250
130
65
32
Technology Node (nm)
Mean Number of Dopant
Atoms
Copyright Sill, 2008

Power
Density

PPGEE‘08, Reliability

12

4004

8008

8080

8085

8086

286

386

486

Pentium®

P4

1

10

100

1000

10000

1970

1980

1990

2000

2010

Year

Power Density (W/cm2)

Hot Plate

Nuclear

Reactor

Rocket

Nozzle

Sun’s

Surface

Prescott
Pentium®

Source: Moore, ISSCC 2003

Copyright Sill, 2008

Temperature Variation

PPGEE‘08, Reliability

13


Power density is not uniformly distributed across the chip


Silicon is not a good heat conductor


Max junction temperature is determined by hot
-
spots


Impact on packaging, cooling

Power Map

On
-
Die Temperature

Source: Borkar, Intel

Copyright Sill, 2008

Temperature Variation cont’d

PPGEE‘08, Reliability

14

Power4 Server Chip

Source: Devgan, ICCAD’03

Copyright Sill, 2008

Temperature Variation cont’d

Threshold voltage V
th
changes with temperature


drain
-
source current
changes


delay changes


PPGEE‘08, Reliability

15

Drain current I
DS

[pA]

Delay [s]

Source: Burleson, UMASS, 2007

Temperature [
°
C]

Copyright Sill, 2008

Supply Voltage Drop

PPGEE‘08, Reliability

16

Source: Trester, 2005

Copyright Sill, 2008

Failures Through Increasing Delay

PPGEE‘08, Reliability

17

FF
Logic
FF
FF
FF
FF
FF
VDD

,
Temp
.

,
...




Clock (Clk)


Data are
processed before
clock phase is over




Logic too slow!

→ Data processing
longer than clock
phase


Wrong Data in
next clock phase!



Clk

Clk

Copyright Sill, 2008

Soft Errors

PPGEE‘08, Reliability

18

Source: Automotive 7
-
8, 2004

1


In 70’s observed: DRAMs occasionally flip bits for no apparent reason


Ultimately linked to alpha particles and cosmic rays


Collisions with particles create electron
-
hole pairs in substrate


These carriers are collected on dynamic nodes, disturbing the voltage

Copyright Sill, 2008

Soft Errors cont’d


Internal state of node flips shortly


If error isn’t masked by


Logic: Wrong input doesn’t lead to wrong output


Electrical: Pulse is attenuated by following gates


Timing: Data based on pulse reach flipflop after clock transistion




wrong data

PPGEE‘08, Reliability

19

FF
FF
FF
FF
Copyright Sill, 2008

Electromigration

Electromigration:


Transport of material caused
by the gradual movement of
ions in a conductor


One of the major failure
mechanisms in interconnects.


Proportional to the width and
thickness of the metal lines


Inversely proportional to the
current density

PPGEE‘08, Reliability

20

Top View

Void

Thick Oxide
Cross Section View

Whisker, Hillock

Source: Plusquellic, UMBC

Metal 1

Metal 1

Metal 1

Metal 2

Copyright Sill, 2008

Electromigration cont’d

Void in 0.45mm Al
-
0.5%Cu line

Source: IMM
-
Bologna

PPGEE‘08, Reliability

21

Hillocks in ZnSn

Source: Ku&Lin,2007

Whiskers in Sn

Source: EPA Centre

Copyright Sill, 2008


Tunneling currents




Wear out of gate oxide


Creation of conducting path
between Gate and Substrate,
Drain, Source


Depending on
electrical field

over
gate oxide,
temperature

(exp.)
,
and
gate oxide thickness (exp.)


Also: abrupt damage due to
extreme overvoltage (e.g. Electro
-
Static Discharge)

Source: Pey&Tung

Source: Pey&Tung

Time
-
Dependent Dielectric Breakdown (TDDB)

PPGEE‘08, Reliability

22

Copyright Sill, 2008

Variability Trends

PPGEE‘08, Reliability

23

0
10
20
30
40
50
60
70
90
80
70
65
57
50
45
40
36
32
28
% Variability
Technology Node [nm]
Vdd
Vth
Performance
Power
Lgate
Source: Burleson, UMASS, 2007

Copyright Sill, 2008

Variability Trends cont’d

PPGEE‘08, Reliability

24

Technology [nm]

0
50
100
150
180
130
90
65
45
32
22
16
Relative SER
Source: Borkar, Intel

Soft Error / Chip (Logic & Mem)

Copyright Sill, 2008

Variability Trends cont’d

PPGEE‘08, Reliability

25

130nm

~1000 samples

30%

5X

Frequency

~30%


Leakage

Power

~5
-
10X

0.9

1.0

1.1

1.2

1.3

1.4

1

2

3

4

5

Normalized Leakage (I
sub
)

Normalized Frequency

Source: Borkar, Intel

Frequency and sub
-
threshold leakage variations

Copyright Sill, 2008

1
10
100
1000
10000
180 nm
90 nm
45 nm
22 nm
Current Density Jox
Technology
Source: Borkar, Intel

Increasing probability for Gate
-
Oxide
-
Breakdown

Source: Kauerauf, EDL, 2002

0
4
8
12
16
0
2
4
6
8
10
12
Reliability (Weibull slope
β
)
Gate Oxide Thickness [nm]
high
-
k?

Variability Trends cont’d

PPGEE‘08, Reliability

26

Copyright Sill, 2008

PPGEE‘08, Reliability

27

Future Designs

100

Billion

Transistors


100 BT integration capacity


Billions unusable (variations)


Some will fail over time


Intermittent failures

Source: Borkar, Intel

Approaches to Increase
Reliability

Copyright Sill, 2008


Reliability

R(t):



Probability of a system to perform as desired until time t


Example: R(
t
x
)

=

0.8


80

% chance that system is still running at time

t
x



Mean Time To Failure MTTF:



Average time that a system runs until it fails


Failure rate

λ:


Probability that system fails in given time interval


Failure Measurement

PPGEE‘08, Reliability

0
( )
1
( )
t
R t e
MTTF R t dt





 

29

Copyright Sill, 2008

Bathtube Failure Model

PPGEE‘08, Reliability

30

Time

Failure rate

7
-
15 years

1
-
40
weeks

Infant mortality


Declining failure rate


Based on latent reliability
defects

Normal lifetime


Constant failure rate


Based on TDDB,
EM, hot
-
electrons…

Wearout period


Increasing failure rate


Based on TDDB, EM, etc.

Copyright Sill, 2008

Classification

PPGEE‘08, Reliability

31

Failure

Permanent

Defects

,

wearout

,

out of

range parameters

,

EM

,

TDDB

...

Temporary

Transient

Intermittent

Process variations

,

infant

mortality

,

random dopant

fluctation

,

...

Radiation

Soft errors

Non


-



Radiation

Power supply

,

coupling

,

operation peaks

Source: Mitra, 2007

Copyright Sill, 2008

The Whole System
Counts!

PPGEE‘08, Reliability

32

Copyright Sill, 2008

Triple Module Redundancy (TMR)

PPGEE‘08, Reliability

33

Voter

Output

Logic L

Copy of

Logic L

Copy of

Logic L

Input

A

B

C

Copyright Sill, 2008

Triple Module Redundancy: Voter

PPGEE‘08, Reliability

34

Hardware realization of 1
-
bit majority voter

OUT = AB+AC+BC
















A

B

C

Requires 2 gate delays

1
1
1
0
0
0
1
0
0
1
0
0
1
0
1
1
OUT
C
B
A
1
1
1
0
0
0
1
0
0
1
0
0
1
0
1
1
OUT
C
B
A
Out

:

:

Copyright Sill, 2008

Triple Module Redundancy cont’d


After certain time: Reliability of TMR system is lower than of simplex
system


Why: After some time probability that 2 modules are wrong is higher
that 2 modules are working!

PPGEE‘08, Reliability

35

Time

Note: For a constant module failure rate


0

1.0

0.5

Simplex (only 1 module)

Reliability

TMR

Copyright Sill, 2008

Self Adaptive Design


Extend idea of clock domains to Adaptive Power Domains


Tackle static process and slowly varying timing variations


Control VDD, V
th

(indirectly by body bias), f
clk

by calibration at
Power On


PPGEE‘08, Reliability

36

Module

Test

Module

V
DD

V
BB

Test inputs

and
responses


f
clk

Copyright Sill, 2008

Self Adaptive Design: Example


21 submodules per die


Applying 0.5V Forward/Reverse Body Biasing (FBB/RBB) in steps
of 32 mV, respectively

PPGEE‘08, Reliability

37

0%

20%

60%

100%

Accepted die

noBB

100% yield

ABB

Higher Frequency


within die ABB

97% highest bin


For given Freq and Power density



100% yield with ABB



97% highest freq bin with ABB for within die variability

Source: Borkar, Intel

Copyright Sill, 2008

Razor Flip
-
Flop


For
uncertainty
-

and variation
-
tolerant design



Razor methodology


Voltage
-
scaling methodology based on real
-
time
detection and correction of circuit timing errors


Use the actual hardware to check for errors


Latch the input data twice:


Once on the clock edge, and then a little later


If the data is not the same, you are going too fast



PPGEE‘08, Reliability

38

Source: Austin, Computer Magazine, 2004

Copyright Sill, 2008

Razor Flip
-
Flop cont’d

PPGEE‘08, Reliability

39

Logic
stage n
+
1
Main
flip
-
flop
M
U
X
Logic
Stage n
Error
Shadow
latch
Comperator
Error
_
Sl
CLK
CLK
_
delayed
D
Q
Shadow FF
Instr
1
Instr
2
Instr
1
Instr
2
CLK
_
delayed
CLK
D
Q
Error
Source: Austin, 2004

Shadow Transistor
Approach

Copyright Sill, 2008

Gate
Gate Oxide
Drain
Source
TDDB model

TDDB between gate and channel

PPGEE‘08, Reliability

W
0
5
10
15
20
0%
25%
50%
75%
100%
-
V
out
/V
DD

rel. delay

R
GC

[k
Ω
] →

W
1
W
2
R
GC
For an Inverter, 65nm
-
BPTM:

Model:

Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.

W= W
1
+W
2

41

Copyright Sill, 2008

0%
25%
50%
75%
100%
-
PPGEE‘08, Reliability

42

TDDB between gate and source/drain

TDDB Model cont’d

For an Inverter, 65nm
-
BPTM
:



Model:

V
out
/V
DD

R
GC

[k
Ω
] →

Gate
Gate Oxide
Drain
Source
R
GS
R
GD
W
W
Based on: Segura et. al., “A Detailed Analysis of GOS Defects
in MOS Transistors: Testing Implications at Circuit Level” 1995.

Copyright Sill, 2008

Shadow Transistors

1.

Insertion of additional transistors in
parallel to vulnerable transistors





Shadow transistors
(ST)

PPGEE‘08, Reliability

0
2
4
6
8
10
-
Relative Delay
R
GC

[k
Ω
] →

wo/ ST

w/ ST

0%
25%
50%
75%
100%
-
V
DD
/V
out
R
GC

[k
Ω
] →

w/ ST

wo/ ST

For an Inverter, 65nm
-
BPTM

43

Copyright Sill, 2008

H
-
Vt
/
To
PPGEE‘08, Reliability

44

Shadow Transistors cont’d

2.

Application of H
-
Vt/To transistors with:


Higher threshold voltage


Thicker gate oxide




Less vulnerable to TDDB

0.15
/
0.22
/
10 4.81
H Vt To
L Vt To
MTTF
MTTF


 
0.22
10
ox
t

Source:


Srinivasan
, “RAMP: A Model for Reliability Aware Microprocessor Design”


Stathis
, J., “Reliability Limits for the Gate Insulator in CMOS Technology”

MTTF


Mean Time To Failure

Copyright Sill, 2008

PPGEE‘08, Reliability

45

Shadow Transistors cont’d

3.

Selective insertion of shadow transistors in parallel to vulnerable
transistors:


Component reliability depends on



Activity, state, temperature, size, fabrication …




Most vulnerable can be identified

Shadow transistors
only added in parallel
to most vulnerable
devices.

Netlist
modification

Copyright Sill, 2008

PPGEE‘08, Reliability

46

Shadow Transistors cont’d

3.

Selective insertion of shadow transistors in parallel to vulnerable
transistors:


Component reliability depends on



Activity, state, temperature, size, fabrication …




Most vulnerable can be identified

Shadow transistors
only added in parallel
to most vulnerable
devices.

Netlist
modification




Estimation of stress factors


Determination of components reliability


Adding redundancy only at most vulnerable components



Advantage:
Lower

area, power and delay

penalty compared to
complete redundancy or random insertion [Sri04]


New Approach

Source: [Sri04] Sirisantana, D&T, 2004

Copyright Sill, 2008

Shadow Transistors cont’d

PPGEE‘08, Reliability




Increased reliability in respect to TDDB


H
-
Vt/To: Reliability increases by ~5x (for
Δt
ox

= 0.15 nm)


Remarkable increase of system life time

Advantages




Higher input capacity → higher delay and dynamic power dissipation


Area increase

Drawbacks




Only slight improvements for Gate
-
Drain/Source breakdown


H
-
Vt/To has to be supported by technology

Remarks

47

Copyright Sill, 2008

0%
5%
10%
15%
20%
c17
c432
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
c7552
Improvemnet of MTTF as regards TDDB
Insertion of
L
-
Vt/To
Shadow Transistors
our algorithm
random insertion
ST


Improvement MTTF

PPGEE‘08, Reliability

≈ 23 % additional transistors

48

Copyright Sill, 2008

0%
50%
100%
150%
200%
250%
c432
c499
c880
c1355
c1908
c2670
c3540
c5315
c6288
c7552
Improvemnet of MTTF as regards TDDB
Insertion of
H
-
Vt/To
Shadow Transistors
SPth = 30
SPth = 55
ST


Improvement MTTF (H
-
Vt
/To)

PPGEE‘08, Reliability

49

Copyright Sill, 2008

Take Home Messages


Integrated circuits face several kinds of failures


Decreasing structures sizes create more failure sources


Future designs should (have to) be failure tolerant


Possible approaches:


Triple Module Redundancy (TMR)


Self
-
Adapting Designs


Razor Flip
-
Flops


Shadow Transistors


There’s still a lot to do!


PPGEE‘08, Reliability

50

Copyright Sill, 2008

Thank you!

franksill@ufmg.br

PPGEE‘08, Reliability

51