Intermittent Faults: Frequency, Causes and Models - UBC Blogs

basketontarioElectronics - Devices

Nov 2, 2013 (3 years and 5 months ago)

43 views

Layali Rashid


Errors that occur in bursts, at the same
location, when the fault is activated
[WDSN07]
.


Faults which occur frequently and irregularly
for a period of time
[ASPLOS08]
.


A persistent defect that causes zero or more
failures, such as a speck of conductive dust
partially bridging two traces
[EuroSys11]
.


Bursts of errors that recur non
-
deterministically
[Layali]
.


2


257 servers for 1.2 year.


Memory SBE rate:







6.2% of the
memory subsystems
were affected
by
ifaults
.


Processor buses
in 2 servers had 15 to 7104
SBE bursts.

Number of Errors

Frequency (%)

0

47
.
5

1
-
5


31
.
5

6
-
99

13
.
3

100
-
1000

5
.
8

>
1000

1
.
9

3


Rates of failures (regardless of the error type).

Failure


P(1fail)

P(2 fail|1

fail)

P(3 fail|2

fail)


CPU

(
5

working

days)

0
.
3
%

30
%

56
%

CPU

(
30

working

days)

0
.
5
%

34
%

59
%

DRAM

(
5

working

days)

0
.
03
%

11
%

46
%

DRAM(
30

working

days)

0
.
05
%

8
%

50
%

Disk

(
5

working

days)

0
.
2
%

29
%

53
%

Disk

(
30

working

days)

0
.
3
%

29
%

59
%

4


Rates of
ifaults

out of total failures:






When
ifaults

recur?






How many times an
ifault

recur?


MTTF decreases as more failures occur.


Not exponen
tially distributed.



Fault location

Recur within 10 days

Recur

within a month

CPU

84
%


97
%


Disk


86
%

99
%

Fault location

Rate of
Ifaults

CPU

39
%


DRAM

19
%

Disk

39
%

5

Location

Source

Result

in
Ifault
?

Wires

Electromigration

Yes

Stress migration

Not mentioned

Crosstalk

Yes

Transistor

Gate

oxide breakdown

Yes

Hot carrier injection


Not mentioned

Negative bias
temperature instability


Not mentioned

Package and pins

Thermal cycling

Not mentioned

Other

Manufacturing defects


Yes

Dust

Yes

6

PolySi

Gate

Substrate

SiO
2

Traps exist in SiO
2

due to
manufacturing defects or
high voltage.

Soft breakdown

Hard breakdown

I
leak
.

Consequences:

↑ Leakage Current

↑Power consumption

偯ssible pol畴io湳:



High
-
k dielectric.



Burn
-
in.

From Wikipedia

7


Thinner wires, high current density and
temperature.


Metal films imperfections.

*

* University of Kiel


Consequences:

`
V
oids

→ stuck shorts

`
Hallocks

→ stuck opens

8


Thermal stress.


Growth of voids



Contribute to
electromigration
.



Consequences:


Voids → stuck shorts



9


Major problem during layout synthesis.



Consequences:


Delays and glitches

10


Appears in package and die interface (e.g.
solder joints).



Large cycles vs. small cycles



Consequences: ?

11

Other
Wearout

Mechanisms



Hot Carrier Injection


Negative Bias Temperature Instability

12


Dominant reliability concern for
nMOS

transistors.


Happens during normal operating
-
temperature ranges.

p+

n+

n+

Vd

Vg

Vbs

Vs

Ig

-


Consequences:

o
Decrease drain current

o
Slower IC

13


Dominant reliability concern for
pMOS
.


Happens during high temperature.

n+

p+

p+

Vd

Vg

Vbs

Vs


Consequences:

o
Reduces
Vt

o
Reduces IC speed (~20%)

o
path delay error

H


Si



Si



Si

H


Si

H


Si

H


Si

H


Si



Si



Si

H


Si

H

H

H
2

H

H

PolySi

Substrate

SiO
2

H
2

14

Location

Source

Model

Duration

Wires

Electromigration

Short

and
open

Stress migration

Short

and
open

Crosstalk

Delay and glitch

Transistor

Gate

oxide breakdown

Ileakage

Hot carrier injection


Path delay

Negative bias
temperature instability

Path delay

S:1x10
4
s+R:
2x10
4
s

Package and pins

Thermal cycling

Other

Manufacturing defects


Dust


Supply voltage fluctuation lasts from 5 to 30
cycles.



Temperature effects evolve over hundreds of
microseconds or milliseconds.



Soft breakdown evolves over a few days then
becomes hard breakdown.


15

Transistor

Stuck
-
open

Last Output

Stuck
-
short

I
DDQ

Delay

16

Wire

Open

Stuck
-
Open

Last output

Stuck
-
at

Short

Bridging

I
DDQ

Logical
AND/OR

Delay

17


Intermittent faults are loosely defined and
their causes are not well explored.



We need more accurate results on the rates of
ifaults


Rates and number of recurrence



Does NBTI, stress migration, thermal cycling
and hot carrier injection cause
ifault
?


Evidences by scientific studies or field data.


18

Backup Slides

19



20

From [AdancesinRadioScience09]

21

From [AdancesinRadioScience09]

22

23


Example

From Dr.
Ivanov

Course

24

Copyright 2001,
Agrawal

& Bushnell

25

26

RAM

Pattern
Sensitivity

BDS

Coupling

BDS

27

Wire

Open

Stuck
-
Open

Last output

Stuck
-
at

Short

Bridging

Wired
-
AND/OR

I
DDQ

Dominant

I
DDQ

Dominant
AND/OR

I
DDQ

Delay

28

[Wikipedia] Many articles.

[WDSN07] Impact of Intermittent Faults on
Nanocomputing

Devices, WDSN,
2007.

[D3T] Emphasis on the existence of intermittent faults in embedded systems.
IEEE Workshop on Defect and Data Driven Testing (D3T), 2010.

[ASPLOS08]
Adapting to intermittent faults in
multicore

systems.

[EuroSys11]
Cycles, Cells and Platters An Empirical Analysis.

[
IEEETrans.onElectronDevices96]
Soft breakdown of ultra
-
thin gate oxide layers

[ACMSurveys10]
Electromigration

for
Microarchitects
, Intel.

[Applied Physics Letters91]
Stress
-
migration related
electromigration

damage
mechanism in
passivated
, narrow interconnects.

[AdancesinRadioScience09]Impact of negative and positive bias temperature
stress on 6T
-
SRAM cells

29