An Architectural framework for evaluating soft errors in
Penn State University
Penn State University
Data paths and combinational logic are inherently resi
stant to soft
errors due to
electrical masking, l
window masking. But, t
heir susceptibility is increasing as
technology is scaling due to increase in p
decrease in the device sizes and increased speed.
In this work
ircuit level estimation of soft errors for
our bit adders
. Then we d
iscuss solutions based on concurrent
error detection and other solutions.
Next we present a
on the effect of soft errors and then a few
solutions for data path error detection and correction
Soft errors pose a major challenge for the continued scaling of
CMOS circuits. They result due to the excess charge carriers
induced primarily by external radi
ations. The circuit however is
not permanently damaged by such radiations. The rapid
technology scaling has accelerated the reduction in the
capacitance of storage nodes and supply voltages thus resulting in
increased susceptibility of combinational circ
uits. These circuits
are more difficult to protect than memories by conventional logic
methods like parity and error correcting codes. Also as
technology scales various factors result in the susceptibility of
path designs to increase, and hence more
to be given.
In this paper we present a detailed analysis of effect of soft errors
path designs. At circuit level, we present the results of
bit adders and analyze 4 bit adder designs. Next we
analyze the effect of s
uch errors at the execution of programs
when soft errors occur at the data
path structures especially adders
by incorporating the errors in data
Soft errors could be induced through three different radiation
sources, alpha particles fro
m the naturally occurring radioactive
impurities in device materials, high
energy cosmic ray induced
neutrons and neutron induced 10B fission [
]. Recent works ,
] have shown the effect of the technology scaling on soft
errors. In [
], a study on t
he radiation flux noted that particles of
lower energy occur far more frequently than particles of higher
energy. Thus smaller CMOS device are easily affected by lower
energy particles, leading to a much higher rate of soft errors. Soft
errors occur when t
he collected energy Q at a particular node is
greater than a critical charge
which results in a bit flip at
that node. This concept of critical charge is used for the
estimation of soft error rate (SER). According to model proposed
is the intensity of the Neutron Flux, CS is the area
of cross section of the node and Q
is the charge collection
efficiency which strongly depends on doping.
proportional to the node cap
acitance and the supply voltage.
Soft errors in memory circuits
energy neutrons that strike a sensitive region in a
semiconductor device deposit a dense track of electron hole pairs
as they pass through a p
n junction. Some of the deposited charge
ll recombine to form a very short duration pulse of current at
the internal circuit node that was struck by the particle. The
magnitude of the collected
charge depends on the particle type,
physical properties of the device, and the circuit topology. When
particle strikes a sensitive region of an SRAM cell, the charge that
accumulates could be large enough to flip the value stored in the
cell, resulting in a soft error. The smallest charge that results in a
soft error is called the
SRAM cell [
rate at which soft errors occur is typically expressed in terms of
Failures In Time (FIT)
, which measures the number of failures
per hours of operation. A number of studies on 3 soft errors in
SRAMs have concluded that the SER for consta
nt area SRAM
arrays will increase as device sizes decrease [
researchers differ on the rate of this increase.
Soft Errors in Combinational Logic
A particle that strikes a p
n junction within a combinational logic
circuit can alter the val
ue produced by the circuit. However, a
transient change in the value of a logic circuit will not affect the
results of a computation unless it is captured in a memory circuit.
Therefore, we define a soft error in combinational logic as a
transient error in
the result of a logic circuit that is subsequently
stored in a memory circuit of the processor.
A transient error in a logic circuit might not be captured in a
memory circuit because it could be
by one of the
following three phenomena:
occurs when a particle strikes a portion of the
combinational logic that is blocked from affecting the output due
to a subsequent gate whose result is completely determined by its
other input values.
occurs when the pulse resulting
particle strike is attenuated by subsequent logic gates due to the
electrical properties of the gates to the point that it does not affect
the result of the circuit.
occurs when the pulse resulting from
a particle strike re
aches a latch, but not at the clock transition
where the latch captures its input value.
These masking effects have been found to result in a significantly
lower rate of soft errors in combinational logic compared to
storage circuits in equivalent device t
However, these effects could diminish significantly as feature
sizes decrease and the number of stages in the processor pipeline
increases. Electrical masking could be reduced by device scaling
because smaller transistors are faster and the
refore may have less
attenuation effect on a pulse. Also, deeper processor pipelines
allow higher clock rates, meaning the latches in the processor will
cycle more frequently, which may reduce latching
OFT ERROR ANALYSIS IN ADDERS
e have the results for the SEU testing for adder designs.
These Qcritical values will be used to model error rates in the
architectural framework. Also the main objective in having to test
many types of adders is to take advantage of the design diversity
or fault tolerant designs.
SEU testing was done in 4 type of 1 bit full adders here. A
summary of these tests are presented below.
Before we move on to each design there are general characteristics
that were common to all the designs. Since all adders are
combinational, the absence of feedback path results only in
temporary bit flips. So for permanent bit flips occur in such data
paths where a register file is used to store the intermediate data.
Such a data
path should be kept in mind when adders ar
So here adders have registers (TGFF is used for this purpose here
) at their sum and carry outputs. This way a temporary bit
sum or carry occurring at the positive edge of the clock cycle may
result in a permanent bit flip at the output.
So regressive testing was done for almost every node in each of
the designs and the Qcritical for 0
>1 and 1
> 0 transitions were
calculated. Following observations were recorded for each design.
Captions should be Times New Roman 9
point bold. They s
be numbered (e.g., “Table 1” or “Figure 2”), please note that the
word for Table and Figure are spelled out. Figure’s captions
should be centered beneath the image or picture, and Table
captions should be centered above the table body.
ingle bit add
And/Xor half adder based full adder
The cell being purely combinational results in a higher Q
will be seen in the results here.
The critical charge values are much higher in the case of adders
when compared to flip
flops or srams. Also some
times they are so
high that no radiation can effect such a transition (SEU).
As can be seen from the above results the carry output is more
vulnerable. This in turn would have adverse effect on multi bit
adders because there the SEU's get propagated. Also
are there in the carry path which gets affected here.
Half adder based adder
The mirror adder has only two nodes in it. So the tests on these
two nodes are presented. SEU at the node Cob affe
cts both sum
and carry output.
As seen the critical charges affecting the sum and carry different
for different transitions. SEU's at the other node Sb affects only
the sum output as shown. So overall, this design could be called
the best design as the num
ber of nodes by themselves are less and
hence less vulnerable to SEU. Also, this design when sized for
better performance will result in higher Q
Transmission gate based delay balanced adder
In TG based adder though the number of nodes affected by
current pulses are large, the Q
is generally high and as can be
seen radiation canno
t result in such high Q
Delay balanced adder
Xor based add
In xor based adder the number of nodes that affect the output
directly is negligible. In fact there are no internal nodes at which
an SEU can affect the carry out. The sum output flips when there
are SEU's at 3 internal nodes. H
ere only the node for whi
al will be minimum is presented.
Figure 5 and Figure 6 presents the critical charge that affect the
sum and carry output for the various adders considered.
Xor based adder
Nodes evaluated in various adders
0 --> 1
1 --> 0
Nodes evaluated in various adder designs
Four bit adders
This could be extended to different 4 bit adders and we can see
how diff. adder structures behave to such SEU's. This could be
either theoretical explanations or HSPICE simulations. Also more
such adder designs could be
tested similarly. Some of the different
type of adders that are p
otentially of interest include the
Ripple carry adder
When we consider an input patter such that a bit flip on the lower
cayry results in the propagation of the carry till the mos
significant bit, the highest bits of the ripple carry adders would be
affected only if there is a very large pulse that occurs in the lower
order bits. But this could potentially flip all the outputs as the
carry propagates. So though the Qcritical might
be large in the
case of ripple carry adders, it could potentially lead to multiple bit
faults which might be very hard to recover.
Carry skip adders
In carry skip adders an error in the carry might propagate to the
most significant bit at a much lower Qcr
itical. But the lower
significant bits might not be affecte
d hence lesser chance for
Also, since these involve extra logic, these are
much more prone to errors than the ripple carry adders.
Prefix adders like brent
stone adders will behave
differently, and will usually have much lower Qcritical for a carry
to propagate to the most significant bit. Here though the carry
error propagates and results in a fewer multiple bit errors than in
ripple carry adders. A de
tailed experimental analysis will be
presented in the final report.
Existing circuit level solutions
Concurrent error detection techniques will work well for these
adder designs. [
] proposes that design diversity in designs
results in more robust designs
. With the existing trade
offs in the
various adder designs, diversity could be used to build robust
CED design. Other proposed techniques include , Arithmetic
coding techniques like carry checking/parity prediction adders
] and Other redundancy techniq
ues like time redundancy [
MPACT OF SOFT ERRORS IN
Here our focus is on the combinational logic in the processor
datapath since the cache structures such as the instruction fetch
buffer, issue queue, reorder buffer, and register fil
e can be
protected by parity checking or ECC coding.
protection schemes can be applied to combinational logic,
especially the functional units. Designing error
units presents a great challenge.
onventional designs utilize t
types of redundancy, time redundancy and space redundancy to
achieve reliable system design.
These design schemes normally
introduce significant performance overhead or hardware cost.
Most researches are focusing on how to combine these two
esigns and use them more effectively, such as
SMT processor [
] and dual use of superscalar datapath [
Different from previous research,
our special interest
is to evaluate the impact of soft errors in different
functional units on
the execution of application programs.
this project, especially, we want to answer the following
How does soft
error in functional units affect the
program behavior during its execution?
Which type of functional units is most critical to th
robustness of the systems?
How to design cost
effective reliable systems?
Modeling a S
uperscalar Processor Core
In this section,
we model a detailed superscalar processor core
with a separated issue queue and re
order buffer (ROB) and the
is similar to that of the MIPS R10000
that we use a unified issue queue instead of separated integer
queue and floating
point queue. The baseline datapath pipeline is
given in Figure
(a) The datapath diagram and (b) pipeline st
ages of the
modeled baseline superscalar microprocessor.
A brief description of instruction going through the datapath is as
The fetch unit fetches instructions from the instruction
cache and performs branch prediction and next PC generation.
tched instructions are then sent to the decoder for decoding.
Decoded instructions are register
renamed and dispatched into the
issue queue. At the same time, each instruction is allocated an
entry in the ROB in program order. Instructions with all source
operands ready are waken up and selected to issue to the
appropriate available function units for execution, and removed
from the issue queue. The status of the corresponding ROB entry
will be updated as the instruction proceeds. The results coming
from the function units or the data cache are written back to
the register file. Instructions in ROB are committed in order.
As mentioned early, we are going to focus on the soft errors in
functional units. Notice that the floating
point functional units
erform numerical operations on data that seldom have impact on
the control flow during the program execution. However, the
results produced by Integer functional unit
have a wide range of
impacts on program execution, such as determining the outcome
condition based on which a branch instruction goes, the
address of a data reference, or the address of a function reference.
Any error incurred during these operations will finally leads to
abnormal program behavior and program crashes.
discussion, we focus on integer functional units.
Error Injection Scheme
Error injection scheme is a critical part when evaluating the
impact of soft errors in the processor datapath.
It should be able to
simulate the error occurrence in real systems
as close as possible.
At the same time, the error injection should not incur a large
performance overhead during architectural simulation.
A possible error injection scheme is hardware based and to flip
the value of some node in the circuit. The scheme is
be very accurate in simulating the soft error occurrences.
However, such a scheme has several difficulties for architecture
level simulation. First, a detailed circuit implementation in netlist
is required for each functional unit in order to i
nsert a single event
upset (SEU) at a particular node. Notice that different processors
may have different implementation styles of functional units,
which requires the simulator maintain a library
implementations of different functional unit
s. Inserting errors into
circuit at architectural level will be very difficult. The most
depressing part of this scheme is not being able to evaluate the
result after inserting some errors. Should we simulate the circuit?
It is not practical for architectu
To overcome the difficulties mentioned above, we proposed a
more effective and low cost error injection scheme.
our focus is to inject soft errors into the functional units.
scheme introduces a random single event upset in t
he inputs to a
functional unit rather than the functional unit itself. When the
inputs with one bit flipped go through the functional unit, it may
or may not cause error in the resulting output, which is similar to
the case when an error happens in the cir
cuit node of the
However, how close exactly the phenomena
introduced by our error
injection scheme and a hardware
scheme is still a question under investigation.
We use SimpleScalar/Pisa version 3.0 tool set [
to implement a
microprocessor model in Section 4.1. The processor and memory
hierarchy configuration for the simulated processor is given in
4 inst. per cycle
4 IALU, 2 IMULT/IDIV, 4 FALU, 2
FMULT/FDIV, 2 memports
Bimodal, 2048 entries, 512
32KB, 1 way, 32B blocks, 1 cycle latency
32KB, 4 way, 32B blocks, 1 cycle latency
256KB, 4 ways, 64B blocks, 8 cycle latency
80 cycles first chunk, 8 cycles rest
30 cycle miss penalty
The configuration of the processor simulated in this
a set of
pplications from the SPEC2000
benchmark suite and use their PISA binaries and reference inputs
for execution. Each benchmark is first fast
forwarded half a
billion instructions, and then simulated
the next half a
billion committed instr
How to Evaluate the Impact of Soft Errors
The major questions we want to answer in this experimental
evaluation are: how can a program survive under the attack of soft
errors? And what is the immunity of program to soft er
different functional units? Thus the main scheme of our
experiments is to evaluate the survivability of programs rather
than detecting errors in the results in the context of soft errors.
We introduce two metrics for this purpose: number of effecti
errors inserted and ratio of completed instructions. Effective
errors inserted are those soft errors inserted during a regular
function operation. If the functional unit is in inactive mode, soft
error does not have any negative impact. Due to the limit
we can only simulate a number of instructions for each benchmark
instead of completing the whole simulation. In our experiments,
we simulate each benchmark up to half billion committed
The ratio of completed instructions is calculat
the number of instructions committed till program crashes over
half billion. The larger the ratio, the longer the program survives
under soft error attacks.
If the benchmark completes half billion
committed instruction, we say that
the program surviv
The Impact of A Single Error
We first investigate the impact of a single soft error happened in
the adders on the execution of the program. We had a set of
experiments that inserts this single error at different times during
, after first
1000 cycles, 10,000 cycles,
100,000 cycles, and 1000,000 cycles. Although happened at
different time points, our results show that a single error happened
in adders seldom crashes the program. All the benchmarks
complete the half billion commit
ted instructions under the attack
of a single soft error.
Inserting Soft Error to Adders at Fixed
Now the question here is
how many soft errors a program can
it is crashes by those errors. This experiment is
designed to inject er
rors at a fixed interval (10,000 cycles) until
the execution of the program is broken.
Number of effective errors inserted
Ratio of completed instructions (/0.5B) till crashing
Impact of soft errors happened in the adders at fixed
rval (10,000 cycles).
In Figure 8 (a), the x axis shows the total number of soft errors
that we intend to insert at the fixed interval of 10,000 cycles. If
only one error is inserted, we can find out from Figure 8 (b) that
survive to finish
executing 0.5 billion instructions,
which agrees with our first set of experiments. When the number
of intended errors increases to 4, benchmark 175.vpr crashes after
inserted 2 errors while the others survive. Further increasing the
number of errors to 1
6, benchmark 254.gap crashes after receiving
7 errors and 256.bzip2 crashes after 12 errors inserted. And
197.parser still survives until the cumulated error number reaches
55. The results from this experiment show that different
benchmarks (programs) have
different immunity to the same soft
at fixed rate
Uniform Error Rate
In the following experiments, we are going to use uniformly
distributed error rates to investigate how program survives under
different intensity of soft errors in different fun
ctional units. For
three types of functional units: adder, logic
multiplier/divider, we use four uniform error rate: 0.000001,
0.00001, 0.0001, and 0.001 per cycle.
Inject Errors into Adders at Uniform Rate
We first evaluate the impac
t of soft errors happened at different
uniform error rates in the adders. An assumed trend under
different error rate would be as follows: the number of effective
error inserted will increase as the error rate becomes larger and
starts to decrease at some
point since the program will be crashed
much earlier under the attack of a larger error rate. On the other
hand, the ratio of completed instructions should keep decreasing
as the error rate increases.
look at results in Figure 9 (a)
and Figure 9 (b)
we can find
some benchmarks such as 175.vpr is not consistent with the
anticipated behavior. It crashes much earlier with smaller error
rates and smaller error numbers inserted. The other two
benchmarks, 254.gap and 256.bzip2 match the predicted behavior.
However, for all benchmarks, uniform error rate at 0.0001 still
presents a very interesting point
. The results here suggest that this
error rate seems to have the most significant impact on the
program execution. Due to limited number of sampling
, we don
have an answer
at present time
for the abnormal behavior in
Number of effective errors inserted
Ratio of completed instructions (/0.5B) till crashing
Impact of soft errors happened in
Errors into Logic Functional Unit at
Uniform Error Rate
known, logic operations have a strong
relation with the
control flow of the program in execution.
If the logic operation is
to evaluate the condition of a branch instruction, an error
during the evaluation might
cause the branch to take the
wrong path. If the correct path is going to perform some critical
operations such as memory allocation, such an error might finally
lead to program crash. An intuition is that control
rams will be more vulnerable to errors happened during the
We present our simulation results for impact of uniform
distributed errors in logic operations in Figure 10.
Figure 10 (a)
gives the number of effective errors inserted till the p
crashes or finishes committing 0.5 billion instructions at different
uniform error rates from 0.000001 to 0.001. Correspondingly,
Figure 10 (b) shows the ratio of completed instructions under
different uniform error rates.
Number of effective
Ratio of completed instructions (/0.5B) till crashing
Impact of soft errors happened in logic functional unit
at uniform error rates.
omparing to Figure 9, the impacts of these two types of soft
errors on the program execu
tion are quite different. Benchmark
175.vpr is an application for FPGA placement and routing that is
intensive. A very small number of errors in logic
operations will immediately cause the program to crashes, which
is very different from the
behavior under soft errors happened in
adders as shown in Figure 9. On the other hand, benchmark
256.bzip2, which is a compression application and is very data
intensive, has a very good immunity of logic operation errors
shown in Figure 10
. It crashes
only when the error rate is
increased to 0.001.
However, the error happened in adders have a
direct impact and cause a quick crash in 256.bzip2 as shown in
Figure 9 (b).
rors into Multiplier/Divider at
Uniform Error Rate
Different from ALU opera
tions, multiplication and division are
data oriented, which mean errors happened in
multiplier or divider have less impact on the execution of the
program. However, these errors will result in wrong data values in
the memory system.
er of effective errors inserted
Ratio of completed instructions (/0.5B) till crashing
Impact of soft errors happened in multiplier/divider at
uniform error rates.
Experimental results shown in Figure 11 confirm what we
out of four benchmarks survive under different
For benchmark 175.vpr, our guess is that some of the
ations have direct or indirect impact
on the decision making during the FPGA placement and routing.
our experimental evaluation, we have some initial
Soft errors happened in different functional units
have different impact on the execution of the program. Errors
happened during addition/subtraction operation
have a significant
impact on da
intensive applications, while control
applications are more sensitive to errors in logic operations.
Soft error effects in adder designs and its architectural effects
have been presented in this paper. As mentioned earlier many
current error detection methods based on design diversity
might be one of the options to counter soft errors at circuit level.
At architectural level, the following conclusions are derived.
Errors in different integer
operations have different impact on th
Different programs have different behavior
under error injection
intensive (lower IPB) applications
are more sensitive to logic operation errors
Multiplication/Division operations have less impact on program
More detailed characterization of program
behavior under error impact
Modeling the soft error rate from
for arithmetic unit.
Use the above information to develop
some selective error protection/detection/recovery schemes
Hazucha P, and Svensson C, "Impact of CMOS
technology Scaling on the Atmospheric Neutron Soft Error
Rate" IEEE Transactions on Nuclear Science, Vol. 47, No. 6,
] Baumann R.,"Soft Error Characterization and Modeling
logies at TI: Past, Present, and Future", 4th Annual
Topical Conference On Reliability, 2000
] Seifert N., Moyer D., Leland N., and Hokinson R., "Historical
Trend in Alpha
Particle induced Soft Error Rates of the Alpha
Microprocessor", IEEE 39th Annual
Physics Symposium, 2001 Page(s): 259
] Hareland, S, Maiz, J, Alavi, M., Mistry, K., Walsta, S.,
Changhong Dai, "Impact of CMOS process scaling and SOI on
the soft error rates of logic processes", 2001 Symposium on VLSI
nology Digest of Technical Papers Page(s): 73
] Ziegler J., "Terrestrial cosmic ray intensities", IBM Journal of
Research and Development, Vol 40, No 1, January 1996, Page(s):
] L. B. Freeman. Critical charge calculations for a bipolar
IBM Journal of Research and Development, Vol40,
, pages 119ñ129, January 1996.
] K. Johansson, P. Dyreklev, B. Granbom, M. Calvet, S.
Fourtine, and O. Feuillatre. In
_ight and ground testing of single
event upset sensitivity in static RAM's.
IEEE Transactions on
, 45:1628ñ1632, June 1998.
] P. Liden, P. Dahlgren, R. Johansson, and J. Karlsson. On
Latching Probability of Particle Induced Transients in
Combinational Networks. In
Proceedings of the 24th Symposium
t Computing (FTCS
, pages 340ñ349,
] E. Peterson, P. Shapiro, J. Adams, and E. Burke. Calculation
ray induced soft upsets and scaling in VLSI devices.
IEEE Transactions on Nuclear Science, Volume: 29 pp. 2055
, December 1982.
J. Pickel. Effect of CMOS miniaturization on cosmic
induced error rate.
IEEE Transactions on Nuclear Science
29:2049ñ2054, December 1982.
E. Rotenberg. AR
SMT: A microarchitectural approach to
fault tolerance in micro
of the 29th
Tolerant Computing Symposium, June 1999.
Joydeep Ray, James C. Hoe, and Babak Falsa_. Dual Use of
Superscalar Datapath for Transient
Detection and Recovery.
In Proc. the 34th Annual International Symposium on
K. C. Yager. The mips r10000 superscalar microprocessor.
40, April 1996.
] D. Burger, A. Kagi, and M. S. Hrishikesh. Memory hierarchy
to simplescalar 3.0. Technical Report TR99
of Computer Scien
ces, The University of Texas at
Mitra, S.; McCluskey, E.J.;
Which concurrent erro
detection scheme to choose ?”
Proceedings of international Test
5 Oct. 2000
Time redundancy based sof
to rescue nanometer technologies
”, Proceedings of 17th IEEE
VLSI Test Symposium, 25
29 April 1999
Carry checking/parity prediction adders and
IEEE Transactions on Very Large Scale Integration
Volume: 11 Issue: 1 , Feb. 2003, pp
Shivakumar, P.; Kistler, M.; Keckler, S.W.; Burger, D.;
Modeling the effect of technology trends on the soft
error rate of combinational logic”
Proceedings of International
rence on Dependable Systems
and Networks, 23