Survey on
Hardware Implementations of
Cryptographic Systems
Elliptic Curve Cryptography and others.

REMOVED
Goodman & Chandrakasan Implementation (D
omain
S
pecific
R
econfigurable
C
ryptographic
P
rocessor
)
•
Capable of performing an entire suite of cryptographic
primitives over integer modulo N, binary Galois Fields and
nonsupersingular Elliptic Curves. This includes RSA, ECC
•
Fully programmable parameters for many cryptographic
systems. Data size can vary from 8 to 1024 bits.
•
Reconfigurability limited to the subset of functions
(domain) required for public

key cryptography as defined
in IEEE 1363. Requires only a small set of configurations
for performing all required operations for RSA, ECC, etc.
Instruction Set Architecture
•
As defined by IEEE 1363 Public Key Cryptography
Standard Document.
•
24 instructions broken up into 6 types of operations:
conventional arithmetic, modular integer arithmetic,
GF(2
n
) arithmetic, elliptic curve arithmetic, register
manipulation and processor configuration.
SET_LENGTH length
Sets width of processor to be length+1
REG_CLEAR rd, rs0
Clears regs specified in mask formed by
(rd,rs0)=R<7:0>
REG_MOVE rd, rs0
rd = rs0
REG_LOAD rd
rd is loaded from I/O interface
REG_UNLOAD rs1
rs1 in unloaded to I/O interface
COMP rs0, rs1
Set “gt” and “eq” flags according to the result
ADD/SUB rd, rs0, rs1, rs2
rd = rs0 + rs1 + rs2<0> rs2<2:1> = 00
rd = (rs0 + rs1 + rs2<0>) >> 1 rs2<2:1> = 01
rd = rs0
–
rs1 rs2<2:1> = 10
rd = (rs0
–
rs1) >> 1 rs2<2:1> = 11
MOD_ADD rd, rs0, rs1, rs2
rd = (rs0 + rs1 + rs2<0>) mod N
MOD_SUB rd, rs0, rs1
rd = (rs0
–
rs1) mod N
MONTRED_A
(Pc, Ps) = A . 2

n
mod N
MONTMULT
(Pc, Ps) = A . B . 2

n
mod N
MONTRED
(Pc, Ps) = (Pc, Ps) . 2

n
mod N
DSRCP ISA
DSRCP ISA
–
cont.
MOD rd, rs0, rs1, rs2
rd = (rs1 . 2
n
+ rs0) mod N, correction factor of
2
2n
mod N stored in rs2
MOD_MULT rd, rs0, rs1, rs2`
rd = (rs0 . Rs1) mod N, correction factor of
2
2n
mod N stored in rs2
MOD_INV rd, rs0
rd = (1 / rs0) mod N
MOD_EXP rd, rs0, rs2, length
rd = rs0
EXP
mod N, Exp has (length + 1)bits,
correction factor of 2
2n
mod N stored in rs2
GF_ADD rd, rs0, rs1
rd = rs0 + rs1 over GF(2
n
) (equiv. to rs0 XOR rs1)
GF_MULT
Pc = A . B
GF_INV
A = 1 / Pc
GF_INVMULT
A = B / Pc
GF_EXP rd, rs0, length
rd = rs0
Exp
mod N, Exp has (length +1) bits
EC_ADD rd, rs0, rs1, rs2
(rd, rd+1) = (rs0, rs0+1) + (rs1,rs1+1), over curve
defined by parameters in (rs2, N).
EC_DOUBLE rd, rs0, rs2
(rd, rd+1) = 2.(rs0, rs0+1), over curve defined by
parameters in (rs2, N)
EC_MULT length
(R4, R5) = Exp(R2,R3) Exp has (length + 1) bits,
over curve defined by parameters in (R6, N)
Top

Level system architecture
Reconfigurable
datapath
(32 x 32bits)
Shutdown
controller
Global
controller
m

code
ROMs
I/O
interface
32
32
Data
Instruction
Reconfigurable Datapath
Reconfigurable Datapath
Adder: adding/subtracting two
n

bit (8 <=
n
<=1024)
operands in 3 cycles.
Comparator: single

cycle magnitude comparisons between
two
n

bit operands. XOR of the two operands.
Local registers in Reconfigurable Logic Unit (Pc, Ps, A, B,
Exp and N) for special purposes and operations. Eliminates
the need for accessing the register file every cycle.
Two operand buses (rs0 and rs1) and one write

back bus.
Modular arithmetic
Complex operations (multiplication, reduction, inversion,
exponentiation) use microcoded instructions
Simple operations (addition, subtraction, comparisons) are
implemented directly in hardware
Modular arithmetic
–
cont.
Multiplication: Montgomery multiplication
MONTMULT (A,B,N) = A . B . 2

n
mod N
Modular inversion: extended binary euclidean algorithm.
Modular exponentiation: square

and

multiply algorithm.
Precomputes and stores the values {2
n
, rs0.2
n
, rs0
2
.2
n
, rs0
3
.2
n
}
in {R0, R1, R2, R3}.
GF(2
n
) arithmetic
Addition: XOR function in the comparator unit.
Multiplication and inversion: implemented directly in
hardware using the reconfigurable datapath
Exponentiation: same manner as modular exponentiation,
with {1, rs0, rs0
2
, rs0
3
} stored in {R0, R1, R2, R3}.
Elliptic Curve arithmetic
Point addition and doubling: implemented in microcode,
with curve points stored as register pairs (R
i
, R
i+1
) = (x,y)
Point multiplication: performing using a repeated double

and

add algorithm.
Exponentiation: same manner as modular exponentiation,
with {1, rs0, rs0
2
, rs0
3
} stored in {R0, R1, R2, R3}.
Summary
Tier I: implemented directly in hardware.
Tier II: microcoded instructions composed of sequences of
first

tier instructions.
Tier III: microcoded instructions composed of sequences
of both first and second tier instructions
Orlando and Paar Implementation
Elliptic Curve Co

processor. Configurable for any size of
the field (# of bits in the key). Needs a host system.
Main features: optimized bit

parallel squarer, digit

serial
multiplier and two programmable processors.
Most dramatic example: squaring can be performed in one
clock cycle, whereas a general architecture usually requires
m/2
clock cycles ( m >= 160 for reasonable security).
Processor Architecture
MC : main controller. Controls the computation of
kP
and
interacts with the host system.
AUC : arithmetic unit controller. Controls the the
computation of
point
additions, point doublings...
AU: arithmetic unit. Perform
field
additions, squares,
multiplications, inversions under AUC.
Example: Point multiplication
1) Host loads
k
into the MC.
2) Host load the coordinates of
P
into the AU.
3) Host commands the MC to start processing.
4) MC initializes (series of operations)
5) MC commands AUC to perform its initialization
6) Computation is performed by MC, AUC and AU.
Arithmetic Unit
Responsible for field arithmetic
Contains a register file, LSD first multiplier, squarer,
accumulator and a zero test circuit.
Arithmetic Unit
–
cont.
Field addition / subtraction: 2 clock cycles.
Field multiplication:
AB + C mod F
takes
k
D
clock cycles (1 <=
k
D
<= [
m/D
])
k
D
represents the number of digits of
B
and
m
the number
of bits of the field.
Prototypes
16 bit MC processor with 256 words of program mem.
24

bit AUC processor with 512 words of program mem.
128 registers of 167 bits each
32 bit I/O interface to the host system.
4, 8 and 16 bits multipliers (3 prototypes)
Number of cycles to compute curve operations (
D
: size
of the multiplier; 167 is the size of the field)
Software ECC Profiling

unoptimized
%time
seconds
#calls
msec/call
Name
50.9
2.74
3360427
0.00081
rot_right
29.6
1.59
10700
0.1486
opt_mul
13.0
0.70
1712760
0.0004
copy
2.4
0.13
130658
0.0010
rot_left
Profile of ONB ECCDSA with 158

bit key
ECC Profiling
–
optimized
%time
seconds
#calls
msec/call
Name
90 +
1.0
10661
0.0938
opt_mul
4.8
0.05
820
0.06
opt_inv
Profile of ONB ECCDSA with 158

bit key
ECC Profiling

unoptimized
Profile of polynomial ECCDSA with 111

bit key
%time
seconds
#calls
msec/call
Name
67.5
3.78
3996909
0.0009
mul_shift
12.1
0.68
33321
0.0204
poly_mul_
partial
6.1
0.34
333882
0.0010
div_shift
3.8
0.21
36544
0.0057
poly_div
ECC Profiling
–
optimized
Profile of polynomial ECCDSA with 111

bit key
%time
seconds
#calls
msec/call
Name
69.1
0.76
567
1.34
poly_inv
22.7
0.25
36544
0.0068
poly_div
5.5
0.06
2134
0.028
poly_mul
Software performance remains highly constrained by
memory access, large integer data and complex arithmetic
operations
Conclusions
Hardware for Cryptosystems rely on fully construction of
a processor/co

processor.
Large register files are required.
Reconfigurability is suitable for using different elliptic
curves within the same hardware.
Small additions on ISA are not being exploited (are they
going to be useful??)
References
Comments 0
Log in to post a comment