# Hardware Implementations of

AI and Robotics

Nov 21, 2013 (4 years and 5 months ago)

75 views

Survey on

Hardware Implementations of
Cryptographic Systems

Elliptic Curve Cryptography and others.

-
REMOVED

Goodman & Chandrakasan Implementation (D
omain

S
pecific

R
econfigurable

C
ryptographic

P
rocessor
)

Capable of performing an entire suite of cryptographic
primitives over integer modulo N, binary Galois Fields and
nonsupersingular Elliptic Curves. This includes RSA, ECC

Fully programmable parameters for many cryptographic
systems. Data size can vary from 8 to 1024 bits.

Reconfigurability limited to the subset of functions
(domain) required for public
-
key cryptography as defined
in IEEE 1363. Requires only a small set of configurations
for performing all required operations for RSA, ECC, etc.

Instruction Set Architecture

As defined by IEEE 1363 Public Key Cryptography
Standard Document.

24 instructions broken up into 6 types of operations:
conventional arithmetic, modular integer arithmetic,
GF(2
n
) arithmetic, elliptic curve arithmetic, register
manipulation and processor configuration.

SET_LENGTH length

Sets width of processor to be length+1

REG_CLEAR rd, rs0

Clears regs specified in mask formed by
(rd,rs0)=R<7:0>

REG_MOVE rd, rs0

rd = rs0

rd is loaded from I/O interface

rs1 in unloaded to I/O interface

COMP rs0, rs1

Set “gt” and “eq” flags according to the result

rd = rs0 + rs1 + rs2<0> rs2<2:1> = 00

rd = (rs0 + rs1 + rs2<0>) >> 1 rs2<2:1> = 01

rd = rs0

rs1 rs2<2:1> = 10

rd = (rs0

rs1) >> 1 rs2<2:1> = 11

rd = (rs0 + rs1 + rs2<0>) mod N

MOD_SUB rd, rs0, rs1

rd = (rs0

rs1) mod N

MONTRED_A

(Pc, Ps) = A . 2
-
n

mod N

MONTMULT

(Pc, Ps) = A . B . 2
-
n

mod N

MONTRED

(Pc, Ps) = (Pc, Ps) . 2
-
n

mod N

DSRCP ISA

DSRCP ISA

cont.

MOD rd, rs0, rs1, rs2

rd = (rs1 . 2
n

+ rs0) mod N, correction factor of

2
2n

mod N stored in rs2

MOD_MULT rd, rs0, rs1, rs2`

rd = (rs0 . Rs1) mod N, correction factor of

2
2n

mod N stored in rs2

MOD_INV rd, rs0

rd = (1 / rs0) mod N

MOD_EXP rd, rs0, rs2, length

rd = rs0
EXP

mod N, Exp has (length + 1)bits,
correction factor of 2
2n

mod N stored in rs2

rd = rs0 + rs1 over GF(2
n
) (equiv. to rs0 XOR rs1)

GF_MULT

Pc = A . B

GF_INV

A = 1 / Pc

GF_INVMULT

A = B / Pc

GF_EXP rd, rs0, length

rd = rs0
Exp

mod N, Exp has (length +1) bits

(rd, rd+1) = (rs0, rs0+1) + (rs1,rs1+1), over curve
defined by parameters in (rs2, N).

EC_DOUBLE rd, rs0, rs2

(rd, rd+1) = 2.(rs0, rs0+1), over curve defined by
parameters in (rs2, N)

EC_MULT length

(R4, R5) = Exp(R2,R3) Exp has (length + 1) bits,
over curve defined by parameters in (R6, N)

Top
-
Level system architecture

Reconfigurable

datapath

(32 x 32bits)

Shutdown

controller

Global

controller

m
-
code

ROMs

I/O

interface

32

32

Data

Instruction

Reconfigurable Datapath

Reconfigurable Datapath

n
-
bit (8 <=
n

<=1024)
operands in 3 cycles.

Comparator: single
-
cycle magnitude comparisons between
two
n
-
bit operands. XOR of the two operands.

Local registers in Reconfigurable Logic Unit (Pc, Ps, A, B,
Exp and N) for special purposes and operations. Eliminates
the need for accessing the register file every cycle.

Two operand buses (rs0 and rs1) and one write
-
back bus.

Modular arithmetic

Complex operations (multiplication, reduction, inversion,
exponentiation) use microcoded instructions

Simple operations (addition, subtraction, comparisons) are
implemented directly in hardware

Modular arithmetic

cont.

Multiplication: Montgomery multiplication

MONTMULT (A,B,N) = A . B . 2
-
n

mod N

Modular inversion: extended binary euclidean algorithm.

Modular exponentiation: square
-
and
-
multiply algorithm.

Precomputes and stores the values {2
n
, rs0.2
n
, rs0
2
.2
n
, rs0
3
.2
n
}

in {R0, R1, R2, R3}.

GF(2
n
) arithmetic

Addition: XOR function in the comparator unit.

Multiplication and inversion: implemented directly in
hardware using the reconfigurable datapath

Exponentiation: same manner as modular exponentiation,
with {1, rs0, rs0
2
, rs0
3
} stored in {R0, R1, R2, R3}.

Elliptic Curve arithmetic

Point addition and doubling: implemented in microcode,
with curve points stored as register pairs (R
i
, R
i+1
) = (x,y)

Point multiplication: performing using a repeated double
-
and
-

Exponentiation: same manner as modular exponentiation,
with {1, rs0, rs0
2
, rs0
3
} stored in {R0, R1, R2, R3}.

Summary

Tier I: implemented directly in hardware.

Tier II: microcoded instructions composed of sequences of
first
-
tier instructions.

Tier III: microcoded instructions composed of sequences
of both first and second tier instructions

Orlando and Paar Implementation

Elliptic Curve Co
-
processor. Configurable for any size of
the field (# of bits in the key). Needs a host system.

Main features: optimized bit
-
parallel squarer, digit
-
serial
multiplier and two programmable processors.

Most dramatic example: squaring can be performed in one
clock cycle, whereas a general architecture usually requires
m/2
clock cycles ( m >= 160 for reasonable security).

Processor Architecture

MC : main controller. Controls the computation of
kP

and

interacts with the host system.

AUC : arithmetic unit controller. Controls the the

computation of
point

AU: arithmetic unit. Perform
field

multiplications, inversions under AUC.

Example: Point multiplication

k

into the MC.

2) Host load the coordinates of
P

into the AU.

3) Host commands the MC to start processing.

4) MC initializes (series of operations)

5) MC commands AUC to perform its initialization

6) Computation is performed by MC, AUC and AU.

Arithmetic Unit

Responsible for field arithmetic

Contains a register file, LSD first multiplier, squarer,
accumulator and a zero test circuit.

Arithmetic Unit

cont.

Field addition / subtraction: 2 clock cycles.

Field multiplication:
AB + C mod F

takes

k
D
clock cycles (1 <=
k
D

<= [
m/D
])

k
D
represents the number of digits of
B

and
m

the number
of bits of the field.

Prototypes

16 bit MC processor with 256 words of program mem.

24
-
bit AUC processor with 512 words of program mem.

128 registers of 167 bits each

32 bit I/O interface to the host system.

4, 8 and 16 bits multipliers (3 prototypes)

Number of cycles to compute curve operations (
D
: size
of the multiplier; 167 is the size of the field)

Software ECC Profiling
-

unoptimized

%time

seconds

#calls

msec/call

Name

50.9

2.74

3360427

0.00081

rot_right

29.6

1.59

10700

0.1486

opt_mul

13.0

0.70

1712760

0.0004

copy

2.4

0.13

130658

0.0010

rot_left

Profile of ONB ECCDSA with 158
-
bit key

ECC Profiling

optimized

%time

seconds

#calls

msec/call

Name

90 +

1.0

10661

0.0938

opt_mul

4.8

0.05

820

0.06

opt_inv

Profile of ONB ECCDSA with 158
-
bit key

ECC Profiling
-

unoptimized

Profile of polynomial ECCDSA with 111
-
bit key

%time

seconds

#calls

msec/call

Name

67.5

3.78

3996909

0.0009

mul_shift

12.1

0.68

33321

0.0204

poly_mul_
partial

6.1

0.34

333882

0.0010

div_shift

3.8

0.21

36544

0.0057

poly_div

ECC Profiling

optimized

Profile of polynomial ECCDSA with 111
-
bit key

%time

seconds

#calls

msec/call

Name

69.1

0.76

567

1.34

poly_inv

22.7

0.25

36544

0.0068

poly_div

5.5

0.06

2134

0.028

poly_mul

Software performance remains highly constrained by

memory access, large integer data and complex arithmetic

operations

Conclusions

Hardware for Cryptosystems rely on fully construction of
a processor/co
-
processor.

Large register files are required.

Reconfigurability is suitable for using different elliptic
curves within the same hardware.

Small additions on ISA are not being exploited (are they
going to be useful??)

References