Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
1
/26
Superscalar Coprocessor for
High

speed Curve

based Cryptography
K. Sakiyama, L. Batina, B. Preneel, I. Verbauwhede
Katholieke Universiteit Leuven / IBBT
Department Electrical Engineering

ESAT/COSIC
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
2
/26
Introduction
Curve

based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
3
/26
Introduction
Motivation
High

speed curve

based cryptography in HW/SW co

design
How much instruction

level parallelism can we obtain from coprocessor
instructions?
Performance improvement for different operation forms in datapath
AB+C mod P vs A(B+D)+C mod P ,A,B,C,D,P: polynomials
Performance comparison three different curve

based cryptosystems
Which one is faster between ECC, HECC, ECC over a composite field?
Programmability and scalability
Programmable in order to support different cryptosystems?
Scalable in field sizes?
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
4
/26
Introduction
Target Architecture
Curve

based cryptography over binary fields
Hardware can be smaller and faster than prime field
ECC over a binary field, e.g.
GF(2
163
)
HECC of genus 2
Field length can be shorter with a factor of 2, e.g.
GF(2
83
)
ECC over a composite field
Field length can be shorter with a factor of 2, e.g.
GF ((2
83
)
2
)
The datapath can be shared
Programmable coprocessor supporting three curve

based
cryptography by defining coprocessor instruction(s)
(Coprocessor) instruction

level parallelism by superscalar
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
5
/26
Introduction
Curve

based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
6
/26
Curve

based Cryptography
HW/SW partitioning (1)
General hierarchy in coprocessor for curve

based
cryptography
Point/Divisor
Multiplication
Point/Divisor
Addition
Point/Divisor
Doubling
Finite Field
Addition
Finite Field
Multiplication
Finite Field
Inversion
HW Datapath
SW
or HW controller
SW
or HW controller
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
7
/26
Single instruction
for all finite field operations
Fixed

cycle execution enables efficient implementation
Point/Divisor
Multiplication
Point/Divisor
Addition
Point/Divisor
Doubling
Finite Field
Addition
Finite Field
Multiplication
Finite Field
Inversion
Point/Divisor
Multiplication
Point/Divisor
Addition
Point/Divisor
Doubling
Finite Field Operation
E.g. AB+C mod P
Finite Field
Inversion
Curve

based Cryptography
Proposed Hierarchy (1)
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
8
/26
(a) Building block: Regular XOR chains
(b) Scalable in digit size (d) and field size (k) by interconnecting
several building blocks
We use MALU
83
(n=83, d=12) as building block
2xMALU
83
can be configured as 1xMALU
163
Curve

based Cryptography
Modular Arithmetic Logic Unit (MALU)
a
i
B
(x)
m
i
P
(x)
T(x)
c
i
a
k
m
k
c
k+1
T
next
(x)
a
i
B
(x)
m
i
P
(x)
T(x)
c
i
a
k
m
k
c
k+1
T
next
(x)
Interconnection
Interconnection
…
…
…
(
b)
(
a)
d
n
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
9
/26
Introduction
Curve

based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
10
/26
HW/SW Partitioning
TYPE I: Smallest implementation (baseline)
32

bit
instructions
32

bit data
Instruction Bus
Program
ROM
Main CPU
Memory Mapped I/O
SRAM
MALU
83
Data Bus
DBC
Coprocessor
IBC
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
11
/26
HW/SW Partitioning
TYPE II: TYPE I +
m

捯c攠剁R
IBC
32

bit
instructions
32

bit data
Instruction Bus
Program
ROM
Main CPU
Memory Mapped I/O
SRAM
m

code
RAM
Data Bus
DBC
FSM
Coprocessor
MALU
83
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
12
/26
HW/SW Partitioning
TYPE III: TYPE I + Coprocessor Memory
32

bit
instructions
32

bit data
Instruction Bus
Program
ROM
Main CPU
Memory Mapped I/O
Coprocessor Memory
SRAM
MALU
83
Data Bus
DBC
Coprocessor
IBC
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
13
/26
HW/SW Partitioning
TYPE IV: TYPE I + Copro. Mem.&
m

捯c攠剁R
32

bit
instructions
32

bit data
Instruction Bus
Program
ROM
Main CPU
Memory Mapped I/O
Coprocessor Memory
SRAM
MALU
83
Data Bus
DBC
IBC
m

code
RAM
FSM
Coprocessor
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
14
/26
HW/SW Partitioning
Co

design flow with GEZEL
Partitioning of functions
C/C++ codes for PKCs
C/C++ codes & H/W behavior
blocks w/interface
GEZEL
FDL codes
Cross compile
Synthesis
C/C++ codes w/physical
memory map
ARM (SW)
Co

processor (HW)
Cycle

true sim.
(
GEZEL
)
VHDL codes
Program codes
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
15
/26
HW/SW Partitioning
Result: Vertical Exploration of System
HECC Performance for different HW/SW partitioning
(Performance: Point/Divisor multiplication)
38
38
67
67
67
67
0
0
187
2,859
0
2,672
0
100
200
300
400
500
TYPE I
TYPE II
TYPE III
TYPE IV
System Configuration
Required Clock Cycles [K]
I/O Transfer Overhead + Others
Coprocessor Data Memory
Datapath
Coprocessor Configuration
m
code RAM
Data Mem.
TYPE I
TYPE II
X
TEPE III
X
TYPE IV
X
X
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
16
/26
Introduction
Curve

based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
17
/26
Multiple Modular Arithmetic Logic Units (MALUs) in
coprocessor
Finite Field
Operation
E.g. AB+C mod P
Point/Divisor
Multiplication
Point/Divisor
Addition
Point/Divisor
Doubling
Finite Field
Inversion
Finite Field
Operation
E.g. AB+C mod P
Finite Field
Operation
E.g. AB+C mod P
Finite Field
Operation
E.g. AB+C mod P
…
Point/Divisor
Multiplication
Point/Divisor
Addition
Point/Divisor
Doubling
Finite Field Operation
E.g. AB+C mod P
Finite Field
Inversion
Superscalar Coprocessor
Proposed Hierarchy (2)
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
18
/26
IBC
32

bit
instructions
32

bit data
Instruction Bus
Program
ROM
Main CPU
Memory Mapped I/O
MALU
83
Coprocessor Memory
SRAM
MALU
83
MALU
83
MALU
83
IQB
m

code
RAM
Data Bus
Buffer
Full
DBC
FSM
Coprocessor
Superscalar Coprocessor
Parallel Processing Architecture (TYPE IV

based)
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
19
/26
Superscalar Coprocessor
Horizontal Exploration of System
Performance of ECC and HECC
67
58
30
36
22
20
20
38
41
25
13
22
22
8
0
20
40
60
80
100
Coprocessor Configuration
Required Clock Cycles [K]
Coprocessor Data Memory
Datapath
1
xMALU
83
2
xMALU
83
1
xMALU
83
HECC
HECC
HECC
Operation: A(B+D)+C
Operation: AB+C
1
xMALU
163
2
xMALU
163
3
xMALU
83
4
xMALU
83
HECC
ECC
HECC
ECC
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
20
/26
Introduction
Curve

based Cryptography
HW/SW Partitioning
Superscalar Coprocessor
Results
Conclusions
Overview
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
21
/26
Results
Performance for ECC over GF(2
83
)
Fastest of three
x1.8 speed

up by
2

way superscaling
(ILP
D
P=6) with
A(B+D)+C
Still more
improvement is
possible by adding
MALUs
AB+C
A(B+D)+C
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
22
/26
Results
Performance of HECC over GF(2
83
)
Faster than ECC
over a composite
field
x2.7 speed

up by
4

way superscaling
(ILP
D
P=5) with
A(B+D)+C
Less improvement
as increasing # of
MALU
AB+C
A(B+D)+C
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
23
/26
Results
Performance for ECC over GF((2
83
)
2
)
Slowest of three
x2.5 speed

up by
4

way superscaling
(ILP
D
P=6) with
A(B+D)+C
Less improvement
as increasing # of
MALU
AB+C
A(B+D)+C
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
24
/26
Results
Comparison of ECC/HECC implementations on FPGAs
[11] T. Wollinger, PhD thesis, 2004.
[13] G. Orlando and C. Paar, CHES 00.
[14] N. Gura
et al.
, CHES02.
[29] Nazar A. Saqib
et al.
, International Journal of Embedded Systems 2005
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
25
/26
Performance improvement / Comparison
ECC was improved by a factor of 1.8 (2

way)
HECC (genus 2) was improved by a factor of 2.7 (4

way)
ECC over a composite field was improved by a factor of 2.5
(4

way)
A(B+D)+C offers better performance than AB+C
ECC is the fastest in this case study
Programmability & flexibility
Support three different curve

based cryptosystems over a binary
field
Arbitrary irreducible polynomial
Field size up to 332 bits
by using 4xMALU
83
Conclusions
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
26
/26
Thank you!
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
27
/26
EX
IF/D
MALU#0
1
4(3*)
4
Clock cycle
EX
IF/D
EX
IF/D
EX
IF/D
R0
W0
IF/D
IF/D
MALU#3
MALU#1
MALU#2
k/d
R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
IF/D
W3
IF/D
R0
R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
W1
W2
R0
R1
R2
R3
R0
R1
R2
R3
…
Parallel issue of instructions
Case of using 4 MALUs
IF/D : Instruction Fetch & Decode
R_ : Read operands (dependent on the type of operation)
EX : Execution (dependent on MALU configuration, k & d)
W_ : Write (dependent on # of instructions issued in parallel)
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)
13/10/2006
28
/26
Parallel issue of instructions
Out

of

order Execution
Check RAW (Read After Write Dependency) for in

/out

of

order execution
Comments 0
Log in to post a comment