Superscalar Coprocessor for

innocentsickAI and Robotics

Nov 21, 2013 (3 years and 11 months ago)

85 views

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

1
/26

Superscalar Coprocessor for

High
-
speed Curve
-
based Cryptography

K. Sakiyama, L. Batina, B. Preneel, I. Verbauwhede


Katholieke Universiteit Leuven / IBBT

Department Electrical Engineering
-

ESAT/COSIC

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

2
/26


Introduction



Curve
-
based Cryptography



HW/SW Partitioning



Superscalar Coprocessor



Results



Conclusions

Overview


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

3
/26

Introduction

Motivation



High
-
speed curve
-
based cryptography in HW/SW co
-
design


How much instruction
-
level parallelism can we obtain from coprocessor
instructions?



Performance improvement for different operation forms in datapath


AB+C mod P vs A(B+D)+C mod P ,A,B,C,D,P: polynomials



Performance comparison three different curve
-
based cryptosystems


Which one is faster between ECC, HECC, ECC over a composite field?



Programmability and scalability


Programmable in order to support different cryptosystems?


Scalable in field sizes?

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

4
/26

Introduction

Target Architecture


Curve
-
based cryptography over binary fields


Hardware can be smaller and faster than prime field


ECC over a binary field, e.g.
GF(2
163
)


HECC of genus 2


Field length can be shorter with a factor of 2, e.g.
GF(2
83
)


ECC over a composite field


Field length can be shorter with a factor of 2, e.g.
GF ((2
83
)
2
)



The datapath can be shared


Programmable coprocessor supporting three curve
-
based
cryptography by defining coprocessor instruction(s)


(Coprocessor) instruction
-
level parallelism by superscalar


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

5
/26


Introduction



Curve
-
based Cryptography



HW/SW Partitioning



Superscalar Coprocessor



Results



Conclusions

Overview


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

6
/26

Curve
-
based Cryptography

HW/SW partitioning (1)



General hierarchy in coprocessor for curve
-
based
cryptography


Point/Divisor

Multiplication

Point/Divisor

Addition

Point/Divisor

Doubling

Finite Field

Addition

Finite Field

Multiplication

Finite Field

Inversion

HW Datapath

SW

or HW controller

SW

or HW controller

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

7
/26


Single instruction

for all finite field operations


Fixed
-
cycle execution enables efficient implementation


Point/Divisor

Multiplication

Point/Divisor

Addition

Point/Divisor

Doubling

Finite Field

Addition

Finite Field

Multiplication

Finite Field

Inversion

Point/Divisor

Multiplication

Point/Divisor

Addition

Point/Divisor

Doubling

Finite Field Operation

E.g. AB+C mod P

Finite Field

Inversion

Curve
-
based Cryptography

Proposed Hierarchy (1)


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

8
/26


(a) Building block: Regular XOR chains


(b) Scalable in digit size (d) and field size (k) by interconnecting
several building blocks


We use MALU
83

(n=83, d=12) as building block


2xMALU
83

can be configured as 1xMALU
163



Curve
-
based Cryptography

Modular Arithmetic Logic Unit (MALU)


a
i
B
(x)
m
i
P
(x)
T(x)
c
i
a
k
m
k
c
k+1
T
next
(x)
a
i
B
(x)
m
i
P
(x)
T(x)
c
i
a
k
m
k
c
k+1
T
next
(x)
Interconnection
Interconnection



(
b)
(
a)
d
n
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

9
/26


Introduction



Curve
-
based Cryptography



HW/SW Partitioning



Superscalar Coprocessor



Results



Conclusions

Overview


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

10
/26

HW/SW Partitioning

TYPE I: Smallest implementation (baseline)

32
-
bit

instructions

32
-
bit data

Instruction Bus

Program

ROM

Main CPU

Memory Mapped I/O

SRAM

MALU
83

Data Bus

DBC

Coprocessor

IBC

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

11
/26

HW/SW Partitioning


TYPE II: TYPE I +
m
-
捯c攠剁R

IBC

32
-
bit

instructions

32
-
bit data

Instruction Bus

Program

ROM

Main CPU

Memory Mapped I/O

SRAM

m
-
code

RAM

Data Bus

DBC

FSM

Coprocessor

MALU
83

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

12
/26

HW/SW Partitioning


TYPE III: TYPE I + Coprocessor Memory

32
-
bit

instructions

32
-
bit data

Instruction Bus

Program

ROM

Main CPU

Memory Mapped I/O

Coprocessor Memory

SRAM

MALU
83

Data Bus

DBC

Coprocessor

IBC

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

13
/26

HW/SW Partitioning


TYPE IV: TYPE I + Copro. Mem.&
m
-
捯c攠剁R

32
-
bit

instructions

32
-
bit data

Instruction Bus

Program

ROM

Main CPU

Memory Mapped I/O

Coprocessor Memory

SRAM

MALU
83

Data Bus

DBC

IBC

m
-
code

RAM

FSM

Coprocessor

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

14
/26

HW/SW Partitioning


Co
-
design flow with GEZEL

Partitioning of functions

C/C++ codes for PKCs

C/C++ codes & H/W behavior

blocks w/interface

GEZEL

FDL codes

Cross compile

Synthesis

C/C++ codes w/physical

memory map

ARM (SW)

Co
-
processor (HW)

Cycle
-
true sim.

(
GEZEL
)

VHDL codes

Program codes

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

15
/26

HW/SW Partitioning


Result: Vertical Exploration of System


HECC Performance for different HW/SW partitioning


(Performance: Point/Divisor multiplication)

38
38
67
67
67
67
0
0
187
2,859
0
2,672
0
100
200
300
400
500
TYPE I
TYPE II
TYPE III
TYPE IV
System Configuration
Required Clock Cycles [K]
I/O Transfer Overhead + Others
Coprocessor Data Memory
Datapath
Coprocessor Configuration
m
-code RAM
Data Mem.
TYPE I
TYPE II
X
TEPE III
X
TYPE IV
X
X
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

16
/26


Introduction



Curve
-
based Cryptography



HW/SW Partitioning



Superscalar Coprocessor



Results



Conclusions

Overview


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

17
/26


Multiple Modular Arithmetic Logic Units (MALUs) in
coprocessor







Finite Field

Operation

E.g. AB+C mod P

Point/Divisor

Multiplication

Point/Divisor

Addition

Point/Divisor

Doubling

Finite Field

Inversion

Finite Field

Operation

E.g. AB+C mod P

Finite Field

Operation

E.g. AB+C mod P

Finite Field

Operation

E.g. AB+C mod P



Point/Divisor

Multiplication

Point/Divisor

Addition

Point/Divisor

Doubling

Finite Field Operation

E.g. AB+C mod P

Finite Field

Inversion

Superscalar Coprocessor

Proposed Hierarchy (2)


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

18
/26

IBC

32
-
bit

instructions

32
-
bit data

Instruction Bus

Program

ROM

Main CPU

Memory Mapped I/O

MALU
83

Coprocessor Memory

SRAM

MALU
83

MALU
83

MALU
83

IQB

m
-
code

RAM

Data Bus

Buffer

Full

DBC

FSM

Coprocessor

Superscalar Coprocessor


Parallel Processing Architecture (TYPE IV
-
based)

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

19
/26

Superscalar Coprocessor


Horizontal Exploration of System


Performance of ECC and HECC

67
58
30
36
22
20
20
38
41
25
13
22
22
8
0
20
40
60
80
100
Coprocessor Configuration
Required Clock Cycles [K]
Coprocessor Data Memory
Datapath
1
xMALU
83
2
xMALU
83
1
xMALU
83
HECC
HECC
HECC
Operation: A(B+D)+C
Operation: AB+C
1
xMALU
163
2
xMALU
163
3
xMALU
83
4
xMALU
83
HECC
ECC
HECC
ECC
Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

20
/26


Introduction



Curve
-
based Cryptography



HW/SW Partitioning



Superscalar Coprocessor



Results



Conclusions

Overview


Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

21
/26

Results

Performance for ECC over GF(2
83
)


Fastest of three



x1.8 speed
-
up by
2
-
way superscaling
(ILP
D
P=6) with
A(B+D)+C



Still more
improvement is
possible by adding
MALUs


AB+C

A(B+D)+C

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

22
/26

Results

Performance of HECC over GF(2
83
)


Faster than ECC
over a composite
field



x2.7 speed
-
up by
4
-
way superscaling
(ILP
D
P=5) with
A(B+D)+C



Less improvement
as increasing # of
MALU

AB+C

A(B+D)+C

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

23
/26

Results

Performance for ECC over GF((2
83
)
2
)


Slowest of three



x2.5 speed
-
up by
4
-
way superscaling
(ILP
D
P=6) with
A(B+D)+C



Less improvement
as increasing # of
MALU


AB+C

A(B+D)+C

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

24
/26

Results

Comparison of ECC/HECC implementations on FPGAs

[11] T. Wollinger, PhD thesis, 2004.

[13] G. Orlando and C. Paar, CHES 00.

[14] N. Gura
et al.
, CHES02.

[29] Nazar A. Saqib
et al.
, International Journal of Embedded Systems 2005

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

25
/26


Performance improvement / Comparison


ECC was improved by a factor of 1.8 (2
-
way)


HECC (genus 2) was improved by a factor of 2.7 (4
-
way)


ECC over a composite field was improved by a factor of 2.5


(4
-
way)


A(B+D)+C offers better performance than AB+C


ECC is the fastest in this case study



Programmability & flexibility


Support three different curve
-
based cryptosystems over a binary
field


Arbitrary irreducible polynomial


Field size up to 332 bits

by using 4xMALU
83


Conclusions

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

26
/26

Thank you!

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

27
/26

EX
IF/D
MALU#0
1
4(3*)
4
Clock cycle
EX
IF/D
EX
IF/D
EX
IF/D
R0
W0
IF/D
IF/D
MALU#3
MALU#1
MALU#2

k/d

R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
IF/D
W3
IF/D
R0
R1
R2
R3
R0
R1
R2
R3
R0
R1
R2
R3
W1
W2
R0
R1
R2
R3
R0
R1
R2
R3

Parallel issue of instructions

Case of using 4 MALUs


IF/D : Instruction Fetch & Decode


R_ : Read operands (dependent on the type of operation)


EX : Execution (dependent on MALU configuration, k & d)


W_ : Write (dependent on # of instructions issued in parallel)

Workshop on Cryptographic Hardware and Embedded Systems (CHES 2006)

13/10/2006

28
/26

Parallel issue of instructions

Out
-
of
-
order Execution


Check RAW (Read After Write Dependency) for in
-
/out
-
of
-
order execution