Systolic Design in Cryptography

jabgoldfishΤεχνίτη Νοημοσύνη και Ρομποτική

19 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

63 εμφανίσεις

Systolic Architectures

Nilda Quintana

Ridhanie Suryawan

Tom Tichy

Systolic Arrays


A class of parallel processors, named after the data flow
through the array, analogous to the rhytmic flow of blood
through human arteries after each heartbeat.


The concept of systolic processing combines a highly
parallel array of identical processor may span several
integrated circuit chips.


A set of simple Processing Elements with regular and local
connections takes external inputs and processes them in a
predetermined manner in a pipelined fashion.

Systolic Arrays


Replace single processor with array of regular
Processing Elements.


Orchestrate data flow for high throughput with less
Memory access.

M

PE

M

PE

PE

PE

Generic Systolic Arrays


In
Generic Systolic Arrays; processing units are connected
in linear array. Each cell is connected with its immediate
neighbours; each cell can exchange data and results with the
outside. Furthermore, each cell can receive data from the top
and transmit result to the bottom.
(The WARP machine can be
viewed as GSA of size 10)


It is also to possible to obtained 2
-
dimensional arrays by
stacking several linear arrays and adequately connecting
channels together. Other topologies (Ring, Cylinder, Torrus)
can be obtained in a similar way.

Generic Systolic Arrays


Cell P
i

admits three input channels; P
i

can receive data from P
i
-
1

through
channel LR
i

(Left to Right), from P
i+1

through RL
i
, and from the outside
through U
i
(Up).


P
i
has also three output channels, which allow transmission of results to
the left and right neighbours and to the outside.

LR
i+1

P
i

D
i

U
i

RL
i
-
1

RL
i

LR
i





P
n

D
n

U
n

RL
n
-
1

RL
n

LR
n

LR
n+1

P
1

D
1

U
1

RL
0

RL
1

LR
1

LR
2

Generic Systolic Arrays

B
[i]

A
[i]

C
[i]

E
[i]

G
[i]

F
[i]

D
i

U
i

RL
i

LR
i+1

LR
i

RL
i
-
1

Generic Systolic Arrays


The internal memory of cell P
i

contains six communica
-
tions
registers, denoted A
[i]
, B
[i]
, C
[i]
, E
[i]
, F
[i]

and G
[i]
. The
remaining part of the memory is denoted M
[i]
; its size is
independent from the size
n
of the network.


The program executed by every cell is a
loop
, whose body is
finite, partially ordered by set of statement that specify three
kinds of actions:

1.
Receiving values (data) from some input channels,

2.
Performing computations within the internal memory,

3.
Transmitting values (results) to output channels.

Generic Systolic Arrays


The processing units acts with high synchronism (often
provided by a global, broadcasted clock). But this can lead
to implementation problems.


Another solution is the synchronization by communication,
named
rendezvous.
Value can be transmitted from a cell to
another
only

when both cells are prepared to do so.


During
communication

phase, only
input registers
A, C and
G are changed; during
computation

phase, only
storage
register

M and the
output registers
B, E and F are changed.

Generic Systolic Arrays


Communication Scheme: an order between the internal
computation, the reception of data, and the transmission of
results.

communication ; computation


Other communication scheme with less restrictive:

input ; computation ; output


The internal computation phase is not restricted; it can be
modelled by function

. More specifically, the computation
phase consists in executing the assignment (
F,E,B,M
) :=


(A,C,G,M).


Space
-
Time Methodology


The algorithms to be mapped is specified as a set of
equations attached to integral points, and mapped on the
architecture using a regular time and space allocation
scheme.


Four main steps using this methodology:


The index localization (computations to be performed are defined by
equation).


Uniformization (indicating where data need to be and where the results are
being produced).


Space
-
Time Transformation (a time and a processor allocations functions
are being chosen).


Interface Design (the loading of the data and the unloading of the results
are considered).


Space
-
Time Methodology


The drawbacks of Space Time Methodology:


The algorithm must be specified as a set of recurrence
equation, or nested do
-
loop instructions. Difficult to
implement.


Location in space is associated to each index value (well
suited for systhesis of regular arrays in which data will
be introduced in a regular order). Eliminates possibility
of synthesizing with other architectures.


Synthesis of initializations; initial algorithm are slightly
modified.


Program
-
Oriented
Methodology


Various attemps have been made to overcome the drawbacks
of the space
-
time transformation methods. Those based on
viewing systolic design as program design.


A program
-
oriented methodology to develop more adequate
method of systolic design:


The specifications of the system are formalized with an input and an
output predicates, just as in structured programming.


A couple (program and invariant) is deduced in an incremental way
from the specification.


The sequential while
-
program is further transformed into a systolic
program; the statement of this program are concurrent assignments.


The systolic array is (easily) obtained from the systolic program.

Systolic Design in Cryptography

Cryptography


Background information


Cryptography


disguising messages so only
certain people can read them


Cryptography will play an increasing role in
future computer and communication system

Types of cryptography


Symmetric


same key used to encrypt and decrypt


Fast algorithm


Requires a one
-
time use of a secret channel for key
exchange


Asymmetric (public key)


different keys to encrypt
and decrypt the message


More secure


Slower algorithm

Cryptography

RSA algorithm


Today’s systems (SSL, SSH, S/MIME,etc) are
based on public key cryptosystem such as RSA


Proposed in 1978


Based on difficulty of factoring large integers


Three keys


N, E and D


N = (P X Q)
; P, Q
-

large prime numbers


Choose a large number
E

such that it is coprime with
(P
-
1)x(Q
-
1)



Cryptography

RSA algorithm


Compute D such that

E x D = 1 mod (P
-
1)x(Q
-
1)


The pair
(E,N)

is the public key


D is the private key


To encrypt a message M:


C = M
E

mod (N)


To decrypt C back to M:


M = C
D
mod (N)


Montgomery algorithm


Computations with integers of length > 500 are
very complex and time
-
consuming


Montgomery algorithm(1985)



reduces modular exponentiation to a series of modular
multiplications


Compute
(A


B) mod N

without trial division (
q


(A


B)
/

N)


Need for faster speed (increasing bit size) and
stronger security leads to hardware implementation


Systolic Implementation


Goal: to create fast, secure and efficient way
to perform modular multiplication


Number representation:
X

=

i
=0

x
i
r
i



r

= 2
k

is the radix



x
i

is the
i
th digit



n



max no. of digits in any number



n
-
1

Systolic architectures


Bit
-
serial architecture


processes one input bit during a clock cycle. Is well suited
for low speed applications.


Bit
-
parallel architecture


processes one input word during a clock cycle. Well suited
for high
-
speed applications, but is area
-
inefficient


Digit
-
serial architecture


attempts to utilize the best of both worlds. The speed of bit
-
parallel and the relative simplicity of bit
-
serial.

Example: compute
A x B


Use
n

digit multipliers to form
a
i

B

and
add to a partial product
P
:



P := 0

;



For

i := n

1

downto
0

do




P := r

P + a
i

B


{ Result:
P

=
A

B

}

Example: compute
A x B


Bit
-
serial


addition of
a
i

B

over
n

cycles



carry

P := P + a
i

B
(bit
-
serial)

time

j+1


time

j


time

j
-
1


carry

carry

carry

p
j+1


p
j


p
j
-
1


p
j+1
b
j+1


p
j
b
j


p
j
-
1
b
j
-
1


a
i


a
i


a
i


a
i


Cell

j

computes

a
i
b
j

in

cycle

i+j



Example: compute
A x B


Bit
-
Parallel

add
a
i
x

B

in one clock cycle

P := P + a
i

B
(bit
-
parallel)

cell

j


cell

j
-
1


cell

j+1


p
j,c


p
j+1,c


p
j+1,s


p
j,s


p
j
-
1,c


p
j
-
1,s


p
j+1,s


p
j,c


p
j,s


p
j
-
1,c


p
j
-
1,s


b
j+1


b
j


b
j
-
1


a
i


a
i


Cell

j

computes

a
i
b
j

in

cycle

i



Montgomery Algorithm

P
0
:=0;

q
0
:=0;

For
i:=0

to
n

do

P
i+1
:=[(P
i

+ q
i
N)/2) + b
i
A;

q
i+1
:=P
i

mod 2;

End

P=Pn;

The systolic array

PE for Montogmery


At
i
th

step, the term A
i
B+Q
i
N
is computed in the upper part.


Results are shifted,
accumulated in the lower part


Calculations in first
n
cycles


Output in next
n

cycles


Zero bit interleaving enables
synchronization with the next
iteration of the algorithm


Digit
-
serial array

Digit
-
serial PE

Digit
-
serial implementation


Width of processing elements is
u


Only need
n/u

instead of
n

processing elements


N_reg

(
u
bits): storage of the modulus


B
-
reg
(
u
bits): storage of the B multiplier


B+N
-
reg
(
u

bits): storage of intermediate result
B+N


Add
-
reg
(
u
+1 bits): storage of intermediate results


Control
-
reg
(3 bits): multiplexer control/clock enable


Result
-
reg
(
u
bits): storage of the result

Conclusion


The need for inexpensive hardware device
for real
-
time RSA encryption/decryption will
be rising.


Systolic computer architecture may offer an
elegant and flexible solution.

Artificial Neural Networks
Introduction


The brain processes information extremely fast and
accurately. To top that, the trained network still
works even if certain neurons fail. For example, it
is amazing how one can recognize a friend in a
noisy football stadium. This ability of the brain
(signal processing) to recognize information,
literally buried in noise, and retrieve it correctly is
one of the incredible processes that we wish that
could be implement in machines. If a machine could
be built with only 0.1 percent of the performance
capacity of the brain we would already have an
extraordinary information and controlling machine.

Artificial Neural Networks
Introduction


A neural network is a processing device, either an algorithm, or actual
hardware, whose design was motivated by the design and functioning of
human brains and its components.


Neurons have elementary computational skills, but they operate in a
highly parallel fashion.


An artificial neural network (ANN) can be described as a network of
many very simple processors called "units", each possibly having a
small amount of local memory or “weight”. The units are connected by
unidirectional communication channels called "connections", which
carry numeric, as opposed to symbolic, data. The units operate only on
their local data and on the inputs they receive via the connections.


Existing neural networks offer improved performance over conventional
technologies in areas which includes: Machine Vision, Robust Pattern
Detection, Virtual Reality, Data Compression, Data Mining, Text
Mining, Artificial Life and more
.

Artificial Neural Networks

Systolic architectures


Neural networks are non
-
linear static or dynamical systems
that learn to solve problems from examples. Most of the
learning algorithms require a lot of computing power and,
therefore, could benefit from fast dedicated hardware. One
of the most common architectures used for this special
-
purpose hardware is the systolic array.


There is an underlying similarity between the simple, special
purpose computational units of a neural network, and the
dedicated processing elements of a systolic array that apply
a predefined computation on data elements as the data are
pumped through the array.

Artificial Neural Networks

Systolic Arrays


S.Y. Kung and J.N. Hwang were the first researchers to publish a neural
model using systolic architecture by devising a scheme to map neural
algorithms into an iterative matrix operation. Several matrices mapping
models have been proposed since and as technology improves so do the
models.


These models have shown that the proper ordering of the elements of the
weight matrix makes it possible to design a cascaded dependency graph
for consecutive matrix
-
vector multiplication.


As the research documents accumulates it has become apparent that
different systolic architectures have become useful to a very narrow and
specific implementation. (e.g. systolic ring arrays and signal processing)

Artificial Neural Networks

Systolic Arrays


Since the early 80s numerous neural algorithms
have been mapped into systolic arrays and have
been created for systolic arrays. Among the later
we have the one
-
dimensional systolic algorithm
created in the 1980s. And the new two
-
dimensional
systolic algorithm implemented in ten years later.


Feature: uses horizontal and vertical product
-
sum
operations for the forward and backward path
computing of a multi
-
layered network.

Artificial Neural Networks

Systolic Arrays:
Multi
-
layered feed
-
forward neural network

Artificial Neural Networks

Systolic Arrays


backward path 2D

Artificial Neural Networks

Systolic Arrays


forward path 2D