Systolic Architectures
Nilda Quintana
Ridhanie Suryawan
Tom Tichy
Systolic Arrays
A class of parallel processors, named after the data flow
through the array, analogous to the rhytmic flow of blood
through human arteries after each heartbeat.
The concept of systolic processing combines a highly
parallel array of identical processor may span several
integrated circuit chips.
A set of simple Processing Elements with regular and local
connections takes external inputs and processes them in a
predetermined manner in a pipelined fashion.
Systolic Arrays
Replace single processor with array of regular
Processing Elements.
Orchestrate data flow for high throughput with less
Memory access.
M
PE
M
PE
PE
PE
Generic Systolic Arrays
In
Generic Systolic Arrays; processing units are connected
in linear array. Each cell is connected with its immediate
neighbours; each cell can exchange data and results with the
outside. Furthermore, each cell can receive data from the top
and transmit result to the bottom.
(The WARP machine can be
viewed as GSA of size 10)
It is also to possible to obtained 2

dimensional arrays by
stacking several linear arrays and adequately connecting
channels together. Other topologies (Ring, Cylinder, Torrus)
can be obtained in a similar way.
Generic Systolic Arrays
Cell P
i
admits three input channels; P
i
can receive data from P
i

1
through
channel LR
i
(Left to Right), from P
i+1
through RL
i
, and from the outside
through U
i
(Up).
P
i
has also three output channels, which allow transmission of results to
the left and right neighbours and to the outside.
LR
i+1
P
i
D
i
U
i
RL
i

1
RL
i
LR
i
…
…
P
n
D
n
U
n
RL
n

1
RL
n
LR
n
LR
n+1
P
1
D
1
U
1
RL
0
RL
1
LR
1
LR
2
Generic Systolic Arrays
B
[i]
A
[i]
C
[i]
E
[i]
G
[i]
F
[i]
D
i
U
i
RL
i
LR
i+1
LR
i
RL
i

1
Generic Systolic Arrays
The internal memory of cell P
i
contains six communica

tions
registers, denoted A
[i]
, B
[i]
, C
[i]
, E
[i]
, F
[i]
and G
[i]
. The
remaining part of the memory is denoted M
[i]
; its size is
independent from the size
n
of the network.
The program executed by every cell is a
loop
, whose body is
finite, partially ordered by set of statement that specify three
kinds of actions:
1.
Receiving values (data) from some input channels,
2.
Performing computations within the internal memory,
3.
Transmitting values (results) to output channels.
Generic Systolic Arrays
The processing units acts with high synchronism (often
provided by a global, broadcasted clock). But this can lead
to implementation problems.
Another solution is the synchronization by communication,
named
rendezvous.
Value can be transmitted from a cell to
another
only
when both cells are prepared to do so.
During
communication
phase, only
input registers
A, C and
G are changed; during
computation
phase, only
storage
register
M and the
output registers
B, E and F are changed.
Generic Systolic Arrays
Communication Scheme: an order between the internal
computation, the reception of data, and the transmission of
results.
communication ; computation
Other communication scheme with less restrictive:
input ; computation ; output
The internal computation phase is not restricted; it can be
modelled by function
. More specifically, the computation
phase consists in executing the assignment (
F,E,B,M
) :=
(A,C,G,M).
Space

Time Methodology
The algorithms to be mapped is specified as a set of
equations attached to integral points, and mapped on the
architecture using a regular time and space allocation
scheme.
Four main steps using this methodology:
The index localization (computations to be performed are defined by
equation).
Uniformization (indicating where data need to be and where the results are
being produced).
Space

Time Transformation (a time and a processor allocations functions
are being chosen).
Interface Design (the loading of the data and the unloading of the results
are considered).
Space

Time Methodology
The drawbacks of Space Time Methodology:
The algorithm must be specified as a set of recurrence
equation, or nested do

loop instructions. Difficult to
implement.
Location in space is associated to each index value (well
suited for systhesis of regular arrays in which data will
be introduced in a regular order). Eliminates possibility
of synthesizing with other architectures.
Synthesis of initializations; initial algorithm are slightly
modified.
Program

Oriented
Methodology
Various attemps have been made to overcome the drawbacks
of the space

time transformation methods. Those based on
viewing systolic design as program design.
A program

oriented methodology to develop more adequate
method of systolic design:
The specifications of the system are formalized with an input and an
output predicates, just as in structured programming.
A couple (program and invariant) is deduced in an incremental way
from the specification.
The sequential while

program is further transformed into a systolic
program; the statement of this program are concurrent assignments.
The systolic array is (easily) obtained from the systolic program.
Systolic Design in Cryptography
Cryptography
Background information
Cryptography
–
disguising messages so only
certain people can read them
Cryptography will play an increasing role in
future computer and communication system
Types of cryptography
Symmetric
–
same key used to encrypt and decrypt
Fast algorithm
Requires a one

time use of a secret channel for key
exchange
Asymmetric (public key)
–
different keys to encrypt
and decrypt the message
More secure
Slower algorithm
Cryptography
RSA algorithm
Today’s systems (SSL, SSH, S/MIME,etc) are
based on public key cryptosystem such as RSA
Proposed in 1978
Based on difficulty of factoring large integers
Three keys
–
N, E and D
N = (P X Q)
; P, Q

large prime numbers
Choose a large number
E
such that it is coprime with
(P

1)x(Q

1)
Cryptography
RSA algorithm
Compute D such that
E x D = 1 mod (P

1)x(Q

1)
The pair
(E,N)
is the public key
D is the private key
To encrypt a message M:
C = M
E
mod (N)
To decrypt C back to M:
M = C
D
mod (N)
Montgomery algorithm
Computations with integers of length > 500 are
very complex and time

consuming
Montgomery algorithm(1985)
reduces modular exponentiation to a series of modular
multiplications
Compute
(A
B) mod N
without trial division (
q
(A
B)
/
N)
Need for faster speed (increasing bit size) and
stronger security leads to hardware implementation
Systolic Implementation
Goal: to create fast, secure and efficient way
to perform modular multiplication
Number representation:
X
=
i
=0
x
i
r
i
r
= 2
k
is the radix
x
i
is the
i
th digit
n
max no. of digits in any number
n

1
Systolic architectures
Bit

serial architecture
processes one input bit during a clock cycle. Is well suited
for low speed applications.
Bit

parallel architecture
processes one input word during a clock cycle. Well suited
for high

speed applications, but is area

inefficient
Digit

serial architecture
attempts to utilize the best of both worlds. The speed of bit

parallel and the relative simplicity of bit

serial.
Example: compute
A x B
Use
n
digit multipliers to form
a
i
B
and
add to a partial product
P
:
P := 0
;
For
i := n
1
downto
0
do
P := r
P + a
i
B
{ Result:
P
=
A
B
}
Example: compute
A x B
Bit

serial
–
addition of
a
i
B
over
n
cycles
carry
P := P + a
i
B
(bit

serial)
time
j+1
time
j
time
j

1
carry
carry
carry
p
j+1
p
j
p
j

1
p
j+1
b
j+1
p
j
b
j
p
j

1
b
j

1
a
i
a
i
a
i
a
i
Cell
j
computes
a
i
b
j
in
cycle
i+j
Example: compute
A x B
Bit

Parallel
–
add
a
i
x
B
in one clock cycle
P := P + a
i
B
(bit

parallel)
cell
j
cell
j

1
cell
j+1
p
j,c
p
j+1,c
p
j+1,s
p
j,s
p
j

1,c
p
j

1,s
p
j+1,s
p
j,c
p
j,s
p
j

1,c
p
j

1,s
b
j+1
b
j
b
j

1
a
i
a
i
Cell
j
computes
a
i
b
j
in
cycle
i
Montgomery Algorithm
P
0
:=0;
q
0
:=0;
For
i:=0
to
n
do
P
i+1
:=[(P
i
+ q
i
N)/2) + b
i
A;
q
i+1
:=P
i
mod 2;
End
P=Pn;
The systolic array
PE for Montogmery
At
i
th
step, the term A
i
B+Q
i
N
is computed in the upper part.
Results are shifted,
accumulated in the lower part
Calculations in first
n
cycles
Output in next
n
cycles
Zero bit interleaving enables
synchronization with the next
iteration of the algorithm
Digit

serial array
Digit

serial PE
Digit

serial implementation
Width of processing elements is
u
Only need
n/u
instead of
n
processing elements
N_reg
(
u
bits): storage of the modulus
B

reg
(
u
bits): storage of the B multiplier
B+N

reg
(
u
bits): storage of intermediate result
B+N
Add

reg
(
u
+1 bits): storage of intermediate results
Control

reg
(3 bits): multiplexer control/clock enable
Result

reg
(
u
bits): storage of the result
Conclusion
The need for inexpensive hardware device
for real

time RSA encryption/decryption will
be rising.
Systolic computer architecture may offer an
elegant and flexible solution.
Artificial Neural Networks
Introduction
The brain processes information extremely fast and
accurately. To top that, the trained network still
works even if certain neurons fail. For example, it
is amazing how one can recognize a friend in a
noisy football stadium. This ability of the brain
(signal processing) to recognize information,
literally buried in noise, and retrieve it correctly is
one of the incredible processes that we wish that
could be implement in machines. If a machine could
be built with only 0.1 percent of the performance
capacity of the brain we would already have an
extraordinary information and controlling machine.
Artificial Neural Networks
Introduction
A neural network is a processing device, either an algorithm, or actual
hardware, whose design was motivated by the design and functioning of
human brains and its components.
Neurons have elementary computational skills, but they operate in a
highly parallel fashion.
An artificial neural network (ANN) can be described as a network of
many very simple processors called "units", each possibly having a
small amount of local memory or “weight”. The units are connected by
unidirectional communication channels called "connections", which
carry numeric, as opposed to symbolic, data. The units operate only on
their local data and on the inputs they receive via the connections.
Existing neural networks offer improved performance over conventional
technologies in areas which includes: Machine Vision, Robust Pattern
Detection, Virtual Reality, Data Compression, Data Mining, Text
Mining, Artificial Life and more
.
Artificial Neural Networks
Systolic architectures
Neural networks are non

linear static or dynamical systems
that learn to solve problems from examples. Most of the
learning algorithms require a lot of computing power and,
therefore, could benefit from fast dedicated hardware. One
of the most common architectures used for this special

purpose hardware is the systolic array.
There is an underlying similarity between the simple, special
purpose computational units of a neural network, and the
dedicated processing elements of a systolic array that apply
a predefined computation on data elements as the data are
pumped through the array.
Artificial Neural Networks
Systolic Arrays
S.Y. Kung and J.N. Hwang were the first researchers to publish a neural
model using systolic architecture by devising a scheme to map neural
algorithms into an iterative matrix operation. Several matrices mapping
models have been proposed since and as technology improves so do the
models.
These models have shown that the proper ordering of the elements of the
weight matrix makes it possible to design a cascaded dependency graph
for consecutive matrix

vector multiplication.
As the research documents accumulates it has become apparent that
different systolic architectures have become useful to a very narrow and
specific implementation. (e.g. systolic ring arrays and signal processing)
Artificial Neural Networks
Systolic Arrays
Since the early 80s numerous neural algorithms
have been mapped into systolic arrays and have
been created for systolic arrays. Among the later
we have the one

dimensional systolic algorithm
created in the 1980s. And the new two

dimensional
systolic algorithm implemented in ten years later.
Feature: uses horizontal and vertical product

sum
operations for the forward and backward path
computing of a multi

layered network.
Artificial Neural Networks
Systolic Arrays:
Multi

layered feed

forward neural network
Artificial Neural Networks
Systolic Arrays
–
backward path 2D
Artificial Neural Networks
Systolic Arrays
–
forward path 2D
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο