A High performance full pipelined arquitecture
of MLP Neural Networks in
FPGA
Antonyus Ferreira
,
Edna Barros
Informatics Center
,
F
ederal University of Pernambuco
Recife,
Brazil
{apaf,
ensb}@cin.ufpe.br
Abstract
—
This paper presents the generalized architecture of
a
FPGA based implementation of a Multilayer Perceptron
Artificial Neural Network
(ANN)
.
The proposed architecture
aims to allow the implementation of large ANNs in FPGA
concerning with the area consumption, interconnection
resources and area/performance trade

off
.
So
me results
include the use of
log
2
m
adders
for an ANN with m input
patterns.
An ANN having 256

10

10 could
reach
a 36
x
speed

up compared to
a conventional software implementation
.
Keywords

FPGA
,
MLP
,
ANN
,
hardware architecture
,
high
performance
, reconfigurable system
.
I.
I
NTRODUCTION
Artificial Neural Networks (ANN)
are used in many
areas like in signal processing systems, medical image
analysis, time series forecasting,
robot vision,
etc. Some of
those areas the
amount of data to be processed
is very big or
the response time
is
to
o
short
which can make not viable
p
ortable solutions
using conventional software designs
.
So t
he proposal of faster implementations of ANN
seems to
be
reasonable. I
n the literature,
it could be found
man
y
types from digital hardware
implementations
[2], [7],
[8], [9] e [10]
, passing throug
h
GPUs
[15],
to analog
systems. Another characteristic of the ANNs that motivates
parallel implementations is the
intrinsic
data
parallelism
of
the model
.
This work is aligned with the digital hardware solutions
motivated to the increasing performance obta
ined by the
FPGAs year by year.
A great example of potential
of the
FPGAs is the CHREC
(
Center for High

Performance
Reconfigurable Computing
) initiative
[17]
to create t
he
NOVO

G
that
link
s 96 top

end Altera Stratix
III FPGAs in
24 servers with 576 GB of m
emory and 20 Gb/s InfiniBand
.
Fist we present in section
II
some basic concepts
concerning to the ANNs,
after in section III
we show some
related works and following in section
IV
we describe in
details the architecture that is the
goal of this work. In
s
ection V
we cite problems commonly found in the FPGAs
implementations and how we intend to address
those
issues.
Finally the sections
VI
and
VII
are the results and
conclusions respectively.
II.
MLP
ARTIFICIAL NEURAL NETWORKS
All the animals’ brains are compounded by billions of
cells interconnected in a giant net. And ANNs are
computational models whose organization and architecture
are inspired in animals’ brains structure. This model inherits
from biological model its parall
el and distributed feature.
ANNs can be found in many areas like signal
processing, medical image analysis, diagnostic systems and
time series forecasting. Some desired properties [1] of
ANNs are:
a. Learning through examples

Non parametric statistical
inference
b. Adaptability
c. Generalization
d. fault tolerance
A.
Artificial Neuron
An artificial neuron is the unit of the neural networks
architecture. In the neurons’ structure can be viewed:
a)
An entry set that receive neurons’ input signals;
b)
A synaptic
set whose intensity is represented by
associated;
c)
An activation function that compares entries and
their synaptic with the function threshold to
define neuron’s output.
In the following
F
igure 1
each
Wi
represents the
weights associated with each
Xi
and
Φ
is the activation
function. The result of the synaptic is given by the
sum of
product
s
(u) of the entries vector by the weight vector and
the output by the computation of
Φ(u)
.
Some activation function
s
used are:
a)
Step function;
Φ(u) = 1
if
u > 0, Φ(u) =
0
, otherwise
b)
Ramp function
Φ(u) = max{0.0, min{1.0, u + 0.5}}
c)
Sigmoid function
Φ(u) = a /{ 1 + exp(−bu) }
Figure
1
Artificial neuron components
III.
RELATED WORKS
A.
FPGA implementation of a face detector using
neural networks [3]
Yongsoon Lee e Seok

Bum Ko
in this work used
floating point arithmetic
[14]
due to
dynamic range of this
representation
and
they
implemented an MLP ANN. They
chose the follow approximation of the sigmoid
function
.
This implementation of the sigmoid seem
s to be
simpleton
and used too many floating point computations.
They obtained 38MHz fmax
.
This work makes evident
ANN
utilization in real time applications
as well as [5]
.
B.
VANNGen: a Flexible CAD Tool for Hardware
Implementation of Artificial
Neural Netwo
rks
[19]
The work is proposed to be a generic generator of
ANN
in FPGA. They use fix point representation, LUT based
multipliers and LUT implementation of the activation
function.
The authors’ architecture uses dedicated hardware for
each neuron and they d
o not show if there is any pipeline
structure whether for a single neuron whether for the entire
network.
They validated their architecture using
a network
(topology 1:2:1) to approximate the sinusoid function
. The performance
reached 100
MHz for a Spartan 3
XC3S500
FPGA.
IV.
PROPOSED ARCHITECTURE
A.
Computation of the Activation State
As mentioned in section
II
, each neuro
n computes the
sum of products
(also called activation state)
of the weights
by the respective
inputs.
A
layer of
neurons compute the
matrix product
(n neurons, m dimensional
input vector,
the
bias)
.
The matrix multiplication
algorithm, given by
:
(
)
(
)
Each element of
the matrix C is the result of the
operation
∑
, which
includes
m
sums of
m+1
independent
product
s
for each
n
.
So
if we could
instantiate
m+1
multipliers
and
m
adders in a
reconfigurable
resource
the result would be computed in
clock
cycles
(after the pipeline is full)
.
Inspired in a high performance matrix multiplication
solution
[18], we perfo
rm the matrix product using the
columns
of
W
by a single element of
X
. This way, we need
a
single value of
X
per time
and it is used only once.
Thus,
we compute first the independent products
(W
11
•
X
1
,
…,
W
n
1
•
X
1
)
and the sum after with the result of
the second column product and so on.
Figure 2
shows an
example for a 4 neuron
s

4inputs layer.
Figure
2
Architecture example of a 4 inputs 4 neurons layer
The output of the multiplier pass
es
through a shift

register (SR) that aligns
W
11
X
1
with
W
12
X
2
in order to
compute the sum of the products.
This way to compute the
activation state we only used the dat
apath (addition
ally there
is a valid signal
put together with the first input data and it is
carried to signalize
to the next layer
that the computation is
done);
So to mult
iply
we spend
clock
cy
cles
(after the pipeline is full)
with
⌊
⌋
adders
and 1 multiplier
.
Obviously, this datapath changes depending on the
number of inputs. For example if
m
is an odd number
,
we
use the bias to make it even (introducing a

1
input and
multiplying by the
bias)
.
So the computation rate of a layer
is
1 layer each
n*m or n*(m+1)
(if m is odd) clock cycles
and the entire network rate is given by the layer with the
lower rate
.
The layers are connected using a module that stores the
values computed by the previ
ous layer, propagates the valid
signal and put sequentially the next layer inputs.
As expected, the procedure is always aligning the
correspondent factors 2

by

2.
This process is the same as
iteratively
divide
a given number
by 2.
Whether the inputs
(
not i
ncluding the bias)
is a power of 2 the datapath differs
from the
Figure 2
only in the number of adders and at the
end we introduce de
–
bias stored. The
Figure
3
illustrates
the step

by

step to produce the datapath for a given n
.
Several situations may
occur:
Inserting the bias between adders
–
we used the valid
signal do compute the exact moment to switch the
MUX and insert the
–
bias;
Carrying values forward
–
if the number of inputs of an
adder can be subtracted or added by one to make
possible the nex
t computation.
Figure
3
Flow to generate the datapat
h
B.
T
he activation function
As mentioned in section
II subsection A
, several
functions
can be
used as activation function of the artificial
neuron.
For several reasons the
sigmoid and the hyperbolic
tangent are the most
frequently
used. Some implementations
of these complex functions have been proposed [11][12][13]
because a direct hardware
implementation requires too
much resources
.
In some applications, the activation func
tion can be
represented as a lookup table. In this case, the precision of
the solution is as accurate as we need, with the increase of
the values stored at the lookup table. The concerning about
the area
occupied is also present.
In our architecture
a laye
r of neurons could represent a
single datapah and this fact allow us to use a single structure
to compute the activation function for the entire layer.
Another point is that the next layer (or the output of the
ANN) only needs one output per time. So the h
ardware to
compute the activation functions was attached in the
datapath of the
sum

of

products’ computation.
Representing the
x (
[a, b]
) axis incrementing 1/k, we
will have
values stored of the y function.
Thus for a given x,
⌊
⌋
results the
position of the x value at the vector
[a, b]
. Simplifying, the
first term
is a constant and the x value is the result of the
activation state of the neuron. Additionally if k is a power of
2 (k = 2
Z
), the x*k term can be resolved with a
n intege
r
combinational sum of z in the exponent of x.
Thus w
e perform a single floating point sum with a
constant and two parallel comparation
s (to check an out

of

bounds value) in order to access the table with the aim
function
and this structure is invariable w
ith the domain
interval
and
with
the
step
1/
k.
V.
PROBLEMS AND SOLUTIONS
A.
Interconnection resources, number of pins
and
logic
fan

out
Using the conventional matrix product, the entire input
vector must be
rea
d and
stored
as long as the computation of
all
neurons of the layer
is not complete
.
The approach used
in this work
has
O
(1) interconnection and pin complexity
while the conventional product
O
(m) (m inputs)
.
Our architecture, also need a lower transfer rate of the
input data. For example a layer with
256 inputs and 10
neurons need 1 input each 10 clock cycles and this input
value
discarded
after that.
Another point is the available pins in the FPGA. The
larger Altera STRATIX III board has 1120 I/O pins what
makes
unpractical
an ANN with 256 floating p
oint inputs
(several applications in digital image processing may
require large number of inputs also cited by
Jang
[15]
)
.
B.
Logic consumption
Pre
viously in sectio
n
IV subsection A
we calculated the
quantity of adders and multipliers used. Comparatively wit
h
Braga [19] that uses m+1 multipliers and computes each
neuron isolated, our
⌊
⌋
adders
for the entire
layer is much better (even using floating point cores).
C.
Area /
Performance trade

off
An important aspect of the proposed architecture is the
possibility of tuning the performance.
T
he datapath of a
layer can be
replicated, and thus, raise the issue rate. For
example, the layer of the
Figure 2
can be replicated in 2 or
4 (submultiples of th
e number of neurons) and at the end we
could have one datapath for each neuron. It can be noticed
that this parameter is independent of the discussed in
subsection A
.
VI.
R
ESULTS
Three neural networks
were designed
two for
classification and one for function
approximation
(approximates the function
)
first
using M
atlab
™
neural networks toolbox
and after
implemented in hardware
.
The two classification networks
represent solutions for databases available in the UCI
repository [20],
named:
The iris
and semeion problems
:
Iris

3 classes of 50 instances each; input: 4 measures
of the iris flowers;
Semeion

1593 handwritten digits scanned, stretched in
a rectangular box 16x16 in a gray scale of 256 values
each.
Table 1
and
2
show the results of the synthesis of the
three ANN implemented in Verilog. We used floating point
adders, multipliers and comparators generated by the Altera
megawizard tool and
the ANN were
synthesized
with
Synopsys Synplify Premier and Altera Quartus I
I 9.1 tools.
The semeion2
(
Table 1
)
uses 5 datapaths of 2 neurons
each for the first hidden layer (256 inputs) and semeion3
uses 10 datapaths (one to each neuron). The iris
ANNs have
the same hardware
but we replicated
the inputs patterns to
show that the
number of instances
does not effect
in the
speed

up
.
All software
performance
average
values
(only
for
the
core computations)
were measured
using
a Xeon E5310
(1.6GHz) with 8 GB DDR2 memory
(
Debian 64bits OS

HP
workstation
) and compiled using GCC
.
Table
1
Performance reports
for the ANNs SW vs HW
samples
topology
N
eurons
Sw
(ms)
H
w
(
m
s)
S
peed

up
S
inusoid
249
1

5

1
6
0.
345
0.0
10137
34.03
I
ris
150
4

8

3

3
14
0.
517
0.019
463
26.56
I
ris
300
4

8

3

3
14
1
.0
0.037
963
26.34
I
ris
600
4

8

3

3
14
2
.0
0.074963
26.67
S
emeion
1593
256

10

10
20
50.07
13.605
3.68
S
emeion
2
1593
256

10

10
20
50.07
6.808
8.10
S
emeion
3
1593
256

10

10
20
50.07
1.371
36.52
Table 2
shows the resource usage of all three networks.
Note that the semeion network (256 inputs, 2 layers, 10
outputs) used 30% of the ALUTs of the smallest Altera
Stratix III device.
All networks achieved
300MHz
fmax
that
demonstrates the
low effect in the
fan

out
complexity with
the increase of the network
.
Table
2
Absolute a
rea reports
for the ANNs
and
relative usage to
the
EP3SL50F484C2 Altera device
A
dders
ALUTs
Registers
Memory
b
its
DSP
blocks
S
inusoid
4
4591
(
12%
)
5294
(
14%
)
116405
(
2%
)
8
(
2%
)
I
ris
9
8782
(
23%
)
9791(
26%
)
181901
(
3%
)
12
(
3%
)
S
emeion
12
11297
(
30%
)
10967
(
29%
)
521847
(
1
0%
)
8
(
2%
)
Analyzing t
he
error of all networks outputs we could
have seen that the lookup table approximation of the
sigmoid function is the only introduction of imprecision
(lossless computation of the activation state using single
precision float point arithmetic).
For
all
iris
patterns we obtain a
max error
of
0.0021
and
a
sse
of
2.4x10

5
;
for the
semeion
a
max error
of
0.0241
and
a
sse 0.03
72
; for
sinusoid
a
max error
of
0.0051
and a sse of
4.9x10

5
.
The overa
ll mean error
is about
1.5x10

4
.
VII.
FUTURE
WORKS
The actual architecture may be extended to include more
ANN types
and additionally the datapath HDL code could
be
automatically generated.
VIII.
R
EFERENCES
[1] Braga, A. P.; Carvalho, A. P. L. F.; Ludermir, T. B.
Redes Neurais
Artificiais
, LTC, 2007.
[2] Omondi, A. R. ; R
ajapakse, J. C. ; Bajger, M. FPGA Neurocomputers.
In: Omondi, A. R.; Rajapakse, J. C. (eds) FPGA Imple
mentations of Neural
Networks. Springer

Verlag, 2006. p. 37

56.
[3] Lee, Y.; Ko, S. B. FPGA implementation of a face detector using neural
networks, IEEE CCECE/CCGEI, 2006.
[5] Azhar, M. A. H. B.; Dimond, K. R. Design of an FPGA Based Adaptive
Neural Contr
oller For Intelligent Robot Navigation. In: Proceedings of the
Euromicro Symposium on Digital System Design, 2002.
[7] Bernard, G. FPNA: Concepts and Properties. In: Omondi, A. R.;
Rajapakse, J. C. (eds) FPGA Implementations of Neural Networks.
Springer

V
erlag, 2006. p. 63

101.
[8] Canas, A.; et al FPGA Implementation of a Fully and Partially
Connected MLP. In: Omondi, A. R.; Rajapakse, J. C. (eds) FPGA
Implementations of Neural Net

works. Springer

Verlag, 2006. p. 271

296.
[9] Girau, B. FPNA: Applications
and implementations. In: Omondi, A. R.;
Raja

pakse, J. C. (eds) FPGA Implementations of Neural Networks.
Springer

Verlag, 2006. p. 103

136.
[10] Girones, R. G.; Agundis, A. R. FPGA Implementation of Non

Linear
Predictors. In: Omondi, A. R.; Rajapakse, J.
C. (eds) FPGA
Implementations of Neural Net

works. Springer

Verlag, 2006. p. 297

323.
[11] Zhang, M.; Vassiliadis, S.; Delgago
–
Frias, J.G. Sigmoid generators
for neural computing using piecewise approximations, IEEE Trans.
Comput., 1996, p. 1045
–
1049
[12]
Amin, H.; Curtis, K.M.; Hayes
–
Gill, B.R. Piecewise linear
approximation applied to nonlinear function of a neural network, IEEE
Proc. Circuits

Devices Sys., 1997 p. 313
–
317
[13] Basterretxea, K.; Tarela, J. M.; Del Campo, I
. Approximation of
sigmoid fun
c
tion and the derivative for hardware implementation of
artificial neurons,
IEEE Proc.

Circuits Devices Syst.
, Vol. 151, 2004.
[14] IEEE computer society: IEEE Standard 754 fo
r Binary Floating

Point
Arithme
tic, 1985.
[15] Jang, H.; Park, A; Jung, K.; Neura
l Network Implementation using
CUDA and OpenMP,
Digital Image Computing: Techniques and
Applications
[17]
Webpage
of the
Chrec
consortium,
http://www.chrec.org/
[18]
Souza, V. L.; Medeiros, V. W.; de Lima, M. E.;
Architecture for
Dense Matrix Multiplication on a High

Performance Reconfigurable
System
,
In Proceedings of the 21st Annual Symposium on integrated
Circuits and System Design
(Natal, Brazil,
2009
). SBCCI
09
.
[19]
Braga, A. L. S.; Llanos, C. H.; Ayala

Rinc
ón, M.; Jacobi, R. P.;
VANNGen: a Flexible CAD Tool for Hardware Implementation of
Artificial Neural Networks,
Proceedings of the 2005 International
Conference on Reconfigurable Computing and FPGAs (ReConFig 2005)
.
[20] UCI

repository
webpage,
http://archive.ics.uci.edu/ml/index.html
Comments 0
Log in to post a comment