A High performance full pipelined arquitecture of MLP Neural Networks in FPGA

madbrainedmudlickAI and Robotics

Oct 20, 2013 (3 years and 7 months ago)

73 views


A High performance full pipelined arquitecture

of MLP Neural Networks in
FPGA

Antonyus Ferreira
,

Edna Barros

Informatics Center
,
F
ederal University of Pernambuco

Recife,
Brazil

{apaf,
ensb}@cin.ufpe.br



Abstract

This paper presents the generalized architecture of
a
FPGA based implementation of a Multilayer Perceptron

Artificial Neural Network

(ANN)
.
The proposed architecture
aims to allow the implementation of large ANNs in FPGA
concerning with the area consumption, interconnection
resources and area/performance trade
-
off
.

So
me results
include the use of
log
2
m

adders

for an ANN with m input
patterns.
An ANN having 256
-
10
-
10 could
reach

a 36
x

speed
-
up compared to
a conventional software implementation
.


Keywords
-

FPGA
,
MLP
,
ANN
,
hardware architecture
,
high
performance
, reconfigurable system
.

I.

I
NTRODUCTION

Artificial Neural Networks (ANN)

are used in many
areas like in signal processing systems, medical image
analysis, time series forecasting,
robot vision,
etc. Some of
those areas the

amount of data to be processed

is very big or
the response time
is

to
o

short
which can make not viable
p
ortable solutions
using conventional software designs
.

So t
he proposal of faster implementations of ANN
seems to
be
reasonable. I
n the literature,
it could be found
man
y
types from digital hardware
implementations

[2], [7],
[8], [9] e [10]
, passing throug
h

GPUs
[15],
to analog
systems. Another characteristic of the ANNs that motivates
parallel implementations is the
intrinsic
data
parallelism

of
the model
.

This work is aligned with the digital hardware solutions
motivated to the increasing performance obta
ined by the
FPGAs year by year.

A great example of potential
of the
FPGAs is the CHREC
(
Center for High
-
Performance
Reconfigurable Computing
) initiative

[17]
to create t
he
NOVO
-
G

that
link
s 96 top
-
end Altera Stratix
III FPGAs in
24 servers with 576 GB of m
emory and 20 Gb/s InfiniBand
.


Fist we present in section

II

some basic concepts
concerning to the ANNs,
after in section III

we show some
related works and following in section
IV

we describe in
details the architecture that is the

goal of this work. In
s
ection V

we cite problems commonly found in the FPGAs
implementations and how we intend to address
those

issues.

Finally the sections
VI

and
VII

are the results and
conclusions respectively.



II.

MLP
ARTIFICIAL NEURAL NETWORKS

All the animals’ brains are compounded by billions of
cells interconnected in a giant net. And ANNs are
computational models whose organization and architecture
are inspired in animals’ brains structure. This model inherits
from biological model its parall
el and distributed feature.

ANNs can be found in many areas like signal
processing, medical image analysis, diagnostic systems and
time series forecasting. Some desired properties [1] of
ANNs are:

a. Learning through examples

-
Non parametric statistical
inference

b. Adaptability

c. Generalization

d. fault tolerance

A.

Artificial Neuron

An artificial neuron is the unit of the neural networks
architecture. In the neurons’ structure can be viewed:

a)

An entry set that receive neurons’ input signals;

b)

A synaptic

set whose intensity is represented by
associated;

c)

An activation function that compares entries and
their synaptic with the function threshold to
define neuron’s output.

In the following
F
igure 1

each
Wi

represents the
weights associated with each
Xi

and
Φ

is the activation
function. The result of the synaptic is given by the
sum of

product
s

(u) of the entries vector by the weight vector and
the output by the computation of
Φ(u)
.

Some activation function
s

used are:

a)

Step function;

Φ(u) = 1

if
u > 0, Φ(u) =

0
, otherwise

b)

Ramp function

Φ(u) = max{0.0, min{1.0, u + 0.5}}



c)

Sigmoid function

Φ(u) = a /{ 1 + exp(−bu) }


Figure
1

Artificial neuron components

III.


RELATED WORKS

A.

FPGA implementation of a face detector using
neural networks [3]

Yongsoon Lee e Seok
-
Bum Ko

in this work used
floating point arithmetic
[14]
due to
dynamic range of this
representation

and
they
implemented an MLP ANN. They
chose the follow approximation of the sigmoid

function
.


This implementation of the sigmoid seem
s to be
simpleton
and used too many floating point computations.
They obtained 38MHz fmax
.
This work makes evident
ANN

utilization in real time applications

as well as [5]
.

B.

VANNGen: a Flexible CAD Tool for Hardware
Implementation of Artificial

Neural Netwo
rks

[19]

The work is proposed to be a generic generator of
ANN
in FPGA. They use fix point representation, LUT based
multipliers and LUT implementation of the activation
function.

The authors’ architecture uses dedicated hardware for
each neuron and they d
o not show if there is any pipeline
structure whether for a single neuron whether for the entire
network.

They validated their architecture using
a network
(topology 1:2:1) to approximate the sinusoid function


















. The performance
reached 100

MHz for a Spartan 3

XC3S500

FPGA.

IV.

PROPOSED ARCHITECTURE

A.

Computation of the Activation State

As mentioned in section
II
, each neuro
n computes the
sum of products
(also called activation state)
of the weights
by the respective
inputs.
A

layer of
neurons compute the
matrix product














(n neurons, m dimensional
input vector,





the
bias)
.

The matrix multiplication
algorithm, given by
:


(





















)

(





)

Each element of

the matrix C is the result of the
operation
















, which

includes
m

sums of
m+1

independent
product
s

for each
n
.

So

if we could

instantiate
m+1

multipliers
and
m

adders in a

reconfigurable
resource

the result would be computed in




clock
cycles

(after the pipeline is full)
.

Inspired in a high performance matrix multiplication
solution

[18], we perfo
rm the matrix product using the
columns
of
W

by a single element of
X
. This way, we need
a
single value of
X

per time

and it is used only once.

Thus,

we compute first the independent products
(W
11

X
1
,

…,

W
n
1

X
1

)
and the sum after with the result of
the second column product and so on.

Figure 2

shows an
example for a 4 neuron
s
-
4inputs layer.


Figure
2

Architecture example of a 4 inputs 4 neurons layer

The output of the multiplier pass
es

through a shift
-
register (SR) that aligns
W
11
X
1

with
W
12
X
2

in order to
compute the sum of the products.
This way to compute the
activation state we only used the dat
apath (addition
ally there
is a valid signal
put together with the first input data and it is
carried to signalize
to the next layer
that the computation is
done);

So to mult
iply















we spend






clock

cy
cles

(after the pipeline is full)
with










adders

and 1 multiplier
.

Obviously, this datapath changes depending on the
number of inputs. For example if
m

is an odd number
,
we
use the bias to make it even (introducing a
-
1

input and
multiplying by the

bias)
.
So the computation rate of a layer
is
1 layer each
n*m or n*(m+1)

(if m is odd) clock cycles

and the entire network rate is given by the layer with the
lower rate
.


The layers are connected using a module that stores the
values computed by the previ
ous layer, propagates the valid
signal and put sequentially the next layer inputs.

As expected, the procedure is always aligning the
correspondent factors 2
-
by
-
2.
This process is the same as
iteratively
divide
a given number
by 2.
Whether the inputs
(
not i
ncluding the bias)

is a power of 2 the datapath differs
from the
Figure 2

only in the number of adders and at the
end we introduce de

bias stored. The
Figure

3

illustrates

the step
-
by
-
step to produce the datapath for a given n
.

Several situations may
occur:



Inserting the bias between adders


we used the valid
signal do compute the exact moment to switch the
MUX and insert the

bias;



Carrying values forward


if the number of inputs of an
adder can be subtracted or added by one to make
possible the nex
t computation.


Figure
3

Flow to generate the datapat
h

B.


T
he activation function

As mentioned in section
II subsection A
, several
functions
can be

used as activation function of the artificial
neuron.
For several reasons the
sigmoid and the hyperbolic
tangent are the most
frequently
used. Some implementations
of these complex functions have been proposed [11][12][13]

because a direct hardware

implementation requires too
much resources
.

In some applications, the activation func
tion can be
represented as a lookup table. In this case, the precision of
the solution is as accurate as we need, with the increase of
the values stored at the lookup table. The concerning about
the area
occupied is also present.

In our architecture
a laye
r of neurons could represent a
single datapah and this fact allow us to use a single structure
to compute the activation function for the entire layer.

Another point is that the next layer (or the output of the
ANN) only needs one output per time. So the h
ardware to
compute the activation functions was attached in the
datapath of the
sum
-
of
-
products’ computation.


Representing the
x (
[a, b]
) axis incrementing 1/k, we
will have












values stored of the y function.
Thus for a given x,


















results the
position of the x value at the vector

[a, b]
. Simplifying, the
first term
is a constant and the x value is the result of the
activation state of the neuron. Additionally if k is a power of
2 (k = 2
Z
), the x*k term can be resolved with a
n intege
r

combinational sum of z in the exponent of x.

Thus w
e perform a single floating point sum with a
constant and two parallel comparation
s (to check an out
-
of
-
bounds value) in order to access the table with the aim
function

and this structure is invariable w
ith the domain
interval
and

with
the
step
1/
k.

V.

PROBLEMS AND SOLUTIONS

A.


Interconnection resources, number of pins

and
logic
fan
-
out

Using the conventional matrix product, the entire input
vector must be
rea
d and
stored
as long as the computation of
all
neurons of the layer

is not complete
.

The approach used
in this work
has
O
(1) interconnection and pin complexity
while the conventional product
O
(m) (m inputs)
.

Our architecture, also need a lower transfer rate of the
input data. For example a layer with
256 inputs and 10
neurons need 1 input each 10 clock cycles and this input
value
discarded

after that.

Another point is the available pins in the FPGA. The
larger Altera STRATIX III board has 1120 I/O pins what
makes
unpractical

an ANN with 256 floating p
oint inputs

(several applications in digital image processing may
require large number of inputs also cited by
Jang

[15]
)
.

B.


Logic consumption

Pre
viously in sectio
n
IV subsection A

we calculated the
quantity of adders and multipliers used. Comparatively wit
h
Braga [19] that uses m+1 multipliers and computes each
neuron isolated, our










adders

for the entire
layer is much better (even using floating point cores).

C.

Area /

Performance trade
-
off


An important aspect of the proposed architecture is the
possibility of tuning the performance.
T
he datapath of a
layer can be
replicated, and thus, raise the issue rate. For
example, the layer of the
Figure 2

can be replicated in 2 or
4 (submultiples of th
e number of neurons) and at the end we
could have one datapath for each neuron. It can be noticed
that this parameter is independent of the discussed in
subsection A
.


VI.

R
ESULTS

Three neural networks
were designed
two for
classification and one for function
approximation

(approximates the function

















)

first

using M
atlab


neural networks toolbox

and after
implemented in hardware
.
The two classification networks
represent solutions for databases available in the UCI
repository [20],
named:

The iris
and semeion problems
:



Iris
-

3 classes of 50 instances each; input: 4 measures
of the iris flowers;



Semeion
-

1593 handwritten digits scanned, stretched in
a rectangular box 16x16 in a gray scale of 256 values

each.

Table 1

and
2

show the results of the synthesis of the
three ANN implemented in Verilog. We used floating point
adders, multipliers and comparators generated by the Altera
megawizard tool and
the ANN were
synthesized
with

Synopsys Synplify Premier and Altera Quartus I
I 9.1 tools.

The semeion2
(
Table 1
)

uses 5 datapaths of 2 neurons
each for the first hidden layer (256 inputs) and semeion3
uses 10 datapaths (one to each neuron). The iris
ANNs have

the same hardware

but we replicated

the inputs patterns to
show that the
number of instances
does not effect
in the
speed
-
up
.

All software
performance
average
values

(only
for
the
core computations)

were measured
using
a Xeon E5310
(1.6GHz) with 8 GB DDR2 memory
(
Debian 64bits OS
-

HP
workstation
) and compiled using GCC
.

Table
1

Performance reports

for the ANNs SW vs HW


samples

topology

N
eurons

Sw

(ms)

H
w

(
m
s)

S
peed
-
up

S
inusoid

249

1
-
5
-
1

6

0.
345

0.0
10137

34.03

I
ris

150

4
-
8
-
3
-
3

14

0.
517

0.019
463

26.56

I
ris

300

4
-
8
-
3
-
3

14

1
.0

0.037
963

26.34

I
ris

600

4
-
8
-
3
-
3

14

2
.0

0.074963

26.67

S
emeion

1593

256
-
10
-
10

20

50.07

13.605

3.68

S
emeion
2

1593

256
-
10
-
10

20

50.07

6.808

8.10

S
emeion
3

1593

256
-
10
-
10

20

50.07

1.371

36.52

Table 2

shows the resource usage of all three networks.
Note that the semeion network (256 inputs, 2 layers, 10
outputs) used 30% of the ALUTs of the smallest Altera
Stratix III device.

All networks achieved
300MHz

fmax

that
demonstrates the
low effect in the
fan
-
out

complexity with
the increase of the network
.


Table
2

Absolute a
rea reports

for the ANNs
and

relative usage to
the
EP3SL50F484C2 Altera device


A
dders

ALUTs

Registers

Memory

b
its

DSP

blocks

S
inusoid

4

4591
(
12%
)

5294
(
14%
)

116405
(
2%
)

8
(
2%
)

I
ris

9

8782
(
23%
)

9791(
26%
)

181901
(
3%
)

12
(
3%
)

S
emeion

12

11297
(
30%
)

10967
(
29%
)

521847
(
1
0%
)

8
(
2%
)

Analyzing t
he
error of all networks outputs we could
have seen that the lookup table approximation of the
sigmoid function is the only introduction of imprecision
(lossless computation of the activation state using single
precision float point arithmetic).

For

all

iris
patterns we obtain a
max error
of
0.0021

and
a
sse

of

2.4x10
-
5
;

for the
semeion
a
max error

of
0.0241

and
a
sse 0.03
72
; for

sinusoid
a
max error
of
0.0051

and a sse of
4.9x10
-
5
.

The overa
ll mean error
is about
1.5x10
-
4
.


VII.

FUTURE

WORKS

The actual architecture may be extended to include more
ANN types
and additionally the datapath HDL code could
be

automatically generated.


VIII.

R
EFERENCES

[1] Braga, A. P.; Carvalho, A. P. L. F.; Ludermir, T. B.
Redes Neurais
Artificiais
, LTC, 2007.

[2] Omondi, A. R. ; R
ajapakse, J. C. ; Bajger, M. FPGA Neurocomputers.
In: Omondi, A. R.; Rajapakse, J. C. (eds) FPGA Imple
mentations of Neural
Networks. Springer
-
Verlag, 2006. p. 37
-
56.

[3] Lee, Y.; Ko, S. B. FPGA implementation of a face detector using neural
networks, IEEE CCECE/CCGEI, 2006.

[5] Azhar, M. A. H. B.; Dimond, K. R. Design of an FPGA Based Adaptive
Neural Contr
oller For Intelligent Robot Navigation. In: Proceedings of the
Euromicro Symposium on Digital System Design, 2002.

[7] Bernard, G. FPNA: Concepts and Properties. In: Omondi, A. R.;
Rajapakse, J. C. (eds) FPGA Implementations of Neural Networks.
Springer
-
V
erlag, 2006. p. 63
-
101.

[8] Canas, A.; et al FPGA Implementation of a Fully and Partially
Connected MLP. In: Omondi, A. R.; Rajapakse, J. C. (eds) FPGA
Implementations of Neural Net
-
works. Springer
-
Verlag, 2006. p. 271
-
296.

[9] Girau, B. FPNA: Applications

and implementations. In: Omondi, A. R.;
Raja
-
pakse, J. C. (eds) FPGA Implementations of Neural Networks.
Springer
-
Verlag, 2006. p. 103
-
136.

[10] Girones, R. G.; Agundis, A. R. FPGA Implementation of Non
-
Linear
Predictors. In: Omondi, A. R.; Rajapakse, J.
C. (eds) FPGA
Implementations of Neural Net
-
works. Springer
-
Verlag, 2006. p. 297
-
323.

[11] Zhang, M.; Vassiliadis, S.; Delgago

Frias, J.G. Sigmoid generators
for neural computing using piecewise approximations, IEEE Trans.
Comput., 1996, p. 1045

1049

[12]

Amin, H.; Curtis, K.M.; Hayes

Gill, B.R. Piecewise linear
approximation applied to nonlinear function of a neural network, IEEE
Proc. Circuits
-

Devices Sys., 1997 p. 313

317

[13] Basterretxea, K.; Tarela, J. M.; Del Campo, I
. Approximation of
sigmoid fun
c
tion and the derivative for hardware implementation of
artificial neurons,
IEEE Proc.
-
Circuits Devices Syst.
, Vol. 151, 2004.

[14] IEEE computer society: IEEE Standard 754 fo
r Binary Floating
-
Point
Arithme
tic, 1985.

[15] Jang, H.; Park, A; Jung, K.; Neura
l Network Implementation using
CUDA and OpenMP,
Digital Image Computing: Techniques and
Applications

[17]
Webpage

of the
Chrec
consortium,
http://www.chrec.org/

[18]

Souza, V. L.; Medeiros, V. W.; de Lima, M. E.;
Architecture for
Dense Matrix Multiplication on a High
-

Performance Reconfigurable
System
,
In Proceedings of the 21st Annual Symposium on integrated
Circuits and System Design

(Natal, Brazil,
2009
). SBCCI
09
.

[19]
Braga, A. L. S.; Llanos, C. H.; Ayala
-
Rinc
ón, M.; Jacobi, R. P.;
VANNGen: a Flexible CAD Tool for Hardware Implementation of
Artificial Neural Networks,
Proceedings of the 2005 International
Conference on Reconfigurable Computing and FPGAs (ReConFig 2005)
.

[20] UCI
-
repository
webpage,

http://archive.ics.uci.edu/ml/index.html