FPGA Systems Speed analysis

madbrainedmudlickΤεχνίτη Νοημοσύνη και Ρομποτική

20 Οκτ 2013 (πριν από 3 χρόνια και 10 μήνες)

63 εμφανίσεις

FPGA Systems Speed analysis

Reason for not having so much analysis in this area:



In a normal industrial situation when using an FPGA, all operations are coded,
simulated, and tested prior to choosing an FPGA model. However, due to our time
constraints, coding and testing before designing the circuit board is not feasible.

Hence,
we are shooting for the best and using the resources we have to as best as possible,
guesstimate the needs of our operations on this FPGA.


Analysis:

Traditionally, the term neural
network had been used to refer to a
network or
circuit of biolog
ical neurons.
The modern usage of the term often refers
to artificial neural networks, which are
composed of artificial neurons or nodes.
Artificial neural networks are made up of
interconnecting artificial neurons
(programming constructs that mimic the
pr
operties of biological neurons). Artificial
neural networks may either be used to gain
an understanding of biological neural
networks, or for solving artificial
intelligence problems without necessarily
creating a model of a real biological
system. The rea
l, biological nervous system is highly complex and includes some features
that may seem superfluous based on an understanding of artificial networks.


The
cognitive modelling field involves the physical or mathematical modeling of the
behaviour of neural s
ystems; ranging from the individual neural level (e.g. modelling the
spike response curves of neurons to a stimulus), through the neural cluster level (e.g.
modelling the release and effects of dopamine in the basal ganglia) to the complete
organism (e.g.
behavioural modelling of the organism's response to stimuli). For more
detailed info about neural networks please see external sources such as Wikipedias or
GCCIS Faculty.








This design shown below in figure 1 was implemented on a Digilent Basys Sp
artan
3E
-
100 developement board. It currently performs the function of XOR, however has no
heuristic coding to help out. Instead it uses the theory touched on above to learn
acceptable and unacceptable responses to input. This is not a simple design.

Resu
lts of
this network is outputs with in 10% of the goal values for "high" and "low". These values
are relatable to digital logic values in hardware and can be used as such. The results of
the implementation in VHDL using the Xilinx WebPack ISE are shown bel
ow in table 1.





Table 1: Device Utilization table for Spartan 3E
-
100

Node


Levels


Time


Data In

29

13ns

Data Out

2

5ns

Table 2: Data speeds

Looking at the data from Table 1 we can see how little resources this design took
up a little less than 19% overall. This is not much considering the complexity of the
design and the simplicity of the Spartan 3E
-
100 FPGA. For example the last row of table
1 shows that only 4 built in hardware optimized 18x18 bit multipliers exist in this device
and all are used. For sure this design does a substantial amount of math to calculate the
weights on the connections between neurons and more ALUs will need to be cr
eated
using general purpose slices as can be seen in the usage of slices being about 20%.



This design was very fast and was able to process changes in inputs very quickly. We can
see that 29 levels of logic were needed to be traversed from the input to t
he end of the
processing pipeline. However this only takes 13ns, we can calculate the frequency of this
to be 76.92MHz. This number implies that we can handle 76 million changes of inputs
per second on each pin that has input to this logic design. From thi
s point in the design, to
get to the output is only 2 levels and takes 5ns (or a speed of 200MHz). This is quite
speedy on a device that we released to be a low price, slowest device in the product line
in 2005. There are not many people that are still usi
ng computers considered "hot items"
in 2005 (think first dual core processors, Celeron D, ...). If the internal logic was simpler,
or more pipelined we would be able to reduce these speeds in final design.



Correlation to this project:




Unfortunately
we have not been able to simulate this design on a Spatan 6LX75T core.
We have been having problems getting the Xilinx software setup. However we can
attempt to build a ratio between the two devices. For example, the table below compares
directly, the reso
urces that both devices have available. As you can see there is a large
Total resources used: 19%

Device Utilization
Summary



Logic Utilization

Used

Available

Utilization

Number of Slice Latches

22

1,920

1%

O
ccupied Slices

197

960

20%

4 input LUTs

360

1,920

18%


Logic

328

17%


R
oute
-
thru

32

3%

Number

of bonded IOBs

10

108

9%

MULT18X18SIOs

4

4

100%

increase on "on die" resources available. This does not mean that we can do the same task
with less resources, just that the design will take up less space on this model. However
this
model is built using a different process of creating transistors, and thus will be able to
run faster since the length of distance between individual elements is smaller on the same
chip (up to a limit). We can see that the standard clock speed of the spar
tan 6 is 2.5 times
that of the tested unit. This will directly correlate to the speed of the device running from
input to output. However this is not a 1:1 ratio, we cannot say that because the clock
speed is higher it will run at 2.5x that of the Spartan3
. But we can say that that speed
available with the specific sequential calculations will be higher than that to some degree
above 1 and below 2.5x. There are various factors to consider in this calculation,
including the amount of actual processing that w
ill be done (currently unknown), the
clock speed of the other components in the design (memory, SATA, etc.). However, due
to the amount of resources available and the low
-
end of the speed spectrum we know
(250MHz), we can estimate that the speed will be cl
oser to the 2.5x.



For example, say we are processing a pixel and we need to do X amount of math
that takes Y seconds. Lets say that Y is longer than 1/30th of a seconds (IR camera
picture rate), this will cause a problem that we cannot process pictures f
ast enough. We
will solve this not by making the FPGA faster, or have a higher clock speed, but instead
by parallelizing the math done in X. This will reduce the time Y needed to process the
pixel. Now, we realize that only so much can be done in parallel,

and that we will not be
utilizing all of the resources of this large FPGA. We will solve this next problem by
parallel processing of multiple pixels simultaneously. Theres, no reason why we wouldnt
be able to just say copy the image pipeline above (A) and

create another one called B.
From this we can say after the pixel data is recieved for the next image, and pipeline A is
not done yet, we can start the processing in B. This technique is very scalable, so the
amount of processing we do is directly proport
ional to the number of parallel pipelines
we will need to process all of the data in "real time".



Comparing the Spartan models:

Resource Type

Spartan 3E
-
100

Spartan 6LXT75

% more than 3E
-
100

Slices

960


11,662

1210%

LUTs

1,920

46,648

2430%

Latch/FFs

1,920

93,296

4860%

User I/O

108

296

2740%

Diff. Pairs

40

148

370%

18x18SIO/DSP48a slices

4

132

3300%

Functional Clock Speed

100MHz

250MHZ

250%

Size of transistors

90nm process

45nm process

50%





Initially we will just use the FPGA as a large and super fast MUX. This will
allow us to connect multiple cameras to the OEM board. The complexity of logic is much
less than that of the Neural Network simulated in Test condition 1. This impli
es that the
Speed from input pin to output pin will be less (however much so is irrelevant for this
analysis, since 76MHz is over our needs) and here's why:




Visual camera





10MP image size =

3664 x 2748 = 10,068,672 pixels



1 pixel = 12 bits of

data


(width of interface is 16 bits, so this works and gets passed in 1 clock cycle)





Clock cycles / pixel = 1




Number of images / second = 1




Speed = cycles / pixel * pixels * images / second = 1 * 10,068,672 * 1




Total required speed to
get data in = 10,068,672Hz (or 10.07MHz)




IR camera




1.3MP image size = 640 x 480 = 307,200 pixels





1 pixel = 8 bits of data (width of interface is 16 bits, so this works and gets passed in 1
clock cycles, max)




Clock cycles / pixel = 1




Nu
mber of images / second = 30




Speed = cycles / pixel * pixels * images / second = 1 * 307,200 * 30 = 9,216,000




Total required speed to get data in =

9,216,000Hz (or 9.2MHz)











INS unit




Total size of data / capture = unkown


Total siz
e of data / second = 1kB




RS
-
232 rate of device = unkown (serially so 1 bit / cycle)




# of captures = 30 (same as fastest image rate)




Total data needed t
o be recieved / second = #bits / #captures = 8000 /

30 =
270 bits

= 34 bytes of data




Spe
ed =
8000 bits / second = 8000 baud


Conclusion:

From the calculations above, the FPGA is able to handle getting the data in. Using the
strategies above we can appropriately parallel process to get all of the data in and
processed successfully in the required time.