HPEC using FPGAs

yakzephyrΤεχνίτη Νοημοσύνη και Ρομποτική

24 Νοε 2013 (πριν από 3 χρόνια και 11 μήνες)

69 εμφανίσεις



HPEC using FPGAs

Challenges and Benefits

Utah State University

2

Cache Valley 90 miles North of Salt Lake City

David.
Sant
. Engineering

Innovation Building

Agenda


On
-
board computing for Spacecraft


A primer on FPGAs (5 slides)


HPEC using FPGAs (26 slides)


The Polymorphic Systolic Array Framework


Improving productivity


Enabling real time and responsive reconfiguration


Future technologies for FPGAs


Acknowledgements


3

On
-
board Computing


Civilian and Military space missions getting more complex


Need to support several types of data from several types of sensors


Missions will require spacecraft computer to be more responsive


Need for In
-
situ data processing (signal processing)


Not just compression, but data analysis, decision making etc.


Power budget, form factors of spacecraft computer extremely tight


State of the art
RadHard

microprocessor from BAE systems or RISC
processor?


Aging workhorse, time to upgrade big time



4

So, what do we upgrade to?


Commodity Microprocessors


Cell, GPU, Many/Multi core


Very powerful


Blows out the power budget


RadHard

parts need to be custom ordered


Commodity DSP chips


Good as long as you stick to just one chip


Rahhard

parts can be custom ordered


Commodity Reconfigurable chips


FPGAs (field programmable gate arrays)


Can perform like a custom silicon chip


Best performance/power ratios


RadHard

parts already available with steady roadmap from Xilinx


5

Programming perspective


Microprocessors


Optimistic view point

6


DSP chips


FPGAs

Frozen pizza

Take ‘n’ bake

Raw ingredients

Quick Primer on FPGAs


Mixture of blocks on a die


Some dedicated


DSP (
MAC
units)


PPC (optional)


RAM


Some programmable


Look Up Tables (LUT)


Gazillions of network switches


Hidden


Special circuit


ICAP (internal configuration
access port)

7

Simple View of Programming an FPGA


An FPGA is
essentially a vast
set of SRAM cells
waiting to be
loaded with 0s and
1s to mimic
Boolean logic

8

NMOS

transistor

All computations are assumed to be based on Boolean Logic

So,

Problem solving concept => algorithms

Algorithms => Discrete set of simple tasks (add/multiply…)

Simple tasks => A set of Boolean functions talking to each other

Boolean function=> simple manipulation of 1 and 0 bits

Each bit
stored in a
small
memory
cell

(SRAM)

Programming an FPGA


Each Look Up Table (LUT) has a unique mailing address


16 bits go into each Look Up Table (LUT)


Each routing switch has a unique mailing address


One bit for each switch


Executable for an FPGA is sequence of bits that have to be
delivered precisely to each LUT and Switch Box


This binary/executable is called “Configuration Bitstream” or
simply “
Bitstream





9

Programming an FPGA


Programming the FPGA is like having a Mailman deliver bits to each address
correctly


Slow process


But a Bitstream is slightly more complex


Each FPGA is like a Country (has a unique code)


A “Bitstream” before entering the chip has to undergo security clearance (CRC or
cyclic redundancy check)


Port of Entry =
ICAP


FPGA addresses are hierarchical (state, county, city, suburb, house address)


Term used for encoding all this overhead is “
Frame Address



All this address stuff is overhead


Actual useful stuff is inside the mail envelope

10

So what does a real
configured/programmed FPGA look like?

11

Before Programming


Nice clean plate


Empty LUTs, Switches….

After Programming


Messy plate of spaghetti


Configured LUTs, Switches….

All those green things are wires that
have been setup to carry data
between LUTs, FFs etc…

High Performance Embedded Computing
(HPEC) using FPGAs


Signal processing algorithms


Wildly useful and hence widely used


Computationally quite parallel/pipeline
-
amenable


Proven to be accelerate
-
able by Systolic Array designs on FPGAs


The Good of FPGAs:


FPGAs claim to have orders of magnitude performance advantage over
DSP chips (
www.xilinx.com

www.altera.com
)


They can be reconfigured partially and dynamically


The Bad (no the Ugly):


Productivity is the biggest barrier


The number of signal processing folks willing to adopt FPGAs is small and
stagnant


Partial dynamic reconfiguration is very slow

compared to processing
speeds

12

Elaborating the Good of FPGAs:


Extreme DSP computing

13

Elaborating the Good of FPGAs:

Partial Dynamic Reconfiguration

14

At some point in time……

Abruptly…say we need to quickly increase parallelism support for application
α

(


5)


At the cost of taking away parallelism support for the other application,



Because we did not have enough space on the chip to support high levels of

parallelism for both applications, or




There was a power budget we couldn’t satisfy

Can we dynamically reconfigure the chip, without disturbing the execution of either
application?


And do it fast enough?


Remember, programming the FPGA is a very very very slow process: RELATIVE to

execution speeds of applications

FPGA

Circuit
α

Circuit
α

Circuit
α

Circuit
α


Four parallel processing circuits for Application
α

Circuit
β

Circuit
β

Circuit
β


Seven parallel processing circuits for application
β

Circuit
β

Circuit
β

Circuit
β

Circuit
β

FPGA

Circuit
α

Circuit
α

Circuit
α

Circuit
α


4 parallel processing circuits for Application
α

Circuit
β

Circuit
β

Circuit
β


7 parallel processing circuits for application
β

Circuit
β

Circuit
β

Circuit
β

Circuit
β

FPGA

Circuit
α

Circuit
α

Circuit
α

Circuit
α


4 parallel processing circuits for Application
α

Circuit
β

Circuit
β

Circuit
β


7 parallel processing circuits for application
β

Circuit
β

Circuit
β

Circuit
β

Circuit
β

FPGA

Circuit
α

Circuit
α

Circuit
α

Circuit
α


4 parallel processing circuits for Application
α

Circuit
β

Circuit
β

Circuit
β


6 parallel processing circuits for application
β

Circuit
β

Circuit
β

Circuit
β

FPGA

Circuit
α

Circuit
α

Circuit
α

Circuit
α


5 parallel processing circuits for Application
α

Circuit
β

Circuit
β

Circuit
β


6 parallel processing circuits for application
β

Circuit
β

Circuit
β

Circuit
β

Circuit
α

Productivity


It’s a funny thing in the FPGA world


FPGA programmers are essentially VLSI design guys


They don’t buy $5K parts to get average performance


Every clock cycle is precious


Every LUT/FF/MAC/BRAM is precious


They don’t adopt new programming languages in a hurry


They love to have full control over every operation

15

Productivity, so what does it mean?


Wants an entire system on FPGA modeled, performance
predicted, designed, implemented, debugged, verified,
guaranteed timing closure, low power, high throughput….


Done really really fast, just like software


And then wants to make some minor changes and do it quickly
all over again, just like software…

16

Why cant new designs be compiled,
loaded onto FPGAs and tested super fast?


Need to look at traditional design flow

1.
Hardware
-
Software partition
(quick)

2.
Create
macro and micro architectures
for hardware portion (a month, two
months..)

3.
Write bug free
VHDL
/
Verilog

code for architectures (a few months)

4.
Synthesize
, translate, map, place and route (5 to 15 hours)

5.
Simulate


If there is a functional or timing bug, you pay a penalty of a few days to weeks

6.
Load configuration
onto chip


Test
again.


If there is
a timing bug
, you pay a penalty of several weeks

7.
If you decide to make a micro
architecture change
,
go back to step 2

8.
Good luck trying to finish your project on time and budget

9.
This will still not get you a dynamically reconfigurable design



17

One way to Improve Productivity


Stick to the traditional design flow as much as possible


FPGA users are once bitten twice shy


Very conservative and believe in the existing flow


But introduce structure into the flow, i.e. physical structure,
macro
-
architecture structure


Make Partial Dynamic Reconfiguration (PDR) almost automatic


FPGA designers are not conversant with PDR designs

18

Augmented Design Flow: Exclusively for
Signal Processing Algorithms


Hardware
-
Software Partitioning (just a concept and specific to an application)


Structured Macro
-
architecture via Floor Planning


Generic structure applicable to many algorithms


Structure Micro
-
architecture design


Project, Schedule data flow model of Sig. Proc. Kernel onto things called
Sockets of Macro
-
architecture


Well understood process


Embed dynamic reconfiguration capability


New technology


Works in tandem with Macro
-
architecture


Code, Synthesize….


Test on chip

19

Structured Macro
-
architecture


Some important Terms/Elements:


Socket
: A
physical region
on the FPGA chip
reserved by designer
to be loaded
with/configured with a PE. This is also called a Partial Reconfiguration Region (PRR)


Switch Box
: A circuit that makes the array of Sockets re
-
partition
-
able


PE/
Processing Element:

A circuit/bitstream to implement a signal processing kernel’s
systolic array data
-
flow functionality. To activate a socket, a PE must be loaded into it

Socket/PRR: Under the Hood

21

Yellow box: A socket/PRR

It contains BRAMs,
MACs

and LUTs/FFs
(
purple and blue/
green/black
stuff)

If you want to dynamically
reconfigure the parallelism of Systolic
Arrays on an FPGA:

All
PRRs

must be created with
identical resources of

MACs
,
BRAMs,
LUTs, FFs.

Physical fabric of Virtex SX 35 FPGA

Simple circuit

Need to set mux sel lines & fifo controls

Resides in static region on FPGA

Change SB connections to change
partitioning of sockets/PRRs between
systolic array kernels’ nodes

Switch Box: Stuff that makes the Array of
Sockets Re
-
partition
-
able

Ok, time to port Macro
-
architecture
Framework onto Chip

23

Virtex 4 SX 35

Static region


(luminescent green
stuff
)


Microprocessor


Switch Boxes


Cache


Controller

PRRs
/Sockets

(
white

boxes)


To be filled with Systolic
Array Processing Elements

What really happened when we tried it

Now to the Micro
-
architecture…

First, Hardware Software Partitioning

25

Example: Extended
Kalman Filter (EKF). A
critical navigation
algorithm and a nasty
signal processing kernel.

All
stuff with rounded
edges are

tasks that can
change based on physics
of the problem.
So put it
all in software
(Microblaze).

All else is

consistent and
so put them in hardware
(PolySAF)

Designing/Deriving the Processing
Element: Example EKF

26

Works on
Faddeev

Algorithm to compute
Schur

compliment

One of the many possible ways

27

Port

Code, Synthesize, …Optimize


Port: Code, synthesize, Translate, Map, Place and Route


For One Socket/PRR (just a few days worth of work)


Move Nets around to meet timing: Manually pick up a wire in this
small bowl of spaghetti of wires, and move it around.


Nuisance of a task, but necessary


But you need to do it only in one PRR (just a few hours worth of work)


Copy Locally optimized bitstream/circuit of the one PRR to all
PRRs


Automatically obtain Global Timing closure for the PolySAF


If Microprocessor, Cache are retained for multiple designs, then global
timing closure for whole chip is also automatically gifted to you

28

Have we answered the Productivity
problem?

Time to Grade the Approach

29


Need to look at traditional design flow

1.
Hardware
-
Software partition
(quick)

2.
Create
macro and micro architectures
for hardware portion (a month, two months..)


Applicable to a wide range of Sig. Proc. Algorithms

3.
Write bug free
VHDL
/
Verilog

code for architectures (a few months)


Reuse most of the macro structure and code only for one PRR

4.
Synthesize
, translate, map, place and route (5 to 15 hours)


Do for only one PRR

5.
Simulate


If there is a functional or timing bug, you pay a penalty of a few days to weeks

6.
Load configuration
onto chip


Test
again.


If there is
a timing bug
, you pay a penalty of several weeks

7.
If you decide to make a micro
architecture change
,
go back to step 3

8.
Good luck trying to

finish your project on time and budget



Want the details, the math, the
algorithms etc?


Read this paper


A.
Sudarsanam
, R. Barnes, A. Dasu, J. Carver, and R.
Kallam
,
“Dynamically
Reconfigurable

Systolic Array Accelerators: A case study
with EKF and DWT Algorithms,”
IET/IEE Computers & Digital Techniques.
Vol

4, Issue 1. Jan 2010
.


Author preprint available on line at Reconfigurable Computing Group


www.usu.edu/rcg

30

Now, onto Partial Dynamic
Reconfiguration in the PolySAF

31

3 nodes EKF

2 nodes DWT



Detach Socket

2 nodes EKF

2 nodes DWT



Reconfigure

Reset new PRR

Re
-
attach

2 nodes EKF

3 nodes DWT


DWT: discrete wavelet transform. The kernel used in JPEG 2000 image compression

How to Physically Reconfigure PRR?


Known Methods

32

Comparison of all known options

33

Best known technique: from Microsoft Research Labs (2008)
eMIPS

project

Too Slow, Too expensive (hogs up valuable on
-
chip BRAMs)

Embedding Dynamic Reconfiguration into
the System

34


Active Bitstream (PRR) to PRR: Hardware Circuit

ARC

ICAP

PRR

(source)

active bitstream

PRR

(destination)

FPGA

PRR

(destination)

ICAP wrapper

snoop

Accelerated Relocation Circuit (ARC)


Manipulate Frame addresses


FAR is Frame address register


Lots of unnecessary overhead can be avoided


No need for CRC processing

35

Results…reconfiguration times in
millisecs

36

All systems run @ 100 MHz

Footprint of ARC:

1064 LUTs, 638 FFs and 1 BRAM

* Estimated
values for state of the art competing technologies

Test Circuit

Resources

Bitstream
Size
(Bytes)

#.of.

frames



ARC

BiRF
*

IEEE
TVLSI
2009

Microsoft*

Tech. Report 2008


PolySAF
node

LUT

FF

DSP

BRAM

Same Side/

Opp

Side

Same

side

BRAM

Same
Side

Opp

Side

FSA_

no_DSP

486

273

0

0

31159

195

0.48

84.7

14

3.38

8.86

DSA_

no_DSP

438

273

0

0

30693

195

0.48

83.4

14

3.33

8.73

Matrix_Mult

no_DSP

1234

988

0

0

68469

432

1.07

186.1

30

7.42

19.47

FSA_

with_DSP

423

216

1

0

32349

195

0.48

87.9

15

3.50

9.20

DSA_

with_DSP

375

216

1

0

32349

195

0.48

89.8

15

3.58

9.20

Matrx_Mult

with_DSP

502

466

8

0

65261

432

1.07

177.3

29

7.07

18.56

RFT cases

DCT

1419

1636

8

8

44397

540

1.34

120.64

22

4.81

12.62

CSC

318

438

1

12

17313

301

0.74

47.04

9

1.87

4.92

DWT

940

389

0

4

47897

303

0.75

130.2

21

5.19

13.62

Next steps…

Improve, Formalize and Collaborate


Performance prediction
Model


Predict how big circuit will be, how it will perform using Excel and
Matlab


Big leap in productivity


Arithmetic Precision manipulation
is extraordinarily powerful when it comes to
FPGAs


If the right non
-
IEEE precision can be chosen for a Sig. Proc. App. Then you can save
medium to massive amounts of area, power in the circuit mapped onto the FPGA


Great opportunity for Small Satellites


Efficient
communication between Microprocessor and PolySAF
via threads


Validate

and brutally
test
this on a large
number of algorithms (
FFTs
, Filters,
Hyperspectral

processing…..)


NASA
can help with this


Technology is attractive for

software defined radios, precision navigation…

37

Kaleidoscope: Future of FPGA


Near term


Maybe better tools to program and debug FPGAs?


Mentor’s Catapult,
AutoESL

compiler,
Synfora

compiler….


Maybe some sort of standardization in FPGA programming


Hopefully DARPA HPCS program will produce something


Longer term (Revolutionary things to come)


Vertically Integrated FPGA + DRAM on a single chip


1000x improvement in performance/watt


Visit Micron Research Center at USU to learn more


www.usu.edu/mrc



38

Acknowledgements


Joe
Bredekamp

and the NASA AISR program


Applied Information Systems Research


Funding from NASA is valuable


Focused research


Want my technology to be adopted for real missions


Xilinx and Mentor Graphics (donated > $ 100K worth software)


My Grad Students




39