Scalability of Soft Multiprocessors

swimlogisticsΗλεκτρονική - Συσκευές

26 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

110 εμφανίσεις

Application
-
Specific Customization and
Scalability of Soft Multiprocessors


Deepak
Unnikrishnan



Chair: Prof. Russell
Tessier



Master’s

Thesis Defense

Funded by
Altera

Corporation, National Science Foundation

2




Outline



Motivation



Previous
work



Design
Components



Approach



Results



Conclusion


3



Motivation


Emerging soft multiprocessor systems and applications



Fully automated soft multiprocessor design



Easy of use


Verifiability
-

Existing parallel benchmarks


Flexibility
-

Application specific customizations



Applications:


Multi
-
core prototyping


End to end product designs


4



Soft multiprocessor synthesis


FPGA based soft
-
multiprocessor system for specific applications



IP packet forwarding
[1]


MPEG /JPEG



Synthesis of latency/throughput constrained stream applications
[2]




Limitations



Tuned for a specific application


No individual processor optimizations


Not scalable


[1]

“An FPGA
-
based soft multiprocessor system for IPv4 packet forwarding”,
Ravindran

et al. FPL 2005.

[2]


“Efficient automated synthesis, programming and implementation of multi
-
processor platforms on FPGA chips”,
Nikolov

et al. , FPL 2006

5



Optimization/Interconnection


Soft
-
processor optimization techniques


Pipeline stages, ISA, Shifter, Forwarding logic
[1]


Custom hardware
[2]


Instruction scheduling and recoding
[3]



Interconnects


Bus/Network on Chip


Topologies


Ring, Star, Mesh, Hypercube
[4]



Limitations


Isolated evaluation of design tradeoffs


Limited benchmarks


[1] “Application
-
specific customization of soft processor
microarchitecture
,”,
Yiannacouras

et al.,
FPGA
2006.

[2] “CUSTARD
-

A customizable threaded FPGA soft processor and tools,”
Dimond

et al.
FPL

2007.

[3] “Combining Instruction Coding and Scheduling to Optimize Energy in System
-
on
-
FPGA,”,
Dimond

et al.
FCCM

2006.

[4] “
Routability

of Network Topologies in FPGAs,” Saldana et al.,
TVLSI,March

2007

6



Design Flow

Computation

Multiprocessor

system designs

Area, Performance, Power evaluation

Processor

Templates

(SPREE)

Streamit Compiler

SoftCoreMapper

SPREE
gcc


Soft
multiprocessor

generator

Quartus Flow

Binary profiler

Communication

Code for soft
multiprocessors

Streamit

App

# Processors

Topology

Custom features

7



Streamit Example


Software Radio

Adder

Speaker

AtoD

FMDemod

LPF
1

Duplicate

RoundRobin

LPF
2

LPF
3

HPF
1

HPF
2

HPF
3

void
-
>void pipeline
FMRadio
(
int

N,

int

freq1,
int

freq2) {


add
AtoD
();



add
FMDemod
();



add
splitjoin

{



split duplicate;



for (
int

i
=0;
i
<N;
i
++) {




add pipeline {














add
LowPassFilter
();















add
HighPassFilter
();




}



}



join
roundrobin
();


}


add Adder();




add Speaker();

}

Courtesy: “Exploiting Coarse
-
Grained Task, Data, and Pipeline Parallelism in Stream Programs”, Gordon et al., ASPLOS

2006


8







Streamit Compiler Extensions

Layout

Partitioning

Scheduling


Parsing and

graph expansion


Static order

calculation

Streamit Application

Code generation

Dependency Analysis

Topology based

rescheduling

Dead code elimination

Computation

Communication

Computation

Communication

Streamit

SoftCoreMapper

Soft multiprocessor code

9



SPREE


S
oft
P
rocessor
R
apid
E
xploration
E
nvironment



Automatic processor generation from processor
descriptions



Fine granular micro
-
architectural customizations


Pipeline stages


Data path


Instruction set



Excellent platform for hardware
-
software


co
-
design evaluation.


Processor

Description

RTL Generator

Quartus

CAD Flow

Area,

Power,

Frequency

Verilog


processor designs

C App

MIPS

gcc


Courtesy:
“Application
-
specific customization of soft processor
microarchitecture
,”, P.
Yiannacouras

et al.,
FPGA
2006.

10



Multiprocessor architecture

5

4

3

2

1

0

lF
/

D

EX/
M

WB

lF
/

D

EX/
M

WB


Key architectural features:


Software flow control


Memory mapped I/O ports


Local on
-
chip memories





11



Multiprocessor Optimization


Topology



Interconnect buffer
size



Pipeline
stages



Instruction set
architecture



Memory size


12



Experimental Framework


Multiprocessor
systems of size 4,6,9 and 16



Altera

Quartus II
8.0/
Modelsim

6.1g



Target
Platforms





90nm Cyclone
II

(
Altera

DE2 board)




90nm
Stratix

II




65nm
Stratix

III (
Altera

DE3 board)






16 soft
-
multiprocessor systems implemented on DE3 (65nm
Stratix

III)







Components



Approach



Results



Future work



13



Benchmarks

Application

Description

FMRadio

FM Radio with multi
-
band equalizer

Equalizer

Multiband equalizer for audio applications

Autocor

Filter to generate autocorrelation series for an input

Lattice

Ten stage lattice filter

Filterbank

Filterbank for multirate signal processing

FFT

Fast Fourier transform kernel

BitonicSort

High performance sorting network

DES

Implementation of DES Encryption algorithm

14



Experimental Results
-

Topology


Topology


Mesh


Point
-
to
-
point

1.
Comm

schedule
-

graph

2.
For each generated data

3.
{

4.

Discover

hop edges

5.

Eliminate hop edges

6.

Insert point to point edges

7.
}

8.
Reschedule communication

S
-
>In

Out
-
>E

W
-
>E

S
-
>E

W
-
>S

W
-
>S

Out
-
>N

Out
-
>E

W
-
>In

Out
-
>N

N
-
>In

N
-
>In

0

1

2

3

4

5

3
-
>In

Out
-
>5

Out
-
>0

Out
-
>4

3
-
>In

Out
-
>5

3
-
>In

5
-
>In

Point to point Topology

0

3

4

5

15



1

2

3

4

Layout of a 16
processors

on
Stratix

II
device
[1]

[1]
Generated using Quartus Chip Planner for
Stratix

II

16




17



Future Work


Evaluate the impact of Streamit compiler optimizations on soft
-
multiprocessor systems.


Example:


Choice of partitioning


greedy, dynamic programming



Evaluate the effect of increasing processor pipelining on soft
-
multiprocessor system.



More aggressive processor optimizations


Application specific on
-
chip memory size reduction



Target 32
-
64 soft processors on FPGA with larger applications



Impact of optimizations on system power






Approach



Results



Future Work



Conclusion



18



Thesis Completion Timeline

Action Item

Date

Development of framework

In Progress.

Evaluation of Application
-
specific buffer sizing and memory sizing
on application performance.

December 30
th

Evaluation

of

impact

of

optimizations

on

power

consumption

January 15
th

Compiler
-
level soft
-
multiprocessor optimizations

February 20
th

Study

of

varying

pipeline

stages

on

multiprocessor

systems
.

March

20
th

Document

completion

and

defense

April

30
th

19



Conclusion


Fully

automatic

and

scalable

flow

for

design

evaluation

of

large

soft
-
multiprocessor

systems
.



Choice

of

interconnect

topologies


Choice

of

multiple

system

level

optimizations


Diverse

set

of

benchmarks



Preliminary

results

indicate

3
-
5
X

speedup

for

16

processor

systems

and

27
%

area

savings



Mesh

vs

Point

to

point

interconnect

topologies

were

evaluated



Work

presented

at

FCCM’

2009
,

Napa

valley,

CA






Results



Future Work



Conclusion



References



20



References

[1] J. Cong, G. Han, W. Jiang, “Synthesis of an application
-
specific soft multiprocessor system,”
In International
Conference on Field Programmable Logic and Applications
, 2007


[2] K. Ravindran, N. Satish, Y. Jin, K. Keutzer, “An FPGA
-
based soft multiprocessor system for IPv4 packet forwarding,”
In International Conference on Field Programmable Logic and Applications(FPL)
,August 2005.


[3]

H. Nikolov, T. Stefanov, E. Deprettere, “Efficient automated synthesis, programming, and implementation of mult
-
processor platforms on FPGA chips,”
In International Conference on Field Programmable Logic and
Applications(FPL)
, August 2006.


[4] M. I. Gordon, W. Thies, S. Amarasinghe, “Exploiting coarse
-
grained task, data, and pipeline parallelism in stream
programs,”
In International Conference on Architectural Support for Programming Languages and Operating
Systems(ASPLOS)
, March 2006.


[5] P. Yiannacouras, J.G. Steffan, J. Rose. “Application
-
specific customization of soft processor microarchitecture,”
In
International Symposium on Field
-
Programmable Gate Arrays(FPGA)
, February 2006.


[6] M. Taylor, “The Raw prototype design document,” Technical Reports ‘05, Massachusetts Institute of Technology


[7] J.P. Derutin, L.Damez, A. Desportes, J.L.L. Galilea, “Design of a scalable network of communicating soft processors
on FPGA,”
In International Workshop on Computer Architecture for Machine Perception and Sensing(CAMPS)
,
September 2006.


Components



Approach



Results



References



21



References

[8] O. Hebert, I.C. Kraljic, Y. Savaria. “A method to derive application
-
specific embedded processing cores,”
In
International Conference on Hardware Software Codesign(CODES)
, September 2000.


[9] R. Dimond, O. Mencer, W. Luk, “CUSTARD
-

A customizable threaded FPGA soft processor and tools,”
In
International Conference on Field Programmable Logic and Applications(FPL)
, August 2007.


[10] R.G. Dimond, O. Mencer, W. Luk, “Combining Instruction Coding and Scheduling to Optimize Energy in
System
-
on
-
FPGA,”
In IEEE Symposium on Field
-
Programmable Custom Computing Machines(FCCM)
, April
2006.


[11] B. Fort, D. Capalija, Z. Vranesic, and S. Brown, “A multithreaded soft processor for SoPC area reduction,”
In
IEEE Symposium on Field
-
Programmable Custom Computing Machines(FCCM),
Apr. 2006.


[12] M. Labrecque, J.G. Steffan, “Improving pipelined soft processors with multithreading,”
In International
Conference on Field
-
Programmable Logic and Applications (FPL)
, August 2007.


[13] M.A.R. Saghir, M. El
-
Majzoub, P. Akl, “Datapath and ISA customization for soft VLIW processors,”
In IEEE
International Conference on Reconfiguurable Computing and FPGAs(ReConFig)
, September 2006.


[14] M. Saldana, L. Shannon, J.S. Yue, S. Bian, J. Graig, P. Chow, “Routability of Network Topologies in FPGAs,”
In
IEEE Transactions on Very Large Scale Integration Systems(VLSI)
,March 2007



22



Streamit

Application

Streamit Compiler

RAW backend

Tile Code

Tile code for soft processors

Quartus Compiler Flow

Target number of

Processors

Switch Code

SoftCoreMapper

SPREE MIPS gcc compiler

Memory Initialization Binaries

Binary Profiler

Application Specific

Soft
-
Multiprocessor System

Generator

Verilog Multiprocessor system designs

Area,Performance,Power evaluation

Interconnect

Topology

Customizable

Processor

Templates