for Application-Specific Processors

possehastyMechanics

Nov 5, 2013 (4 years and 7 days ago)

94 views

Architecture and Design Automation
for Application
-
Specific Processors

Philip Brisk

Assistant Professor

Dept. of Computer Science and Engineering

University of California, Riverside

IEEE 9
th

International Conference on ASIC (ASICON)

Xiamen, China


October 26, 2011

Acknowledgment

The vast majority of slides in this presentation are
taken from the Ph.D. Thesis of my friend and
collaborator, Dr. Theo
Kluter

(Ph.D., EPFL, 2010)


Five Stage RISC Pipeline

I$

RF

D$

RF

Fetch

Decode

Execute

Memory

Write
-
back

Application
-
Specific Custom Unit (ASCU)
for Instruction Set Extensions (ISEs)

I$

RF

D$

RF

Fetch

Decode

Execute

Memory

Write
-
back

ASCU

Automatic ISE Identification

I$

RF

D$

RF

Fetch

Decode

Execute

Memory

Write
-
back

ASCU

Compiler

HW Synthesis

Applications

Assembly code with ISEs

Overview


Architecture


Compilation and Synthesis


Conclusion


Overview


Architecture


Custom ISE Logic


I/O Bandwidth


Local memories and coherence


Compilation and Synthesis


Conclusion


Example: Luminance Conversion in
JPEG Compression


19 cycles in software


17
-
bit values


Fixed
-
point

Custom Hardware Implementation

One single
-
ported memory


4


5 cycles (3 loads, 1 arithmetic, 1 store)


Speedup: 3.8x


4.8x

R, G, B, and Y Memories


1

cycle for everything


Speedup: 19x

Custom ISE Logic

RF has 2 read ports

RF has 1 write port

Architectural Limitations


L
oad data from memory into RF


RF I/O bandwidth


Performance


7

cycles (3 loads, 2 ASCU, 1 store)


Speedup: 3.1x

Overview


Architecture


Custom ISE Logic


I/O Bandwidth


Local memories and coherence


Compilation and Synthesis


Conclusion


I/O Bandwidth Constraint


AES Algorithm


Single round


4 stages



Best ISE


22 inputs


22 outputs


[
Verma
, Brisk, and
Ienne
,
CASES 2007 & TCAD
2010]



RF I/O constraints


Noticeable slowdown



Pipeline Forwarding

[
Jayaseelan

et al., DAC 2006]

1 output

I/O Bandwidth limitations


Input bandwidth depends on number of pipeline stages


Does not increase output bandwidth

Complicates instruction
scheduling

Register File Clustering

4 inputs

1

output

[
Karuri

et al., ICCAD 2007]

I/O Bandwidth limitations


Input bandwidth depends on number clusters


Does not increase output bandwidth

Compiler must eliminate inter
-
cluster copies


More clusters => more copies


NP
-
Hard


Shadow Registers

[Cong et al., FPGA 2005]

1

output

I/O Bandwidth


No limitation on input bandwidth


Does not increase output bandwidth

Increases ISA
bitwidth


Overview


Architecture


Custom ISE Logic


I/O Bandwidth


Local memories and coherence


Compilation and Synthesis


Conclusion


Architecturally Visible Storage


DMA transfers data between memory and AVS


Coherence problem between AVS and D$

[
Biswas

et al., DATE 2006,

TCAD 2007]

Example: IDCT (from JPEG)

The Coherence Problem

The Coherence Problem

The Coherence Problem

Overview


Architecture


Custom ISE Logic


I/O Bandwidth


Local memories and coherence


Coherent and Speculative DMA


Virtual Ways


Way Stealing


Compilation and Synthesis


Conclusion


Coherent DMA

Coherent DMA

Speculative DMA

Coherent DMA
loads and evicts
the array from
AVS during each
iteration

Speculative DMA waits until the array is overwritten in
AVS memory by other data, or if the data is
read/written by the D$.

Virtual Ways

AVS vs. Traditional Cache

AVS and Cache Ways are Similar

if AVS Memory has 1
-
input, 1
-
output

Way Stealing

Way Stealing

No AVS memories (reduced area)

No coherence protocol

Coherent AVS Summary


Speculative DMA


Requires a coherence protocol


Lots of bus traffic


Good solution for coherent multiprocessor systems


No limit on AVS memory organization


Uses standard cache IPs



Virtual Ways


Requires non
-
traditional cache controller


No limit on AVS memory organization



Way Stealing


Requires non
-
traditional cache


Number of ways limits number of AVS memories


All AVS memories have 1
-
input, 1
-
output


Keeps AVS memories within the cache


Overview


Architecture


Compilation and Synthesis


ISE Identification Algorithms


Conclusion


SW and HW Costs

Convex and Non
-
Convex Cuts

Integrating AVS Memories

Integrating AVS Memories

Single Cycle ISE Identification Problem


Legality Constraints:


Convex cut


Contains no forbidden nodes


Number of inputs/outputs match architectural
constraints


(e.g., 2 RF inputs, 1 RF output)



Objective:


Find the legal cut that maximizes speedup

Algorithms for ISE Identification


Optimal (Exponential worst
-
case runtime)


Branch
-
and
-
bound search


Integer Linear Program Formulation



Iterative Improvement


Evolutionary algorithms


Simulated annealing



Polynomial
-
time Heuristics

Branch
-
and
-
Bound Search Example

[
Atasu

et al., DAC 2003]

Branch
-
and
-
Bound Search Example

[
Atasu

et al., DAC 2003]

ISE Identification Algorithms

[
Kastner

et al., ICCAD 2001]

[Brisk et al., CASES 2002]

[Sun et al., ICCAD 2002]

[Lee et al., ICCAD 2002]

[
Atasu

et al., DAC 2003]

[Goodwin and
Petkov
, DATE 2003]

[
Peymandoust

et al., ASAP 2003]

[Clark et al., MICRO 2003]

[Sun et al., ICCAD 2003]

[Lee et al., ISLPED 2003]

[Cong et al., FPGA 2004]

[
Biswas

et al., DAC 2004]

[Yu and
Mitra
, DAC 2004]

[
Borin

et al.,
ESTIMedia

2004]

[
Kastens

et al., LCTES 2004]

[Yu and
Mitra
, CASES 2004]

[
Pozzi

and
Ienne
, CASES 2005]

[
Biswas

et al., DAC 2005
]

[
Atasu

et al., CODES
-
ISSS 2005]

[Sun et al., VLSI Design 2005]

[
Biswas

et al., DATE 2006]

[
Galuzzi

et al., CODES
-
ISSS 2006]

[Sun et al., VLSI Design 2006
]

[Wong et al.,
HiPEAC

2007]

[
Verma

et al., CASES 2007]

[
Pothineni

et al., CDES 2007]


Conferences

[
Atasu

et al., IJPP 2003]

[Clark et al., IJPP 2003]

[Sun et al., TCAD 2004]

[Clark et al., TCOMP 2005]

[
Pozzi

et al., TCAD 2006]

[
Biswas

et al., TVLSI 2006]

[Sun et al., TCAD 2006]

[Sun et al., TVLSI 2006]

[
Biswas

et al., TCAD 2007]

[Chen et al., TCAD 2007]

[Sun et al., TCAD 2007]

[Lee et al., TODAES 2007]

[
Bonzini

and
Pozzi
, TVLSI 2008]

[Zhao et al., IEICE Trans. Fund. 2008]

[
Atasu

et al., TCAD 2008]

[Murray et al., TECS 2009]

[
Verma

et al., TCAD 2010]

[
Galuzzi

and
Bertels
, TRETS 2011]



Journals

[
Pothineni

et al., VLSI Design 2007]

[
Bonzini

and
Pozzi
, DATE 2007]

[
Atasu

et al., DATE 2007]

[
Noori

et al., DATE 2007]

[
Galuzzi

et al., SAMOS 2007]

[
Galuzzi

et al., ARC 2007]

[
Bonzini

and
Pozzi
, ASAP 2007]

[
Wolinski

and
Kuchcinski
, ASAP 2007]

[
Galuzzi

et al., ARC 2007]

[Yu and
Mitra
, FPL 2007]

[
Bennet

et al., LCTES 2007]

[
Verma

et al., ASPDAC 2008]

[
Wolinski

and
Kuchcinski
, DATE 2008]

[
Galuzzi

and
Bertels
, ARC 2008]

[
Atasu

et al., ASAP 2008]

[
Galuzzi

and
Bertels
,
ReConFig

2008]

[
Pothineni

et al., VLSI Design 2008]

[
Galuzzi

et al., DATE 2009]

[Martin et al., ASAP 2009]

[Martin et al., SAMOS 2009]

[Kamal et al., ASAP 2010]

[
Pothineni

et al., VLSI Design 2010]

[
Ahn

et al., ASPDAC 2011]

[Xiao and
Casseau
, GLS
-
VLSI 2011]

[Xiao and
Casseau
, ASAP 2011]

[
Ahn

et al., CODES
-
ISSS 2011]

Overview


Architecture


Compilation and Synthesis


Conclusion


Summary and Future Research Directions


Conclusion


ASIP Architecture


Supply data bandwidth to ASCU


Ensuring coherence when using local memories



ISE Identification


Problem formulation is well
-
understood


Extensions needed to support memory operations


Many effective algorithms exist


Future ASIP Research Directions


Parallel and Multi
-
core ASIPs


Balance ISE speedup across many threads


ISE identification for parallel models of computation


Concurrent state machines


Synchronous Data Flow / Kahn Process Networks


MapReduce



Identify ACSU for Current AND Future Applications


Some ISEs are not known at design time


Must insert generality or programmability into the ACSU



Application
-
specific GPUs


Identify
vectorized

and threaded ISEs


ACSU by hundreds of near
-
identical threads concurrently