Slides (ppt ~4.5 MB) - CES

mittenturkeyΗλεκτρονική - Συσκευές

26 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

102 εμφανίσεις

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

http://ces.univ
-
karlsruhe.de

Lars Bauer, CES, University of Karlsruhe, DAC 2007

Lars Bauer, Muhammad Shafique, Simon Kramer

and Jörg Henkel


Chair for Embedded Systems (CES)



University of Karlsruhe

RISPP: R
otating
I
nstruction
S
et

P
rocessing
P
latform

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

2

Outline


Motivation


Related Work


Our RISPP Approach:


Special Instructions (SIs) composition


Forecasting SI usages


Run
-
time architecture


Results & Evaluation

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

3

Development of Embedded Systems


Typical:


Static analysis

of

hot spots


Building
tightly optimized

system


Nowadays:


Increasing complexity


More functionality


Problem:


Statically chosen design
point

has to match all
requirements


Typically inefficient

for
individual components
(e.g. tasks or hot spots)


nokia.com

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

4

Possible Solution:

Extensible Processors

Flexibility, 1/time
-
to
-
market,

Efficiency: $/
Mips
,
mW
/MHz,
Mips
/area
,

ASIC:
-
Non
-
programmable,
-
highly specialized
ASIC:
-
Non
-
programmable,
-
highly specialized
General purpose
processor
General purpose
processor
ASIP
(extensible
processor)
ASIP
(extensible
processor)
-
Instruction set extension
-
parameterization
-
inclusion/exclusion of
functional blocks

Hardware solution


Software
solution

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

5

Related Work:

Extensible Processors


S Kobayashi, K Mita, Y Takeuchi, M Imai: “Design space
exploration for dsp applications using the ASIP development
system PEAS
-
III”, ICASSP 2002


A Hoffmann, T Kogel, A Nohl, G Braun, O Schliesbusch, O
Wahlen, A Wieferink, H Meyr “A novel methodology for the
design of application
-
specific instruction
-
set processors (ASIPs)
using a machine description language”, IEEE Trans. on CAD of
Int. Circ. and Syst. 01


K Atasu, L Pozzi, P Ienne “Automatic application
-
specific
instruction
-
set extensions under microarchitectural constraints”,
DAC, 2003


F Sun, S Ravi, A Raghunathan, NK Jha “A scalable application
-
specific processor synthesis methodology”, ICCAD, 2003


N Cheung, S Parameswaran, J Henkel “A quantitative study and
estimation models for extensible instructions in embedded
processors”, ICCAD, 2004




Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

6

Problem: Various Hot
-
Spots

Hot-Spots in H.324 Video Conferencing Application
0
2
4
6
8
10
12
I_ME
PMV
TQ_IL
LF
MC_C
MD_I4
CAVLC
get_pos
CABAC_d
FM
UP
Qt
Reconst
BC
BA
NF
HPF
DRF
FGA
H223_M
V34Mod
MAC
Processing Functions
Processing Time [%]
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

7

Related Work:

Reconfigurable Computing


K Compton, S Hauck “Reconfigurable computing: a
survey of systems and software”, ACM Computing
Surveys 2002


F Barat, R Lauwereins “Reconfigurable instruction set
processors: a survey”, RSP 2000


RD Wittig, P Chow “OneChip: an FPGA processor with
reconfigurable logic”, IEEE Symp. FCCM, 1996


S Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G.
Kuzmanov, E.M. Panainte, “The MOLEN polymorphic
processor”, IEEE Transaction on Computers, 2004




Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

8

Dynamic System Behavior


Extensible Processor
: choosing points in design

space at
design time


Reconfigurable Computing
: typically fix at
compile
time
when and how to deploy reconfigurable hardware


Depending on
input data

(e.g. different computational paths in video encoder)


Which tasks/applications will be
executed together
?

How to handle situations that are

unknown at design
-

& compile
-

time?

(while still supporting various extensible instructions)

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

9


At design time
:
fix the amount of reconfigurable
hardware


At compile time
:
compose Special Instructions

(SIs)
out of high re
-

usable data

paths


At run time
:

dynamically

determine the

implementa
-

tion of a SI


Altogether:

Rotate the

Instruction

Set

Our New Concept
:

Basic Idea and Overview

Dynamic
Hardware
Pipeline Register
IF
/
ID
Rotation
Manager
Instruction
Memory
A
D
D
M
U
X
PC
A
L
U
Controll
Data
Memory
Access
Data
Memory
Access
Branch taken
?
Data
Memory
Hierarchy
Arbiter
Test
Condition
4
PC
Register
File
Temporary
Storage for
sw
-
emul
.
Jump Target
Dynamic
Hardware
Interconnect Bus
Sign
Extend
Pipeline Register
ID
/
EXE
Pipeline Register
EXE
/
MEM
Pipeline Register
MEM
/
WB
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

10

HT
_
4
x
4
DCT
_
4
x
4
SATD
_
4
x
4
HT
1
HT
2
HT
3
HT
SW
1
2
2
PACK
_
LSB
_
MSB
TRANSFORM
DCT
1
DCT
2
DCT
SW
SATD
1
SATD
SW
QUAD
_
SUB
1
2
SATD
2
SPECIAL
INSTRUCTIONS
MOLECULES
ATOMS
2
1
1
1
1
2
TRANSFORM
PACK
_
LSB
_
MSB
SATD
1
2
1
2
2
1
1
2
(
some Atoms replicated for simplicity of figure
)
The number
denotes
:
#
Atom
-
instances
required for
this Molecule
1
Example Atom

+
-
X
00
X
11
-
+
X
01
X
10
<<
1
+
+
-
<<
1
-
T
00
T
01
T
11
T
10
Example Molecule

Atom A
Atom A
X
00
X
11
X
01
X
10
T
00
T
10
Atom B
Atom C
Atom C
Fundamental Idea:

Atom / Molecule Model


Atom:

elementary data path (smaller granularity)


Molecule:

combination of Atoms (bigger granularity)


Special Instr.:

Application specific assembly instruction

Example Molecule

X
00
X
11
X
01
X
10
T
00
T
10
Atom B
Atom C
Atom C
+
-
-
+
<<
1
+
+
-
<<
1
-
+
-
-
+
<<
1
+
+
-
<<
1
-
Key:


-
Multiple implementations

per SI (Molecules)

-
Each Molecule is
composed

out of Atoms

-
Implementation
hierarchy

-
Atoms are more
reusable

-
Molecules are more
specific

-
Advantage:

Enables dynamic trade
-
off

-
Drawback:

Higher design effort

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

11

Formal Atom / Molecule Model: Example

# Atoms A
1

# Atoms A
2

Legend:

Molecule

Relation “is bigger or equal than”

Infimum of the Molecules

Supremum of the Molecules

1

1


Molecule relations are e.g. needed when Molecules
comprise each other


In such cases we can first configure the smallest possible
Molecule with required functionality and then
upgrade

to
faster implementations

(in general: n
-
dimensional)

(3,5)

(1,4)

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

12

Formal Atom / Molecule Model: Details


Main data structure:

Set of all Molecules


Meta
-
Molecule

to implement two
Molecules, such that they can be
executed consecutively
, i.e. temporal
domain (Abelian Group)


Meta
-
Molecule

for the
common Atoms

(indicator for
compatibility
)


Relation

(Complete Lattice), with


Supremum
: Meta
-
Molecule that is
needed to implement all Molecules


Infimum
: Meta
-
Molecule that is col
-
lectively needed for all Molecules

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

13


Determinant
: number of Atoms needed to implement
a Molecule






Upgrading
: Atoms that are additionally needed to
implement o, assuming m is already available

Formal Atom / Molecule Model: Details

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

14

349,600
6,000
139,264
1
10
100
1,000
10,000
100,000
1,000,000
Reconfiguration time (4 Atoms)
HW execution time (4 Atoms)
SW execution time
Execution Time [cycles]
Instruction Set Rotation Time


Loading time depends on:

a)
Atom size

b)
Reconfiguration bandwidth


Altogether: Hardware has to be available when needed






獴慲s潡摩湧d敡牬e

For our examples:

0.84


0.95 ms

Execution and Reconfiguration times for SATD_4x4 for 1 frame:

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

15


Control
-
flow graph


Each node is a Base
-
Block (BB)



At compile time:


Determine points

to forecast a SI


Add Forecast Instructions

with
forecast values (about the
SI
importance
) to these points


At run time:


Use the Forecasts to
determine
the Instruction Set rotation


Dynamically
update the
importance

of the forecasted SIs

Return from

subroutine

Executions of
SATD_4x4




forecast
SATD_4x4
, 42



Time for


Instruction


set rotation


SI Forecasting: Example

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

16

Inserting Forecast Points (FCs):

General Idea of Algorithm

Pre
-
computations

from profiling data
for each Special Instruction (SI)

For every SI
determine

Forecast Candidates

Optimize list

of FC
-
Candidates

and select final forecasts

I.

II.

III.

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

17

I. Pre
-
Computations


Pre
-
computations

are done
on control
-
flow graph
using profiling
-
information





Temporal Distance

from
Base Block to SI execution


Probability

that the SI
executions are reached


Number of executions

of
this SI (if it is executed)

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

18

II. Forecast Decision Function (FDF)

0.1
0.2
0.3
0.4
0.6
1.0
1.6
2.5
4.0
6.3
10.0
15.8
25.1
39.8
63.1
100.0
100
70
40
10
0
50
100
150
200
250
300
350
400
450
500
Output:

Num-
ber of minimal
SI usages
to
issue a Fore-
cast Candidate
[#SI usages]
Temporal distance
until usage of SI
(relative to rotation time
of SI with logarithmic scale)
[t / T
Rot
]
Probability
of SI usage
[%]
450-500
400-450
350-400
300-350
250-300
200-250
150-200
100-150
50-100
0-50
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

19

III. Optimize list of FC Candidates

// S
1
, …, S
k
are the SIs of the FC Candidates in this BB

// Rep(S): A Meta
-
Molecule that represents
the
Molecule
-

//

i
mplementations for the
S
I
S

1.

M;


2.


i i
S:M M Rep(S );
 

{ }

3.



( sup(M)#AvailableAtomContainer
M
s)

  
while
{

4.


worstRelation 0;


5.


candidate null;


6
.



m M

{

7
.





sup(M) sup(M\{m} ExpectedSpeedu
temp;
p(m)



8
.




if
(temp worstRelation)

{

9
.




worstRelation temp;


10.




candidate m;


11
.




}

12
.



}

13
.



if
(candidate null)

{

14
.



M M\{candidate};


15
.



}
else
{

16
.



break;

17
.



}

1
8
.

}

General Idea:




While

the forecasted SIs in a Base Block

consume
too many

area:

remove

the forecast with the worst


Achieved Speedup


Exclusively used Atoms

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

20

Main Tasks of the

Run
-
Time Architecture

Monitoring Forecasts and Special Instructions:


Fine
-
tune the forecasted
importance

to reflect varying run
-
time situation

Selecting Molecules to implement SIs:


Dynamically choose an SI implementation

that matches the current needs of the application

Realize the taken decisions:


Determine a loading sequence

for

the Atoms &
control the SI execution

a)

b)

c)

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

21


2 Tasks are running alternating, sharing the available Atom
Containers


Only one task may determine the content of an Atom container, but
both can use them


[SASO’07]: “A Self
-
Adaptive Extensible Embedded Processor”

(
IEEE International Conference on Self
-
Adaptive and

Self
-
Organizing Systems Boston, July 9
-
11
)

Run
-
time Architecture example

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

22

Results & Evaluation:

Flow of Test Application


Core part of Encoding Engine

of ITU
-
T H.264


Special Instructions (# executions per MacroBlock):


SATD_4x4

(256)


DCT_4x4

(16)


HT_4x4

(1)


Focus: Proof of concept, not automatic SI detection

SATD
_
4
x
4
SATD
<
SATD
_
min
16
candidates
DCT
_
4
x
4
16
Sub
-
Blocks
HT
_
4
x
4
For each
4
x
4
candidate Sub
-
Block
(
SB
)
For each
4
x
4
Sub
-
Block
(
SB
)
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

23

Designing an Atom for the

three transform operations

X
00
X
11
X
01
X
10
<<
1
DCT
<<
1
T
00
T
01
T
11
T
10
DCT
>>
1
>>
1
HT
>>
1
HT
>>
1
-
+
+
+
+
-
-
-
+
-
X
00
X
11
-
+
X
01
X
10
T
00
T
01
T
11
T
10
+
-
+
-
+
-
X
00
X
11
-
+
X
01
X
10
+
+
-
>>
1
-
>>
1
>>
1
>>
1
T
00
T
01
T
11
T
10
+
-
X
00
X
11
-
+
X
01
X
10
<<
1
+
+
-
<<
1
-
T
00
T
01
T
11
T
10

Consider constraints


Max size of data path


Number of I/O signals


Number of control signals


Increase re
-
usability


Combine similar data
paths (MUX)

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

24

Composing Molecules for SATD_4x4

+
+
+
I
N
P
U
T
O
U
T
P
U
T
DCT
=
0
DCT
=
0
HT
=
1
HT
=
0
QuadSUB
-
Atom
PackLSBMSB
-
Atom
Transform
-
Atom
SATD
-
Atom
+
ADD
(
in core cpu
)
Increased

re
-
usability

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

25

dynamic trade
-
off
RISPP SI Trade-off: Performance vs. Resources
5
7
9
11
13
15
17
19
21
23
25
0
2
4
6
8
10
12
14
16
18
RISPP Resources [#Atoms]
Execution Time [#Cycles]
SATD_4x4
DCT_4x4
HT_4x4
Performance vs. Area Trade
-
off

Area requirements

[# loaded Atoms]

0

5

10

15

max

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

26

4 Atom
Containers
Traansform
SATD
Pack
QuadSub
Core Processor
Hardware Feasibility Study


Xilinx Virtex II 3000
xc2v3000
-
6ff1152


Board: Xilinx HW
-
AFX
-
FF1152
-
200


Floor
-
Planning with
Plan Ahead

Trans
-

form

SA
-

TD

Pack

Quad

SUB

# Slices

517

407

406

352

Utilization

50%

39%

39%

34%

Bitstream


[KByte]

59

58

66

59

Atom lo
-

ading [
µ
s]

857

840

949

848

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

27

Special Instruction Execution Time

for Different Resources

18
20
24
544
488
24
19
15
298
22
22
17
1
10
100
1000
Opt. SW
4 Atoms
5 Atoms
6 Atoms
RISPP Resources
Execution Time [Cycles]
SATD_4x4
DCT_4x4
HT_4x4
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

28

Application Execution Time

Allover performance for
H.264 Encoding Engine
60,244
59,135
58,287
201,065
0
50,000
100,000
150,000
200,000
Opt. SW
4 Atoms
5 Atoms
6 Atoms
RISPP Resources
Execution Time [Cycles]
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

29

Time matters!


Fix the avail
-
able reconfi
-
gurable hard
-
ware resour
-
ces


Determine
Special
Instructions


Determine
composition
out of Atoms /
Molecules


Profile the
application


Add Forecast
Points to the
application


Dynamically
update the
forecasted
Importance

of
the SIs


Choose
Molecule
implemen
-
tation for SIs

Design Time

Compile Time

Run Time


The art is to find the right trade
-
off between design
-
/compile
-
time and run
-
time

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

30

Summary & Conclusion


Hierarchical Special Instruction (SI) composition


Atom / Molecule model


Use resources more efficiently


Offer multiple SI implementations


Forecasting SI usages at compile time


Pre
-
computations from profiling and graph analysis


Forecast Decision Function


Push more decisions to run time


Which SI implementation (dynamic trade
-
off)


Adapting to run
-
time situation


There is a large potential for improving the way current
Extensible Processors work

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

31

Lars Bauer, Muhammad Shafique, Simon Kramer

and Jörg Henkel








Chair for Embedded Systems (CES)


University of Karlsruhe

RISPP: R
otating
I
nstruction
S
et

P
rocessing
P
latform

Thank you for

your attention !

http://ces.univ
-
karlsruhe.de

Lars Bauer, CES, University of Karlsruhe, DAC 2007

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

32

Atom
-
Container Interconnections

Atom
Container
Intercon
-
nections
Atom
Container
Intercon
-
nections
Core
Processor
Rotation
Manager
Reconfigurabel Functional Units
Atom
Container
Intercon
-
nections
Atom
Container
Intercon
-
nections
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

33

III. Select final Forecasts


Optimization goals


As few FCs as possible (smaller code
size, less executed cycles), as many
as needed (provide all necessary
information to the run time system


Choose FCs with a good trade
-
off
between ‘sufficiently early’ and a ‘high
execution probability’


For each SI start Depth
-
First
-
Searches on the FC Candidates on
the transposed Base Block graph
(i.e. all edges reversed)


Green BB
:

FC
-
Candidate

Final Forecast

Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

34

RISPP Area Savings

ITU
-
T H
.
264
Encoding Engine
Explaining Idea of RISPP Rotation Concept
TQ
ME
MC
TQ
LF
70
%
4
.
5
%
8
.
5
%
17
%
processing time

[%]
of
4
major
functional blocks
ASIP Area

[
GE
]
to
target the hot spots
in these blocks
ME
TQ
LF
Area SAVING
RISPP HW
=
α
x GE
reqd
ASIP H
/
W
=
317
,
423
GE
ME
MC
LF
27
,
483
GE
9
%
199
,
812
GE
63
%
67
,
032
GE
21
%
23
,
096
GE
7
%
MC
TQ
LF
RISPP HW status after ME exec
.
RISPP HW status after MC exec
.
RISPP HW status after TQ exec
.
RISPP HW status after LF exec
.
Rotation
In advance
Rotation
In advance
Rotation
In advance
Completed
rotation
ME
Rotation
In advance
RISPP Area
MC
Invited Talk, VLSI Conference, Mumbai, Jan. 9
th
, 2004

35

II. FDF
-
Details


Explanation and Parameter Description:


T: Time (Rot: for Rotation; SW: For SW Execution


p: Probability


E: Energy


α
: Parameter for Energy vs. Speedup fine
-
tuning

Rot SW
Rot
T t T *p
FDF(p,t):offset max 0
T *t p



 



Rot
SW HW
E
offset *
T T
 
 
 

 