Design and Synthesis of Image Processing Systems using Reconfigurable Dataflow Graphs

pancakesnightmuteΤεχνίτη Νοημοσύνη και Ρομποτική

5 Νοε 2013 (πριν από 3 χρόνια και 7 μήνες)

61 εμφανίσεις

Design and Synthesis of Image Processing
Systems using Reconfigurable Dataflow Graphs

Mainak Sen and Shuvra S. Bhattacharyya


Department of Electrical and Computer Engineering, and

Institute for Advanced Computer Studies

University of Maryland at College Park


Maryland DSPCAD Research Group

http://www.ece.umd.edu/DSPCAD/home/dspcad.htm

November 22, 2005

Leiden University, The Netherlands

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
2

Outline


Dataflow
-
based model of computation for modeling the behavior of
DSP applications


Decidable dataflow models


Example: use of decidable dataflow as a model of computation for
modeling the mapping of (decidable) dataflow behaviors onto
embedded multiprocessors


Structured reconfiguration of dataflow graphs


Examples of meta
-
modeling techniques that can be classified as
structured, reconfigurable dataflow


Parameterized dataflow and its application to SDF


Homogeneous
-
parameterized dataflow and its application to SDF and
CSDF


Experiments on a gesture recognition application


Summary

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
3

Dataflow
-
based design for DSP

(Example from Agilent ADS tool)

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
4

DSP
-
oriented Dataflow Models of
Computation


Used widely in design tools for DSP


Application is modeled as a directed graph


Nodes (actors) represent functions


Edges represent communication channels between functions


Nodes produce and consume data from edges


Edges buffer data in FIFO (first
-
in first
-
out) fashion


Data
-
driven execution model


A node can execute whenever it has sufficient data on its input
edges


The order in which nodes execute is not part of the specification


The order is typically determined by the compiler, the hardware,
or both


Iterative execution


Body of loop to be iterated a large or infinite number of times

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
5

Dataflow Features and Advantages


Exposes coarse
-
grain parallelism.


Exposes high
-
level structure that facilitates analysis, verification,
and optimization.


Captures multi
-
rate behavior.


Complementary to ongoing advances in DSP compiler technology
for procedural languages, such as C and MATLAB.


Encourages desirable software engineering practices: modularity
and code reuse


Amenable also to aspect
-
oriented design.


Intuitive to DSP algorithm designers: signal flow graphs.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
6

Evolution of Dataflow Models for DSP


Synchronous dataflow: static multirate behavior


Agilent ADS, Cadence SPW, etc.


Well
-
behaved dataflow: schemas for bounded dynamics


Boolean/integer dataflow: Turing complete models


Multidimensional synchronous dataflow: image and video


Scalable synchronous dataflow: block processing


Synopsys COSSAP


Cyclo
-
static dataflow: phased behavior


Synopsys El Greco, Eonic Systems Virtuoso Synchro, System
Canvas


Bounded dynamic dataflow : bounded dynamics


The processing graph method: reconfigurable dynamic DF


US Naval Research Laboratory, MCCI Autocoding Toolset


Parameterized dataflow: dynamically
-
reconfigurable static DF


Blocked dataflow: image and video in terms of reconfigurable dataflow

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
7

Modeling Design Space

E

x

p

r

e

s

s

i

v

e



p

o

w

e

r

Verification / synthesis power

X

C, BDF, DDF

X

SDF

X

CSDF



X

CSDF, SSDF

MDSDF,

WBDF

X

X

PSDF

X

PCSDF

(Third dimension: simplicity and intuitive appeal)

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
8

Decidable Dataflow Models


Modeling flow for representing static flowgraph behavior:


Cyclo
-
static dataflow (CSDF), multiphase modeling



Synchronous dataflow (SDF), multirate modeling



Homogeneous synchronous dataflow (HSDF)



Acyclic homogeneous synchronous dataflow (“task graphs”)


These are in decreasing order or generality


Designs represented in the more general models can be converted to
equivalent representations in the less general ones


e.g., CSDF


卄䘠


䡓H䘠


瑡t欠杲慰g


HSDF:

each actor (graph node) produces/consumes exactly one data
value to/from each incident output/input edge


Suitable for exposing parallelism


Not the best model for minimizing memory requirements


University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
9

Synthesis Techniques for Decidable Models


Static scheduling: low overhead, predictability


Performance analysis through synchronization graphs


Loop scheduling


Implicit repetition in the dataflow graph (through changes in sample
rate) needs to be translated into explicit repetition in the form of loops
on the execution target.


Complex design space exists for such translation


Complementary to procedural language techniques for nested loop
compilation


Loop scheduling techniques


Simulation speedup (minimization of scheduling complexity)


Code/data minimization


Hierarchical parallel scheduling


Block processing


Task scheduling for latency/throughput optimization


Probabilistic design: exploiting tolerances to deadline misses



University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
10

Example: Intermediate representations for
synthesis from decidable dataflow models


Consider a decidable dataflow behavior that is to be implemented on
a self
-
timed, embedded multiprocessor


Natural way to implement DSP multiprocessors from decidable dataflow


Actor assignment and ordering are performed statically


Invocation (dispatch) of actors is performed dynamically, through
synchronization


Candidate mappings of the behavior onto the architecture can be
represented through an intermediate representation that also has
decidable dataflow semantics


This representation is useful for understanding the performance,
communication overhead, and synchronization structure associated with
the candidate mapping


Facilitates the separation of communication and synchronization
functionality


This is a useful modeling methodology for design space exploration

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
11

Interprocessor Communication Graph (
G
ipc
)

2r
1

4s
1

4s
2

4s
3

5s
1

7r
1

8r
1

9r
1

6

2

3


4


5

8

7

9

1

IPC Graph

Every edge (
v
i
, v
j
) induces the precedence
constraint

2


4


1

3

6

5

8

7

9

Self
-
Timed
Schedule

Proc 1: (1, 2, 3, 4, 6)

Proc 2: (5, 7, 8)

Proc 3: (9)

Proc 1

Proc 2

Proc 3

Self
-
timed schedule and

its IPC graph

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
12

The synchronization graph
G
s



Derived from the interprocessor communication graph


Synchronization edges

are distinguished from interprocessor
communication (IPC) edges



Synchronization edges represent precedence constraints that
are enforced by synchronization protocols



IPC edges represent data transfers



Interprocessor connections



Coincident synchronization and IPC edges


捯cm畮u捡瑩潮c
together with synchronization protocol (conventional approach)



IPC edge only


捯浭畮楣慴楯渠w楴桯畴h獹湣栮n灲潴潣潬



Synchronization edge only


獹湣桲潮楺慴楯渠灲潴潣潬p潮汹

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
13

Applications of Synchronization Graphs


Simulation


Throughput estimation through cycle mean analysis


Removal of redundant synchronizations


Resynchronization


Conversion to more efficient synchronization protocols
(strongly connected synchronization graphs)


Statically determining and minimizing the sizes of
interprocessor communication buffers



All are post
-
processing methods that can be applied to
improve a wide range of existing task graph scheduling
techniques on a wide range of multiprocessor architectures.



These techniques benefit from good execution time estimates,
but do not depend on exact execution time values to deliver
useful results.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
14

Beyond Decidable Models


Limited expressive power: DSP applications increasingly employ
high
-
level dynamics in their behavior


User interface functionality


Mode changes


Adaptive algorithms


Reconfiguration of processing resources/parameters


However, key
subsystems

still exhibit large amounts of “quasi
-
static” structure
---

structure that stays fixed across significant
windows of time.


Various dynamic dataflow models have been proposed that address
the limitation above by abandoning most or all restrictions related to
decidable dataflow


However, these methods are correspondingly limited in their ability
to exploit the quasi
-
static structure described above

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
15

Parameterized Dataflow: Structured Control
of Dynamic Parameters



The Key discipline that is imposed on reconfiguration is that each
subsystem must have a consistent view of each of its actors
(hierarchical or primitive) throughout any given iteration of that
subsystem.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
16

Parameterized Dataflow


Hierarchical

modeling


subsystem

parent graph

subinit

init

body

parameter
n, ...

writes
n

reads
n


Parameterized DF
subsystem

is composed
of 3 parmeterized DF
graphs:


init
,
subinit
,
body



Subsystem

parameters


configured in init/subinit,
used in body



Dynamically
reconfigurable


University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
17

Meta
-
modeling with parameterized dataflow


Parameterized dataflow can be applied to any dataflow
model of computation (“base model”) to augment that
model with dynamic reconfiguration capabilities in a
structured way


Provides for efficient quasi
-
static scheduling


Enables execution to be viewed in terms of a sequence of
dataflow graphs in the base model


Parameterized dataflow + XYZ


“Parameterized XYZ”


Examples of parameterized dataflow models of
computation that we are developing and experimenting
with


parameterized synchronous dataflow (PSDF)


parameterized cyclo
-
static dataflow (PCSDF)

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
18

Parameterized Synchronous Dataflow
(PSDF)


“Locally synchrony” conditions can be formulated and
checked in a quasi
-
static fashion to ensure that bounded
token production and consumption along with bounded
delays lead to bounded memory requirements overall.


This is not true of unstructured dynamic dataflow models, such
as general dynamic dataflow, boolean dataflow, and bounded
dynamic dataflow


Techniques for construction of streamlined looped
schedules for synchronous dataflow graphs have natural
and efficient extensions to the construction of
parameterized looped schedules

for PSDF graphs.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
19

PSDF Example: CD to DAT Conversion

initChild

setFac

(sets
i
1
,…
d
4
)

CD

PF1




1 1


d
1
i
4





i
1

i
3


d
2
d
4


i
2

d
3

PF2

preamble

PF3

PF4

DAT

params

i
1
,
d
1
, ….,
i
4
,
d
4

init

body

body

repeat 5 times {


fire setFac
/*

sets

i
1
,
d
1
,
i
2
,
d
2
,
i
3
,
d
3
,
i
4
,
d
4

*/


int _
g
1 = gcd(
i
1
,
d
2
); int _
g
2=gcd((
i
2
x
i
1
)
/
_
g
1,
d
3
)


int _
g
3=gcd((
i
3
x

i
2
x
i
1
)
/
(_
g
2
x
_
g
1),
d
4
);


repeat (
d
4
/
_
g
3) times {


repeat (
d
3
/
_
g
2) times {


repeat (
d
2
/
_
g
1) times {


repeat (
d
1
) times {fire
CD
}


fire
PF1


}


repeat (
i
1
/
_
g
1) times {fire
PF2
}


}


repeat ((
i
2
x

i
1
)
/
(_
g
2
x

_
g
1)) times {fire
PF3
}


}


repeat ((
i
3
x
i
2
x
i
1
)
/
(_
g
3
x
_
g
2
x
_
g
1)) times {


fire
PF4


}


repeat (
i
4
) times {fire
DAT
}

}

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
20

PSDF Example: Speech Compression

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
21

PCSDF Version of Speech Compression

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
22

Outline


Dataflow
-
based model of computation for modeling the behavior of
DSP applications


Decidable dataflow models


Example: use of decidable dataflow as a model of computation for
modeling the mapping of (decidable) dataflow behaviors onto
embedded multiprocessors


Structured reconfiguration of dataflow graphs


Examples of meta
-
modeling techniques that can be classified as
structured, reconfigurable dataflow


Parameterized dataflow and its application to SDF


Homogeneous
-
parameterized dataflow and its application to SDF and
CSDF


Experiments on a gesture recognition application


Summary

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
23

Homogeneous Parameterized Dataflow


(HPDF)



Parameterized dataflow model that can encapsulate
dynamicity of application.



Meta
-
modeling technique. Hierarchical actors can have any
other underlying dataflow model (SDF, CSDF, PSDF etc.)



Data production & consumption rates though dynamic are
equal across an edge for a large number of applications
-

thus the name homogeneous.



Reconfiguration can be performed without introducing
hierarchy when more natural to do so (advantage over
parameterized dataflow).



Parameterized dataflow is a more powerful technique and
thus can be used to represent a wider set of applications.


University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
24

Applications



Applications with dynamic run
-
time data and aggregated
final
-
stage processes perform especially well for HPDF
over SDF semantics.



Many applications in image and speech processing seem
well suited for our model.



We applied the model on two applications



-

A real
-
time video processing algorithm for smart
camera developed at Princeton


-

A face detection algorithm developed at CFAR labs
in UMD.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
25

Application characteristics

A

B

M

N

Dynamic
but
balanced
amount of
data

Aggregating

final
-
stage



This structure seems to be abundant in many audio/video
applications.



Our HPDF model is a natural fit for applications with the
above structure.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
26

Gesture recognition algorithm



Real
-
time video processing for gesture recognition.



Does low
-
level (red oval) and high
-
level processing.



Low
-
level processing recognizes body parts and identifies
movements.



High
-
level processing recognized actions.



We concentrate on low
-
level processing.

Ref : W. Wolf, B. Ozer, T. LV. Smart cameras as embedded systems.
IEEE Computer Magazine

Vol 35, Iss
9, Sept 2002, Pages 48
-
53

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
27

HPDF model of

gesture recognition algorithm

Region

finding

Contour

following

Ellipse

Fitting

Graph

Matching

Dynami
c data

Aggregating

final
-
stage

Dynami
c data

n n

p p

Ptolemy II
implementation

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
28

Modeling with HPDF/CSDF

VIDEO

INPUT

REGION

EXTRACTION

CONTOUR

FOLLOWING

(s 1) (s 1)

(s 1) (s 1)

(s 1) (s 1)

(s 1) (X
i
, Y
i
)

(s 1) (X
i
, Y
i
)


ELLIPSE

FITTING

(
I

0
,I k
i
)

(n 1)

MATCH

p (p
i
1, q
i

0)

p phases with 1
token and (n
-
p)
phases with 0
token production

#phases = #pixels = s

p
p
and
q
n
p
i
i





University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
29

Integrating HPDF and CSDF


Number of phases in a fundamental period can vary dynamically.


Number of tokens produced or consumed in a given phase can also
vary dynamically.


HPDF constraint: the total number of tokens produced by a source
actor of a given edge in a given invocation (a fundamental period)
must equal the total number of tokens consumed by the sink in its
corresponding invocation.


University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
30


Each frame has 384x240 pixels, so we model the input as a CSDF
actor with 92160 = s phases.








Model captures pixel level parallelism present in Region.


It also captures the frame level parallelism through the number of
phases in Input (s).

Finer granularity and Input modeling

VIDEO

INPUT

REGION

EXTRACTION

(s 1) (s 1)

(s 1) (s 1)

(s 1) (s 1)

#phases = #pixels = s

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
31

Modeling dynamicity
-

Contour


2 phases for Contour


First one scans until finds a contour.


Output = 0 tokens


Second one follows this contour and all the overlapping
ones.


Output = k
i

tokens, each token is a list of pixels from a contour


Homogeneous condition remains:


=s



i
i
i
Y
X
)
(
University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
32

Scheduling


VRCEM


(s V)(s R)(2
I

C)(n E)M


(s VR)(2
I

C)(n E)M


University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
33



We applied HPDF to successfully model a face detection
algorithm also.



We developed a TI DSP implementation of the HPDF
model of the gesture recognition algorithm.



The application was run on a TMS320C64xx fixed point
processor.



When implemented with our HPDF model, the runtime
was 21405671 cycles.



With a 40ns cycle period, execution time for the
application was 0.86 sec.

Results

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
34

Results (contd.)



Scheduling overhead was minimal as imperatively
highly streamlined quasi
-
static schedule was obtained.



Worst case buffer size 642 Kb when the input images
were 384X240 pixels. HPDF modeling suggested buffer
reuse between the edges.



Original C code had runtime of 27741882 cycles,
execution time was 1.11 sec with the same clock period
of 40 ns.



HPDF improved runtime by 23%.



Efficient hardware code generation is being looked
into using hardware synthesis framework developed in
our research group.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
35

Summary


Dataflow
-
based model of computation for is attractive for modeling
the behavior of DSP applications


Decidable dataflow models are useful for exposing and exploiting
static structure in synthesis tools for DSP


Decidable dataflow models in conjunction with structured
reconfigurable techniques allow for efficient handling of application
dynamics


Examples of structured, reconfigurable dataflow techniques that we
discussed:


Parameterized dataflow and its application to SDF


Homogeneous
-
parameterized dataflow and its application to SDF and
CSDF


Experiments on a gesture recognition application


Other examples include dynamic configuration of graph topologies,
and blocked dataflow modeling.

University of Maryland at College Park

Design and Synthesis of Image Processing Systems,
36

References


B.

Bhattacharya and S.

S. Bhattacharyya. Parameterized dataflow modeling for
DSP systems.
IEEE Transactions on Signal Processing
, 49(10):2408
-
2421,
October 2001


S.

S. Bhattacharyya, R.

Leupers, and P.

Marwedel. Software synthesis and code
generation for DSP.
IEEE Transactions on Circuits and Systems
---

II: Analog and
Digital Signal Processing
, 47(9):849
-
875, September 2000.


G.

Bilsen, M.

Engels, R.

Lauwereins, and J.

A. Peperstraete. Cyclo
-
static
dataflow.
IEEE Transactions on Signal Processing
, 44(2):397
-
408, February 1996.


D.

Ko and S.

S. Bhattacharyya. Dynamic configuration of dataflow graph
topology for DSP system design. In
Proceedings of the International Conference
on Acoustics, Speech, and Signal Processing
, pages V
-
69
-
V
-
72, Philadelphia,
Pennsylvania, March 2005.


E.

A. Lee and D.

G. Messerschmitt. Static scheduling of synchronous dataflow
programs for digital signal processing.
IEEE Transactions on Computers
,
February 1987.


S.

Neuendorffer and E.

Lee. Hierarchical reconfiguration of dataflow models. In
Proceedings of the International Conference on Formal Methods and Models for
Codesign
, June 2004.


M.

Sen, S.

S. Bhattacharyya, T.

Lv, and W.

Wolf. Modeling image processing
systems with homogeneous parameterized dataflow graphs. In
Proceedings of
the International Conference on Acoustics, Speech, and Signal Processing
,
pages V
-
133
-
V
-
136, Philadelphia, Pennsylvania, March 2005