CPU - Progetti.t3lab.it

companyscourgeAI and Robotics

Oct 19, 2013 (3 years and 9 months ago)

64 views

Experimenting

with


machine

vision

and
Zynq


M. Viti
1
, E. Primo
2
, C. Salati
2

14
-
15
-
16/02/2012

1

Datalogic

Automation
Srl

2

T3LAB

VIALAB Origins


Research program funded by
Regione

Emilia
-
Romagna


From industrial to technological districts


Projects are lead by industrial partners


Project duration: 2 years


VIALAB


A
pplied
I
ndustrial
V
ision
Lab
oratory


Computer Vision as an enabling technology for the
manufacturing automation industry

2

VIALAB Partners


Datalogic


Automation (
Prime Contractor
)


Scanning (ADC)


System


SPA


Logistics


T3LAB
,
Region

Emilia
-
Romagna High
Technology

Network


DEIS
,
University

of

Bologna


CRIT

3

VIALAB topics


Computing platforms and

computational models


HW acceleration of machine vision algorithms


3D Vision


Benchmark of machine vision libraries


4

Machine

Vision &

Image

Processing

input =
image

output = information

Image

Processing

Computer
Vision

output =
image

input =
image

image

Image

Processing

image

information

Computer
Vision

machine

vision = computer vision in industrial
applications

FPGA in machine vision:

conventional approach

6

video stream in

info out

PL

CPU

RAM

time

PL

-

-

-

Img
[n]

Img
[n+1]

-

-

-

CPU

-

-

-

Img
[n
-
1]

Img
[n]

-

-

-

1.
Limited to image processing (image
-
in


image
-
out)

2.
Limited to
videoStream
-
in


videoStream
-
out applications

3.
Limited to pre
-
processing in full computer vision applications


Pipeline architecture


PL (Programmable Logic)


pre
-
filtering

Image processing


Kernel and sliding window









Examples









average 4
-
connectivity 8
-
connectivity horizontal
Sobel



gradient



7

x

y

-
1

0

+1

-
2

0

+2

-
1

0

+1

Filters


Smoothing filters


Median


Mean


Gauss

(composed for optimization purpose: horizontal


vertical)


Morphological smooth (with customizable mask)

(composed: open


close)


Morphological operators (with customizable mask)


Erode


Dilate


Open

(composed: erode


dilate)


Close

(composed: dilate


erode)


Edge detection


Sobel


Morphological gradient (with customizable mask)

(composed: dilate
-

erode)


Punctual operators


Binarization

(with customizable threshold)



Edge thinning


Canny
-
like

(composed: median


Sobel



non
-
maximal suppression)


Edge sharpening


3 different masks

8

Future filters


Punctual operators


Histogram stretching


Histogram equalization


Corner detection


Harris

(composed:
Sobel



cornerness



non
-
maximal suppression)


Piramidization


Bilateral


9

Filters’ architecture

10

BRAM

extractor






BRAM

BRAM

BRAM

. . .

video
stream

in

filter

logic

video
stream

out









Filter

NxN

pixel

Filters’ architecture


Maximum kernel size is statically configurable


Actual kernel size is dynamically configurable


Changing the maximum kernel size affects


time performance


latency


FPGA area


The actual kernel size has no effects on performance characteristics


The maximum operating frequency (with peaks of 200 MHz) depends only
on the slowest combinatorial stage of a filter logic


Designed to


Maximize parallelism


Minimize dependence from

kernel size

11

Filters’ architecture

12



Filter.n

=
Filter.n
.2


Filter.n
.1

Selectable filtering

13

CPU

DDR


mem.


contr.

selection

register

video stream in

F
ilter
.1

F
ilter
.2

Filter.n
.2

video stream out

video
stream out

video
stream

out

video
stream out

. . .

bus

Filter.n
.1

video
stream

Demo characteristics


Screen = 1280 x 1024 pixel, 60 fps


b/w pixel rate


raw: ~80Mp/s


within a row: ~110Mp/s


Compiled at 166Mp/s


Target FPGA: Xilinx® Spartan®
-
6 LX150T


EvalBoard
: AVNET Xilinx® Spartan®
-
6 FPGA Industrial Video Processing Kit


FPGA utilization:


Slice registers: ~25.000 (~14%)


LUT: ~40.000 (~40%)


BRAM: 12

14

Selectable filtering

15

CPU

DDR


mem.


contr.

selection

register

BRAM

extractor






BRAM

BRAM

BRAM

. . .

video
stream

in

filter.1

filter.2

filter.n

video stream out

video
stream out

video
stream out

video
stream out

. . .

bus

Selectable filtering


Kernel size is the same for all filters and the BRAM extractor


Kernel size defined at compile time


A VHDL constant


Kernel size can be changed through download of a different
bitstream


After having changed the value of the VHDL constant


Value of kernel of each filter is wired
-
in


Wired in an attached external register


The CPU program can dynamically select what filter will be applied to the
input video stream


Latency =
pixelTime

*
rowSize

*
kernelSize

16

Dynamic configurable filtering

17

CPU

DDR


mem.


contr.

selection

register

BRAM

extractor






BRAM

BRAM

BRAM

. . .

video
stream

in

filter.1

filter.2

filter.n

video stream out

video
stream out

video
stream out

video
stream out

. . .

bus

kernel value

configuration registers

kernel size

̷


7
x
7

Dynamic configurable filtering


The CPU program can dynamically select what filter will be applied to the
input video stream


The CPU program can dynamically configure the size of the kernel that will
be used by the selected filter


Each filter designed to handle the maximum kernel size


Kernel size: 3x3, 5x5, 7x7


The maximum kernel size depends on specific filter


When relevant, the CPU program can dynamically configure the value of
the kernel that will be used by the selected filter


4
-
connectivity vs. 8
-
connectivity


Application specific filters in morphologic filters


The BRAM
-
extractor provides enough parallelism to support the largest
possible kernel


BRAM
-
extractor output parallelism: 7x7 pixels

18

Programmable filtering

19

CPU

DDR


mem.


contr.

video stream in

video stream out

Stage

1

. . .

. . .

Stage

2

. . .

. . .

Stage

3

. . .

. . .

Stage

4

. . .

. . .

filter.2

BRAM extr.

filter.1

BRAM extr.

filter.n

BRAM extr.

filter.m

BRAM extr.

. . .

S
2
ass. reg.

S1 ass. reg.

S3 ass. reg.

S4 ass. reg.

Filter specific registers

connectivity

threshold

. . .

bus

Programmable filtering


A set of filters is available


The filter that is embedded in each stage is configurable


The kernel size of each filter is determined at compile time, but the kernel
sizes of different filters may be different


The kernel size of a filter depends on the specific characteristics of the
filter


When relevant, the CPU program can dynamically set filter specific
parameters; e.g. the value of the kernel in morphologic filters:


4
-
connectivity vs. 8
-
connectivity


Application specific filters


Latency =


This would have been done better through partial reconfiguration

20

Video stream to memory

21

CPU

DDR


mem.


contr.

selection

register

BRAM

extractor






BRAM

BRAM

BRAM

. . .

video
stream in

filter.1

filter.2

filter.n

video
stream out

video
stream out

video
stream out

video
stream out

. . .

bus

kernel value

configuration registers

kernel size

mem
.

interface

FPGA in machine vision:

conventional approach

22

video stream in

info out

PL

CPU

RAM

time

PL

-

-

-

Img
[n]

Img
[n+1]

-

-

-

CPU

-

-

-

Img
[n
-
1]

Img
[n]

-

-

-


PL = pre
-
processing = image processing


CPU = machine vision


Parallelism through pipeline

Alternative approach:

hardware acceleration

23

char *
inputData

=
0
xNNNNNNNN;

char *
outputData

=
0
xMMMMMMMM;


char *
accelInputData

=
0
xKKKKKKKK;

char *
accelOutputData

=
0
xLLLLLLLL;

char *
accelControl

=
0
xJJJJJJJJ;


// Pure SW processing

Process_data_sw
(
inputData
,
outputData
);


// HW Accelerator
-
based processing

Send_data_to_accel
(
inputData
,
accelInputData
);

Process_data_hw
(
accelControl
);

Recv_data_from_accel
(
accelOutputData
,
outputData
);


Alternative approach:

hardware acceleration


How long does


P
rocess_data_hw
(
accelControl
);


last?


many

CPU machine instructions:


Asymmetric
MultiProcessing

(AMP)


Interrupt based interaction


CPU performs other jobs in the meantime


few

CPU machine instructions:


Co
-
Processing


Busy
-
wait interaction


CPU does nothing in the meantime

24

Alternative approaches:

Asymmetric multi
-
processing


Zynq
-
7000 EPP Software Developers Guide: “Asymmetric multi
-
processing
is a processing model in which each processor in a multiple
-
processor
system executes a different operating system image while sharing the
same physical memory”


Related to SW


Limited to the 2 cores of the ARM Cortex™

A9 multiprocessor


A stronger meaning of AMP:

a subset of PL is seen as a processor performing a computational task in
parallel with the CPU processing


PL
AMP

is not restricted to pre
-
processing


PL
AMP

may access data in DDR


PL
AMP

asymmetric multi
-
processing represented by a computational
thread in the CPU SW environment

25

Alternative approaches:

Asymmetric multi
-
processing

26

video
stream

in

info out

PL
PP

CPU

DDR

PL
AMP

time

PL
PP

-

-

-

Img
[n]

Img
[n+1]

-

-

-

CPU

-

-

-

Img
[n
-
1
]

Img
[n]

-

-

-

PL
AMP

-

-

-

Img
[n
-
1]

Img
[n]

-

-

-

Asymmetric multi
-
processing

Case study: blob analysis


1.
PL
PP

(pre
-
processing) produces an image containing all
binarized

blobs
and stores it in central memory

2.
CPU performs blob labeling, computes the related region of interest (ROI)
and produces the list of blobs

3.
PL
AMP
and CPU compute in parallel a set of descriptors for each blob


CPU: o
rientation, rectangularity, …


PL
AMP
: a
rea, perimeter, center, Euler's number, …


Each computation is represented by a SW thread


SW threads may interact with each other

4.
Based on the related descriptors CPU classifies each blob

27

Asymmetric multi
-
processing

Case study: blob analysis





1.






2.








28

4.






Asymmetric multi
-
processing

Case study: blob analysis

29

video

stream

in

PL
PP

DDR

image with
all
binarized

blobs

CPU

DDR

image with
all blobs
labeled &
blob ROIs
list

CPU

PL
AMP

DDR

descriptor array
for all blobs

CPU

DDR

descriptor
array &
classification
for all blobs

Computation

of

descriptors

Blob analysis & AMP:

computation of blob descriptors

30

time

PL
AMP

Single sweep
through ROI:


Preliminary


for pose

CPU

-

-

-

Center


Pose


Length


Width


Eccentricity


Bounding
box area

-

-

-

Bounding

box

Single sweep
through ROI:


Area


Euler’s
Number


Preliminary


for center


Preliminary


for perimeter


Perimeter


Compactness

-

-

-

CPU or PL
AMP
?

Blob analysis & pre
-
processing:

computation of blobs

31



Preliminary

smoothing

Binarization

Gaussian

video
stream

out

Median

video
stream

video
stream

in

video
stream



Morphological

smoothing

Close

Open

video
stream

video
stream

video
stream

Blob analysis, pre
-
filtering &

asymmetric multi
-
processing

32

time

PL
PP

Img
[n+
1
]

PL
AMP

Img
[n].
blob
[1]

Img
[n].
blob
[2]

CPU

-

-

-

Img
[n].
blob
[k]

Img
[n].
blob
[1]

Img
[n].
blob
[
2
]

-

-

-

-

-

-

Img
[n].
blob
[k]

-

-

-

-

-

-

blob

labeling

Asymmetric multi
-
processing

1.
CPU


No parallelism/pipeline


Floating
-
point operations

2.
PL
AMP


Parallelism


Pipeline


Simple operations with short integers


Operations that can be based on look
-
up tables

33

Alternative approaches:

Co
-
processing


Each ARM Cortex™

A9
MPcore
™ already includes a FPU/NEON™
accelerator


But one may think of an application specific co
-
processor


A subset of PL is seen as a co
-
processor implementing an application
specific, extension instruction set


PL
CP

includes


An extension register file (e.g. PL distributed memory)


An extension “ALU”


An ARM SW thread


Loads the operands in the extension register file


Operands may be references to data in DDR or On
-
Chip Memory


Activates the extension ALU


(Busy
-
)Waits for the completion of the operation


Fetches the results from the extension register file

34

Alternative approaches:

Co
-
processing

35

video
stream

in

PL
PP

PL
CP

DDR

PL
AMP

info out

CPU

PL
PP

-

-

-

Img
[n+1]

-

-

-

CPU

-

-

-

Img
’[n]

Img
”[
n]

-

-

-

PL
AMP

-

-

-

Img
[n]

-

-

-

time

PL
CP

op

Co
-
processing

Case study:
Sobel

gradients


hgrad
[x, y] =

normalize(p[x+
1
, y
-
1
] +
2
*p[x+
1
, y] + p[x+
1
, y+
1
] +



-

p[x
-
1
, y
-
1
]
-

2
*p[x
-
1
, y]
-

p[x
-
1
, y+
1
])


vgrad
[x, y] =

normalize(p[x
-
1
, y+
1
] +
2
*p[x, y+
1
] + p[x, y+
1
] +



-

p[x
-
1
, y
-
1
]
-

2
*p[x, y
-
1
]
-

p[x
-
1
, y
-
1
])


CPU operations:
4
(left) shifts + (
5
*
2
) sums +
2
(right) shifts


PL operations:


Cost of shift:
0


Depth of each summation tree =
3


Total time complexity:
3
sums



36

Co
-
processing

Case study: blob analysis


Classification based on nearest neighbor strategy


Distance between the descriptor array of a blob and the descriptor array of
all possible object classes


Two levels of parallelism:


Distance from several possible object classes


Distance of two descriptor arrays = f(distance of individual descriptors)


PL based co
-
processor to compute the distance of two descriptor arrays

37

Acknowledgements


Xilinx


Silica


Stefano Tabanelli


Università

di Bologna


DEIS


Prof. Luigi Di Stefano, VIALAB Scientific Director


Prof. Stefano Mattoccia, VIALAB
OR4 Advisor


Michele Benedetti (
Datalogic

Automation), VIALAB Director


Luca Turrini (System), VIALAB OR4 responsible

38