An FPGA Implementation of Image Component Labelling

designpadAI and Robotics

Dec 1, 2013 (3 years and 11 months ago)

116 views

An FPGA Implementation of Image Component Labelling


D Crookes and K Benkrid


School of Computer Science, The Queen’s University of Belfast, Belfast, BT7 1NN, UK



ABSTRACT


Connected Component Labelling (CCL) is an important task in intermediate image pro
cessing. Several algorithms have
been developed to handle this problem. Hardware implementations have typically been based on massively parallel
architectures, with one logical processing element per pixel. This approach requires a great deal of logic, so

current
solutions are often implemented in VLSI rather than on FPGAs, and are limited in the size of image which can be labelled.

This paper describes an architecture based on a serial, recursive algorithm for CCL which is well suited to
implementation on

an FPGA with off
-
chip RAM. The algorithm iteratively scans the input image, performing a non
-
zero
maximum neighbourhood operation. A complete forward pass is followed by an inverse pass in which the image is scanned
in reverse order. In the general cas
e, the process is repeated until no change in the image occurs.

For a 256 by 256 image, the whole circuit fits easily on a XC4010E FPGA chip (20 by 20 CLBs). Using bit serial
arithmetic, it runs at 76 MHz (over 70 passes per second for 16 bits per pixel).

In the general case, the time to label a whole
image depends on the shapes of the objects, which is a disadvantage. However, given prior knowledge of the shapes, an
upper limit on the number of required passes can be set in advance. The big advantage of t
he algorithm is that its hardware
implementation is easily scaleable to images of any size. It can be defined at a high level within our application
environment, which enables automatic generation of architectures for an FPGA
-
based image coprocessor.


Key
words:
Labelling, Intermediate level image processing, Serial recursive algorithm, FPGA(s).


1.

INTRODUCTION


Connected Component Labelling is an important task in intermediate image processing with a large number of
applications
1,2
. The problem is to ass
ign a unique label to each connected component in the image while ensuring a
different label for each distinct object as illustrated in Figure 1. By assigning a unique label to each connected region,
higher level image processing operations can identify,
extract, and process each object separately.


10
10
10
10
10
10
10
10
10
20
20
20
20
20
20
20
20
20
20
20
20
30
30
30
30
30
30
30
30
30
Figure 1: A
labelled Image.







Email
: (D.Crookes,K.Benkrid)@qub.ac.uk; Telephone: +44 (0) 1232 274626; Fax: +44 (0) 1232 683890


To date, many algorithms have been developed to handle this problem
3
. Some have been designed for implementation on
software, and some for hardware. Hardware implementations have
typically been based on massively parallel architectures.
Early parallel techniques for image labelling were based on a nearest neighbour algorithm, dedicating one Processing
Element (PE) for each pixel where labels are broadcast locally between neighbour
s’ processors. They were proposed for 2D
mesh
-
connected cellular arrays and ILLIAC
-
type computers
4
. Using this approach, a time complexity of
O
(
N
2
)

was obtained
for an N
x
N image.


Another algorithm is based on a global approach, and is designed for a reco
nfigurable
-
mesh type architecture
5,6
. It reduces
the complexity to
O
(
N
). The reconfigurable mesh is an N
x
N array of PE’s that are connected to a reconfigurable broadcast
bus.


Yet another architecture based on a boundary processing method
7
, reduces the c
omplexity to
O
(Log
N
). The underlying
strategy of this approach is based on constructing a boundary list for each connected component, labelling these boundaries,
and then propagating the labels to the pixels within each component.


Another parallel archite
cture has been proposed which also takes a global approach. It runs on a one
-
dimensional array of
processors, and aims to minimise the electronic structure
8
. The algorithm is based on a divide and conquer paradigm in
which the image is first partitioned i
nto several sub
-
images; then, local labels are computed within each sub
-
image. These
results are then merged to compute the labels for the entire image. It leads to a complexity of
O
(
N
X
Log
N
). Though
implementable on FPGAs, it requires four XC4025 chips (3
2
x
32 CLBs each) to implement a labelling circuit for a 32
x
32
image size only.


These massively parallel approaches (based on one PE per pixel) inevitably require a great deal of logic, so current solution
s
are often implemented in VLSI rather than on FPG
As. Their main disadvantage is that they are generally limited in the size
of image, which can be labelled on FPGAs (e.g. 32
x
32).


This paper describes an architecture based on a serial, recursive algorithm for CCL which is well suited to implementation
o
n an FPGA with off
-
chip RAM. Its main advantage is that it is readily scaleable, and that it can label images of any size
(within reason), given the appropriate FPGA. In the following section, we describe the algorithm used. We then describe
the hardware

architecture, after which we give details of the FPGA implementation (on a Xilinx XC4010E FPGA).
Simulation results are given.


2.
THE PROPOSED CCL ALG
ORITHM


Given an arbitrary input image, the CCL algorithm which we use is an iterative one. Initially eac
h object pixel is taken to
be a separate object; then, by repeatedly applying a recursive neighbourhood operation, connected pixels are ‘merged’ by
being made to share a common pixel value (the maximum in the neighbourhood). The key to requiring only a sm
all number
of iterations is the recursive nature of the neighbourhood operation, in which the result pixel is stored back in the image
being labelled: this enables intermediate labels to be propagated across the image in a single pass.


In more detail, the

algorithm involves the following steps:


Step 1:

Threshold the input image to obtain a binary image. This will make all pixels in the objects equal to 1 and all
other pixels equal to 0.

Step 2:

The thresholded binary image is initially labelled by assigni
ng a different label to every non
-
zero pixel (i.e.
initially treating each pixel as a separate object).

Step 3:

Apply a recursive ‘non
-
zero maximum’ neighbourhood operation on the image, using the window given in
Figure 2. During this operation, each resul
t pixel is stored back in the source image. A complete forward pass
is followed by an inverse pass.


Step 4: Repeat Step 3 until there is no change in the image.


The following pseudo
-
code therefore expresses the basic algorithm:





Threshold the image;
Do initial labelling of all object pixels
;
Repeat
Apply a forward recursive ‘non-zero
max’ neighbourhood operation; //Forward pass
Apply a backward recursive ‘non-zero
max’ neighbourhood operation; // Backward pass
Until (No change in the image);


The application of the ‘non
-
zero maximum’ operations is illustrated in Figure 2.


P
1
P
2
P
3
P
4
P
5
If (P
1
=0) then return 0
Else return
maximum(P
1
, P
2
, P
3
, P
4
, P
5
)
Figure 2: CCL neighbourhood operation.
Backward pass
Input Image
P
5
P
4
P
3
P
2
P
1
Result Image
Forward Pass

3.

THE HARDWARE ARCHITE
CTURE


Previous work by Donachy et al
9

has resulted in an FPGA architecture for recursive neighbourhood operations

such as ‘non
-
zero maximum’ above. This architecture is a serial one, where pixels are fed to the FPGA one at a time. A one
-
line on
-
chip
buffer is required, with five maximum
-
of
-
two operators, plus the necessary control to implement the recursive aspect
of the
operation.


However, even given this architecture, the hardware implementation of this algorithm faces several further challenges:




The multi
-
pass scheme suggests repeated reconfiguration of the circuit between the forward pass and the backward pass

(and vice
-
versa). This reconfiguration would slow the overall operating speed considerably. The solution to this
problem is to keep the same circuit configuration, but reverse the order in which the image is supplied to the circuit: i.e.
reverse the ima
ge instead of the window. The latter is achieved by storing the output pixels in reverse order in both
cases (so that after a forward+backward pass the image pixels are in the original order). Thus, a forward pass would
translate to a forward scan of the

image (from the top left corner to the bottom right one), whereas a backward pass
would translate to a backward scan of the image (from the bottom right corner to the top left one), but using the same
circuit.




As mentioned before, the initial labelling

has to assign a different label to every non
-
zero pixel. For a N
x
N image size,
this could require 2Log
2
N bits per pixel to represent intermediate label values. In order to reduce the number of bits
per pixel, the initial labelling technique we adopt is:

firstly to give the first non
-
zero pixel the highest label, and
decrease the label value for subsequent non
-
zero pixels. Secondly, it gives adjacent non
-
zero pixels, within the same
column (assuming a vertical scan), the same label; this will use fewer la
bel values, thus reducing the required pixel
word length. The latter could be determined in advance by knowing of the shape and maximum number of objects.
Note that starting this initial labelling technique at the
maximum

label allows us to predetermine th
e label of one object
in advance, which is the maximum pixel value chosen.




If we wish to include the initial thresholding and initial labelling on
-
the
-
fly (which we do), then the first pass is special.
Subsequent passes should bypass these stages. An una
ttractive solution would be to reconfigure the whole circuit after
the first pass. To avoid that, our implementation provides extra logic for bypassing this stage in subsequent passes
without reconfiguring.




Detecting if a pass has resulted in any change
needs to be done on
-
the
-
fly, to avoid a separate pass for this purpose. A
flag is maintained during processing, and is set to 1 if and when any result pixel differs from its original value. To test
for termination of the whole algorithm, this flag must b
e 0 for two consecutive passes (one forward and one backward).


The organisation of the proposed architecture is given in Figure 3. The ‘maximum of five’ neighbourhood operation is split
into four ‘maximum of two’ operations. The recursive storing back o
f result pixels into the source image is implemented by
a feedback loop. In order to eliminate the propagation of the pixel label from the bottom of an image column to the top of
the following one, a column counter is provided to inhibit this propagation.


In order to make the design scaleable and compact, we used a bit serial pipelining. For efficient implementation of the
‘maximum of two’ operators, data is processed Most Significant Bit First (MSBF).



Figure 3: Architecture of the Labelling Unit.
T
H
R
E
S
H
O
L
D
INITIAL
LABELLING
COLUMN
COUNTER
MAXIMUM
PIXEL
DELAY
L
I
N
E
B
U
F
F
E
R
INPUT IMAGE
OUTPUT IMAGE
MAXIMUM
MAXIMUM
PIXEL
DELAY
4
5
3
1
2
1
2
3
4
5
NON-ZERO
MAXIMUM


A forward pass is follo
wed by an inverse one in which the input image (the output image of the previous pass) is scanned in
reverse order (actually implemented by storing the output image in reverse order). The process is repeated until no change
in the image occurs. The syste
m architecture is illustrated in Figure 4.




Labelling Unit
Controller
Off-chip RAM
FPGA Chip
Figure 4: System architecture.
Source Image
Comparator
“Finish”
“Start”


The controller block generates the proper off
-
chip RAM address sequences. In addition, it generates the signal ‘Start’, which
initiates the first pass, when it is set to one. It is
set to zero in subsequent passes, thus bypassing the Threshold and the Initial
Labelling.


Given prior knowledge of the shapes in the source image, an upper limit on the number of required passes can be set in
advance, and in some common cases labelling m
ay need only one forward+backward pass.


4.

FPGA IMPLEMENTATION


The CCL circuit has been implemented for Xilinx XC4000 series FPGAs. The Threshold block is bit parallel, and uses
dedicated fast carry logic; it is highly scaleable. A 16 bit threshold block
occupies 9 CLBs.


The Initial Labelling block occupies 3 CLBs per bit. The non
-
zero maximum unit occupies 14 CLBs whereas the other
maximum units occupy 4 CLBs.


The line buffer is implemented using XC4000 Select
-
RAM
TM
. An N pixel line buffer occupies N
/2 +2 CLBs.


The whole CCL circuit has been implemented in an XC4010E
-
1 FPGA chip (20 by 20 CLBs). For a 256 by 256 image, it
fits easily on to the XC4010E.


5.

SIMULATION RESULTS


Timing simulations show that the circuit can run at a speed of 76 MHz (over

70 passes per second for 16 bits per pixel). For
256
x
256 images in which the objects can be labelled by a single forward+backward pass, input images can be thresholded
and labelled in real time at approximately 35 fps. Larger images will take proportion
ately longer, since the algorithm is
O
(
N
2
) for an N
X
N image.


In order to verify the correctness of the design, we have simulated it using Xilinx Foundation Simulator for a variety of
images. Figure 5 presents the simulation results for an 8
X
8 image.








In this particular case, one forward pass is sufficient to label the image. However, that is not true in the general case. Ob
jects
of the forms given in Figure 6 cannot be labelled in one pass. In this case two passes


one forward and one backward
-

are
necessary.





Original binary image
64
64
64
63
64
63
62
61
61
61
62
Initial
Labelling
Figure 5
: Simulation Example of Connected Component
Labelling.
First pass (Forward scan)
64
64
64
64
64
64
62
62
62
62
62
Original binary
image
Initial
Labelling
First pass (Forward scan)
Figure 6: Another simulation example.
63
64
60
62
59
62
61
60
60
60
61
62
62
59
63
64
60
64
61
64
61
61
61
61
61
64
64
61
Second pass (Backward scan)
64
64
61
64
61
64
61
61
61
61
61
64
64
61
6.

CONCLUSION


In this paper, we have presented an architecture based on a serial, recursive algorithm for Connected Component Labelling.
This architecture is well suited to implementation on an FPGA with off
-
chip RAM. The algorithm ite
ratively scans the input
image, performing a non
-
zero maximum neighbourhood operation. The resulting pixels are stored back into the input
image, so an intermediate label is partially propagated throughout a component during a single pass. An inverse pass

in
which the image is scanned in reverse order follows a complete forward pass. In the general case, the process is repeated
until no change in the image occurs. For 256
x
256 images, using 16 bits per pixel, the current architecture delivers 70 passes
per

second. For the common case where in practice a single forward+backward pass is sufficient, a frame rate of
approximately 35 fps can be achieved. In practice, in some circumstances, a smaller word length can be used (e.g. 10 bits),
which results in propor
tionately faster operation.



Although the algorithm has been implemented in bit serial, it is possible to implement it using digit serial arithmetic (or
even fully parallel) in order to speed up the circuit at the expense of higher complexity.


The main a
dvantage of this algorithm is that its hardware implementation is easily scaleable to realistic size images. It can
be defined at a high level within our application environment, which enables automatic generation of architectures for an
FPGA
-
based image c
oprocessor
10
. On the other hand, because a single pass is not sufficient to label the whole image in the
general case, this algorithm requires an off
-
chip memory to store intermediate images. In addition, it may not be possible to
predict the number of pa
sses required. Both these factors mean that the algorithm is not yet well suited to real time general
purpose labelling using an FPGA embedded in a camera.


REFERENCES


1.

D. L. Milgram, “Region extraction using convergent evidence”,
Computer Graphics and

Image Processing
, vol. 5, no.
2, pp. 561
-
572, 1988.

2.

L. C. Sanz and D. Petkovic, “Machine vision algorithms for automated inspection of thin
-
film disk heads”,
IEEE
Transactions on Pattern Analysis and Machine Intelligence,
Vol. 10, no. 6, pp. 830
-
848, 1
988.

3.

H. M. Alnuweiri, and V. K. Prasanna, “Parallel architectures and algorithms for image component labeling”,
IEEE
Transactions on Pattern Analysis and Machine Intelligence
, Vol. 14, No. 10, pp. 1014
-
1034, Oct. 1992.

4.

C. R. Dyer and A. Rosenfeld, “P
arallel image processing by memory augmented cellular automata”,
IEEE
Transactions on Pattern Analysis and Machine Intelligence, pp. 29
-
41, Jan. 1981.


5.

M. Maresca, and H. Li, M. Lavin, “Connected component labeling on Polymorphic
-
Torus architecture”,
IE
EE Int.
Conf. Computer Vision and Pattern Recognition
, Ann Arbor, pp. 951
-
956, 1988.

6.

R. Miller, and V. K. Prasanna
-
Kumar, “Meshes with reconfigurable buses
”, Proc. 5
th

MIT Conf. Advanced Research in
VLSI
, pp. 163
-
178, March 1988

7.

S. Olariu, J. L. Sc
hwing and J. Zhang, “Fast component labelling and convex hull computation on reconfigurable
meshes”
, Image and Vision Computing,
Vol. 11, No. 7, pp.447
-
455, 1993.

8.

E. Mozef, S. Weber, J. Jaber and G. Prieur, “Parallel architecture dedicated to image comp
onent labeling in O(nLog
n): FPGA implementation”,
Proc. SPIE
, Vol. 2784, pp.120
-
125, 1996.

9.


P. Donachy, D. Crookes, A. Bouridane, K. Alotaibi, and A. Benkrid, “Design and implementation of a high level
image processing machine using reconfigurable hard
ware”
, Proceedings SPIE’98
, Vol. 3526, p. 2
-
13, 1998.

10.

K. Benkrid, D. Crookes, A. Bouridane, P. Corr and K.Alotaibi, ‘A High Level Software Environment for FPGA Based
Image Processing’, To appear in
Proceedings IPA 99’
, Manchester, July 1999.