MORPHOLOGICAL IMAGE PROCESSING USING CUSTOM INSTRUCTIONS ON DISTRIBUTED NIOS PROCESSORS

breezebongΤεχνίτη Νοημοσύνη και Ρομποτική

6 Νοε 2013 (πριν από 3 χρόνια και 9 μήνες)

107 εμφανίσεις

MORPHOLOGICAL IMAGE PROCESSING USING CUSTOM INSTRUCTIONS ON
DISTRIBUTED NIOS PROCESSORS


Haichen Ren, David J. Jackson

Electrical and Computer Engineering

The University of Alabama

Tuscaloosa, AL 35487
-
0286 USA


Abstract


As fundamental image processing b
lock,
morphological processing involves intensive computation
and contribute significantly to the system over
head
. With
depending on only spatially local data, several
morphological operations could be implemented with
parallel hardware to reduce the compu
tation over
head
. In
this paper, we implemented morphological image
operations, which

include dilation, erosion, and edge
detection based on a 3x3 mask,

on a distributed Altera
NIOS
®
soft core system. We also implement custom
instructions to improve the sys
tem performance.
Compared with non
-
distributed system without custom
instruction, the speedup of several morphological
operations based upon the distributed system with custom
instructions can reach to 11.8. The system architecture
and implementation detai
ls are presented.


Keywords: image processing, morphological operation,
embedded processor, programmable logic, soft core.


1.

INTRODUCTION


Recently, the image processing community has
become aware of the potential for massive parallelism and
high computatio
nal density in hardware. Among all
available hardware, embedded programmable processors
give the system designer unprecedented freedom in
determining which functions should be executed in
software and which would benefit the most from
dedicated hardware im
plementation in the form of custom
peripherals or coprocessor elements.
T
his flexibility
allows a designer to not only
rapidly prototype new
designs and easily integrate different digital components
into one design,
fully realize, in hardware, several
iter
ations of a system in a shorter amount of time, but also
to explore different options for partitioning to deliver the
best possible combination in that product

while still
meeting the design's functionality and performance
requirements [1].



Based upon th
e Altera Excalibur embedded
processor solution, we implemented distributed hardware
implementation of several morphological operations,
evaluated the custom instruction design of the Altera
NIOS soft core processor upon non
-
distributed and
distributed syst
em.


This paper is organized as follows. Section 2
presents the Altera Excalibur
TM

embedded processor
solution. Section 3 reviews several morphological image
operations, details an algorithm ORD
-
4C for
morphological image processing. In section 4 a
distr
ibuted NIOS hardware implementation with custom
instructions using ORD
-
4C algorithm is detailed. Section
5 presents results and performance.


2.

NIOS SOFT CORE PROCESSOR


One of the more popular solutions of
combinations of embedded processors and programmab
le
logic

is the use of soft core microprocessors. Soft core
processors are a recent digital design method that
combines the advantages of programmable logic devices
with those of conventional hard core processors. Soft core
processors function like hard co
re processors but are
implemented on programmable logic devices (PLD), such
as Field Programmable Gate Arrays (FPGA).


Scalability and flexibility are the two main
advantages of soft core processors derived from
implementation on PLDs. Soft cores are flex
ible in that
custom defined logic can be easily integrated to the
processor with minimal interfacing requirements. Some
soft core processors even allow their internal architecture
to be changed to suit a particular design. This gives the
designer more flex
ibility when interfacing the soft core
processor to the rest of the embedded system. The
scalability of soft core processors allows more than one
processor to be implemented in a particular design, but
this is limited by the capacity and resources of the P
LD
used. The scaleable and flexible factors make the soft core
processor suitable for use in a variety of applications such
as communications or digital signal processing.


2.1

NIOS soft core processor


Altera Excalibur
TM

solutions include both soft
core and h
ard core embedded processors.
As part of
Altera

Excalibur embedded processor solutions and based
on Altera APEX 20K200EFC484
-
2x, the NIOS soft core
embedded processor is a configurable, general
-
purpose
RISC microprocessor with a 16
-
bit instruction set, use
r
-
selectable 32
-

or 16
-
bit datapath, and configurable register
file and barrel shifter size. It can provide up to 50 MIPS
performance while being optimized for area in a PLD and
can easily fit into an Altera APEX device, leaving most of
the logic available

for peripherals and custom logic
functions. Figure 1 shows the structure of the NIOS
embedded processor [2].


2.2

Custom Instructions


Altera NIOS processor

is one of

a few types of
soft core processors that allow custom instruct
ions.
By
designing special custom instructions for the NIOS soft
core, system designers can add
up to five custom
-
defined
functionalities

to the NIOS processor’s arithmetic logic
unit (ALU) and instruction set, as shown in Figure
2
.
Custom instructions con
sist of
custom logic block and
software macro.
Custom logic block

is the

hardware that
performs the operation. The NIOS processor can include
up to five user
-
defined custom logic blocks. The blocks
become part of the NIOS microprocessor’s ALU.
Software mac
r
o is the
user
-
interface that allows the
system designer to access the custom logic through
software code.



Figure 2. Custom instruction logic block and interface of 32
-
bit NIOS
processor [3].


3.

MORPHOLOGIOCAL OPERATIONS


Morphology offers a unified and

powerful
approach to numerous image processing problems [4].
The goal of morphological operations is to smooth the
contours of the objects and to decompose an image into
its fundamental geometrical shapes.


3.1

Fundamental morphological image processing


Mor
phological operations can be separated into
two categories: 1) binary morphological filtering of
binary images, and 2) grey
-
level morphological filtering
of grey
-
level images. Four basic types of binary
morphological filtering operations are available: ero
sion,
dilation, opening, and closing. Each of these filters uses a
mask or structuring element to determine the geometrical
filtering process. The erosion filtering operation reduces
the geometrical size of an object, while the dilation
filtering operation

enlarges an object’s geometrical size.
However, they are generally not reversible operations. An
opening filter is simply an erosion filter followed by a
dilation filter, and a closing filter is a dilation filter
followed by an erosion filter [4][5]. For
more complex
morphological operations, most of them are kinds of
combinations of these four operations. Such as edge
detection can be got by the calculating difference between
the dilated image and the original image or the original
image and the eroded im
age.


The morphological operations procedure of an
input image is a kind of convolution between the input
image and structuring element (SE).
for example, with a
3x3 all 1’s SE, a block of nine pixels is covered by the SE
in each step, and the maximum valu
e or minimum value
among the nine pixels is picked for the dilation or erosion
operation output of the central pixel. After the current
operation is complete, the SE will move to the right by
one pixel, or move down to the beginning of the next row
if the
end of the current row is hitted.


3.2

ORD
-
4C algorithm


If we trace the movement of SE, we can find that
some pixels are overlapped between each step or each
row. For example, in Figure 4 with 3x3 all 1’s SE, nine
double
-
lined pixels are covered for movement

step.
Among adjacent movement, 6 bold
-
lined pixels are
always overlapped. And 4 pixels are overlapped among
the 4 movement steps shown in the figure 4.


Obviously, the comparison between nine pixels
is the most common computation overhead of the
algorith
m. If we
implement the
nine
-
pixel comparison
operation in hardware, we could significant reduce the
system overhead compared with high
-
level language
implementation. Considering the custom instruction
feature of NIOS soft core processor, it allows at most

two
input operands and one output
,
plus a 11
-
bit control bit
providing more choices of the operation on input
operands. We can customize a custom instruction base
upon 32
-
bit processor. In this way, we could have up to
eight 8
-
bit pixel values embedded in

instruction operands.


Based on the observation of the m
orphological
operations on an 8
-
bit grey
-
scale image with 3x3
all 1’s
Figure
1
.
The NIOS embedded programmable processor

[2]
.

SRAM

NIOS


Area

AvailableFor

Customization

CPU

FLASH

Serial

Port

PBM

UART

APEX

Device

IRQ

Timer

SE

mentioned above,
six pixels are overlapped between
each comparison step.
We can embed six 8
-
bit pixel
values into two 32
-
bit
input operands.
The comparison
between nine pixels can be done in two level
s: first
compare left side six pixels and right side six pixels
respectively, and then compare the results of these two 6
-
pixel comparisons to get the final morphological
operation
result for the current central pixel
.

But this
design has 2 drawbacks: first, t
he middle results of
different rows can not be re
-
used

among 2 adjacent rows.
Second and more inefficient is the

comparison
for the
final result of current central pixel is gene
rated by doing
comparison of two 8
-
bit values, which are the
middle
results
of left side six pixels
L
-
6

and

right side six pixels

R
-
6
. It only uses two 8
-
bit of two 32
-
bit input operands.



As mentioned above, there are always
also 4
pixels that are always overlapped among 4 steps
comparison of the four adjacent steps in the same rows
and adjacent rows.

If we use only one input operand for
the 32
-
bit customized comparison instruction, which
embeds
four 8
-
bit pixels
, fo
r each
nin
e pixels comparison
,
two levels with five 4
-
pixel comparisons

are enough
.

We
need to create two temporal buffers. Every time for each
row, when a new pixel value is inputted, we first perform
first level comparison: combining the new pixel with one
pixel
in its left in the same row and two pixels into the
input operand of the custom instruction to do the
comparison, saving the result in corresponding buffer.
Meanwhile, with this new comparison result, we can
perform second level comparison by combining 4
values
in the two temporal buffers to do the comparison. The
result is actually the morphological operation result of the
pixel in the last row. So, as we can see, with two
comparisons, we can generate one result for the last row.
And the 32 bits of input
operand are fully used and same
for the all the middle level results. It’s much more
efficient the six pixels’ comparison custom instruciton.
We call
is as

One Row Delay 4
-
Pixel Comparison
algorithm

(ORD
-
4C).


4.

HARDWARE IMPLEMENTATION



In most image proces
sing applications,
significant computation is required, even for the simple
morphological operations previously mentioned. For
example, let us consider the overhead for dilation
calculation on a 256x256 8
-
bit grey
-
scale image with a
3x3 SE. Minimally, 256x
256x9 8
-
bit comparisons must
be performed.


Simplifying the intensive computation is the first
concern for improving
the system
performance.
Considering the morphological operations presented in
section
3
, we can see the morphological operations
perform l
ocal operations on the image, with each step
during the operations only operating on that part of the
image pixels that are covered by the SE.
T
his

implies
parallel algorithm can be used to
further improve the
system performance
.


4.1

Architecture of
distribut
ed system


To determine a reasonable and efficient
parallel
distributed
architecture for a problem, there are many cost
metrics that can be used to investigate cost
-
performance
tradeoffs in the network

[6]
.

Here a

message
-
passing
parallel architecture with

four NIOS processors
based
connected via
10M Ethernet network card
s is applied

[7]
,
as shown on Figure 5
.


4.2

Parallel morphological operation


Assuming we need to perform a morphological
operation on a
P
x
Q

matrix. For non
-
parallel architecture
with one proc
essor used for computation, we have a host
which holds the data matrix and a processor which
performs the morphological operation. The time for the
(a) Pixels for step (1, 1)


Figure 4.
Related pixels in each step of morphological operations.

(e)

St
ep (1, 1), step (1, 2), step (2, 1), step (2, 2)


P1,1,1


P1,1,2


P2,1,1


P1,1,2

P1,2,1


P1,1,3

P1,2,2


P1,1,2


P2,1,4



P1,2,3




P1,2,6

P2,2,3


P1,2,9

P2,2,6


P2,1,7



P2,1,8

P2,2,7



P2,1,9

P2,2,8




P2,2,9

P1,1,5
P1,2,4

P2,1,2
P2,2,1


P1,1,2

P1,2,5

P2,1,3

P2,2,2

P1,1,8

P1,2,
7

P2,1,5

P2,2,4


P1,1,2

P1,2,8

P2,1,6

P2,2,5


P1,1,2


P1,1,3




P1,1,5


P1,1,6



P1,1,8


P1,1,9



P1,1,4


P1,1,1

P1,1,7






(
c
) Pixels for step (
2
, 1)


P1,2,2


P1,2,3




P1,2,5


P1,2,6



P1,2,8


P1,2,9



P1,2,4


P1,2,1

P1,2,7






(b
) Pixels for step (1,
2
)


P2,2,2


P2,2,3




P2,2,5


P2,2,6



P2,2,8


P2,2,9



P2,2,4


P2,2,1

P2,2,7






(d
) Pixels for step (
2
,
2
)


P2,1,2


P2,1,3


P2,1,5


P2,1,6


P2,1,8


P2,1,9


P2,1,4


P2,1,1

P2,1,7





operation will consist of communication time for the data
transmission between host and processor, and the
o
peration time of the processor on the data. For a parallel
architecture, we can split the data matrix into N (here N =
4) independent parts. Compared with the normal non
-
parallel operation, the theoretical speedup of the parallel
architecture can be N time
s that of the non
-
parallel
architecture.




Figure 5. Architecture of the distributed system
.



For the morphological operation implementation
on the distributed NIOS
processor system, if we split the
image symmetrically to four parts, we will suffer inner
boundary problem. As shown in Figure 6 (a), when we
perform an operation on pixel
P
1

in partial image part 1,
three pixels of its original 8
-
neighbor pixels in the bo
ttom
of the range of computation are cut off. In this case, we
will get an incorrect operation result for pixel
P
1
. For
pixel
P
2

in partial image part 1, the right
-
most three pixels
and bottom
-
most three pixels of its original neighbor
pixels are out of th
e range of computation. To avoid this
problem, we need to apply the asymmetrical splitting on
the image as shown in Figure 6 (b).


5.

System Architecture of NIOS Implementation


Based upon NIOS development boards with
APEX EP20K200EFC484
-
2x programmable lo
gic device,


we re
-
configure
an
d compile the built
-
in customized 32
-
bit soft core for the NIOS processor

with custom
instructions implementing the ORD
-
4C algorithm for the
comparison
.


We compared dilation, erosion, open, close and
edge
-
detection morpholog
ical operations of 256x256 8
-
bit
grey scale “lena.png” on different systems.
Figure 7
shows the morphological image operations of the
256x256 8
-
bit lena image, Figure 8 shows the system
performance.
According to our results, compared with
C/C++ approach ba
sed on a non
-
distributed NIOS soft
core processor system, the custom instruction approach
based on the non
-
distributed system shows speedup of
approximately 2.99 on basic dilation and erosion
operations, and speedup of 1.7 on edge detection
operation. The
same results are obtained from a
distributed NIOS soft core processor system.
Compared
with non
-
distributed C/C++ approach, the customized
distributed system gains a speedup around 11.8.





(a) lena.png

(b) Dilation

(c) Erosion




(d) Opening

(e)

Closing

(e) edge detection

Figure 7. Morphological operations on lena
.


0.0E+00
1.0E+07
2.0E+07
3.0E+07
4.0E+07
5.0E+07
6.0E+07
7.0E+07
C/C++ approach (non-
dist ributed)
Cust om Inst ruct ion
(non-dist ributed)
C/C++ approach
(dist ributed)
Cust om Inst ruct ion
(dist ributed)
Dilat ion
Edge Det ect ion

Figure 8. System performance of different implementations on lena
.

Figure
6.

Inner boundary of the s
plitting of the image.


P
1

P
art 1

Part 3

Part 4

Part 2

P
2


Columns of Part 1

Columns of Part 3

Columns of Part 2

Rows of part 1

Rows of part 2

Rows of part 3

Rows of part 4

Part 1

Part 3

Part 4

Part 2

Columns of Part 4

P
1

P
2

(b
)
As
ymmetrical splitting

(a) Symmetrical splitting


6.

REFERENCE


[1]

White Paper of Excalibur Backgrounder
, version
1, http://www.altera.com, June 2000

[2]

NIOS Embedd
ed Processor Development Board
,
version 2.1, http://www.altera.com, April, 2002

[3]

Custom Instructions for the Nios Embedded
processor, version 1.1, http://www.altera.com,
April, 2002

[4]

C. Gonzalez and Richard E. Woods,
Digital
Image Processing
, third edition,
Addison
-
Wesley, 1993

[5]

Harley R. Myler and Arthur R. Weeks,
Computer Imaging Recipes in C
, Prentice
-
hall,
1993

[6]

Vipin Kumar, Ananth Grama, Anshul Gupta and
George Karypis,
Introduction to Parallel
Computing
-
Design and Analysis of Algorithms
,
Benjamin/Cummings
, 1994

[7]

Nios Ethernet Development Kit User Guide
,
version 2.1, http://www.altera.com, April, 2002