High Speed FPGA Implementation of Median Filters

breezebongAI and Robotics

Nov 6, 2013 (3 years and 8 months ago)

83 views

High Speed FPGA Implementation of Median Filters

B
éla Fehér


Gábor Szedő

Technical University of Budapest, Department of Measurement and Information Systems

Budapest, H
-
1521 Hungary


feher@mmt.bme.hu

szedo@mmt.bme.hu

Abstract

-

In case of some low level d
ata processing
functions, like FIR filtering, pattern recognition or
correlation, where the parallel implementation is
supported by architecture matched special purpose
arithmetic, high throughput FPGA circuits easily
outperform even the most advanced DSP
processors.
In this paper another DSP application, a high
-
speed
non
-
linear median filter implementation is presented.
A general scheme is supported, which
-

with minimal
modifications
-

is able to realise both 1D and 2D,
standard and recursive median filte
rs. Finally results
of implementations on XC6200 and XC4003 FPGAs
are revealed.

INTRODUCTION

Median filters

One or
two
-
dimensional

median filtering is a non
-
linear
operation which is known for preserving sharp edges in
signals or images. It is particularly

effective in removing
non
-
Gaussian
, impulsive noise. The
standard median
filter
is characterised by the following method: the
output value of the median filter is that input sample
value, which is located in the center of the list of ordered
samples. The
sampling window is shifted through the
full data window.

If the window size is 2*N+1, the actual input sample
values in the window are x(n+N), x(n+N
-
1)...x(n)… x(n
-
N+1), x(n
-
N).
Let the

magnitude ordered sample values
of the window in the data vector be m
(0), m(1), ...
m(2*N), then the filter output is determined as
y(n)=m(N).

In case of
recursive median filters

[4] half of the median
window contains the latest outputs (medians). Using the
same notations, the window contains: x(n+N), x(n+N
-
1),
…x(n), y(n),

y(n
-
1),…,y(n
-
N).

Opposing linear filters, using median filtering
recurrently on a set of input data, the output of the filter
converges to a stable signal in finite number of steps.
These stable signals extracted from the input signals are
called the root
s[4]. The advantage of recursive median
filters is the ability of abstracting the roots from input
signals in one run. As shown later the hardware
realisations of the two filters are almost identical.

Selecting a sample other than the central one also coul
d
have meaning. When the signal is interfered with
impulsive noise, and the noise spikes can only increment
the
signal,

it is reasonable to select an element left from
the center of the increasingly ordered window. In case of
image filtering, this operati
on results to adjust the
brightness of the image, so for the desired brightness a
defined element should be selected.

Two
-
dimensional

median filters are commonly used in
image processing, where spike noise should be removed
from an image while sharp edges
should be retained. A
2D median filter with NxN window size can be
established using one 1D median filter with NxN size
buffer. When the 2D window is shifted by one pixel, N
new samples (pixels) are entering the filter, and the N
least recent elements are

discarded respectively. Only
after these N steps is the central element selected.

As seen, the implementation of a standard median filter
requires an ordering operation to be applied on the
samples inside the window. The complexity of this
operation is st
rongly affected by the size of the data
sample window n, performance decreases at least
O
(
N

log(
N
)). General purpose DSP processors with single
operational core will exhibit strongly decreasing
performance, hence are not able to provide real
-
time
median fi
ltering in high speed applications. FPGAs, on
the other hand, are the best candidates in this field, for the
following reasons:


1.

The internal arithmetic of a median filter is based on
comparisons, data transfer and selection operations
only, no multiplicat
ions required.

2.

Because of the simplicity of the basic array processor
elements, a direct, optimised mapping of the
algorithm can be defined to the FPGAs logic
resources, where almost all the operations and data
transfers are constrained to local, neighbou
r
-
to
-
neighbour communications. Only the actual input and
output values need external signal propagation.

FPGAs

A Field Programmable Gate Array contains a large array
of configurable cells (or logic blocks) on a single chip.
Each cell can implement one log
ic function and/or
performs routing to allow inter
-
cell communication. All
of these operations can take place simultaneously across
the whole array of cells. The basic architecture of an
FPGA consists of
a 2
-
D array of cells.
Communication
between the cells takes

place through interconnection
resources. The outer edge of the array consists of special
blocks capable of performing certain I/O operations to
connect the chip to the surrounding circuits. The
architecture of a typical FPGA is illustrated in
Figure
1
.


Cell
I/O
block
Cell
I/O
block
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O blocks
Interconnection
resources
Cells
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell

Figure 1.

FPGA structure


Programmable switches can program the computation
unit functions and the routing c
onfiguration for each cell
.
Several technologies are used to implement these
programmable switches.

FPGA cells differ greatly in t
heir size and
implementation capacity. FPGA cell complexity extend
from implementing a single gate to cells containing look
-
up tables capable of implementing logic functions
containing up to 5 inputs. On the contrary other FPGAs
contain thousands of fine
-
g
rain cells that consist of only
a few transistors.

The SRAM based FPGA

FPGAs provide the benefits of a custom CMOS VLSI
chip, while avoiding the initial cost, time delay, and
inherent risk of a conventional masked gate array. The
FPGAs are customised by lo
ading configuration data
into the internal memory cells. The FPGA can be
programmed an unlimited number of times and supports
system clock rates up to 200 MHz.

FPGA devices have proved their advantages in high
performance custom computing machines and
reco
nfigurable accelerators. Applications of
FPGAs in

the DSP or multimedia environment verified their
capability of the direct hardware implementation of
computation intensive algorithms.

IMPLEMENTATION

Operation fundamentals

The implementation of the median

filter is realised by
2*N+1 simple processor elements. The activity of the
processor elements are determined by the values of the
current input, the neighbours and the time stamp of the
data item, which determines the "age" of the sample value
in the wind
ow. The implementation of the time stamp is
crucial for the
high
-
speed

on
-
the
-
fly computation of a real
time median filter. The time
stamps of the data items
provide

information to every processor nodes to check its
status about the necessary preservation

of the current
sample value in the window. The oldest samples of the
window, with time index x(2*N+1) are discarded in every
iteration, independently of the position and the sample
value of it. This is realised by a chain of independent
index arithmetic u
nit, so actually the filter architecture is
composed from a double parallel processing nodes.


The main
functions of the processor nodes are

as follows.
The oldest sample value is located somewhere in the
chain of the nodes, let assume at index k, with 0 <=

k <=
2*N (Index 0 stores the smallest and index 2*N stores the
largest input sample value of the window). Every node
,
with indices less than k are prepared to pass their stored
sample values to the right neighbour, i.e. to fill up the
ordered list of the

values, after dropping the oldest

sample. Nodes with indices greater

than k stay in the
default state
. All nodes compare the value of the locally
stored older sample value to the
new input

broadcasted
.
Every
node,

for which the stored value is less than the

input, are prepared to pass their values to the left
neighbour, i.e. to preserve the magnitude ordered list of
the sample values in the window. Let the index of
the new
sample input sample is n
. Three cases are possible.

a)

k = n
. The new sample just replace
s the oldest one,
no other samples are moved.

b)

k < n.

The nodes from 0 to k
-
1

are prepared to move
right, and the nodes from 0 to n

are prepared to move
left also. For the first k nodes the two assignment
cancel each other, so they will remain in place, whi
le
the n
-
k sample will move left to give place to the new
sample. The new list will be the properly ordered list
of the sample values.

c)

k > n
. The nodes from 0 to k
-
1 are prepared to move
right, and the nodes from 0 to n

are prepared to move
left also. For
the first n

nodes the two assignment
cancel each other, so they wil
l remain in place, while
the k
-
n
-
1 samples move to the right to give place to
the new sample. The new list will be the properly
ordered list of the sample values.

Operations of the processing nodes are

s
cheduled by the
input samples and the
system clock. The output is
usually provided by the
central
processor node with

index N, but in case
of necessity

any node can be
configured to provide the output.

In case of recursive median filter implementation, the

“age” of the currently selected
processing element (PE)

is clea
red, while all counters of other

nodes are
increment
ed
. Then the least recent sample is discarded, a
new sampl
e is injected. After re
ordering

a new median is
selected, its age is cleared,
and the

oldest element is
discarded
,

and so on.

The available operational speed allows application of the
filter in case
of real
-
time
digital image processing or

on
-
line video filtering.

Realisations

The
high
-
speed

median filter wa
s
first designed
in
a
modular
design

in
schematic form
.

Using
the
integrated
XILINX

Foundation Series

[5]
tools

allowed
simulating
the design as well as calculating

timing informa
tion
.

The design was implemented in

a custom
test
-
card

and
the H.O.T Works board by VCC.

Both implementations
can be easily reconfigured for almost arbitrary data
wi
d
th and window size, as well as the data
representation
(signed/unsigned).
Using
this

information

the application
dependent, custom m
edian filter blocks are ready to be
synthesised.

Directly
f
rom the

schematic a bit
-
stream
can
be generated to the XC4003, as this device is
supported

by
Foundation. Finally the bit
-
stream

(configurat
ion
information) is downloaded
to a cu
stom
test
-
card
.

The
XC6200 implementation
involves
sev
eral

o
ther design
step
s
:


1.

The
modules of
the design should be
describ
ed in
VELAB, that is a special
structural
VHDL s
ubset for
building XC6200 applications
.

2.

The design should be placed and routed by Xact6
0
00,
and a bit
-
stream (CAL file)

should be generated.

3.

A simple user
-
interface should be de
veloped in C++,
which

communicates with the
H.O.T.Works
card
plugged
into a PCI slot of a PC.

The
architecture had shown in
Figure 2.

is
flexible
enough to realise any of the filters mentioned above. The
new sample loaded to the input register

starts
the

operation
. In each PE, the new sample is compared to the
sample value the actual PE contains. From the c
ounter
values through the dedicated

nets all the PE
s

can be
informed, whether the least recent sample is on the left
or on the right.

From this
and the

result of the
comparison,
each PE decides

whether the sample and the
counter values should be passed, and if
so, to which
direction.

The output, the median value is just obtained
form one of the PEs.

When the filter is used in 2D mode,
first N samples are injected, and only the Nth output is
used, the first N
-
1 output values are just omitted.

R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N)
R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N+1)
R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N-1)
INPUT
REGISTER
Median OUT
INPUT Sample
Figure 2.

Block Diagram of the Architecture

CONCLUSIONS

In the paper a general,
high
-
speed

parallel median
filtering architecture was proposed. The parameterised
architectural description allowed custom filter
realisa
tions, in terms of input sample resolution, filter
window width, 1D or 2D implementation.

Compared to other parallel median filter realisations [2],
where stack decomposition was brought to front, this
architecture does not involve huge additional matrices
,
that makes realisations impossible. Compared to solutions
carried out on real multiprocessor structures [3], our
architecture contains only one FPGA, that is based on a
single board plugged onto a PC.

Our application is expected to process
one sample in
less
than 100
ns.
This operationa
l speed is
sufficient for


real
-
time
1D
median
filtering

with

arbitrary
sized
data
window, until the logic
and routing
complexity of the
FPGA

is not fully
utilised
.

In case of the 6216
Reconfigurable Processing Unit, sixteen nodes by 16 bit
data
words
could

be

realis
ed.

REFERENCES

[1]

Olli Vainio, Yrj
ö Neuvo
, Steven E. Butner,
A S
ignal
Processor for Median
-
Based Algorithms
, IEEE
Transactions on Acoustics, Speech, Processing VOL
37. NO. 9, September 1989.

[2]

V.V. Bapeswara Rao and K. Sankara Rao,
A New
Algorithm for Real
-
Time Median Filtering
, IEEE
Transactions on Acoustics, Speech, Pr
ocessing VOL
ASSP
-
34. NO. 6, December 1986.

[3]

M. O.

Ahmad and D. Sundararajan,
Parallel
Implementation of a Median Filtering Algorithm
, Int.

Symp
.

on Signals and Systems, 1988.

[4]

Dobrowiecki Tadeusz,
Medi
án Szűrők
, Mérés és
Automatika, 37. Évf., 1
989. 3.szám

[5]

Xilinx
Foundation
Series Quick Start Guide
, 1991
-
1997. Xilinx. Inc.