High Speed FPGA Implementation of Median Filters
B
éla Fehér
Gábor Szedő
Technical University of Budapest, Department of Measurement and Information Systems
Budapest, H

1521 Hungary
feher@mmt.bme.hu
szedo@mmt.bme.hu
Abstract

In case of some low level d
ata processing
functions, like FIR filtering, pattern recognition or
correlation, where the parallel implementation is
supported by architecture matched special purpose
arithmetic, high throughput FPGA circuits easily
outperform even the most advanced DSP
processors.
In this paper another DSP application, a high

speed
non

linear median filter implementation is presented.
A general scheme is supported, which

with minimal
modifications

is able to realise both 1D and 2D,
standard and recursive median filte
rs. Finally results
of implementations on XC6200 and XC4003 FPGAs
are revealed.
INTRODUCTION
Median filters
One or
two

dimensional
median filtering is a non

linear
operation which is known for preserving sharp edges in
signals or images. It is particularly
effective in removing
non

Gaussian
, impulsive noise. The
standard median
filter
is characterised by the following method: the
output value of the median filter is that input sample
value, which is located in the center of the list of ordered
samples. The
sampling window is shifted through the
full data window.
If the window size is 2*N+1, the actual input sample
values in the window are x(n+N), x(n+N

1)...x(n)… x(n

N+1), x(n

N).
Let the
magnitude ordered sample values
of the window in the data vector be m
(0), m(1), ...
m(2*N), then the filter output is determined as
y(n)=m(N).
In case of
recursive median filters
[4] half of the median
window contains the latest outputs (medians). Using the
same notations, the window contains: x(n+N), x(n+N

1),
…x(n), y(n),
y(n

1),…,y(n

N).
Opposing linear filters, using median filtering
recurrently on a set of input data, the output of the filter
converges to a stable signal in finite number of steps.
These stable signals extracted from the input signals are
called the root
s[4]. The advantage of recursive median
filters is the ability of abstracting the roots from input
signals in one run. As shown later the hardware
realisations of the two filters are almost identical.
Selecting a sample other than the central one also coul
d
have meaning. When the signal is interfered with
impulsive noise, and the noise spikes can only increment
the
signal,
it is reasonable to select an element left from
the center of the increasingly ordered window. In case of
image filtering, this operati
on results to adjust the
brightness of the image, so for the desired brightness a
defined element should be selected.
Two

dimensional
median filters are commonly used in
image processing, where spike noise should be removed
from an image while sharp edges
should be retained. A
2D median filter with NxN window size can be
established using one 1D median filter with NxN size
buffer. When the 2D window is shifted by one pixel, N
new samples (pixels) are entering the filter, and the N
least recent elements are
discarded respectively. Only
after these N steps is the central element selected.
As seen, the implementation of a standard median filter
requires an ordering operation to be applied on the
samples inside the window. The complexity of this
operation is st
rongly affected by the size of the data
sample window n, performance decreases at least
O
(
N
log(
N
)). General purpose DSP processors with single
operational core will exhibit strongly decreasing
performance, hence are not able to provide real

time
median fi
ltering in high speed applications. FPGAs, on
the other hand, are the best candidates in this field, for the
following reasons:
1.
The internal arithmetic of a median filter is based on
comparisons, data transfer and selection operations
only, no multiplicat
ions required.
2.
Because of the simplicity of the basic array processor
elements, a direct, optimised mapping of the
algorithm can be defined to the FPGAs logic
resources, where almost all the operations and data
transfers are constrained to local, neighbou
r

to

neighbour communications. Only the actual input and
output values need external signal propagation.
FPGAs
A Field Programmable Gate Array contains a large array
of configurable cells (or logic blocks) on a single chip.
Each cell can implement one log
ic function and/or
performs routing to allow inter

cell communication. All
of these operations can take place simultaneously across
the whole array of cells. The basic architecture of an
FPGA consists of
a 2

D array of cells.
Communication
between the cells takes
place through interconnection
resources. The outer edge of the array consists of special
blocks capable of performing certain I/O operations to
connect the chip to the surrounding circuits. The
architecture of a typical FPGA is illustrated in
Figure
1
.
Cell
I/O
block
Cell
I/O
block
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O
block
I/O blocks
Interconnection
resources
Cells
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Cell
Figure 1.
FPGA structure
Programmable switches can program the computation
unit functions and the routing c
onfiguration for each cell
.
Several technologies are used to implement these
programmable switches.
FPGA cells differ greatly in t
heir size and
implementation capacity. FPGA cell complexity extend
from implementing a single gate to cells containing look

up tables capable of implementing logic functions
containing up to 5 inputs. On the contrary other FPGAs
contain thousands of fine

g
rain cells that consist of only
a few transistors.
The SRAM based FPGA
FPGAs provide the benefits of a custom CMOS VLSI
chip, while avoiding the initial cost, time delay, and
inherent risk of a conventional masked gate array. The
FPGAs are customised by lo
ading configuration data
into the internal memory cells. The FPGA can be
programmed an unlimited number of times and supports
system clock rates up to 200 MHz.
FPGA devices have proved their advantages in high
performance custom computing machines and
reco
nfigurable accelerators. Applications of
FPGAs in
the DSP or multimedia environment verified their
capability of the direct hardware implementation of
computation intensive algorithms.
IMPLEMENTATION
Operation fundamentals
The implementation of the median
filter is realised by
2*N+1 simple processor elements. The activity of the
processor elements are determined by the values of the
current input, the neighbours and the time stamp of the
data item, which determines the "age" of the sample value
in the wind
ow. The implementation of the time stamp is
crucial for the
high

speed
on

the

fly computation of a real
time median filter. The time
stamps of the data items
provide
information to every processor nodes to check its
status about the necessary preservation
of the current
sample value in the window. The oldest samples of the
window, with time index x(2*N+1) are discarded in every
iteration, independently of the position and the sample
value of it. This is realised by a chain of independent
index arithmetic u
nit, so actually the filter architecture is
composed from a double parallel processing nodes.
The main
functions of the processor nodes are
as follows.
The oldest sample value is located somewhere in the
chain of the nodes, let assume at index k, with 0 <=
k <=
2*N (Index 0 stores the smallest and index 2*N stores the
largest input sample value of the window). Every node
,
with indices less than k are prepared to pass their stored
sample values to the right neighbour, i.e. to fill up the
ordered list of the
values, after dropping the oldest
sample. Nodes with indices greater
than k stay in the
default state
. All nodes compare the value of the locally
stored older sample value to the
new input
broadcasted
.
Every
node,
for which the stored value is less than the
input, are prepared to pass their values to the left
neighbour, i.e. to preserve the magnitude ordered list of
the sample values in the window. Let the index of
the new
sample input sample is n
. Three cases are possible.
a)
k = n
. The new sample just replace
s the oldest one,
no other samples are moved.
b)
k < n.
The nodes from 0 to k

1
are prepared to move
right, and the nodes from 0 to n
are prepared to move
left also. For the first k nodes the two assignment
cancel each other, so they will remain in place, whi
le
the n

k sample will move left to give place to the new
sample. The new list will be the properly ordered list
of the sample values.
c)
k > n
. The nodes from 0 to k

1 are prepared to move
right, and the nodes from 0 to n
are prepared to move
left also. For
the first n
nodes the two assignment
cancel each other, so they wil
l remain in place, while
the k

n

1 samples move to the right to give place to
the new sample. The new list will be the properly
ordered list of the sample values.
Operations of the processing nodes are
s
cheduled by the
input samples and the
system clock. The output is
usually provided by the
central
processor node with
index N, but in case
of necessity
any node can be
configured to provide the output.
In case of recursive median filter implementation, the
“age” of the currently selected
processing element (PE)
is clea
red, while all counters of other
nodes are
increment
ed
. Then the least recent sample is discarded, a
new sampl
e is injected. After re
ordering
a new median is
selected, its age is cleared,
and the
oldest element is
discarded
,
and so on.
The available operational speed allows application of the
filter in case
of real

time
digital image processing or
on

line video filtering.
Realisations
The
high

speed
median filter wa
s
first designed
in
a
modular
design
in
schematic form
.
Using
the
integrated
XILINX
Foundation Series
[5]
tools
allowed
simulating
the design as well as calculating
timing informa
tion
.
The design was implemented in
a custom
test

card
and
the H.O.T Works board by VCC.
Both implementations
can be easily reconfigured for almost arbitrary data
wi
d
th and window size, as well as the data
representation
(signed/unsigned).
Using
this
information
the application
dependent, custom m
edian filter blocks are ready to be
synthesised.
Directly
f
rom the
schematic a bit

stream
can
be generated to the XC4003, as this device is
supported
by
Foundation. Finally the bit

stream
(configurat
ion
information) is downloaded
to a cu
stom
test

card
.
The
XC6200 implementation
involves
sev
eral
o
ther design
step
s
:
1.
The
modules of
the design should be
describ
ed in
VELAB, that is a special
structural
VHDL s
ubset for
building XC6200 applications
.
2.
The design should be placed and routed by Xact6
0
00,
and a bit

stream (CAL file)
should be generated.
3.
A simple user

interface should be de
veloped in C++,
which
communicates with the
H.O.T.Works
card
plugged
into a PCI slot of a PC.
The
architecture had shown in
Figure 2.
is
flexible
enough to realise any of the filters mentioned above. The
new sample loaded to the input register
starts
the
operation
. In each PE, the new sample is compared to the
sample value the actual PE contains. From the c
ounter
values through the dedicated
nets all the PE
s
can be
informed, whether the least recent sample is on the left
or on the right.
From this
and the
result of the
comparison,
each PE decides
whether the sample and the
counter values should be passed, and if
so, to which
direction.
The output, the median value is just obtained
form one of the PEs.
When the filter is used in 2D mode,
first N samples are injected, and only the Nth output is
used, the first N

1 output values are just omitted.
R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N)
R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N+1)
R
E
G
I
S
T
E
R
C
O
M
P
A
R
A
T
O
R
C
O
U
N
T
E
R
L
O
G
I
C
PE (N1)
INPUT
REGISTER
Median OUT
INPUT Sample
Figure 2.
Block Diagram of the Architecture
CONCLUSIONS
In the paper a general,
high

speed
parallel median
filtering architecture was proposed. The parameterised
architectural description allowed custom filter
realisa
tions, in terms of input sample resolution, filter
window width, 1D or 2D implementation.
Compared to other parallel median filter realisations [2],
where stack decomposition was brought to front, this
architecture does not involve huge additional matrices
,
that makes realisations impossible. Compared to solutions
carried out on real multiprocessor structures [3], our
architecture contains only one FPGA, that is based on a
single board plugged onto a PC.
Our application is expected to process
one sample in
less
than 100
ns.
This operationa
l speed is
sufficient for
real

time
1D
median
filtering
with
arbitrary
sized
data
window, until the logic
and routing
complexity of the
FPGA
is not fully
utilised
.
In case of the 6216
Reconfigurable Processing Unit, sixteen nodes by 16 bit
data
words
could
be
realis
ed.
REFERENCES
[1]
Olli Vainio, Yrj
ö Neuvo
, Steven E. Butner,
A S
ignal
Processor for Median

Based Algorithms
, IEEE
Transactions on Acoustics, Speech, Processing VOL
37. NO. 9, September 1989.
[2]
V.V. Bapeswara Rao and K. Sankara Rao,
A New
Algorithm for Real

Time Median Filtering
, IEEE
Transactions on Acoustics, Speech, Pr
ocessing VOL
ASSP

34. NO. 6, December 1986.
[3]
M. O.
Ahmad and D. Sundararajan,
Parallel
Implementation of a Median Filtering Algorithm
, Int.
Symp
.
on Signals and Systems, 1988.
[4]
Dobrowiecki Tadeusz,
Medi
án Szűrők
, Mérés és
Automatika, 37. Évf., 1
989. 3.szám
[5]
Xilinx
Foundation
Series Quick Start Guide
, 1991

1997. Xilinx. Inc.
Comments 0
Log in to post a comment