SYSTOLIC ARCHITECTURES FOR
1D and 2D RECURSIVE FILTERS
D. Chikouche, R. E. Bekka
Département d'Electronique, Faculté des Sciences de l’Ingénieur
Université de Sétif, 19000 Sétif Algérie
E

mail
: dj_chikou@yahoo.fr
Key Words:
Recursive filters, Systol
ic, Cylindric, CTP, Switched

capacitor.
Abstract
In this paper,
discrete state space recursive f
ilters are implemented in the form of
systolic
array
processors.
We show that the recursivity inher
ent to the filtering algorithm introduces a latency
proportional to the filter order
.
T
he use of CTP decomposition technique together with
cylindrical

typ
e structures reduces significantly this latency and improves the computation
throughput of these arrays.
Résumé
Dans cet
article, les filtres recursifs, décrits dans un espace d
’état,
sont implémentés sous forme
d’un réseau
systolique
.
Nous montrons que la récursivité inhérente à l’algorithme de filtrage
introduit une latence proportionnelle à l’ordre du filtre
.
L
’usage de la décomposition CTP et les
structures cylindriques réduit considérablement cette latence et améliore le débit
en données
de
ces réseaux
.
1. Introduction
The concept of systolic architecture was developed for the first time during the years 1979 and
1980 at the Carnegie

Mellon

University [1], and
many versions of systolic processors have
been designed and constructed by several industrials [1

11].
In a previous work [12

19], we have presented a methodology for the implementation of state
space recursive filters on systolic architectures of the Kun
g

type [1] and the cylindrical

type
[3]. In this paper, we present a review of the application of systolic system concept (of both the
Kung

type and the cylindrical

one) to the realization of discrete recursive filters described in
the state space by a s
imple matrix equation [20

21]. We will show that the recursivity inherent
to the filtering algorithm introduces a latency proportional to the filter order which has a direct
effect on the computation throughput of these architectures. Furthermore, the use
of CTP
decomposition technique [15,17,18] together with the cylindrical structures can considerably
reduce the latency of the array, thus improving its computation throughput rate.
We will start our study by introducing the principle of the Kung

type syst
olic implementation
of 1D discrete recursive filters. Systolic structures of the cylindrical

type together with the CTP
technique are considered in section 3 for the implementation of discrete recursive filters. In the
last section, we propose the design o
f processing elements, of the different systolic
architectures presented in this paper, by using switched

capacitor architectures.
2. Systolic structure for discrete recursive filters
A discrete recursive filter can be described in the state space dom
ain by the following two
equations [21]:
x
n
Ax
n
Be
n
y
n
Cx
n
De
n
(
)
(
)
(
)
(
)
(
)
(
)
1
(1)
or, in a matrix form according to [21] as:
x
n
y
n
A
B
C
D
x
n
e
n
(
)
(
)
(
)
(
)
1
(2)
where:
A, B, C, and D are the state matrices of the filter,
x
(
n
)
R
N
the state signal vector of
dimension
(
N
1
)
,
e
(
n
)
R
the input signal and
y
(
n
)
R
the output signal.
The internal state space
description of the filter permits to represent the filtering algorithm as a
simple product of a square matrix with a column vector [21]. This last description of the filter
can be obtained either directly in the state space domain from the specifications
of the
amplitude and the phase of the filter frequency response or after a transformation of the transfer
function computed from its specifications.
x (n+1)
1
x (n+1)
x (n+1)
3
2
y (n)
0
x (n)
e(0)
1
x (n)
2
x (n)
3
a
a
21
a
31
a
a
13
a
22
a
23
a
32
a
33
b
1
b
2
b
3
d
c
3
c
2
c
1
11
12
0
0
0
0
0
0
0
0
0
0
0
0
Fig. 1
. Systolic implementation of a third order discrete recu
rsive filter.
The systolic array implementation of the discrete
filter, represented in F
ig
.
1
uses the global
state ma
trix elements to load the PE's memories of the systolic array. The PE (a) computes the
first term of
x
(
n
+
1
)
i
, the PE (b) performs the following term of
x
(
n
+
1
)
i
and adds it to the
previous term, the third PE (c) computes t
he different terms of
y
(
n
)
.
The s
ystolic architecture of F
ig
.
1
of dimension
(
N
1
)
(
N
1
)
, proposed for the realization
of the sampled

data recursive filter of order N, has a computation throughput of:
1
2
1
(
)(
)
N
t
t
m
a
where
t
m
and
t
a
are respectively the times required to perform a multiplication and an
addition.
In the next section, we will show that the use of CTP techique together wih systolic
architectures of the cylindrical

type [15,17,18] permits to improve the computation throughput
of these structures.
3. Fast systolic architectures with dyn
amic reconfiguration for discrete recursive filters
Consider an
(
)
N
1
th
order
1D discrete recursive filter
(
)
N
pq
described by equation (2).
Let:
H
v
u
A
B
C
D
x
n
y
n
x
n
e
n
(
)
(
)
(
)
(
)
1
Equation (2) is then equivalent to the
following linear relation:
v
=
Hu
(3)
In this section, we will apply the CTP decomposition technique [15] to our recursive filtering
algorithm (3) in order to obtain a fas
ter form.
Consider the example of a third order recursive filter described by the state space equation (3)
with
N
=
=
4
2
2
,
p
q
=
=
2
,
and:
H
v
u
a
a
a
b
a
a
a
b
a
a
a
b
c
c
c
d
x
n
x
n
x
n
y
n
x
n
x
n
x
n
e
n
11
12
13
1
21
22
23
2
31
32
33
3
1
2
3
1
2
3
1
2
3
1
1
1
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
A single term CTP decomposition of
H
can be found by using metho
ds of [18]. This
decomposition is defined by the following
(
)
2
2
matrices
L
and
R
:
L
R
l
l
l
l
r
r
r
r
11
12
21
22
11
12
21
22
such as
H
is the tensor product of
L
and
R
.
Mapping the vector
u
on a
(
)
p
q
matrix
U
by using segments of
u
as columns of
U
, we get:
U
V
x
n
x
n
x
n
e
n
x
n
x
n
x
n
y
n
1
3
2
1
3
2
1
1
1
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
The matrix
V
is obtained by the same procedure from the vector
v.
The CTP expansion associated with equation (3) takes then the following fast form:
V
LUR
(4)
The cylindrical arrays of [3] are compatible w
ith the CTP decomposition. Fig.
2
represents a
cylindrical array performing the
(
)
2
2
matrix

matrix product
LU
. The triangular figures denote
local me
mory wherein elements of the matrix
L
are stored as indicated in Fig.
2
a. We transmit
the columns of
U
down the longitudinal paths. At each node, the longitudinal input is
multiplied by the scalar stored in its internal register. The resulted product is ad
ded to the input
arriving along the transversal path. This sum is retransmitted transversally. The longitudinal
sequence is retransm
itted without alteration. Fig. 2
a depicts the calculation at the
start of the
second step. Fig. 2
b shows the computation at
the second step.
We assume our array opera
tes synchronously. The sequences available on the transversal paths
at the bottom of the array are the rows of
LU
. We can verify that the top row nodes complete
their computations at the same time with the completion of computation of the first row of
LU
by the bottom row nodes. At the
p
th
step (here
p
q
=
=
2
), the array is
switched as indicated in
Fig. 2
b. The row sequences of (
LU
) are fed back on the transversal paths of the input nodes.
The R row sequences follow the
U
row sequences o
n the longitudinal paths. When the new
computation starts down the array, the node operation changes to another form. This time, the
node retransmits all input sequences unchanged while iteratively calculating the dot product of
these sequences. This produ
ct is stored at the no
de memo
ry as indicated in Fig. 2
. The switch in
function of the nodes will propagate down the array together with the first arrival of
LU
and R
data. Fig. 2
c shows the computational wave front reaching the second row.
x (n)
3
e(n)
x (n)
1
x (n)
2
11
22
21
12
x (n)
2
l
22
l
11
x (n)
1
r
11
x (n)
3
x (n)
1
11
22
21
12
r
21
x (n)
2
e(n)
(LU)
11
(LU)
21
l
11
x (n)
3
(LU)
11
(LU)
21
l e(n)
21
Fig. 2
.a. Step 1.
(
)
(
)
(
)
(
)
(
)
(
)
LU
l
x
n
l
x
n
LU
l
x
n
l
x
n
21
21
1
22
2
11
12
2
11
1
Fig. 2
.b. Step 2.
r
21
x (n)
3
r
22
e(n)
(LU)
12
(LU)
22
(LU)
12
(LU)
22
r
12
r
11
(LU)
11
11
22
21
12
(LU)
12
(LU)
22
r
22
r
21
11
22
21
12
(
)
(
)
(
)
(
)
(
)
(
)
LU
l
x
n
l
e
n
LU
l
e
n
l
x
n
22
21
3
22
12
12
11
3
V
LU
r
LU
r
V
LU
r
LU
r
V
LU
r
LU
r
V
LU
r
LU
r
11
11
11
12
21
22
21
12
22
22
21
21
11
22
21
12
11
12
12
22
(
)
(
)
(
)
(
)
(
)
(
)
(
)
(
)
Fig. 2
.c. Step 3. Fig.
2
.d. Step 4.
Fig. 2
. Operating principle of the fast cylindrical array with dynamic
reconfiguration of a third order filter.
The components
of
V
=
LUR
are stored in the memories at the
(
)
p
q
th
step of this sequence.
The indices i, j on the nodes of
Fig.
2
d represent the location of
V
ij
. Therefore, using the same
cylindrical arrays, the m
atrix

matrix operation
V
=
LUR
can be computed in
O
p
q
(
)
time
units while the matrix

vector operation
v
=
H
u
takes
O
pq
(
)
time. We can clearly see the
superiority in computational speed o
f the first linear operation over the last one. This
implementation technique of 1D IIR filters could achieve a throughput rate of
1
(
)(
p
q
t
t
m
a
+
)
much higher than the throughput rate of
+
)
1
2
(
)(
pq
t
t
m
a
of the Kung

type systolic array o
f
Fig. 1
.
In the last discussion, the ability to dynamically switch and reconfigurate the array implies
added hardware complexity. These hardware complexity need careful evaluation in any
specific design
process.
4. Design of processing elements by using switched

capacitor architectures
Because of the sampled nature of the sampled

data recursive filters considered in this paper
[12], we must construct the processing elements of our systolic arrays with s
ampled

data
techniques. In this paper, we propose the use of switched

capacitor architectures to build the
PEs. These last architectures are mainly based on the swit
ched

capacitor element of F
ig.
3
. This
basic element can be used to construct adders, mul
tipliers, and delay elements [22

26] which
are the basic blocks of all types of processing elements of a systolic array.
C
O1
O2
V1
V2
T/2
T
2T
3T
O1
t
O2
T/2
T
2T
3T
t
(a) SC circuit
(b) Switch timing
Fig. 3
The Basic swit
ched

capacitor element.
4.1. Design of the PEs used in the Kung

type systolic array of figure
1
.
Each PE of the systolic array is built from a Switched

Capacitor Multiplier/Adder, a one time

unit delay, and a memorization component [22

26]. The Switched

Capacitor Multiplier allows
the computation
y
=
y
+
a
x
s
e
ij
e
,
the
memorization component is used to load the
a
ij
coefficient of the filter, and the
one time

unit delay permits the transmission of the vertical
input of the PE t
o its vertical output with one time

unit delay
x
=
x
s
e
.
x
e
y
s
x
s
a
ij
Multiplier/Adder
y
s
x
s
x
e
of coefficient a
ij
Memorisation
Delay
of one
unit
(a) Operation of the (a)

PE
(b) PE's Constru
ction of the (a)

type
y
=
a
x
x
=
x
(
Delay of
one time unit
)
s
ij
e
s
e
Fig. 4
.
The
(a)

type
PE's Construction using SC techniques
4
.2. Design of the PEs used in the cylindrical

type systolic array of figure 2
.
Each cylindrical

type PE of t
he systolic array of F
ig
. 2
is built from a Switched

Capacitor
Multiplier/Adder, a one time

unit delay, and a memorization component
(Fig. 7)
[22

26].
The
Switched

Capacitor Multiplier/Adder allows the computation
y
=
y
+
a
x
s
e
ij
e
,
the
memorization
compone
nt is used to load the
a
ij
coefficient of the filter during the first wave front, or to store
the result
V
=V
+(LU)
r
ij
ij
ik
kj
locally at the PE, and the
one time

unit delay permits the
transmission of the vertical input of the
PE to its vertical output with one time

unit delay
x
=
x
s
e
.
a
ij
x
e
y
s
y
e
x
s
Multiplier/Adder
y
s
x
s
y
e
x
e
Delay
of one
unit
of coefficient a
ij
Memorisation
(a) Operation of the (b)

type PE
(b) PE's Construction of the (b)

type
y
=y
+
a
x
x
=
x
(
Delay of
one time unit
)
s
e
ij
e
s
e
Fig. 5
.
T
he (b)

type
PE's Construction using SC techniques
x
e
y
s
y
e
c
i
Multiplier/Adder
y
s
y
e
x
e
of coefficient c
i
Memorisation
(a) Operation of the (c)

type PE (b) PE's Construction of the (c)

typ
e
y
=
y
+
a
x
s
e
ij
e
Fig. 6
.
T
he (c)

typ
e
PE's Construction using SC techniques
Conclusion
In this paper, we have presented and analyzed the several possible systolic architectures that we
have proposed in a previous work in order to realize sampled

data recursiv
e filters. All these
structures of both the Kung

type and the cylindrical

type are obtained in a straightforward
manner from a matrix representation of the filters in the state

space domain. We notice also that
a latency proportional to the filter order is
the main disadvantage of the Kung

type systolic
architectures. We have shown that the use of CTP technique together with the cylindrical
structures leads to an improvement of computation throu
ghput of these systolic arrays.
Switched

capacitor techniques a
re proposed, in this paper, to built all types of processing
elements used in these structures.
y
e
x
e
y
s
x
s
l
ij
Multiplier/Adder
y
s
x
s
y
e
x
e
Delay
of one
unit
of coefficient l
ij
Memorisation
or the result V
ij
(a) Operation of the cylindrical

type PEs
(b) PE's Construction of the cylindrical

type
At the first wave front:
At the second wave front:
y
=y
+l
x
x
=
x
(Delay of
one
time unit
)
s
e
ij
e
s
e
y
=
y
x
=
x
(Delay of
one time
unit)
s
e
s
e
V
=V
+(LU)
r
ij
ij
ik
kj
Fig. 7. PE's Construction of the cylindrical

type using SC techniques
References
[1] H. T. Kung, "Why systolic architectures?", IEEE Computer, Vol. 15, N°1, pp 37

46, 1982.
[2] S. Y. Kung, K. S. Arun, R. J. Gal

Ezer, D. V. Bhask
ar Rao, «Wavefront array processor:
language, architecture, and applications", IEEE Trans. comput., Special Issue on parallel
and distributed computers, vol. C

31, N° 11, Nov. 1982, pp. 1054

1066.
[3]
W. A. Porter, J. L. Aravena,"Orbital architectures wit
h dynamic reconfiguration",
Proc.IEE, part E, Vol. 134, N°6, Nov.1987, pp. 281

287.
[4]
T. Zhang, K. K. Parhi, "VLSI implementation

oriented (3,k)

regular low

density parity

check codes", IEEE Workshop on signal processing systems (SiPS) 2001, Antwerp,
Bel
gium, Sept. 2001.
[5] S. Jain, L. Song, K. K. Parhi, "Efficient semi

systolic VLSI architectures for finite field
arithmetic", IEEE Trans. On VLSI Systems, Vol. 6, N° 1, Mar. 1998, pp. 101

113.
[6] J. P. Ma, K. K. Parhi, E. F. Deprettere, "Pipelining of c
ordic based IIR digital
filters", Proc.
Of IEEE Int. Conf. On Acoustics, Speech and Signal Processing, Munich, April 1997, pp.
643

646
[7] A. Härmä, "Implementation of frequency

warped recursive filters", Signal Processing, Vol.
80, 2000, pp. 543

548.
[8]
K. Z. Pekmestzi, N. K. Moshopoulos, "A bit

interleaved systolic architecture for a high

speed RSA system", Integration
: the VLSI Journal, Vol. 30, N° 2, 2001, pp. 169

175.
[9] C. Souani, M. Abid, K. Torki, R. Tourki, "VLSI design of 1

D DWT architecture w
ith
parallel filters", Integration
: the VLSI Journal, Vol. 29, N° 2, 2000, pp. 181

207.
[10] D. Massicotte, "A parallel VLSI architecture of Kalman

filter

based algorithms for signal
reconstruction", Integration
: the VLSI Journal, Vol. 28, N° 2, 1999, pp
. 185

196.
[11] S. Ramanathan, V. Visvanathan, "Low

power pipelined LMS adaptive filter architectures
with minimal adaptation delay", Integration
: the VLSI Journal, Vol. 27, N° 1, 1999, pp. 1

32.
[12] D. Chikouche, D. T. Davis, "Sampled

Data Recursive Fil
ters Using Systolic
Architectures," Technical Report, Elect. Eng. Dept. OSU, EE 793, Jan. 1984.
[13] D. Chikouche, D., S. B. Bibyk, "Ion Implantation: a Standard Technique for Introducing
Controlled Amounts of Dopants into Silicon during VLSI Processing,"
Technical Report,
Elect. Eng. Dept. OSU, EE 631, Feb. 1984.
[14] D. Chikouche, R. E. Bekka, "Architectures systoliques et toriques des filtres numériques
RII 1D et 2D", Proc. 4ème colloque africain sur la recherche en informatique CARI’98,
Dakar (Sénégal),
12

15 Oct. 1998, pp. 25.
[15] D. Chikouche, R. E. Bekka, "Cylindrical architectures for 1

D recursive digital filters: a
state space approach", IEE Proc.

Comput. Digit. Tech., Vol. 145, No. 4, July 1998, pp.1

6.
[16] D. Chikouche, R. E. Bekka, A. Khellaf,
A. Boucenna, " Etude des environnements de
simulations des architectures parallèles du type systolique
", Actes des journées d'études
TSC'95, 11

13 septembre 1995, pp. 31

36.
[17] D. Chikouche, R. E. Bekka, "Architectures systoliques rapides des filtres n
umériques RII
1D", Proc. of Int. Conf. SSA2’99, Blida, Algérie, 10

12 Mai 1999, pp. 144

148.
[18] D. Chikouche, R. E. Bekka, "Architectures rapides dynamiquement reconfigurables des
filtres numériques récursifs 1

D et 2

D ", Revue Traitement du signal, vo
l. 16, N° 1, 1999,
pp. 1

12.
[19] R. E. Bekka, D. Chikouche, "Application des structures systoliques aux filtres RII 1

D et
2

D: Amélioration du flot en données", Conférence Internationale IMCES’99, Université de
Sidi Bel

Abbes, 17

18 Mai, 1999.
[20] D. Ch
ikouche, R. E. Bekka, "Etude et réalisation d'un filtre numérique programmable à
base du microprocesseur Z80", Revue Sciences et technologies, Université de Constantine,
Algérie, 1996, pp.51

56.
[21] F. J. Taylor,
Digital filter design handbook
, Marcel Dek
ker, Inc, New York, 1983.
[22] K. Martin, A. S. Sedra, "Exact design of switched capacitor bandpass filters using coupled
biquad structures", IEEE Trans. Circuits Syst., CAS

27, June 1980, pp. 469

475.
[23] D. J. Allstot, and W. C. Black, "Technological de
sign considerations for monolithic MOS
switched capacitor filtering systems", Proc. IEEE, vol.71, pp. 967

986, Aug. 1983.
[24] R. Gregorian, K. W. Martin, G. C. Temes, "Switched

Capacitor circuit design", Proc.
IEEE, vol.71, pp. 941

966, Aug. 1983.
[25]
D. Brodarac, D. Herbst, B. J. Hosticka, B. Hoefflinger, "A novel sampled

data MOS
multiplier", Electron. Lett., vol. 18, pp. 229

230, 1982.
[26] E. Kettel, W. Schneider, "An accurate analog multiplier and divider", IRE Trans.
Electronic Computers, vol. ED

7, pp. 269

274, 1961.
Comments 0
Log in to post a comment