A NOVEL VLSI ARCHITECTURE FOR MULTIDIMENSIONAL
DISCRETE WAVELET TRANSFORM
Xinjian Chen,Qionghai Dai
Department of Automation
Tsinghua University,Beijing,100084,China
Cxjian99@mails.tsinghua.edu.cn,Qhdai@tsinghua.edu.cn
ABSTRACT
In this short paper we propose a novel VLSI architecture for mul
tidimensional discrete wavelet transform (mD DWT) based on
systolic array and nonseparable approach.The proposed archi
tecture performs a decomposition of an N
1
× N
2
× ∙ ∙ ∙ × N
m
image in about N
1
N
2
∙ ∙ ∙ N
m
/(2
m
−1) clock cycles (ccs).This
result considerably speeds up other known architectures.Besides,
the advantages of the proposed architecture include very simple
hardware complexity,regular data ow and low control complex
ity.
1.INTRODUCTION
The discrete wavelet transform (DWT) [1] has been emerged as a
powerful tool for audio and image compression recently.Besides,
the DWT has important applications in areas as diverse as signal
processing,digital communications,numerical analysis,computer
graphics and radar target distinguishing.Since the DWT demands
massive computation,dedicated VLSI ASIC solutions should be
considered to meet the realtime requirements in practical applica
tions.
Although many VLSI architectures for 1D DWT have been
proposed,it is hard to design a highly efcient VLSI architecture
with low hardware cost and high throughput for 2D DWT.Lewis
and Knowles [2] used the fourtap Daubechies lter to design a
2D DWT architecture without multipliers.Parhi and Nishitani
[3] proposed two architectures that combine the wordparallel and
digitalserial methodologies.Vishwanath et al.[4] proposed a
systolicparallel architecture using a combination of systolic and
parallel lters for the 2D DWT.Chakrabarti and Vishwanath [5]
proposed two efcient nonseparable architectures,the parallel l
ter and the SIMD2Darray,which optimize both the area and time.
Chuang and Chen [6] presented a parallel pipelined VLSI array ar
chitecture for the 2D DWT.Chen and Bayoumi [7] proposed a
scalable systolic array architecture.Chakrabarti and Mumford [8]
presented folded architecture and scheduling algorithms for com
puting the 2D DWT for analysis and synthesis lters.Among the
various VLSI architectures,there exit two bestknown designs for
the 2D DWT in terms of computing time and hardware cost.One
was proposed by Wu et.al.for the separable 2D DWT [9] by
employing the polyphase decomposition technique and coefcient
folding technique,and the other one was reported by Marino for
the nonseparable 2D DWT [10] based on a modied recursive
pyramid algorithm (MRPA) [5].Either of these two architectures
can performa decomposition of an N×N image in about 2N
2
/3
clock cycles (ccs).At present,there is few literature to present
VLSI architecture for multidimensional DWT due to the complex
ity of design.
Based on systolic array and nonseparable approach,in this
short paper we proposed a novel VLSI architecture for the mul
tidimensional (mD) DWT.The proposed architecture performs
a decomposition of an N
1
× N
2
× ∙ ∙ ∙ × N
m
image in about
N
1
N
2
∙ ∙ ∙ N
m
/(2
m
− 1) clock cycles (ccs).This result consid
erably speeds up other known architectures.In particular case,
2D DWT,the proposed architecture performs a decomposition of
an N ×N image in approximately N
2
/3 clock cycles (ccs),only
a half of that in [9] or [10].Another advantage is that the proposed
architecture requires less multipliers and accumulators (MACs)
than other nonseparable approaches when the lter length is large.
Besides,the proposed architecture has very simple hardware com
plexity,regular data ow and low control complexity.
The rest of the paper is organized as follows.In Section 2 we
propose the architectures for multidimensional DWT,and In Sec
tion 3 we provide the comparative evaluations.Finally,in section
4,we give conclusions.
2.ARCHITECTURE FOR MULTIDIMENSIONAL DWT
When the mD wavelet basis functions are separable,the mD
DWTcan be divided into m1Doperations,i.e.,rowcolumn method.
However,this separable approach requires extra huge memory to
save data that must be transposed for row (column) by column
(row) processing.In our nonseparable approach,the wavelet basis
functions are still separable.For convenient description,assume
that the wavelet basis functions are identical in each dimension.
Let h
i
(i = 0,1,∙ ∙ ∙,K
h
−1) and g
i
(i = 0,1,∙ ∙ ∙,K
g
−1)
denote,respectively,the coefcients of low and high lter bases.
Let ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
denotes the coefcients of lowlow ∙ ∙ ∙ low
subband produced at the decomposition level j,and so on.Then
the mathematical formulas for multidimensional DWT can be de
ned as follows.
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
=
K
h
−1
i
1
=0
K
h
−1
i
2
=0
∙ ∙ ∙
K
h
−1
i
m
=0
h
i
1
h
i
2
∙ ∙ ∙h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(1)
ll ∙ ∙ ∙ h
j
n
1
,n
2
,∙∙∙,n
m
=
K
h
−1
i
1
=0
K
h
−1
i
2
=0
∙ ∙ ∙
K
g
−1
i
m
=0
h
i
1
h
i
2
∙ ∙ ∙g
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(2)
.
.
.
I  6970780379659/03/$17.00 ©2003 IEEE ICME 2003
hh∙ ∙ ∙ h
j
n
1
,n
2
,∙∙∙,n
m
=
K
g
−1
i
1
=0
K
g
−1
i
2
=0
∙ ∙ ∙
K
g
−1
i
m
=0
g
i
1
g
i
2
∙ ∙ ∙ g
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(3)
In a similar way,we introduce Wu's approach [9] to design our
architecture for mD DWT as illustrated in Fig.1.The proposed
architecture includes a multiplexer,a RAM module and 2
m
sub
band lters.The RAMmodule saves the j th level lowlow∙ ∙ ∙ low
subband data for performing the next level decomposition.The
multiplexer selects the proper data for decomposition.Only when
performing the rstlevel decomposition,the multiplexer selects
the original input data,otherwise it selects data from the RAM
module.This procedure repeats until the desired level is nished.
✒✑
✓✏
❄
✒✑
✓✏
❄
✒✑
✓✏
❄
✒✑
✓✏
❄
Subsampling by 2 in each dimension
✲
✲
✑
✑
◗
◗
✲
Input
M
U
X
✲
✲
✲
LL∙ ∙ ∙ L
LL∙ ∙ ∙ H
HH∙ ∙ ∙ H
RAM
✲
✲
✲
✛
q
q
q
ll∙ ∙ ∙l
j−1
n
1
,n
2
,∙∙∙,n
m
ll∙ ∙ ∙l
j
n
1
,n
2
,∙∙∙,n
m
ll∙ ∙ ∙h
j
n
1
,n
2
,∙∙∙,n
m
hh∙ ∙ ∙h
j
n
1
,n
2
,∙∙∙,n
m
Fig.1.Architecture for mD DWT.
The approach to design the circuits for each subband lter
based on systolic array is similar,so we choose LL∙ ∙ ∙ L lter as
an example.Since the parallelpipeline architecture is one of the
main characteristics of systolic array,we give a new denition
x
n
1
= x
n
2
(M) (4)
where Mdenotes the latency fromx
n
1
to x
n
2
in the given pipelined
data stream {x
n
}(n = 0,1,∙ ∙ ∙ ).If the data stream pipelines one
sample per clock cycle,then M = n
2
−n
1
clock cycles.
Dene
A
j
1
,j
2
,∙∙∙,j
m
=
i
1
<K
h
i
1
=j
1
i
1
=i
1
+2
h
i
1
i
2
<K
h
i
2
=j
2
i
2
=i
2
+2
h
i
2
∙ ∙ ∙
i
m
<K
h
i
m
=j
m
i
m
=i
m
+2
h
i
m
ll ∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(5)
From(1) and (5),we have
ll∙ ∙ ∙l
j
n
1
,n
2
,∙∙∙,n
m
= A
1,1,∙∙∙,1
+A
1,1,∙∙∙,0
+∙ ∙ ∙ +A
0,0,∙∙∙,0
(6)
The right hand of equation (6) includes 2
m
independent parts,
and each part is an mdimensional convolution.Thus we can make
2
m
times the input clock rate and get 2
m
parallel data streams:
{ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
}
{ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
+1
}
.
.
.
{ll ∙ ∙ ∙ l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
+1
}
Assume that the input image is with size N
1
×N
2
×∙ ∙ ∙×N
m
,
so in each data stream,0 ≤ n
1
< N
1
/2
j
,0 ≤ n
2
< N
2
/2
j
,∙ ∙ ∙,
0 ≤ n
m
< N
m
/2
j
.
Fromthe denition of (4),we have
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
= ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2l
m
(l
m
−n
m
) (7)
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m−1
,2n
m
=
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2l
m−1
,2n
m
((l
m−1
−n
m−1
)N
m
/2
j
)
(8)
.
.
.
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
=
ll∙ ∙ ∙l
j−1
2l
1
,2n
2
,∙∙∙,2n
m
((l
1
−n
1
)N
m
N
m−1
∙ ∙ ∙N
2
/2
j(m−1)
)
(9)
For k
1
,k
2
,∙ ∙ ∙,k
m
= 0,1,Dene
P
k
1
= h
k
1
K
h
−1
i
2
=0
∙ ∙ ∙
K
h
−1
i
m
=0
h
i
2
∙ ∙ ∙ h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(10)
then
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
=
i<(k
h
−1)/2
i=0
i=i+1
h
2i+1
h
1
P
1
(iN
m
N
m−1
∙ ∙ ∙N
2
/2
j(m−1)
)
+
i<k
h
/2
i=0
i=i+1
h
2i
h
0
P
0
(iN
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)
(11)
Dene
P
k
1
,k
2
= h
k
1
h
k
2
K
h
−1
i
3
=0
∙ ∙ ∙
K
h
−1
i
m
=0
h
i
3
∙ ∙ ∙ h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m
−i
m
(12)
then
P
k
1
=
i<(k
h
−1)/2
i=0
i=i+1
h
2i+1
h
1
P
k
1
,1
(iN
m
N
m−1
∙ ∙ ∙N
3
/2
j(m−2)
)
+
i<k
h
/2
i=0
i=i+1
h
2i
h
0
P
k
1
,0
(iN
m
N
m−1
∙ ∙ ∙ N
3
/2
j(m−2)
)
(13)
.
.
.
Dene
P
k
1
,k
2
,∙∙∙,k
m−1
= h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
K
h
−1
i
m
=0
h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
−i
m
(14)
then
P
k
1
,k
2
,∙∙∙,k
m−2
=
i<(k
h
−1)/2
i=0
i=i+1
h
2i+1
h
1
P
k
1
,k
2
,∙∙∙,k
m−2
,1
(iN
m
/2
j
)
+
i<k
h
/2
i=0
i=i+1
h
2i
h
0
P
k
1
,k
2
,∙∙∙,k
m−2
,0
(iN
m
/2
j
)
(15)
I  698
and
P
k
1
,k
2
,∙∙∙,k
m−1
=
i<(k
h
−1)/2
i=0
i=i+1
h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
h
2i+1
ll ∙ ∙ ∙ l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
(i)
+
i<k
h
/2
i=0
i=i+1
h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
h
2i
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
+1
(i)
(16)
The latency between any two data in the data stream can be
implemented by using shift registers.According to (11),(13),(15)
and (16),we can design a parallelpipeline architecture for the im
plementation of LL∙ ∙ ∙ L lter by decomposing ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
step by step.Assume K
h
= 5,the corresponding circuits for (15)
and (16) are illustrated in Fig.2 and Fig.3,respectively.The
delayblock in Fig.2 can make data delay N
m
/2
j
clock cycles
(ccs),and it can be implemented by pushing them into a SRAM
module with size N
m
/2
j
.Fig.4 shows the architecture for the
LL∙ ∙ ∙ L lter.Each pair of arrow in Fig.4 is implemented by a
specic circuit (e.g.,the circuit in Fig.2 or Fig.3).
✁
✁
❆
❆ ✁
✁
❆
❆
✁
✁
❆
❆
✁
✁
❆
❆ Multiplier
❤
Adder
N
m
/2
j
Delayblock
✲
✲
✲
✲
N
m
/2
j
N
m
/2
j
❤
❤
❄
❄
❄
❄
✲
N
m
/2
j
✲ ❤
❄
❄
❤
✻
✲ ✲
P
k
1
,k
2
,∙∙∙,k
m−2
,0
P
k
1
,k
2
,∙∙∙,k
m−2
,1
P
k
1
,k
2
,∙∙∙,k
m−2
h
2
/h
0
h
3
/h
1
h
4
/h
0
Fig.2.Circuit for (15).In this example,K
h
= 5.
The number of multipliers required in the architecture in Fig.
4 is
2
m−1
K
h
+(2
m−2
+2
m−3
+∙ ∙ ∙ +1)(K
h
−2)
= (2
m
−1)K
h
−2
m
+2
(17)
and the number of adders is
(2
m−1
+2
m−2
+∙ ∙ ∙ +1)(K
h
−1)
= (2
m
−1)(K
h
−1)
(18)
Besides,the storage size required for the delayblocks in this ar
chitecture is
(2
m−2
N
m
/2
j
+2
m−3
N
m
N
m−1
/2
2j
+
∙ ∙ ∙ +N
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)(K
h
−2)
(19)
In the same way,we can design other subband lters.It is
easy to see that the total number of multipliers required for the
architectures of all 2
m
lters is
2
m−1
((2
m
−1)(K
h
+K
g
) −2
m+1
+4) (20)
✁
✁
❆
❆
✁
✁
❆
❆ ✁
✁
❆
❆
✁
✁
❆
❆ ✁
✁
❆
❆
Register
❤ ❤
❤ ❤
✲
✲
✲
✲
❄
❄
❄
❄
✲
✲
✲
✲
❄
❄
✲
✲
❄
❄
❄
✲
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,∙∙∙,2n
m−1
−k
m−1
,2n
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,∙∙∙,2n
m−1
−k
m−1
,2n
m+1
P
k
1
,k
2
,∙∙∙,k
m−1
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
1
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
4
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
0
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
3
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
2
Fig.3.Circuit for (16).In this example,K
h
= 5.
the total number of adders is
2
m−1
(2
m
−1)(K
h
+K
g
−2) (21)
and the total storage size required for the delayblocks at the de
composition level j is
2
m−1
(2
m−2
N
m
/2
j
+2
m−3
N
m
N
m−1
/2
2j
+
∙ ∙ ∙ +N
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)(K
h
+K
g
−4)
(22)
All lters can share some registers,so the number of registers
required in Fig.1 is max{2
m−1
K
h
,2
m−1
K
g
}.
Obviously,the rstlevel decomposition of the proposed ar
chitecture requires N
1
N
2
∙ ∙ ∙ N
m
/2
m
clock cycles (ccs).Since
the quantity of ccs in each level is 1/2
m
of that in the previous
level,the total quantity of ccs required for the decomposition to
Jth level is
J
i=1
N
1
N
2
∙ ∙ ∙N
m
/2
mj
= (1 −2
−Jm
)N
1
N
2
∙ ∙ ∙N
m
/(2
m
−1)
(23)
If J is large enough,(23) will converge to
N
1
N
2
∙ ∙ ∙ N
m
/(2
m
−1) (24)
During the second and following level decompositions,the
storage in the rstlevel decomposition can be reused partly,So the
hardware cost is determined by the rstlevel decomposition.The
storage size of RAMmodule in Fig.1 should be N
1
N
2
∙ ∙ ∙ N
m
/2
m
to be enough to save the LL∙ ∙ ∙ L subband data for the next level
decomposition.
3.PERFORMANCE AND COMPARISONS
For the purpose of comparing with other architectures,we choose
m=2,K
h
= K
g
= K(lter length).Table I compares the per
formance of our architecture with other architectures in terms of
computing time,the number of multipliers,the number of adders,
storage sizes and control complexity.The parameter N
2
denotes
the input image size and J denotes the 2D DWT level.From the
comparison data,we know that the efciency of our architecture
I  699
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
✟
✟✯
❍
❍❥
ll∙ ∙ ∙l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
ll∙ ∙ ∙l
j−1
2n
1
,∙∙∙,2n
m−1
+1,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
,∙∙∙,2n
m−1
+1,2n
m
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
P
0,0,∙∙∙,0,0
m−1
P
1,1,∙∙∙,1,0
m−1
P1,1,∙∙∙,1,1
m−1
P
1,1,∙∙∙,1
m−2
r r r
r
r
r
P
0,0
P
0,1
P
1,0
P
1,1
P
0
P
1
✒
❅
❅
❅❘
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
Fig.4.Architecture for LL∙ ∙ ∙ L lter.Each pair of arrow is implemented by a specic circuit (e.g.,the circuit in Fig.2 or Fig.3).
Table 1.Performance Comparisons of Several 2D DWT Architectures.(K:Filter Length,N
2
:Image Size,and J:2D DWT Level)
Architectures Computing Multipliers Adders Storage Size Control
Times(ccs) Complexity
Ours ≈ N
2
/3 12K −8 12(K −1) N
2
/4 +2N(K −2) +2K Simple
Wu's [9] ≈ 2N
2
/3 4K 4K N
2
/4 +KN +K Moderate
QuadriFilter [10] ≈ 2N
2
/3 2K
2 2(K
2
−1) 2KN Moderate
NonSeparable [5] N
2 2K
2 2(K
2
−1) 2KN Complex
SIMD [5] K
2
J 2N
2 2N
2 N
2 Complex
SystolicParallel [4] N
2 4K 4K 2KN +4N Complex
is slightly lower than Wu's but obviously better than other archi
tectures.We should mention that Wu's approach is only suitable
for twodimensional case.The RAMmodule of size N
2
/4 is not
necessary if we use the already existing memory in the systems to
save lowlow subband data for the next level decomposition [9].
The number of registers required for the proposed architecture is
also included in the storage size.Besides its regularity,our archi
tecture only includes very simple processing elements (PE),so the
control complexity of our architecture is simpler than others.
4.CONCLUSIONS
This paper proposes a Novel VLSI architecture for the multidimen
sional DWT.Provided comparative evaluations in two dimensional
case has shown some advantages of our architecture.Besides,we
give the detailed architecture for multidimensional DWT which is
not discussed in other literatures.
5.REFERENCES
[1] S.G.Mallat,Atheory for multiresolution signal decomposi
tion:The wavelet representation, IEEE Trans.Pattern Anal.
Machine Intel.,vol.11,pp.674693,July 1989.
[2] A.S.Lewis and G.Knowles,VLSI architecture for 2D
Daubechies wavelet transform without multipliers, Elec
tron.Lett.,vol.27,pp.171173,Jan.1991.
[3] K.K.Parhi and T.Nishitani,VLSI architectures for discrete
wavelet transforms, IEEE Trans.VLSI Syst.,vol.1,pp.191
202,June 1993.
[4] M.Vishwanath,R.M.Owens,and M.J.Irwin,VLSI ar
chitectures for the discrete wavelet transform, IEEE Trans.
Circuits Syst.II,vol.42,pp.305316,May 1995.
[5] C.Chakrabarti and M.Vishwanath,Efcient realizations of
the discrete and continuous wavelet transforms:Fromsingle
chip implementations to mappings on SIMD array comput
ers, IEEE Trans.Signal Processing,vol.43,pp.759771,
Mar.1995.
[6] H.Y.H.Chuang and L.Chen,VLSI architecture for fast
2D discrete orthogonal wavelet transform, J.VLSI Signal
Processing,vol.10,pp.225236,Aug.1995.
[7] J.Chen and M.A.Bayoumi,Ascalable systolic array archi
tecture for 2D discrete wavelet transforms, in Proc.IEEE
VLSI Signal Processing Workshop,1995,pp.303312.
[8] C.Chakrabarti and C.Mumford,Efcient realizations of
analysis and synthesis lters based on the 2D discrete
wavelet transform, in Proc.IEEE ICASSP,May 1996,
pp.32563259.
[9] P.C.Wu and L.G.Chen,An efcient architecture for two
dimensional discrete wavelet transform, IEEE Trans.Cir
cuits Syst.Video Technol.,vol.11,pp.536545,Apr.2001.
[10] F.Marino,Two fast architectures for the direct 2D discrete
wavelet transform, IEEE Trans.Signal Processing,vol.49,
pp.12481259,June 2001.
I  700
Enter the password to open this PDF file:
File name:

File size:

Title:

Author:

Subject:

Keywords:

Creation Date:

Modification Date:

Creator:

PDF Producer:

PDF Version:

Page Count:

Preparing document for printing…
0%
Comments 0
Log in to post a comment