A NOVEL VLSI ARCHITECTURE FOR MULTIDIMENSIONAL DISCRETE WAVELET TRANSFORM

connectionbuttsElectronics - Devices

Nov 26, 2013 (3 years and 8 months ago)

90 views

A NOVEL VLSI ARCHITECTURE FOR MULTIDIMENSIONAL
DISCRETE WAVELET TRANSFORM
Xinjian Chen,Qionghai Dai
Department of Automation
Tsinghua University,Beijing,100084,China
Cxjian99@mails.tsinghua.edu.cn,Qhdai@tsinghua.edu.cn
ABSTRACT
In this short paper we propose a novel VLSI architecture for mul-
tidimensional discrete wavelet transform (m-D DWT) based on
systolic array and non-separable approach.The proposed archi-
tecture performs a decomposition of an N
1
× N
2
× ∙ ∙ ∙ × N
m
image in about N
1
N
2
∙ ∙ ∙ N
m
/(2
m
−1) clock cycles (ccs).This
result considerably speeds up other known architectures.Besides,
the advantages of the proposed architecture include very simple
hardware complexity,regular data ow and low control complex-
ity.
1.INTRODUCTION
The discrete wavelet transform (DWT) [1] has been emerged as a
powerful tool for audio and image compression recently.Besides,
the DWT has important applications in areas as diverse as signal
processing,digital communications,numerical analysis,computer
graphics and radar target distinguishing.Since the DWT demands
massive computation,dedicated VLSI ASIC solutions should be
considered to meet the real-time requirements in practical applica-
tions.
Although many VLSI architectures for 1-D DWT have been
proposed,it is hard to design a highly efcient VLSI architecture
with low hardware cost and high throughput for 2-D DWT.Lewis
and Knowles [2] used the four-tap Daubechies lter to design a
2-D DWT architecture without multipliers.Parhi and Nishitani
[3] proposed two architectures that combine the word-parallel and
digital-serial methodologies.Vishwanath et al.[4] proposed a
systolic-parallel architecture using a combination of systolic and
parallel lters for the 2-D DWT.Chakrabarti and Vishwanath [5]
proposed two efcient non-separable architectures,the parallel l-
ter and the SIMD2-Darray,which optimize both the area and time.
Chuang and Chen [6] presented a parallel pipelined VLSI array ar-
chitecture for the 2-D DWT.Chen and Bayoumi [7] proposed a
scalable systolic array architecture.Chakrabarti and Mumford [8]
presented folded architecture and scheduling algorithms for com-
puting the 2-D DWT for analysis and synthesis lters.Among the
various VLSI architectures,there exit two best-known designs for
the 2-D DWT in terms of computing time and hardware cost.One
was proposed by Wu et.al.for the separable 2-D DWT [9] by
employing the polyphase decomposition technique and coefcient
folding technique,and the other one was reported by Marino for
the non-separable 2-D DWT [10] based on a modied recursive
pyramid algorithm (MRPA) [5].Either of these two architectures
can performa decomposition of an N×N image in about 2N
2
/3
clock cycles (ccs).At present,there is few literature to present
VLSI architecture for multidimensional DWT due to the complex-
ity of design.
Based on systolic array and non-separable approach,in this
short paper we proposed a novel VLSI architecture for the mul-
tidimensional (m-D) DWT.The proposed architecture performs
a decomposition of an N
1
× N
2
× ∙ ∙ ∙ × N
m
image in about
N
1
N
2
∙ ∙ ∙ N
m
/(2
m
− 1) clock cycles (ccs).This result consid-
erably speeds up other known architectures.In particular case,
2-D DWT,the proposed architecture performs a decomposition of
an N ×N image in approximately N
2
/3 clock cycles (ccs),only
a half of that in [9] or [10].Another advantage is that the proposed
architecture requires less multipliers and accumulators (MACs)
than other non-separable approaches when the lter length is large.
Besides,the proposed architecture has very simple hardware com-
plexity,regular data ow and low control complexity.
The rest of the paper is organized as follows.In Section 2 we
propose the architectures for multidimensional DWT,and In Sec-
tion 3 we provide the comparative evaluations.Finally,in section
4,we give conclusions.
2.ARCHITECTURE FOR MULTIDIMENSIONAL DWT
When the m-D wavelet basis functions are separable,the m-D
DWTcan be divided into m1-Doperations,i.e.,row-column method.
However,this separable approach requires extra huge memory to
save data that must be transposed for row (column) by column
(row) processing.In our non-separable approach,the wavelet basis
functions are still separable.For convenient description,assume
that the wavelet basis functions are identical in each dimension.
Let h
i
(i = 0,1,∙ ∙ ∙,K
h
−1) and g
i
(i = 0,1,∙ ∙ ∙,K
g
−1)
denote,respectively,the coefcients of low and high lter bases.
Let ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
denotes the coefcients of low-low- ∙ ∙ ∙ -low
subband produced at the decomposition level j,and so on.Then
the mathematical formulas for multidimensional DWT can be de-
ned as follows.
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
=
K
h
−1
￿
i
1
=0
K
h
−1
￿
i
2
=0
∙ ∙ ∙
K
h
−1
￿
i
m
=0
h
i
1
h
i
2
∙ ∙ ∙h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(1)
ll ∙ ∙ ∙ h
j
n
1
,n
2
,∙∙∙,n
m
=
K
h
−1
￿
i
1
=0
K
h
−1
￿
i
2
=0
∙ ∙ ∙
K
g
−1
￿
i
m
=0
h
i
1
h
i
2
∙ ∙ ∙g
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(2)
.
.
.
I - 6970-7803-7965-9/03/$17.00 ©2003 IEEE ICME 2003


hh∙ ∙ ∙ h
j
n
1
,n
2
,∙∙∙,n
m
=
K
g
−1
￿
i
1
=0
K
g
−1
￿
i
2
=0
∙ ∙ ∙
K
g
−1
￿
i
m
=0
g
i
1
g
i
2
∙ ∙ ∙ g
i
m
ll∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(3)
In a similar way,we introduce Wu's approach [9] to design our
architecture for m-D DWT as illustrated in Fig.1.The proposed
architecture includes a multiplexer,a RAM module and 2
m
sub-
band lters.The RAMmodule saves the j th level low-low-∙ ∙ ∙ -low
subband data for performing the next level decomposition.The
multiplexer selects the proper data for decomposition.Only when
performing the rst-level decomposition,the multiplexer selects
the original input data,otherwise it selects data from the RAM
module.This procedure repeats until the desired level is nished.
✒✑
✓✏

✒✑
✓✏

✒✑
✓✏

✒✑
✓✏

Subsampling by 2 in each dimension







Input
M
U
X



LL∙ ∙ ∙ L
LL∙ ∙ ∙ H
HH∙ ∙ ∙ H
RAM




q
q
q
ll∙ ∙ ∙l
j−1
n
1
,n
2
,∙∙∙,n
m
ll∙ ∙ ∙l
j
n
1
,n
2
,∙∙∙,n
m
ll∙ ∙ ∙h
j
n
1
,n
2
,∙∙∙,n
m
hh∙ ∙ ∙h
j
n
1
,n
2
,∙∙∙,n
m
Fig.1.Architecture for m-D DWT.
The approach to design the circuits for each subband lter
based on systolic array is similar,so we choose LL∙ ∙ ∙ L lter as
an example.Since the parallel-pipeline architecture is one of the
main characteristics of systolic array,we give a new denition
x
n
1
= x
n
2
(M) (4)
where Mdenotes the latency fromx
n
1
to x
n
2
in the given pipelined
data stream {x
n
}(n = 0,1,∙ ∙ ∙ ).If the data stream pipelines one
sample per clock cycle,then M = n
2
−n
1
clock cycles.
Dene
A
j
1
,j
2
,∙∙∙,j
m
=
i
1
<K
h
￿
i
1
=j
1
i
1
=i
1
+2
h
i
1
i
2
<K
h
￿
i
2
=j
2
i
2
=i
2
+2
h
i
2
∙ ∙ ∙
i
m
<K
h
￿
i
m
=j
m
i
m
=i
m
+2
h
i
m
ll ∙ ∙ ∙l
j−1
2n
1
−i
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(5)
From(1) and (5),we have
ll∙ ∙ ∙l
j
n
1
,n
2
,∙∙∙,n
m
= A
1,1,∙∙∙,1
+A
1,1,∙∙∙,0
+∙ ∙ ∙ +A
0,0,∙∙∙,0
(6)
The right hand of equation (6) includes 2
m
independent parts,
and each part is an m-dimensional convolution.Thus we can make
2
m
times the input clock rate and get 2
m
parallel data streams:
{ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
}
{ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
+1
}
.
.
.
{ll ∙ ∙ ∙ l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
+1
}
Assume that the input image is with size N
1
×N
2
×∙ ∙ ∙×N
m
,
so in each data stream,0 ≤ n
1
< N
1
/2
j
,0 ≤ n
2
< N
2
/2
j
,∙ ∙ ∙,
0 ≤ n
m
< N
m
/2
j
.
Fromthe denition of (4),we have
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
= ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2l
m
(l
m
−n
m
) (7)
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m−1
,2n
m
=
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2l
m−1
,2n
m
((l
m−1
−n
m−1
)N
m
/2
j
)
(8)
.
.
.
ll ∙ ∙ ∙ l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
=
ll∙ ∙ ∙l
j−1
2l
1
,2n
2
,∙∙∙,2n
m
((l
1
−n
1
)N
m
N
m−1
∙ ∙ ∙N
2
/2
j(m−1)
)
(9)
For k
1
,k
2
,∙ ∙ ∙,k
m
= 0,1,Dene
P
k
1
= h
k
1
K
h
−1
￿
i
2
=0
∙ ∙ ∙
K
h
−1
￿
i
m
=0
h
i
2
∙ ∙ ∙ h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−i
2
,∙∙∙,2n
m
−i
m
(10)
then
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
=
i<(k
h
−1)/2
￿
i=0
i=i+1
h
2i+1
h
1
P
1
(iN
m
N
m−1
∙ ∙ ∙N
2
/2
j(m−1)
)
+
i<k
h
/2
￿
i=0
i=i+1
h
2i
h
0
P
0
(iN
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)
(11)
Dene
P
k
1
,k
2
= h
k
1
h
k
2
K
h
−1
￿
i
3
=0
∙ ∙ ∙
K
h
−1
￿
i
m
=0
h
i
3
∙ ∙ ∙ h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m
−i
m
(12)
then
P
k
1
=
i<(k
h
−1)/2
￿
i=0
i=i+1
h
2i+1
h
1
P
k
1
,1
(iN
m
N
m−1
∙ ∙ ∙N
3
/2
j(m−2)
)
+
i<k
h
/2
￿
i=0
i=i+1
h
2i
h
0
P
k
1
,0
(iN
m
N
m−1
∙ ∙ ∙ N
3
/2
j(m−2)
)
(13)
.
.
.
Dene
P
k
1
,k
2
,∙∙∙,k
m−1
= h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
K
h
−1
￿
i
m
=0
h
i
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
−i
m
(14)
then
P
k
1
,k
2
,∙∙∙,k
m−2
=
i<(k
h
−1)/2
￿
i=0
i=i+1
h
2i+1
h
1
P
k
1
,k
2
,∙∙∙,k
m−2
,1
(iN
m
/2
j
)
+
i<k
h
/2
￿
i=0
i=i+1
h
2i
h
0
P
k
1
,k
2
,∙∙∙,k
m−2
,0
(iN
m
/2
j
)
(15)
I - 698


and
P
k
1
,k
2
,∙∙∙,k
m−1
=
i<(k
h
−1)/2
￿
i=0
i=i+1
h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
h
2i+1
ll ∙ ∙ ∙ l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
(i)
+
i<k
h
/2
￿
i=0
i=i+1
h
k
1
h
k
2
∙ ∙ ∙ h
k
m−1
h
2i
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,2n
2
−k
2
,∙∙∙,2n
m−1
−k
m−1
,2n
m
+1
(i)
(16)
The latency between any two data in the data stream can be
implemented by using shift registers.According to (11),(13),(15)
and (16),we can design a parallel-pipeline architecture for the im-
plementation of LL∙ ∙ ∙ L lter by decomposing ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
step by step.Assume K
h
= 5,the corresponding circuits for (15)
and (16) are illustrated in Fig.2 and Fig.3,respectively.The
delay-block in Fig.2 can make data delay N
m
/2
j
clock cycles
(ccs),and it can be implemented by pushing them into a SRAM
module with size N
m
/2
j
.Fig.4 shows the architecture for the
LL∙ ∙ ∙ L lter.Each pair of arrow in Fig.4 is implemented by a
specic circuit (e.g.,the circuit in Fig.2 or Fig.3).



❆ ✁










❆ Multiplier

Adder
N
m
/2
j
Delay-block




N
m
/2
j
N
m
/2
j







N
m
/2
j
✲ ❤




✲ ✲
P
k
1
,k
2
,∙∙∙,k
m−2
,0
P
k
1
,k
2
,∙∙∙,k
m−2
,1
P
k
1
,k
2
,∙∙∙,k
m−2
h
2
/h
0
h
3
/h
1
h
4
/h
0
Fig.2.Circuit for (15).In this example,K
h
= 5.
The number of multipliers required in the architecture in Fig.
4 is
2
m−1
K
h
+(2
m−2
+2
m−3
+∙ ∙ ∙ +1)(K
h
−2)
= (2
m
−1)K
h
−2
m
+2
(17)
and the number of adders is
(2
m−1
+2
m−2
+∙ ∙ ∙ +1)(K
h
−1)
= (2
m
−1)(K
h
−1)
(18)
Besides,the storage size required for the delay-blocks in this ar-
chitecture is
(2
m−2
N
m
/2
j
+2
m−3
N
m
N
m−1
/2
2j
+
∙ ∙ ∙ +N
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)(K
h
−2)
(19)
In the same way,we can design other subband lters.It is
easy to see that the total number of multipliers required for the
architectures of all 2
m
lters is
2
m−1
((2
m
−1)(K
h
+K
g
) −2
m+1
+4) (20)







❆ ✁






❆ ✁



Register
❤ ❤
❤ ❤




















ll∙ ∙ ∙l
j−1
2n
1
−k
1
,∙∙∙,2n
m−1
−k
m−1
,2n
m
ll∙ ∙ ∙l
j−1
2n
1
−k
1
,∙∙∙,2n
m−1
−k
m−1
,2n
m+1
P
k
1
,k
2
,∙∙∙,k
m−1
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
1
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
4
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
0
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
3
h
k
1
h
k
2
∙ ∙ ∙h
k
m−1
h
2
Fig.3.Circuit for (16).In this example,K
h
= 5.
the total number of adders is
2
m−1
(2
m
−1)(K
h
+K
g
−2) (21)
and the total storage size required for the delay-blocks at the de-
composition level j is
2
m−1
(2
m−2
N
m
/2
j
+2
m−3
N
m
N
m−1
/2
2j
+
∙ ∙ ∙ +N
m
N
m−1
∙ ∙ ∙ N
2
/2
j(m−1)
)(K
h
+K
g
−4)
(22)
All lters can share some registers,so the number of registers
required in Fig.1 is max{2
m−1
K
h
,2
m−1
K
g
}.
Obviously,the rst-level decomposition of the proposed ar-
chitecture requires N
1
N
2
∙ ∙ ∙ N
m
/2
m
clock cycles (ccs).Since
the quantity of ccs in each level is 1/2
m
of that in the previous
level,the total quantity of ccs required for the decomposition to
Jth level is
J
￿
i=1
N
1
N
2
∙ ∙ ∙N
m
/2
mj
= (1 −2
−Jm
)N
1
N
2
∙ ∙ ∙N
m
/(2
m
−1)
(23)
If J is large enough,(23) will converge to
N
1
N
2
∙ ∙ ∙ N
m
/(2
m
−1) (24)
During the second and following level decompositions,the
storage in the rst-level decomposition can be reused partly,So the
hardware cost is determined by the rst-level decomposition.The
storage size of RAMmodule in Fig.1 should be N
1
N
2
∙ ∙ ∙ N
m
/2
m
to be enough to save the LL∙ ∙ ∙ L subband data for the next level
decomposition.
3.PERFORMANCE AND COMPARISONS
For the purpose of comparing with other architectures,we choose
m=2,K
h
= K
g
= K(lter length).Table I compares the per-
formance of our architecture with other architectures in terms of
computing time,the number of multipliers,the number of adders,
storage sizes and control complexity.The parameter N
2
denotes
the input image size and J denotes the 2-D DWT level.From the
comparison data,we know that the efciency of our architecture
I - 699



✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥

✟✯

❍❥
ll∙ ∙ ∙l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
+1,2n
2
+1,∙∙∙,2n
m
ll∙ ∙ ∙l
j−1
2n
1
,∙∙∙,2n
m−1
+1,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
,∙∙∙,2n
m−1
+1,2n
m
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
+1
ll∙ ∙ ∙l
j−1
2n
1
,2n
2
,∙∙∙,2n
m
P
0,0,∙∙∙,0,0
￿
￿￿ ￿
m−1
P
1,1,∙∙∙,1,0
￿
￿￿ ￿
m−1
P1,1,∙∙∙,1,1
￿
￿￿ ￿
m−1
P
1,1,∙∙∙,1
￿
￿￿ ￿
m−2
r r r
r
r
r
P
0,0
P
0,1
P
1,0
P
1,1
P
0
P
1


✒


❅❘
ll ∙ ∙ ∙ l
j
n
1
,n
2
,∙∙∙,n
m
Fig.4.Architecture for LL∙ ∙ ∙ L lter.Each pair of arrow is implemented by a specic circuit (e.g.,the circuit in Fig.2 or Fig.3).
Table 1.Performance Comparisons of Several 2-D DWT Architectures.(K:Filter Length,N
2
:Image Size,and J:2-D DWT Level)
Architectures Computing Multipliers Adders Storage Size Control
Times(ccs) Complexity
Ours ≈ N
2
/3 12K −8 12(K −1) N
2
/4 +2N(K −2) +2K Simple
Wu's [9] ≈ 2N
2
/3 4K 4K N
2
/4 +KN +K Moderate
Quadri-Filter [10] ≈ 2N
2
/3 2K
2 2(K
2
−1) 2KN Moderate
Non-Separable [5] N
2 2K
2 2(K
2
−1) 2KN Complex
SIMD [5] K
2
J 2N
2 2N
2 N
2 Complex
Systolic-Parallel [4] N
2 4K 4K 2KN +4N Complex
is slightly lower than Wu's but obviously better than other archi-
tectures.We should mention that Wu's approach is only suitable
for two-dimensional case.The RAMmodule of size N
2
/4 is not
necessary if we use the already existing memory in the systems to
save low-low subband data for the next level decomposition [9].
The number of registers required for the proposed architecture is
also included in the storage size.Besides its regularity,our archi-
tecture only includes very simple processing elements (PE),so the
control complexity of our architecture is simpler than others.
4.CONCLUSIONS
This paper proposes a Novel VLSI architecture for the multidimen-
sional DWT.Provided comparative evaluations in two dimensional
case has shown some advantages of our architecture.Besides,we
give the detailed architecture for multidimensional DWT which is
not discussed in other literatures.
5.REFERENCES
[1] S.G.Mallat,Atheory for multiresolution signal decomposi-
tion:The wavelet representation, IEEE Trans.Pattern Anal.
Machine Intel.,vol.11,pp.674693,July 1989.
[2] A.S.Lewis and G.Knowles,VLSI architecture for 2D
Daubechies wavelet transform without multipliers, Elec-
tron.Lett.,vol.27,pp.171173,Jan.1991.
[3] K.K.Parhi and T.Nishitani,VLSI architectures for discrete
wavelet transforms, IEEE Trans.VLSI Syst.,vol.1,pp.191
202,June 1993.
[4] M.Vishwanath,R.M.Owens,and M.J.Irwin,VLSI ar-
chitectures for the discrete wavelet transform, IEEE Trans.
Circuits Syst.II,vol.42,pp.305316,May 1995.
[5] C.Chakrabarti and M.Vishwanath,Efcient realizations of
the discrete and continuous wavelet transforms:Fromsingle
chip implementations to mappings on SIMD array comput-
ers, IEEE Trans.Signal Processing,vol.43,pp.759771,
Mar.1995.
[6] H.Y.H.Chuang and L.Chen,VLSI architecture for fast
2-D discrete orthogonal wavelet transform, J.VLSI Signal
Processing,vol.10,pp.225236,Aug.1995.
[7] J.Chen and M.A.Bayoumi,Ascalable systolic array archi-
tecture for 2-D discrete wavelet transforms, in Proc.IEEE
VLSI Signal Processing Workshop,1995,pp.303312.
[8] C.Chakrabarti and C.Mumford,Efcient realizations of
analysis and synthesis lters based on the 2-D discrete
wavelet transform, in Proc.IEEE ICASSP,May 1996,
pp.32563259.
[9] P.C.Wu and L.G.Chen,An efcient architecture for two-
dimensional discrete wavelet transform, IEEE Trans.Cir-
cuits Syst.Video Technol.,vol.11,pp.536545,Apr.2001.
[10] F.Marino,Two fast architectures for the direct 2-D discrete
wavelet transform, IEEE Trans.Signal Processing,vol.49,
pp.12481259,June 2001.
I - 700