A NOVEL VLSI ARCHITECTURE FOR MULTIDIMENSIONAL

DISCRETE WAVELET TRANSFORM

Xinjian Chen,Qionghai Dai

Department of Automation

Tsinghua University,Beijing,100084,China

Cxjian99@mails.tsinghua.edu.cn,Qhdai@tsinghua.edu.cn

ABSTRACT

In this short paper we propose a novel VLSI architecture for mul-

tidimensional discrete wavelet transform (m-D DWT) based on

systolic array and non-separable approach.The proposed archi-

tecture performs a decomposition of an N

1

× N

2

× ∙ ∙ ∙ × N

m

image in about N

1

N

2

∙ ∙ ∙ N

m

/(2

m

−1) clock cycles (ccs).This

result considerably speeds up other known architectures.Besides,

the advantages of the proposed architecture include very simple

hardware complexity,regular data ow and low control complex-

ity.

1.INTRODUCTION

The discrete wavelet transform (DWT) [1] has been emerged as a

powerful tool for audio and image compression recently.Besides,

the DWT has important applications in areas as diverse as signal

processing,digital communications,numerical analysis,computer

graphics and radar target distinguishing.Since the DWT demands

massive computation,dedicated VLSI ASIC solutions should be

considered to meet the real-time requirements in practical applica-

tions.

Although many VLSI architectures for 1-D DWT have been

proposed,it is hard to design a highly efcient VLSI architecture

with low hardware cost and high throughput for 2-D DWT.Lewis

and Knowles [2] used the four-tap Daubechies lter to design a

2-D DWT architecture without multipliers.Parhi and Nishitani

[3] proposed two architectures that combine the word-parallel and

digital-serial methodologies.Vishwanath et al.[4] proposed a

systolic-parallel architecture using a combination of systolic and

parallel lters for the 2-D DWT.Chakrabarti and Vishwanath [5]

proposed two efcient non-separable architectures,the parallel l-

ter and the SIMD2-Darray,which optimize both the area and time.

Chuang and Chen [6] presented a parallel pipelined VLSI array ar-

chitecture for the 2-D DWT.Chen and Bayoumi [7] proposed a

scalable systolic array architecture.Chakrabarti and Mumford [8]

presented folded architecture and scheduling algorithms for com-

puting the 2-D DWT for analysis and synthesis lters.Among the

various VLSI architectures,there exit two best-known designs for

the 2-D DWT in terms of computing time and hardware cost.One

was proposed by Wu et.al.for the separable 2-D DWT [9] by

employing the polyphase decomposition technique and coefcient

folding technique,and the other one was reported by Marino for

the non-separable 2-D DWT [10] based on a modied recursive

pyramid algorithm (MRPA) [5].Either of these two architectures

can performa decomposition of an N×N image in about 2N

2

/3

clock cycles (ccs).At present,there is few literature to present

VLSI architecture for multidimensional DWT due to the complex-

ity of design.

Based on systolic array and non-separable approach,in this

short paper we proposed a novel VLSI architecture for the mul-

tidimensional (m-D) DWT.The proposed architecture performs

a decomposition of an N

1

× N

2

× ∙ ∙ ∙ × N

m

image in about

N

1

N

2

∙ ∙ ∙ N

m

/(2

m

− 1) clock cycles (ccs).This result consid-

erably speeds up other known architectures.In particular case,

2-D DWT,the proposed architecture performs a decomposition of

an N ×N image in approximately N

2

/3 clock cycles (ccs),only

a half of that in [9] or [10].Another advantage is that the proposed

architecture requires less multipliers and accumulators (MACs)

than other non-separable approaches when the lter length is large.

Besides,the proposed architecture has very simple hardware com-

plexity,regular data ow and low control complexity.

The rest of the paper is organized as follows.In Section 2 we

propose the architectures for multidimensional DWT,and In Sec-

tion 3 we provide the comparative evaluations.Finally,in section

4,we give conclusions.

2.ARCHITECTURE FOR MULTIDIMENSIONAL DWT

When the m-D wavelet basis functions are separable,the m-D

DWTcan be divided into m1-Doperations,i.e.,row-column method.

However,this separable approach requires extra huge memory to

save data that must be transposed for row (column) by column

(row) processing.In our non-separable approach,the wavelet basis

functions are still separable.For convenient description,assume

that the wavelet basis functions are identical in each dimension.

Let h

i

(i = 0,1,∙ ∙ ∙,K

h

−1) and g

i

(i = 0,1,∙ ∙ ∙,K

g

−1)

denote,respectively,the coefcients of low and high lter bases.

Let ll ∙ ∙ ∙ l

j

n

1

,n

2

,∙∙∙,n

m

denotes the coefcients of low-low- ∙ ∙ ∙ -low

subband produced at the decomposition level j,and so on.Then

the mathematical formulas for multidimensional DWT can be de-

ned as follows.

ll ∙ ∙ ∙ l

j

n

1

,n

2

,∙∙∙,n

m

=

K

h

−1

i

1

=0

K

h

−1

i

2

=0

∙ ∙ ∙

K

h

−1

i

m

=0

h

i

1

h

i

2

∙ ∙ ∙h

i

m

ll∙ ∙ ∙l

j−1

2n

1

−i

1

,2n

2

−i

2

,∙∙∙,2n

m

−i

m

(1)

ll ∙ ∙ ∙ h

j

n

1

,n

2

,∙∙∙,n

m

=

K

h

−1

i

1

=0

K

h

−1

i

2

=0

∙ ∙ ∙

K

g

−1

i

m

=0

h

i

1

h

i

2

∙ ∙ ∙g

i

m

ll∙ ∙ ∙l

j−1

2n

1

−i

1

,2n

2

−i

2

,∙∙∙,2n

m

−i

m

(2)

.

.

.

I - 6970-7803-7965-9/03/$17.00 ©2003 IEEE ICME 2003

hh∙ ∙ ∙ h

j

n

1

,n

2

,∙∙∙,n

m

=

K

g

−1

i

1

=0

K

g

−1

i

2

=0

∙ ∙ ∙

K

g

−1

i

m

=0

g

i

1

g

i

2

∙ ∙ ∙ g

i

m

ll∙ ∙ ∙l

j−1

2n

1

−i

1

,2n

2

−i

2

,∙∙∙,2n

m

−i

m

(3)

In a similar way,we introduce Wu's approach [9] to design our

architecture for m-D DWT as illustrated in Fig.1.The proposed

architecture includes a multiplexer,a RAM module and 2

m

sub-

band lters.The RAMmodule saves the j th level low-low-∙ ∙ ∙ -low

subband data for performing the next level decomposition.The

multiplexer selects the proper data for decomposition.Only when

performing the rst-level decomposition,the multiplexer selects

the original input data,otherwise it selects data from the RAM

module.This procedure repeats until the desired level is nished.

✒✑

✓✏

❄

✒✑

✓✏

❄

✒✑

✓✏

❄

✒✑

✓✏

❄

Subsampling by 2 in each dimension

✲

✲

✑

✑

◗

◗

✲

Input

M

U

X

✲

✲

✲

LL∙ ∙ ∙ L

LL∙ ∙ ∙ H

HH∙ ∙ ∙ H

RAM

✲

✲

✲

✛

q

q

q

ll∙ ∙ ∙l

j−1

n

1

,n

2

,∙∙∙,n

m

ll∙ ∙ ∙l

j

n

1

,n

2

,∙∙∙,n

m

ll∙ ∙ ∙h

j

n

1

,n

2

,∙∙∙,n

m

hh∙ ∙ ∙h

j

n

1

,n

2

,∙∙∙,n

m

Fig.1.Architecture for m-D DWT.

The approach to design the circuits for each subband lter

based on systolic array is similar,so we choose LL∙ ∙ ∙ L lter as

an example.Since the parallel-pipeline architecture is one of the

main characteristics of systolic array,we give a new denition

x

n

1

= x

n

2

(M) (4)

where Mdenotes the latency fromx

n

1

to x

n

2

in the given pipelined

data stream {x

n

}(n = 0,1,∙ ∙ ∙ ).If the data stream pipelines one

sample per clock cycle,then M = n

2

−n

1

clock cycles.

Dene

A

j

1

,j

2

,∙∙∙,j

m

=

i

1

<K

h

i

1

=j

1

i

1

=i

1

+2

h

i

1

i

2

<K

h

i

2

=j

2

i

2

=i

2

+2

h

i

2

∙ ∙ ∙

i

m

<K

h

i

m

=j

m

i

m

=i

m

+2

h

i

m

ll ∙ ∙ ∙l

j−1

2n

1

−i

1

,2n

2

−i

2

,∙∙∙,2n

m

−i

m

(5)

From(1) and (5),we have

ll∙ ∙ ∙l

j

n

1

,n

2

,∙∙∙,n

m

= A

1,1,∙∙∙,1

+A

1,1,∙∙∙,0

+∙ ∙ ∙ +A

0,0,∙∙∙,0

(6)

The right hand of equation (6) includes 2

m

independent parts,

and each part is an m-dimensional convolution.Thus we can make

2

m

times the input clock rate and get 2

m

parallel data streams:

{ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

}

{ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

+1

}

.

.

.

{ll ∙ ∙ ∙ l

j−1

2n

1

+1,2n

2

+1,∙∙∙,2n

m

+1

}

Assume that the input image is with size N

1

×N

2

×∙ ∙ ∙×N

m

,

so in each data stream,0 ≤ n

1

< N

1

/2

j

,0 ≤ n

2

< N

2

/2

j

,∙ ∙ ∙,

0 ≤ n

m

< N

m

/2

j

.

Fromthe denition of (4),we have

ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

= ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2l

m

(l

m

−n

m

) (7)

ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2n

m−1

,2n

m

=

ll∙ ∙ ∙l

j−1

2n

1

,2n

2

,∙∙∙,2l

m−1

,2n

m

((l

m−1

−n

m−1

)N

m

/2

j

)

(8)

.

.

.

ll ∙ ∙ ∙ l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

=

ll∙ ∙ ∙l

j−1

2l

1

,2n

2

,∙∙∙,2n

m

((l

1

−n

1

)N

m

N

m−1

∙ ∙ ∙N

2

/2

j(m−1)

)

(9)

For k

1

,k

2

,∙ ∙ ∙,k

m

= 0,1,Dene

P

k

1

= h

k

1

K

h

−1

i

2

=0

∙ ∙ ∙

K

h

−1

i

m

=0

h

i

2

∙ ∙ ∙ h

i

m

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,2n

2

−i

2

,∙∙∙,2n

m

−i

m

(10)

then

ll ∙ ∙ ∙ l

j

n

1

,n

2

,∙∙∙,n

m

=

i<(k

h

−1)/2

i=0

i=i+1

h

2i+1

h

1

P

1

(iN

m

N

m−1

∙ ∙ ∙N

2

/2

j(m−1)

)

+

i<k

h

/2

i=0

i=i+1

h

2i

h

0

P

0

(iN

m

N

m−1

∙ ∙ ∙ N

2

/2

j(m−1)

)

(11)

Dene

P

k

1

,k

2

= h

k

1

h

k

2

K

h

−1

i

3

=0

∙ ∙ ∙

K

h

−1

i

m

=0

h

i

3

∙ ∙ ∙ h

i

m

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,2n

2

−k

2

,∙∙∙,2n

m

−i

m

(12)

then

P

k

1

=

i<(k

h

−1)/2

i=0

i=i+1

h

2i+1

h

1

P

k

1

,1

(iN

m

N

m−1

∙ ∙ ∙N

3

/2

j(m−2)

)

+

i<k

h

/2

i=0

i=i+1

h

2i

h

0

P

k

1

,0

(iN

m

N

m−1

∙ ∙ ∙ N

3

/2

j(m−2)

)

(13)

.

.

.

Dene

P

k

1

,k

2

,∙∙∙,k

m−1

= h

k

1

h

k

2

∙ ∙ ∙ h

k

m−1

K

h

−1

i

m

=0

h

i

m

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,2n

2

−k

2

,∙∙∙,2n

m−1

−k

m−1

,2n

m

−i

m

(14)

then

P

k

1

,k

2

,∙∙∙,k

m−2

=

i<(k

h

−1)/2

i=0

i=i+1

h

2i+1

h

1

P

k

1

,k

2

,∙∙∙,k

m−2

,1

(iN

m

/2

j

)

+

i<k

h

/2

i=0

i=i+1

h

2i

h

0

P

k

1

,k

2

,∙∙∙,k

m−2

,0

(iN

m

/2

j

)

(15)

I - 698

and

P

k

1

,k

2

,∙∙∙,k

m−1

=

i<(k

h

−1)/2

i=0

i=i+1

h

k

1

h

k

2

∙ ∙ ∙ h

k

m−1

h

2i+1

ll ∙ ∙ ∙ l

j−1

2n

1

−k

1

,2n

2

−k

2

,∙∙∙,2n

m−1

−k

m−1

,2n

m

(i)

+

i<k

h

/2

i=0

i=i+1

h

k

1

h

k

2

∙ ∙ ∙ h

k

m−1

h

2i

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,2n

2

−k

2

,∙∙∙,2n

m−1

−k

m−1

,2n

m

+1

(i)

(16)

The latency between any two data in the data stream can be

implemented by using shift registers.According to (11),(13),(15)

and (16),we can design a parallel-pipeline architecture for the im-

plementation of LL∙ ∙ ∙ L lter by decomposing ll ∙ ∙ ∙ l

j

n

1

,n

2

,∙∙∙,n

m

step by step.Assume K

h

= 5,the corresponding circuits for (15)

and (16) are illustrated in Fig.2 and Fig.3,respectively.The

delay-block in Fig.2 can make data delay N

m

/2

j

clock cycles

(ccs),and it can be implemented by pushing them into a SRAM

module with size N

m

/2

j

.Fig.4 shows the architecture for the

LL∙ ∙ ∙ L lter.Each pair of arrow in Fig.4 is implemented by a

specic circuit (e.g.,the circuit in Fig.2 or Fig.3).

✁

✁

❆

❆ ✁

✁

❆

❆

✁

✁

❆

❆

✁

✁

❆

❆ Multiplier

❤

Adder

N

m

/2

j

Delay-block

✲

✲

✲

✲

N

m

/2

j

N

m

/2

j

❤

❤

❄

❄

❄

❄

✲

N

m

/2

j

✲ ❤

❄

❄

❤

✻

✲ ✲

P

k

1

,k

2

,∙∙∙,k

m−2

,0

P

k

1

,k

2

,∙∙∙,k

m−2

,1

P

k

1

,k

2

,∙∙∙,k

m−2

h

2

/h

0

h

3

/h

1

h

4

/h

0

Fig.2.Circuit for (15).In this example,K

h

= 5.

The number of multipliers required in the architecture in Fig.

4 is

2

m−1

K

h

+(2

m−2

+2

m−3

+∙ ∙ ∙ +1)(K

h

−2)

= (2

m

−1)K

h

−2

m

+2

(17)

and the number of adders is

(2

m−1

+2

m−2

+∙ ∙ ∙ +1)(K

h

−1)

= (2

m

−1)(K

h

−1)

(18)

Besides,the storage size required for the delay-blocks in this ar-

chitecture is

(2

m−2

N

m

/2

j

+2

m−3

N

m

N

m−1

/2

2j

+

∙ ∙ ∙ +N

m

N

m−1

∙ ∙ ∙ N

2

/2

j(m−1)

)(K

h

−2)

(19)

In the same way,we can design other subband lters.It is

easy to see that the total number of multipliers required for the

architectures of all 2

m

lters is

2

m−1

((2

m

−1)(K

h

+K

g

) −2

m+1

+4) (20)

✁

✁

❆

❆

✁

✁

❆

❆ ✁

✁

❆

❆

✁

✁

❆

❆ ✁

✁

❆

❆

Register

❤ ❤

❤ ❤

✲

✲

✲

✲

❄

❄

❄

❄

✲

✲

✲

✲

❄

❄

✲

✲

❄

❄

❄

✲

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,∙∙∙,2n

m−1

−k

m−1

,2n

m

ll∙ ∙ ∙l

j−1

2n

1

−k

1

,∙∙∙,2n

m−1

−k

m−1

,2n

m+1

P

k

1

,k

2

,∙∙∙,k

m−1

h

k

1

h

k

2

∙ ∙ ∙h

k

m−1

h

1

h

k

1

h

k

2

∙ ∙ ∙h

k

m−1

h

4

h

k

1

h

k

2

∙ ∙ ∙h

k

m−1

h

0

h

k

1

h

k

2

∙ ∙ ∙h

k

m−1

h

3

h

k

1

h

k

2

∙ ∙ ∙h

k

m−1

h

2

Fig.3.Circuit for (16).In this example,K

h

= 5.

the total number of adders is

2

m−1

(2

m

−1)(K

h

+K

g

−2) (21)

and the total storage size required for the delay-blocks at the de-

composition level j is

2

m−1

(2

m−2

N

m

/2

j

+2

m−3

N

m

N

m−1

/2

2j

+

∙ ∙ ∙ +N

m

N

m−1

∙ ∙ ∙ N

2

/2

j(m−1)

)(K

h

+K

g

−4)

(22)

All lters can share some registers,so the number of registers

required in Fig.1 is max{2

m−1

K

h

,2

m−1

K

g

}.

Obviously,the rst-level decomposition of the proposed ar-

chitecture requires N

1

N

2

∙ ∙ ∙ N

m

/2

m

clock cycles (ccs).Since

the quantity of ccs in each level is 1/2

m

of that in the previous

level,the total quantity of ccs required for the decomposition to

Jth level is

J

i=1

N

1

N

2

∙ ∙ ∙N

m

/2

mj

= (1 −2

−Jm

)N

1

N

2

∙ ∙ ∙N

m

/(2

m

−1)

(23)

If J is large enough,(23) will converge to

N

1

N

2

∙ ∙ ∙ N

m

/(2

m

−1) (24)

During the second and following level decompositions,the

storage in the rst-level decomposition can be reused partly,So the

hardware cost is determined by the rst-level decomposition.The

storage size of RAMmodule in Fig.1 should be N

1

N

2

∙ ∙ ∙ N

m

/2

m

to be enough to save the LL∙ ∙ ∙ L subband data for the next level

decomposition.

3.PERFORMANCE AND COMPARISONS

For the purpose of comparing with other architectures,we choose

m=2,K

h

= K

g

= K(lter length).Table I compares the per-

formance of our architecture with other architectures in terms of

computing time,the number of multipliers,the number of adders,

storage sizes and control complexity.The parameter N

2

denotes

the input image size and J denotes the 2-D DWT level.From the

comparison data,we know that the efciency of our architecture

I - 699

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

✟

✟✯

❍

❍❥

ll∙ ∙ ∙l

j−1

2n

1

+1,2n

2

+1,∙∙∙,2n

m

+1

ll∙ ∙ ∙l

j−1

2n

1

+1,2n

2

+1,∙∙∙,2n

m

ll∙ ∙ ∙l

j−1

2n

1

,∙∙∙,2n

m−1

+1,2n

m

+1

ll∙ ∙ ∙l

j−1

2n

1

,∙∙∙,2n

m−1

+1,2n

m

ll∙ ∙ ∙l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

+1

ll∙ ∙ ∙l

j−1

2n

1

,2n

2

,∙∙∙,2n

m

P

0,0,∙∙∙,0,0

m−1

P

1,1,∙∙∙,1,0

m−1

P1,1,∙∙∙,1,1

m−1

P

1,1,∙∙∙,1

m−2

r r r

r

r

r

P

0,0

P

0,1

P

1,0

P

1,1

P

0

P

1

✒

❅

❅

❅❘

ll ∙ ∙ ∙ l

j

n

1

,n

2

,∙∙∙,n

m

Fig.4.Architecture for LL∙ ∙ ∙ L lter.Each pair of arrow is implemented by a specic circuit (e.g.,the circuit in Fig.2 or Fig.3).

Table 1.Performance Comparisons of Several 2-D DWT Architectures.(K:Filter Length,N

2

:Image Size,and J:2-D DWT Level)

Architectures Computing Multipliers Adders Storage Size Control

Times(ccs) Complexity

Ours ≈ N

2

/3 12K −8 12(K −1) N

2

/4 +2N(K −2) +2K Simple

Wu's [9] ≈ 2N

2

/3 4K 4K N

2

/4 +KN +K Moderate

Quadri-Filter [10] ≈ 2N

2

/3 2K

2 2(K

2

−1) 2KN Moderate

Non-Separable [5] N

2 2K

2 2(K

2

−1) 2KN Complex

SIMD [5] K

2

J 2N

2 2N

2 N

2 Complex

Systolic-Parallel [4] N

2 4K 4K 2KN +4N Complex

is slightly lower than Wu's but obviously better than other archi-

tectures.We should mention that Wu's approach is only suitable

for two-dimensional case.The RAMmodule of size N

2

/4 is not

necessary if we use the already existing memory in the systems to

save low-low subband data for the next level decomposition [9].

The number of registers required for the proposed architecture is

also included in the storage size.Besides its regularity,our archi-

tecture only includes very simple processing elements (PE),so the

control complexity of our architecture is simpler than others.

4.CONCLUSIONS

This paper proposes a Novel VLSI architecture for the multidimen-

sional DWT.Provided comparative evaluations in two dimensional

case has shown some advantages of our architecture.Besides,we

give the detailed architecture for multidimensional DWT which is

not discussed in other literatures.

5.REFERENCES

[1] S.G.Mallat,Atheory for multiresolution signal decomposi-

tion:The wavelet representation, IEEE Trans.Pattern Anal.

Machine Intel.,vol.11,pp.674693,July 1989.

[2] A.S.Lewis and G.Knowles,VLSI architecture for 2D

Daubechies wavelet transform without multipliers, Elec-

tron.Lett.,vol.27,pp.171173,Jan.1991.

[3] K.K.Parhi and T.Nishitani,VLSI architectures for discrete

wavelet transforms, IEEE Trans.VLSI Syst.,vol.1,pp.191

202,June 1993.

[4] M.Vishwanath,R.M.Owens,and M.J.Irwin,VLSI ar-

chitectures for the discrete wavelet transform, IEEE Trans.

Circuits Syst.II,vol.42,pp.305316,May 1995.

[5] C.Chakrabarti and M.Vishwanath,Efcient realizations of

the discrete and continuous wavelet transforms:Fromsingle

chip implementations to mappings on SIMD array comput-

ers, IEEE Trans.Signal Processing,vol.43,pp.759771,

Mar.1995.

[6] H.Y.H.Chuang and L.Chen,VLSI architecture for fast

2-D discrete orthogonal wavelet transform, J.VLSI Signal

Processing,vol.10,pp.225236,Aug.1995.

[7] J.Chen and M.A.Bayoumi,Ascalable systolic array archi-

tecture for 2-D discrete wavelet transforms, in Proc.IEEE

VLSI Signal Processing Workshop,1995,pp.303312.

[8] C.Chakrabarti and C.Mumford,Efcient realizations of

analysis and synthesis lters based on the 2-D discrete

wavelet transform, in Proc.IEEE ICASSP,May 1996,

pp.32563259.

[9] P.C.Wu and L.G.Chen,An efcient architecture for two-

dimensional discrete wavelet transform, IEEE Trans.Cir-

cuits Syst.Video Technol.,vol.11,pp.536545,Apr.2001.

[10] F.Marino,Two fast architectures for the direct 2-D discrete

wavelet transform, IEEE Trans.Signal Processing,vol.49,

pp.12481259,June 2001.

I - 700

## Comments 0

Log in to post a comment