VLSI Architecture for Low Power Motion Estimation

greatgodlyΗλεκτρονική - Συσκευές

27 Νοε 2013 (πριν από 3 χρόνια και 4 μήνες)

75 εμφανίσεις

VLSI Architecture for Low Power Motion Estimation

using High Data Acess Reuse


Bo
-
Sung Kim and Jun
-
Dong Cho

VLSI Algorithmic Design Automation Lab

SungKyunKwan University

300, Chunchun
-
dong, Changan
-
gu, Suwon, Kyunggi
-
do, Korea.

Phone number : (0331)290
-
71
27

Fax number : (0331)290
-
7127

e
-
mail : {
kdream, jdcho}@{nature, yurim}.skku.ac.kr


Presenting part : number 2

Digital Circuits
























VLSI Architecture for Low Power Motion Estimation

Usin
g High Data Acess reuse


Bo
-
Sung Kim and Jun
-
Dong Cho

VLSI Algorithmic Design Automation Lab

SungKyunKwan University

300, Chunchun
-
dong, Changan
-
gu, Suwon, Kyunggi
-
do, Korea.

Phone number : (0331)290
-
7127

Fax number : (0331)290
-
7127

e
-
mail : {
kdream, jdcho}@{nature, yurim}.skku.ac.kr


Presenting part : number 2

Digital Circuits



1. Introduction

This paper presents a new VLSI architecture of the Motion Estimation in MPEG
-
2. Various full search
block matching

algorithms (BMA) and architectures using systolic array have been proposed for motion
estimation. The BMA divides the image in squared blocks and compares each block in the current frame
(reference block) with those within a reduced area of the previous
frame (search area) looking for the
most similar one, as shown in Figure 1. This matching procedure is made by determining the optimum of
the selected cost function. During the matching procedure, whenever a current block in a frame moves to
the next block
, previously
-
searched data are repeatedly accessed for generating a motion vector. This
method is inefficient due to the excessive data access. Therefore, the problem of generating motion
vectors is how to remove the unnecessary search data access. Recent
ly, to reduce the number of acesses,
[1] used a so
-
called
cascade
method within a search area to reuse the search data using multiple process
elements. To further reduce the data access and computation time during the block matching, in this paper,
we prop
ose a new approach through the reuse of the previously
-
searched data. The difference between
our algorithm and one of [1] is that we reuse the part of search area between two consecutive search areas,
while [1] reuse the part of the reference block betwee
n two consecutive reference blocks.


2. Our algorithm compared with the previous algorithms


The cost function for the motion estimation of MPEG
-
2 is the Minimum Mean Absolute Error (MAE).
For simpler computational complexity (because hardware multiplie
r is not required) than MSE (Mean
Square Error), the MAE is the most widely adopted for motion estimation. The cost function is presented
in (1), where N is the block size in a frame, x

s are the pels in the reference block; and y

s are the pels
within the

search area. The left_, right_, up_, and down_search represent the search range of the
candidate block.



S
m
n
x
i
k
y
i
m
k
n
k
N
i
N
(
,
)
|
(
,
)
(
,
)|








1
1
, (1)


where left_search

m

right_search, down_search

n

up_search)


)}
,
(
{
min
)
,
(
n
m
s
u
n
m

,
u
n
m
v
|
)
,
(


(2)

The sum
S(m,n
), in (1), of the absolute differences between corresponding pel of reference block
data
x(i,k)

in the current fr
ame and candidate block data
y(i+m,k+n)

of the previous frame are added for
each candidate block. The minimum error
u
in (2), of all sums s(m,n) within a search area denotes the
position
(m,n)
u

of the best fitting candidate block which provides the displac
ement vector
v
in (2). In
case of above method, the following algorithm is executed for each reference block.
Here
m
is the
number of row pels in search range of previous block, and
n

is the number of column pels in search
range of previous block.


Figur
e 1 An instance of general block matching algorithm [1]


Next, let us consider the new search with new reference blocks in a frame. The number of reference
blocks in a column of a frame is
block
current
of
number
pel
column
frame
of
number
pel
column
_
_
_
_
_
_
_
_
_
, and the number of rows of a frame is
block
current
of
number
pel
raw
frame
of
number
pel
raw
_
_
_
_
_
_
_
_
_
. Then overall motion estimation algorithm in an entire frame is as
following.

candidate
block
in previous
frame
left_search
right_search
up_search
down_search
reference
block,
current
frame
x (i,k)
y (i+m,k +n)

Previous
frame
N
N

PREVIOUS ALGORITHM


For column:=1 to column_number

{


l e f t _ s e a r c h := l e f t _ s e a r c h + N


r i g h t _ s e a r c h := r i g h t _ s e a r c h + N


f o r r o w:= 1 t o r o w_ n u mb e r d o



{


u p _ s e a r c h := u p _ s e a r c h + N


d o w n _ s e a r c h := d o w n _ s e a r c h + N


F u n c t i o n ( l e f t _ s e a r c h,r i g h t _ s e a r c h,


U p _ s e a r c h,d o w n _ s e a r c h )


/* c o mp u a t a t i o n b e t w e e n o n e c u r r e n t


b l o c k a n d o n e c a n d i d a t e b l o c k */



}


}


F i g u r e 2 P r e v i o u s a l g o r i t h m [ 1 ]


I n t h i s p r e v i o u s me t h o d a s i n F i g u r e 2, s e a r c h d a t a i s o v e r l a p p e d b e t we e n p r e v i o u s s e a r c h b l o c k a n d n e x t
c u r r e n t s e a r c h b l o c k. P r a c t i c a l l y, i n t h i s c a s e, t h e n u mb e r o f o v e r l a p p e d p e l s i s ( u p _ s e a r c h +
d o wn _ s e a r c h
) * ( N + l e f t _ s e a r c h + r i g h t _ s e a r c h ) + N * ( l e f t _ s e a r c h + r i g h t _ s e a r c h ) p e r r e f e r e n c e b l o c k.


O u r a p p r o a c h r e d u c e d t h e n u mb e r o f r e f e r e n c e s b y r e u s i n g t h o s e p r e v i o u s l y a c c e s s e d d a t a. D u r i n g
c o mp u t a t i o n i n a c o l u mn, t h e n u mb e r o f d a t a o v e r l a p s d e c r e a s e s b
y u p _ s e a r c h a r e a r e u s i n g a n d
d o wn _ s e a r c h a r e a r e u s i n g.

O U R A L G O R I T H M



f o r c o l u mn:= 1 t o c o l u mn _ n u mb e r d o


l e f t _ s e a r c h := l e f t _ s e a r c h + N


r i g h t _ s e a r c h := r i g h t _ s e a r c h + N


f o r r o w := 1 t o r o w_ n u mb e r d o


F u n c t i o n ( l e f t _ s e a r c h, r i g h t _ s e a r c h, u p _ s e a r
c h, d o wn _ s e a r c h
)


/*

computation between current block

a
nd one candidate

block

*/


Function(left_search, right_search, up_search, down_search)


/* computation between on next current block and one candidate block */


e n d


e n d


s e a r c h
b l o c k o f A
s e a r c h b l o c k
of B
reference
block A
next
reference
block B
next
r
eference
block
B
overlapped
area

Figu
re 3 Our algorithm



Practically, the number of overlapped pels in our architecture is N * (left_search + right_search) per
reference blocks. Table 1 shows a simulation using C
-
language.




Table 1 Comparison on the number of overlapped pels bet
ween [1] and ours



Parameter



Architectures

Referenc block 4

Up
-
search 4

Down
-
search 3

Right
-
search 3

Left
-
search 4

Reference block 8

Up
-
search 8

Down
-
search 7

Right
-
search 7

Left
-
search 8

Reference block 16

U
p
-
search 16

Down
-
search 15

Right
-
search 15

Left
-
search 16


Previousarchitecture [1]

161,847

174,090

173,538


Our architecture


44,247


47,565


49,290


(144 X 176 pel/frame)




A parallel architecture in [1] uses a cascade method. If two chips are cascaded, the number of
overlapped pels is decreased in half. With this in mind, if our chip were cascaded by the number of
frame/N, the number of overlapped pels is zero.

reference
block
N by N
reference
block
N by N
next
reference
block
Reused area
in next reference
Reused area
in next reference
left_search + right_search +
N
up_search + down_search

Our ar
chitecture contains the
N
2

modified process elements (PE) which is composed of an
additional subtracter and one accumulator. If the searching range, for example, left_search is N, then the
number of required subtracters is two. If searc
h range is more than N and less than twice of N, then the
number of required subtracters is three or four. If the searching range is more than twice of N, then the
mumber of required subtracter is more than four. One subtracter of PE computes first search
ing candidate
block and first reference block, and another subtracter computes next searching candidate block and next
reference block. The latter subtracter reuses the data of searching area. Therefore, computational time is
reduced and the number of dat
a access is extremely decreasing. Furthermore, the size of address
-
generating
-
block becomes small, and the number of external pin is decreased because this architecture
reuses search data. The proposed architecture based on the reusing scheme is in figure

4. In the right
figure in Figure 4, the left subtractor receives reference
-
data of the right subtractor by proper control
signal. At the same time, the right unit receives a new reference data. In this way, the number of
search
-
data access is extreme
ly decreased.


The PE
-
array in Figure 4 parallelly computes search data and transfers the sum of the front PE

s output
and the current subtracter

s output to the next PE. This structure is a systolic array. The shift register
moves a new search data to t
he first PE after each N clocks. The number of search
-
data access decreases
by using this method.


In a hardware implementation, the number of gates is increased in presented architecture. Table 2 shows
the omparison of the number of gates for 4x4 referenc
e
-
block between our algorithm and [1].




Table 2. The number of gates for computational block

Architectures


number of gates in
computation block

Previous architecture [1]

321


Our architecture

697


In the computation block, the number of gates us
ing our architecture is increased, but the number of gates
in the

address
-
generating
-
block is decreased because data flow is very simple.


As a result, the total number of gates is increased a few, but the speed and the number of data
accesses and the n
umber of external access pins are signigicantly improved. Because of decreasing of
data access and external pins, the chip consumes lower power. Because of speed enhancement and
cascadablity, the chip is adaptive for HDTV (High density TV).


Reference

[1]

Ce
sar Sanz, Matias J. Garrido, Juan M. Meneses,


VLSI Architecture for Motion Estimation using the
Block
-
Matching Algorithm

,

[2]