JVT-N021x

feastcanadianSoftware and s/w Development

Dec 14, 2013 (3 years and 5 months ago)

138 views

1

Scalable Video Coding


Working Draft
1


Joint Video Team (JVT) of ISO/IEC MPEG & ITU
-
T VCEG

(ISO/IEC JTC1/SC29/WG11 and ITU
-
T SG16 Q.6)

1
4
th Meeting:

Hong Kong, CN
, 1
7
-
21

January
, 200
5

Document: JVT
-
N
020

Filename:
JVT
-
N021.doc


Title:

Joint Scalable Video Model
JSVM 0

Status:

Output Document
of
JVT

Purpose:

Information

Editor
(s)
/

Contact(s):

Julien Reichel

VisioWave Switzerland,

Rte. de la Pierre 22,

CH
-
1032 Ecublens, Switzerland


Heiko Schwarz

Heinrich Hertz Institute (FhG),

Einsteinufer 37,

D
-
10587 Berlin,
Germany


Mathias Wien

Institut für Nachrichtentechnik,

RWTH Aachen University,

D
-
52056 Aachen, Germany


Tel:

Fax:

Email:



Tel:

Fax:

Email:



Tel:

Fax:

Email:


+41 (21) 695
-
0041

+41 (21) 695
-
00
01

julien.r
eichel@visiowave.com



+49 (30) 31002
-
226

+49 (30)
39272
-
00

heiko.schwarz@hhi.fhg.de



+49 (241) 80
-
27681

+49 (241) 80
-
22196

wien@ient.rwth
-
aachen.de

Source:

JVT

_____________________________


Note:

This document contains the text of the MPEG
-
21 Scalable Video Model 3.0, document
ISO/IEC JTC
1/SC 29/WG 11

N6716
.
It is provided as information on the status of the Scalable Video Coding
effort before it became a work item for JVT.

Index

1

Glossary

................................
................................
................................
................................
................................
.......

4

2

Introduction

................................
................................
................................
................................
................................

5

2.1

Framework
................................
................................
................................
................................
............................

5

2.1.1

Framework
implementation

................................
................................
................................
..........................

5

2.1.2

Motion
-
Compensated Temporal Filtering (MCTF)

................................
................................
......................

6

2.1.3

Base Layer Compatibility with AVC Main Profile
................................
................................
.......................

7

2.2

Scalablity Dimensions

................................
................................
................................
................................
..........

8

2.2.1

Temporal Scalability

................................
................................
................................
................................
.....

9

2.2.2

SNR (Quality) Scalability

................................
................................
................................
...........................

10

2.2.3

Spatial Scalability

................................
................................
................................
................................
.......

10

2.2.4

Co
mbined Scalability

................................
................................
................................
................................
..

10

3

Encoder Description

................................
................................
................................
................................
.................

11

3.1

Intra Prediction

................................
................................
................................
................................
..................

11

3.2

Inter Pr
ediction

................................
................................
................................
................................
...................

12

3.3

Inter
-
Layer Prediction

................................
................................
................................
................................
........

12

3.3.1

Inter
-
Layer Intra Texture Prediction

................................
................................
................................
...........

12

3.3.2

Inter
-
Layer Motion Prediction

................................
................................
................................
....................

13

3.4

Motion Compensated Temporal Filtering

................................
................................
................................
..........

14

3.4.1

Adaptive prediction/update steps

................................
................................
................................
................

15

3.5

Residual Coding

................................
................................
................................
................................
.................

17

3.5.1

4x4 and 8x8 Transforms

................................
................................
................................
.............................

17

3.5.
2

Quantization and scaling

................................
................................
................................
.............................

17

3.6

Motion Coding

................................
................................
................................
................................
....................

19

3.7

Entropy Coding

................................
................................
................................
................................
...................

19

3.7.1

Quality
base layer

................................
................................
................................
................................
.......

19

3.7.2

Quality enhancement layer

................................
................................
................................
.........................

19

3.8

Deblocking

................................
................................
................................
................................
..........................

23

3.8.1

Deblocking f
ilter process

................................
................................
................................
............................

23

4

Operational Encoder Control

................................
................................
................................
................................
..

23

4.1

Scalability and Decomposition Structure

................................
................................
................................
...........

23

4.2

Fine grain SNR scalability

................................
................................
................................
................................
..

24

4.3

Motion estimation and mode decision process

................................
................................
................................
...

24

4.4

Coding of Subband Pictures

................................
................................
................................
...............................

27

4.5

Quantizer Selection

................................
................................
................................
................................
.............

28

5

Syntax and Semantics

................................
................................
................................
................................
...............

28

5.1

Specification of functions and variables

................................
................................
................................
.............

28

5.1.1

Specification of functions

................................
................................
................................
...........................

28

5.1.2

Specification of variables
................................
................................
................................
............................

30

5.2

Syntax in tabular form

................................
................................
................................
................................
........

30

5.2.1

NAL unit syntax

................................
................................
................................
................................
.........

31

5.2.2

Sequen
ce parameter set RBSP syntax

................................
................................
................................
.........

32

5.2.3

Slice layer in scalable extension RBSP syntax

................................
................................
...........................

33

5.2.4

Slice header in scalable extension syntax

................................
................................
................................
...

34

5.2.5

Slice data in scalable extension syntax

................................
................................
................................
.......

35

5.2.6

Macroblock layer in scalable extension syntax

................................
................................
...........................

36

5.2.7

Progressive refinement slice data syntax in scalable extension

................................
................................
..

44

5.3

Semantics

................................
................................
................................
................................
............................

55

5.3.1

NAL unit seman
tics

................................
................................
................................
................................
....

55

5.3.2

Scalable slice header in scalable extension semantics

................................
................................
................

57

5.3.3

Macroblock layer in scalable extension semantics

................................
................................
.....................

57

5.3.4

Progressive refienement slice data semantics in scalable extension

................................
...........................

60

6

Decoding Process

................................
................................
................................
................................
......................

61

6.1

Parsing Process

................................
................................
................................
................................
..................

61

6.2

Decoding process for prediction data
................................
................................
................................
.................

62

6.3

Decoding process for subband pictures

................................
................................
................................
..............

63

6.4

Decoding process of progressive refinements

................................
................................
................................
....

65

6.5

Reconstruction process of a group of pictures
................................
................................
................................
....

65

6.5.1

Inverse motion
-
compensated temporal filtering process

................................
................................
............

66

6.5.2

Reconstruction of a set of low
-
pass pictures

................................
................................
...............................

66

6.5.3

Reference list construction process

................................
................................
................................
.............

67

3

Scalable Video Coding


Working Draft
1

6.5.4

General prediction process

................................
................................
................................
..........................

69

6.5.5

Derivation of prediction data for the update steps

................................
................................
......................

71

6.5.6

Deblocking filter process

................................
................................
................................
............................

72

7

References

................................
................................
................................
................................
................................
.

73




1

Glossary

ABT

Adaptive Block size Transforms

AVC

ITU
-
T Recommendation H.264 | ISO MPEG 4 Part 10: Advanced Video Coding

CABAC

Context Adaptive Based Arithmetic Coder

FGS

Fine Grained
Scalability

GOP

Group of Pictures

IDR

Instantaneous Decoder Refresh

MC

Motion Compensation

MCTF

Motion Compensated Temporal Filtering

ME

Motion Estimation

MMCO

Memory Management Control Operation

RPLR

Reference Picture List Reordering

UMCTF

Unconstrained MCTF

SNR

Signal to Noise Ratio

SVC

Scalable Video Codec (or ‘Coding’, dependent on the context)

SVM

Scalable Video Model






5

Scalable Video Coding


Working Draft
1

2

Introduction

The SVM shall function as a guideline for the development of the future SVC reference software. Many basic building
blocks of the SVM are related to MPEG
-
4 AVC. Further detail on these parts can be found in
[1]
.

2.1

Framework

The SVM encoder is composed of the generic building block presented in
Figure
1
.

The three main scalability aspects, i.e. temporal, spatial and quality scalability will be controlled by the algorithm used t
o
implement each one of those building blocks.



Temporal scalabili
ty: Mostly controlled by the temporal transform (or prediction). The texture and motion
coding should not prevent it. This aspect is intrinsic if the texture and motion coding are restricted to the current
frame.



Spatial Scalability: Mostly controlled by t
he pyramidal representation of the spatial scalability levels. The
motion coding should be compatible with the spatial scalability for efficiency reasons.



Quality scalability: Mostly controlled by the texture coding. The motion coding might also be compat
ible with
quality scalability in order to optimize the trade
-
off between motion and texture coding.



Figure
1
: Generic Encoder

2.1.1

Framework implementation

The overall structure of the encoder is presented i
n
Figure
2
. One of the main difficulties of this approach is caused by
the spatial feedback which is necessary to reduce the redundancy of the transform. As both, the en
coder and the decoder
must use the same prediction, the scalability features of the algorithm are reduced if decoded signals are used for the
inter
-
scale prediction. On the other hand, if the whole range of scalability is exploited, the inter
-
scale predict
ion might be
different on the encoder from the decoder side. This will cause a “spatial” drift and degrade the quality of higher
resolution videos.


Figure
2
: Scalable codec using a multi
-
scale pyramid and a "2D+t" structure.
Example with 3 levels of spatial
scalability.


2.1.2

Motion
-
Compensated Temporal Filtering (MCTF)

The motion Compensated Temporal Filtering (MCTF) is based on the lifting scheme
[6]
. The lifting scheme has two
main advantages: It provides a way to compute the wavelet transform in an efficient way (up to 4 times fewer operations
than using a direct implementation of the wa
velet) and it insure perfect reconstruction of the input in the absence of
quantization of the wavelet coefficients. This last property is valid even if non
-
linear operations are used during the
lifting operation.

The generic lifting scheme consists of thr
ee types of operations: polyphase decomposition, prediction(s), and update(s).
In most cases the MCTF is restricited to a special case of lifting scheme with only one prediction and one update step.
Figure
3

illustrates the lifting representation of an analysis
-
synthesis filter bank.

Figure
3
:

Lifting representation of an analysis
-
synthesis filter bank.

At the analysis side (a), the odd samples
s
[2
k
+1] of a given signal
s

are predicted by a linear combination of the even
samples
s
[2
k
] using a prediction operator
P
(
s
[2
k
]) and a high pass signal
h
[
k
] is formed by the prediction residuals. A
corresponding low
-
pass signal
l
[
k
] is obtained by adding a linear combination of the prediction residuals
h
[
k
] to the even
samples
s
[2
k
] of the input signal
s

using the update operator
U
(
h
[
k
]):

7

Scalable Video Coding


Working Draft
1


Since both the prediction and the update step are fully invertible, the cor
responding transform can be interpreted as
critically sampled perfect reconstruction filter bank. The synthesis filter bank simply consists of the application of the
prediction and update operators in reverse order with the inverted signs in the summation
process followed by the
reconstruction process using the even and add polyphase components. For a normalization of the low
-

and high
-
pass
components, appropriately chosen scaling factors
F
l

and
F
h

are applied, respectively. However, in practice, these scal
ing
factors are included into the quantisation step sizes as described in Section
4.5
.

Let
s
[
x
,

k
] be a video signal with the spatial coordinate
x

=

(
x
,

y
)
T

and the temporal coordinate
k
. The prediction and
update operators for the temporal decomposition using the lifting representation of the Haar wavelet are given by


For the 5/3 transform, the prediction and update operators are given by


The extension to motion
-
compensated temporal filtering is realized by modifying the prediction and update operators as
follows


where the reference indices
r



0 allow a general frame
-
adaptive motion
-
compensated filt
ering. The motion vectors
m

are not restricted to sample
-
accurate displacements. In case of sub
-
sample accurate motion vectors, the term
s
[
x
+
m
,

k
]
has to be interpreted as a spatially interpolated value.

As can be seen from the above equations, both the pr
ediction and update operators for the motion
-
compensated filtering
using the lifting representation of the Haar wavelet are equivalent to uni
-
directional motion
-
compensated prediction. For
the 5/3 wavelet, the prediction and update operators specify bi
-
dir
ectional motion
-
compensated prediction.

Since bi
-
directional motion
-
compensated prediction generally reduces the energy of the prediction residual but increases
the motion vector rate in comparison to uni
-
directional prediction, it is desirable to switch d
ynamically between uni
-

and
bi
-
directional prediction, and thus between the lifting representation of the Haar and the 5/3 wavelet. The choice of the
transform can be made on a macroblock basis and is signalled as part of the motion information as explaine
d in Section
3.6
.

2.1.3

Base Layer Compatibility with AVC Main Profile

Base layer compatibility with AVC Main Profile is achieved by using a representation with hierarchical B pictures of the
lowest provided spati
al resolution. With hierarchical B pictures, the temporal update step is omitted, leading to a purely
predictive coding structure. This structure can be represented using the syntax of AVC.

The coding structure depicted in
Figure
4

is employed. Similarly to the layers encoded with MCTF, the first picture is
independently coded as an IDR picture, and all remaining pictures are

coded in “B...BP” or “B...BI” groups of pictures
using the concept of hierarchical B pictures. The coding order of the pictures inside a GOP is “AB
1
B
2
B
2
B
3
B
3
B
3
B
3
…”

The B pictures of the first level (the picture labeled as B
1

in
Figure
4
) use only the surrounding anchor frames A for
motion
-
compensated prediction. The B pictures B
i

of level
i

>

1 can use the surrounding anchor frames A as well as the
B pictu
res B
j

of a level
j

<

i

that are located inside the same group of pictures for motion
-
compensated prediction. Thus,
the same dependency structure as for an MCTF layer without update steps is used. In order to restrict the reference
picture usage in such a way, reference picture
list reordering (RPLR) commands are transmitted when necessary.
Furthermore, in order to reduce the required frame memory,



the B pictures of the highest level (the pictures B
3

in the example of
Figure
4
) are transmitted as non
-
reference
pictures (all other B pictures need to be transmitted as reference pictures, since they are used as reference for
motion
-
compensated prediction of following pictures), an
d



memory management control operation (MMCO) commands are coded in the slice headers of the anchor
pictures A. With these MMCO commands, all B pictures of the previous GOP are marked as unused for
reference, so that they can be removed from the decoded pic
ture buffer.



Figure
4
: Coding structure of the AVC Main profile compatible base layer


2.2

Scalablity Dimensions

There are two different ways of introducing scalability in a codec, either by using a techni
que that is intrinsically scalable
(such as bitplane arithmetic coding) or by using a layered approach (same concept as the one that is used in many
previous standards [7
-
9]). Here, a combination of the two approaches to enable a full spatio
-
temporal and q
uality
scalable codec is used. Temporal scalability is enabled by the Motion Compensated Temporal Filtering [2
-
6] (MCTF),
whereas spatial scalability is provided using a layered approach. For quality (SNR) scalability, an embedded quantization
approach is
pursued.

A detailed block diagram of the encoder is presented in
Figure
5
. As the codec is based on a layer approach to enable
spatial scalability, the encoder provides

a down
-
sampling filter stage that generates the lower resolution signal for each
spatial layer. Depending on the application requirements in terms of spatio
-
temporal scalability, the inputs of the
different layer might have different frame
-
rates.

The inp
ut video signal of each layer is temporally transformed using MCTF. The output of this process is a set of
temporal lowpass {L} and highpass {H} frames with residual texture information and motion description information.

The encoding of the motion and the

residual texture is based on the AVC
[1]

algorithm, with few modifications to handle
the spatial and SNR scalability aspects.

The scalability of the motion information is enabled using multi
-
scale prediction of the motion vector as
described in
3.3.2
, where each new spatial layer refines the block size and/or the pixel precisions of the motion vectors.

As in the case of the AVC standard, the re
sidual texture information is encoded on a macroblock basis. Each macroblock
can be coded either intra or inter. Here, inter coding denotes the application of a motion prediction/compensation scheme
to predict the current macroblock. In case of intra codin
g, a prediction from surrounding macroblocks or from other
spatial layers is possible. These prediction techniques do not employ motion information and hence, are referred to as
intra prediction techniques.

9

Scalable Video Coding


Working Draft
1

For each spatial resolution, the residual texture

is encoded using a SNR base layer at a certain quantization level and
enhancement layers that are provided in an embedded structure.


Figure
5
: Example for the encoder structure providing three spatial r
esolution layers.

2.2.1

Temporal Scalability

The temporal decomposition framework of MCTF inherently provides temporal scalability. By using
n

decomposition
stages, up to
n

levels of temporal scalability can be provided.

In
Figure
6
, an example for the temporal decomposition of a group of 12 pictures using 3 decomposition stages is
illustrated. This structure provides a non
-
dyadic decomposition in the coarsest layer. Note

that other decomposition
structures are available providing dyadic decomposition for all layers. If only the low
-
pass pictures {
L
}
3

that are obtained
after the third (coarsest) decomposition stage are transmitted, the picture sequence {L}
3*

that can be reconstructed at the
decoder side has 1/12 of the temporal resolution of the input sequence. The picture sequence {L}
3

at the coarsest
temporal layer is also referred to as temporal base layer. By additionally transmitting the high
-
pass pictur
es {H}
3
, the
decoder can reconstruct an approximation of the picture sequence {L}
2

that has 1/4 of the temporal resolution of the
input sequence. The high
-
pass pictures {H}
3

are also referred to as the first temporal enhancement layer. By further
adding th
e high
-
pass pictures {H}
2
, a picture sequence {L}
1*

with half the temporal resolution can be reconstructed. And
finally, if the remaining high
-
pass pictures {H}
1

are transmitted, a reconstructed version of the original input sequence
with the full temporal

resolution is obtained.


Figure
6
: Illustration of temporal scalability.

In general, by using
n

decomposition stages, the decomposition structure can be designed in a way that
n

levels of
temporal scalability are provided with te
mporal resolution conversion factors of 1/
m
0
, 1/(
m
0

m
1
), …, 1/(
m
0

m
1



m
n
-
1
),
where
m
i

represents any integer number greater than 1. Therefore, a picture sequence has to be coded in groups of
N
0

=

(
j

m
0

m
1



m
n
-
1
) pictures with
j

being an integer number greater than 0. The GOP size does not need to be constant
within the picture sequence.

It should be mentioned that a similar degree of temporal scalability could be realized with standard AVC coding (or the
presented coding scheme
with GOP’s of only 1 picture) by using sub
-
sequences and/or regularly inserted non
-
reference
pictures.

2.2.2

SNR (Quality) Scalability

The open
-
loop structure of the presented subband approach provides the possibility to efficiently incorporate SNR
scalability.
As indicated before, the texture information is encoded in an AVC compatible texture base layer that
provides a minimum quality at a given quantization level.

The texture base layer is encoded using AVC entropy coding, including the block transformation
,
quantization and
CABAC as specified in AVC
. In the higher spatial resolution layers, the size of the block transform can be chosen
adaptively between 4x4 and 8x8 pixels as specified in the Fidelity Range Extensions profiles of AVC. For the lowest
spatial r
esolution, compatibility to the AVC Main Profile is retained.
Within each spatial resolution SNR scalability is
achieved by encoding successive refinements of the transform coefficients, starting with the minimum quality provided
by AVC compatible texture
encoding. This is done by repeatedly decreasing the quantization step size and applying a
modified CABAC entropy coding process akin to sub
-
bitplane coding, as explained in sections
3.5.2.2

and
3.7.2
. This
coding mode is referred to as
progressive refinement
.


2.2.3

Spatial Scalability

In the block transform based approach of MCTF, spat
ial scalability is provided by concepts used in the video coding
standards H.262/MPEG
-
2 Visual
[7]
, H.263
[8]
, or MPEG
-
4 Visual
[9]
. Conceptually a pyramid of spatial resolutions is
provided. The AVC compatible b
ase layer represents the lowest spatial resolution that can be decoded from an SVM
bitstream.

2.2.4

Combined Scalability

The concepts of temporal, SNR, and spatial scalability presented in Sections
2.2.1
,
2.2.2
, and
2.2.3
, respectively, can
easily be combined to a general scalable coding scheme, which can provide a wide range of temporal, SNR, and spatial
scalability.

11

Scalable Video Coding


Working Draft
1

Example

In
Figure
7
, an example for combined scalability is illustrated. In this example, the spatial base layer (QCIF) is coded
at a
frame rate of 15 Hz
using standard AVC, where
the structure with hierarchical
B pictures depicted in section
2.1.3

is
used to provide temporal scalability. The spatial enhancement layer in CIF resolution is coded at a frame rate of 30 Hz,
usin
g the MCTF coding scheme with 4 decomposition stages. Each of these spatial layers provides SNR scalability by
using FGS.

The QCIF layer (layer 0) is encoded at a maximum bitrate of 80 kbit/s. By exploiting the SNR scalability provided by
progressive refin
ement a QCIF 15 Hz bitstream can be extracted and transmitted at a bitrate ranging from 41 kbit/s
(minimum quality provided without any progressive refinements) and 80 kbit/s. A reduced frame rate bitstream can also
be extracted by dropping the frames labe
led as B
3
. In combination with SNR scalability a QCIF 7.5 Hz bitstream can be
extracted and transmitted with a bitrate ranging from 32 to 66 kbit/s by exploiting progressive refinements on the
remaining frames.

The CIF layer (layer 1), which includes layer

0, is encoded at a maximum bitrate of 256 kbit/s. By exploting the SNR
scalability a CIF 30 Hz bitstream can be extracted with bitrates between 115 and 256 kbit/s. By exploiting the temporal
scalability of the MCTF coding scheme bitstreams with reduced fr
ame rates of 15, 7.5, 3.75, and 1.875 Hz can be
extracted by dropping the sets of temporal subbands {H}
0
, {{H}
0
,{H}
1
}, {{H}
0
,{H}
1
,{H}
2
}, and
{{H}
0
,{H}
1
,{H}
2
,{H}
3
}, respectively. In combination with SNR scalability the range of bitrates shown in
Figure
7

can
be extracted for each frame rate.

Figure
7
: Example for combined scalability.

3

Encoder Description

Single layer coding is mostly based on AVC technology with the MCFT extension.

3.1

Intra Prediction

Directional prediction may be
employed for intra coding of macroblocks. For macroblocks coded in INTRA_4x4 mode
or INTRA_8x8 mode, the directions shown in
Figure
8

may be employed. For macroblocks c
oded in INTRA_16x16
mode, a reduced set of directions is used, see the specification of AVC
[1]
.


Figure
8

: Intra prediction mode directions for intra 4x4 and 8x8 blocks.

3.2

Inter Prediction

The inter macroblock modes specified in AVC are employed. The macroblock partitioning for the inter mode is shown in
Figure
9
. The prediction block sizes are organized in a tree structure starting from 16x16 to 8x8. The 8x8 block may be
further divided into partitions of size 8x8 to 4x4.


Figure
9

: Prediction

block sizes of the tree
-
structured inter prediction modes.


For motion compensation of partitions coded in predictive mode, one motion vector is employed. If the MCTF with Haar
filters is used, this mode is always used. For motion compensation of partit
ions coded in bi
-
predictive mode, two motion
vectors are employed. This mode is used in MCTF with 5/3 filters. Note that with 5/3 filters, the macroblock modes can
be locally selected to be predictive or bi
-
predictive for adaptation to the signal propertie
s.

3.3

Inter
-
Layer Prediction

Although MCTF is independently applied in each spatial layer, a large degree of inter
-
layer prediction is incorporated.
Intra macroblocks and residual macroblocks representing temporal high
-
pass signals can be predicted using the
corresponding interpolated reconstruction signals of previous layers. The motion description of each MCTF layer can be
used for a prediction of the motion description for following enhancement layers.

3.3.1

Inter
-
Layer Intra Texture Prediction

Intra texture
prediction using information from the next lower spatial resolution is provided in the I_BL macroblock
mode. The usage of the I_BL mode in a high
-
pass picture is only allowed for macroblocks, for which the corresponding
8x8 block of the base layer is locat
ed inside an intra
-
coded macroblock. With this mode, t
he inverse MCTF is only
required for the spatial layer that is actually decoded.

13

Scalable Video Coding


Working Draft
1

For generating the intra prediction signal for high
-
pass macroblocks coded in I_BL mode, the corresponding 8x8 blocks
of
the base layer high
-
pass signal are directly de
-
blocked and interpolated as illustrated in
Figure
10
. Therefore, after
padding the corresponding 8x8 bloc
k, the de
-
blocking filter as specified in H.64/ AVC is applied and the interpolation is
performed using the half
-
pel interpolation filter of AVC.



Figure
10
: Low
-
complexity inter
-
layer prediction of intra macroblocks

For applying the de
-
blocking filter and performing the interpolation process, the intra macroblocks of the base layer are
extended by a 4
-
pixel border in each direction using the following padding process as specified in Section
6.3


3.3.2

Inter
-
Layer Motion Prediction

For encoding the motion field of an enhancement layer, two macroblock modes are possible in addition to the modes
applicable in the base layer: “
BASE_LAYER_M
ODE
” and “
QPEL_REFINEMENT_MODE
” as depicted in
Figure
11
.


Figure
11
: Extension of prediction data syntax: Macroblock mode.


If the “
BASE_LAYER_MODE
” is used then no further motion information is transmitted for the corresponding
macroblock. This macroblock mode indicates that the motion/prediction information including the macroblock
partitioning of the corresponding macroblock of the
base layer is used. When the base layer represents a layer with half
the spatial resolution, the motion vector field including the macroblock partitioning is scaled accordingly (see
Figure
12
).
In this case the current macroblock covers the same region as an 8x8 sub
-
macroblock of the base layer motion field.
Thus, if the corresponding base layer macroblock is coded in Direct, 16x16, 16x8, or 8x16 mode or if the correspon
ding
base layer sub
-
macroblock is coded in 8x8 mode or in Direct8x8 mode, then the 16x16 mode is used for the current
macroblock. Otherwise, if the base layer sub
-
macroblock is coded in 8x4, 4x8, or 4x4 mode, the macroblock mode for
the current macroblock
is set equal to 16x8, 8x16, or 8x8 (with all sub
-
macroblock modes equal to 8x8), respectively. If
the base layer macroblock represents an intra macroblock, the current macroblock mode is set to I_BL (intra macroblock
with prediction from base layer). For t
he macroblock partitions of the current macroblock, the same reference indices as
for the corresponding macroblock/sub
-
macroblock partitions of the base layer block are used. The associated motion
vectors are multiplied by a factor of 2.


T
he


QPEL_REFINEM
ENT_MODE
” is used only if the base layer represents a layer with half the spatial resolution of
the current layer. The “
QPEL_REFINEMENT_MODE
” is similar to the “
BASE_LAYER_MODE
”. The macroblock
partitioning as well as the reference indices and motion vecto
rs are derived as for the “
BASE_LAYER_MODE
”. However,
for each motion vector a quarter
-
sample motion vector refinement (
-
1, 0, or +1 for each motion vector component) is
additionally transmitted and added to the derived motion vectors.


Figure
12
: Upsampling of motion data.


If none of the techniques above are used, the macroblock mode as well as the corresponding reference indices and
motion vector differences are encoded according to the AVC syntax.

3.4

Motion Compensated Temporal
Filtering

In the SVM software
[17]

the temporal decomposition relies on the basic MCTF structure as it is described in Section
2.1.2
. The MCTF is applied on a GOP basis as depicted e.g. in
Figure
13
.

The temporal decomposition structure

allows for prediction over GOP boundaries. In the prediction steps, the low
-
pass
picture of the previous GOP that is obtained after performing all
N

decomposition stages is used as additional reference
picture for motion
-
compensated prediction. However, t
he motion
-
compensated update is only performed inside the GOP;
i.e. the low
-
pass picture of the previous GOP that is used for prediction is not updated. An example for this structure is
depicted in
Figure
13
. Note that this decomposition structure is conceptually similar to the open
-
GOP structure used in
hybrid video coding schemes. The delay associated with this GOP structure is identical to the delay introduced by the
independent GOP structure. However, the subjectively disturbing temporal blocking artefacts are significantly reduced.


Figure
13
: MCTF structure for one GOP with prediction over GOP boundaries.

15

Scalable Video Coding


Working Draft
1


3.4.1

Adaptive prediction/update steps

T
he decomposition structure above is directly coupled to a structural delay of 2
N
-
1 frames at the highest temporal
resolution. This delay can be adjusted by limiting the reference picture lists (list 0 and list 1) that are used for the
prediction and update

steps.

Let
f
in

be the frame rate of the highest temporal level in Hz, and
d
max

be the maximum structural delay in seconds. Then,
the delay in frames for the highest temporal resolution is given by

.

In order to enable structural encodin
g
-
decoding delays
d
f0

that are less than 2
N
-
1 frames, with
N

being the number of
dyadic temporal decomposition stages, a group of pictures is partitioned into sub
-
groups. Neither backward prediction
steps nor update steps (backward or forward) are allowed
across the corresponding partition boundaries, so that these
sub
-
groups of pictures can be encoded and decoded independently without influencing previous or following sub
-
groups.

Let
l

specify the temporal level.
l

is equal to zero for the first decomposit
ion stage and is increased by one for each
following dyadic temporal decomposition stage. The partitioning for the
l
-
th decomposition level is controlled by two
parameters: the partition size
G
l

and the sub
-
partition size
C
l
. These parameters are determine
d by the following
algorithm, where
D
l

is an auxiliary variable:

for(
D
l

= 0;
d
f0

>> (
D
l

+
l
);
D
l
++ );

D
l

= min(
D
l
,
N



l );

G
l

= ( 1 <<
D
l

);

C
l

= max( 0, G
l



(d
f0

>> l )


1 );

Figure
14

illustrates the partitioning and the corresponding decomposition structure for a group of 16 pictures (
N

=

4)
and a structural encoding
-
decoding delay
d
f0

of 4 frames. In that case, the parame
ters
G
l

and
C
l

with
l

= 0..3 are given by

G
l
[
l
] = { 8, 4, 2, 1 }

C
l
[
l
] = { 3, 1, 0, 0 }.


Figure
14
: Illustration of the GOP partitioning for a group of 16 pictures and a structural encoding
-
decoding
delay of 4 frames at the highest temporal resolution. For clarity, only the prediction and update steps using
directly neighboring reference pictures are illustrated.

Note that the partitioning is always consistent across all decomposition stages. That mea
ns, the location of the partition
and sub
-
partition boundaries is identical for all temporal decomposition stages. And furthermore, the decomposition
structure for the temporal levels 1 to
N
-
1 and a delay of
d
f

frames is always identical to the decomposition structure for
the temporal levels 0 to
N
-
2 and a delay of
d
f

/

2 frames.

Given the parameters
G
l

and
C
l

and thus the partitioning of a group of pictures, the size of the reference picture lists that
are used

in the prediction and update steps is restricted in a way that any backward prediction or update across a sub
-
partition or partition boundary is discarded.

Let
n
P0
[
i
] specify the size (in frames) for the prediction list 0 (forward prediction list) for the

picture
i
, and let
n
P1
[
i
]
specify the size for the prediction list 1 (backward prediction list) for picture
i
. Similarly, let
n
U0
[
i
] specify the size (in
frames) for the update list 0 (forward update list) for the picture
i
, and let
n
U1
[
i
] specify the siz
e for the update list 1
(backward update list) for picture
i
. The picture index
i

is related to the low
-
pass size before the
l
-
th decomposition stage
is performed. The picture index 0 corresponds to the low
-
pass frame of the previous GOP, which is used as
additional
reference frame in the prediction steps. The pictures 1, 3, 5, … are predicted from the pictures 0, 2, 4, … in the
prediction steps and replaced by the corresponding high
-
pass pictures. Thereafter, the pictures 2, 4, 6, … are updated
using the h
igh
-
pass pictures 1, 3, 5, … Note that picture 0 is never updated since it represents a picture of the previous
GOP which is already encoded.

Given the variables
G
l

and
C
l
, the maximum sizes of the reference picture lists that are used in the prediction an
d update
steps are determined as follows. For the following considerations, it is assumed that the default derivation process for
reference picture lists as described in
6.5.4

is used. In case reference picture re
-
ordering is employed, the algorithm needs
to be adjusted accordingly.



Prediction list 0:


n
P0
[
i
] = (
i

+ 1 ) >> 1

The prediction list 0 contains the pictures that are used for forward prediction, and thus its

size is only limited by the
“left” GOP boundary.



Prediction list 1:


if( (
i

%
G
l

) >
C
l

)


n
P1
[
i
] = (
G
l



(
i

%
G
l

) + 1 ) >> 1


else




n
P1
[
i
] = (
C
l



(
i

%
G
l

) + 1 ) >> 1

The prediction list 1 specifies the pictures that can be used for ba
ckward prediction. In case the current frame with index
i

is contained in the second sub
-
partition of a GOP partition (

(

i

%

G
l

)

>

C
l

), its size is limited by the “right” partition
or GOP border, which is specified by
G
l
. Otherwise, the current frame
is contained in the first sub
-
partition of a GOP
partition, the size of prediction list 1 is limited by the corresponding sub
-
partition border, which is specified by
C
l
.



Update list 0:


w

= (
i

= = 0 ? 0 : ( (
i



1 ) %
G
l

) + 1 )


if(
w

>
C
l

)



n
U0
[
i
] = (
w



C
l

) >> 1


else




n
U0
[
i
] =
w

>> 1

The update list 0 specifies the set of “preceding” high
-
pass pictures that can be used for updating the current low
-
pass
picture.
w

represents a frame index inside the current GOP partition. If w is greater
than
C
l
, that is the current frame is
located inside the second sub
-
partition of a GOP partition, the size of update list 0 is restricted by the corresponding sub
-
partition boundary. Otherwise, when the current frame is located inside the first sub
-
partiti
on of a GOP partition, the size
of update list 0 is restricted by the “left” partition or GOP boundary.



Update list 1:


w

= (
i

= = 0 ? 0 : ( (
i



1 ) %
G
l

) + 1 )


if(
w

>
C
l

)



n
U1
[
i
] = (
G
l



w

) >> 1


else




n
U1
[i] = (
i

= = 0 ? 0 : (
C
l



w

+ 1 ) >> 1 )

The update list 1 specifies the set of “following” high
-
pass pictures that can be used for updating the current low
-
pass
picture.
w

represents a frame index inside the current GOP partition. If
w

is greater than
C
l
, that is the current frame is
located inside the second sub
-
partition of a GOP partition, the size of update list 1 is restricted by the “right” partition or
GOP border. Otherwise, when the current frame is located inside the first sub
-
partition of a GOP
partition, the size of
update list 1 is restricted by the corresponding sub
-
partition boundary. Note that the low
-
pass frame with index
i

=

0 is
never updated, since it represents the low
-
pass picture of the previous GOP, which is already coded and cannot
be
modified.

17

Scalable Video Coding


Working Draft
1


3.5

Residual Coding

The entropy coding is mostly based on the AVC technology
[12]
. The video information is encoded on a macroblock
basis (16x16 luma pixels
). The encoding of a macroblock is divided into three phases: spatial transform, quantization,
arithmetic coding.

The transform size as well as the size of the spatial intra predictors (4x4 or 8x8) for the luminance component can be
adaptively chosen on ma
croblock basis. This choice is indicated by a flag for each macroblock. Inter
-
coded macroblocks
with partitions smaller than 8x8 are restricted to a transform block size of 4x4. For the AVC standard, it was reported
that the concept of adaptive block size
transforms (ABT) improves the coding efficiency objectively (PSNR) and
subjectively.

3.5.1

4x4 and 8x8 Transforms

The 2
-
D forward 4x4 and 8x8 transforms are computed in a separable way as a 1
-
D horizontal (row)
transform followed by a 1
-
D vertical (column) trans
form, where the corresponding 1
-
D transforms are given
by the
matrices,

,
for the 4x4 transform and


,
for the 8x8 transform.

3.5.2

Quantization and scaling

The quantization and scaling process is carried out differently for

a quality base layer and quality enhancement layer.
These are as follows.

3.5.2.1

Quality base layer

The following quantization formula is used:

,

where

denotes the transform coefficient,
denotes the corresponding quantized value (level),
is the quantization
parameter, and
is the deadzone/offset parameter with an absolute value ranging between 0 and ½ and with the same
sign as the coefficient that is be
ing quantized. The quantization table
CoeffQuant[]
is specified as:


and


for 4x4 blocks and


for 8x8 blocks.

Finally the matrix
M
4x4

and is
M
8x8

are specified as:

,

The reconstruction
of a transform coefficient using a given level

is calculated as follows:


with the same notations as given above, and where the dequantization table

is given by


with the matrix S4x4 and S8x8 specified as

,
.

Note that for the 8x8 blocks each row of
S
8x8

is representing scaled step sizes equivalent to the corresponding
basic step size in {0.625, 0.6875, 0.8125, 1.0, 1.125}. The entries of
M
8x8

were derived from the
corresponding entries of
S
8x8

by using the relation


where

are the six different values of squared norms of the underlying 2
-
D
basis functions.

3.5.2.2

Quality enhancement layer

Each enhancement layer contains the residue between the spatial (4x4 or 8x8) transform coefficients of the original
subband

picture obtained after the MCTF of the corresponding spatial layer and their reconstructed base layer
representation (or the subordinate enhancement layer representation). Each enhancement layer contains a refinement
signal that corresponds to a bisection

of the quantization step size.

19

Scalable Video Coding


Working Draft
1

The quantization parameters
QP
i

for the macroblocks of the
i
-
th enhancement layer (with
i

=

0 specifying the base layer),
which are used in the inverse scaling process (see below) are determined as follows:



If the macroblock does not contain any transform coefficients, for which a transform coefficient level not equal
to zero has been transmitted in the base layer representation or any previous enhancement layer representation,
the quantization parameter is c
alculated as specified in AVC
[1]

using the syntax element
mb_qp_delta
.



Otherwise (the macroblock contains at least one transform coefficient, for
which a transform coefficient level
not equal to zero has been transmitted in the base layer or any previous enhancement layer representations), the
quantization parameter is calculated as follows:

QP
i

= min( 0,
QP
i
-
1



6 )

At the decode side, the reconstr
uction of a transform coefficient
c
k

at scanning position
k

is obtained by



where
l
i,k

represents the transform coefficient level that has been coded in the
i
-
th enhancement layer for the transform
coefficient
c
k

and
QP
i

is the correspo
nding macroblock quantization parameter. The function
InverseScaling(.)

represents
the coefficient reconstruction process specified in
3.5.2.1 above
.

3.6

Motion Coding

The encoding of the motion and the residual texture is based on the AVC
[1]

algorithm, with few modifications to handle
the spatial and SNR scalability aspects.

3.7

Entropy Coding

3.7.1

Quality base layer

The CABAC entropy coding is used for encoding the 4x4 and the 8x8 blocks

in a quality base layer. For details, see the
AVC specification
[1]
.

3.7.2

Quality enhancement layer

The quantized levels of the 4x4 and 8x8 blocks in a q
uality enhancement layer are coded as specified below.

For each quality enhancement layer the coding process for the transform coefficient refinement levels is divided into 3
scans:

In the first scan, the refinement levels of all transform coefficients
with the following properties are coded:



The transform coefficient levels that have been coded in the base layer representation and all subordinate
enhancement layer representations are equal to zero. Such transform coefficients are also referred to as
non
-
significant transform coefficients

in the following.



The transform coefficient is located inside a transform block that at least includes one transform coefficient, for
which a transform coefficient level not equal to zero has been transmitted in the base

layer representation or any
subordinate enhancement layer representation. Those transform coefficient blocks are also referred to as
significant transform coefficient blocks

in the following.

In the second scan, the refinement levels of all transform coef
ficients with the following properties are coded:



A transform coefficient level not equal to zero has been coded in the base layer or any previous enhancement
layer representation. Such transform coefficients are also referred to as
significant transform c
oefficients

in the
following.

Finally, in the third scan, all remaining refinement levels are coded. The corresponding transform coefficients have the
following properties:



The transform coefficient levels that have been coded in the base layer representat
ion and all subordinate
enhancement layer representations are equal to zero (
non
-
significant transform coefficient
).



The transform coefficient is located inside a transform block that does not include any transform coefficient, for
which a transform coeffi
cient level not equal to zero has been transmitted in the base layer representation or any
subordinate enhancement layer representation. Those transform coefficient blocks are also referred to as
non
-
significant transform coefficient blocks
.

In each scan t
he corresponding transform coefficient are transmitted in the order that is specified by the following
pseudo
-
code:

for( scan_index = 0; scan_index < 16; scan_index++ )

{


//===== luma coefficients =====


for( block_y = 0; block_y <

4*frame_height_in_mb; block_y++ )


for( block_x = 0; block_x < 4*frame_width_in_mb; block_x++ )


{


if( transform_size( MB[ block_y /4, block_x /4 ] ) = = 8x8 )


{


b8x8_y = block_y / 2


b8x8_x = block_x / 2


scan8x8 = 4 * scan
_index + 2 * ( block_y % 2 ) + ( block_x % 2 )




encode_8x8luma_coefficient
( b8x8_y, b8x8_x, scan8x8 )


}


else


{


encode_4x4luma_coefficient
( block_y, block_x, scan_index )


}


}




if( scan_index = = 0 )


{


//===== chroma DC coefficients =====


for( DC_index = 0; DC_index < 4; DC_index ++ )


for( component = 0; component < 2; component++ )


for( mb_y = 0; mb_y < frame_height_in_mb; mb_y++ )


for( mb_x = 0; mb_x < frame_width_in_mb; mb_x++ )


{


encode_chromaDC_coefficient
( component, mb_y, mb_x, DC_index )


}


}


else


{


//===== chroma AC coefficients =====


for( component = 0; component < 2; component++ )


for( block_y = 0; block_y < 2*frame_height_in_mb; block_y++ )


for( block_x = 0; block_x < 2*frame_width_in_mb; block_x++ )


{


encode_chromaAC_coefficient
( component, block_y, block_x,

21

Scalable Video Coding


Working Draft
1


scan_index )


}


}

}

The variables
frame_width_in_mb

and
frame_height_in_mb

specify the frame width and height in
macroblock units, respectively. The function
transform_size(MB[y,x])

returns the transform size (8x8 or 4x4)
of the macroblock at the macroblock location (x,

y). The highest
-
level index specifies the frequency band of

transform
coefficients. In each scan, the frequency bands are transmitted in a global zig
-
zag scan order from low to high frequency
bands using the zig
-
zag scan that is specified in AVC for the scanning of transform coefficient levels inside a 4x4
transfo
rm block. Note that the transform coefficients of 8x8 transform blocks are mapped onto four neighboring 4x4
blocks. Within a frequency band, first all corresponding luma coefficient levels are transmitted in raster scan order,
thereafter the corresponding
chroma coefficient levels are coded.

The transmitted syntax elements for each transform coefficient level depend on the current scan. In the first and third
scan, in which transform coefficient levels for non
-
significant transform coefficients (see above)
are transmitted, for each
transform coefficient, the following syntax elements (cp.
[1]
) are coded in the specified order:



significant_coeff_flag
: This syntax element specifies whether a transmitted transform coefficient
level is equal to zero. If this flag is equal to zero, the transform coefficient level is equal to zero and no further
syntax elements are transmitted for the transform coefficient

level. In AVC
[1]
, the syntax element
significant_coeff_flag

is never transmitted for the last transform coefficient (in scanning order) inside
a block. For the pro
gressive refinement packets, this syntax element is always transmitted (if no previously
coded syntax elements indicate that the transform coefficient level is equal to zero). Therefore, we added 4
additional CABAC contexts, one for the last scanning posit
ion of each used block category (cp.
[1]
).



last_significant_coeff_flag
: This syntax element is only transmitted when the flag
significant_coeff_flag

is equal to 1. It indicates whether the current transform coefficient level
represents the last significant transform coefficient level in scanning order inside a transform block. If this flag
is equal to 1, no further information is transmitted for all re
maining transform coefficient levels of a transform
block. That is, the corresponding transform coefficient levels are excluded from the first and third scan (coding
of non
-
significant transform coefficient refinements).



coeff_sign_flag
: This syntax elemen
t is only transmitted when the flag
significant_coeff_flag

is equal to 1. It specifies the sign of the transform coefficient level.



coeff_abs_level_minus1
: This syntax element is only transmitted when the flag
significant_coeff_flag

is equal to 1. It speci
fies the absolute value minus 1 of the transform
coefficient level.

In addition to the syntax elements specified above, several macroblock
-
based syntax elements can be transmitted just
before the flag
significant_coeff_flag
. These syntax elements including

semantics and the conditions on that
they are transmitted are summarized in
Table
1
. Note that all transform coefficient levels that are signaled to be equal to
zero by

a bit of the
coded_block_pattern

or by the
coded_block_flag

are excluded from the first and third
scan of the enhancement layer coding process. This also includes the current transform coefficient level. Thus, if for
example a bit equal to zero of the syn
tax element
coded_block_pattern

is coded, no further syntax elements are
transmitted for the current transform coefficient level.

Table
1
: Macroblock
-
based syntax elements for encoding the first and third scan of an progressive
refinement
layer

syntax element

transmitted when …

卥ma湴楣s (捰c 䅖䌠
[1]
)

1
st

bit of
coded_block_pattern

the current transform coefficient is the
first non
-
significant transform
coefficient inside the first 8x8 luma
block of a macroblock

If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non
-
significant transform
coefficients inside the first 8x8 luma
block of a

macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.

2
nd

bit of
coded_block_pattern

the current transform coefficient is the
If this bit is equal to zero, it indicates
first non
-
significant transform
coefficient inside the second 8x8 luma
blo
ck of a macroblock

that the transform coefficient levels
for all non
-
significant transform
coefficients inside the second 8x8
luma block of a macroblock are equal
to zero. Thus, these transform
coefficients are ex
cluded from the
first or third scan.

3
rd

bit of
coded_block_pattern

the current transform coefficient is the
first non
-
significant transform
coefficient inside the third 8x8 luma
block of a macroblock

If this bit is equal to zero, it indicates
that the tr
ansform coefficient levels
for all non
-
significant transform
coefficients inside the third 8x8 luma
block of a macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.

4
th

bit of
coded_block_pattern

the c
urrent transform coefficient is the
first non
-
significant transform
coefficient inside the fourth 8x8 luma
block of a macroblock

If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non
-
significant transform
coefficients

inside the fourth 8x8 luma
block of a macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.

5
th

bit of
coded_block_pattern

the current transform coefficient is the
first non
-
significant chroma
coefficient
inside the macroblock

If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non
-
significant chroma
transform coefficients inside the
macroblock are equal to zero. Thus,
these transform coefficients are
exclude
d from the first or third scan.

6
th

bit of
coded_block_pattern

the current transform coefficient is the
first non
-
significant chroma AC
coefficient and the already coded 5
th

bit
of the syntax element
coded_block_pattern

is equal to 1

If this bit is equal
to zero, it indicates
that the transform coefficient levels
for all non
-
significant chroma AC
transform coefficients inside the
macroblock are equal to zero. Thus,
these transform coefficients are
excluded from the first or third scan.

coded_block_flag

th
e current transform coefficient is the
first non
-
significant transform
coefficient of an 4x4 transform block or
a 2x2 chroma DC transform block

If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non
-
significant transfo
rm
coefficients inside the current
transform block are equal to zero.
Thus, these transform coefficients are
excluded from the first or third scan.

mb_qp_delta

for the current transform coefficient a
bit not equal to zero of the syntax
element
coded_block
_pattern

is
transmitted, all previously transmitted
bits of the syntax element
coded_block_pattern

are equal to
zero, and all bits of the syntax elements
coded_block_pattern

that have been
coded for the base layer representation
or previous enhancement lay
er
representations are equal to zero.

This syntax element specifies the
quantization parameter for the current
macroblock. The quantization
parameter is computed as specified in
AVC.

transform_size_8x8_flag

the syntax element
transform_8x8_mode_flag

is eq
ual
to 1, for the current transform
coefficient a bit not equal to zero of the
syntax element
coded_block_pattern

is transmitted,
This syntax element specifies the
transform size (4x4 or 8x8) for the
luminance signal of the current
macroblock.

23

Scalable Video Coding


Working Draft
1

all previously transmitted bits of the
syntax element
coded_block_pattern

are equal to
zero, and all bits of the syntax elemen
ts
coded_block_pattern

that have been
coded for the base layer representation
or previous enhancement layer
representations are equal to zero.


In the second scan of an enhancement layer representation, in which transform coefficient levels for significant transform
coefficients (coefficients for which non
-
zero levels have been coded in the base layer or any subordinate enhancement
layer
representation) are transmitted, for each transform coefficient, the following syntax elements are transmitted in the
specified order:

coeff_refinement_flag
: This syntax element indicates whether the transform coefficient refinement level is
equal to zero.

If the syntax element is equal to zero, no further information is transmitted for the current transform
coefficient. For the encoding of this syntax element an additional CABAC context has been added.

coeff_refinement_direction_flag
: This syntax element s
pecifies the sign of the transform coefficient
refinement level. If this syntax element is equal 1, the sign of the transform coefficient refinement level is equal to the
sign of its base layer representation (which is coded in the base layer or in a subor
dinate enhancement layer); otherwise,
the refinement level has the opposite sign. For the encoding of this syntax element an additional CABAC context has
been added.

Note that for a transform coefficient refinement level in the second scan only the values

1, 0, and 1 are supported by the
syntax.

3.8

Deblocking

3.8.1

Deblocking filter process

The AVC deblocking filter is employed to reduce the blocking artefacts induced from block based motion compensation
and quantization of the block transform coefficients. This se
ction describes the deblocking filter process.

Inputs to this process are a low
-
pass picture
L
k
, a prediction data array
M
P,k+1
, and an array
C
H,k+1

specifying quantisation
parameters and transform coefficient levels for each macroblock of the low
-
pass picture
L
k
.

Output of the process is a modified low
-
pass picture
L
k
.

The deblocking filter process for the picture
L
k

is applied as specified in the AV
C standard
[1]
, where



the macroblock modes, the sub
-
macroblock modes, the reference indices, and the motion vectors are extracted from
the given prediction data arra
y
M
P,k+1
, and



the transform coefficient levels and quantisation parameters are extracted from the array
C
H,k+1
.

4

Operational Encoder Control

In this section, the operational encoder control implemented in the SVM software
[17]

is described. Section
4.1

explains
the general coding and decomposition structure, while section
4.2

describes how fine granular scalability is realized. In
Sections
4.2

and
4.4
, the algorithm for determining the prediction data arrays used for motion
-
compensated temporal
filtering and the mode decision algorithm used for encoding the subba
nd representation are described, respectively. The
concept for selecting the quantisation parameters that are used for encoding the subband pictures is presented in Section
4.5
.

4.1

Scalability and Decomposition Structure

Two types of base layer coding can be applied. The base layer can be either encoded using MCTF as explained in
Section
3.4

or it can be encoded using a single layer AVC coding scheme as explained in Section
2.1.3
.

Several types of enhancement layers can be included to achieve the desired combined spatio
-
temporal
-
SNR scalability.
These scalability dimensions are explained in Section
2.2

and, in particular, their combination in Section
2.2.4
.

4.2

Fine grain SNR scalability

Fine gr
ain SNR scalability (FGS) is obtained using the progressive refinement representations presented in
3.5.2.2

and
3.7.2
. The corresponding NAL units can be truncated at any arbitrary point. For any spatio
-
temporal resolution, a
minimum bit
-
rate, which represents the corresponding spatial base layer re
presentation (including the base layer
representations of the lower
-
resolution layers), must be transmitted; these bit
-
rates can be adjusted in a way that the
corresponding reconstructions represent the minimally acceptable video quality. Above the minimum

bit
-
rate for a
spatio
-
temporal resolution, any bit
-
rate can be extracted by truncating the progressive refinement NAL units of the
corresponding spatio
-
temporal layer and all lower resolution layers in a suitable way.

In order to extract a requested bit
-
r
ate from the overall bit
-
stream, the following algorithm is used. Let
R
t

specify the
target bit
-
rate for a spatio
-
temporal resolution
S
t
-
T
t
, and let
R
0

be the base layer bit
-
rate for this spatio
-
temporal
resolution, i.e. the bit
-
rate that corresponds to the base layer representation including all spatial lower
-
resolution for the
temporal resolution
T
t
.

If
R
0

is greater than
R
t
, the requested spatio
-
tempo
ral
-
rate point cannot be extracted from the overall bit
-
stream.
Otherwise, the following applies



The target rate is modified to R
t

= R
t



R
0



The progressive refinement packets are processed from the lowest supported spatial resolution to the target
spatial

resolution S
t
, and for each spatial resolution, the progressive refinement packets are processed from the
lowest refinement layer to the highest refinement layer. For each progressive refinement layer of a spatio
-
temporal resolution S
-
T
t
, the following ap
plies:

o

Let
R
F

be the bit
-
rate of the
i
-
th progressive refinement representation for the spatio
-
temporal
resolution
S
-
T
t
.

o

If
R
F

is less or equal to
R
t
, the corresponding progressive refinement packets are fully included into the
extracted bit
-
stream, and th
e target rate is modified to
R
t

=
R
t



R
F

o

Otherwise, the corresponding progressive refinement packets are truncated, and the target rate is set to
zero:
R
t

= 0. Let
L

be the original length of a progressive refinement packet. During truncation, the
length
is set to


4.3

Motion estimation and mode decision process

This section describes an example method for determining the prediction data arrays
M
P

used in the prediction steps.
This method is implemented in the SVM software
[17]
. The algorithm employs Lagrangian optimisation techniques that
are widely used to optimise the rate
-
distortion efficiency of hybrid video coders
[11]
. A similar algorithm was integrated
into the test model JM
-
2
[12]

for the AVC standard
.

Inputs to this process are a variable
idxLP
, reference index lists

refIdxList0

and
refIdxList1
, and an ordered set of low
-
pass pictures {

L
k
[0],

…,

L
k
[
N
k



1]

}.

Output of this process is a prediction data array
M
P
.

Furthermore, for controlling the motio
n estimation / mode decision process, the encoder control has to select the number
of active entries for both reference index lists

refIdxList0

and
refIdxList1

as well as a quantisation parameter
QP


[0;

51]. The selected quantisation parameter determines
the operating point of the encoder control.

Based on the given quantisation parameter
QP
, two Lagrangian multipliers

SAD

and

SSD

are derived by


Let
R
0

and
R
1

specify the set of active entries of the reference index lists
refIdxList0

a
nd
refIdxList1
, respectively.

The prediction data array
M
P

is estimated in a macroblock
-
wise manner by using the following process.

1.

For each possible macroblock partition
P

(and sub
-
macroblock mode
p
sub
-
mb

if applicable) of the current macroblock,
the prediction method
p
pred

together with the associated reference indices
r
0

and/or
r
1

and motion vectors {
m
0
} and
{
m
1
} is determined by the following algorithm.

25

Scalable Video Coding


Working Draft
1



For all sub
-
macroblock partitions
P
i

of the current macroblock partition
P

(and sub
-
macroblock modes s
p
sub
-
mb

if
applicable), list 0 and list 1 motion vector candidates
m
0
(
r
0
,

i
) and
m
1
(
r
1
,

i
) for all reference indices
r
0

R
0

and
r
1

R
1

are obtained by minimizing the Lagrangian functional


with the distortion term being given as


l
org
[] represents the luma sample array of the picture
L
k
[
idxLP
], and
l
ref,0/1
[] represents the luma sample array of the
picture
L
k
[
refIdxList0/1
[
r
0/1
]], which is referenced by the reference index
r
0/1
.
S

is the motion vector search range.
The terms
R
(

r
0/1

) and
R
(

m
0/1

) specify the number of bits needed to transmit the reference index
r
0/1

and all
components of the motion vector
m
0/1
, respectively.

T
he motion search proceeds first over all integer
-
sample accurate motion vectors in the given search range
S
.
Then, given the best integer motion vector, the eight surrounding half
-
sample accurate motion vectors are tested,
and finally, given the best half
-
sample accurate motion vector, the eight surrounding quarter
-
sample accurate
motion vectors are tested. For the half
-

and quarter
-
sample accurate motion vector refinement, the term
l
ref,0/1
[

i

+

m
0/1,x

,

j

+

m
0/1,y

] has to be interpreted as interpolation
operator.



Given the motion vector candidates for all sub
-
macroblock partitions
P
i

and reference indices
r
0

R
0

and
r
1

R
1
,
the list 0 and list 1 reference indices
r
0

and
r
1

and the associated motion vectors {
m
0
} and {
m
1
} for list 0 and list 1
prediction are
selected by minimizing the Lagrangian functional


where the summation is proceeded over all sub
-
macroblock partitions
P
i

(with
i

being the sub
-
macroblock
partition index) of a macroblock partition
P
.

Given the determined reference indices
r
0

and
r
1

and the associated motion vectors {
m
0
} and {
m
1
}, the
Lagrangian costs for list 0 and list 1 prediction
J
L0

and
J
L1

are calculated by




Given the reference indices
r
0

and
r
1

and the associated motion vectors {
m
0
} and {
m
1
} for list 0 and list 1
prediction, the reference indices
r
B0

and
r
B1

and the associated motion vectors {
m
B0
} and {
m
B1
} for bi
-
prediction
are obtained by the following iterative algorithm.

Initially, the refer
ence indices and motion vectors for bi
-
prediction are set equal to the reference indices and
motion vectors that have been determined for list 0 and list 1 prediction,

r
B0

=

r
0
,


m
B0

=

m
0
,

r
B1

=

r
1,


m
B1

=

m
1,

an iteration index
iter

is set equal to 0,

ite
r

=

0

and the Lagrangian cost for bi
-
prediction
J
BI

is set equal to
J
(

r
B0
,

r
B1
,

{
m
B0
},

{
m
B1
}

) with


and the distortion term being given as


Subsequently, in each iteration step, the following applies.

o

If (

iter

%

2

)

is equal to 0, the following applies.



A list 0 reference index
r
*
B0

and associated list 0 motion vectors {
m
*
B0
} are determined by minimizing the
following Lagrangian functional


where the search range
S
*
(
m
B0
(i)) specifies a small area around the motion vector
m
B0
(i).



If the associated cost measure
J
iter

=

J
(

r
*
B0
,

r
B1
,

{
m
*
B0
},

{
m
B1
}

) is less than the minimum cost measure
J
BI
, the list 0 reference index
r
*
B0

is assigned to
r
B0

and the associated list 0 motion

vectors {
m
*
B0
} are
assigned to {
m
B0
}.

o

Otherwise (

(

iter

%

2

) is equal to 1

), the following applies.



A list 1 reference index
r
*
B1

and associated list 1 motion vectors {
m
*
B1
} are determined by minimizing the
following Lagrangian functional




If the associated cost measure
J
iter

=

J
(

r
B0
,

r
*
B1
,

{
m
B0
},

{
m
*
B1
}

) is less than the minimum cost measure
J
BI
, the list 1 reference index
r
*
B1

is assigned to
r
B1

and the associated list 1 motion vectors {
m
*
B1
} are
assigned to {
m
B1
}.

o