1
Scalable Video Coding
–
Working Draft
1
Joint Video Team (JVT) of ISO/IEC MPEG & ITU

T VCEG
(ISO/IEC JTC1/SC29/WG11 and ITU

T SG16 Q.6)
1
4
th Meeting:
Hong Kong, CN
, 1
7

21
January
, 200
5
Document: JVT

N
020
Filename:
JVT

N021.doc
Title:
Joint Scalable Video Model
JSVM 0
Status:
Output Document
of
JVT
Purpose:
Information
Editor
(s)
/
Contact(s):
Julien Reichel
VisioWave Switzerland,
Rte. de la Pierre 22,
CH

1032 Ecublens, Switzerland
Heiko Schwarz
Heinrich Hertz Institute (FhG),
Einsteinufer 37,
D

10587 Berlin,
Germany
Mathias Wien
Institut für Nachrichtentechnik,
RWTH Aachen University,
D

52056 Aachen, Germany
Tel:
Fax:
Email:
Tel:
Fax:
Email:
Tel:
Fax:
Email:
+41 (21) 695

0041
+41 (21) 695

00
01
julien.r
eichel@visiowave.com
+49 (30) 31002

226
+49 (30)
39272

00
heiko.schwarz@hhi.fhg.de
+49 (241) 80

27681
+49 (241) 80

22196
wien@ient.rwth

aachen.de
Source:
JVT
_____________________________
Note:
This document contains the text of the MPEG

21 Scalable Video Model 3.0, document
ISO/IEC JTC
1/SC 29/WG 11
N6716
.
It is provided as information on the status of the Scalable Video Coding
effort before it became a work item for JVT.
Index
1
Glossary
................................
................................
................................
................................
................................
.......
4
2
Introduction
................................
................................
................................
................................
................................
5
2.1
Framework
................................
................................
................................
................................
............................
5
2.1.1
Framework
implementation
................................
................................
................................
..........................
5
2.1.2
Motion

Compensated Temporal Filtering (MCTF)
................................
................................
......................
6
2.1.3
Base Layer Compatibility with AVC Main Profile
................................
................................
.......................
7
2.2
Scalablity Dimensions
................................
................................
................................
................................
..........
8
2.2.1
Temporal Scalability
................................
................................
................................
................................
.....
9
2.2.2
SNR (Quality) Scalability
................................
................................
................................
...........................
10
2.2.3
Spatial Scalability
................................
................................
................................
................................
.......
10
2.2.4
Co
mbined Scalability
................................
................................
................................
................................
..
10
3
Encoder Description
................................
................................
................................
................................
.................
11
3.1
Intra Prediction
................................
................................
................................
................................
..................
11
3.2
Inter Pr
ediction
................................
................................
................................
................................
...................
12
3.3
Inter

Layer Prediction
................................
................................
................................
................................
........
12
3.3.1
Inter

Layer Intra Texture Prediction
................................
................................
................................
...........
12
3.3.2
Inter

Layer Motion Prediction
................................
................................
................................
....................
13
3.4
Motion Compensated Temporal Filtering
................................
................................
................................
..........
14
3.4.1
Adaptive prediction/update steps
................................
................................
................................
................
15
3.5
Residual Coding
................................
................................
................................
................................
.................
17
3.5.1
4x4 and 8x8 Transforms
................................
................................
................................
.............................
17
3.5.
2
Quantization and scaling
................................
................................
................................
.............................
17
3.6
Motion Coding
................................
................................
................................
................................
....................
19
3.7
Entropy Coding
................................
................................
................................
................................
...................
19
3.7.1
Quality
base layer
................................
................................
................................
................................
.......
19
3.7.2
Quality enhancement layer
................................
................................
................................
.........................
19
3.8
Deblocking
................................
................................
................................
................................
..........................
23
3.8.1
Deblocking f
ilter process
................................
................................
................................
............................
23
4
Operational Encoder Control
................................
................................
................................
................................
..
23
4.1
Scalability and Decomposition Structure
................................
................................
................................
...........
23
4.2
Fine grain SNR scalability
................................
................................
................................
................................
..
24
4.3
Motion estimation and mode decision process
................................
................................
................................
...
24
4.4
Coding of Subband Pictures
................................
................................
................................
...............................
27
4.5
Quantizer Selection
................................
................................
................................
................................
.............
28
5
Syntax and Semantics
................................
................................
................................
................................
...............
28
5.1
Specification of functions and variables
................................
................................
................................
.............
28
5.1.1
Specification of functions
................................
................................
................................
...........................
28
5.1.2
Specification of variables
................................
................................
................................
............................
30
5.2
Syntax in tabular form
................................
................................
................................
................................
........
30
5.2.1
NAL unit syntax
................................
................................
................................
................................
.........
31
5.2.2
Sequen
ce parameter set RBSP syntax
................................
................................
................................
.........
32
5.2.3
Slice layer in scalable extension RBSP syntax
................................
................................
...........................
33
5.2.4
Slice header in scalable extension syntax
................................
................................
................................
...
34
5.2.5
Slice data in scalable extension syntax
................................
................................
................................
.......
35
5.2.6
Macroblock layer in scalable extension syntax
................................
................................
...........................
36
5.2.7
Progressive refinement slice data syntax in scalable extension
................................
................................
..
44
5.3
Semantics
................................
................................
................................
................................
............................
55
5.3.1
NAL unit seman
tics
................................
................................
................................
................................
....
55
5.3.2
Scalable slice header in scalable extension semantics
................................
................................
................
57
5.3.3
Macroblock layer in scalable extension semantics
................................
................................
.....................
57
5.3.4
Progressive refienement slice data semantics in scalable extension
................................
...........................
60
6
Decoding Process
................................
................................
................................
................................
......................
61
6.1
Parsing Process
................................
................................
................................
................................
..................
61
6.2
Decoding process for prediction data
................................
................................
................................
.................
62
6.3
Decoding process for subband pictures
................................
................................
................................
..............
63
6.4
Decoding process of progressive refinements
................................
................................
................................
....
65
6.5
Reconstruction process of a group of pictures
................................
................................
................................
....
65
6.5.1
Inverse motion

compensated temporal filtering process
................................
................................
............
66
6.5.2
Reconstruction of a set of low

pass pictures
................................
................................
...............................
66
6.5.3
Reference list construction process
................................
................................
................................
.............
67
3
Scalable Video Coding
–
Working Draft
1
6.5.4
General prediction process
................................
................................
................................
..........................
69
6.5.5
Derivation of prediction data for the update steps
................................
................................
......................
71
6.5.6
Deblocking filter process
................................
................................
................................
............................
72
7
References
................................
................................
................................
................................
................................
.
73
1
Glossary
ABT
Adaptive Block size Transforms
AVC
ITU

T Recommendation H.264  ISO MPEG 4 Part 10: Advanced Video Coding
CABAC
Context Adaptive Based Arithmetic Coder
FGS
Fine Grained
Scalability
GOP
Group of Pictures
IDR
Instantaneous Decoder Refresh
MC
Motion Compensation
MCTF
Motion Compensated Temporal Filtering
ME
Motion Estimation
MMCO
Memory Management Control Operation
RPLR
Reference Picture List Reordering
UMCTF
Unconstrained MCTF
SNR
Signal to Noise Ratio
SVC
Scalable Video Codec (or ‘Coding’, dependent on the context)
SVM
Scalable Video Model
5
Scalable Video Coding
–
Working Draft
1
2
Introduction
The SVM shall function as a guideline for the development of the future SVC reference software. Many basic building
blocks of the SVM are related to MPEG

4 AVC. Further detail on these parts can be found in
[1]
.
2.1
Framework
The SVM encoder is composed of the generic building block presented in
Figure
1
.
The three main scalability aspects, i.e. temporal, spatial and quality scalability will be controlled by the algorithm used t
o
implement each one of those building blocks.
Temporal scalabili
ty: Mostly controlled by the temporal transform (or prediction). The texture and motion
coding should not prevent it. This aspect is intrinsic if the texture and motion coding are restricted to the current
frame.
Spatial Scalability: Mostly controlled by t
he pyramidal representation of the spatial scalability levels. The
motion coding should be compatible with the spatial scalability for efficiency reasons.
Quality scalability: Mostly controlled by the texture coding. The motion coding might also be compat
ible with
quality scalability in order to optimize the trade

off between motion and texture coding.
Figure
1
: Generic Encoder
2.1.1
Framework implementation
The overall structure of the encoder is presented i
n
Figure
2
. One of the main difficulties of this approach is caused by
the spatial feedback which is necessary to reduce the redundancy of the transform. As both, the en
coder and the decoder
must use the same prediction, the scalability features of the algorithm are reduced if decoded signals are used for the
inter

scale prediction. On the other hand, if the whole range of scalability is exploited, the inter

scale predict
ion might be
different on the encoder from the decoder side. This will cause a “spatial” drift and degrade the quality of higher
resolution videos.
Figure
2
: Scalable codec using a multi

scale pyramid and a "2D+t" structure.
Example with 3 levels of spatial
scalability.
2.1.2
Motion

Compensated Temporal Filtering (MCTF)
The motion Compensated Temporal Filtering (MCTF) is based on the lifting scheme
[6]
. The lifting scheme has two
main advantages: It provides a way to compute the wavelet transform in an efficient way (up to 4 times fewer operations
than using a direct implementation of the wa
velet) and it insure perfect reconstruction of the input in the absence of
quantization of the wavelet coefficients. This last property is valid even if non

linear operations are used during the
lifting operation.
The generic lifting scheme consists of thr
ee types of operations: polyphase decomposition, prediction(s), and update(s).
In most cases the MCTF is restricited to a special case of lifting scheme with only one prediction and one update step.
Figure
3
illustrates the lifting representation of an analysis

synthesis filter bank.
Figure
3
:
Lifting representation of an analysis

synthesis filter bank.
At the analysis side (a), the odd samples
s
[2
k
+1] of a given signal
s
are predicted by a linear combination of the even
samples
s
[2
k
] using a prediction operator
P
(
s
[2
k
]) and a high pass signal
h
[
k
] is formed by the prediction residuals. A
corresponding low

pass signal
l
[
k
] is obtained by adding a linear combination of the prediction residuals
h
[
k
] to the even
samples
s
[2
k
] of the input signal
s
using the update operator
U
(
h
[
k
]):
7
Scalable Video Coding
–
Working Draft
1
Since both the prediction and the update step are fully invertible, the cor
responding transform can be interpreted as
critically sampled perfect reconstruction filter bank. The synthesis filter bank simply consists of the application of the
prediction and update operators in reverse order with the inverted signs in the summation
process followed by the
reconstruction process using the even and add polyphase components. For a normalization of the low

and high

pass
components, appropriately chosen scaling factors
F
l
and
F
h
are applied, respectively. However, in practice, these scal
ing
factors are included into the quantisation step sizes as described in Section
4.5
.
Let
s
[
x
,
k
] be a video signal with the spatial coordinate
x
=
(
x
,
y
)
T
and the temporal coordinate
k
. The prediction and
update operators for the temporal decomposition using the lifting representation of the Haar wavelet are given by
For the 5/3 transform, the prediction and update operators are given by
The extension to motion

compensated temporal filtering is realized by modifying the prediction and update operators as
follows
where the reference indices
r
0 allow a general frame

adaptive motion

compensated filt
ering. The motion vectors
m
are not restricted to sample

accurate displacements. In case of sub

sample accurate motion vectors, the term
s
[
x
+
m
,
k
]
has to be interpreted as a spatially interpolated value.
As can be seen from the above equations, both the pr
ediction and update operators for the motion

compensated filtering
using the lifting representation of the Haar wavelet are equivalent to uni

directional motion

compensated prediction. For
the 5/3 wavelet, the prediction and update operators specify bi

dir
ectional motion

compensated prediction.
Since bi

directional motion

compensated prediction generally reduces the energy of the prediction residual but increases
the motion vector rate in comparison to uni

directional prediction, it is desirable to switch d
ynamically between uni

and
bi

directional prediction, and thus between the lifting representation of the Haar and the 5/3 wavelet. The choice of the
transform can be made on a macroblock basis and is signalled as part of the motion information as explaine
d in Section
3.6
.
2.1.3
Base Layer Compatibility with AVC Main Profile
Base layer compatibility with AVC Main Profile is achieved by using a representation with hierarchical B pictures of the
lowest provided spati
al resolution. With hierarchical B pictures, the temporal update step is omitted, leading to a purely
predictive coding structure. This structure can be represented using the syntax of AVC.
The coding structure depicted in
Figure
4
is employed. Similarly to the layers encoded with MCTF, the first picture is
independently coded as an IDR picture, and all remaining pictures are
coded in “B...BP” or “B...BI” groups of pictures
using the concept of hierarchical B pictures. The coding order of the pictures inside a GOP is “AB
1
B
2
B
2
B
3
B
3
B
3
B
3
…”
The B pictures of the first level (the picture labeled as B
1
in
Figure
4
) use only the surrounding anchor frames A for
motion

compensated prediction. The B pictures B
i
of level
i
>
1 can use the surrounding anchor frames A as well as the
B pictu
res B
j
of a level
j
<
i
that are located inside the same group of pictures for motion

compensated prediction. Thus,
the same dependency structure as for an MCTF layer without update steps is used. In order to restrict the reference
picture usage in such a way, reference picture
list reordering (RPLR) commands are transmitted when necessary.
Furthermore, in order to reduce the required frame memory,
the B pictures of the highest level (the pictures B
3
in the example of
Figure
4
) are transmitted as non

reference
pictures (all other B pictures need to be transmitted as reference pictures, since they are used as reference for
motion

compensated prediction of following pictures), an
d
memory management control operation (MMCO) commands are coded in the slice headers of the anchor
pictures A. With these MMCO commands, all B pictures of the previous GOP are marked as unused for
reference, so that they can be removed from the decoded pic
ture buffer.
Figure
4
: Coding structure of the AVC Main profile compatible base layer
2.2
Scalablity Dimensions
There are two different ways of introducing scalability in a codec, either by using a techni
que that is intrinsically scalable
(such as bitplane arithmetic coding) or by using a layered approach (same concept as the one that is used in many
previous standards [7

9]). Here, a combination of the two approaches to enable a full spatio

temporal and q
uality
scalable codec is used. Temporal scalability is enabled by the Motion Compensated Temporal Filtering [2

6] (MCTF),
whereas spatial scalability is provided using a layered approach. For quality (SNR) scalability, an embedded quantization
approach is
pursued.
A detailed block diagram of the encoder is presented in
Figure
5
. As the codec is based on a layer approach to enable
spatial scalability, the encoder provides
a down

sampling filter stage that generates the lower resolution signal for each
spatial layer. Depending on the application requirements in terms of spatio

temporal scalability, the inputs of the
different layer might have different frame

rates.
The inp
ut video signal of each layer is temporally transformed using MCTF. The output of this process is a set of
temporal lowpass {L} and highpass {H} frames with residual texture information and motion description information.
The encoding of the motion and the
residual texture is based on the AVC
[1]
algorithm, with few modifications to handle
the spatial and SNR scalability aspects.
The scalability of the motion information is enabled using multi

scale prediction of the motion vector as
described in
3.3.2
, where each new spatial layer refines the block size and/or the pixel precisions of the motion vectors.
As in the case of the AVC standard, the re
sidual texture information is encoded on a macroblock basis. Each macroblock
can be coded either intra or inter. Here, inter coding denotes the application of a motion prediction/compensation scheme
to predict the current macroblock. In case of intra codin
g, a prediction from surrounding macroblocks or from other
spatial layers is possible. These prediction techniques do not employ motion information and hence, are referred to as
intra prediction techniques.
9
Scalable Video Coding
–
Working Draft
1
For each spatial resolution, the residual texture
is encoded using a SNR base layer at a certain quantization level and
enhancement layers that are provided in an embedded structure.
Figure
5
: Example for the encoder structure providing three spatial r
esolution layers.
2.2.1
Temporal Scalability
The temporal decomposition framework of MCTF inherently provides temporal scalability. By using
n
decomposition
stages, up to
n
levels of temporal scalability can be provided.
In
Figure
6
, an example for the temporal decomposition of a group of 12 pictures using 3 decomposition stages is
illustrated. This structure provides a non

dyadic decomposition in the coarsest layer. Note
that other decomposition
structures are available providing dyadic decomposition for all layers. If only the low

pass pictures {
L
}
3
that are obtained
after the third (coarsest) decomposition stage are transmitted, the picture sequence {L}
3*
that can be reconstructed at the
decoder side has 1/12 of the temporal resolution of the input sequence. The picture sequence {L}
3
at the coarsest
temporal layer is also referred to as temporal base layer. By additionally transmitting the high

pass pictur
es {H}
3
, the
decoder can reconstruct an approximation of the picture sequence {L}
2
that has 1/4 of the temporal resolution of the
input sequence. The high

pass pictures {H}
3
are also referred to as the first temporal enhancement layer. By further
adding th
e high

pass pictures {H}
2
, a picture sequence {L}
1*
with half the temporal resolution can be reconstructed. And
finally, if the remaining high

pass pictures {H}
1
are transmitted, a reconstructed version of the original input sequence
with the full temporal
resolution is obtained.
Figure
6
: Illustration of temporal scalability.
In general, by using
n
decomposition stages, the decomposition structure can be designed in a way that
n
levels of
temporal scalability are provided with te
mporal resolution conversion factors of 1/
m
0
, 1/(
m
0
m
1
), …, 1/(
m
0
m
1
…
m
n

1
),
where
m
i
represents any integer number greater than 1. Therefore, a picture sequence has to be coded in groups of
N
0
=
(
j
m
0
m
1
…
m
n

1
) pictures with
j
being an integer number greater than 0. The GOP size does not need to be constant
within the picture sequence.
It should be mentioned that a similar degree of temporal scalability could be realized with standard AVC coding (or the
presented coding scheme
with GOP’s of only 1 picture) by using sub

sequences and/or regularly inserted non

reference
pictures.
2.2.2
SNR (Quality) Scalability
The open

loop structure of the presented subband approach provides the possibility to efficiently incorporate SNR
scalability.
As indicated before, the texture information is encoded in an AVC compatible texture base layer that
provides a minimum quality at a given quantization level.
The texture base layer is encoded using AVC entropy coding, including the block transformation
,
quantization and
CABAC as specified in AVC
. In the higher spatial resolution layers, the size of the block transform can be chosen
adaptively between 4x4 and 8x8 pixels as specified in the Fidelity Range Extensions profiles of AVC. For the lowest
spatial r
esolution, compatibility to the AVC Main Profile is retained.
Within each spatial resolution SNR scalability is
achieved by encoding successive refinements of the transform coefficients, starting with the minimum quality provided
by AVC compatible texture
encoding. This is done by repeatedly decreasing the quantization step size and applying a
modified CABAC entropy coding process akin to sub

bitplane coding, as explained in sections
3.5.2.2
and
3.7.2
. This
coding mode is referred to as
progressive refinement
.
2.2.3
Spatial Scalability
In the block transform based approach of MCTF, spat
ial scalability is provided by concepts used in the video coding
standards H.262/MPEG

2 Visual
[7]
, H.263
[8]
, or MPEG

4 Visual
[9]
. Conceptually a pyramid of spatial resolutions is
provided. The AVC compatible b
ase layer represents the lowest spatial resolution that can be decoded from an SVM
bitstream.
2.2.4
Combined Scalability
The concepts of temporal, SNR, and spatial scalability presented in Sections
2.2.1
,
2.2.2
, and
2.2.3
, respectively, can
easily be combined to a general scalable coding scheme, which can provide a wide range of temporal, SNR, and spatial
scalability.
11
Scalable Video Coding
–
Working Draft
1
Example
In
Figure
7
, an example for combined scalability is illustrated. In this example, the spatial base layer (QCIF) is coded
at a
frame rate of 15 Hz
using standard AVC, where
the structure with hierarchical
B pictures depicted in section
2.1.3
is
used to provide temporal scalability. The spatial enhancement layer in CIF resolution is coded at a frame rate of 30 Hz,
usin
g the MCTF coding scheme with 4 decomposition stages. Each of these spatial layers provides SNR scalability by
using FGS.
The QCIF layer (layer 0) is encoded at a maximum bitrate of 80 kbit/s. By exploiting the SNR scalability provided by
progressive refin
ement a QCIF 15 Hz bitstream can be extracted and transmitted at a bitrate ranging from 41 kbit/s
(minimum quality provided without any progressive refinements) and 80 kbit/s. A reduced frame rate bitstream can also
be extracted by dropping the frames labe
led as B
3
. In combination with SNR scalability a QCIF 7.5 Hz bitstream can be
extracted and transmitted with a bitrate ranging from 32 to 66 kbit/s by exploiting progressive refinements on the
remaining frames.
The CIF layer (layer 1), which includes layer
0, is encoded at a maximum bitrate of 256 kbit/s. By exploting the SNR
scalability a CIF 30 Hz bitstream can be extracted with bitrates between 115 and 256 kbit/s. By exploiting the temporal
scalability of the MCTF coding scheme bitstreams with reduced fr
ame rates of 15, 7.5, 3.75, and 1.875 Hz can be
extracted by dropping the sets of temporal subbands {H}
0
, {{H}
0
,{H}
1
}, {{H}
0
,{H}
1
,{H}
2
}, and
{{H}
0
,{H}
1
,{H}
2
,{H}
3
}, respectively. In combination with SNR scalability the range of bitrates shown in
Figure
7
can
be extracted for each frame rate.
Figure
7
: Example for combined scalability.
3
Encoder Description
Single layer coding is mostly based on AVC technology with the MCFT extension.
3.1
Intra Prediction
Directional prediction may be
employed for intra coding of macroblocks. For macroblocks coded in INTRA_4x4 mode
or INTRA_8x8 mode, the directions shown in
Figure
8
may be employed. For macroblocks c
oded in INTRA_16x16
mode, a reduced set of directions is used, see the specification of AVC
[1]
.
Figure
8
: Intra prediction mode directions for intra 4x4 and 8x8 blocks.
3.2
Inter Prediction
The inter macroblock modes specified in AVC are employed. The macroblock partitioning for the inter mode is shown in
Figure
9
. The prediction block sizes are organized in a tree structure starting from 16x16 to 8x8. The 8x8 block may be
further divided into partitions of size 8x8 to 4x4.
Figure
9
: Prediction
block sizes of the tree

structured inter prediction modes.
For motion compensation of partitions coded in predictive mode, one motion vector is employed. If the MCTF with Haar
filters is used, this mode is always used. For motion compensation of partit
ions coded in bi

predictive mode, two motion
vectors are employed. This mode is used in MCTF with 5/3 filters. Note that with 5/3 filters, the macroblock modes can
be locally selected to be predictive or bi

predictive for adaptation to the signal propertie
s.
3.3
Inter

Layer Prediction
Although MCTF is independently applied in each spatial layer, a large degree of inter

layer prediction is incorporated.
Intra macroblocks and residual macroblocks representing temporal high

pass signals can be predicted using the
corresponding interpolated reconstruction signals of previous layers. The motion description of each MCTF layer can be
used for a prediction of the motion description for following enhancement layers.
3.3.1
Inter

Layer Intra Texture Prediction
Intra texture
prediction using information from the next lower spatial resolution is provided in the I_BL macroblock
mode. The usage of the I_BL mode in a high

pass picture is only allowed for macroblocks, for which the corresponding
8x8 block of the base layer is locat
ed inside an intra

coded macroblock. With this mode, t
he inverse MCTF is only
required for the spatial layer that is actually decoded.
13
Scalable Video Coding
–
Working Draft
1
For generating the intra prediction signal for high

pass macroblocks coded in I_BL mode, the corresponding 8x8 blocks
of
the base layer high

pass signal are directly de

blocked and interpolated as illustrated in
Figure
10
. Therefore, after
padding the corresponding 8x8 bloc
k, the de

blocking filter as specified in H.64/ AVC is applied and the interpolation is
performed using the half

pel interpolation filter of AVC.
Figure
10
: Low

complexity inter

layer prediction of intra macroblocks
For applying the de

blocking filter and performing the interpolation process, the intra macroblocks of the base layer are
extended by a 4

pixel border in each direction using the following padding process as specified in Section
6.3
3.3.2
Inter

Layer Motion Prediction
For encoding the motion field of an enhancement layer, two macroblock modes are possible in addition to the modes
applicable in the base layer: “
BASE_LAYER_M
ODE
” and “
QPEL_REFINEMENT_MODE
” as depicted in
Figure
11
.
Figure
11
: Extension of prediction data syntax: Macroblock mode.
If the “
BASE_LAYER_MODE
” is used then no further motion information is transmitted for the corresponding
macroblock. This macroblock mode indicates that the motion/prediction information including the macroblock
partitioning of the corresponding macroblock of the
base layer is used. When the base layer represents a layer with half
the spatial resolution, the motion vector field including the macroblock partitioning is scaled accordingly (see
Figure
12
).
In this case the current macroblock covers the same region as an 8x8 sub

macroblock of the base layer motion field.
Thus, if the corresponding base layer macroblock is coded in Direct, 16x16, 16x8, or 8x16 mode or if the correspon
ding
base layer sub

macroblock is coded in 8x8 mode or in Direct8x8 mode, then the 16x16 mode is used for the current
macroblock. Otherwise, if the base layer sub

macroblock is coded in 8x4, 4x8, or 4x4 mode, the macroblock mode for
the current macroblock
is set equal to 16x8, 8x16, or 8x8 (with all sub

macroblock modes equal to 8x8), respectively. If
the base layer macroblock represents an intra macroblock, the current macroblock mode is set to I_BL (intra macroblock
with prediction from base layer). For t
he macroblock partitions of the current macroblock, the same reference indices as
for the corresponding macroblock/sub

macroblock partitions of the base layer block are used. The associated motion
vectors are multiplied by a factor of 2.
T
he
“
QPEL_REFINEM
ENT_MODE
” is used only if the base layer represents a layer with half the spatial resolution of
the current layer. The “
QPEL_REFINEMENT_MODE
” is similar to the “
BASE_LAYER_MODE
”. The macroblock
partitioning as well as the reference indices and motion vecto
rs are derived as for the “
BASE_LAYER_MODE
”. However,
for each motion vector a quarter

sample motion vector refinement (

1, 0, or +1 for each motion vector component) is
additionally transmitted and added to the derived motion vectors.
Figure
12
: Upsampling of motion data.
If none of the techniques above are used, the macroblock mode as well as the corresponding reference indices and
motion vector differences are encoded according to the AVC syntax.
3.4
Motion Compensated Temporal
Filtering
In the SVM software
[17]
the temporal decomposition relies on the basic MCTF structure as it is described in Section
2.1.2
. The MCTF is applied on a GOP basis as depicted e.g. in
Figure
13
.
The temporal decomposition structure
allows for prediction over GOP boundaries. In the prediction steps, the low

pass
picture of the previous GOP that is obtained after performing all
N
decomposition stages is used as additional reference
picture for motion

compensated prediction. However, t
he motion

compensated update is only performed inside the GOP;
i.e. the low

pass picture of the previous GOP that is used for prediction is not updated. An example for this structure is
depicted in
Figure
13
. Note that this decomposition structure is conceptually similar to the open

GOP structure used in
hybrid video coding schemes. The delay associated with this GOP structure is identical to the delay introduced by the
independent GOP structure. However, the subjectively disturbing temporal blocking artefacts are significantly reduced.
Figure
13
: MCTF structure for one GOP with prediction over GOP boundaries.
15
Scalable Video Coding
–
Working Draft
1
3.4.1
Adaptive prediction/update steps
T
he decomposition structure above is directly coupled to a structural delay of 2
N

1 frames at the highest temporal
resolution. This delay can be adjusted by limiting the reference picture lists (list 0 and list 1) that are used for the
prediction and update
steps.
Let
f
in
be the frame rate of the highest temporal level in Hz, and
d
max
be the maximum structural delay in seconds. Then,
the delay in frames for the highest temporal resolution is given by
.
In order to enable structural encodin
g

decoding delays
d
f0
that are less than 2
N

1 frames, with
N
being the number of
dyadic temporal decomposition stages, a group of pictures is partitioned into sub

groups. Neither backward prediction
steps nor update steps (backward or forward) are allowed
across the corresponding partition boundaries, so that these
sub

groups of pictures can be encoded and decoded independently without influencing previous or following sub

groups.
Let
l
specify the temporal level.
l
is equal to zero for the first decomposit
ion stage and is increased by one for each
following dyadic temporal decomposition stage. The partitioning for the
l

th decomposition level is controlled by two
parameters: the partition size
G
l
and the sub

partition size
C
l
. These parameters are determine
d by the following
algorithm, where
D
l
is an auxiliary variable:
for(
D
l
= 0;
d
f0
>> (
D
l
+
l
);
D
l
++ );
D
l
= min(
D
l
,
N
–
l );
G
l
= ( 1 <<
D
l
);
C
l
= max( 0, G
l
–
(d
f0
>> l )
–
1 );
Figure
14
illustrates the partitioning and the corresponding decomposition structure for a group of 16 pictures (
N
=
4)
and a structural encoding

decoding delay
d
f0
of 4 frames. In that case, the parame
ters
G
l
and
C
l
with
l
= 0..3 are given by
G
l
[
l
] = { 8, 4, 2, 1 }
C
l
[
l
] = { 3, 1, 0, 0 }.
Figure
14
: Illustration of the GOP partitioning for a group of 16 pictures and a structural encoding

decoding
delay of 4 frames at the highest temporal resolution. For clarity, only the prediction and update steps using
directly neighboring reference pictures are illustrated.
Note that the partitioning is always consistent across all decomposition stages. That mea
ns, the location of the partition
and sub

partition boundaries is identical for all temporal decomposition stages. And furthermore, the decomposition
structure for the temporal levels 1 to
N

1 and a delay of
d
f
frames is always identical to the decomposition structure for
the temporal levels 0 to
N

2 and a delay of
d
f
/
2 frames.
Given the parameters
G
l
and
C
l
and thus the partitioning of a group of pictures, the size of the reference picture lists that
are used
in the prediction and update steps is restricted in a way that any backward prediction or update across a sub

partition or partition boundary is discarded.
Let
n
P0
[
i
] specify the size (in frames) for the prediction list 0 (forward prediction list) for the
picture
i
, and let
n
P1
[
i
]
specify the size for the prediction list 1 (backward prediction list) for picture
i
. Similarly, let
n
U0
[
i
] specify the size (in
frames) for the update list 0 (forward update list) for the picture
i
, and let
n
U1
[
i
] specify the siz
e for the update list 1
(backward update list) for picture
i
. The picture index
i
is related to the low

pass size before the
l

th decomposition stage
is performed. The picture index 0 corresponds to the low

pass frame of the previous GOP, which is used as
additional
reference frame in the prediction steps. The pictures 1, 3, 5, … are predicted from the pictures 0, 2, 4, … in the
prediction steps and replaced by the corresponding high

pass pictures. Thereafter, the pictures 2, 4, 6, … are updated
using the h
igh

pass pictures 1, 3, 5, … Note that picture 0 is never updated since it represents a picture of the previous
GOP which is already encoded.
Given the variables
G
l
and
C
l
, the maximum sizes of the reference picture lists that are used in the prediction an
d update
steps are determined as follows. For the following considerations, it is assumed that the default derivation process for
reference picture lists as described in
6.5.4
is used. In case reference picture re

ordering is employed, the algorithm needs
to be adjusted accordingly.
Prediction list 0:
n
P0
[
i
] = (
i
+ 1 ) >> 1
The prediction list 0 contains the pictures that are used for forward prediction, and thus its
size is only limited by the
“left” GOP boundary.
Prediction list 1:
if( (
i
%
G
l
) >
C
l
)
n
P1
[
i
] = (
G
l
–
(
i
%
G
l
) + 1 ) >> 1
else
n
P1
[
i
] = (
C
l
–
(
i
%
G
l
) + 1 ) >> 1
The prediction list 1 specifies the pictures that can be used for ba
ckward prediction. In case the current frame with index
i
is contained in the second sub

partition of a GOP partition (
(
i
%
G
l
)
>
C
l
), its size is limited by the “right” partition
or GOP border, which is specified by
G
l
. Otherwise, the current frame
is contained in the first sub

partition of a GOP
partition, the size of prediction list 1 is limited by the corresponding sub

partition border, which is specified by
C
l
.
Update list 0:
w
= (
i
= = 0 ? 0 : ( (
i
–
1 ) %
G
l
) + 1 )
if(
w
>
C
l
)
n
U0
[
i
] = (
w
–
C
l
) >> 1
else
n
U0
[
i
] =
w
>> 1
The update list 0 specifies the set of “preceding” high

pass pictures that can be used for updating the current low

pass
picture.
w
represents a frame index inside the current GOP partition. If w is greater
than
C
l
, that is the current frame is
located inside the second sub

partition of a GOP partition, the size of update list 0 is restricted by the corresponding sub

partition boundary. Otherwise, when the current frame is located inside the first sub

partiti
on of a GOP partition, the size
of update list 0 is restricted by the “left” partition or GOP boundary.
Update list 1:
w
= (
i
= = 0 ? 0 : ( (
i
–
1 ) %
G
l
) + 1 )
if(
w
>
C
l
)
n
U1
[
i
] = (
G
l
–
w
) >> 1
else
n
U1
[i] = (
i
= = 0 ? 0 : (
C
l
–
w
+ 1 ) >> 1 )
The update list 1 specifies the set of “following” high

pass pictures that can be used for updating the current low

pass
picture.
w
represents a frame index inside the current GOP partition. If
w
is greater than
C
l
, that is the current frame is
located inside the second sub

partition of a GOP partition, the size of update list 1 is restricted by the “right” partition or
GOP border. Otherwise, when the current frame is located inside the first sub

partition of a GOP
partition, the size of
update list 1 is restricted by the corresponding sub

partition boundary. Note that the low

pass frame with index
i
=
0 is
never updated, since it represents the low

pass picture of the previous GOP, which is already coded and cannot
be
modified.
17
Scalable Video Coding
–
Working Draft
1
3.5
Residual Coding
The entropy coding is mostly based on the AVC technology
[12]
. The video information is encoded on a macroblock
basis (16x16 luma pixels
). The encoding of a macroblock is divided into three phases: spatial transform, quantization,
arithmetic coding.
The transform size as well as the size of the spatial intra predictors (4x4 or 8x8) for the luminance component can be
adaptively chosen on ma
croblock basis. This choice is indicated by a flag for each macroblock. Inter

coded macroblocks
with partitions smaller than 8x8 are restricted to a transform block size of 4x4. For the AVC standard, it was reported
that the concept of adaptive block size
transforms (ABT) improves the coding efficiency objectively (PSNR) and
subjectively.
3.5.1
4x4 and 8x8 Transforms
The 2

D forward 4x4 and 8x8 transforms are computed in a separable way as a 1

D horizontal (row)
transform followed by a 1

D vertical (column) trans
form, where the corresponding 1

D transforms are given
by the
matrices,
,
for the 4x4 transform and
,
for the 8x8 transform.
3.5.2
Quantization and scaling
The quantization and scaling process is carried out differently for
a quality base layer and quality enhancement layer.
These are as follows.
3.5.2.1
Quality base layer
The following quantization formula is used:
,
where
denotes the transform coefficient,
denotes the corresponding quantized value (level),
is the quantization
parameter, and
is the deadzone/offset parameter with an absolute value ranging between 0 and ½ and with the same
sign as the coefficient that is be
ing quantized. The quantization table
CoeffQuant[]
is specified as:
and
for 4x4 blocks and
for 8x8 blocks.
Finally the matrix
M
4x4
and is
M
8x8
are specified as:
,
The reconstruction
of a transform coefficient using a given level
is calculated as follows:
with the same notations as given above, and where the dequantization table
is given by
with the matrix S4x4 and S8x8 specified as
,
.
Note that for the 8x8 blocks each row of
S
8x8
is representing scaled step sizes equivalent to the corresponding
basic step size in {0.625, 0.6875, 0.8125, 1.0, 1.125}. The entries of
M
8x8
were derived from the
corresponding entries of
S
8x8
by using the relation
where
are the six different values of squared norms of the underlying 2

D
basis functions.
3.5.2.2
Quality enhancement layer
Each enhancement layer contains the residue between the spatial (4x4 or 8x8) transform coefficients of the original
subband
picture obtained after the MCTF of the corresponding spatial layer and their reconstructed base layer
representation (or the subordinate enhancement layer representation). Each enhancement layer contains a refinement
signal that corresponds to a bisection
of the quantization step size.
19
Scalable Video Coding
–
Working Draft
1
The quantization parameters
QP
i
for the macroblocks of the
i

th enhancement layer (with
i
=
0 specifying the base layer),
which are used in the inverse scaling process (see below) are determined as follows:
If the macroblock does not contain any transform coefficients, for which a transform coefficient level not equal
to zero has been transmitted in the base layer representation or any previous enhancement layer representation,
the quantization parameter is c
alculated as specified in AVC
[1]
using the syntax element
mb_qp_delta
.
Otherwise (the macroblock contains at least one transform coefficient, for
which a transform coefficient level
not equal to zero has been transmitted in the base layer or any previous enhancement layer representations), the
quantization parameter is calculated as follows:
QP
i
= min( 0,
QP
i

1
–
6 )
At the decode side, the reconstr
uction of a transform coefficient
c
k
at scanning position
k
is obtained by
where
l
i,k
represents the transform coefficient level that has been coded in the
i

th enhancement layer for the transform
coefficient
c
k
and
QP
i
is the correspo
nding macroblock quantization parameter. The function
InverseScaling(.)
represents
the coefficient reconstruction process specified in
3.5.2.1 above
.
3.6
Motion Coding
The encoding of the motion and the residual texture is based on the AVC
[1]
algorithm, with few modifications to handle
the spatial and SNR scalability aspects.
3.7
Entropy Coding
3.7.1
Quality base layer
The CABAC entropy coding is used for encoding the 4x4 and the 8x8 blocks
in a quality base layer. For details, see the
AVC specification
[1]
.
3.7.2
Quality enhancement layer
The quantized levels of the 4x4 and 8x8 blocks in a q
uality enhancement layer are coded as specified below.
For each quality enhancement layer the coding process for the transform coefficient refinement levels is divided into 3
scans:
In the first scan, the refinement levels of all transform coefficients
with the following properties are coded:
The transform coefficient levels that have been coded in the base layer representation and all subordinate
enhancement layer representations are equal to zero. Such transform coefficients are also referred to as
non

significant transform coefficients
in the following.
The transform coefficient is located inside a transform block that at least includes one transform coefficient, for
which a transform coefficient level not equal to zero has been transmitted in the base
layer representation or any
subordinate enhancement layer representation. Those transform coefficient blocks are also referred to as
significant transform coefficient blocks
in the following.
In the second scan, the refinement levels of all transform coef
ficients with the following properties are coded:
A transform coefficient level not equal to zero has been coded in the base layer or any previous enhancement
layer representation. Such transform coefficients are also referred to as
significant transform c
oefficients
in the
following.
Finally, in the third scan, all remaining refinement levels are coded. The corresponding transform coefficients have the
following properties:
The transform coefficient levels that have been coded in the base layer representat
ion and all subordinate
enhancement layer representations are equal to zero (
non

significant transform coefficient
).
The transform coefficient is located inside a transform block that does not include any transform coefficient, for
which a transform coeffi
cient level not equal to zero has been transmitted in the base layer representation or any
subordinate enhancement layer representation. Those transform coefficient blocks are also referred to as
non

significant transform coefficient blocks
.
In each scan t
he corresponding transform coefficient are transmitted in the order that is specified by the following
pseudo

code:
for( scan_index = 0; scan_index < 16; scan_index++ )
{
//===== luma coefficients =====
for( block_y = 0; block_y <
4*frame_height_in_mb; block_y++ )
for( block_x = 0; block_x < 4*frame_width_in_mb; block_x++ )
{
if( transform_size( MB[ block_y /4, block_x /4 ] ) = = 8x8 )
{
b8x8_y = block_y / 2
b8x8_x = block_x / 2
scan8x8 = 4 * scan
_index + 2 * ( block_y % 2 ) + ( block_x % 2 )
encode_8x8luma_coefficient
( b8x8_y, b8x8_x, scan8x8 )
}
else
{
encode_4x4luma_coefficient
( block_y, block_x, scan_index )
}
}
if( scan_index = = 0 )
{
//===== chroma DC coefficients =====
for( DC_index = 0; DC_index < 4; DC_index ++ )
for( component = 0; component < 2; component++ )
for( mb_y = 0; mb_y < frame_height_in_mb; mb_y++ )
for( mb_x = 0; mb_x < frame_width_in_mb; mb_x++ )
{
encode_chromaDC_coefficient
( component, mb_y, mb_x, DC_index )
}
}
else
{
//===== chroma AC coefficients =====
for( component = 0; component < 2; component++ )
for( block_y = 0; block_y < 2*frame_height_in_mb; block_y++ )
for( block_x = 0; block_x < 2*frame_width_in_mb; block_x++ )
{
encode_chromaAC_coefficient
( component, block_y, block_x,
21
Scalable Video Coding
–
Working Draft
1
scan_index )
}
}
}
The variables
frame_width_in_mb
and
frame_height_in_mb
specify the frame width and height in
macroblock units, respectively. The function
transform_size(MB[y,x])
returns the transform size (8x8 or 4x4)
of the macroblock at the macroblock location (x,
y). The highest

level index specifies the frequency band of
transform
coefficients. In each scan, the frequency bands are transmitted in a global zig

zag scan order from low to high frequency
bands using the zig

zag scan that is specified in AVC for the scanning of transform coefficient levels inside a 4x4
transfo
rm block. Note that the transform coefficients of 8x8 transform blocks are mapped onto four neighboring 4x4
blocks. Within a frequency band, first all corresponding luma coefficient levels are transmitted in raster scan order,
thereafter the corresponding
chroma coefficient levels are coded.
The transmitted syntax elements for each transform coefficient level depend on the current scan. In the first and third
scan, in which transform coefficient levels for non

significant transform coefficients (see above)
are transmitted, for each
transform coefficient, the following syntax elements (cp.
[1]
) are coded in the specified order:
significant_coeff_flag
: This syntax element specifies whether a transmitted transform coefficient
level is equal to zero. If this flag is equal to zero, the transform coefficient level is equal to zero and no further
syntax elements are transmitted for the transform coefficient
level. In AVC
[1]
, the syntax element
significant_coeff_flag
is never transmitted for the last transform coefficient (in scanning order) inside
a block. For the pro
gressive refinement packets, this syntax element is always transmitted (if no previously
coded syntax elements indicate that the transform coefficient level is equal to zero). Therefore, we added 4
additional CABAC contexts, one for the last scanning posit
ion of each used block category (cp.
[1]
).
last_significant_coeff_flag
: This syntax element is only transmitted when the flag
significant_coeff_flag
is equal to 1. It indicates whether the current transform coefficient level
represents the last significant transform coefficient level in scanning order inside a transform block. If this flag
is equal to 1, no further information is transmitted for all re
maining transform coefficient levels of a transform
block. That is, the corresponding transform coefficient levels are excluded from the first and third scan (coding
of non

significant transform coefficient refinements).
coeff_sign_flag
: This syntax elemen
t is only transmitted when the flag
significant_coeff_flag
is equal to 1. It specifies the sign of the transform coefficient level.
coeff_abs_level_minus1
: This syntax element is only transmitted when the flag
significant_coeff_flag
is equal to 1. It speci
fies the absolute value minus 1 of the transform
coefficient level.
In addition to the syntax elements specified above, several macroblock

based syntax elements can be transmitted just
before the flag
significant_coeff_flag
. These syntax elements including
semantics and the conditions on that
they are transmitted are summarized in
Table
1
. Note that all transform coefficient levels that are signaled to be equal to
zero by
a bit of the
coded_block_pattern
or by the
coded_block_flag
are excluded from the first and third
scan of the enhancement layer coding process. This also includes the current transform coefficient level. Thus, if for
example a bit equal to zero of the syn
tax element
coded_block_pattern
is coded, no further syntax elements are
transmitted for the current transform coefficient level.
Table
1
: Macroblock

based syntax elements for encoding the first and third scan of an progressive
refinement
layer
syntax element
transmitted when …
卥ma湴楣s (捰c 䅖䌠
[1]
)
1
st
bit of
coded_block_pattern
the current transform coefficient is the
first non

significant transform
coefficient inside the first 8x8 luma
block of a macroblock
If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non

significant transform
coefficients inside the first 8x8 luma
block of a
macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.
2
nd
bit of
coded_block_pattern
the current transform coefficient is the
If this bit is equal to zero, it indicates
first non

significant transform
coefficient inside the second 8x8 luma
blo
ck of a macroblock
that the transform coefficient levels
for all non

significant transform
coefficients inside the second 8x8
luma block of a macroblock are equal
to zero. Thus, these transform
coefficients are ex
cluded from the
first or third scan.
3
rd
bit of
coded_block_pattern
the current transform coefficient is the
first non

significant transform
coefficient inside the third 8x8 luma
block of a macroblock
If this bit is equal to zero, it indicates
that the tr
ansform coefficient levels
for all non

significant transform
coefficients inside the third 8x8 luma
block of a macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.
4
th
bit of
coded_block_pattern
the c
urrent transform coefficient is the
first non

significant transform
coefficient inside the fourth 8x8 luma
block of a macroblock
If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non

significant transform
coefficients
inside the fourth 8x8 luma
block of a macroblock are equal to
zero. Thus, these transform
coefficients are excluded from the
first or third scan.
5
th
bit of
coded_block_pattern
the current transform coefficient is the
first non

significant chroma
coefficient
inside the macroblock
If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non

significant chroma
transform coefficients inside the
macroblock are equal to zero. Thus,
these transform coefficients are
exclude
d from the first or third scan.
6
th
bit of
coded_block_pattern
the current transform coefficient is the
first non

significant chroma AC
coefficient and the already coded 5
th
bit
of the syntax element
coded_block_pattern
is equal to 1
If this bit is equal
to zero, it indicates
that the transform coefficient levels
for all non

significant chroma AC
transform coefficients inside the
macroblock are equal to zero. Thus,
these transform coefficients are
excluded from the first or third scan.
coded_block_flag
th
e current transform coefficient is the
first non

significant transform
coefficient of an 4x4 transform block or
a 2x2 chroma DC transform block
If this bit is equal to zero, it indicates
that the transform coefficient levels
for all non

significant transfo
rm
coefficients inside the current
transform block are equal to zero.
Thus, these transform coefficients are
excluded from the first or third scan.
mb_qp_delta
for the current transform coefficient a
bit not equal to zero of the syntax
element
coded_block
_pattern
is
transmitted, all previously transmitted
bits of the syntax element
coded_block_pattern
are equal to
zero, and all bits of the syntax elements
coded_block_pattern
that have been
coded for the base layer representation
or previous enhancement lay
er
representations are equal to zero.
This syntax element specifies the
quantization parameter for the current
macroblock. The quantization
parameter is computed as specified in
AVC.
transform_size_8x8_flag
the syntax element
transform_8x8_mode_flag
is eq
ual
to 1, for the current transform
coefficient a bit not equal to zero of the
syntax element
coded_block_pattern
is transmitted,
This syntax element specifies the
transform size (4x4 or 8x8) for the
luminance signal of the current
macroblock.
23
Scalable Video Coding
–
Working Draft
1
all previously transmitted bits of the
syntax element
coded_block_pattern
are equal to
zero, and all bits of the syntax elemen
ts
coded_block_pattern
that have been
coded for the base layer representation
or previous enhancement layer
representations are equal to zero.
In the second scan of an enhancement layer representation, in which transform coefficient levels for significant transform
coefficients (coefficients for which non

zero levels have been coded in the base layer or any subordinate enhancement
layer
representation) are transmitted, for each transform coefficient, the following syntax elements are transmitted in the
specified order:
coeff_refinement_flag
: This syntax element indicates whether the transform coefficient refinement level is
equal to zero.
If the syntax element is equal to zero, no further information is transmitted for the current transform
coefficient. For the encoding of this syntax element an additional CABAC context has been added.
coeff_refinement_direction_flag
: This syntax element s
pecifies the sign of the transform coefficient
refinement level. If this syntax element is equal 1, the sign of the transform coefficient refinement level is equal to the
sign of its base layer representation (which is coded in the base layer or in a subor
dinate enhancement layer); otherwise,
the refinement level has the opposite sign. For the encoding of this syntax element an additional CABAC context has
been added.
Note that for a transform coefficient refinement level in the second scan only the values
–
1, 0, and 1 are supported by the
syntax.
3.8
Deblocking
3.8.1
Deblocking filter process
The AVC deblocking filter is employed to reduce the blocking artefacts induced from block based motion compensation
and quantization of the block transform coefficients. This se
ction describes the deblocking filter process.
Inputs to this process are a low

pass picture
L
k
, a prediction data array
M
P,k+1
, and an array
C
H,k+1
specifying quantisation
parameters and transform coefficient levels for each macroblock of the low

pass picture
L
k
.
Output of the process is a modified low

pass picture
L
k
.
The deblocking filter process for the picture
L
k
is applied as specified in the AV
C standard
[1]
, where
the macroblock modes, the sub

macroblock modes, the reference indices, and the motion vectors are extracted from
the given prediction data arra
y
M
P,k+1
, and
the transform coefficient levels and quantisation parameters are extracted from the array
C
H,k+1
.
4
Operational Encoder Control
In this section, the operational encoder control implemented in the SVM software
[17]
is described. Section
4.1
explains
the general coding and decomposition structure, while section
4.2
describes how fine granular scalability is realized. In
Sections
4.2
and
4.4
, the algorithm for determining the prediction data arrays used for motion

compensated temporal
filtering and the mode decision algorithm used for encoding the subba
nd representation are described, respectively. The
concept for selecting the quantisation parameters that are used for encoding the subband pictures is presented in Section
4.5
.
4.1
Scalability and Decomposition Structure
Two types of base layer coding can be applied. The base layer can be either encoded using MCTF as explained in
Section
3.4
or it can be encoded using a single layer AVC coding scheme as explained in Section
2.1.3
.
Several types of enhancement layers can be included to achieve the desired combined spatio

temporal

SNR scalability.
These scalability dimensions are explained in Section
2.2
and, in particular, their combination in Section
2.2.4
.
4.2
Fine grain SNR scalability
Fine gr
ain SNR scalability (FGS) is obtained using the progressive refinement representations presented in
3.5.2.2
and
3.7.2
. The corresponding NAL units can be truncated at any arbitrary point. For any spatio

temporal resolution, a
minimum bit

rate, which represents the corresponding spatial base layer re
presentation (including the base layer
representations of the lower

resolution layers), must be transmitted; these bit

rates can be adjusted in a way that the
corresponding reconstructions represent the minimally acceptable video quality. Above the minimum
bit

rate for a
spatio

temporal resolution, any bit

rate can be extracted by truncating the progressive refinement NAL units of the
corresponding spatio

temporal layer and all lower resolution layers in a suitable way.
In order to extract a requested bit

r
ate from the overall bit

stream, the following algorithm is used. Let
R
t
specify the
target bit

rate for a spatio

temporal resolution
S
t

T
t
, and let
R
0
be the base layer bit

rate for this spatio

temporal
resolution, i.e. the bit

rate that corresponds to the base layer representation including all spatial lower

resolution for the
temporal resolution
T
t
.
If
R
0
is greater than
R
t
, the requested spatio

tempo
ral

rate point cannot be extracted from the overall bit

stream.
Otherwise, the following applies
The target rate is modified to R
t
= R
t
–
R
0
The progressive refinement packets are processed from the lowest supported spatial resolution to the target
spatial
resolution S
t
, and for each spatial resolution, the progressive refinement packets are processed from the
lowest refinement layer to the highest refinement layer. For each progressive refinement layer of a spatio

temporal resolution S

T
t
, the following ap
plies:
o
Let
R
F
be the bit

rate of the
i

th progressive refinement representation for the spatio

temporal
resolution
S

T
t
.
o
If
R
F
is less or equal to
R
t
, the corresponding progressive refinement packets are fully included into the
extracted bit

stream, and th
e target rate is modified to
R
t
=
R
t
–
R
F
o
Otherwise, the corresponding progressive refinement packets are truncated, and the target rate is set to
zero:
R
t
= 0. Let
L
be the original length of a progressive refinement packet. During truncation, the
length
is set to
4.3
Motion estimation and mode decision process
This section describes an example method for determining the prediction data arrays
M
P
used in the prediction steps.
This method is implemented in the SVM software
[17]
. The algorithm employs Lagrangian optimisation techniques that
are widely used to optimise the rate

distortion efficiency of hybrid video coders
[11]
. A similar algorithm was integrated
into the test model JM

2
[12]
for the AVC standard
.
Inputs to this process are a variable
idxLP
, reference index lists
refIdxList0
and
refIdxList1
, and an ordered set of low

pass pictures {
L
k
[0],
…,
L
k
[
N
k
–
1]
}.
Output of this process is a prediction data array
M
P
.
Furthermore, for controlling the motio
n estimation / mode decision process, the encoder control has to select the number
of active entries for both reference index lists
refIdxList0
and
refIdxList1
as well as a quantisation parameter
QP
[0;
51]. The selected quantisation parameter determines
the operating point of the encoder control.
Based on the given quantisation parameter
QP
, two Lagrangian multipliers
SAD
and
SSD
are derived by
Let
R
0
and
R
1
specify the set of active entries of the reference index lists
refIdxList0
a
nd
refIdxList1
, respectively.
The prediction data array
M
P
is estimated in a macroblock

wise manner by using the following process.
1.
For each possible macroblock partition
P
(and sub

macroblock mode
p
sub

mb
if applicable) of the current macroblock,
the prediction method
p
pred
together with the associated reference indices
r
0
and/or
r
1
and motion vectors {
m
0
} and
{
m
1
} is determined by the following algorithm.
25
Scalable Video Coding
–
Working Draft
1
For all sub

macroblock partitions
P
i
of the current macroblock partition
P
(and sub

macroblock modes s
p
sub

mb
if
applicable), list 0 and list 1 motion vector candidates
m
0
(
r
0
,
i
) and
m
1
(
r
1
,
i
) for all reference indices
r
0
R
0
and
r
1
R
1
are obtained by minimizing the Lagrangian functional
with the distortion term being given as
l
org
[] represents the luma sample array of the picture
L
k
[
idxLP
], and
l
ref,0/1
[] represents the luma sample array of the
picture
L
k
[
refIdxList0/1
[
r
0/1
]], which is referenced by the reference index
r
0/1
.
S
is the motion vector search range.
The terms
R
(
r
0/1
) and
R
(
m
0/1
) specify the number of bits needed to transmit the reference index
r
0/1
and all
components of the motion vector
m
0/1
, respectively.
T
he motion search proceeds first over all integer

sample accurate motion vectors in the given search range
S
.
Then, given the best integer motion vector, the eight surrounding half

sample accurate motion vectors are tested,
and finally, given the best half

sample accurate motion vector, the eight surrounding quarter

sample accurate
motion vectors are tested. For the half

and quarter

sample accurate motion vector refinement, the term
l
ref,0/1
[
i
+
m
0/1,x
,
j
+
m
0/1,y
] has to be interpreted as interpolation
operator.
Given the motion vector candidates for all sub

macroblock partitions
P
i
and reference indices
r
0
R
0
and
r
1
R
1
,
the list 0 and list 1 reference indices
r
0
and
r
1
and the associated motion vectors {
m
0
} and {
m
1
} for list 0 and list 1
prediction are
selected by minimizing the Lagrangian functional
where the summation is proceeded over all sub

macroblock partitions
P
i
(with
i
being the sub

macroblock
partition index) of a macroblock partition
P
.
Given the determined reference indices
r
0
and
r
1
and the associated motion vectors {
m
0
} and {
m
1
}, the
Lagrangian costs for list 0 and list 1 prediction
J
L0
and
J
L1
are calculated by
Given the reference indices
r
0
and
r
1
and the associated motion vectors {
m
0
} and {
m
1
} for list 0 and list 1
prediction, the reference indices
r
B0
and
r
B1
and the associated motion vectors {
m
B0
} and {
m
B1
} for bi

prediction
are obtained by the following iterative algorithm.
Initially, the refer
ence indices and motion vectors for bi

prediction are set equal to the reference indices and
motion vectors that have been determined for list 0 and list 1 prediction,
r
B0
=
r
0
,
m
B0
=
m
0
,
r
B1
=
r
1,
m
B1
=
m
1,
an iteration index
iter
is set equal to 0,
ite
r
=
0
and the Lagrangian cost for bi

prediction
J
BI
is set equal to
J
(
r
B0
,
r
B1
,
{
m
B0
},
{
m
B1
}
) with
and the distortion term being given as
Subsequently, in each iteration step, the following applies.
o
If (
iter
%
2
)
is equal to 0, the following applies.
A list 0 reference index
r
*
B0
and associated list 0 motion vectors {
m
*
B0
} are determined by minimizing the
following Lagrangian functional
where the search range
S
*
(
m
B0
(i)) specifies a small area around the motion vector
m
B0
(i).
If the associated cost measure
J
iter
=
J
(
r
*
B0
,
r
B1
,
{
m
*
B0
},
{
m
B1
}
) is less than the minimum cost measure
J
BI
, the list 0 reference index
r
*
B0
is assigned to
r
B0
and the associated list 0 motion
vectors {
m
*
B0
} are
assigned to {
m
B0
}.
o
Otherwise (
(
iter
%
2
) is equal to 1
), the following applies.
A list 1 reference index
r
*
B1
and associated list 1 motion vectors {
m
*
B1
} are determined by minimizing the
following Lagrangian functional
If the associated cost measure
J
iter
=
J
(
r
B0
,
r
*
B1
,
{
m
B0
},
{
m
*
B1
}
) is less than the minimum cost measure
J
BI
, the list 1 reference index
r
*
B1
is assigned to
r
B1
and the associated list 1 motion vectors {
m
*
B1
} are
assigned to {
m
B1
}.
o
Comments 0
Log in to post a comment