Using Graphics Processor Units (GPUs) for Image Analysis: A Comparison between GPU and Conventional Software Engineering

useumpireSoftware and s/w Development

Dec 2, 2013 (3 years and 6 months ago)

134 views

Using Graphics Processor Units (GPUs) for Image Analysis: A
Comparison between GPU and Conventional Software Engineering


Peter Kehoe, 51040294, MSc Software Engineering

peter.kehoe3@mail.dcu.ie



ABSTRACT


The rapid pace of Graphic Processor Unit
(GPU)
development in recent years in terms of
performance and programmability has
attracted the attention of computer scientists
seeking to leverage alternative architectures for
better performance than that which commodity
CPUs can provide. In this paper, the p
otential
of the GPU in image analysis is examined,
specifically in shot boundary detection and
keyframe selection techniques. We first
introduce the programming model of the GPU
and go on to explain in detail the
implementation of histogram
-
based shot
boun
dary detection and keyframe selection on
both the CPU and GPU. We compare the
approaches, the specific challenges presented
by the GPU, and present performance results
for both the CPU and GPU. Overall these
results demonstrate the significant potential of

the GPU in this domain, exhibiting significant
speedups on the GPU relative to the CPU.


1. INTRODUCTION


Over the last few years, programmers and
computer scientists have increasingly
investigated the potential of Graphics
Processor Units (GPUs) for a va
riety of
computational tasks beyond graphics
rendering. The motivation for such work is the
promise of high potential speedups compared
to commodity desktop CPUs, thanks to the
very high parallelism employed by GPUs and
the massive advantage this gives the

GPU in
floating point computational capability. The
GPU also presents unique challenges, with a
distinct programming model.


This work examines the potential of the GPU
compared to the CPU for image analysis


specifically in shot boundary detection and
k
eyframe selection. Shot boundary detection is
the process of segmenting a video into its
component camera shots, which may be
delineated by an opening and closing cut. A
shot represents the unbroken sequence of
frames from a single camera


a shot boundary

occurs when the sequence switches to another
shot from another camera, or viewpoint. Shot
boundary detection is commonly used in
automated video content analysis. Asides from
the boundary detection itself, a subsequent
technique called keyframe selection
may be
used to identify a frame that will represent a
given shot. This paper will examine the
implementation of both techniques on the CPU
and GPU, and compare the approaches and
their performance. Our overall results confirm
the potential of the GPU for t
hese techniques.


2. RELATED WORK


The shot boundary detection technique
presented herein, like many, works with
decompressed video frames. Thus the process
of taking a compressed video file, such as a
MPEG video file, and decompressing it is a
first step
in this process. The decompression
itself takes a significant proportion of the time
required for the entire process, and represents a
good candidate for acceleration on the GPU.
This work has already been undertaken by a
number of graphics processor vendo
rs, such as
in nVidia’s PureVideo technology, and ATi’s
Avivo. Both companies’ technologies offload a
number of the most computationally intensive
aspects of MPEG decoding to the GPU, in
order to speed up the process over the CPU
alone.


There has been a s
ignificant amount of
research into shot boundary techniques


the
proceedings of TRECVid [TRECVid]
conferences over the last five years or so
present a good body of knowledge in the field.
Many techniques exist for shot boundary
detection, including pixel
and histogram
comparisons and statistical differences. Some
approaches focus on different types of shot
boundary, from hard cuts to gradual transitions
e.g. fades and dissolves.


3. GPU BACKGROUND


GPUs, or Graphics Processor Units, emerged
in the PC space

in response to the growing
demands placed on rendering capability,
driven primarily by the videogames market
and the breakneck pace in which advances are
expected therein. At one point, virtually all
rendering tasks were performed on the CPU.
However it s
oon became clear that dedicated
co
-
processors in the form of the GPU would be
required to accelerate the pace of improvement
in real
-
time graphics rendering. In recent years
there has been a growing interest in using
GPUs for tasks beyond rendering


for g
eneral
computation. This interest has been stoked for
a number of reasons. First, there has been a
steady increase in GPU capability and
programmability. Initially, GPUs were fixed
-
function pieces of silicon that simply took
input on one end via an API, an
d produced a
picture as output, with little scope for
programmer control in the process. This trade
-
off was made due to the optimisation possible
in such dedicated processors, but in 2001 that
all changed with the introduction of the first
GPUs with progra
mmable shaders. The word
shader has a dual meaning here


it can be
used to refer to processors in the GPU
hardware itself, and to the software programs
that run on them. The first generation of
programmable hardware was quite limited in
its capability, an
d required shader programs to
be written in an assembly
-
like language.
However, since then there have been 2 major
revisions to the programmable graphics
pipeline (driven chiefly by Microsoft’s
DirectX API), each bringing increasingly
general programming c
apability, and more
flexible input and output options, including,
critically, 32
-
bit floating point support. Also
crucial has been the emergence of high level
languages, including nVidia’s Cg (‘C for
Graphics’) and Microsoft’s HLSL (High Level
Shading Lang
uage). These languages are quite
similar to C in syntax, with support for
branching, loops, and a wide variety of data
types.


Though the increasing generality of GPUs has
been a key enabler for general computation on
the GPU, a second factor has been as m
uch if
not more influential in driving interest in the
field, namely performance. A GPU today
boasts a much higher capability in floating
point calculations than commodity CPUs. To
look at an example from a couple of years ago,
the nVidia Geforce FX 5900 U
ltra GPU of the
time could manage 20 billion floating point
multiplies per second compared to a 3Ghz
Pentium 4 which peaks at 6 billion floating
point multiples [Buck, I.]. This performance is
due to the high level of parallelism used in
GPUs. Asides from
computational prowess,
the GPU also can claim much higher
bandwidth to its local memory. A high end
GPU currently can have as much as 512MB of
RAM locally, with up to 50GB/s of bandwidth
to that memory. By comparison, today’s high
-
end CPU has up to 8.5GB/s

of main memory
bandwidth.


The numbers


on paper at least


make a
compelling case for the GPU, and go some
way toward explaining why the GPU has
received increasing focus as a platform for
computation beyond rendering. However, the
GPU is not a silver
-
bullet, suitable for any task
one could throw at it. The GPU embodies a
different programming model than that on the
CPU, with limitations and performance pitfalls
waiting to trip up the unwary programmer.


4. THE GPU PROGRAMMING MODEL


There is key limita
tion in the GPU
programming model that any programmer
must first understand before using the GPU for
computation


that is, limited output. On the
CPU, a programmer is used to being able to
write to any location in memory at any time.
This is a
scatter

cap
ability. For example, we
can write to a memory location using an
expression such as a[x+2] where x is some
integer variable and a is an array. On the GPU,
this is not possible. In a shader, the number of
outputs is limited to at most one RGBA colour
value
(i.e. a pixel), and the pixel that a shader
can write to is fixed and pre
-
determined. This
limitation is likely to be relaxed in future
generations of hardware, but for now, it’s a
key characteristic that must be understood.


Input is read into shaders fro
m 2D arrays of
data called
textures
. In graphics, textures are
used to apply 2D images to 3D surfaces, to
give the appearance of texture, but any data
can be stored in textures for the purposes of
general computation.
Texture coordinates

are
used to index
into a texture and read a value
from it.


The slow readback of results from the GPU to
the CPU must also be considered. This is due
to the relatively narrow bus between the CPU
and GPU mentioned earlier. This can be one of
the biggest limitations on effect
ive GPU
performance, and encourages the programmer
to place as much of a task’s computation on
the GPU as possible. A second issue similar to
this is passing input data to the GPU during
computation. Ideally only a low amount of
traffic on the bus between
the CPU and GPU
will be required during computation,
otherwise, again, performance gains due to
faster computation on the GPU could be
reduced or wiped out entirely by the significant
cost of transferring data to the GPU.


What we set out to do in this pra
cticum was to
investigate whether GPUs could be used to
perform some video analysis and video
structuring application with a performance
level which was greater than the same task on
a general purpose CPU. The specific tasks we
chose for analysis were shot

boundary
detection and selection of key frames from a
shot. In the next section we shall introduce
each of these applications and outline how we
have implemented each using both a
conventional CPU and a GPU approach.


All development and testing for the p
roject
was performed using a machine with the
following specifications:


CPU: AMD 3800+

RAM: 512MB

GPU: nVidia 7800 GT (450Mhz, 24 Pixel
Shaders)

GPU RAM: 256MB (Bandwidth: 32GB/s)


The API used for interfacing with the GPU
was OpenGL, and the shader lang
uage chosen
was nVidia’s Cg. All other programming was
in C++. For performance tests, timings were
averaged over 10 runs.


5. SHOT BOUNDARY DETECTION


We begin by looking at shot boundary
detection, examining the approaches on the
CPU and GPU, and comparin
g performance
characteristics and actual results. Shot
boundary detection is an important pre
-
processing stage in video analysis. It involves
determining the boundaries between one ‘shot’
and the next ‘shot’ in a video sequence, where
a ‘shot’ is defined a
s the video taken by a
single camera over time. In our
implementations we focussed on a histogram
-
based approach, popular due to its performance
and accuracy [Zhang, H.]. Our focus was on
hard cut detection rather than gradual shot
transitions such as fade
s or dissolves.


The process for histogram
-
based shot
boundary detection is made up of 3 parts:


1.

Firstly we compute colour
histograms, by taking the video’s
decoded frames as input and
producing a histogram for each frame.
A histogram can be thought of as
a
vector, where each component
represents a count of the number of
colours in a frame that fall within
certain ranges, called bins. For
example, a 32
-
bin histogram for a 8
-
bit RGB frame would hold a count of
all RGB values between 0 and 7 in its
first bin,

8 and 16 in its second, and
so on.


2.

Compute the difference between each
frame’s histogram and that of the
frame immediately following it.
Again, a comparison with vectors can
be made


the difference between two
frames’ histograms can be calculated
as th
e vector distance between them.
The difference between neighbouring
frames’ histograms is calculated on
the basis that a hard cut in a scene
will often be revealed by a large
difference in the histograms between
neighbouring frames.


3.

Identify candidate sho
t boundaries
using the histogram differences
calculated in 2. This is usually done
by comparing the differences with a
threshold


if the difference exceeds
the given threshold, the neighbouring
frames are marked as representing a
cut.


We will start by lo
oking at the approach taken
on the CPU.


5.1 The CPU Approach


The implementation of a CPU approach to shot
boundary detection is reasonably
straightforward. We use a simple histogram
class with functions for generating a histogram
based on a provided arra
y of frame data, and
for calculating the distance between the frame
and a second frame passed to it.


Histogram Computation


The Histogram class’s constructor allows us to
pass an array with image data, and the number
of bins we would like in the histogram
.
Assuming an 8
-
bit RGB image, it will use the
number of bins provided to calculate the range
each bin will represent. The computation itself
looks at each image element once and
increments the appropriate bin’s count based
on the element’s value. This alg
orithm maps
very intuitively and easily to the CPU’s
capabilities in terms of gather (reading from
any memory location, or here, reading each
image element sequentially) and scatter
(writing to any memory location, here based
on the image element’s value).

This allows for
an effective approach with a minimal amount
of code.


Histogram Difference Computation


The Histogram class provides a function that
calculates the distance between the histogram
and a second passed to the function as a
parameter, as if th
ey were in a vector space.
That is, we calculate a vector between the two
histograms by subtracting one histogram from
the other, and then calculate the length of this
vector


this is the vector distance between the
two histograms. We use this as a measur
e of
their difference. We use the Euclidean norm to
calculate the vector’s length


that is, we
square each component of the vector, sum the
vector, and take the square root of the sum.


We simply cycle through the array of
histograms, calculating the dist
ance between
one frame and the next, and store this in an
array of differences.


It would be possible to use this array directly
to perform shot boundary direction, but a final
pass over the array is performed to help reduce
the number of false positives.
It can sometimes
occur that large values of difference will be
recorded in the frames surrounding one cut due
to high levels of motion in the video, which
could trigger a cut being recorded multiple
times. In order to try and prevent this, a
technique borr
owed from [Luo, M.] is used. It
takes each difference value and divides it by
the maximum of the difference values in a
sliding window centred on that value. So if the
sliding window has a width of 25, we look at
the 12 values preceding the current differe
nce
value and the 12 values following it, take the
maximum, and divide the difference value by
it. The final array is then used in shot boundary
detection using a preset threshold value.


5.2 The GPU Approach


The approach taken to implement hard cut
dete
ction on the GPU differs quite
significantly from that on the CPU. Let’s look
at each of the processes in turn.


Histogram Computation


Histogram computation is relatively
straightforward on the CPU due to its
competency with both gather and scatter
operat
ions. In contrast, a GPU shader lacks any
scatter capability. As explained earlier, this
means the location the shader writes to in
memory is preset and cannot be changed
within the shader. This presents some
challenges, and initially it was not clear if t
his
would transition well to the GPU at all.


Our first approach considered a shader which
took the frame as input as a texture, and drew
to a buffer with a height of 1 and a width equal
to the number of bins in the histogram. Thus,
each bin would be hand
ed to a different shader
unit on the GPU and have the shader executed
for it. This required the shader to read in every
pixel in the frame once, and compare it to the
range of values the bin represented. In order
for every frame pixel to be read, a much la
rger
number of texture coordinates were required
than could be passed as parameters, and thus a
non
-
trivial amount of computation of texture
coordinates within the shader is required. Also,
the bin’s range must be calculated based on
one of the texture co
-
ordinates passed, which
will indicate if the shader is dealing with the
first, second, third bin and so on. This could be
avoided by having the shader draw to a buffer
with width 1 and height 1


in effect to deal
with one bin only


and to pass the bin ra
nge
directly as parameters that change with each
bin, but this approach would nullify the
potential for parallelism. What this all means,
in short, is that for n bins, the entire frame
must be examined n times, which seems
significantly more inefficient th
an the CPU
implementation which only examines a frame
once for its entire histogram. That asides, the
approach considered above has a number of
other less desirable characteristics, including
the high number of texture reads required for
each bin (effectiv
ely the entire frame must be
read), and the texture coordinate computation
that also must be performed within the shader
to accommodate this.


As it turns out, a somewhat cleaner approach is
possible that isn’t immediately obvious or
intuitive. The idea fo
r this approach came from
a sample in nVidia’s developer SDK [nVidia].
This approach leverages the capability to query
the GPU based on a shader that is executing.
This querying is exposed by the API, and can
be used in rendering to determine, for example,

if an object is occluded or not (the application
would execute a shader to draw a proxy object
in place of the actual more complex version,
and use a query to determine if the object was
drawn or discarded). This capability can be
applied to histogram com
putation. In this case,
each bin of the histogram is addressed in turn.
For each bin, the shader takes the frame as
input, and draws it unchanged to another
buffer


but first it checks if the pixel to be
drawn is within the range of the current bin,
whose

minimum and maximum values are
passed as parameters. If the pixel is within the
given range it is drawn, if not it is discarded.
The query over this shader simply counts the
number of pixels drawn


effectively
computing the value for the current bin. Not
e
that in this approach we are still passing over
every frame
n

times for
n

bins. However, in
contrast to the previous approach considered,
we can now pass the bin’s minimum and
maximum values to the shader directly as
parameters with each pass, obviating
the need
for computation of the minimum and
maximum values within the shader, without
sacrificing parallel speedups. Moreover, each
execution of the shader deals with only one
pixel from the input frame, allowing for a 1:1
mapping between texture coordinat
es and the
location in the output buffer that is being
processed. This also avoids the need for any
texture coordinate manipulation in the shader.
This approach also involves drawing to a
larger buffer than in the first technique (it is
the same size as th
e input texture), which maps
more closely to the scale of work the GPU’s
parallelism is optimised for in the first place.


Histogram Difference Computation


The input and end result of the difference
computation on the GPU is the same as on the
CPU


we wa
nt to take neighbouring frames’
histograms, and calculate the vector distance
between them. However, the form of that
input and the process of calculating the
difference are very different.


Considering first the input, a simple approach
may involve sendi
ng a single frame and its
neighbour as two separate textures to the GPU
and executing the shader on that input, and
repeating for every frame. Though it is
intuitive and conceptually close to the CPU
implementation, this approach does not reflect
the kind
of workloads a GPU is optimised to
handle, with typically larger texture sizes and
output buffers much larger than 1x
n
, where
n

is the number of bins, typically 32 or 64. This
approach also incurs more overhead in the
form of function calls to set up the i
nput
textures for each frame. Instead, we take a
different, slightly less intuitive approach. We
calculate the difference for every pair of
frames in one shader pass by packing all the
histograms into two textures


one contains all
textures from 0 to
m
-
1,

where
m

is the number
of histograms, and the other contains all
textures from 1 to
m
. So in short, the second
texture is the same as the first, except shifted
one histogram to the left. This allows the
shader to access a frame’s histogram and its
neighbou
rs


if one instance of a shader has
texture co
-
ordinates of 0.5,0.5, that index will
return the first histogram’s bin from the first
texture, and the same coordinates will retrieve
the second histogram’s bin from the second
texture. The input textures tak
e the form nxm
-
1, where n is the number of bins and m is the
number of histograms


meaning that each
histogram occupies one ‘row’ in the texture.


Having considered input, we’ll look at the
process of computing the distance on the GPU.
Although we previou
sly referred to the
computation as being performed by a single
shader, in reality multiple shaders are used in a
sequence. That sequence is:


Shader 1: Takes the two input textures as
outlined above, and subtracts them from one
another. This results in the

vector between
each histogram which we will subsequently
calculate the length of, to get the distance
between the histograms. Since it is easy to do
so at this point, we also square the result,
which takes care of part of the Euclidian norm
calculation. T
he output is a new texture of the
same size as the input textures.


Shader 2: Takes the output texture of Shader 1
and uses row reduction to sum the values in
each row of the texture
-

i.e. each histogram
-

and reduce the texture to one column.


Shader 3:
Takes the output buffer of Shader 2
and simply calculates the square root of each
element.


Thus, after these 3 passes, we’re left with a
column of values to be read back to the CPU,
containing the vector distances between each
frame and its successor. Fig
ure 3 illustrates the
process on the GPU.


Figure 3.



We read the values back to the CPU in order to
perfom the final pass because it is very much
more suited to computation on the CPU. If we
recall, the final pass divides each value
in the
array of histogram differences by the
maximum difference in a sliding window
centred on the value. In order to do this on the
GPU, for each value we would have to read in
perhaps 25 values from a texture and compute
the maximum of these. Moreover, t
he texture
corordinates to access those values would also
need to be calculated. Such a high ratio of
texture access to computation is not a good
mix for the GPU, although with further work
and time it may be possible to map this process
well to the GPU.


After the distance values are read back to the
CPU, they are compared to the threshold value
in order to flag shot boundaries in the exact
same way as done in the CPU implementation.
Unfortunately it is simply not feasible to do
this part of the process on

the GPU given the
lack of any File Input/Output capability. Even
if such capability were available, the parallel
nature of the GPU computation would
complicate that file output


some cuts could
be recorded in the file before others, even
though chronolog
ically they arrive later in the
sequence. Regardless of this, this part of the
process is not very expensive from a
computation point of view.


5.3 Performance Comparison


The results for histogram
-
based shot boundary
detection are presented in Table 1.


Table 1



䡩e瑯gram
-
b慳敤 pho琠tound慲y
䑥瑥捴楯n (PO b楮sF

# Frames

500
(20s)

1000
(40s)

2000
(1min
20s)

CPU
Result

9.679s

18.743s

35.443s

GPU
Result

2.889s

5.303s

10.283s


There is a clear performance advantage for the
GPU


it is roughly between 3.3

and 3.5 times
faster than the CPU. Performance scales
roughly linearly on both the CPU and GPU
with increasing numbers of frames. To see
where the gain comes from, we’ll look at
performance for just histogram computation,
one part of the process. The resu
lts are in
Table 2.


Table 2



䡩e瑯gram 捯mpu瑡瑩tn (PO b楮sI
䙲cm攠r敳e汵瑩tn㨠:ROxO4MF

# Frames

500
(20s)

1000
(40s)

2000 (1
min 20s)

CPU
Result

8.480s

16.998s

34.012s

GPU
Result

2.801s

5.272

10.211


As we can see, histogram computation
represents t
he vast bulk of the work performed,
and it is the GPU’s gains here that result in the
high overall gains. Recall in our discussion of
the approaches to CPU and GPU histogram
computation that we outlined how the CPU
approach intuitively seemed more elegant
and
efficient, needing to examine every frame only
once, compared to the GPU, which needs to
examine a frame once for each bin in the
histogram. However in this instance we see a
case of brute computational force outweighing
efficiency concerns. The GPU is

optimised for
the kind of drawing our approach engages in,
and with two comparisons per frame element
(comparison with the minimum and maximum
for the current bin) that can be parallelised
across the multiple ALUs in each pixel shader,
and the high level
of parallelism across the
pixel shaders, the GPU easily wins, if
somewhat unexpectedly.


6. KEYFRAME SELECTION


A process often used subsequent to shot
boundary detection is keyframe selection. With
a set of video shots, each delineated by its
opening and

closing cuts, it is often desirable
to select a keyframe to represent each shot.
There are a number of ways this can be done.
The cheapest way is to simply select the frame
in the middle of the shot


however, this frame
may not actually be representative

of the rest
of the frames in the shot. A more desirable
approach is to select the frame that is most like
every other frame in the shot. One way to
implement such an approach is to calculate the
difference between each frame and every other
frame in the s
hot, using their histograms, and
to average the difference for each frame. The
frame with the lowest average distance
between itself and every other frame is
selected as the keyframe. The disadvantage of
this approach is that it incurs much greater
computa
tional cost, of the order nxn, where n
is the number of frames in the scene. Thus, it
was decided to attempt to perform this
computation on the GPU and compare with a
CPU implementation. As with shot boundary
detection, we will first look at the details of

the
implementation on both the CPU and GPU,
and then compare performances.


6.1 CPU Approach


Assuming a prior step of shot boundary
detection, histograms for every frame in a
given scene should already be computed ready
to use in keyframe selection. The
process itself
is very simple, and can be coded in a very
straightforward manner on the CPU. We use a
for loop to loop over every frame’s histogram,
and within that for loop use a second for loop
to loop over every frame again, calculating the
vector dista
nce between the current frame’s
histogram and every other frame’s histogram.
The results are accumulated into a float value,
and finally divided by the number of frames in
the scene to find the average. When all frames
have been processed, the result is an

array of
average distances


we then simply traverse
the array to find the minimum distance, and
return the index at which that distance is
located, which is the same as the keyframe’s
number (counting from zero).


6.2 GPU Approach


The GPU approach is ag
ain significantly
different from that taken on the CPU. It uses
some techniques that are similar to those used
in the shot boundary detection GPU
implementation. Again, how input is passed to
the GPU is a key factor in achieving good
performance, so we’ll
look first at this. As with
the first step of shot boundary detection, we
want to calculate the vector distances between
histograms, but in this case we need to find the
distance between one given histogram and
every other histogram, and repeat this proces
s
for every frame. A logical extension of the
techniques used in the shot boundary detection
approach might thus see input of the form of
two textures


one texture containing every
frame’s histogram (with each histogram on one
‘row’ of the texture, as bef
ore), and one texture
containing one frame’s histogram copied
across a texture of the same size as the first.
See Figure 4 for an illustration. This seems
sound initially, as it requires only one texture
to be prepared for each frame being
considered, as t
he texture containing every
frame’s histogram need only be prepared and
transferred to the GPU once and used again
and again for subsequent frames. However, as
it turns out, the cost of preparing and more
particularly, transferring even one texture for
eac
h frame is extremely high relative to the
amount of computation to be performed, and
scales linearly with the number of frames in
the scene. Having tested this approach, it is
significantly slower than the CPU
implementation, sharply highlighting the need
to pay attention to data transfer to and from the
GPU.


Figure 4.


Figure 5.


However there is a better way. The texture we
were preparing and transferring to the GPU for
each frame is simply one histogram repeating n
times, where n is the number of frames in the
scene. Intuitively this seems wasteful


do we
really need to transfer all of this data to the
GPU when only 1 row in the texture is actually
unique? The answer, luckily, is no. Packing the
histogram into the t
exture n times has the
benefit of allowing a 1:1 mapping between the
texture coordinates accessed in each of the two
input textures, making it very easy to access
the input data without altering the interpolated
texture coordinates automatically passed int
o
the shader. But with some relatively
straightforward manipulation of texture
coordinates within the shader, we can access
the same histogram over and over, as if it were
a circular buffer. This means we only need to
transfer the histogram once, in a 1xn
texture
(where n is the number of bins), which is
vastly cheaper than transferring a 1xm texture
(where m is the number of frames), and this
has the added bonus of being a constant cost
regardless of the number of frames in the
scene. See Figure 5 for an i
llustration.


Now that we know what our input looks like,
we’ll look at the computation itself. As before,
we have split it into a number of passes with
different shaders. A breakdown of the shaders
and what they do follows:


Shader 1: This pass subtracts
each histogram in
the first texture with the single histogram in
the second texture, producing a buffer of the
same size as the first texture. It also squares
the result.


Shader 2: This pass take the output of shader 1
and uses row reduction to sum the va
lues in
each row of the texture i.e. in each histogram,
reducing the buffer to a column vector.


Shader 3: This pass takes the output of shader
2 and simple takes the square root of each
element


effectively now we have the vector
distance between the cur
rent histogram and
every other histogram.


Shader 5: This pass uses column reduction to
sum every distance, reducing our column
vector to simply one value.


This process is illustrated in Figure 6.


Figure 6.


The single output value o
f the final pass is then
read back to the CPU, and subsequent
processing is performed there. The value is
divided by the number of frames to get the
average vector distance between the current
frame and every other frame. The average
difference is stored i
n an array, and once every
frame has been processed, we find the
minimum of these values exactly as on the
CPU, and mark its associated frame as the
keyframe, and print out the result. Since we
need to print out the result, something shaders
do not support
, it would be impossible for us to
see through the computation exclusively on the
GPU, hence we read back slightly earlier and
perform the remainder on the CPU.


6.3 Performance Comparison


The results for keyframe selection are
presented in Table 3.



Ta
ble 3



䡩e瑯gram
-
b慳敤 heyfram攠卥汥捴楯n (PO b楮sF

# Frames

500 (20s)

1000 (40s)

2000 (1min
20s)

3000 (2 min)

4000 (2 min
40s)

CPU Result

0.077s

1.089s

4.515s

9.729s

17.472s

GPU Result

0.648s

0.559s

1.234s

1.584s

1.926s



The table shows results fo
r increasing numbers
of frames. As we can see, the results bear some
interesting characteristics.


The first result worth commenting on is that
for 500 frames. As we can see, the GPU is
significantly outperformed by the CPU in this
case. The reason? There

is a constant amount
of once
-
off setup work that needs to be
performed before any computation can take
place on the GPU, including loading shader
programs, and any initial texture input, and so
forth. All of this initial setup consists of
several function

calls, with a non
-
negligible
level of overhead. The end result is that if we
are only performing a relatively small amount
of work on the GPU, the actual time to do that
processing will be overshadowed by the time
to set things up, which is what we are se
eing
happen here. It does not matter how fast we
can do the actual work if the initial overhead is
much larger than that. The amount of
computation to be performed here with 500
frames is simply too low relative to the amount
of setup required, and thus th
e GPU loses
compared to the CPU which does not require
any such setup, and can start computing
immediately.


However, beyond that first result, we see some
dramatic numbers. With 1000 frames and
beyond the GPU wins easily, with a speedup
ranging from rough
ly 2x to roughly 9x. As we
can see the speedup the GPU achieves scales
dramatically with an increasing number of
frames. The reason for this is that the keyframe
selection algorithm is of the order nxn where n
is the number of frames


the amount of
comput
ation required scales quadratically with
more frames since every frame is compared to
every other. This does not favour the CPU
where computation is relatively expensive, but
on the GPU, computation is cheap, and while
the CPU starts to scale poorly with t
he
numbers above, the GPU’s ability is not
significantly tested. The approach we’ve taken
seems quite ideally suited to the GPU, in fact


the GPU is strong with raw computation, but
vulnerable to high levels of traffic on the bus
between the GPU and main
memory, and in
this instance, computational demands scale
quadratically, while the amount of data to be
transferred to the GPU to perform said
computation scales merely linearly in
comparison. Hence, with increasing numbers
of frames the GPU really starts
to pull ahead
relative to the CPU. It’s worth noting again,
that if we had formed our input to the GPU as
initially proposed above, the data transferred
on the bus to the GPU would also have scaled
quadratically, negating the computational
advantage, but a
s is, that concern is neatly
avoided.


7. PRODUCTIVITY CONSIDERATIONS


As demonstrated above, the GPU presents
considerable opportunity for significantly
better performance compared to the CPU.
However, at the same time, the development
experience is quite

different and it is worth
noting how it compares to the CPU.


To begin, consider the experience for a
programmer who has no prior experience with
the GPU. For such a programmer there is a lot
to learn in order to take best advantage of the
GPU. Such a pr
ogrammer may be used to
certain approaches to development on the
CPU, and to the somewhat forgiving nature of
the CPU


we can code an algorithm for a
CPU in any number of ways, and the CPU will
try to make the best of our code and execute it
as fast as po
ssible. In contrast, the GPU is far
less forgiving. Shader programming itself is
not so far removed from what a programmer
may be used to on the CPU


the syntax and
constructs are very similar for anyone who was
coded functions before. However, pitfalls
a
wait elsewhere


chiefly in how to feed data
to the GPU and get data from the GPU. Here
the experience diverges greatly from the CPU,
where free and (relatively) easy memory
access is the norm. Another consideration is
that of algorithm design


with the G
PU, we
need to devise algorithms that suit the GPU,
rather than algorithms that necessarily map
easily from one’s own thinking on a problem.
This may require multiple attempts before a
satisfactory approach is arrived at. In short,
there is a reasonably st
eep learning curve for
the inexperienced programmer.


Once experienced, however, things become
easier. With practise, and more exposure to
different examples of algorithms that work
well on the GPU, it becomes easier to
implement others, and to foresee pot
ential
problems and difficulties earlier. However,
there is still the issue of simple code volume
required for a GPU implementation versus a
CPU implementation. In order to do anything
on the GPU, a series of function calls and other
related code is requir
ed before any
computation can begin to be done. This isn’t a
huge issue, but initially it helps to have some
familiarity with the GPU API (such as
OpenGL), and to package that code away such
that it can be reused in subsequent projects.
However, despite th
is, in applications were
performance is key, a GPU implementation is
certainly well worth investigating.


8. CONCLUSIONS & FUTURE WORK


The GPU offers a powerful and unique
platform for computation, with the potential
for significant performance gains over

traditional CPU computation. This paper
examined two image analysis techniques


histogram
-
based shot boundary detection and
keyframe analysis


and demonstrated the
performance gains of the GPU implementation
in each case, ranging from 3 to 9 times the
p
erformance of CPU implementations. We
also examined the programming model for the
GPU, and explained the areas of concern for
the programmer when mapping algorithms to
the GPU. As we showed, the GPU performs
best when the transfer of data to and from the
G
PU is limited, and the intensity of the
computation is high. We also demonstrated the
need to carefully examine the performance
characteristics of GPU implementations, as
approaches that intuitively should give better
performance may not always do so, and
vice
versa


this is particularly true when weighing
data readback against further computation on
the GPU. This work also considered the
learning curve and development issues around
GPU coding, and concluded that the extra
effort required is well worthwhil
e where
performance is a key concern, and gains are to
be had on the GPU.


Future work in this area presents some
interesting possibilities. First, some of the
techniques presented here likely could be
optimised further for greater performance
improvement.

The shot boundary detection
techniques could also be expanded to
incorporate further boundary types, such as
fades and dissolves. Beyond the specifics of
the techniques presented here, there will be
emerging opportunities due to the general
improvement of

the hardware itself. Asides
from the fast growth in performance on GPUs,
a number of enabling capabilities should
emerge in the near future. ATi, for example,
have recently announced a virtual machine for
general
-
purpose GPU computation that allows
one to

bypass graphics orientated APIs like
OpenGL and DirectX, and that exposes key
capabilities including scatter in shaders [Segal,
M.]. Such relaxation of limitations should
continue into the future, allowing revision of
existing implementations for performa
nce
improvement, and the implementation of new
algorithms that may have previously been
unsuited to GPU computation.

9. REFERENCES


[TRECVid] TREC Video Retrieval Evaluation
Online Proceedings,
http://www
-
nlpir.nist.gov/projects/tvpubs/tv.pubs.org.html

[Buck, I.] “A Toolkit for Computation on
GPUs”, Chapter 37, GPU Gems, Addison
-
Wesley, 2004


[Zhang, H.] “Automatic Partitioning of Full
-
motion Video”, Multimedia Systems (1993)
Vol. 1
, No. 1, pages 10
-
28


[Segal, M.] “A Performance
-
Orientated Data
Parallel Virtual Machine for GPUs”,
http://www.ati.com/developer/siggraph06/dpv
m_e.pdf


[Luo, M.] “Shot Boundary Detection using
Pixel
-
to
-
Neighbor Image Differences in
Video”,
http://www
-
nlpir.nist.gov/projects/tvpubs/tvpapers04/umd.
pdf#search=%22shot%20boundary%20detecti
on%20sliding%20window%20maxi
mum%22
,
TRECVid 2004 Proceedings


[nVidia] “nVidia SDK, Featured Code
Samples”,
http://download.developer.nvidia.com/develop
er/SDK/Individual_Sampl
es/featured_samples.
html
, May 2005


[Harris, M.] “Mapping Computational
Concepts to GPUs”, Chapter 31, GPU Gems 2,
Addison
-
Wesley, 2005


[Göddeke, D.] “GPGPU Basic Math Tutorial”,
http://www.mathematik.uni
-
dortmund.de/~goeddeke/gpgpu/tutorial.html


[Buck, I.] “Taking the Plunge into GPU
Computing”, Chapter 32, GPU Gems 2,
Addison
-
Wesley, 2005


[nVidia] “Cg Toolkit User’s Manual Release
1.4.1”,
http://download.nvidia.com/developer/cg/Cg_1
.4/1.4.1/Cg
-
1.4.1_UsersManual.pdf
, March
2006


[Horn, D.] “Stream Reduction Operations for
GPGPU Applications”, Chapter 36, GPU
Gems 2, Addison
-
W
esley, 2005


[nVidia] “Cg Reference Manual Release
1.4.1”,
http://download.nvidia.com/developer/cg/Cg_1
.4/1.4.1/Cg
-
1.4.1_ReferenceManual.pdf