Big Performance in Small Packages

useumpireSoftware and s/w Development

Dec 2, 2013 (3 years and 4 months ago)

43 views

Supercomputer Performance

on a Chip

Powers Next
-
Gen
eration

Embedded
Imag
e
Processing


A new framework for the C language had been developed to move appropriate code to
parallel graphics processors for fast execution in graphical and other numeric
-
intensive

applications. A new generation of hybrid processors is emerging to take advantage of this
capability.


by Dr. Vijay Reddi, Advanced Micro Devices


Current
-
generation imaging applications have exhausted every method for squeezing out
additional performance
. First, microprocessor vendors deepened on
-
die caches

created
SIMD (single inst
r
uction multiple data) instruction set extensions to process media
streams, and
then
impl
emented
multi
-
threading and
clock frequency scaling.
However,
imaging applications typi
cally operate on data sets that are an order of magnitude larger
than caches which better serve traditional code and static (re
-
used) data.
Although the rest
of t
he
performance
-
enhancing
techniques
help

image processing
,

we

will demonstrate

here

a much mor
e efficient and scalable way to attack the problem
,

drawing from
supercomputing pedigree.


Large racks populated
with CPU blades and
loud fans

have been
deployed in MRI and
CT scanning equipment, often with
ASICs and FPGA
s as additional performance
-
boostin
g

offload engines
. Additional imaging applications

have

emerged, from
surveillance to facial recognition, built around smaller box PCs. But the performance of
such

CPU
-
centric systems
on these imaging applications
leaves much to be desired.


C
onventional s
equential microprocessors and coding languages have run their courses
and are ill
-
prepared for the faster real
-
time performance and smaller system size demands
of next
-
generation medical imaging, video surveillance, homeland security
,

and
similar
applicati
ons.


A fresh new approach is needed for the embedded market, and
that
can now
be realized thanks to the
advances in graphics processing units (GPUs)
and an innovative
programming language, OpenCL.


The OpenCL programming language was developed to provide
developer
s with a
platform to
create

C
-
based applications that run on the CPU but
can also

offload parallel
kernels to the GPU.
GPUs are implemented with dozens to hundreds of very powerful
math engines with fast local RAM. PCs with dual
/quad

core desktop
processors and
discrete
PCI Express graphics cards

can certainly deliver the performance required for
the
embedded market
; however
next
-
generation systems require smaller size and power
consumption.


Big Performance in Small Packages

Graphics processing ha
s evolved in a relatively short period of time from specialized
supercomputers to powerful Graphics Processing Unit (GPU) add
-
in cards
.

High
-
end
GPUs pack Teraflops of floating
-
point compute horsepower onto
a single

PCI

Express

graphics card.


At the same
time, the massively
-
parallel processing logic once dedicated to specific 3D
graphics processing tasks has become more flexible and programmable. Enhancements to
the GPUs have enabled these processors to address a wider range of applications. While
not suit
able to accelerating every application, those
with similar

characteristics
to

graphics workloads can achieve large improvements in performance and power
-
efficiency by exploiting the GPU. The best suited applications
are

characterized by many,
largely
-
indep
endent tasks ideally operating on large, regularly structured data

sets.

S
maller die geometries have
enabled
the first
family of single die
CPU+GPU solution
s
,
known as heterogeneous multi
-
core processors,
which
greatly reduc
e

space and heat
while increasin
g the data bandwidth between CPU cores and graphics cores.
These
Fusion processors or
APUs (
accelerated
processing units) are positioned well to reduce
the size and weight o
f imaging systems dramatically.

The low
-
power APUs integrate
GPUs capable of
tens t
o hundreds

of Gigaflops onto the same die as multiple
conventional CPU cores

AMD's new Fusion processors such as the AMD Embedded G
-
Series dual
-
core 1.6GHz T56N CPU with integrated AMD Radeon HD 6310 GPU

shown
in Figure 1, are ushering in a new era of effi
cient supercomputing for embedded
applications that are not well served by conventional CPUs.

GPUs are
optimized primarily for graphics tasks, and are best suited to certain types of
parallel workloads. B
ecause
of their
data
-
parallel
execution logic
, GPUs

are effective at
tackling problems that can be
decomposed into a large number of independent parallel
tasks
.

With hundreds of computing cores, modern GPUs are much more scalable than the
handful of cores

offered in a CPU
-
centric paradigm.

The term ‘GPGPU’

refers to the use
of a GPU for general
-
purpose
parallel
computation
s
.


GPUs are geared for

process
ing

many independent
’work items’

in parallel.
When it
comes to graphics rendering, the particular work items are
operations on
vertices and
pixels, includin
g texturing and shading calculations.

In a general
-
purpose GPU (GPGPU)
program, a set of operations

is typically executed in parallel on each item in a data set.

A
pplications

well suited for GPU execution
such as image processing

have large data
sets, high

parallelism, and minimal dependency between data elements.

A
common form
for a
data set
to take in
a
GPGPU
application
is a 2D grid because this fits naturally with
the rendering model built into GPUs. Many computations naturally map into grids:
matrix a
lgebra, image processing, physically based simulation, and so on.

OpenCL
kernels can also take advantage of dedicated texture processing hardware in the GPU to
perform various 2D filtering operations on certain memory reads.


Originated by Apple and turned

over to the Khronos Group for standardization,

Open

C
omputing
L
anguage

(
OpenCL
) is a framework for writing programs that
can exploit
heterogeneous platforms consisting of CPUs, GPUs, and other processors. OpenCL
includes a language
based on C99

for writin
g
kernels

functions t
hat execute on OpenCL
devices

plus APIs that are used to define and then control the platforms.
Each kernel is
a main body of a loop or routine. The developer specifies the kernel
,

memory regions,
and

data set to process

using OpenCL c
onstructs
.


OpenCL provides parallel computing using task
-
based and data
-
based parallelism
, and
presents the GPU resources in a very clean manner as having many instantiations of a
single processor type, buffers and memory spaces
.
AMD provides an SDK for i
ts Fusion
series of APUs to allow embedded developers to get started with OpenCL.


OpenSURF: Community Development with a
Vision

As an example of the use of OpenCL with a GPGPU, o
ne
group within the
open
-
source
community
analyze
d

and profile
d

the componen
ts of the

Speeded Up Robust Features
(SURF) Computer Vision

algorithm written in OpenCL.

SURF analyzes an image like
the one in Figure 2 and produces feature vectors for points of interest (“ipoints”).


SURF

features have been used to perform

operations l
ike object

recognition, feature
comparison
,

and
face recognition. A feature vector describes a set of ipoints
,
consist
ing

of

t
he location of the point in the image
, t
he local orientation at the detected point
, t
he
scale at which the interest point was
dete
cted and

a

descriptor
vector (
typically 64 values)
that can
be used

to compare with the descriptors of other features.


A diagram of the

application is
shown

in Figure
3
.

To find points of interest, SURF
applies a
Fast
-
Hessian Detector
that uses approximat
ed Gaussian Filters

at different
scales to generate a stack of Hessian

matrices. SURF utilizes an
integral imag
e
, which

allows
scaling of

the filter instead of the image. The

location of the ipoint is calculated by
finding the local

maxima or minima in the

image at different scales using

the generated
Hessian matrices.

The local orientation at an ipoint maintain
s

invariance to image
rotation.
O
rientation

(
the
4th stage of the pipeline in Figure
3
) is calculated using

the
wavelet response in the X and Y dire
ctions in

the neighborhood of the detected ipoint.
The dominant

local orientation is selected by rotating a circle segment

covering an angle
of
60
degrees

around the origin. At each position,

the x and y
-
responses within the
segment of the

circle are summe
d and used to form a new vector. The

orientation of the
longest vector becomes the feature orientation.


To demonstrate the power of familiar C
-
style structures
in

OpenCL with the data
-
parallel
compute capability

of GPGPUs, Figure 4

shows SURF code for cal
culating and
normalizing descriptors.


The calculation of the largest response is done

using a local memory
-
based reduction.

The
64
-
element

descriptor is calculated by dividing

the neighborhood of the ipoint into
16 regular subregions.

Haar wavelets are ca
lculated in each region and

each region
contributes 4 values to the descriptor. Thus,

16 * 4 values are used in applications based
on SURF to

compare descriptors.


The
goal
of using OpenCL with GPGPUs is
to extract as much parallelism out of the
framework

as possible. In SURF, execution performance is

determined by the
characteristics of the data set rather

than the size of the data. This is because the number
of

ipoints detected in the
non
-
max suppression
stage of the

algorithm helps to determine
the workg
roup dimensions

for the orientation and descriptor kernels. Computer Vision

frameworks like SURF also have a large number

of tunable parameters, for example,
a
detection threshold, which

changes the number of points detected in the suppression

stage
,

that
greatly impact the performance of an

application.


Optimization

In heterogeneous computing, knowledge about the a
rchitecture of the targeted set of
devices is critical to reap the full benefits of the hardware. For example, selected kernels
in an application may be able to exploit vectorized operations available on the targeted
device, and if some of the kernels can b
e optimized with vectorization in mind, the
overall application may be sped up significantly. However, it is important to gauge the
contributions of each kernel to the overall application runtime. Then informed
optimizations can be applied to obtain the be
st performance.

In a heterogeneous computing scenario, an application starts out executing on the CPU,
and then the CPU
launches kernels on a second device (e.g., a GPU). The data transferred
between these devices must be managed efficiently to minimize the impact of
communication. Data manipulated by multiple ke
rnels should be kept on the same device
where the kernels are run. In many cases, data is transferred back to the CPU host, or
integrated into CPU library functions. Analysis of program flow can pinpoint sections of
the application where it would be benefici
al to modify data management, leading to more
efficient use of the overall system.

GPGPUs can be programmed and optimized efficiently using OpenCL, a roya
lty
-
free
language based upon C, in order to unlock the full potential of standardized processing
hardwa
re for
the rapid development of
next
-
generation imaging applications.


Advanced Micro Devices
, Sunnyvale, CA. (408) 749
-
7000. [amd.com]

Khronos Group, Beaverton, OR. (415) 869
-
8627. [www.khronos.org]