Audio and the Graphics Processing Unit

pumpedlessSoftware and s/w Development

Dec 2, 2013 (4 years and 7 months ago)


Audio and the Graphics Processing Unit
Sean Whalen
March 10,2005
In recent years,the development of programmable graphics pipelines has placed the
power of parallel computation in the hands of consumers.Systems developers are
now paying attention to the general purpose computational ability of these graphics
processor units,or GPUs,and are using them in novel ways.This paper examines
using pixel shaders for executing audio algorithms.We compare GPU performance
to CPU performance,discuss problems encountered,and suggest new directions for
supporting the needs of the audio community.
The Problem With Audio
Audio signal processing is used by everyone from bedroom musicians to the largest studios for
generating and shaping recorded sounds,both offline and in real-time.There is demand to apply
several high quality effects to each available channel of audio.Each channel may contain a stereo
pair of 44.1-192 KHz,16-32 bits per sample discrete audio signals.This approaches 1 MB per
second which must be processed,per effect per channel.
Modern effects do more than simple echo and delay – they model vintage instruments and amplifiers
using convolution with impulse responses,offer time shifting independent of pitch,and create new
sounds using modern granular synthesis techniques.Applying these effects is taxing on a processor,
especially in real-time where latency must be kept under 5-10 ms.
Ideally these computations would be offloaded to specialized audio hardware.This would increase
the maximum number of simultaneous real-time effects and reduce computation time for offline
effects.Such hardware has not found its way to consumers.
Graphics To The Rescue
Before dedicated graphics hardware,the CPU performed all graphics-related computation.As
graphics architecture improved,more of these computations were supported by the graphics proces-
Audio and the Graphics Processing Unit
Figure 1:CPU vs GPU performance timeline,in Gigaflops [
First generation (1998) graphics processors handled pre-transformed triangle rasterization and sim-
ple texturing,but vertex transformations and lighting still occurred on the CPU [
].The second
generation (1999-2000) supported vertex transformations in a fixed pipeline,sacrificing program-
mability for performance.The third generation (2001) offered a programmable vertex pipeline,
followed by a programmable fragment pipeline in fourth generation hardware (2002-present).
This programmability allows creative use of the graphics processing unit,or GPU,for general-
purpose computation.The use of the GPU is attractive for performance reasons:a modern GPU
can outperform a CPU by an order of magnitude for certain tasks [
].For a simple performance
comparison,see Figure
Graphics chips double in performance every 6-9 months,sometimes called ”Moore’s Law Cubed”
growth.A GeForce 6800 has 222 million transistors,compared to the Pentium 4’s 55 million.The
Pentium 4 contains more cache than data logic,and a single-threaded pipeline executes multiple
processes all reading data from a single memory interface
.In contrast,graphics architecture reads
streams of data in parallel,with no contention and at full bandwidth.
General Purpose Computation
Recent work has exploited the parallelism of general purpose computation on graphics processor
units,termed GPGPU.Applications have been found in cryptography [
] and signal processing
] among others
.Interest in harnessing the GPU has increased in part due to the emergence of
high level languages for graphics hardware.
The OpenGL Shading Language (GLSL) [
],nVidia’s C for Graphics (Cg) [
],and Microsoft’s
High Level Shading Language (HLSL) each enable vertex and fragment programs to be written in
a C-like language.For example,consider the following Cg snippet which calculates Phong shading:
Audio and the Graphics Processing Unit
float4 cSpec = pow(max(0,dot(Nf,H)),phongExp).xxx;
float4 cPlastic = Cd * (cAmbi + cDiff) + Cs * cSpec;
These shading languages are portable across hardware and between generations
.They also have
a much larger potential user base than assembly-level shading.Several meta-shading languages have
already emerged [
] allowing direct integration with C and C++ code,removing the need for
any graphics-specific knowledge for GPGPU development.
A typical GPGPU application loads data to be processed into a texture,one element per texel.
This texture is bound to a full-screen quad,and the viewport is set to the dimensions of the texture.
This causes the fragment program to be executed once for every input.
A new “color” is generated for each fragment and written to the frame buffer or directly to another
texture.For complex computations,this process can be repeated using output fromthe current pass
as input to the next.This process is visualized in Figure
.When the computation is complete,the
contents of the frame buffer or texture are copied to system memory for access by the application.
For more details on the GPGPU framework see [
Figure 2:The programmable graphics pipeline [
Not every algorithm can take advantage of the parallel computation a GPU has to offer.Shader
programs must be able to exploit data parallelism.This means local computation,high arithmetic
intensity,and no data dependencies [
].In addition,the AGP bus has slow readback from GPU to
main memory which must be amortized by processing large chunks of data.The new PCI-X bus
alleviates this bottleneck.
This work uses the Cg language to implement audio algorithms as fragment programs on the GPU.
The software written,called gpudsp,was tested on a 3GHz Pentium 4 running Slackware Linux
10.1 with kernel 2.6.10.The graphics card was an nVidia GeForce FX 5200/AGP,accelerated
by nVidia’s kernel driver 6629 and Cg 1.3.Both 8-bit and 16-bit audio files are supported,but
experiments were run with 16-bit files since 8-bit is not used professionally.Uncompressed WAV
files were decoded using the SDL
sound library,which were played back by writing to the/dev/dsp
To an extent.See [
Audio and the Graphics Processing Unit
Samples are stored in the red channel of a square texture,sized to the nearest power of 2 which
will fit all samples.This may leave unused space in the texture,wasting some computation.These
unused fragments are not read back into the playback buffer.The full RGBA space was not used,
in order to simplify translating between a 1D sample array and a 2D texture.This translation is
detailed in
Seven audio algorithms were implemented both as fragment programs written in Cg and as C func-
tions,adapted from [
].The GPU-accelerated Cg programs were timed with calls to gettimeofday
before and after drawing the full screen quad,as were the CPU-executed C functions before and
after the function call.The audio file contained 105,000 16-bit mono samples.Execution times are
shown in Figure
Figure 3:GPU versus CPU execution time,in microseconds
Although the GPU wins in overall execution time,it outperforms the CPU in less than half of
the algorithms.This reinforces the notion that the GPU excels at certain tasks.To explore this
further,a brief explanation of each algorithm follows.
Chorus mixes the current sample with a future sample determined by the sine of the current
sample’s index.This requires two texture lookups and one linear interpolation.
Figure 4:Audio samples run through a 1-second delay
Audio and the Graphics Processing Unit
Compress is a parametric logarithm function,allowing the shape of the logarithm to be “softer” or
“harder”.Compression in audio recording is not related to data compression.One texture lookup,
several multiplies and divides,and exponentiation are needed.
Delay mixes the current sample with a previous sample determined by a parameter multiplied by
the sample rate.The previous sample can be scaled down in volume.This requires two texture
lookups,one linear interpolation,and one multiply if scaling is used.
Highpass averages 5 adjacent samples after weighting with the vector [−1,−1,4,−1,−1].This is
a one-dimensional version of the highpass filter used in kernel-based image processing.Five texture
lookups,five additions,one multiply,and one divide are needed.
Figure 5:Audio samples run through a highpass filter
Lowpass is identical to Highpass,but uses the vector [1,2,4,2,1].Five texture lookups,five
additions,three multiplies,and one divide are required.
Figure 6:Audio samples run through a lowpass filter
Noisegate implements a hard gate,setting any sample below a threshold to zero.One texture
lookup and one conditional are needed.
Nrml performs normalization,which in audio recording refers to adding a constant to each sample
to increase loudness.This constant is computed by subtracting the loudest single sample from the
maximum loudness for a given bit depth.A single divide is required after finding the maximum.
Audio and the Graphics Processing Unit
To access any sample,the CPU can index into a one dimensional array with an integer.For
the GPU,this linear array is converted to a two dimensional texture which is indexed by floats.
A function was written to convert a one dimensional integer index into a floating point texture
coordinate.This was necessary in order to implement audio algorithms designed for a single
dimensional array of samples.
This translation adds significant overhead,and explains the poor performance of the highpass and
lowpass filters.These algorithms require access to 4 neighboring fragments,adding 4 translations
for each processed fragment.Using only 2 neighbors,GPU execution time was cut almost in half
whereas CPU time was reduced 15%.
OpenGL’s glTexImage2D function converts input data into textures,stored as floats in the range
[0,1].Audio samples are signed in the range [−128,127] for 8-bit and [−32768,32767] for 16-bit.
The samples must be range compressed before calling glTexImage2D,which does not handle signed
values.This involves converting each sample to an unsigned value and dividing the result by the
maximum unsigned value.For example:to range compress an 8-bit sample,add 128 then divide
by 255.The same concept is used for normal maps in bump mapping [
Floating point rounding errors occurred when using half precision 16-bit floats for performance in
the fragment programs.Not all fragments were being addressed in the index translation,and severe
distortion resulted (see Figure
).This was resolved by using full precision 32-bit floats,and is a
known problem discussed in [
Figure 7:Distortion resulting from half precision indexing (left) versus full precision (right)
Cg’s texture lookup function returns a 4-element vector.This means that for every 16-bit short
the CPU computes in Figure
,the GPU is calculating 4 32-bit floats.
Passing parameters to fragment programs requires explicit setup of each parameter in Cg,requiring
several function calls and runtime binding.This makes switching between fragment programs with
different parameters tedious.To speed development,most parameters were instead defined as
constants in the fragment code.Meta-languages such as Brook [
] and SH [
] remove the hassle of
parameter binding,transparently passing local variables to fragment programs.
Audio and the Graphics Processing Unit
Audio on the GPU is becoming a reality.Already,startup is touting commercial
GPU-accelerated audio software.It remains to be seen what limitations their product has,based
on the caveats discovered here.
Many audio algorithms cannot be implemented with a single pass on the GPU.Algorithms such
as bandpass,multi-tap delay,and reverb require output from previously computed samples.This
poses no problem for serialized execution on the CPU,but is orthogonal to the parallel architecture
of the GPU.
Intermediate values can be computed using multiple execution passes and textures,a task which is
simplified by Brook [
] and SH [
].However,the multipass approach may not be compatible with
the latency requirements of real-time processing.Implementation of multipass algorithms is left for
future work.
Current generations of hardware support limited for and while loops,which are unrolled automat-
ically by the compiler.Shader programs have a hardware-limited maximum length,which limits
loop iterations due to this unrolling.For our GeForce FX 5200,the iteration limit was 256.A
CD-quality delay calculated fromthe previous second of audio must iterate through 44,100 samples.
Clearly this restricts the complexity of audio algorithms implementable on the GPU.
Real-time effects processing on the AGP bus seems unlikely.To achieve low latency,short chunks
of audio must be sent to the GPU and back to system memory.The readback performance of the
AGP bus will likely limit GPU acceleration to offline audio processing.PCI-X should remove this
Based on these initial results,simple audio processing on the GPU seems practical given the proper
bus.Nearly matching or outperforming the CPU for every implemented algorithm,GPU accelera-
tion can increase the maximum number of effects used to record and master professional audio.
R.Fernando and M.Kilgard,“The Cg Tutorial:The Definitive Guide to Programmable Real-
Time Graphics”,Addison-Wesley,2003.
I.Buck et al,“Brook for GPUs:Stream Computing on Graphics Hardware”,ACM Transac-
tions on Graphics,Auguest 2004.
D.Cook,J.Ioannidis,A.Keromytis,and J.Luck,“Secret Key Cryptography Using Graph-
ics Cards”,Available HTTP:

K.Moreland and E.Angel,”The FFT on a GPU”,SIGGRAPH/Eurographics Workshop on
Graphics Hardware 2003 Proceedings,pp.112-119,Jul.2003.
J.Kessenich,D.Baldwin,and R.Rost,”The OpenGL Shading Language”,Available
Audio and the Graphics Processing Unit
M.McCool,Z.Qin,and T.Popa,“Shader Metaprogramming”,SIGGRAPH/Eurographics
Workshop on Graphics Hardware 2002 Proceedings,pp.57-68,Sep.2002.
I.Buck,A.Lefohn,J.Owens,and R.Strzodka,“GPGPU:General Purpose Computation on
Graphics Processors”,IEEE Visualization 2004 Tutorial,Oct.2004,Available HTTP:
T.Kurien,“Audio Effects Algorithms”,Available HTTP: