Download The SwiftShader Whitepaper - TransGaming Inc.

skillfulwolverineΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

82 εμφανίσεις

1
Copyright © 2013 TransGaming Inc. All rights reserved.
TRANSGAMING INC.
WHITE PAPER:
SWIFTSHADER TECHNOLOGY
JANUARY 29, 2013
For some time now, it has been clear that there is strong
momentum for convergence between CPU and GPU technol
-
ogies. Initially, each technology supported radically differ
-
ent kinds of processing, but over time GPUs have evolved to
support more general purpose use while CPUs have evolved
to include advanced vector processing and multiple execu
-
tion cores. In 2013, even more key GPU features will make
their way into mainstream CPUs.
At TransGaming, we believe that this convergence will
continue to the point where typical systems have only one
type of processing unit, with large numbers of cores and
very wide vector execution units available for high perfor
-
mance parallel execution. In this kind of environment, all
graphics processing will ultimately take place in software.
Through our SwiftShader software GPU toolkit, TransGaming
has long been a world leader in software based rendering,
with widespread adoption of our technology by some of the
world’s top technology companies, and a US patent on key
techniques issued in late 2012.
This whitepaper explores the past, present, and future
of software rendering, and why TransGaming expects that
the technology behind SwiftShader will be a critical compo
-
nent of future graphics systems.
SwiftShader Today
In 2005, TransGaming launched SwiftShader, for the first
time providing a software-only implementation of a com
-
monly used graphics API (Microsoft® Direct3D®), including
shader support, at performance levels fast enough for real-
time interactive use. Since then, SwiftShader has found an
important niche in the graphics market as a fallback solu
-
tion – ensuring that even in cases where available hardware
or graphics drivers are inadequate, out of date, or unstable,
our customers’ software will still run.

SwiftShader:
Why the Future of 3D Graphics is in Software
SwiftShader:
Why the Future of 3D Graphics is in Software
2
Copyright © 2013 TransGaming Inc. All rights reserved.
This fallback case is a critical one for software that
needs to run no matter what system an end user has in
place. TransGaming has licensed SwiftShader to compa
-
nies such as Adobe for use with Flash® as a fallback for the
Stage3D® API, and to Google to implement the WebGL®
API within Chrome® and Native Client®. Beyond this,
SwiftShader has found customers in markets as diverse as
medical imaging and the defense industry. All of these cus
-
tomers require a solution that will put the right pixels on
the screen 100% of the time.
Another important area where SwiftShader is being
used today is in cloud computing and virtualization systems.
Servers in data centers that include GPU capabilities are cur
-
rently substantially more expensive than normal servers. Using
SwiftShader and software rendering thus allows substantial
savings and flexibility for developers with server-oriented appli
-
cations that require some degree of graphics capability.

A key part of the reason that SwiftShader is useful as a
fallback option in situations where a hardware GPU is not
available or not reliable is that it is capable of achieving
performance that approaches that of dedicated hardware.
With a 2010-era quad-core CPU, SwiftShader scores 620
points in the popular 3DMark06 DirectX 9 benchmark; this
is higher than the scores for many previous generation

integrated GPUs.
Software Rendering Future Advantages
While today’s software rendering results are good
enough for some applications, current generation integrat
-
ed GPUs still have a substantial performance advantage.
Why then does TransGaming believe that software render
-
ing will have a more important role in the future, beyond a
reliable fallback?
The answers are straightforward. As CPUs continue to
increase their parallel processing performance, they be
-
come adequate for a wider range of graphics applications,
thus saving the cost of additional unnecessary hardware.
Hardware manufacturers can then focus resources into
Figure 1: SwiftShader running 3DMark06
SwiftShader:
Why the Future of 3D Graphics is in Software
3
Copyright © 2013 TransGaming Inc. All rights reserved.
optimizing and improving hardware with a single architec
-
ture, and thus avoid the costs of melding separate CPU and
GPU architectures in a system. As graphics drivers and APIs
get more complex and diverse, the issues of driver correct
-
ness and stability become ever more important. In today’s
world, software developers must test their applications on an
almost infinite variety of different GPUs, drivers, and OS revi
-
sions. With a pure software approach, these problems all go
away. There are no feature variations to worry about, other
than performance, and developers can always ship applica
-
tions with a fully stable graphics library, knowing that it will
work as expected no matter what. Software rendering thus
saves time and money for all participants in the platform and
ecosystem during development, testing and maintenance.
Beyond cost savings, software rendering has numerous
additional advantages. For example, graphics algorithms
that today use a combination of CPU and GPU processing
must split the workload in a suboptimal way, and develop
-
ers must deal with the complexity of handling different
bottlenecks to ensure that each pipeline remains balanced.
Software rendering also simplifies optimization and debug
-
ging by using a single architecture, allowing the use of well
established CPU-side profilers and debuggers. A simpler, uni
-
form memory model also liberates developers from having
to deal with multiple memory pools and inconsistent data
access characteristics, creating additional freedom for devel
-
opers to explore new graphics algorithms.
Most importantly however, software rendering allows
for unlimited new capabilities to be used at any time. New
graphics API releases can always be compatible with exist
-
ing hardware, and developers can add new functionality at
any layer of their graphics stack. The only limits become
those of the developer’s imagination.
All of this however can only become true if software
rendering can close the performance gap. At TransGaming,
we believe that this is very achievable, and that upcoming
hardware advances will prove this out. To understand why
requires a deeper dive into the technical side of SwiftShader.
SwiftShader: The State of the Art in
Software Rendering
This section highlights some of the key technologies
that differentiate SwiftShader from other renderers, and il
-
lustrates how the challenges posed by software rendering
can be overcome.

One of the seemingly major advantages of dedicated 3D
graphics hardware is that it can switch between different
operations at no significant cost. This is particularly relevant
to real-time 3D graphics because all of the graphics pipeline
stages depend on a certain ‘state’ that determines which cal
-
culations are performed. For instance, take the alpha blend
-
ing stage used to combine pixels from a new graphics opera
-
tion with previously drawn pixels. This stage can use several
different blending functions, each of which takes various
input arguments. Handling this kind of work is a challenge
for traditional software approaches that use conditional
statements to perform different operations. The resulting
CPU code ends up containing more test and branch instruc
-
tions than arithmetic instructions, resulting in slower perfor
-
mance compared to code that has been specialized for just a
single combination of states, or hardware with separate logic
for each blending function. A naïve software solution that
includes pre-built specialized routines for every combina
-
tion of blending states is not feasible, because combinatorial
explosion would result in excessive binary code size.
The practical solution that SwiftShader uses for this
type of problem is to compile only the routines for state
combinations that are needed at run-time. In other words,
SwiftShader waits for the application to issue a drawing
command and then compiles specialized routines which
perform just those operations required by the states that are
active at the time the drawing command is issued. The gen
-
erated routines are then cached to avoid redundant recom
-
pilation. The end result is that SwiftShader can support all
the graphics operations supported by traditional GPUs, with
no render-state dependent branching code in the processing
routines. This elimination of branching code also comes with
secondary benefits such as much improved register reuse.
This technique of dynamic code generation with spe
-
cialization, has proven to be invaluable in making software
rendering a viable choice for many applications today, and
naturally extends to the run-time compilation of current
types of programmable shaders. Most importantly, it opens
up huge opportunities for future techniques.
In addition to using dynamic code generation,
SwiftShader also achieves some of its performance through
the use of CPU SIMD instructions. SwiftShader pioneered the
implementation of Shader Model 3.0 software rendering by
using the SIMD instructions to process multiple elements
such as pixels and vertices in parallel. By contrast, the classic
SwiftShader:
Why the Future of 3D Graphics is in Software
4
Copyright © 2013 TransGaming Inc. All rights reserved.
way of using these instructions is to execute vector opera
-
tions for only a single element. For example, other software
renderers might implement a 3-component dot product us
-
ing the following sequence of Intel x86 SSE2 instructions:

mulps xmm0, xmm1
movhlps xmm1, xmm0
addps xmm0, xmm1
pshufd xmm1, xmm0, 1
addss xmm0, xmm1
Note that this sequence requires five instructions to
compute a single 3-component dot product - a common
operation in 3D lighting calculations. This is no faster than
using scalar instructions, and thus many legacy software
renderers did not obtain an appreciable speedup from the
use of vector instructions. SwiftShader instead uses them to
compute multiple dot products in parallel:
mulps xmm0, xmm3
mulps xmm1, xmm4
mulps xmm2, xmm5
addps xmm0, xmm1
addps xmm0, xmm2
The number of instructions is the same, but this sequence
computes four dot products at once. Each vector register
component contains a scalar variable (which itself can be a
logical vector component) from one of four pixels or verti
-
ces. Although this is straightforward for an operation like a
dot product, the challenge that TransGaming has solved with
SwiftShader is to efficiently transform all data into and out of
this format, while still supporting branch operations.
Earlier versions of SwiftShader made use of our in-
house developed dynamic code generator, SwiftAsm, which
used a direct x86 assembly representation of the code to be
generated. This offered excellent low-level control, but at
the cost of the burden of dealing with different sets of SIMD
extensions, and of determining code dependencies based on
complex state interactions. We’ve since taken things to the
next level by abstracting the operations into a high-level
shader-like language, which integrates directly into C++.
This layer, which we call Reactor, outputs an intermediate
representation that can then be optimized and translated
into binary code using a full compiler back-end. We chose
to use the well-known LLVM framework due to its excellent
support of SIMD instructions and straightforward use for
run-time code generation. The combination of Reactor and
LLVM forms a versatile tool for all dynamic code genera
-
tion needs, exploiting the power of SIMD instructions while
abstracting the complexities.
A simple example of how Reactor is used in the imple
-
mentation of the cross product shader instruction illustrates
this well:
void ShaderCore::crs( Vector4f &dst,

Vector4f &src0,

Vector4f &src1)
{
dst.x = src0.y * src1.z -

src0.z * src1.y;
dst.y = src0.z * src1.x -

src0.x * src1.z;
dst.z = src0.x * src1.y -

src0.y * src1.x;
}
This looks exactly like the calculation to perform a cross
product in C++, but the magic is in the use of Reactor’s C++
template system. The Reactor Vector4f data type is defined
with overloaded arithmetic operators that generate the re
-
quired instructions for SIMD processing in the output code.
In addition to eliminating branches and making effec
-
tive use of SIMD instructions, SwiftShader also achieves
substantial speedups through the use of multi-core pro
-
cessing. While this may at first seem obvious and relatively
straightforward, it poses both challenges and opportunities.
Graphics workloads can be split into concurrently execut
-
able tasks in many different ways. One can choose between
sort-first or sort-last rendering, or a hybrid approach. Each
task can also be divided into more tasks through data paral
-
lelism, function parallelism, and/or instruction parallelism.
Subdividing tasks and scheduling them onto cores/threads
comes with some overhead, and they are typically fixed
SwiftShader:
Why the Future of 3D Graphics is in Software
5
Copyright © 2013 TransGaming Inc. All rights reserved.
processes once a specific approach is chosen. TransGaming
has identified opportunities to minimize the overhead and
in some cases even exceed the theoretical speedup of using
multiple cores, by combining dynamic code generation with
the choice of subdivision/scheduling policy. Information
about the processing routines, obtained during run-time
code generation, can be used during task subdivision and
scheduling, while information about the subdivision/sched
-
uling can be used during the run-time code generation.
We believe this advantage to be unique to software ren
-
dering, because only CPU cores are versatile enough to do
dynamic code generation, intelligent task subdivision/sched
-
uling, and high throughput data processing.
Further information about these techniques can be
found in TransGaming’s patent filing, Patent #8,284,206:
General purpose software parallel task engine. While the pat
-
ent was granted in late 2012, the original provisional patent
was filed in early 2006, well before other modern software
rendering efforts such as Intel’s Larrabee became public.
Convergence and Trends
The previous sections of this whitepaper show some
of the substantial advantages of software rendering, and
demonstrate that the technology to use the full computing
power of a modern CPU as efficiently as possible is already
here. In order to fully understand the coming convergence
between CPU and GPU however, we must also consider the
evolution of the GPU side of the equation.
Firstly, we must understand what makes modern GPUs
exceptionally fast parallel computation engines, and what
the limits of growth on the approaches used to provide that
speed may be.
Modern GPUs have two critical features that enable the
majority of their performance: they provide a large number of
heavily pipelined parallel computation units, and they drive
many execution threads through these units simultaneously.
This allows them to hide the long latencies that frequently
occur when executing operations such as texture fetching.
While one thread is waiting for a texture fetch result, another
thread occupies the computation units. Context switches are
therefore designed to be very efficient on a GPU.
Keeping many threads active simultaneously requires a
large number of registers to be available. The more registers
a given instruction sequence requires, the fewer threads can
be run simultaneously.

The lowest organizational level of computation on a
GPU is known as a ‘warp’ on NVIDIA GPUs, and a ‘wavefront’
on AMD GPUs. This is similar to the SIMD width in a CPU
vector unit. Current generation GPU hardware typically uses
1024-bit or 512-bit wide SIMD units, compared to the 256-
bit wide SIMD units used by current generation CPUs.
The wide SIMD approach also has some important
limitations. One is that control statements within a given
instruction sequence cause divergence, which requires eval
-
uating multiple code paths. With a wider SIMD width, this
divergence becomes more common, eliminating some of the
execution parallelism. Another limiting factor for graphics
processing is that pixels are processed in rectangular tiles,
so rendering triangles regularly results in leaving some
lanes unused.
Another limitation is the number of registers available.
Larger register files lower computational density, so GPU
manufacturers must balance that against stalls caused by
running out of storage for covering RAM access latencies
By contrast, CPUs are optimized for low-latency opera
-
tion. On a CPU core, a significant amount of logic is devoted
to scheduling logic that allows many functional units to be
used simultaneously through out-of-order execution. Branch
prediction units and shorter SIMD widths reduce the penal
-
ties for branch-heavy code, and more die space is devoted to
caches and memory-management functionality. CPUs typi
-
cally support running at significantly higher clock frequen
-
cies as well.
CPUs are now evolving to support increased parallelism
at the SIMD width level as well as with additional execution
units available to simultaneous threads, and larger num
-
bers of CPU cores per die. Intel’s Haswell chips, available
later this year, will include three 256-bit wide SIMD units
per core, two of which are capable of a fused multiply-add
operation. This arrangement will process up to 32 floating-
point operations per cycle: with four cores on a mid-range
version of this architecture, this provides about 450 raw
GFLOPS at 3.5 GHz.
Intel’s AVX2 instruction set offers room to increase
the SIMD width size to 1024 bits, which would put the raw
CPU GFLOPS at similar levels to the highest end GPUs

currently available.
At the same time, GPUs are becoming more and more
like CPUs, adding more advanced memory management fea
-
tures such as virtual memory and the corresponding MMU
SwiftShader:
Why the Future of 3D Graphics is in Software
6
Copyright © 2013 TransGaming Inc. All rights reserved.
complexity that is required. GPU instruction scheduling is be
-
coming more complex as well, with out-of-order features such
as register scoreboarding, and ILP extraction features such as
superscalar execution. Furthermore, GPU-vendor sponsored
research suggests that running fewer threads simultaneously
might lead to better performance in many cases
1
.
Die-level Integration and Bandwidth
One of the trends that displays the clearest indications
of convergence between CPUs and GPUs is the increasing
frequency of die-level integration of current-generation dif
-
ferentiated CPU and GPU units. This trend has become more
and more important with the rise of mobile devices, which
require both graphics and CPU performance in a single low-
power chip. Most desktop chips sold today also include an
on-die GPU.
The very existence of this important market shows the
value of CPU / GPU convergence. While today the market is
served by chips that integrate separate units on the same
die, the potential advantages of a fully unified chip are

clear – hardware manufacturers would be able to manufac
-
ture simpler macro-level designs with computation cores
that could be used for either general purpose or graphics
1 Better Performance at Lower Occupancy
http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf
workloads as needs arise in the system.
While one of the traditional hallmarks of strength with
GPU computing has been the use of high bandwidth dedi
-
cated memory, this advantage becomes moot in environ
-
ments where the GPU must share memory with the CPU.
While there is no question that the availability of high
bandwidth memory will continue to be a strength of dis
-
crete GPU computing, there are ways in which both integrat
-
ed CPU / GPU packages as well as potential future unified
chips can offset this distinction.
One approach to provide increased performance for
these chips is on-package memory. This approach has al
-
ready proven useful in Microsoft’s Xbox 360, which includes
a 10 MB eDRAM framebuffer. Intel’s Haswell integrated GPU
will optionally include 128 MB of high performance mem
-
ory for this purpose as well. AMD’s next generation Kaveri
Fusion architecture chip is designed to fully share memory
and address space between its CPU and GPU components.
Clearly, any future unified architecture chip will not
suffer from bandwidth limitations any more than existing
integrated designs might.
Computational Efficiency
One argument often raised in favor of GPU computing is
that GPUs have greater computational efficiency than CPUs.
Device
Speed –

MHz
Area –

mm
2
Performance –
GFLOPS
Power –
Watts
GFLOPS /
Watt
GFLOPS /
mm
2
Discrete GPUs
NVidia Kepler GK104 (GTX 680)
1006
294
3090.4
195
15.85
10.5
NVidia Kepler GK107 (GTX 650)
928
118
812.5
64
12.7
6.88
ATI Radeon HD 7870 XT
975
365
2995.2
185
16.19
8.21
Integrated GPUs
NVidia Tegra 4 GPU Only (est.)
520
~30
74.8
~3.8
~19.68
~2.49
Intel Ivy Bridge HD4000 GPU Only (est)
1150
~57
294.4
14.8
19.89
~5.16
CPU Hardware
Intel Haswell Quad-Core, no GPU (est.)
3100
~96
~396.8
~65
~6.2
~4.13
Intel Haswell Single Core (est.)
3100
~24
~99.2
~16
~6.2
~4.13
Intel Haswell ULX Single Core (est.)
1500
~24
~48
~4
~12.0
~2.0
Table 1: GFLOPS per Watt and GFLOPS per mm
2
SwiftShader:
Why the Future of 3D Graphics is in Software
7
Copyright © 2013 TransGaming Inc. All rights reserved.
While this is true today, the advantage is much less than
one might think, and there is every reason to believe that it
will disappear with future CPU designs as the convergence
trends described above continue.
Table 1 summarizes information about the performance
per unit area and performance per Watt of both discrete
GPUs, integrated GPUs, and Intel’s Haswell CPU. Estimates
are drawn from the websites in the footnote below
2
, with
the following additional assumptions:

Tegra 4 GPU area is estimated as 37.5% of overall SoC area

Tegra 4 GPU TDP is estimated as 50% of overall SoC TDP,
based on reported battery size and time estimates for
NVidia Shield device; 38 Watt-hour battery, ~5 hour

battery life

Sandy Bridge HD 4000 GPU size is estimated as 31% of
overall die, based on visual estimates from die photographs

Sandy Bridge HD 4000 GPU power consumption taken
from AnandTech measurements

Haswell core size is estimated as 13% of overall die, based
on visual estimates of die photographs

Haswell 3.1 GHz power one core power estimated based
on data from Tom’s Hardware article above

Haswell ULX 1.5 GHz TDP power estimated based on 50%
reported 10 Watt TDP from Anandtech article above
While the data above clearly shows GPUs as more

efficient in raw GFLOPS performance, the result is hardly
2 Data for Table 1 was compiled from the sites below:
http://en.wikipedia.org/wiki/GeForce_600_Series
http://www.zdnet.com/nvidia-claims-tegra-4-gpu-will-outperform-the-
ipad-4s-a6x-7000009888/
http://www.anandtech.com/show/6550/more-details-on-nvidias-tegra-
4-i500-5th-core-is-a15-28nm-hpm-ue-category-3-lte
http://i1247.photobucket.com/albums/gg628/mrob27/HSW-4c-GT2-rev2_
zps3212a12b.png
http://www.extremetech.com/computing/144778-atom-vs-cortex-a15-vs-
krait-vs-tegra-3-which-mobile-cpu-is-the-most-power-efficient
http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_
units#Southern_Islands_.28HD_7xxx.29_series
http://www.nordichardware.com/CPU-Chipset/intel-core-i7-3770k-ivy-
bridge-and-the-3d-transistor-is-here/New-graphics-the-biggest-news-
in-Ivy-Bridge.html
http://www.anandtech.com/show/5878/mobile-ivy-bridge-hd-4000-investi
-
gation-realtime-igpu-clocks-on-ulv-vs-quadcore
http://www.tomshardware.com/gallery/haswell-665x269,0101-364392-0-
2-3-1-jpg-.html
http://www.anandtech.com/show/6655/intel-brings-core-down-to-7w-
introduces-a-new-power-rating-to-get-there-yseries-skus-demystified
http://www.chip-architect.com/news/2012_04_19_Ivy_Bridges_GPU_2-
25_times_Sandys.html
overwhelming. Given the potential for scaling raw GFLOPS
on a CPU-style architecture at relatively low power cost by
providing wider SIMD vectors, there is no reason to imag
-
ine that computational efficiency is a bar to future unified
architectures.
Transitioning to Unified Computing
As we have seen above, there are strong trends to
-
wards convergence between the GPU and CPU, and no

obvious obstacles to unified computing. Are there any
other important factors required to complete a transition
to fully unified architectures?
One caveat in the above compute density comparisons
is that so far we’ve only looked at the programmable hard
-
ware. The relative amount of fixed-function GPU hardware
has been shrinking, but it is worth considering the question
of whether anything changes if we implement these func
-
tions in software.
It is a common misconception that replacing fixed-
function hardware would require significant additional
programmable hardware in order to achieve the same peak
throughput. To illustrate why this is incorrect, we’ll focus on
the texture units, which remain the most prominent fixed-
function logic on today’s GPUs. These texture units are at
any given time either a bottleneck or underutilized. GPU
manufacturers try to prevent them from being a bottleneck
by having more texture units than what is needed by the
average shader (bandwidth and area permitting). As a result
these additional texture units are, more often than not, idle.
By contrast, with unified hardware, the additional
programmable hardware required would be based on the
average
utilization. We have confirmed this experimentally
by collecting statistics through the use of run-time perfor
-
mance counters in SwiftShader. The average TEX:ALU ratio
in observed shaders is lower than the ratio of TEX:ALU hard
-
ware in contemporary GPUs. Software rendering on unified
hardware thus has the potential to outperform dedicated
hardware, by having more programmable logic available
that does not suffer from underutilization. Even in the case
where the GPU’s texture units are a bottleneck, software
rendering on a unified CPU may outperform it for simple
sampling operations with high cache locality.
The second factor that makes it feasible to replace
fixed-function texture samplers with programmable hard
-
ware is the fact that texture sampling is by nature pipelined,
SwiftShader:
Why the Future of 3D Graphics is in Software
8
Copyright © 2013 TransGaming Inc. All rights reserved.
consisting of several logical stages, most of which are con
-
figurable: address generation, mipmap level of detail (LOD)
determination, texel gather, and filtering. Different func
-
tional units inside a CPU core can perform work at each of
these stages. For example, texel gathering can be performed
by a load/store unit while the SIMD FP ALUs are completing
filtering on a previous sample.
Furthermore, SwiftShader’s dynamic code generation
can completely eliminate the LOD determination stage when
mipmapping is inactive - either when disabled explicitly or
when the texture only has one level. Similarly, filtering can
range from none at all, to trilinear anisotropic filtering, and
beyond. Modern GPU hardware provides support for one
trilinearly filtered sample per clock cycle, implementing
anisotropic filtering using multiple cycles. This means that
during anisotropic filtering, some of the other stages are idle.
To implement more advanced filtering, a shader program is
required, i.e., software. Likewise on the CPU we can generate
specialized routines for 1D, 2D and 3D texture addressing. All
this leads to the general observation that as the usage diver
-
sifies, bottlenecks and underutilization can be addressed by
new forms of programmability and unification.
This brings us to a third argument. Performing texture
operations on unified hardware enables global optimiza
-
tions. We have observed that many shaders sample multiple
textures using the same or similar texture coordinates. This
enables SwiftShader’s run-time compiler back-end to elimi
-
nate common sub-expressions. For instance if a shader uses
a regular grid of sample locations to implement a higher or
-
der filter, fewer addresses have to be computed than if each
of the samples computed both coordinates.
Unifying the CPU and GPU also means that some por
-
tions of the GPU’s fixed-function hardware could be added
to the CPU’s architecture as new instructions, and these
could then be used for new purposes as well. One major
example of this is the addition of ‘gather’ support to com
-
modity CPUs, which will become available with Intel’s
Haswell CPU later this year. This will speed up software
rendering considerably, by transforming serial texel fetches
into a parallel operation. But the gather instruction can also
speed up a multitude of other graphics operations, such as
vertex attribute fetches, table lookups for transcendental
functions, arbitrarily indexed constant buffer accesses, and
more. The possible uses go far beyond graphics. The decou
-
pling of the gather operation from filtering also enables the
optimization of texture sampling operations that require no
filtering. In recent years the use of graphics algorithms re
-
quiring non-texture related memory lookups has increased,
spawning the addition of a generic gather intrinsic in shader
languages. Besides gather, several other generic instruc
-
tions could be added to the CPU to ensure efficient texture
sampling in software. For example, to efficiently pack and
unpack small data fields, a vector version of the bit manipu
-
lation instructions (BMI1/2) could be added.
Note that our analysis above has covered the decou
-
pling and unification of every major stage in texture sam
-
pling. This approach would eliminate both bottlenecks and
underutilization, and would also enable new optimizations.
It may seem like a large number of instructions would still
be required to implement texture sampling, but it is impor
-
tant to keep in mind that on a unified architecture each core
can execute any operation. When GPU hardware first became
programmable and spent multiple cycles per pixel, vendors
were still able to improve performance by adding additional
arithmetic units running in parallel. Likewise, breaking up
texture sampling into simpler operations allows spreading
the work over more CPU functional units and cores.
A similar analysis can be made for fixed-function raster
output operations (ROP). In fact for GPU hardware that sup
-
ports OpenGL’s GL_EXT_shader_framebuffer_fetch exten
-
sion
3
, colour blending is already performed by the shader
units. This is, at the time of writing, only supported by mo
-
bile GPUs, a fact that also illustrates that replacing dedicat
-
ed hardware with programmable hardware doesn’t have to
be detrimental to performance and power consumption.
The ROP units are typically also responsible for anti-
aliasing (AA). Interestingly, this functionality was broken in
some versions of AMD’s R600 architecture
4
, but this did not
prevent them from launching the product, as the drivers
were able to implement anti-aliasing using the shader units.
Note that the compute density has since increased signifi
-
cantly, so not having dedicated AA hardware would now
have an even lower impact. Moreover, replacing dedicated
hardware with more general-purpose compute units al
-
lows ROP unit die area to be used for many other purposes.
3 An OpenGL ES extension:
http://www.khronos.org/registry/gles/extensions/EXT/EXT_shader_frame
-
buffer_fetch.txt
4 Reported here:
http://www.theinquirer.net/inquirer/news/1046479/ati-r600-manage-
pixels-clock
SwiftShader:
Why the Future of 3D Graphics is in Software
9
Copyright © 2013 TransGaming Inc. All rights reserved.
Finally, there has recently been a great deal of successful
research into screen-based AA algorithms, which do not re
-
quire any dedicated hardware
5
.
This brings us to a more general discussion of dedi
-
cated versus programmable hardware. For decades, com
-
puting power has increased at a faster rate than memory
bandwidth.It’s easy to see why this will remain a universal
truth as long as Moore’s Law holds: with every halving of
the semiconductor feature size, four times more logic can be
fit on the same area, but the contour can only fit two times
more wires. Furthermore, the pin count of a chip does not
scale linearly with semiconductor technology. Hitting the in
-
evitable “memory wall” has been staved off several times by
adding more metal layers, by building a hierarchy of caches,
and by increasing the effective frequency per pin.
Going forward, other techniques will become essential
to keep scaling the effective bandwidth. An extensive study
6

points out the best candidates, but probably the most in
-
teresting result is what won’t work well: shrinking the core
sizes. This is the least effective approach because when the
majority of the die space is occupied by storage and com
-
munication structures to feed the execution logic, smaller
execution logic only leads to a marginal increase in com
-
pute density. In essence, this is an argument against simple
programmable cores and fixed-function logic in the long
term. In essence, this has been one of the driving forces that
has enabled graphics hardware to become programmable
thus far, and it shows that even more programmability can
be achieved in the future at a low cost. Eventually there
won’t be a significant advantage in using small GPU-like
cores, and every core can instead have a more versatile CPU-
like architecture.
Conclusion
TransGaming believes that an eventual convergence of
CPU and GPU computing into a revolutionary unified archi
-
tecture is inevitable. This merger will give developers and
end users the best of both worlds: highly parallel program
-
ming environments that interface easily with scalar code,
5 See:
http://iryoku.com/aacourse/downloads/Filtering-Approaches-for-Real-Time-
Anti-Aliasing.pdf
http://software.intel.com/en-us/articles/mlaa-efficiently-moving-antialias
-
ing-from-the-gpu-to-the-cpu
6 Rogers, et al.
http://www.ece.ncsu.edu/arpers/Papers/isca09-bwwall.pdf
full confidence that end users will always see the right pix
-
els on the screen, regardless of drivers or operating systems,
and the limitless potential for innovation that comes with
software-based approaches
7
.
Some of this convergence is already apparent with
upcoming new hardware such as Intel’s Haswell processor.
AMD’s Heterogeneous Computing architecture is another
proof point on this roadmap, integrating CPU and GPU ele
-
ments into a single processor, controlled through dynami
-
cally generated code. Non-graphics domains are also seeing
benefits from similar dynamic, software-based approaches
– for example, NVidia’s Tegra 4 processor includes a fully
software controlled radio.
TransGaming’s SwiftShader technology offers a pioneer
-
ing approach to software rendering, backed by powerful IP.
SwiftShader is uniquely suited to providing TransGaming’s
customers with the ability to navigate the transition from
today’s mixed hardware through to future architectures
that we can only speculate about. SwiftShader’s dynamic
code generation approach allows TransGaming to imple
-
ment commonly used graphics APIs such as Direct3D 9 and
OpenGL ES 2.0 on a variety of different contemporary sys
-
tems, while paving the way towards a future where a graph
-
ics library is simply a set of building blocks that developers
make use of on a piece by piece basis.
Many challenges remain to be overcome in order
for this vision of unified computing to become a real
-
ity. TransGaming aims to play a key role in meeting these
challenges and in helping to deliver on the resulting
innovations.
7 Some interesting ideas well suited to pure software approaches can be
found here:
http://graphics.cs.williams.edu/archive/SweeneyHPG2009/TimHPG2009.pdf
SwiftShader:
Why the Future of 3D Graphics is in Software
10
Copyright © 2013 TransGaming Inc. All rights reserved.
Copyright Notice
This document © 2013 TransGaming Inc. All
rights reserved.
Trademark Notice
SwiftShader, SwiftAsm, Reactor, and the
SwiftShader logo are trademarks of
TransGaming, Inc. in Canada and other
countries. Other company and product
names may be trademarks of the respective
companies with which they are associated.
Disclaimer and Limitation of Liability
ALL DESIGN SPECIFICATIONS, DRAWINGS,
PROGRAMS, SAMPLES, AND OTHER
DOCUMENTS (TOGETHER AND SEPARATELY,
“MATERIALS”) ARE BEING PROVIDED
“AS IS.” TRANSGAMING MAKES NO
WARRANTIES, EXPRESSED, IMPLIED,
STATUTORY, OR OTHERWISE WITH RESPECT
TO THE MATERIALS, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY,
AND FITNESS FOR A PARTICULAR PURPOSE.
Information furnished is believed to
be accurate and reliable. TransGaming,
Inc. assumes no responsibility for the
consequences of use of such information
or for any infringement of patents or other
rights of third parties that may result from
its use. No license is granted by implication
or otherwise under any patent or patent
rights of TransGaming Inc. Specifications
mentioned in this document are subject
to change without notice. SwiftShader and
Reactor technologies are not authorized
for use in devices or systems without
express written approval or license from
TransGaming, Inc.
For More Information: http://transgaming.com/swiftshader
TransGaming Inc. (TSX-V: TNG) is the global leader in developing and delivering platform-defining social
video game experiences to consumers around the world. From engineering essential technologies for the
world’s leading companies, to engaging audiences with truly immersive interactive experiences, Trans
-
Gaming fuels disruptive innovation across the entire spectrum of consumer technology. TransGaming’s
core businesses span the digital distribution of games for Smart TVs, next-generation set-top boxes, and
the connected living room, as well as technology licensing for cross-platform game enablement, software
3D graphics rendering, and parallel computing.
Website: http://transgaming.com
About TransGaming Inc.