Architecture - UPC

sploshtribeSoftware and s/w Development

Dec 14, 2013 (3 years and 3 months ago)


Títol: Implementation of a generic interface of a 3D graphic

API for building concrete graphic API’s on top


Alumne: Vicente Escandell Miguel

Director/Ponent: Carlos González/Agustin Fernandez


Arquitectura de Computadors



Títol del Projecte: Implementation of a generic interface of a 3D graphic API for
building concrete graphic API’s on top

Nom de l'estudiant: Vicente Escandell Miguel

Titulació: Enginyeria informàtica

Crèdits: 37.5

Ponent: Carlos González / Agustin Fernandez

Departament:Arquitectura de Computadors


(nom i signatura)

President:Roger Espasa Sans

Vocal:Fernando Orejas Valdés

Secretari:Agustin Fernandez


Qualificació numèrica:

Qualificació descriptiva:









































If you compare how games have evolved in the last few years you will get impressed of how they
have improved in all the terms. Games improvements
are closely related with GPU (graphic
processing units) improvements. During the last years GPU flop’s have rapidly being increased
because of the innate independence nature of the current rasteritzation protocol used in the GPU.

Taking into account the im
portance that GPU’s have acquire in the recent years, a group of
researchers decided to study and develop high performance micro architectures for the next
generations of GPU’s.

To this end, a simulator and a sort of different tools have been developed. Fi
rst of all, a complete
accurate simulator of a GPU based on the R580, G70 and G80 generation was design and
coined ATILLA.

Above this, a thin driver has been design with a goal in mind: become a thin layer that works
directly with GPU transactions an
d memory management and allocation. Other tasks are directly
delegated to higher levels of abstraction.

On top of this driver, the two most popular graphic API’s, openGL and DirectX, where implemented
independently from each other.

All these layers create

a full 3D
graphic stack capable of running traces taken from real games,
allowing us to work with real workload and obtain more realistic data from the research done on
the simulator.

In order to obtain game traces from the different API’s, other tools ha
ve additionally been
produced, such as tools to capture openGL traces, a player to play these traces, among others.

Project characterization

By the time I started working with the ATTILA team; current API implementations are completely
independent from each other. This was a reasonable decision from the point of view that in the real
world each API is designed by different people without sharin
g among them anything. Most
commonly used are DirectX 9.0 (from now on D3D9) by Microsoft and openGL (from now on OGL)
specified by the
OpenGL ARB Working Group and implemented by each GPU designer

Given the fact that many of the main features in the di
fferent graphics API are very similar, it seems
quite obvious that having each different API implemented independently from each other will cause
us lots of problems and more work. We will have much code that may do basically the same things
but implemente
d many times, in fact, one time per each API we are going to support. Now, only two
API’s are implemented, but in the near future, as new API’s (D3D10 and D3D11) get more popular,
we will being forced to implement new API generation in order to have our wo
rkload up to date.
Problems derived from this methodology will cause us an overhead because we must write much
more code, debug it, and the most important one: Attila is constantly growing supporting new
functionalities, forcing us to modify the entire 3D
graphic stack, and this includes modifications in
each API. Too much work for the project evolved quickly.

ATTILA member take into account this problem and decided to create a graphics API abstraction
layer to provide with all these common functionalitie
s found in state
art graphics API’s to make
easier the implementation of concrete API’s on top. This big layer actually, will be called ATTILA
Common Library (from now on ACD), and it has all the basic and common functionalities across the
different API
’s. ACD communicates with the GPU driver and has an upper interface which provides
the required functionality to allow the easy programming of specific API’s on top of it.

The ACD module provides an abstraction layer for the ATTILA Architecture, giving a
friendly high
level interface of a generic 3D graphics API, suitable for building concrete API’s on top. The ACD is
more than a simply HAL as usually known. It provides a lot of special features specially suited to
make the graphic API implementation easie
r. Among them, we highlight the following features:

ACD will manage all the resources provided by the simulator, so any concrete API won’t have to
manage low level data. To this end, the ACD layer will provide some data
container objects that can
be used
to store each object type.

GPU transactions will be also optimized. ACD will be aware of current hardware state, this includes
register values and memory status, so the ACD layer will avoid useless register updates and also,
data structures can be optimiz
ed before sending to the GPU, minimizing CPU to GPU bus traffic.

Another important point is that ACD will not include fixed function management. This will be
delegated to a second already implemented module (as part of another student’s project) which
l manage all these functionalities. This second module is called ACDX (Attila Common Driver
eXtensions). The reason for separating this is because fixed function is no longer supported in new
graphics APIs due to the emergence of fully programmable GPU’s w
hich use customized programs,
called shaders that replace fixed with programmable visual effects.

ACDX will be used only by older API’s (OpenGL and DirectX9) in order to generate equivalent
shader programs from those old fixed function states. Therefore,
ACD will only work with shaders,
either coming directly from the 3D application or the ACDX module.

The second goal with this project is to implement the OpenGL module making use of the ACD layer
(called AOGL) to test the ACD implementation. I will use the

old monolithic OpenGL module of
figure 1 (full
functional and already tested) to compare with the new AOGL/ACD stack, both the
visual outcome (simulator generated images) and performance (using simulator statistics).

As a result of my project, the
research group will have a graphics API abstraction layer to build
many others concrete graphic API’s in the future.

Project organization

The documentation is organized as follows:

Chapter 1



Basic concepts

Base Graphics concepts

The rendering
pipeline is the responsible for creating a 2D image given a geometric description of
the 3D world and a virtual camera that specifies the perspective from which the world is being

his section is going to introduce the basic graphic pipeline conce
pts that most of the current
graphic API use. The purpose of this section is not so much to explain in depth how the pipeline
works, if not to introduce the reader the basic knowledge they need to understand the

project. In case you want to go deepe
r, good reference

are [][][].

If we look around us we can see objects with very different shapes. Being capable of drawing all the
shapes is quite difficult, because many of them are not regular, so they are difficult to define
mathematically. Otherwise
, being capable of drawing every regular shape we can

is also
impossible due to they are infinite ones. Taking into consideration all this facts and thinking how
can we represent any object in the world leads us to the conclusion that all the compl
ex objects can
be built up using simple models, for example we can build a pentagon from the combination of a
square and a triangle. If we go further, the most basic shape which has area is the triangle. From
this we can approximate whatever shape we want.

Triangles are no more than 3 vertices connected each other, which also define a plane.

From now on, whatever object we would like to use must be split
in triangle meshes. As you may probable think, using triangles to
represent basic shapes works fine and


exactly the same, but
when we work with curved shapes, u
sing triangles don’t provide

an accurate representation. The more triangles we use to define
the curved shape the best approximation we obtain.

Triangle is the shape of geometry we are going to us
e when we
would like to define closed surfaces. Otherwise points and lines
can also be used when they are needed.

Having to represent a 3D world space forces us to use a coordinate
system that allows us to situate the objects in that space. Graphic
uses the three
dimensional Euclidian space.

So, to represent a vertex we are going to use 3 components that define vertex location

Up to now vertex coordinates are the only attribute associated with one given
vertex. But many
other attributes can be associated with a vertex


such as a color, material, texture
coordinates, normal vector, and more.

Vertex is the most important information needed to render a game because it represents the
geometry we
are going to display. The way how a set of vertices is transfer from the API to the GPU
have changed a lot, from the prehistoric ways where a single vertex is transfer with in a single call,
to the newer techniqu
es we are going to explain.

Newer APIs use
huge structures, usually referenced as Vertex Buffer,
to store vertex information of
a set of vertex not having to send them one
by one. Although each API works with Vertex Buffer in a
different way, we are going to try to explain them regardless any API.
So, information such as vertex
position, normal, texture coordinates and so on, is saved in raw buffers. Each vertex attribute can
be saved in an individual buffer or can be combined between them. GPU have an element called
Stream, which is going to be exp

deeper when GPU architecture is explained, that are mainly
reading buffers used by later stages (vertex buffers). The number of streams available is need to
take into consideration, so they are finite, then normally each stream contain various attri
The way each raw buffer is assigned to an stream depends from one API to another, and some of
them used containers where raw buffers are introduced. As attributes can be combined with an
unique stream, it is need to specify how this information can
be found into this stream, so offset,
stride, number of elements are information that all the APIs need to set properly the GPU state.
other important

information is which attributes are active and in which stream can

found. This is also do
ne by the API and is usually saved with the
attribute information.

If we look closer to a triangle mesh we would probably see another fact, that many triangles share
vertices with others, then if we treat them as independent triangles we are gone use much

space to save them and as a consequence of this bus traffic is going to be increased as well as API
overhead and GPU memory needed to save the geometric information

To solve this

problem, Index Buffers can be used. Index buffers store indices whic
h are
more than
pointers to some of the vertices stored inside a vertex buffer. Then, repeated vertices
don’t need to be repeated.

Another way to
decrease the number of vertices used is found detecting that usually the set of
triangles being drawn is packed together building a mesh. Inside this mesh we can see that many
triangles share many vertices with those near it, in fact one triangle can be de
fined using only one
new vertex and using two vertices from one of the adjacent triangles. Then we only

need to specify
the three vertices of the first triangle and then only another vertex for each triangle
being drawn.
There are different ways the rest
of vertices can be
, but the two most commonly used are
triangle strip and triangle fan. Triangle strip obtains the other two vertices from the last two
vertices specified building the
preceding triangle

while in triangle fan we always used the

vertex used


the first triangle and the last vertex of the last triangle draw
, this is
why it’s called this way.

All the time w
e w
ere talking about vertices and they position in a given space. Objects that are
going to be drawn are designed using a
ocal space
which is

in the object being designed,
as a consequence of this

each object we would like to draw is in a
his own lo


This is done
this way because having to define an object

where it has to be

in the world space

quite more difficult

because it implies

having to work with strange coordinates and positions
Moreover, many of the scenes have many o
bjects that are used repeatedly,

so if the object is defined
using world coordinates
it forces us to send vertices for each object we are going to draw instead of
issue them only once and reuse them performing

matrix op
erations over them
. So working using
local space is easier because we work only with the figure we are designing and also because it can
be reused saving memory space, bandwidth and API overhead.

So when we would like to use one object we must change his coordina
te space, from the local space
where it was designed to the world space that we are going to use to draw the


scene. This
world space will contain all the geometry we are going to draw, so they are all relative to the same
coordinate system.

Once all the geometry is placed, is time to set the camera, from which the viewer is going to watch
the scene. Camera specifies

volume of the world the viewer
would be able to

see and thus
what volume of the world is needed to generate a 2D image of.

Axonometric and perspective are the
types of camera



Perspective is the typical view that a
person would have when is watching
something in

real life, wit
h this view
all the points tend to a central point
where they meet. Axonometric is an
artificial type of
camera where the
challenge is to maintain the parallelism
of the lines that in the rea
l life are parallel as well as maintain objects proportion; these objectives
are not achieved with the perspective camera. This camera is commonly used in CAD environment

where they need to work with real proportions.

There are many ways to build up a
camera, but one of

the most commonly used is giving

the camera
position, a target point where the camera is looking to, and an up vector which point where the top
of the camera is.

Once the camera is placed, another step is m
issing before the transformation is finished. We kn
where the camera direction

is looking to, but a person doesn’t have an infinite range

of view

and as
in real life camera’s
range of view

must be bounded,
being this area

called frustum. Near and far
nes are two parallel
planes perpendicular to

the camera
view direction
that determines where
e viewer’s range starts and where

it finishes. Both can be defined using
how far from the camera
they are
, as they are parallel to the xy plane.

Near and far planes only bound z coordinate. In order to do the same with x and y coordinates we
must d
efine the projection window. This window bound the x
y coordinates and also is where the
vertices in world coordinate are projected. Points projection are performed taking into
consideration the camera we have

before. In case perspective camera was
chosen, projected
point is computed drawing a line between the point and the camera origin. As ther
e is only one
point that defines

the camera origin, all scene vertices finished at the same point. This is why the
two same

that only differ in their

looks one bigger than the other
. Orthogonal camera
works in a different way and instead of drawing a point between


to the camera origin
we draw a

line to the x
y plane that goes throw the desired point. Taking into
n the two same objects we were talking before, we can see that

they will look
exactly the same size.

Projection window can be defined from a view angle (alfa) and an aspect ratio. Aspect ratio by
definition is ar = width / height. Using trigonometric o
perations we can also obtain how fare this
window is from the camera, so all the information we need is here.

The result of all the transformation we have been applied are the scene in view space, which is
essentially the 2D image of the scene.

on windows is not a trivial decision and has it’s consequences when we have to map it to
the final back buffer which contains what is send to the screen. So, although aspect ratio from both
windows can be different is desirable that have the same because i
f not when the transformation
from one to another is performed

would be some deformation.

Another problem found in the mapping process is found because
the projection window dimension
is related to the aspect ratio, so to perform this transformation
will be needed to tell the hardware
which is this aspect ratio. To make it independent we are going to perform another transformation
from view space to normalized device coordinates (NDC). X
Y view space coordinates has a width of
2ar while the height is
2. NDC transformation involves a x
coordinate interval change from the (
ar,ar) to (

Although we are now working in a 2D space we also need to use the Z coordinate because it must
be used to determine which object is in front another and other optimiz
ation to early discard

Z transformation is different

API are we working with. Depending on it

Z is scaled to on of the following ranges

[0,1], [
1,1], [0,

or others. One important thing that
most API’s has in common is that the scale is not proportional, being more precise when the
vertices are near the viewer and less when they move away. This is done this way because more
precision in needed for the near
objects as errors are more visible than errors far away.

Over those steps we go generate
transform all the geometry we wanted to
draw from an individual definition, to a
2D definition. Although depth value must
be saved for subsequent operations in
order to determine wh
ich element is in
front of another. Depth value will also be
scale between [0,1], but this time scaling
will be exponential in order to have more
definition in the front. The reason is quite
simple; users will see much more
difference if there is an error

near than far away from the camera, thus we can use fewer bits
obtaining similar results.

Space transformation we have been describing here can be setted in two different ways. Old
graphic pipelines set the state to the GPU modifying some registers the G
PU has; in order words
GPU has programmed all the space transformation


they are
only need

the matrix
that are going to be used. So other space transformation or tricky modifications cannot be done.
Otherwise new GPU’s have left this way of working and now

a program called Vertex Program
created and

executed for each vertex.

able to p
rogram all the operations give u
s much more
flexibility because with the fixed function model we can only s
et what the GPU is designed for, while
with the vertex program we can do whatever operation the instruction set allow us to do, so you
don’t have no creativity restriction. Then when we are using vertex shader programs, this is the
responsible of performin
g the appropriate space transformation in order to watch the scene in the
right way.

Vertex shader is nothing more than a program that could be program

using some sort of
programming languages designed
for Shader programming. Some examples
are ARB

and g

openGL and

for directX.



TEMP vertexClip;

DP4 vertexClip.x, state.matrix.mvp.row[0], vertex.position;

DP4 vertexClip.y, state.matrix.mvp.row[1], vertex.position;

DP4 vertexClip.z, state.matrix.mvp.row[2], vertex.position;


vertexClip.w, state.matrix.mvp.row[3], vertex.position;

MOV result.position, vertexClip;

MOV result.color, vertex.color;

MOV result.texcoord[0], vertex.texcoord;



uniform Transformation

mat4 projection_matrix

mat4 modelview_matrix


in vec3 vertex







// Very simple vertex shader

// These lines are just for EffectEdit: string XFile = "tiger

// model int BCLR = 0xff202080;

// background // transformations provided by the app: float4x4 matWorldViewProj:

// the format of our vertex data struct VS_OUTPUT { float4 Pos : POSITION; };

// Vertex Shader

simply carry out transformation

VS_OUTPUT VS(float4 Po
Out.Pos = mul(Pos,matWorldViewProj); return Out; } // Effect technique to be used
technique TVertexAndPixelShader { pass P0 { VertexShader = compile vs_1_1 VS();
} }


access to a wide range of resources in order to perform their operations. First of all
the most basic one is that they can access to a constant table that is provided by the programmer
with the shader code. Inside the code,

to vertex information
can also be performed

from the streams
. Vertex information been accessible depends on which information we have

in each vertex when we declared the vertex buffers. How vertex information is access
depends on the different shader language.

Another important task perform

by the vertex program, or by the fixed pipeline when no shader
program exists, is lighting the scene.

Next part will introduce color theory needed to understand
how color is selected.

Color theory & lighting

There are man
y ways to represent a color but graphic‘s usually use the RGB convention.

RGB is the acronym for red, green and blue color, which used in different proportion generate
practically whatever color a person can see using his eyes.

This model was selected bec
ause according to the trichromatic theory, the retina contains three
kinds of colored light receptors, each sensitive to red, green or blue light with some overlapping.
The incoming RGB light stimulates its corresponding light receptor creating the picture

inside the

Usually each component use

1 byte, having 256 possible values per component, resulting in

possible colors

Color using the RGB form are expressed following this way (R, G, B)
where usually R, G and B are represented using values between 0 and 1.

RGB is not the only format
that can be used;

available formats

are restricted by
the API features and some special o
nes by
GPU capabilities, but
current GPU
and APIs
support a wide range of formats, from RGB with
different component precision

more bits it has the greater

precision this

component will have)
to compressed formats really useful in case
re are memory restrictions

Another component that a color format may have is alpha, RGBA. Alpha component
manifest how
transparent is the color, in other words how easy is to watch throw one material.

The c

an object is the light ou
r eye light receptors receive from the light that
reflects the object. The more light the object reflect the more bright
the object is seen
. To test this
you only have to be inside a room with a

incremental control light, and start incremen
ting and
enting the light.

s you increase the power



light is


get darker until light is off
when nothing could be seen

because no light is

The problem is that objects also reflect the light

they receive,

acting as light sources, so objects that
are supposed to be


dark because they are
not facing
the light

they receive some amount of light
obtained from the light that other objects have reflected.

The model explained above follows the

, which faithfully represent real

. But

all possible light interactions

not only between

light source and
each object but also all

any object could reflect,

is so computational expensive that
this cannot
be achieved

real time applications
. In order to obtain
credible illumination models

without being
so computational expensive

local lighting model

must be used


model has some differences that make it much fast
er than the global model. W
global mo
del compute all
light interactions between

each light source and object and between
different objects,
local model only compute the contribution that the light source generate over the

this model doesn’t take into consideration
possibility that other objects can

between the li
ght source and the object being considered, so in this case the object that is rear
the other one and is not supposed to receive light from the source would be illuminated as if no
object is in front of
it, so this model don’t generate shadow and if they are desired they must be
generated using other techniques.

The absorp
tion and reflection of a light

depends on many factors
, so in order to compute it

Lambert Cosine Law

will be used
is neces
sary to take into consideration that when

light strikes

point head


the reflection is

more intense that

just glance

surface point. In order to determine the angle

with which light strikes the surface

vector product
een the vertex normal, vector that describes the direction a polygon is facing

, and light
direction vector would be used.

So, Lambert Cosine Law express the intensity with which light
strikes and object following this formula:

Intensity = max (cos angl
e, 0) = max (L∙n, 0)

Where L is the light direction and n the normal vector
. Intesity range is from 0 to 1.

As in real life can be
, light sources also have some color
. They are

also represented using the
RGB agreement, having for example red light (1,


0), yellow




and so on.

Objects are not


in fact

this is done saying the material it has. Material is
just the way an object behaves when the
light strikes it. Then when an object is defined by having
some color

for example red, (1,



using the RGB agreement

what is told is that

this object is
made using some material that in front of a particular light source it behaves reflecting this lig
color, which arrives to our eyes. So in case of using red (1,


0), 1 means that all light that it
receives will be reflected while 0 means that all the light is absorbed by the object. So in case the
object with this material is illuminated
, it



reflect red color which will be the color that a
person looking it would see. It is important to take into consideration that the light
is also
important, because
material was defined as a property that says which

light is reflected

but if the light we are using to illuminate the object don’t emit this light colo

it would


be reflected. So, in the previous case if a red material is used (1,


0) but it is illuminated using a
cyan light (0,1,1) the object won’t refl
ect any light
as the light source have not

red component a
blue and green components

are absorbed by the object.

Another important topic to analyze is the different light sources that can be found. Local model
describes three different ways light rays can be propagat

Ambient light

As was told before, light reflections generated by other objects are not considered. If this rule was
followed strictly, every area that is not facing the light source would be entirely black. This behavior
is not so real, because as wa
s told before, objects reflect some

among of

light so normally all objects
have some illumination. In order to simulate this behavior without having to follow the global

ambient light will be set
. Ambient light is nothing more than a light
that will be
added to every vertex of the scene regardless where is the source light and the object.

So when ambient light

is used the following formula is applied to each vertex colo

Being c the light color and m the diffuse material color

the ref
lected light obtained is computed
using the following formula:



light =

x material

as an example

(0.5, 0.25, 0) = (1, 0.5, 0) x (0.5, 0.5, 1)

Diffuse light

Diffuse light takes place when a light ray
impacts a surface and this ray scatters in various random
directions. The

fact that the light is scattered

in random directions makes that no matter where is
the viewer that he would receive the same amount of light.

As was told before, when light strikes
point head
on some surface the reflection is more intense that when light just glance a surface
point. This must be taken into consideration in computing how diffuse light interacts with the
object material. To compute the angle the ray strikes the surface

the vector product between the
vertex normal, vector that describes the direction a polygon is facing to, and light direction vector
would be used.

So being n the surface normal, and L the ray vector the reflected diffuse
light is computed using the
following formula:

Reflected diffuse light = L∙n x Light

x material

Specular light

Smooth surfaces have a different

when a ray impacts on it. In this case the light is reflected

in one direction.

In real life, this kind of light is this that produce spots on the surfaces.

you can see the most important difference between diffuse and specular lights is that in one case
light is scatter

in all directions while

in the other the l
ight is only reflected in a unique direction. So
viewer will only be able to see this light in case reflectance vector direction points to the viewer
This case is really rare because the viewer must stay in a too specific position to being able
o see the specular light.

Also it is not realist because as you are in the right position, the specular
light would be perfectly visible, but once you move a little it will suddenly disappear, something
that doesn’t happen in the real life.

In order to mak
e it more realist


light would

only for those viewers that are in those positions, but also for those that are near some these
So the reflectance direction

would be

generated as a cone.

as is not the same to be on
the reflectance direction that on the cone boundary,

he close

the viewer is from the cone center
the more intensive light


will receive. Cone size will be determined

by the angle between the
specular reflectance direction a
nd the side
, represented using

To model this behavior, we modify the function used in Lambert’s cosine law adding some power p
to the previous formula. P
is related to the cone’s angle, so the bigger p is the smaller the angle
would be and vice versa th
e smaller p is the bigger the cone angle is.

So, specular coefficient could be defined as follow:

Ks = { (max(cos


,0) = max(v∙r,0), L∙n > 0}

{ ( 0 <=0}

Reflected specullar light = Ks x Light

x mat

Usually all three
types of light are present in a

scene, so the final formula modeling the three
lighting models is:

Reflected light = (Kd x Light

+ Ks x Light

+ Light

) x material

We have been talking about different ways how light can strike a surface. Now what we are going to
talk about is different light sources.

Parallel light

Parallel lights represent sources of light that are so far away from
the object that the rays arrive to the object in parallel
, so all light
rays are parallel to each other
. An example of this kind of

light is the
sun light.

Point lights

Point lights are typical light


that radiates spherically from a given point. Good examples of
this kind of light are bulbs. In order to make t
hem more real, light
intensity weakens as a function of distance based on the inverse
squared law:

istance) = Light intensity

/ d

The results obtain from the abo
ve formula are not quite good when
they are used

in computer graphics

so, to improve the results it is



one that is more configurable and
with which
designers can obtain better results changing the values:

istance) = Light intensity

/ a


∙ distance



So if we incorporate atten
uation into the light equation we get:

Reflected light =
(Kd x Light

+ Ks x Light

+ Light

) x material


+ a

∙ distance + a

∙ distance


Spotlight is more or less the same as point light but with the difference that instead of radiating to
all the directions, it only
adiates through a cone.

The manner in which spotilight is calculated is similar to specular reflection. The cone light is
efined by a vector d which define the cone center and and the opening angle represented using
LEEEEEEEEETRAAAAA. Also the more closely the boundaries the ray is the less intensity it have. So
for a given ray, that has a AAAAAAANGLLLEEEEEE with the cone cen
ter the light intensity is
defined by the following formula:


Spotlights also take into consideration that the farther an object is from the light source the less
intensity it receives, so
spotlights also uses the same attenuation parameter that have been seen
with point lights. So the final formula to compute spotlight is the following:

Reflected light = K

(Kd x Light

+ Ks x Light

+ Light

) x material


+ a

∙ distance + a

∙ distance

So, as a recap, we have seen that we have different light sources and how the light they radiate
interacts with the surfaces they strike. In a complex scene, with multiple light source, for each
x we must calculate the light they receive from each light source, so the more complex the
scene is and the more light it has, the more computational expensive it would be.

After each vertex has

throw the vertex shading stage, is time to group every set of vertices
into the appropriate
set primitive.

So, in case we selected triangle as our current primitive, each set
of 3 vertices are group into a triangle. From now on our working set will be pri

All the transformation


above don’t reduce the among of vertices, so even though some
vertices are outside the frustum they will continue down the pipeline until this step. Now is time to
remove some geomet
ry that is outside the frustum area. Clipping will be the stage that is going to
perform this task.

What clipping does is discard all this triangles that are outside the frustum area. Primitives that are
completely outside the frustum are easily to
discard, but there are some triangles that are half in
and half out. In those cases what is done is clip the part of the triangle that is outside the frustum
and create smaller primitives from the part that is inside the frustum.

Another technique that is
sed in order to reduce the amount

of primitives

are going to

is backface culling. In a scene there is some geometry that is not looking to the viewer, so
this geometry is not viewable and may be discarded.

n order to determine how a triangle is facing the camera
some conventions

be taken
. First of

the order in whi
ch triangle vertices
are introduced

in the graphic pipeline

is important. To
the triangle
is facing
, the

triangle normal is

going to



This normal is
going to be used to determine where the triangle if facing to.

Up to now we have been talking about primitives and the way they are distributed
among a given
viewer space. Remember that we are work
ing with NDC coordinates, then our

space is

1,1], so there is no relation between the pixels that are part of the viewport and the space we are
working with.

Once we discarded all the primitives that


for sure that are not going to b
e display, is time
to map the space we are working with, with the final matrix of pixels that represents the viewport
which is called frame buffer.

This step is done by the rasterization stage, where given a primitive in the viewer space we
generate all th
e pixels that are mapped to the region described.

There many algorithms that can be
used to perform the rasterization process they are not going to be explained.

After rasterization stage our working object are fragments. Direc
tX terminology also refers them as
pixels, but for us fragments are pixels that may or may not be viewable in the final frame.

New stage in the fragment life
cycle is fragment shading. Fragment shader is for a fragment what a
vertex shader was for a vertex
. As we previously said for vertex shader, old graphic cards are fixed
function so they are not using this kind of programs.

Pixels shader main function is generate the final color for the fragment
that is being rasterized
. In
order to do this, fragment sh
ader can access to the attributes that have been obtained from the
interpolation that took place during the rasterization stage. Other values that can be accessed
throw the fragment shaders are texture colors

which are accessed throw the texture units the

. As in vertex shaders, fragment shaders can also access to constant values that have been define
in the program. Flexibility that gives a fragment shader allows as

multiple combinations
in order to obtain incredible results.

How a text
ure can be access is what is going to be explained next.


As we said before, fragment shaders can perform texture accesses, but what a

access is?

Having to define a unique color for each vertex don’t give good result if we would like to


on the geometry with the basic tools we have, basically assigning a color to each
, because in order to do it we must create big meshes and paint each triangles with the
appropriate colors, imagine painting a gradient on a simp
le triangle, you would have to draw
hundreds of triangles in order to have this effect

One solution to this problem is

using textures. Textures are nothing more than an array of colors
that can be defined using the formats we see before when we talk abou
t lighting. So, what we do is

geometry figure with

one texture

that holds the colors we would like to place on that

Texture coordinates map the textures with the primitive as each vertex will save one or
multiple texture coordinates

odern GPU could perform multiple texture accesses per vertex,
being able to combine them afterwards)

Texture coordinates

are described using u and v vector

and a range between 0 and 1
as the following picture shows:

Each [u,v]

point in a texture is called texel.

The texture of a door is a 2D texture but there are more texture formats. Obvious ones are 1D and
3D formats; they have 1 or 3 dimensions, so texture coordinates have 1 or 3 values. Another format
is Cube Map. Cube map

is built using 6 2D textures and they are joined as a cube. This kind of
textures is mainly used to draw environment textures

as all the faces of the cube map are

Until here it seems easy, there are some texture coordinates that are mapped to a texture, but there
is one problem, when we assign some texture coordinates to some triangle, we don’t know how big
it will be on the frame
buffer. Imagine that we have a square where we map a texture. This texture
is built with 16 texels but when the square is rasterized it has generated 256 fragments. Then we
have a lot more fragments that information has the texture. The opposite case can a
lso take

place, in
fact is strange that both are the same. Both situations represent a problem, in the first one we don’t
have enough information, but on the second one we have too much information and the appearance
would be bad as texels are not selected


The solutions for both situations are the following:


In this case

region where


primitive is being mapped i
s bigger than the texture size, so

each texel there will be more than one fragment with the same
The best way to solve the
problem is to have higher
resolutions but as monitor resolution increase texture resolution


increase and nowadays monitor resolution makes impossible

to have all the texture of
that size

There are two possible solutions to try to

this problem: nearest or linear interpolation.

Nearest interpolation


the fragment with the

texel to the texture coordinates.



is a complex way in which final color is ob
tain from combining the color of
the four nearest samples to the texture coordinates. These texel colors are combined by weighted
average according to distance.

(Exemple de nearest I linear)


On the other hand, minification takes place when y
ou are trying to map a texture on a primitive
which is smaller than the texture itself.

Problem here is that the texture has to

many information so there is more than one texel for each
fragment. The best way to solve this problem is workin

with smaller
textures obtain

from the
original. These textures are called mipmap levels. Mipmap size is half of the size of the upper
mipmap level being the smallest one a 1x1 mipmap.

Mipmaps can be generated by the API itself or designed by
a graphic artist

Now the problem is how, given a fragment, we select the appropriate mipmap. As with
magnification occurs, most probably situation will be that the best mipmap

between two
ferent mipmaps. To solve
interpolation between mipmaps will be applied



Point filtering chooses the closest mipmap to the
fragment center

we are working with.
fragment color

is chosen using nearest



Bilinear filtering

inear filtering chooses mipmap

in the same why as point filter did, but now the final texel
color is selected using linear magnification interpolation



Nearest and bilinear filters has an abrupt change when mipmap level changes. To solve this
problem trilinear select the two closes mipmap levels.
linear filter is applied over these levels and
the results obtain from each mipmap are interpolated.

ropic filtering

All texture filters presented until now have one problem, if texture surface is at an oblique angle
from the viewer texture looks blurriness. The problem is because filtering samples a square from
the texture. The way to solve this problem
is sample a trapezoid. Anisotropic filtering samples
trapezoid of the mipmaps and perform the trilinear filtering.

Once we defined texture coordinates, we did it saying that they range is between 0 and 1, then it
seems that no texture coordinates can be
bigger than 1. In fact this is not true, we can use whatever
number is desired, then bigger numbers than 1 where they are pointing. When using textures we
can set something call wrapping. Wrapping is nothing more than specifying what is going to happen
n we use bigger numbers. The values that we can set here are many and depend on the API and
the underlying hardware. The most important ones are
, which when 1 is reach starts
another time from the begging, for example if we set 1.5 as a texcoord and

repeat as wrapping
mode the selected texel would be 0.5. Another wrapping more is mirroring, where when 1 is reach
texture is mirrored and this is done every time we reach the limit, 1, 2, 3, ….. Other wrapping modes
do nothing or return always a preconfi
gured plain color.

Finally another

important topic

about texture
is how they are setted using the API. As when we
were talking about the streams
GPU uses to hold the vertex buffers, texture are also hold inside
another structu
re inside the GPU, texture


So, to use a texture

with an API the first think
to do is
load it. Each API has its own tools to do it. Once this is done, textures can be thought as raw
information, information that must be introduced inside a texture unit. Apart from that there i
more information that must be set in the texture units, such as which filter is going to be used in
each case, the wrapping mode is used and all the characteristics that must be defined to the texture.
Once this is done textures are ready to be used. As
we said, they are used throw the fragment
shader and the way they are usually accessed is using special registers that refer each texture unit
that have been set.

From the definition we give above about what a fragment is, it is obvious that not all the fragment
are going to be drawn in the frame buffer, so what next step would try to do is discard some of this
fragments in order to reduce the among of fragments t
o process. Fragments that are not going to
be displayed are those that are rear others, so techniques we are going to use try to determine if
there is something in front of them. Here is where is so important how we determine each

value, because t
he best we done there the better results we are going to obtain here.

To perform the test we are going to explain, we must introduce a new buffer which will be used
among this test, the Z
Buffer. This buffer hold for each of the pixels of the frame buffer
area which is
the Z value of the last value wrote into the frame buffer. So checking this buffer allows as to

how near or far is the current fragment being drawn respect to the current fragment.

test will allow us to discard
the fragments that are

ot going to be drawn in the frame buffer
because their Z is further than the Z that the current fragment in the frame buffer has.

test will allow us to discard many fragments and will reduce the among of load of our system.

Stencil test is another test
that can be performed to discard fragments. This test blocks certain
areas of the frame buffer to not be drawn. To be able to perform stencil test another buffer is
, the stencil buffer
. This time this buffer will contain some reference value that

will be used
to evaluate if the fragment must be killed from the pipeline.

Stencil works applying over every fragment one formula that decides if a fragment can continue:

If (reference value &

mask O
PT stored value &





operation OPT is performed between the stencil reference value, value that was set throw the
API, and the value that is store in the stencil mask. Over those values an ADD mask is applied.
Typical stencil operation are comparison operators (<, <=, …) and t
hey can return two possible
values, true or false. In case the answer is true, the stencil buffer is updated with the reference value
otherwise the fragment is killed from the pipeline. Stencil test is usually use to create shadows in a
scene using shadow

Scissor test

is another test that can be performed to discard more geometry. The objective it aims is
more or less the same as stencil buffer only paint some portions of the frame buffer. This time the
way we select which area of the frame buffer is selected is more s
imple and is done defining two
points from the frame buffer that define an square area. This area will be the only one that can be
updated, other fragments that are mapped on pixels outside this area are discarded.

Another test is alpha test. Until now we

t say anything about the alpha color that sometimes
color have. This test discard geometry based on this value. As in stencil test, user defines a
reference value that is used to be compared against the alpha value that each fragment have. As in
encil test many operations can be set between them. In case the result is true, the fragment
continues down, otherwise it is discarded.

After all this tests

have been performed we know for sure that our fragment will be drawn in the
frame buffer, but what

we don’t know is how the color of the input fragment will be combined with
the current color of the fragment in the frame buffer. This technique allows us to create effects such
as transparencies.

Inside blending stage, fragments that are currently raste
rizing are called source fragments while
fragments stored in the frame buffer are called destination fragments. So the formula used into the
blending stage is:

CF = C src x Fsrc blending op Cdst x Fdst

Where Csrc and Cdst are the colors from the fragments

and Fsrc and Fdst are the factors of each
color we are going to use. Blending operation available depend on the graphic API itself but
commonly operations are src color, dst color, add, subtract, min, max, among others.

When the fragment exits the blendi
ng stage it has his final color and can be written to the frame
buffer, updating also the Z
Buffer with its Z value.

// Imatge sobre el pipeline

Basic graphic stack

Between the first simple games programmed to run over a CPU which only have thousands li
nes of
code, to current complex games which uses much more resources than many other task, with huge
detailed levels and characters which appear real, basic graphic stack has changed a lot in order to
be able to support high demand graphics.

Over the yea
rs, game programming have changed so much, not only for the increasing difficulty and
complexity they have, but also for the need to reduce how much it costs to program a game. As
always in computers to make it easier to do something, levels of abstraction

have been added to
simplify the work of creating new content. That’s the reason why game graphic have evolved in
such a way.

As we previously talk about,


for games is the GPU, which manages all the low level
operations to create games. Ove

piece of hardware we find a driver

controls the
hardware itself, being

first level of abstraction. Over


we find one or more API’s
attach to it. Basically
APIs are the first abstraction which provide real drawing pr
and is the first way game programmers can start creating graphic content.
, graphic
primitives provided for such APIs are quite simple

points, lines, triangles, squares
, and you need a
lot of work to get some result.

since new gr
aphic cards have become an unified architecture
without fixed function, for whatever effect we would like to create we must create a shader,
is not an easy thing for complex effects
. Here is where
game engines
. Game engines provide
a higher level of abstraction,

capable of working with model
, not triangles
, effects,…


a friendly interface

to work with. Working with a game engine is quite more simple than
with the API itself, and most game progra
mmers work with it to create their content.

Now we are going to describe more accuracy each level.

3D Game engine

A 3D Game engine is the highest level view which provides basically all the services a game might
require to render a game. 3D game engines

deals with meshes, bones, effects, textures and so forth.
It offers a simple interface so that the user of the engines does little more than choose what object
to render with which materials and how. To do so, a modern 3D engine uses an API that will
unicate, through a driver, to the hardware.

3D game engine are normally included inside Game engines which provides all services a game
might require to run, such as sound modules, network, I/O modules.

The way that 3D game engines are design are to be sim
ple, fast, efficient and elegant

A 3D engine’s task is only to render the world to the screen, and it might interact with the disk I/O
module to load data when is needed. To be more accurate, it should be said that only the player’s
interest must be disp
layed. It obviously means that the 3D engine does only have to render a subset
of the complete game world, which is the part visible in the viewport. So, one of the tasks of a 3D
engine, is to find as quickly as possible the visible subset of the game worl
d. To achieve that, the
world is divided into areas, which store the objects they fully enclose. Then the engine will find the
areas visible from the camera point of view, and know which objects to render. This process is often
referred to as culling.

ering is another task of the 3D engine. Once it has found the smallest subset of objects to be
drawn, it must render them as quickly as possible. The independent hardware vendors (IHVs),
AMD, NVIDIA, and others, have published many documents about the best

methods to render a
scene quickly, out of which two major points always come up: minimising your state changes, and
batching your geometry.

There’s yet one other task of the 3D engine, and that’s to animate characters, which is mostly done
today through s
kinning. Skinning is a process in which the bone of a skeleton are hierarchically
transformed (so as to have a child bone move with its parent), and their resulting position and
orientation used to place the vertics (the skin) of the characters at the righ
t place. This process is
often perform during frame updating.

Those three steps are performed in the following order: updating, culling, rendering and to be
googd, a 3D engine must perform all those steps efficiently, and to do that involves using a 3D API


Normally a game engine performs all the graphical work, but some engines only do one thing, but
they do it more convincingly or more efficiently than general purpose engines. For example
SpeedTree was used to render realistic trees and vegetat
ion in the role play game The Elder Scrools
IV: Oblivion.

Some examples of game engines being used for famous games are Cry Engine which is used in
Crysis and Crysis Warhead in its second version. Another one is Source Engine, developed by Valve
which is
mostly used by valve games such as HalfLife.

Graphic API

Graphic API is lower level of abstraction. In this level API interface deals with Vertex Buffers, Index
Buffers, Shaders, Textures, RenderTargets.

There are two main 3D API available today: Direc
t3D and openGL. Both provide an interface to the
same underlying hardware, with the differences being the quality and simplicity of the interface,
and the implementations (the drivers), rather than the feature set.

On one hand you have Direct3D, pushed by
Microsoft, with a rather nice interface (in its 9

and 10

versions), but suffering from a severe draw call issue. Draw calls on certain Direct3D platforms
force a kernel context switch, which an ultimate performance has cost. The other downside is that
the API tends to change a lot from one version to the other, which isn’t too nice since it means
rewriting a lot of code to take advantage of any new version in any existing engine.D3D9 to D3D10
transition is a good example of that problem. It’s available
on Windows and Xbox 360.

On the other hand you have openGL, lead by the Khronos Group (and formerly by the architecture
review board (ARB)), and exists in different

such as the ES (Embedded Systems) version, or
the standard openGL for workstations.

OpenGL 2.1 has many
, and many ways of doing the same thing, making its
implementation difficult and the engine write’s task uneasy, as there is a need to look for the
optimal path, which evolves with time and new extensions. Or new hardware, sin
ce extensions are
meant to make hardware features available in the API, without breaking the existing interface.
Hopefully, openGL 3.0 has a completely new streamlined interface, sometimes referred to as “Learn
& Mean”. It’s available on MacOS X, Windows,
PS3 and Wii.

There’s the option of deciding to follow one of the API and use its strengths in the engine. While it
allows to take advantage of specific features provided by chosen API, it also means restricting the
system the engine will run on, and so th
e number of potential users of the engine. A more
interesting approach is to choose neither of them, and to write an abstraction layer which will hide
all API specific code inside a module, making the engine API agnostic. With such a layer, the engine

be able to use the best API for a given system, to ensure high performance. The drawback of
having of having an abstract renderer interface is that it must target least common denominator of
the APIs it’ll be hiding, or the engine will need some tweaks to

target some platforms. Still, since the
code is nicely encapsulated, changes, even engine bread, will be much easier to deal with.


A driver is the most hardware dependant piece of low
level software. Writing a driver requires an
depth understanding of the hardware we are going to make the driver for as also a good
knowledge of the OS on which we are going to build the driver. As yo
u probably noticed, driver is a
high OS dependant code.

GPU drivers typically communicate with the GPU card through the computer bus that in the case of
GPU is a AGP bus for the old fashion computers or the new PCI Express bus with all versions of
t speeds for the new ones.

GPU driver main functions are making transparent to the above levels the specific characteristics of
the underlying hardware. Many times computer architectures design hardware having some
weaknesses that may cause a great
overhead in case the hardware utilization is not performed
properly. An example of this may be when a given functional unit of the hardware is terrible slow
and in order to maintain the throughput of the system the drivers tries to change all the operation
regarding this unit to other operations that don’t affect so much the system performance.

Another important task for the GPU driver is the GPU memory management. Driver must allocate
and deallocate memory in a proper manner in order to reduce the bus ove
rhead. Also data
placement is important in order to avoid in and out operations from memory.

GPU (Graphic Processing Unit)

GPU is a highly dedicated piece of hardware
which is design as a specific purpose

implementing a specific 3D rendering al

his section explain
s the basic GPU pipeline
. Explanation is not very in depth and will only talk
about basic GPU architecture. Due to rapid evolution GPU architecture have we are going to take a
glimpse about how this evolution take place.

oach to existing architectures

New GPU architecture can be divided in two different parts: the fixed and

programmable part.
The programmable
part consists

of some small proce
ssors called stream processor being capable of
executing the shader programs
we describe in the previous section

On the other hand we have the
which has some functionalities that cannot be expanded only some setting about how
they work but nothing about adding new functionalities

Next picture maps each API stage to
the correspondent in the GPU one.

Streamer is the first stage a GPU has and is the responsible for reading all the


data provided
by the API. This data is all the vertices that are going to be render this batch as well as all the
information they
have associated. Vertices can be selected using the index buffer in case it is an
indexed draw call. Once all the vertices have been read next operation is vertex shading. Vertex
shading is performed over each vertex read by the streamer

and it is done usi
ng the stream
. After vertex shading next step is join the vertices into the primitive chosen by the
programmer. Beyond this point we are working with primitives, usually a triangle. Next step is
executing over each primitive the geometry shader,

as we had seen before, for each geometry
shader is executed over each primitive and this is the first stage where primitives can be discarded.

After all the work over the primitives is done is time to transform primitives into fragments. This
operation is

performed inside the rasterizer. Rasterizer has two main steps, initially compute all
vectors that define the primitive edges and once this is done use them to generate fragments from
the primitive we are working with, this process is done in fragment ge
TRETS COM ES PRODUEIX LA RASTERIZACIÓ After rasterization our working object are fragments.
The following stages has as main goal discard as many fragments as they can in order to avoid extra
computing costs. It is difficult to k
now if a fragment will become a pixel because the only known
information comes from all the previous pixels that has been written on the frame buffer. In fact, a
fragment that has been written does not mean that it becomes a pixel because another fragment
may be written over this before the draw call has finished. So discarding information can only be
obtained from the fragments that are on the frame buffer, and all the information will be obtained
from the Z value they have. Even so, as we seen when we wer
e studying the API pipeline other test
can be performed to discard more fragment, but those tests are designed by the programmer.

First test performed over the fragments is early Z test. This test performs a fast comparison
between the Z of the fragment
we are evaluating and the fragment written in the frame buffer. As
the test is performed in early stages other fragments can be on fly inside the pipeline so this test
only guaranties as that if the result is to discard it is correct, otherwise no final co
nclusion can be
reached. Next test is Hierarchical Z, this test is also done using the Z value as main information.

After these first tests are done, every fragmen
t is processed in a fragment shader. Before this,
fragment attribute interpolation must be done because rasterization haven’t done it yet.

As we previously explained fragment shader decide which will be the final color the fragment will
have. As geometry
shader, fragment shader can also discard fragments. Fragment shader can
perform texture access throw Texture Unit which will return the color that is map on the
coordinates given by the shader. Texture Unit is responsible of performing all the texture acce
needed to obtain the final color based on the filter type the programmer has chosen.

Once fragment shader is done is time to do other test in order to discard more fragments. Next
testes have been explained when we have explained the API pipeline so
they will not be explained
another time.

Discarding fragments is not so easy to do in all situations. When we are working with alpha
blending active, fragments written in the frame buffer can have some degree of transparency, so
even a fragment is rear thi
s fragment is need not to discard it because as the fragment written has
some transparency we can see the fragment rear it. So in this situations, early Z and Hierarchical Z
are disabled to avoid errors discarding fragments.

Once all tests have been done,
fragments are ready to be written in the fragment buffer and this is
done in the blending stage. This stage is responsible of writing the fragment on the frame buffer
and performs any operation between the fragment currently in the frame buffer and the new

one in
order to obtain the final color. Writing the fragment on the frame buffer also implies a modification
on the z buffer in order to have it up to date.

From fixed pipelined to unified architectures

First GPU’s were thought to free the CPU from

intensive work of texture mapping and
rendering polygons. So, the functionalities included are those related to geometry rasterization of
2D images which maps the basic geometry (usually triangles) to pixels of the viewport.

For a given fragment w
hich has some texture coordinates, texture mapping samples the requested
texel from the texture. In order to improve the quality of the image obtain different kinds of

sampling has been implemented.

Finally we found the raster operation unit


which pe
rforms some tests in order to determine
if the fragment has to be drawn to the frame buffer. Basic raster operations are a scissor, alpha,
stencil and z
test and alpha blending

and they will be explained in the next section

Apart from the frame buffer we

are working with, we can define a zone of the frame rectangle called
scissor rectangle to define the region of the frame buffer we would like to write on. So, when we
enable the scissor test all the fragments that are outside the region defined by the sci
ssor test are

test compares Z from inputs fragments with the Z from the fragment is stored in the frame buffer.
In case the input fragment is in front of the fragment stored, it continues throw the pipeline,
otherwise it is discarded.

blending is the operation of combining the color from the input fragment and the fragment
that is stored in the frame buffer. This combination can be done in multiple ways such as become
the final fragment color the input color, only change some components
, for example red and blue, or
more complex combination which adding some proportion from the input color, and some other
from the frame buffer color.

Next generation of GPU included one new functionality: multitexturing. Now fragments can perform
ltiple accesses to different textures. This new functionality allowed having a texture with the
object material and another one with the object lighting map. Combining both textures we can
obtain the obtain …

One of the first GPU which implemented this
was Riva TNT.

Until now GPU always received 2D geometry, with all the attributes from each vertex previously
being transformed and calculated. Next step in GPU design was to introduce a new stage, called
geometry stage with performs contains the transform

and lighting unit. As we previously said,

Matrix multiplication


Material properties, light properties and vertex color

Example GeForce 256

Fragment Shader

Example GeForce 2

Vertex Shader

Bump Mapping

Cube Map

Volumne Textures


Example GeForce 3

Multiple vertex and pixel shaders

Example GeForce 4

Programable pixel shading

Example GeForce FX

Falta: Hierchical Z, Early Z, Geometry shader, Tesselator

At this time we are capable of doing two different tests to
rd some fragments: early
Z and
Hierarchical Z



Previous sections have explained basic graphic theory necessary to understand the whole project.
This section will continue explaining more information necessary, but
this case this informa

about the environment used to

the project. This environment is really big, as other people
have been working with it during many years, so this section only will talk about essential things
needed to understand my project.


is a cycle
accurate GPU simulator which main obje
ctive is test new architecture
proposals and
evaluate them. In order to perform the proposal evaluation

a realistic workload is need to be as
precise as possible and be sure that if the proposal is implemen
ted on hardware and used in

environments it will behave as in the simulator. The best way to succeed in this task is using real
workload to feed the simulator

as this will test the simulator
in real cases

Nowadays the most
representative workload for

GPUs are PC games, although console games are also good alternatives
but they are played in close environments with pr
ivate tools. A
s we previous saw, most games uses
the two major API’s, D3D and openGL, so using real workload

will involve implementing
the full
graphic stack for each API on the simulator. This is a hard task
not only
because each API has
hundreds of calls but also because new APIs are so complex that trying to debug them will be

This section will describe all those tools nee
ded to run this workload and will explain how
workload is traced and how it is played on the simulator.

This first schema explains the main steps perform to play the workload on the simulator:

Workload is not executed direc
ly on the simulator because it
is much slower than
real GPU, so
executing a game on the simulator will be unplayable. Another important thing is that the same
workload would be run many times in the simulator
, so if it is executed directly, this cannot
achieved because every time a game is played a different trace will be obtained, so there are no way
to compare different architectures with the same workload, something that is important


to solve these two major problems
, games are traced creating some files called

make it possible to reproduce them afterwards. The first problem is how game traces can be
obtained from commercial games. Games perform many operations

such as

collisions, IA

… but we ar
e only interested in the graphic
part which involves

. These APIs

exported as dynamic libraries so the best way to

all the


the games made

interpose between the application we would like to trace and the gra
phic API another dynamic
library that intercepts all the calls. So




a fake API library

and games make
their calls on

then the

interceptor colle
ts all the information needed
to later reproduce the




call to the original API. GLInterceptor is the tool developed by the
group to perform this task and record trace
s of

openGL game

while Microsoft Pix Tool,
publish by
is used for DirectX games. Both application

work in a similar way and as a r
esult from
their execution

one or several files are
obtained that which

the trace recorded from the game.
This trace
contains all the information needed t
o time the workload that was traced

play another

Once the trace is obtained we don’t need a
ny more the game from which the trace was extracted as
all the information needed is contained in the trace. Next step is being able to reproduce the trace
file. This can be done over the GPU hardware itself if you want to compare its behavior with the
ginal one, or play it on the simulator. The way player works in both cases is more or less the
same, first work is reading the trace file, openGL trace file is easy to read as the format was created
by the group, but Pix Tool was created by Microsoft and t
race format was coded by them so is
wasn’t so easy to play it. After reading the trace commands must be send to the appropriate driver
in case it is played on a real GP

or they are sent to the simulator stack in order to simulate the
behavior in Attila. I
nformation obtain

from both cases are a sequence of images, called frames,
that can be compared in order to see if the simulator is working as a real GPU is supposed to. Here
we introduce a new word, frame, which is when the back buffer is exchanged with the frame buffer
nd a new image appears on the screen. To be able to see movement in the image this has to take
place at least 24 times per second. Another important term is

batch is
understood as all
context surrounding a draw call, every time a draw call is is
sued this is understand as a batch. So
frames are

of batches.

Attila current stack

As it is said before, in order to execute a trace, which is in fact a real game it is need to emulate a
functional graphic stack on the simulator. So, this
section is going to explain how this stack is built
on the top of the simulator. This section is important

to understand how all is built

and in order to

now and why it is need to modify.

Attila graphic stack is
composed by 4 different parts: the player,

each different implemented API, a
driver and the simulator itself.


player has


before and what it basically

is read the trace file, interpret it
and issue every API call. When the trace is played in the simulator, once each
call is read next step is
call the appropriate API function call to do whateve
r is need to execute that call. The way this is
done is using tables that hold pointers to the implemented calls. So, using this table and setting the
appropriate pointers in it,

all API functions can be called.


Attila has two different APIs, openGL and D3D
It is too much work to implement the full API call
because there are hundreds of them and
as games often use only a small subset of the entire list

there is no nee
d to implement


calls that are not going to be used,

so calls are implemented on

OpenGL implements a large subset of openGL calls an
, in its current

it supports some of the
most important API features from openGL specs 1.4 to 2.0. When this project started, the openGL
Driver was supporting traces obtained from the following games: Doom 3, Quake 4, Prey,
Chronicles of Riddick and Unreal Tournament 2000
4. To being able to play them about 200 API
calls were supported. ARB shader language was supported while GLSlang was not supported. A
great range of texture formats were supports ( LLISTAT DE FORMATS SUPORTATS), including
support for S3 Texture Compressio
n (S3TC) modes DXT1, DXT3, DXT5. Textures filters available
are nearest, bilinear, trilinear and 16X Anisotropic

OpenGL also support
rrays and Vertex


which are the structures used
to hold all the vertex data as was explained

Both of them allow us,

instead specifying individual vertex data in inmmediate mode (between
glBegin() and glEnd() pars)

to store vertex data in a set of arrays including vertex coordinates,
normal, texture coordinates and color information, and
then draw calls can be made by referencing
those arrays. Draw calls that uses those arrays are glDrawArrays(), glDrawElements and

Vertex data is stored in raw buffers which can multiple vertex data. Once
the programmer would like to ac
tive or deactivate one of the different vertex data it only has to
execute glEnableClientStat() or glDisableClientStat() ir order to active or deactive the different
types of arrays. Moreover, when this is done, we need to set the
buffer we are going to us
e. Setting
the buffers can be done using the calls gl***Pointer where *** can be Vertex, normal, color, index,
texcood or edgeFlag
. Vertex Arrays and VBO work in a different manner
. With Vertex Arrays
gl**Pointer functions are used without nothing more. T
his call has as parameters the type of data is
going to be used, the stride between elements and the pointer to the raw data. Apart from data
nothing else must be set up, only as mane arrays as needed. When VBO are used more thing are
needed to do. First r
aw arrays must be introduced inside Buffer objects using to do this the
glBufferDataARB functions. Parameters for this call are basically those to characterize the buffer
content, such as the buffer size, data pointer and the usage. Every buffer we would l
ike to use must
be created using this function. When all buffers have been created we can change between them
using the glBindBufferARB call. This function alls us to set one buffer, remember that openGL works
as a state machine. Then, as with Vertex Array

has been done, gl***Pointer calls must be used to set
the appropriate buffer to each characteristic, but this time the third parameter which is the pointer
to the data won’t work the same way as before. This time to set the buffer that is going to be used

will be done choosing the last one setted using the glBindBufferARB call, and the pointer parameter
will be the offset inside this buffer where data can be found.



The driver is the lowest piece of

and is just above the simulator. As it is a low level code,
their main functions are to abstract some basic resources avoiding that the API programmer have
to deal with. In Attila, the driver only takes care about

two important things: memory management
and AGP transactions.

GPU has its own memory and is need to track it in order to avoid overwrite it. The task of the driver
is track the memory an when upper levels ask for memory allocate it and return some identi
called memory descriptor (md) that can be used to reference to this amount of memory. So upper
levels when refer to memory do not work with memory addresses if not memory descriptors.
Deallocate can also be done with the memory descriptor, the driver

will add the memory region the
memory descriptor references to the free pool memory.

AGP transactions

hiding is

the second functions driver does. If driver has to manage AGP
transactions is because is because the simulator is plug in to an AGP port. Work
ing with AGP
transactions, registers and other low level data containers is too dangerous to export to the API
programmer, because
GPU manufactures don’t want their users now too many things about their
architectures and also because AGP transactions is to
o low level.

Finally Attila Driver has a Shader Cache. Why a driver needs to have a shader cache? Attila
simulator can only execute shaders that are in some special memory regions, so as shaders are
loaded inside normal memory, when a shader is executed,
first is need to copy it from the memory
region where it is defined to the memory region reserved for shader execution. This copy operation
is a bottleneck if it has to be done many times, and this happens when shaders are changing a lot
from one batch to
another. To avoid performing many times this operation a shader cache was
introduce in the driver.

GPU: Simulator

Gracies a que la traça conté tota la informació necessaria per a reproduir el funcionament del joc,
podem a p
artir d’aquesta traça analitzar tota l’informació relative a crides d’API directament.