gpu-cudax

molassesitalianAI and Robotics

Nov 6, 2013 (4 years and 6 days ago)

97 views

GPU History

CUDA Intro

Graphics Pipeline Elements

1.

A
scene description: vertices, triangles, colors,
lighting

2.
Transformations that map the scene to a
camera viewpoint

3.
“Effects”: texturing, shadow mapping, lighting
calculations

4.
Rasterizing: converting geometry into pixels

5.
Pixel processing: depth tests, stencil tests, and
other per
-
pixel operations.

Host

Vertex Control

Vertex

Cache

VS/T&L

Triangle Setup

Raster

Shader

ROP

FBI

Texture

Cache

Frame

Buffer

Memory

CPU

GPU

Host Interface

A Fixed Function
GPU Pipeline

Texture mapping example: painting a world map
texture image onto a globe object.

Texture Mapping Example

3D Application

or Game

3D API:

OpenGL or
Direct3D

Programmable

Vertex

Processor

Primitive

Assembly

Rasterization &
Interpolation

3D API
Commands

Transformed
Vertices

Assembled
Polygons,
Lines, and
Points

GPU
Command &
Data Stream

Programmable

Fragment

Processor

Rasterized

Pre
-
transformed

Fragments

Transformed

Fragments

Raster

Ops

Framebuffer

Pixel
Updates

GPU

Front
End

Pre
-
transformed
Vertices

Vertex Index
Stream

Pixel
Location
Stream

CPU


GPU Boundary

CPU

GPU

An example of separate vertex processor and fragment processor in
a programmable graphics pipeline

Programmable Vertex and Pixel Processors

What is (Historical) GPGPU ?


General Purpose computation using GPU and graphics API in
applications other than 3D graphics


GPU accelerates critical path of application



Data parallel algorithms leverage GPU attributes


Large data arrays, streaming throughput


Model is SPMD


Low
-
latency floating point (FP) computation


Applications


see http://gpgpu.org


Game effects (FX) physics, image processing


Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting

Tesla GPU


NVIDIA developed a more general purpose GPU


Can programming it like a regular processor


Must
explicitly

declare the data parallel parts of the
workload


Shader

processors


fully programming processors with
instruction memory, cache, sequencing logic


Memory load/store instructions with random byte
addressing capability


Parallel programming model primitives; threads, barrier
synchronization, atomic operations


CUDA


“Compute Unified Device Architecture”


General purpose programming model


User kicks off batches of threads on the GPU


GPU = dedicated super
-
threaded, massively data parallel co
-
processor


Targeted software stack


Compute oriented drivers, language, and tools


Driver for loading computation programs into GPU


Standalone Driver
-

Optimized for computation


Interface designed for compute


graphics
-
free API


Data sharing with OpenGL buffer objects


Guaranteed maximum download &
readback

speeds


Explicit GPU memory management


CUDA Devices and Threads


A compute

device


Is a coprocessor to the CPU or
host


Has its own DRAM (
device memory
)



Runs many
threads

in parallel


Is typically a
GPU

but can also be another type of parallel processing
device


Data
-
parallel portions of an application are expressed as device
kernels

which run on many threads


Differences between GPU and CPU threads


GPU threads are extremely lightweight


Very little creation overhead


GPU needs
1000
s of threads for full efficiency


Multi
-
core CPU needs only a few

10

G80 CUDA mode


A
Device
Example


Processors execute computing threads


New operating mode/HW interface for computing

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Texture

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Parallel Data

Cache

Load/store

Load/store

Load/store

Load/store

Load/store

Arrays of Parallel Threads


A CUDA kernel is executed by an array of

threads


All threads run the same code (SPMD)



Each thread has an ID that it uses to compute
memory addresses and make control decisions


7

6

5

4

3

2

1

0



float x = input[threadID];

float y = func(x);

output[threadID] = y;



threadID



float x =
input[threadID];

float y = func(x);

output[threadID] = y;



threadID

Thread Block 0





float x =
input[threadID];

float y = func(x);

output[threadID] = y;



Thread Block 1



float x =
input[threadID];

float y = func(x);

output[threadID] = y;



Thread Block N
-

1

Thread Blocks: Scalable Cooperation


Divide monolithic thread array into multiple blocks


Threads within a block cooperate via
shared memory,
atomic operations
and
barrier synchronization


Threads in different blocks cannot cooperate


Up to 65535 blocks, 512 threads/block


7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

7

6

5

4

3

2

1

0

Host
Kernel
1
Kernel
2
Device
Grid 1
Block
(0, 0)
Block
(1, 0)
Block
(0, 1)
Block
(1, 1)
Grid 2
Courtesy: NDVIA
Figure 3.2. An Example of CUDA Thread Organization.
Block (1, 1)
Thread
(0,1,0)
Thread
(1,1,0)
Thread
(2,1,0)
Thread
(3,1,0)
Thread
(0,0,0)
Thread
(1,0,0)
Thread
(2,0,0)
Thread
(3,0,0)
(0,0,1)
(1,0,1)
(2,0,1)
(3,0,1)
Block IDs and Thread IDs



We launch a “grid” of “blocks”
of “threads”


Each thread uses IDs to decide
what data to work on


Block ID: 1D or 2D


Thread ID: 1D, 2D, or 3D



Simplifies memory

addressing when processing

multidimensional data


Image processing


Solving PDEs on volumes





CUDA Memory Model Overview


Global memory


Main means of
communicating R/W
Data between
host
and
device


Contents visible to all
threads


Long latency access

Grid

Global Memory

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host

15

CUDA Device Memory Allocation


cudaMalloc
()


Allocates object in the
device
Global Memory


Requires two parameters


Address of a pointe
r to the
allocated object


Size of

allocated object


cudaFree
()


Frees object from device
Global Memory


Pointer to freed object

Grid

Global

Memory

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host

DON’T use a CPU
pointer in a GPU
function !

CUDA Device Memory Allocation (cont.)



Code example:


Allocate a
64
*
64
single precision float array


Attach the allocated storage to Md


“d” is often used to indicate a device data structure

TILE_WIDTH =
64
;

float
*
Md
;

int

size = TILE_WIDTH * TILE_WIDTH *
sizeof
(float);


cudaMalloc
((void**)&
Md
, size);

cudaFree
(
Md
);

CUDA Host
-
Device Data Transfer


cudaMemcpy
()


memory data transfer


Requires four parameters


Pointer to destination


Pointer to source


Number of bytes copied


Type of transfer


Host to Host


Host to Device


Device to Host


Device to Device


Non
-
blocking/asynchronous
transfer

Grid

Global

Memory

Block (
0
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Block (
1
,
0
)


Shared Memory

Thread (
0
,
0
)


Registers

Thread (
1
,
0
)


Registers

Host

CUDA Host
-
Device Data Transfer

(cont.)


Code example:


Transfer a
64
*
64
single precision float array


M is in host memory and Md is in device memory


cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic constants


cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);


cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);

CUDA Function Declarations

host

host

__host__

float HostFunc()


host

device

__global__

void KernelFunc()


device

device

__device__

float DeviceFunc()


Only callable
from the:

Executed
on the:



__global__

defines a kernel function


Must return
void



__device__

and
__host__

can be used
together

__global__ void add(int a, int b, int *c)

{


*
c = a + b;

}



int
main()

{


int
a,b,c
;


int
*
dev_c
;




a=
3
;


b=
4
;


cudaMalloc
((void**)&
dev_c
,
sizeof
(int));


add
<<<
1
,
1
>>>(
a,b,dev_c
);

//
1
Block and
1
Thread/Block


cudaMemcpy
(&c,
dev_c
,
sizeof
(int),
cudaMemcpyDeviceToHost
);


printf
("%d + %d is %d
\
n", a, b, c);


cudaFree
(
dev_c
);


return
0
;

}


Code Example

#define N
10

void
add(int *a, int *b, int *c)

{


int
tID

=
0
;


while (
tID

< N)


{


c[
tID
] = a[
tID
] + b[
tID
];


tID

+=
1
;


}

}


int
main()

{


int a[N], b[N], c[N];


// Fill Arrays


for (int
i

=
0
;
i

< N;
i
++)


{


a[
i
] =
i
,


b[
i
] =
1
;


}


add (a, b, c);


for (int
i

=
0
;
i

< N;
i
++)


{


printf
("%d + %d = %d
\
n", a[
i
], b[
i
], c[
i
]);


}


return
0
;

}

Sequential Code


Adding Arrays

#include "
stdio.h
"

#define N
10



__global__ void add(int *a, int *b, int *c)

{


int
tID

=
blockIdx.x
;


if (
tID

< N)


{


c[
tID
] = a[
tID
] + b[
tID
];


}

}

CUDA Code


Adding Arrays

int main()

{


int a[N], b[N], c[N];


int *
dev_a
, *
dev_b
, *
dev_c
;



cudaMalloc
((void **) &
dev_a
, N*
sizeof
(int));


cudaMalloc
((void **) &
dev_b
, N*
sizeof
(int));


cudaMalloc
((void **) &
dev_c
, N*
sizeof
(int));



// Fill Arrays


for (int
i

=
0
;
i

< N;
i
++)


{



a[
i
] =
i
,



b[
i
] =
1
;


}




cudaMemcpy
(
dev_a
, a, N*
sizeof
(int),
cudaMemcpyHostToDevice
);


cudaMemcpy
(
dev_b
, b, N*
sizeof
(int),
cudaMemcpyHostToDevice
);




add<<<N,
1
>>>(
dev_a
,
dev_b
,
dev_c
);




cudaMemcpy
(c,
dev_c
, N*
sizeof
(int),
cudaMemcpyDeviceToHost
);



for (int
i

=
0
;
i

< N;
i
++)


{



printf
("%d + %d = %d
\
n", a[
i
], b[
i
], c[
i
]);


}


return
0
;

}

Julia Fractal


Evaluates an iterative equation
for points in the complex plane


A point is not in the set if iterating
diverges and approaches infinity


A point is in the set if iterating
remains bounded


Equation


Z
n+
1
=Z
n
2

+ C


Where Z is a point in the complex
plane, C is a constant


Our implementation uses the
freeimage

library

CPU Fractal Implementation


Structure to store, multiply, and divide complex
numbers

#include "
FreeImage.h
"


#include "
stdio.h
"


#
define DIM
1000


struct

cuComplex

{


float
r;


float
i;


cuComplex
( float a, float b ) : r(a), i(b) {}


float
magnitude
2
( void ) { return r * r + i * i; }


cuComplex

operator*(
const

cuComplex
& a) {



return
cuComplex
(r*
a.r

-

i*
a.i
, i*
a.r

+ r*
a.i
);


}


cuComplex

operator+(
const

cuComplex
& a) {


return
cuComplex
(
r+a.r
,
i+a.i
);


}

};


CPU Fractal Implementation


Julia function

int

julia
(
int

x,
int

y)

{


const

float scale =
1.5
;


float
jx

= scale * (float)(DIM/
2
-

x)/(DIM/
2
);


float
jy

= scale * (float)(DIM/
2
-

y)/(DIM/
2
);


cuComplex

c(
-
0.8
,
0.156
);


cuComplex

a(
jx
,
jy
);


int

i =
0
;


for (i =
0
; i <
200
; i++)


{


a = a*a + c;


if (a.magnitude
2
() >
1000
) return
0
;


}


return
1
;

}


CPU Fractal Implementation


What will become our kernel


Array of char is
0
or
1
to indicate pixel or no pixel

void kernel(char *
ptr
)

{


for (
int

y =
0
; y<DIM; y++)


for (
int

x=
0
; x<DIM; x++)


{


int

offset = x + y * DIM;


ptr
[offset] =
julia
(
x,y
);


}

}


CPU Fractal Implementation

int

main()

{


FreeImage_Initialise
();

FIBITMAP
* bitmap =
FreeImage_Allocate
(DIM, DIM,
32
);




char
charmap
[DIM][DIM];


kernel(&
charmap
[
0
][
0
]);



RGBQUAD color;


for (
int

i =
0
; i < DIM; i++){


for (
int

j =
0
; j < DIM; j++){


color.rgbRed

=
0
;


color.rgbGreen

=
0
;


color.rgbBlue

=
0
;


if (
charmap
[i][j]!=
0
)


color.rgbBlue

=
255
;


FreeImage_SetPixelColor
(bitmap, i, j, &color);


}



}


FreeImage_Save
(FIF_BMP,
bitmap, "
output.bmp");


FreeImage_Unload
(bitmap
);


return
0
;

}

GPU Fractal Implementation


Assign the computation of each point to a processor


Use a
2
D block and the
blockIdx.x

and
blockIdx.y

variables to determine which pixel we should be
working on

GPU Fractal


__device__ makes this accessible from the compute
device

__device__
struct

cuComplex

{


float r;


float i;


__device__
cuComplex
( float a, float b ) : r(a), i(b) {}


__device__ float magnitude
2
( void ) { return r * r + i * i; }


__device__
cuComplex

operator*(
const

cuComplex
& a) {


return
cuComplex
(r*
a.r

-

i*
a.i
, i*
a.r

+ r*
a.i
);


}


__device__
cuComplex

operator+(
const

cuComplex
& a) {


return
cuComplex
(
r+a.r
,
i+a.i
);


}

};


GPU Fractal

__device__
int

julia
(
int

x,
int

y)

{


// Same as CPU version

}


__
global__ void kernel(char *
ptr
)

{


int

x =
blockIdx.x
;


int

y =
blockIdx.y
;


int

offset = x + y * DIM;


ptr
[offset] =
julia
(
x,y
);

}



GPU Fractal

int

main()

{


FreeImage_Initialise
();


FIBITMAP
* bitmap =
FreeImage_Allocate
(DIM, DIM,
32
);




char
charmap
[DIM][DIM];


char
*
dev_charmap
;


cudaMalloc
((void**)&
dev_charmap
, DIM*DIM*
sizeof
(char));




dim
3
grid(DIM,DIM);




kernel
<<<grid,
1
>>>(
dev_charmap
);


cudaMemcpy
(
charmap
,
dev_charmap
, DIM*DIM*
sizeof
(char),



cudaMemcpyDeviceToHost
);

GPU Fractal


RGBQUAD
color;


for
(int i =
0
; i < DIM; i++){



for
(int j =
0
; j < DIM; j++){




color.rgbRed

=
0
;




color.rgbGreen

=
0
;




color.rgbBlue

=
0
;




if
(
charmap
[i][j]!=
0
)





color.rgbBlue

=
255
;




FreeImage_SetPixelColor
(bitmap
, i, j, &color);



}


}


FreeImage_Save
(FIF_BMP,
bitmap, "
output.bmp");


FreeImage_Unload
(bitmap);


cudaFree
(
dev_charmap
);


return
0
;

}