CUDA - TZI

birdsowlΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

46 εμφανίσεις

Algorithm

Engineering


„GPGPU“

Stefan Edelkamp

Graphics Processing Units


GPGPU = (GP)²U

General
Purpose

Programming

on
the

GPU



Parallelism

for

the

masses



Application
: Fourier
-
Transformation, Model
Checking
,

Bio
-
Informatics
,
see

CUDA
-
ZONE


Programming the

Graphics Processing Unit

with
Cuda


Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


GPGPU languages


CUDA


Small Example

Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


GPGPU languages


CUDA


Small Example

Cluster /
Multicore

/ GPU


Cluster system


many unique systems


each one


one (or more) processors


internal memory


often HDD


communication over network


slow compared to internal


no shared memory

CPU

RAM

HDD

CPU

RAM

HDD

CPU

RAM

HDD

Switch

Cluster /
Multicore

/ GPU


Multicore

systems


multiple CPUs


RAM


external memory on HDD


communication over RAM

CPU1

CPU2

CPU4

CPU3

RAM

HDD

Cluster /
Multicore

/ GPU


System with a Graphic Processing Unit


Many (240) Parallel processing units


Hierarchical memory structure


RAM


VideoRAM


SharedRAM



Communication


PCI BUS

Graphics
Card

GPU

SRAM

VRAM

RAM

CPU

Hard Disk Drive

Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


GPGPU languages


CUDA


Small Example

Computing on the GPU


Hierarchical execution


Groups


executed sequentially


Threads


executed parallel


lightweight (creation / switching nearly free)



one

Kernel function


executed by each thread


Group 0

Computing on the GPU


Hierarchical memory


Video RAM



1 GB


Comparable to RAM



Shared RAM in the GPU


16 KB


Comparable to registers


parallel access by threads

Graphic Card

GPU

SRAM

VideoRAM

Beispielarchitektur G200 z.B. in
280GTX

Beispielprobleme

Ranking und
Unranking

mit
Parity

2
-
Bit BFS

1
-
Bit BFS

Schiebepuzzle

Some

Results


Weitere

Resultate‏…

Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


GPGPU languages


CUDA


Small Example

GPGPU Languages


RapidMind


Supports
MultiCore
, ATI, NVIDIA and Cell


C++
analysed

and compiled for target hardware


Accelerator (Microsoft)



Library for .NET language


BrookGPU

(Stanford University)



Supports ATI, NVIDIA


Own Language, variant of ANSI C

Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


Programming languages


CUDA


Small Example

CUDA


Programming language


Similar to C


File suffix .cu


Own compiler called
nvcc


Can be linked to C

CUDA

C++ code

CUDA Code

Compile with GCC

Compile with
nvcc

Link with ld

Executable

CUDA


Additional variable types


Dim3


Int3


Char3

CUDA



Different types of functions


__global__

invoked from host


__device__

called from device


Different types of variables


__device__

located in VRAM


__shared__

located in SRAM

CUDA


Calling the kernel function


name
<<<dim
3
grid, dim
3
block>>>
(...)



Grid dimensions (groups)



Block dimensions (threads)


CUDA


Memory handling


CudaMalloc
(...)

-

allocating VRAM


CudaMemcpy
(...)

-

copying Memory


CudaFree
(...)

-

free VRAM

CUDA


Distinguish threads


blockDim



Number of all groups


blockIdx



Id of Group (starting with
0
)



threadIdx



Id of Thread (starting with
0
)



Id =
blockDim.x
*
blockIdx.x+threadIdx.x

Overview


Cluster /
Multicore

/ GPU comparison


Computing on the GPU


Programming languages


CUDA


Small Example

CUDA

void inc(
int

*a,
int

b,
int

N)

{



for (
int

i

=
0
;
i
<N;
i
++)



a[
i
] = a[
i
] + b;

}


void main()


{


...



inc(
a,b,N
);

}

__global__

void inc(
int

*a,
int

b,
int

N)


{


int

id =
blockDim.x
*
blockIdx.x+threadIdx.x
;


if (id<N)



a[id] = a[id] + b;

}


void main()


{


...


int

*
a_d

=
CudaAlloc
(N);


CudaMemCpy
(
a_d,a,N,HostToDevice
);


dim
3
dimBlock

(
blocksize
,
0
,
0
);


dim
3
dimGrid

( N /
blocksize
,
0
,
0
);


inc
<<<
dimGrid,dimBlock
>>>
(
a_d,b,N
);

}

Realworld

Example


LTL Model checking


Traversing an implicit Graph G=(V,E)



Vertices called states


Edges represented by transitions


Duplicate removal needed

Realworld

Example


External Model checking


Generate Graph with external BFS


Each BFS layer needs to be sorted


GPU proven to be fast in sorting

Realworld

Example


Challenges


Millions of states in one layer


Huge state size


Fast access only in SRAM


Elements needs to be moved

Realworld

Example


Solutions:


Gpuqsort



Qsort

optimized for GPUs


Intensive swapping in VRAM


Bitonic

based sorting


Fast for subgroups


Concatenating Groups slow

Realworld

Example


Our solution


States
S

presorted by Hash
H(S)


Bucket sorted in SRAM by a Group


VRAM


SRAM

Realworld

Example


Our solution


Order given by
H(S),S

Realworld

Example


Results


Questions???

Programming
the GPU