GPU acceleration in Matlab

appliancepartAI and Robotics

Oct 19, 2013 (4 years and 20 days ago)

169 views

GPU acceleration in Matlab

Jan Kamenick
ý

UTIA Friday seminar





9.11.2012

GPU acceleration


CPU


fast


general
-
purpose



GPU


highly parallel


handles specific tasks with large amount of data



memory transfers needed

GPU acceleration in Matlab


Build
-
in functions


many Matlab functions support GPU acceleration
natively


arrayfun


specific element
-
wise processing


CUDA kernels


w
rite “.cu” files


compile to “.ptx” (parallel thread execution)


run using feval

Prerequisites


Matlab 2010b or newer


Parallel Computing Toolbox


ver

Prerequisites

>> ver

-------------------------------------------------------------------------------------


MATLAB Version 7.13.0.564 (R2011b)

MATLAB License Number: XXXXXX

Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)

Java VM Version: Java 1.6.0_17
-
b04 with Sun Microsystems Inc. Java HotSpot(TM) 64
-
Bit Server VM mixed mode

-------------------------------------------------------------------------------------


MATLAB


Version 7.13


(R2011b)

Simulink


Version 7.8


(R2011b)

Computer Vision System Toolbox


Version 4.1


(R2011b)

Curve Fitting Toolbox


Version 3.2


(R2011b)

DSP System Toolbox


Version 8.1


(R2011b)

Data Acquisition Toolbox


Version 3.0


(R2011b)

Filter Design HDL Coder


Version 2.9


(R2011b)

Fixed
-
Point Toolbox


Version 3.4


(R2011b)

Global Optimization Toolbox


Version 3.2


(R2011b)

Image Acquisition Toolbox


Version 4.2


(R2011b)

Image Processing Toolbox


Version 7.3


(R2011b)

MATLAB Compiler


Version 4.16


(R2011b)

MATLAB Distributed Computing Server


Version 5.2


(R2011b)

Neural Network Toolbox


Version 7.0.2


(R2011b)

Optimization Toolbox


Version 6.1


(R2011b)

Parallel Computing Toolbox


Version 5.2


(R2011b)

Partial Differential Equation Toolbox


Version 1.0.19


(R2011b)

Signal Processing Toolbox


Version 6.16


(R2011b)

Simulink 3D Animation


Version 6.0


(R2011b)

Statistics Toolbox


Version 7.6


(R2011b)

Symbolic Math Toolbox


Version 5.7


(R2011b)

Wavelet Toolbox


Version 4.8


(R2011b)

Prerequisites

>>
ver


-------------------------------------------------------------------------------------


MATLAB Version 7.13.0.564 (
R2011b
)


MATLAB License Number: XXXXXX

Operating System: Microsoft Windows 7 Version 6.1 (Build 7601: Service Pack 1)

Java VM Version: Java 1.6.0_17
-
b04 with Sun Microsystems Inc. Java HotSpot(TM) 64
-
Bit Server VM mixed mode

-------------------------------------------------------------------------------------


MATLAB


Version 7.13


(R2011b)

Simulink


Version 7.8


(R2011b)

Computer Vision System Toolbox


Version 4.1


(R2011b)

Curve Fitting Toolbox


Version 3.2


(R2011b)

DSP System Toolbox


Version 8.1


(R2011b)

Data Acquisition Toolbox


Version 3.0


(R2011b)

Filter Design HDL Coder


Version 2.9


(R2011b)

Fixed
-
Point Toolbox


Version 3.4


(R2011b)

Global Optimization Toolbox


Version 3.2


(R2011b)

Image Acquisition Toolbox


Version 4.2


(R2011b)

Image Processing Toolbox


Version 7.3


(R2011b)

MATLAB Compiler


Version 4.16


(R2011b)

MATLAB Distributed Computing Server


Version 5.2


(R2011b)

Neural Network Toolbox


Version 7.0.2


(R2011b)

Optimization Toolbox


Version 6.1


(R2011b)

Parallel Computing Toolbox


Version 5.2


(R2011b)


Partial Differential Equation Toolbox


Version 1.0.19


(R2011b)

Signal Processing Toolbox


Version 6.16


(R2011b)

Simulink 3D Animation


Version 6.0


(R2011b)

Statistics Toolbox


Version 7.6


(R2011b)

Symbolic Math Toolbox


Version 5.7


(R2011b)

Wavelet Toolbox


Version 4.8


(R2011b)

Prerequisites


Matlab 2010b or newer


Parallel Computing Toolbox


ver



NVIDIA GPU with CUDA version 1.3 or higher


gpuDevice

Prerequisites

>> gpuDevice


ans =




parallel.gpu.CUDADevice

handle




Package: parallel.gpu




Properties:



Name: 'GeForce GTX 285'



Index: 1



ComputeCapability: '1.3'



SupportsDouble: 1



DriverVersion: 5



MaxThreadsPerBlock: 512



MaxShmemPerBlock: 16384



MaxThreadBlockSize: [512 512 64]



MaxGridSize: [65535 65535]



SIMDWidth: 32



TotalMemory: 2.1475e+009



FreeMemory: 1.9656e+009



MultiprocessorCount: 30



ClockRateKHz: 1476000



ComputeMode: 'Default'



GPUOverlapsTransfers: 1



KernelExecutionTimeout: 1



CanMapHostMemory: 1



DeviceSupported: 1



DeviceSelected: 1




Methods
,
Events
,
Superclasses


Prerequisites

>>
gpuDevice


ans =




parallel.gpu.CUDADevice

handle




Package: parallel.gpu




Properties:



Name: 'GeForce GTX 285'



Index: 1



ComputeCapability:

'1.3'




SupportsDouble: 1



DriverVersion: 5



MaxThreadsPerBlock: 512



MaxShmemPerBlock: 16384



MaxThreadBlockSize: [512 512 64]



MaxGridSize: [65535 65535]



SIMDWidth: 32



TotalMemory: 2.1475e+009



FreeMemory: 1.9656e+009



MultiprocessorCount: 30



ClockRateKHz: 1476000



ComputeMode: 'Default'



GPUOverlapsTransfers: 1



KernelExecutionTimeout: 1



CanMapHostMemory: 1



DeviceSupported: 1



DeviceSelected: 1




Methods
,

Events
,

Superclasses


Basic usage


Send data to GPU


either allocate there or transfer from workspace



Run Matlab functions


GPU acceleration is used automatically



Retrieve the output data

GPUArray class

parallel.gpu.GPUArray


main data class for GPU computations


stored in the GPU memory


create directly using static methods




copy from existing data


gpuArray(img)

zeros

nan

eye

rand

linspace

ones

true

colon

randi

logspace

inf

false

randn

GPUArray class


Supported data types:


(u)int8, (u)int16, (u)int32, (u)int64, single, double,
logical


determine the type using


classUnderlying(gpuVar)



Retrieve the data using


workspaceVar = gather(gpuVar)

GPU accelerated Matlab functions (2012b)


methods(‘parallel.gpu.GPUArray’)

GPU accelerated Matlab functions (2012b)

abs

cast

dot

ge

issparse

normest

sinh

acos

cat

double

gt

isvector

not

size

acosh

ceil

eig

horzcat

kron

num2str

sort

acot

chol

eps

hypot

ldivide

numel

sprintf

acoth

circshift

eq

ifft

le

perms

sqrt

acsc

classUnderlying

erf

ifft2

length

permute

squeeze

acsch

colon

erfc

ifftn

log

plot (and related)

std

all

complex

erfcinv

ifftshift

log10

plus

sub2ind

angle

cond

erfcx

imag

log1p

pow2

subsasgn

any

conj

erfinv

ind2sub

log2

power

subsindex

arrayfun

conv

exp

int16

logical

prod

subsref

asec

conv2

expm1

int2str

lt

qr

sum

asech

convn

fft

int32

lu

rank

svd

asin

cos

fft2

int64

mat2str

rdivide

tan

asinh

cosh

fftn

int8

max

real

tanh

atan

cot

fftshift

inv

mean

reallog

times

atan2

coth

filter

ipermute

meshgrid

realpow

trace

atanh

cov

filter2

iscolumn

min

realsqrt

transpose

beta

cross

find

isempty

minus

rem

tril

betaln

csc

fix

isequal

mldivide

repmat

triu

bitand

csch

fliplr

isequaln

mod

reshape

uint16

bitcmp

ctranspose

flipud

isfinite

mpower

rot90

uint32

bitget

cumprod

flipdim

isinf

mrdivide

round

uint64

bitor

cumsum

floor

islogical

mtimes

sec

uint8

bitset

det

fprintf

ismatrix

ndgrid

sech

uminus

bitshift

diag

full

isnan

ndims

shiftdim

uplus

bitxor

diff

gamma

isreal

ne

sign

var

blkdiag

disp

gammaln

isrow

nnz

sin

vertcat

bsxfun

display

gather

issorted

norm

single

Simple example


Solve system of linear equations (Ax = b)



A = gpuArray(A);


b = gpuArray(b);


x = A
\
b;


x = gather(x);

Simple example


Compute convolution using FFT



img = gpuArray(img);


msk = padarray(msk,size(img)
-
size(msk),0,'post');


msk = gpuArray(msk);


I = fft2(img);


M = fft2(msk,size(img,1),size(img,2));


res = real(ifft2(I.*M));


res = gather(res);








M = fft2(
msk
);



Linear system solution benchmark

0
0.5
1
1.5
2
2.5
3
3.5
Speedup

Matrix size (number of equations)

Speedup of computations on GPU compared to CPU

single-precision
double-precision
Convolution benchmark

0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Speedup

Matrix size

Speedup of computations on GPU compared to CPU

single-precision
double-precision
Profiling


Before optimizing (trying to use GPU) locate
promising parts of code like


custom code consuming the majority of time


build
-
in functions that support GPUArray
(consuming the majority of time)


large input/output data, simple data types


Test the speed afterwards



GPU code cannot be profiled

Profiling