4세대 이동 통신 연구 2단계 1/2 분기 발표 - 무선네트워크연구실

gradebananaSoftware and s/w Development

Dec 2, 2013 (3 years and 6 months ago)

91 views



무선네트워크연구실

2011
-
06
-
11


Ji
-
woo, Kang

CUDA Preview





2

What is (Historical) GPGPU ?


General Purpose computation using GPU and graphics API in
applications other than 3D graphics


GPU accelerates critical path of application



Data parallel algorithms leverage GPU attributes


Large data arrays, streaming throughput


Fine
-
grain SIMD parallelism


Low
-
latency floating point (FP) computation


Applications


see
//GPGPU.org


Game effects (FX) physics, image processing


Physical modeling, computational engineering, matrix algebra,
convolution, correlation, sorting





3

CUDA



C
ompute Unified
Device

A
rchitecture”


General purpose programming model


User kicks off batches of threads on the GPU


GPU = dedicated super
-
threaded, massively data parallel co
-
processor


Targeted software stack


Compute oriented drivers, language, and tools


Driver for loading computation programs into GPU


Standalone Driver
-

Optimized for computation


Interface designed for compute


graphics
-
free API


Data sharing with OpenGL buffer objects


Guaranteed maximum download &
readback

speeds


Explicit GPU memory management






4

Parallel Computing on a GPU


8
-
series GPUs deliver 25 to 200+ GFLOPS

on compiled parallel C applications


Available in laptops, desktops, and clusters



GPU parallelism is doubling every year


Programming model scales transparently



Programmable in C with CUDA tools


Multithreaded SPMD model uses application

data parallelism and thread parallelism

GeForce 8800

Tesla S870

Tesla D870





CUDA Compilation


As a programming model,
CUDA is a set of extensions to ANSI C


CPU code is compiled by the host C compiler and the GPU code
(kernel) is compiled by the CUDA compiler.
Separate binaries are
produced







CUDA Stack





Limitations of CUDA


Tesla does not fully support IEEE spec for double precision
floating point operations


(
Although double precision will be resolved with Fermi
)



Code only supported on NVIDIA hardware


No use of recursive functions (can workaround)


Bus latency between host CPU and GPU





Thread Hierarchy


Thread



Distributed by the CUDA runtime


(identified by
threadIdx
)

Warp



A scheduling unit of up to 32 threads


Block



A user defined group of 1 to 512 threads.


(identified by
blockIdx
)








Grid



A group of one or more





blocks. A grid is created for





CUDA kernel function






CUDA Memory Hierarchy

The CUDA platform has three primary memory types


Local Memory


per
thread

memory for automatic variables and register spilling.



Shared Memory


per
block

low
-
latency memory to allow for intra
-
block data sharing and
synchronization. Threads can safely share data through this memory
and can perform barrier synchronization through
_ _
syncthreads
()



Global Memory


device level

memory that may be shared between blocks or grids





Moving Data…

CUDA allows us to copy data from one
memory type to another.


This includes dereferencing pointers,
even in the host’s memory (main system
RAM)


To facilitate this data movement CUDA
provides
cudaMemcpy
()





Code Example

Will be explained more in depth later…





Kernel Functions


A
kernel function

is

the basic unit of work within a
CUDA thread



Kernel functions are CUDA extensions to ANSI C

that
are compiled by the CUDA compiler and the object code
generator






Kernel Limitations


There must be
no recursion
; there’s
no call stack



There must
no static variable declarations



Functions must have
a non
-
variable number of
arguments





Myths About CUDA


GPUs are the only processors in a CUDA application


The CUDA platform is a co
-
processor, using the CPU and GPU



GPUs have very wide (1000s) SIMD machines


No, a CUDA Warp is only 32 threads



Branching is not possible on GPUs


Incorrect.



GPUs are power
-
inefficient


Nope, performance per watt is quite good



CUDA is only for C or C++ programmers


Not true, there are third party wrappers for Java, Python, and more





Different Types of CUDA Applications





OpenCL Commercial Objectives


Grow the market for parallel computing


For vendors of systems, silicon, middleware, tools and applications


Open, royalty
-
free standard for heterogeneous parallel computing


Unified programming model for CPUs,
GPUs
, Cell, DSP and other processors in a system


Cross
-
vendor software portability to a wide range of silicon and syste
ms


HPC servers, desktop systems and handheld devices covered in one specification


Support for a wide diversity of applications


From embedded and mobile software through consumer applications to HPC solutions


Create a foundation layer for a parallel computing ecosystem


Close
-
to
-
the
-
metal interface to support a rich diversity of middleware and applications


Rapid deployment in the market


Designed to run on current latest generations of GPU hardware





OpenCL Working Group


Diverse industry participation


Processor vendors, system OEMs, middleware vendors, application developers


Many industry
-
leading experts involved in
OpenCL’s

design


A healthy diversity of industry perspectives


Apple initially proposed and is very active in the working group


Serving as specification editor


Here are some of the other companies

in the OpenCL working group





Interoperability


What is the common part
?

Data



We
will allocate a block of memory that
will be
accessible by
both CUDA and
OpenGL





Interoperability


As the same block of memory will
be accessible
from CUDA and
OpenGL
we need
two different
identifiers





Reconstructing MR
Images

Cartesian Scan Data

Spiral Scan Data

Gridding

FFT

LS

Cartesian scan data + FFT:

Slow scan, fast reconstruction, images may be poor

kx
ky
kx
ky
kx
ky




Reconstructing MR
Images

Cartesian Scan Data

Spiral Scan Data

Gridding
1

FFT

LS

Spiral scan data + Gridding + FFT:

Fast scan, fast reconstruction, better images

kx
ky
kx
ky
kx
ky




Reconstructing MR
Images

Cartesian Scan Data

Spiral Scan Data

Gridding

FFT

Least
-
Squares (LS)

Spiral scan data + LS

Superior images at expense of significantly more computation

kx
ky
kx
ky
kx
ky




Summation


GPGPU


Massively Parallel Processor



CUDA

OPENCL


For multi
-
platform environment



Interoperability with OpenGL,
DirectX