CS420: Introduction to Accelerators and GPGPUs

basketontarioΗλεκτρονική - Συσκευές

2 Νοε 2013 (πριν από 3 χρόνια και 8 μήνες)

79 εμφανίσεις

CS420: Introduction to Accelerators and
GPGPUs

Laxmikant V. Kale

Definition


Specialized hardware for accelerating certain types of
operations


Less general than CPUs


Typically require a CPU for control


Incapable of emulating a CPU, or exceedingly slow when doing so



May have limited functionality


Digital Signal Processors


Legacy graphics chips


Compared to a CPU, may have different performance
characteristics


Designed for high throughput rather than low latency


May have a restricted programming model

2

CS420: Accelerators and GPGPUs

Fall 2009

Why Do We Need Accelerators?

Fall 2009

CS420: Accelerators and GPGPUs

3


CPUs are Turing complete


They can emulate any type of computation performed by
accelerator architectures


May not be fast enough


Consider performing floating point computation by using
libraries which operate on integer instructions


This was once the norm, and hardware floating point units were
available as accelerator chips


Another example: graphics rendering using the CPU

Chip Size Characteristics

Fall 2009

CS420: Accelerators and GPGPUs

4


Chip area is determined by manufacturing efficiency


Roughly constant over time


Typically 100


500 mm
2


Number of transistors per unit area doubles roughly every
two years due to Moore’s Law


Pentium(1993)


3.1 million transistors


Intel Core i7 Nehalem (2008)


731 million transistors


CPU architects must decide how to use the additional
transistors provided by a process shrink



CPU Design Methodology

Fall 2009

CS420: Accelerators and GPGPUs

5


CPUs have traditionally been designed for fast execution
of sequential applications


Under this methodology, transistors have been allocated
to:


Decrease latency of memory access


Large caches, cache hierarchies


Exploit instruction level parallelism


Multiple instructions fetched per cycle


More execution units, larger register files


Dynamic dependency resolution


Limit effects of branches


Efficient branch prediction


Speculative execution


An Alternative Design Approach

Fall 2009

CS420: Accelerators and GPGPUs

6


Only a small fraction of a CPU’s total number of
transistors is used for execution units


Most transistors are used to extract parallelism from
sequential programs and hide latency of memory access
through caching


What if typical applications had large amounts of
explicitly
-
defined independent computation?


We could spend most of the CPU’s transistors on execution
units and a high
-
bandwidth memory interface

Graphics Processing Units

Fall 2009

CS420: Accelerators and GPGPUs

7


Hardware devices which produce an image from a higher
level description of its components


Unlike CPUs, designed explicitly for a massively
-
parallel
application area


Large groups of pixels can be dealt with independently


Historically, designed by having specialized execution
units organized into a graphics pipeline


More recently, shader units have been
unified

into a
single type of execution unit


Improved load balancing in graphics applications


Generalized the hardware


GPU Architecture Overview

Fall 2009

CS420: Accelerators and GPGPUs

8


Architectural features correspond to aspects of graphics
application area


Ample parallelism in pixel shading applications


Large number of computational units


High memory bandwidth


Data parallelism of pixel shading applications


SIMD model of computation


Lack of communication between data parallel units (pixels)


Little support for data exchange between computational units


Few synchronization primitives


Single
-
precision floating point computation accurate enough
for graphics


Most computational units only support single
-
precision floating point
arithmetic

GPU Performance

Fall 2009

CS420: Accelerators and GPGPUs

9


Peak floating point performance


NVIDIA GTX 285:
1116 GFLOPS SP, 93 DP


3 GHz Quad
-
core CPU with 4 inst/cycle: 48 GFLOPS DP, 96 SP


Peak memory bandwidth


NVIDIA GTX 285: 166.4 GB/s


Core i7 (Nehalem) : 32 GB/s


Using three channels of 1333 MHz RAM




General Purpose Programming on the GPU

Fall 2009

CS420: Accelerators and GPGPUs

10


In the past, GPUs only supported a graphics API


In order to use GPUs for general purpose computation,
programs had to be written in terms of shading operations on
pixels


Brook GPU


One of the first general
-
purpose abstractions for the GPU


Brook was a language for an experimental machine at Stanford


GPU turned out to be a good fit for the language


Stream computation


Operations performed on arrays (streams) of values


Operations corresponding to different indices can be computed
concurrently


Streamed data can be coalesced, guaranteeing high bandwidth


A Throughput
-
Based Model of Computation

Fall 2009

CS420: Accelerators and GPGPUs

11


Recall that GPUs lack caches


On average, leads to high memory latency


In order to mask the effects of memory latency, GPGPUs
rely on computation
-
communication overlap


While an instruction is waiting for data to arrive from memory,
other instructions for which data has arrived may execute


Requires massive parallelism to effectively hide latency


Typically thousands of threads in flight simultaneously

CUDA

Fall 2009

CS420: Accelerators and GPGPUs

12


Compute Unified Device Architecture


A hardware/software model for general purpose programming on
the GPU


Implemented on NVIDIA GPUs


Data
-
parallel computation model


Main idea: execute the same program (or
kernel
) on multiple data
locations concurrently


Based on C with extensions


Kernels execute on the GPU; the rest of the code executes on the
CPU


Each of the
threads

executing a kernel is given a unique
threadID


Can be used to have each thread operate on different index of an array



Can be used in branches to allow threads to follow different branch paths


GTX 285 Computation Hardware

Fall 2009

CS420: Accelerators and GPGPUs

13

Source: NVIDIA

A Simple Example


Matrix Addition

Fall 2009

CS420: Accelerators and GPGPUs

14

Source: CUDA 2.1 Programming Guide

CUDA Execution Model

Fall 2009

CS420: Accelerators and GPGPUs

15


CUDA scheduler assigns work to clusters of GPU
streaming processors (streaming multiprocessors) in units
of blocks


Previous example had just one block


only 1 of the 30
SMs on the GTX 285 would be utilized


Having more blocks per
kernel grid
than there are SMs on
a chip is a good idea


Greater tolerance of memory latency


Matrix Addition Using Multiple Blocks

Fall 2009

CS420: Accelerators and GPGPUs

16

Source: CUDA 2.1 Programming Guide

Further Reading

Fall 2009

CS420: Accelerators and GPGPUs

17


CUDA 2.1 Programming Guide


http://developer.download.nvidia.com/compute/cuda/2_1/to
olkit/docs/NVIDIA_CUDA_Programming_Guide_2.1.pdf


Detailed description of GT 200 architecture and CUDA
execution model


http://www.realworldtech.com/page.cfm?ArticleID=RWT0908
08195242

Fermi: recently announced by NVIDIA

Fall 2009

CS420: Accelerators and GPGPUs

18


http://www.nvidia.com/object/fermi_architecture.html


http://www.nvidia.com/object/pr_oakridge_093009.html
#rssid?=cuda_oakridnat0930



It is not here yet