CS420: Introduction to Accelerators and GPGPUs

basketontarioΗλεκτρονική - Συσκευές

2 Νοε 2013 (πριν από 3 χρόνια και 5 μήνες)

70 εμφανίσεις

CS420: Introduction to Accelerators and

Laxmikant V. Kale


Specialized hardware for accelerating certain types of

Less general than CPUs

Typically require a CPU for control

Incapable of emulating a CPU, or exceedingly slow when doing so

May have limited functionality

Digital Signal Processors

Legacy graphics chips

Compared to a CPU, may have different performance

Designed for high throughput rather than low latency

May have a restricted programming model


CS420: Accelerators and GPGPUs

Fall 2009

Why Do We Need Accelerators?

Fall 2009

CS420: Accelerators and GPGPUs


CPUs are Turing complete

They can emulate any type of computation performed by
accelerator architectures

May not be fast enough

Consider performing floating point computation by using
libraries which operate on integer instructions

This was once the norm, and hardware floating point units were
available as accelerator chips

Another example: graphics rendering using the CPU

Chip Size Characteristics

Fall 2009

CS420: Accelerators and GPGPUs


Chip area is determined by manufacturing efficiency

Roughly constant over time

Typically 100

500 mm

Number of transistors per unit area doubles roughly every
two years due to Moore’s Law


3.1 million transistors

Intel Core i7 Nehalem (2008)

731 million transistors

CPU architects must decide how to use the additional
transistors provided by a process shrink

CPU Design Methodology

Fall 2009

CS420: Accelerators and GPGPUs


CPUs have traditionally been designed for fast execution
of sequential applications

Under this methodology, transistors have been allocated

Decrease latency of memory access

Large caches, cache hierarchies

Exploit instruction level parallelism

Multiple instructions fetched per cycle

More execution units, larger register files

Dynamic dependency resolution

Limit effects of branches

Efficient branch prediction

Speculative execution

An Alternative Design Approach

Fall 2009

CS420: Accelerators and GPGPUs


Only a small fraction of a CPU’s total number of
transistors is used for execution units

Most transistors are used to extract parallelism from
sequential programs and hide latency of memory access
through caching

What if typical applications had large amounts of
defined independent computation?

We could spend most of the CPU’s transistors on execution
units and a high
bandwidth memory interface

Graphics Processing Units

Fall 2009

CS420: Accelerators and GPGPUs


Hardware devices which produce an image from a higher
level description of its components

Unlike CPUs, designed explicitly for a massively
application area

Large groups of pixels can be dealt with independently

Historically, designed by having specialized execution
units organized into a graphics pipeline

More recently, shader units have been

into a
single type of execution unit

Improved load balancing in graphics applications

Generalized the hardware

GPU Architecture Overview

Fall 2009

CS420: Accelerators and GPGPUs


Architectural features correspond to aspects of graphics
application area

Ample parallelism in pixel shading applications

Large number of computational units

High memory bandwidth

Data parallelism of pixel shading applications

SIMD model of computation

Lack of communication between data parallel units (pixels)

Little support for data exchange between computational units

Few synchronization primitives

precision floating point computation accurate enough
for graphics

Most computational units only support single
precision floating point

GPU Performance

Fall 2009

CS420: Accelerators and GPGPUs


Peak floating point performance

1116 GFLOPS SP, 93 DP

3 GHz Quad
core CPU with 4 inst/cycle: 48 GFLOPS DP, 96 SP

Peak memory bandwidth

NVIDIA GTX 285: 166.4 GB/s

Core i7 (Nehalem) : 32 GB/s

Using three channels of 1333 MHz RAM

General Purpose Programming on the GPU

Fall 2009

CS420: Accelerators and GPGPUs


In the past, GPUs only supported a graphics API

In order to use GPUs for general purpose computation,
programs had to be written in terms of shading operations on

Brook GPU

One of the first general
purpose abstractions for the GPU

Brook was a language for an experimental machine at Stanford

GPU turned out to be a good fit for the language

Stream computation

Operations performed on arrays (streams) of values

Operations corresponding to different indices can be computed

Streamed data can be coalesced, guaranteeing high bandwidth

A Throughput
Based Model of Computation

Fall 2009

CS420: Accelerators and GPGPUs


Recall that GPUs lack caches

On average, leads to high memory latency

In order to mask the effects of memory latency, GPGPUs
rely on computation
communication overlap

While an instruction is waiting for data to arrive from memory,
other instructions for which data has arrived may execute

Requires massive parallelism to effectively hide latency

Typically thousands of threads in flight simultaneously


Fall 2009

CS420: Accelerators and GPGPUs


Compute Unified Device Architecture

A hardware/software model for general purpose programming on
the GPU

Implemented on NVIDIA GPUs

parallel computation model

Main idea: execute the same program (or
) on multiple data
locations concurrently

Based on C with extensions

Kernels execute on the GPU; the rest of the code executes on the

Each of the

executing a kernel is given a unique

Can be used to have each thread operate on different index of an array

Can be used in branches to allow threads to follow different branch paths

GTX 285 Computation Hardware

Fall 2009

CS420: Accelerators and GPGPUs


Source: NVIDIA

A Simple Example

Matrix Addition

Fall 2009

CS420: Accelerators and GPGPUs


Source: CUDA 2.1 Programming Guide

CUDA Execution Model

Fall 2009

CS420: Accelerators and GPGPUs


CUDA scheduler assigns work to clusters of GPU
streaming processors (streaming multiprocessors) in units
of blocks

Previous example had just one block

only 1 of the 30
SMs on the GTX 285 would be utilized

Having more blocks per
kernel grid
than there are SMs on
a chip is a good idea

Greater tolerance of memory latency

Matrix Addition Using Multiple Blocks

Fall 2009

CS420: Accelerators and GPGPUs


Source: CUDA 2.1 Programming Guide

Further Reading

Fall 2009

CS420: Accelerators and GPGPUs


CUDA 2.1 Programming Guide


Detailed description of GT 200 architecture and CUDA
execution model


Fermi: recently announced by NVIDIA

Fall 2009

CS420: Accelerators and GPGPUs




It is not here yet