Performance Embedded Computing

builderanthologyAI and Robotics

Oct 19, 2013 (3 years and 10 months ago)

88 views

GPU Computing


Low
-
cost High
Performance Embedded Computing

David Kaeli


Department of Electrical and Computer Engineering

Northeastern University

Boston, MA


One end of the parallel computing spectrum
-

Multi
-
core computing


The CPU industry has elected to jump off the
cycle
-
time scaling bandwagon


Power/thermal constraints have become a limiting
factor


CPU vendors placing multiple cores on a single
chip


Clock speeds have not changed


The memory wall persists and multi
-
core places
further pressure on this problem


Software vendors are looking for new
parallelization technology


Multi
-
core aware operating systems


Parallelizing compilers


Programming frameworks

The other end of the spectrum



Graphics Processors


Graphics Processing Units


More than 65% of Americans played a video game in 2010


The global video game market was $51.7B in 2011


High
-
end
-

primarily used for 3
-
D rendering for videogame graphics
and movie animation


Manufacturers include NVIDIA, AMD/ATI, IBM
-
Cell


Very competitive commodities market





NVIDIA Comparison of CPU and GPU
Hardware Architectures
(SP GFLOPS


GS/sec)

Source: NVIDIA

Comparison of CPU and GPU
Hardware Architectures

Source: NVIDIA, AMD and Intel

CPU/GPU

Single
precision
TFLOPs

Cores

GFLOPs/

Watt

$/GFLOP

NVIDIA 285

1.06

240

5.8

$0.09

NVIDIA 295

1.79

480

6.2

$0.08

NVIDIA 480

1.34

480

5.4

$0.41

AMD HD 6990

5.10

3072

11.3

$0.14

AMD HD 5870

2.72

1600

14.5

$0.16

AMD HD 4890

1.36

800

7.2

$0.18

Intel I
-
7 965

0.051

4

0.39

$11.02

CPU vs GPU Architectures



Irregular data accesses



More cache + Control



Focus on per thread
performance



Regular data accesses



More ALUs and massively
parallel



Throughput oriented

A wide range of CUDA applications


3D image analysis


Adaptive radiation therapy


Acoustics


Astronomy


Audio


Automobile vision


Bioinfomatics


Biological simulation


Broadcast


Cellular automata


Fluid dynamics


Computer vision


Cryptography


CT reconstruction


Data mining


Digital cinema / projections


Electromagnetic simulation


Equity trading


Film


Financial


GIS


Holographics cinema


Intrusion detection


Machine learning


Mathematics research


Military


Mine planning


Molecular dynamics


MRI reconstruction


Multispectral imaging


N
-
body simulation


Network processing


Neural network


Oceanographic research


Optical inspection


Particle physics


Protein folding


Quantum chemistry


Ray tracing


Radar


Reservoir simulation


Robotic vision / AI


Robotic surgery


Satellite data
analysis


Seismic imaging


Surgery simulation


Surveillance


Ultrasound


Video conferencing


Telescope


Video


Visualization


Wireless


X
-
Ray

SP

SP

SP

SP

SFU

SP

SP

SP

SP

SFU

Texture Processor


Cluster

SM

Streaming Processor Array

Streaming Multiprocessor

Texture Unit

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

TPC

SM

SM

NVIDIA
architecture

Grid of thread blocks

Multiple thread blocks,
many warps of threads

Individual threads



240 shader cores



1.4B transistors



Up to 2GB onboard
memory



~150GB/sec BW



1.06 SP TFLOPS



CUDA and OpenCL
support



Programmable
memory spaces



Tesla S1070
provides 4 GPUs in a
1U unit

Nvidia Fermi


Compute 2.0 / 2.1 devices


Better double precision


ECC support


Configurable cache
hierarchy


Faster context switching


Faster atomic operations


Concurrent kernel
execution


Dual DMA Engines

AMD/ATI
Radeon

HD 5870



Codename “Evergreen”





20 SIMD Engines



1600 SIMD cores




L1/L2 memory architecture




153GB/sec memory bandwidth




2.72 TFLOPS SP




OpenCL and DirectX11




Provides for vectorized operation

AMD Memory System



Distributed memory controller



Optimized for latency hiding
and memory access efficiency



GDDR5 memory at 150GB/s



Up to 272 billion 32
-
bit
fetches/second



Up to 1 TB/sec L1 texture
fetch bandwidth



Up to 435 GB/sec between
L1 & L2

OpenCL


The future for many
-
core computing


Open Compute Language


A framework for writing programs that execute on
heterogeneous systems


Very similar to CUDA


Presently runs on NVIDIA GPUs and AMD multi
-
core
CPUs/GPUs, Intel CPUs and some embedded CPUs


Being developed by Khronos Group


a non
-
profit


Modeled as four parts



Platform Model



Execution Model



Memory Model



Programming Model



Big Picture

GPUs

in Biomedical Imaging


The Biomedical Imaging field is moving to extensive use of
new 3
-
D and 4
-
D imaging technologies to improve patient
outcomes


This has created an avalanche of image data


Image reconstruction and image analysis have become major bottlenecks


Accurate image reconstruction requires compute
-
intensive algorithms


The use of multi
-
modality imaging (e.g., CT and Ultrasound) further
exacerbates this problem

GPU computing is playing a large role in addressing these
challenges


Experiences with migrating biomedical

applications to a GPU


3
-
D Cardiac CT Imaging


Iterative Least Squares Back
Projection



3
-
D Breast Cancer
Screening


Maximum Likelihood
Estimation


Diffuse Optical Tomography


Developing a suite of Biomedical Image
Reconstruction Libraries


CUDA/OpenCL


Target applications:


Deformable registration
-

radiation oncology


3
-
D Iterative reconstruction


cardio
-
vascular imaging


Maximum likelihood estimation


Digital
Breast Tomosynthesis


Motion compensation in PET/CT images
-

cardiovascular imaging


Hyperspectral imaging


skin cancer
screening


Image segmentation


brain imaging


$1.3M NSF Award EEC
-
0946463

A Look at the Future for GPUs


AMD Fusion


CPU/GPU on a single chip


Shared
-
memory model


Reduces communication overhead



Intel SandyBridge


CPU/GPU on a single chip


Less aggressive approach than Fusion



Windows, MacOS

and Linux franchises


Thousands of apps


Established
programming and
memory model


Mature tool chain


Extensive backward
compatibility for
applications and OSs


High barrier to entry

x86 CPU owns

the Software World


Enormous parallel
computing capacity


Outstanding

performance
-
per
-

watt
-
per
-
dollar


Very efficient

hardware threading


SIMD architecture well
matched to modern
workloads: video,
audio, graphics

GPU Optimized for
Modern Workloads

AMD Fusion


The Future for CPU/GPU
Computing

PC with only a Discrete GPU

Memory

15GB/sec


PCIe 12GB/sec

GPU

Device

Memory

150GB/sec

PC with a Discrete GPU and an APU

Memory

20 GB/sec


20GB/sec

Fusion

GPU

Discrete

GPU

Device

Memory

150GB/sec


PCIe 12GB/sec


The Northeastern University GPU Crew


Zhongliang Chen


Machine vision algorithms (DHS)


Nanomaterial modeling (MGHPCC)


CUDA/OpenCL experience


Rodrigo Dominguez


Compiler optimizations


Binary translation (PTX on AMD)


CUDA/OpenCL experience


Xiang Gong


Physics Phase Field Model


CUDA experience


Xiangyu Li



3
-
D Ultrasound (BK)



Compounding





Perhaad Mistry



Data structure design (AMD)



CUDA/OpenCL experience



Physics
-
based simulation (Simquest)



Dana Schaa


Multi
-
GPU performance


CUDA/OpenCL experience


3
-
D Ultrasound (BK)


Tomosynthesis Recon (MGH)



Matt Sellitto


OpenCL experience


Hyperspectral Imaging (HySpeed)



Rafael Ubal


GPU modeling and simulation


Ayse Yilmazer


3
-
D CT Visualization (MGH)


OpenCL experience




Summary


GPUs are revolutionizing embedded systems


A number of critical image reconstruction
applications have been migrated successfully


Impressive speedups


GPU knowledge and GPU
-
specific algorithms are
key to reap the benefits of GPU computing


The Northeastern University GPU team is
working on many exciting project


You could be part of this exciting revolution!!!


And now a word from our sponsors