GPU Computing
–
Low
-
cost High
Performance Embedded Computing
David Kaeli
Department of Electrical and Computer Engineering
Northeastern University
Boston, MA
One end of the parallel computing spectrum
-
Multi
-
core computing
The CPU industry has elected to jump off the
cycle
-
time scaling bandwagon
Power/thermal constraints have become a limiting
factor
CPU vendors placing multiple cores on a single
chip
Clock speeds have not changed
The memory wall persists and multi
-
core places
further pressure on this problem
Software vendors are looking for new
parallelization technology
Multi
-
core aware operating systems
Parallelizing compilers
Programming frameworks
The other end of the spectrum
–
Graphics Processors
Graphics Processing Units
More than 65% of Americans played a video game in 2010
The global video game market was $51.7B in 2011
High
-
end
-
primarily used for 3
-
D rendering for videogame graphics
and movie animation
Manufacturers include NVIDIA, AMD/ATI, IBM
-
Cell
Very competitive commodities market
NVIDIA Comparison of CPU and GPU
Hardware Architectures
(SP GFLOPS
–
GS/sec)
Source: NVIDIA
Comparison of CPU and GPU
Hardware Architectures
Source: NVIDIA, AMD and Intel
CPU/GPU
Single
precision
TFLOPs
Cores
GFLOPs/
Watt
$/GFLOP
NVIDIA 285
1.06
240
5.8
$0.09
NVIDIA 295
1.79
480
6.2
$0.08
NVIDIA 480
1.34
480
5.4
$0.41
AMD HD 6990
5.10
3072
11.3
$0.14
AMD HD 5870
2.72
1600
14.5
$0.16
AMD HD 4890
1.36
800
7.2
$0.18
Intel I
-
7 965
0.051
4
0.39
$11.02
CPU vs GPU Architectures
•
Irregular data accesses
•
More cache + Control
•
Focus on per thread
performance
•
Regular data accesses
•
More ALUs and massively
parallel
•
Throughput oriented
A wide range of CUDA applications
3D image analysis
Adaptive radiation therapy
Acoustics
Astronomy
Audio
Automobile vision
Bioinfomatics
Biological simulation
Broadcast
Cellular automata
Fluid dynamics
Computer vision
Cryptography
CT reconstruction
Data mining
Digital cinema / projections
Electromagnetic simulation
Equity trading
Film
Financial
GIS
Holographics cinema
Intrusion detection
Machine learning
Mathematics research
Military
Mine planning
Molecular dynamics
MRI reconstruction
Multispectral imaging
N
-
body simulation
Network processing
Neural network
Oceanographic research
Optical inspection
Particle physics
Protein folding
Quantum chemistry
Ray tracing
Radar
Reservoir simulation
Robotic vision / AI
Robotic surgery
Satellite data
analysis
Seismic imaging
Surgery simulation
Surveillance
Ultrasound
Video conferencing
Telescope
Video
Visualization
Wireless
X
-
Ray
SP
SP
SP
SP
SFU
SP
SP
SP
SP
SFU
Texture Processor
Cluster
SM
Streaming Processor Array
Streaming Multiprocessor
Texture Unit
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
TPC
SM
SM
NVIDIA
architecture
Grid of thread blocks
Multiple thread blocks,
many warps of threads
Individual threads
•
240 shader cores
•
1.4B transistors
•
Up to 2GB onboard
memory
•
~150GB/sec BW
•
1.06 SP TFLOPS
•
CUDA and OpenCL
support
•
Programmable
memory spaces
•
Tesla S1070
provides 4 GPUs in a
1U unit
Nvidia Fermi
Compute 2.0 / 2.1 devices
Better double precision
ECC support
Configurable cache
hierarchy
Faster context switching
Faster atomic operations
Concurrent kernel
execution
Dual DMA Engines
AMD/ATI
Radeon
HD 5870
•
Codename “Evergreen”
•
20 SIMD Engines
•
1600 SIMD cores
•
L1/L2 memory architecture
•
153GB/sec memory bandwidth
•
2.72 TFLOPS SP
•
OpenCL and DirectX11
•
Provides for vectorized operation
AMD Memory System
Distributed memory controller
Optimized for latency hiding
and memory access efficiency
GDDR5 memory at 150GB/s
Up to 272 billion 32
-
bit
fetches/second
Up to 1 TB/sec L1 texture
fetch bandwidth
Up to 435 GB/sec between
L1 & L2
OpenCL
–
The future for many
-
core computing
Open Compute Language
A framework for writing programs that execute on
heterogeneous systems
Very similar to CUDA
Presently runs on NVIDIA GPUs and AMD multi
-
core
CPUs/GPUs, Intel CPUs and some embedded CPUs
Being developed by Khronos Group
–
a non
-
profit
Modeled as four parts
•
Platform Model
•
Execution Model
•
Memory Model
•
Programming Model
Big Picture
GPUs
in Biomedical Imaging
The Biomedical Imaging field is moving to extensive use of
new 3
-
D and 4
-
D imaging technologies to improve patient
outcomes
This has created an avalanche of image data
Image reconstruction and image analysis have become major bottlenecks
Accurate image reconstruction requires compute
-
intensive algorithms
The use of multi
-
modality imaging (e.g., CT and Ultrasound) further
exacerbates this problem
GPU computing is playing a large role in addressing these
challenges
Experiences with migrating biomedical
applications to a GPU
3
-
D Cardiac CT Imaging
Iterative Least Squares Back
Projection
3
-
D Breast Cancer
Screening
Maximum Likelihood
Estimation
Diffuse Optical Tomography
Developing a suite of Biomedical Image
Reconstruction Libraries
–
CUDA/OpenCL
Target applications:
Deformable registration
-
radiation oncology
3
-
D Iterative reconstruction
–
cardio
-
vascular imaging
Maximum likelihood estimation
–
Digital
Breast Tomosynthesis
Motion compensation in PET/CT images
-
cardiovascular imaging
Hyperspectral imaging
–
skin cancer
screening
Image segmentation
–
brain imaging
$1.3M NSF Award EEC
-
0946463
A Look at the Future for GPUs
AMD Fusion
CPU/GPU on a single chip
Shared
-
memory model
Reduces communication overhead
Intel SandyBridge
CPU/GPU on a single chip
Less aggressive approach than Fusion
Windows, MacOS
and Linux franchises
Thousands of apps
Established
programming and
memory model
Mature tool chain
Extensive backward
compatibility for
applications and OSs
High barrier to entry
x86 CPU owns
the Software World
Enormous parallel
computing capacity
Outstanding
performance
-
per
-
watt
-
per
-
dollar
Very efficient
hardware threading
SIMD architecture well
matched to modern
workloads: video,
audio, graphics
GPU Optimized for
Modern Workloads
AMD Fusion
–
The Future for CPU/GPU
Computing
PC with only a Discrete GPU
Memory
15GB/sec
PCIe 12GB/sec
GPU
Device
Memory
150GB/sec
PC with a Discrete GPU and an APU
Memory
20 GB/sec
20GB/sec
Fusion
GPU
Discrete
GPU
Device
Memory
150GB/sec
PCIe 12GB/sec
The Northeastern University GPU Crew
Zhongliang Chen
Machine vision algorithms (DHS)
Nanomaterial modeling (MGHPCC)
CUDA/OpenCL experience
Rodrigo Dominguez
Compiler optimizations
Binary translation (PTX on AMD)
CUDA/OpenCL experience
Xiang Gong
Physics Phase Field Model
CUDA experience
Xiangyu Li
3
-
D Ultrasound (BK)
Compounding
Perhaad Mistry
Data structure design (AMD)
CUDA/OpenCL experience
Physics
-
based simulation (Simquest)
Dana Schaa
Multi
-
GPU performance
CUDA/OpenCL experience
3
-
D Ultrasound (BK)
Tomosynthesis Recon (MGH)
Matt Sellitto
OpenCL experience
Hyperspectral Imaging (HySpeed)
Rafael Ubal
GPU modeling and simulation
Ayse Yilmazer
3
-
D CT Visualization (MGH)
OpenCL experience
Summary
GPUs are revolutionizing embedded systems
A number of critical image reconstruction
applications have been migrated successfully
Impressive speedups
GPU knowledge and GPU
-
specific algorithms are
key to reap the benefits of GPU computing
The Northeastern University GPU team is
working on many exciting project
You could be part of this exciting revolution!!!
And now a word from our sponsors
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο