1
1
2
2
•
Key Note
•
Original Purpose and History of GPU
•
Why Develop Programs for GPU
•
Architecture Difference Between CPU and GPU
•
When to Develop Programs for
GPU
•
Side
-
Step
•
Future of CPU / GPU computing
•
CPU / GPU Computing Hurdles
–
Hardware
–
Parallel Programming Concepts
–
CPU versus GPU Battle
–
Complexity of Current GPU program APIs & Architecture
–
GPU Program Portability
–
Hybrid/Heterogeneous Computing
•
Future Computing Challenges Tree
•
Possible Future & Solutions Summary Table
•
Breakdown of Parallelism
•
Conclusion
•
References
GPU
Background
3
3
•
This presentation talks about GPU and CPU computing and
programming
•
Kernel programming controls the GPU and CPU.
•
Hardware Abstraction
-
Typically with higher level
languages this kernel programming is taken care of by the
language developers and hidden from the view of high
-
level
programming programmers
Mathematicians
Hardware Programmers &
Program Language
Developers
High Level Language
Programmers
End Software Customer
Hardware Components
4
4
•
GPU = Graphical Processing Unit
•
Initially designed to accelerate memory
-
intensive work of
texture mapping and rendering polygons which would then be
displayed on the user’s Computer Screen.
[1][2]
•
Modern GPUs use most of their transistors to do calculations
related to computer graphics.
[1]
•
Just back in 2006
Nvidia
released
Cuda
1.0 which allowed
programmers access to GPU computing capabilities [20]
•
This evolution in GPU has continued to add flexibility to GPU
usage. With this new, now
somewhat easy to access
computing capability, many
engineers and
scientists are starting to look
into the using the GPU for
non
-
graphical calculations.
Texture Mapping
[3]
5
5
•
Speed (probably the sole reason)
–
As graphics, animation and GUI interfaces become an everyday occurrence in
software … the software becomes more and more compute intensive
This makes the user experience slow and arduous.
–
R&D becomes more and more compute intensive
–
In many machines, the GPU sits idle while the CPU does all the work
–
GPUs are more efficient then CPUs in certain processes and program which can
take advantage of parallel programming.
–
Once GPU programming languages came along people began to offload work they
once forced the CPU process over to the GPU.
•
CAUTION:
Exact Speed Difference: Comparing Apples and Oranges
–
People (companies) have gone to extreme measure to determine which is better
and faster …. the GPU or the CPU
–
Unfortunately this is a very unfair comparison due to the fact that they each have
different purposes
•
CPU:
Much broader use
achieve good performance on a wide variety of workloads
CPU cores (things you run a thread on) are much faster than GPU cores [6]
•
GPU:
very specific use so can maximize architecture for that one use
has dozens of cores compared to the CPUs 4
-
8 cores
–
Processes ideal for GPU have been measured to run from only 2.5x (Intel) faster
to 100x (
Nvidia
) faster [4][5]
6
6
CPU
GPU
Cores
~4
Several
Dozen
Optimized for
Rapid Sequential
Parallel/concurrent operations
Transistor
use
More for flow control and data caching
More for
data processing
Core speed
Faster than GPU (GHz)
Slower than CPU (MHz)
Take Away
•
GPU is a supplement, NOT a replacement, for the CPU
•
Our goal as programmers should be to:
–
Make wise decisions as to when to take advantage of the GPU power
–
Help CPU & GPU work together as efficiently as possible
7
7
•
Converting a program to take advantage of the GPU is not a
simple or cheap task.
Therefore need to determine which code would be most
efficient on CPU and which would be more efficient if
processed by the GPU.
•
Graphics Rendering
•
Problems expressed as data
-
parallel computations with high
arithmetic intensity (a high ratio of arithmetic operations to
memory operations) [7]
•
Computationally Intensive Task, ideal for GPU processing
–
Many scientific computing problems
–
Engineering computing problems
–
Simple structured grip PDR methods in computational finance
–
Physical simulations
–
Matrix Algebra
–
Image & Volume processing
–
Global Illumination
•
Ray Tracing, photon mapping,
radiosity
–
Non
-
grid streams (which can be mapped to grids
–
XML parsing
–
Medical Imaging
–
Photography
–
Grid Computing
8
8
Task (function/Control)
Parallelism
Each processor executes a different
thread (or process) on the same or
different data. The threads can be
the same or different code.
Data
-
Parallelism
(loop
-
level parallelism) (SIMD)
Distributing the data across different
parallel computing nodes. Perform the
same task on different pieces of
distributed data.
[8]
9
9
GPUs finding their way
into the following fields
[9]
[10]
•
Database
•
Oil Exploration
•
Web Search Engines
•
Medical Imaging
•
Pharmaceutical design
•
Financial Modeling
•
Advanced Graphics
•
Networked Video tech
•
Collaborative Work
Environments
10
10
•
Heterogeneous/hybrid Computing
•
Tasks split between GPU and CPU
•
Parallel CPU/GPU Processing will become a norm in all
program [11]
Do We Really Need to Switch to Heterogeneous Computing
?
•
Previously (90 early 2000), hardware technology advances
allowed increase in performance without the immediate need
for change or fundamental restructuring.
–
Hardware is starting to hit a quantum
-
wall and a thermal/power
-
wall. Need to spread tasks out over several processors.
•
Different processor architectures excel in different areas. Why
make one architecture style do everything?
•
Currently there is a lot of wasted processor time. CPU sits idle
while GPU does it’s task. The GPU sits idle while the CPU burns
itself out trying to do almost everything [11]
•
In the end … GPU provide low cost platform for accelerating
high performance computations. [13]
11
11
12
12
•
Hardware: GPU to CPU data transfer bottleneck
–
the limitation with the heterogeneous computation model
is
the significant
overhead of memory transfers between the host CPU and the
GPU [12]
•
Parallel Programing
Concepts
–
Multi
-
processor chip hardware requires dauntingly complex software that
breaks up computing chores into simultaneously processed chucks of code.
[21]
•
CPU versus GPU
battle
•
Complexity
of current GPU programming
languages [13]
–
Fairly complex and error
prone at times
–
Optimizing an algorithm for a specific GPU is a time
-
consuming task which
currently requires thorough knowledge of both the algorithm as well as the
hardware [13]
–
Programmers should not have to concern themselves with intricate details of
the hardware.
•
Portability of current GPU programming
languages [13]
–
G
PU code lacks portability due to the fact that code for one GPU may not run
as efficiently (or at all) when run on non
-
native GPU hardware.
–
Much of GPU coding is not even capable of being efficiently ported over to
different generations and/or model of the same GPU brands
–
There is also a desire for GPU code to be able to fall back, and run on CPUs if
a GPU is not available … this feature is only seen in a very few GPU APIs
•
Complexity of Hybrid optimization
–
Entire thesis done on CPU/GPU communication optimization.
13
13
•
CPU & GPU Hardware Constraints
Moore’s Law Continues &
Heisenberg Uncertainty
Principle Altered
–
Feb 2012 Physicists created a working transistor (transistors = things
that holds bits making memory and information storage possible)
consisting of a single atom [15]
–
After single
-
atom transistors next will be photo transistors. Replace
traces
on circuit boards with optical
signals
–
In 2010 IMB and Intel joined forces, investing $4.4 billion in
chip technology [19
]
•
GPU to CPU data transfer bottleneck (Hardware
)
Optical guides (IBM and Intel)
–
the limitation with the heterogeneous computation model is the
significant overhead of memory transfers between the host CPU and
the GPU [
12]
–
Both IBM and Intel are investing money and time into photon data
transfer technologies [17][18]
–
Plan to replace copper cables and backplane. Photon data transfer
significantly reduces CPU to GPU communication & bring transfer
rates down to hopefully negligible times.
[18]
14
14
•
Currently programs written for an architecture with n processors
require a re
-
write when migrated to an m processor architecture
to benefit from additional resources. [22]
•
Compiler based parallelization techniques try to automatically find
and use partial orders in sequential code but often fail to match
manual optimization.
•
Where various techniques fall short
–
POSIX
–
requires programmer to specify the partial order between
program operations in terms of constructs such as threads, locks and
semaphores
–
OpenMP
–
requires programmer to specify code which they believe
would perform better via parallel processing.
–
OpenCL
and CUDA
–
require user knowledge of the computational
platform learn the libraries and how to implement them
•
Solution
–
Automatic and Portable Parallel Programming
–
TripleP
–
uses synthesis at compile time to generate parallel binaries
from declarative programs. It abstracts the execution order of the
program away from the developer and allows for explicitly parallelism
without requiring architecture specific annotations and structures
(determines best way to parallel the code) [22]
–
DARPA challenges companies/institutes to develop new parallel
languages and programming tools back in 2001. [23]
–
PPmodel
–
helps separate out sequential and parallel parts of program
into blocks without modifying code.
Also supports CUDA. (identifies
hotspots) [24]
–
MARPLE
–
help businesses to automatically migrate their legacy software
systems to a data
-
parallel platform like the
Nvidia
CUDA GPU [25]
Language Developers
parallel program
Software Developers
parallel program design
Software design
Software Developers
having to think about
parallel breakdown
of program
15
15
•
Market demands as well as global demands will encourage the
progress of technology.
•
Mergers (AMD & ATI 2006)
•
Partially non
-
bias Middle Person
–
vendors such as IBM, dell, HP realize need GPU and CPU. Help
facilitate creation of heterogeneous system.
–
Government
•
Nvidia
and Intel DARPA in
exascale
computing project in 2010
[30]
•
Nvidia
, Intel, AMD,
Whamcloud
work with Department of Energy
on
FastForward
exascale
computing
program Jul 2012. [26]
•
Truly heterogeneous machines may be achievable without
intimate relationship & sharing of proprietary information
between CPU and GPU companies
•
Conclusion:
–
Should not expect or hope for the separate companies to play
‘friendly’. Will always have lawsuits and fighting. Main concern for us
is that their bickering does not infringe on overall progress of
computing technology, but instead encourages growth.
–
No one disputes the need for heterogeneous computing. Disputes
over who should do what.
16
16
???
GPU
Hardware and
Code Learning
I need a fast
Structural analysis
tool
Okay, that will be about
a two year
wait
;
we
have
to learn the latest GPU
h
ardware and libraries and write
Code for that specific GPU
(which you must also purchase along
with our software). When you upgrade
hardware must update software to take
maximum
a
dvantage of hardware
•
Solutions to: Fairly complex and error prone due to parallel
programming.
–
Improve ease of parallel programming
(See parallel programming
solutions
)
–
Program readability still needs work. More difficult for humans to
conceptualize since more natural to think in series
–
Work on creating a higher level programming abstraction similar to
stream programming model [13]
–
Far from max efficiency when programming in object oriented
programming languages (C++ good… Java and everything else not as
close to max efficiency)
17
17
•
OpenCL
(
Khronos
… initially Apple 2008) [28]
–
Khronos
-
ATI Technologies, Discreet, Evans & Sutherland,
Intel Corporation, NVIDIA, Silicon Graphics (SGI), and Sun
Microsystems. Today the
Khronos
Group has roughly 100
member companies, over 30 adopters, and twenty
-
four
conforming members
–
Can be implemented on number of platforms (including
cell phones)
–
When GPU hardware is not present it can fall back on CPU
to perform the specified work * [28]
–
Supports synchronization over multiple devices
–
Easy to learn
–
Open standard & Collaborative Effort
–
Share resources with OpenGL
–
GPUs:
Nvidia
, ATI & Ivy Bridge & others
•
DirectCompute
(2009?
–
Microsoft)
•
C++ AMP … builds on
DirectCompute
(2011
–
Microsoft)
–
GPUs:
Nvidia
& ATI
18
18
•
Sponge
: a compilation framework for
Nvidia
GPUs
using
synchronous data flow streaming languages.
–
abstraction of hardware details [13]
–
Creates write
-
once optimized CUDA code for variety of GPU
targets
–
Takes care of the GPU to host and host to GPU communication
–
Also determines what of
your code (
StreamIT
program)
is better suited
for GPU and which is better suited for CPU [13]
–
Improved performance of 3.2x compared to the GPU baseline
benchmarks which come from
StreamIT
suite
19
19
•
Software that can support Hybrid Computing
OpenCL
, C++ AMP
•
Parallel Analyzers to aid in process distribution amongst
CPU/GPU
–
All software mentioned in the pages above
20
20
Conclusion: Will
not be able to get good grade, dependable and
reliable software which will survive in this environment
(
new frontier)
until a lot of these challenges have been
confronted
and
complexities somewhat removed
21
21
NOTE: Solutions in red highlight are also part of the computing challenges
22
22
Conclusion:
Need to decide where parallelism belongs; How to abstract (for
software programmer) the process as much as possible
23
23
YES
NO
NO
YES
NO
YES
NO
NO
YES
YES
YES
Conclusion: Need better models, guidelines and programs to help
determine where (processor) and how processes run most efficiently
24
24
•
It would be in our best interest to peruse hybrid computing in
order up with market demands.
•
The future of Research Depends Heavily on Computing Power:
–
Space
: predicting the future of the planet and the solar system and
universe
–
Medical
: Techniques to find cures to cancer and other diseases are
being taken out of lab and designed into the computer software
–
Environmental
: Collecting data on environmental and weather
patterns and create a more eco
-
compatible human habitats
–
Science
: aid in solving complex mathematical computations to make
further strides in scientific discoveries
•
Information overload needs to be dealt with [31]
–
Increase available space for information
–
Increase focus on massive organization of information
25
25
•
[1] http://en.wikipedia.org/wiki/Graphics_processing_unit
•
[2] http://en.wikipedia.org/wiki/Texture_mapping
•
[3] http://www.siggraph.org/education/materials/HyperGraph/mapping/r_wolfe/r_wolfe_mapping_1.htm
•
[4] p451
-
lee.pdf only 2.5x (Intel) faster
•
[5] http://blogs.nvidia.com/2010/06/gpus
-
are
-
only
-
up
-
to
-
14
-
times
-
faster
-
than
-
cpus
-
says
-
intel/
•
[6] http://stackoverflow.com/questions/28147/feasability
-
of
-
gpu
-
as
-
a
-
cpu
•
[7] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing
•
[8] http://software.intel.com/en
-
us/articles/choose
-
the
-
right
-
threading
-
model
-
task
-
parallel
-
or
-
data
-
parallel
-
threading/
•
[9]
the_future_of_Massively_parallel_and_GPU_Computing
(
pdf
)
•
[10] https://computing.llnl.gov/tutorials/parallel_comp/
•
[11] interact
-
16
-
paper
-
5.pdf
•
[12] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing
•
[13] Sponge_Portable_Stream_Programming_on_Graphics_Engines.pdf
•
[14] http://www.nature.com/nphys/journal/vaop/ncurrent/full/nphys1734.html
•
The uncertainty principle in the presence of quantum memory (Nature Physics)
•
[15] http://www.sciencedaily.com/releases/2012/02/120219191244.htm
•
[16] http://www.sciencedaily.com/releases/2007/08/070826162731.htm
•
[17] http://www.intel.com/pressroom/archive/releases/2010/20100727comp_sm.htm
•
[18] ibm+opcb+roadmap+and+tech+
-
+jeff+kash.pdf
•
[19] http://news.cnet.com/8301
-
13924_3
-
20112553
-
64/ibm
-
intel
-
group
-
to
-
invest
-
$4.4
-
billion
-
in
-
chip
-
tech/
•
[20] http://www.youtube.com/watch?v=Cmh1EHXjJsk
•
[21] ManyCore121707.pdf
•
[22] p1922
-
zaraket.pdf
•
[23] http://www.economist.com/node/18750706
•
[24] p138
-
jacob.pdf
•
[25] p131
-
sarkar.pdf
•
[26] http://www.theverge.com/2012/7/14/3157985/nvidia
-
intel
-
amd
-
department
-
of
-
energy
-
fastforward
•
[27] http://www.digitaltrends.com/computing/how
-
nvidias
-
kepler
-
chips
-
could
-
end
-
pcs
-
and
-
tablets
-
as
-
we
-
know
-
them/
•
[28] 0112acij09.pdf
•
[29] p91
-
song.pdf
•
[30] http://www.informationweek.com/news/government/enterprise
-
architecture/226700040
Enter the password to open this PDF file:
File name:
-
File size:
-
Title:
-
Author:
-
Subject:
-
Keywords:
-
Creation Date:
-
Modification Date:
-
Creator:
-
PDF Producer:
-
PDF Version:
-
Page Count:
-
Preparing document for printing…
0%
Σχόλια 0
Συνδεθείτε για να κοινοποιήσετε σχόλιο