Future of GPU/CPU Computing and Programming - The Future of ...

coleslawokraSoftware and s/w Development

Dec 1, 2013 (3 years and 7 months ago)

77 views

1

1

2

2


Key Note


Original Purpose and History of GPU


Why Develop Programs for GPU


Architecture Difference Between CPU and GPU


When to Develop Programs for
GPU


Side
-
Step


Future of CPU / GPU computing


CPU / GPU Computing Hurdles


Hardware


Parallel Programming Concepts


CPU versus GPU Battle


Complexity of Current GPU program APIs & Architecture


GPU Program Portability


Hybrid/Heterogeneous Computing


Future Computing Challenges Tree


Possible Future & Solutions Summary Table


Breakdown of Parallelism


Conclusion


References




GPU

Background

3

3


This presentation talks about GPU and CPU computing and
programming


Kernel programming controls the GPU and CPU.


Hardware Abstraction
-

Typically with higher level
languages this kernel programming is taken care of by the
language developers and hidden from the view of high
-
level
programming programmers




Mathematicians

Hardware Programmers &
Program Language
Developers

High Level Language
Programmers

End Software Customer

Hardware Components

4

4


GPU = Graphical Processing Unit



Initially designed to accelerate memory
-
intensive work of
texture mapping and rendering polygons which would then be
displayed on the user’s Computer Screen.
[1][2]



Modern GPUs use most of their transistors to do calculations
related to computer graphics.
[1]



Just back in 2006
Nvidia

released
Cuda

1.0 which allowed
programmers access to GPU computing capabilities [20]



This evolution in GPU has continued to add flexibility to GPU
usage. With this new, now

somewhat easy to access

computing capability, many

engineers and

scientists are starting to look

into the using the GPU for

non
-
graphical calculations.




Texture Mapping
[3]

5

5


Speed (probably the sole reason)


As graphics, animation and GUI interfaces become an everyday occurrence in
software … the software becomes more and more compute intensive

This makes the user experience slow and arduous.


R&D becomes more and more compute intensive


In many machines, the GPU sits idle while the CPU does all the work


GPUs are more efficient then CPUs in certain processes and program which can
take advantage of parallel programming.


Once GPU programming languages came along people began to offload work they
once forced the CPU process over to the GPU.




CAUTION:

Exact Speed Difference: Comparing Apples and Oranges


People (companies) have gone to extreme measure to determine which is better
and faster …. the GPU or the CPU


Unfortunately this is a very unfair comparison due to the fact that they each have
different purposes


CPU:

Much broader use

achieve good performance on a wide variety of workloads

CPU cores (things you run a thread on) are much faster than GPU cores [6]


GPU:

very specific use so can maximize architecture for that one use

has dozens of cores compared to the CPUs 4
-
8 cores


Processes ideal for GPU have been measured to run from only 2.5x (Intel) faster
to 100x (
Nvidia
) faster [4][5]



6

6

CPU

GPU

Cores

~4

Several

Dozen

Optimized for

Rapid Sequential

Parallel/concurrent operations

Transistor

use

More for flow control and data caching

More for

data processing

Core speed

Faster than GPU (GHz)

Slower than CPU (MHz)

Take Away


GPU is a supplement, NOT a replacement, for the CPU


Our goal as programmers should be to:


Make wise decisions as to when to take advantage of the GPU power


Help CPU & GPU work together as efficiently as possible



7

7


Converting a program to take advantage of the GPU is not a
simple or cheap task.

Therefore need to determine which code would be most
efficient on CPU and which would be more efficient if
processed by the GPU.





Graphics Rendering


Problems expressed as data
-
parallel computations with high
arithmetic intensity (a high ratio of arithmetic operations to
memory operations) [7]


Computationally Intensive Task, ideal for GPU processing


Many scientific computing problems


Engineering computing problems


Simple structured grip PDR methods in computational finance


Physical simulations


Matrix Algebra


Image & Volume processing


Global Illumination


Ray Tracing, photon mapping,
radiosity


Non
-
grid streams (which can be mapped to grids


XML parsing


Medical Imaging


Photography


Grid Computing





8

8

Task (function/Control)

Parallelism

Each processor executes a different
thread (or process) on the same or
different data. The threads can be
the same or different code.




Data
-
Parallelism

(loop
-
level parallelism) (SIMD)

Distributing the data across different
parallel computing nodes. Perform the
same task on different pieces of
distributed data.




[8]




9

9

GPUs finding their way
into the following fields




[9]




[10]





Database


Oil Exploration


Web Search Engines


Medical Imaging


Pharmaceutical design


Financial Modeling


Advanced Graphics


Networked Video tech


Collaborative Work
Environments




10

10


Heterogeneous/hybrid Computing


Tasks split between GPU and CPU


Parallel CPU/GPU Processing will become a norm in all

program [11]


Do We Really Need to Switch to Heterogeneous Computing
?


Previously (90 early 2000), hardware technology advances
allowed increase in performance without the immediate need
for change or fundamental restructuring.


Hardware is starting to hit a quantum
-
wall and a thermal/power
-
wall. Need to spread tasks out over several processors.


Different processor architectures excel in different areas. Why
make one architecture style do everything?


Currently there is a lot of wasted processor time. CPU sits idle
while GPU does it’s task. The GPU sits idle while the CPU burns
itself out trying to do almost everything [11]



In the end … GPU provide low cost platform for accelerating
high performance computations. [13]


11

11

12

12


Hardware: GPU to CPU data transfer bottleneck


the limitation with the heterogeneous computation model
is
the significant
overhead of memory transfers between the host CPU and the
GPU [12]


Parallel Programing
Concepts


Multi
-
processor chip hardware requires dauntingly complex software that
breaks up computing chores into simultaneously processed chucks of code.
[21]


CPU versus GPU
battle




Complexity
of current GPU programming
languages [13]


Fairly complex and error
prone at times


Optimizing an algorithm for a specific GPU is a time
-
consuming task which
currently requires thorough knowledge of both the algorithm as well as the
hardware [13]


Programmers should not have to concern themselves with intricate details of
the hardware.


Portability of current GPU programming
languages [13]


G
PU code lacks portability due to the fact that code for one GPU may not run
as efficiently (or at all) when run on non
-
native GPU hardware.


Much of GPU coding is not even capable of being efficiently ported over to
different generations and/or model of the same GPU brands


There is also a desire for GPU code to be able to fall back, and run on CPUs if
a GPU is not available … this feature is only seen in a very few GPU APIs


Complexity of Hybrid optimization


Entire thesis done on CPU/GPU communication optimization.





13

13


CPU & GPU Hardware Constraints

Moore’s Law Continues &
Heisenberg Uncertainty
Principle Altered


Feb 2012 Physicists created a working transistor (transistors = things
that holds bits making memory and information storage possible)
consisting of a single atom [15]


After single
-
atom transistors next will be photo transistors. Replace
traces
on circuit boards with optical
signals


In 2010 IMB and Intel joined forces, investing $4.4 billion in

chip technology [19
]



GPU to CPU data transfer bottleneck (Hardware
)

Optical guides (IBM and Intel)


the limitation with the heterogeneous computation model is the
significant overhead of memory transfers between the host CPU and
the GPU [
12]


Both IBM and Intel are investing money and time into photon data
transfer technologies [17][18]


Plan to replace copper cables and backplane. Photon data transfer
significantly reduces CPU to GPU communication & bring transfer
rates down to hopefully negligible times.




[18]

14

14


Currently programs written for an architecture with n processors
require a re
-
write when migrated to an m processor architecture
to benefit from additional resources. [22]


Compiler based parallelization techniques try to automatically find
and use partial orders in sequential code but often fail to match
manual optimization.


Where various techniques fall short


POSIX


requires programmer to specify the partial order between
program operations in terms of constructs such as threads, locks and
semaphores


OpenMP



requires programmer to specify code which they believe
would perform better via parallel processing.


OpenCL

and CUDA


require user knowledge of the computational
platform learn the libraries and how to implement them


Solution


Automatic and Portable Parallel Programming


TripleP



uses synthesis at compile time to generate parallel binaries
from declarative programs. It abstracts the execution order of the
program away from the developer and allows for explicitly parallelism
without requiring architecture specific annotations and structures
(determines best way to parallel the code) [22]


DARPA challenges companies/institutes to develop new parallel
languages and programming tools back in 2001. [23]


PPmodel



helps separate out sequential and parallel parts of program
into blocks without modifying code.

Also supports CUDA. (identifies
hotspots) [24]


MARPLE


help businesses to automatically migrate their legacy software
systems to a data
-
parallel platform like the
Nvidia

CUDA GPU [25]

Language Developers

parallel program

Software Developers

parallel program design

Software design

Software Developers

having to think about

parallel breakdown

of program

15

15


Market demands as well as global demands will encourage the
progress of technology.


Mergers (AMD & ATI 2006)


Partially non
-
bias Middle Person


vendors such as IBM, dell, HP realize need GPU and CPU. Help
facilitate creation of heterogeneous system.


Government


Nvidia

and Intel DARPA in
exascale

computing project in 2010
[30]


Nvidia
, Intel, AMD,
Whamcloud

work with Department of Energy
on
FastForward

exascale

computing
program Jul 2012. [26]


Truly heterogeneous machines may be achievable without
intimate relationship & sharing of proprietary information
between CPU and GPU companies



Conclusion:


Should not expect or hope for the separate companies to play
‘friendly’. Will always have lawsuits and fighting. Main concern for us
is that their bickering does not infringe on overall progress of
computing technology, but instead encourages growth.


No one disputes the need for heterogeneous computing. Disputes
over who should do what.


16

16

???

GPU
Hardware and
Code Learning

I need a fast

Structural analysis


tool

Okay, that will be about
a two year

wait
;
we
have
to learn the latest GPU

h
ardware and libraries and write

Code for that specific GPU

(which you must also purchase along

with our software). When you upgrade

hardware must update software to take

maximum
a
dvantage of hardware



Solutions to: Fairly complex and error prone due to parallel
programming.


Improve ease of parallel programming
(See parallel programming
solutions
)


Program readability still needs work. More difficult for humans to
conceptualize since more natural to think in series


Work on creating a higher level programming abstraction similar to
stream programming model [13]


Far from max efficiency when programming in object oriented
programming languages (C++ good… Java and everything else not as
close to max efficiency)



17

17


OpenCL

(
Khronos

… initially Apple 2008) [28]


Khronos

-

ATI Technologies, Discreet, Evans & Sutherland,
Intel Corporation, NVIDIA, Silicon Graphics (SGI), and Sun
Microsystems. Today the
Khronos

Group has roughly 100
member companies, over 30 adopters, and twenty
-
four
conforming members


Can be implemented on number of platforms (including
cell phones)


When GPU hardware is not present it can fall back on CPU
to perform the specified work * [28]


Supports synchronization over multiple devices


Easy to learn


Open standard & Collaborative Effort


Share resources with OpenGL


GPUs:
Nvidia
, ATI & Ivy Bridge & others


DirectCompute

(2009?


Microsoft)


C++ AMP … builds on
DirectCompute

(2011


Microsoft)


GPUs:
Nvidia

& ATI

18

18


Sponge
: a compilation framework for
Nvidia

GPUs
using
synchronous data flow streaming languages.


abstraction of hardware details [13]


Creates write
-
once optimized CUDA code for variety of GPU
targets


Takes care of the GPU to host and host to GPU communication


Also determines what of
your code (
StreamIT

program)
is better suited
for GPU and which is better suited for CPU [13]


Improved performance of 3.2x compared to the GPU baseline
benchmarks which come from
StreamIT

suite



19

19


Software that can support Hybrid Computing

OpenCL
, C++ AMP



Parallel Analyzers to aid in process distribution amongst
CPU/GPU


All software mentioned in the pages above

20

20

Conclusion: Will
not be able to get good grade, dependable and
reliable software which will survive in this environment
(
new frontier)
until a lot of these challenges have been
confronted
and
complexities somewhat removed

21

21

NOTE: Solutions in red highlight are also part of the computing challenges

22

22

Conclusion:

Need to decide where parallelism belongs; How to abstract (for
software programmer) the process as much as possible

23

23


YES

NO

NO

YES




NO

YES

NO

NO

YES

YES

YES

Conclusion: Need better models, guidelines and programs to help
determine where (processor) and how processes run most efficiently

24

24


It would be in our best interest to peruse hybrid computing in
order up with market demands.



The future of Research Depends Heavily on Computing Power:


Space
: predicting the future of the planet and the solar system and
universe


Medical
: Techniques to find cures to cancer and other diseases are
being taken out of lab and designed into the computer software


Environmental
: Collecting data on environmental and weather
patterns and create a more eco
-
compatible human habitats


Science
: aid in solving complex mathematical computations to make
further strides in scientific discoveries



Information overload needs to be dealt with [31]


Increase available space for information


Increase focus on massive organization of information

25

25


[1] http://en.wikipedia.org/wiki/Graphics_processing_unit


[2] http://en.wikipedia.org/wiki/Texture_mapping


[3] http://www.siggraph.org/education/materials/HyperGraph/mapping/r_wolfe/r_wolfe_mapping_1.htm


[4] p451
-
lee.pdf only 2.5x (Intel) faster


[5] http://blogs.nvidia.com/2010/06/gpus
-
are
-
only
-
up
-
to
-
14
-
times
-
faster
-
than
-
cpus
-
says
-
intel/


[6] http://stackoverflow.com/questions/28147/feasability
-
of
-
gpu
-
as
-
a
-
cpu


[7] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing


[8] http://software.intel.com/en
-
us/articles/choose
-
the
-
right
-
threading
-
model
-
task
-
parallel
-
or
-
data
-
parallel
-
threading/


[9]
the_future_of_Massively_parallel_and_GPU_Computing

(
pdf
)


[10] https://computing.llnl.gov/tutorials/parallel_comp/


[11] interact
-
16
-
paper
-
5.pdf


[12] http://wiki.accelereyes.com/wiki/index.php/Introduction_to_GPU_Computing


[13] Sponge_Portable_Stream_Programming_on_Graphics_Engines.pdf


[14] http://www.nature.com/nphys/journal/vaop/ncurrent/full/nphys1734.html



The uncertainty principle in the presence of quantum memory (Nature Physics)


[15] http://www.sciencedaily.com/releases/2012/02/120219191244.htm


[16] http://www.sciencedaily.com/releases/2007/08/070826162731.htm


[17] http://www.intel.com/pressroom/archive/releases/2010/20100727comp_sm.htm


[18] ibm+opcb+roadmap+and+tech+
-
+jeff+kash.pdf


[19] http://news.cnet.com/8301
-
13924_3
-
20112553
-
64/ibm
-
intel
-
group
-
to
-
invest
-
$4.4
-
billion
-
in
-
chip
-
tech/


[20] http://www.youtube.com/watch?v=Cmh1EHXjJsk


[21] ManyCore121707.pdf


[22] p1922
-
zaraket.pdf


[23] http://www.economist.com/node/18750706


[24] p138
-
jacob.pdf


[25] p131
-
sarkar.pdf


[26] http://www.theverge.com/2012/7/14/3157985/nvidia
-
intel
-
amd
-
department
-
of
-
energy
-
fastforward


[27] http://www.digitaltrends.com/computing/how
-
nvidias
-
kepler
-
chips
-
could
-
end
-
pcs
-
and
-
tablets
-
as
-
we
-
know
-
them/


[28] 0112acij09.pdf


[29] p91
-
song.pdf


[30] http://www.informationweek.com/news/government/enterprise
-
architecture/226700040