Myth: An Evaluation of Throughput

yellvillepotatocreekSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

70 views

Debunking the 100X GPU vs. CPU
Myth
: An
Evaluation of Throughput
Computing on CPU and GPU

Presented by: Ahmad
Lashgar

ECE Department, University of Tehran

Seminar of Parallel Processing. Instructor: Dr.
Fakhraie

29 Dec 11


ISCA 2010

Original authors: Victor
W
Lee et al.

Intel Corporation

1

Some slides are included from original paper only for educational purposes

Abstract


Is the GPU silver bullet of parallel computing?


How far is the difference between peak and
achievable performance?


2

Overview


Abstract


Architecture


CPU: Intel core i7


GPU:
Nvidia

GTX280


Implications for throughput computing applications


Methodology


Results


Analyzing the results


Platform optimization guides


Conclusion

3

Architecture (1)


Intel core i7
-
960


4
-
core, 3.2 GHz


2
-
way multi
-
threading


4
-
wide


L1 32KB, L2 256KB, L3 3MB


32 GB/sec


4

[DIXON’2010]

Architecture (2)


Nvidia GTX280


30 core, 1.3GHz


1024
-
way multi
-
threading


8
-
way SIMD


16KB software managed cache (shared
memory)


141 GB/s


5

[LINDHOLM’2008]

Architecture (3)

Core i7
-
960

GTX280

Core

4

30

Frequency (GHz)

3.2

1.3

Transistors

0.7B
(263mm2)

1.4B
(576mm2)

Memory Bandwidth (GB/s)

32

141

SP SIMD

4

8

DP SIMD

2

1

Peak SP scalar GFLOPS

25.6

116.6

Peak SP SIMD GFLOPS

102.4

311.1

(
933.1)

Peak DB SIMD GFLOPS

51.2

77.8

Red texts are not the author’s numbers.

6

Implications for throughput computing
applications

1.
Number of core difference

2.
Cache size/multi
-
threading

3.
Bandwidth difference

7

1. Number of cores difference


It is all about the core complexity:


The common goal: Improving pipeline efficiency


CPU goal: Single
-
thread performance


Exploiting ILP


Sophisticated branch predictor


Multiple issue logics


GPU goal: Throughput


Interleaving hundreds of threads

8

2. Cache size/multi
-
threading


CPU goal: reducing memory latency


Programmer
-
transparent data caching


Increasing the cache size to capture the working set


Prefetching (HW/SW)


GPU goal: hiding memory latency


Interleave the execution of hundreds of threads to hide
the latency of each other


Notice:


CPU uses multi
-
threading for latency hiding


GPU uses software controlled caching (shared memory) for
reducing memory latency

9

3. Bandwidth difference


Bandwidth versus latency


CPU goal: single thread performance


Workloads do not demand for many memory accesses


Bring the data as soon as possible


GPU goal: throughput


There are lots of memory accesses, provide the good
bandwidth


No matter the latency, core will hide it!

10

Methodology (1)


Hardware


Intel Core i7
-
960, 6GB DRAM, GTX280 1GB


Software


SUSE Enterprise 11


CUDA Toolkit 2.3

11

Methodology (2)


Optimizations


On CPU:


SGEMM, SpMV and FFT from Intel MKL 10


Always 2 threads per core


On GPU:


Best possible algorithm for SpMV, FFT and MC


Often 128 to 256 threads per core (to leverage shared memory
and register
-
file usage)


Interleaving GPU execution and HD/DH memory transfers
where possible

12

Results


The HD/DH data transfer time is not considered


Only 2.5X on average


Far from what is reported by previous researches (100X)

13

Where is the speedup of previous
researches?!


What CPU and GPU are compared?


How much optimization is performed on CPU and
GPU?


Where they optimize both platforms, they reported much
lower speedup (like this paper)

14

Analyzing the results (1)

1.
Bandwidth

2.
Compute flops (single precision)

3.
Compute flops (double precision)

4.
Reduction and synchronization

5.
Fixed function


15

Analyzing the results (2)

1.
Bandwidth


Peak: GTX280/Corei7
-
960 ~ 4.7X


Feature: Large working set, Performance is bounded by
the bandwidth


Examples


SAXPY (5.3X)


LBM (5X)


SpMV (1.9X)


CPU benefits from caching

16

Analyzing the results (3)

2.
Compute Flops (Single Precision)


Peak: GTX280/Corei7
-
960 ~ 3X


Feature: Bounded by computation, benefit from more
cores


Examples


SGEMM, Conv and FFT (2.8
-
4X)

17

Analyzing the results (4)

3.
Compute Flops (Double Precision)


Peak: GTX280/Corei7
-
960 ~ 1.5X


Feature: Bounded by computation, benefit from more
cores


Examples


MC (1.8X)


Blitz (5X)


Uses transcendental operations


Sort (
1.25X slower
)


Due to decrease in SIMD width usage


Depends on scalar performance

18

Analyzing the results (5)

4.
Reduction and Synchronization


Feature: More threads, higher the synchronization
overhead


Examples


Hist (1.8X)


On CPU, 28% of the time is spent on atomic operations


On GPU, the atomic operations are much slower


Solv (
1.9X slower
)


Multiple kernel launches to preserve cache coherency on GPU

19

Analyzing the results (6)

5.
Fixed function


Feature: Interpolation, texturing and transcendental
operation are bonus on GPU


Examples


Bilat (5.7X)


On CPU, 66% of the time is spent on transcendental operations


GJK (14.9X)


Uses texture lookup

20

Platform optimization guides


CPU programmer have heavily relied on increasing
clock frequency


Their application do not benefits from TLP and DLP


Today CPUs use wider SIMD which stays idle if not
exploited by programmer (or compiler)


This paper showed that careful multi
-
threading can
reduce the gap heavily


For LBM, from 114X down to 5X


Let’s learn some optimization tips from the authors

21

CPU optimization


Scalability (4X):


Scale the kernel with the number of threads


Blocking (5X):


Be aware of cache hierarchy and use it efficiently


Regularizing (1.5X):


Align the data regularly to take advantage of SIMD


22

GPU optimization


Global synchronization


Reduce the atomic operations


Shared memory


Use shared memory to reduce of
-
chip demand


Shared memory is multi
-
banked and is efficient for
gathers/scatters operations


23

Conclusion


This work analyzed the performance of important
throughput computing kernels on CPU and GPU


the gap is much lower that previous reports (~2.5X)


Recommendation for a throughput computing
architecture:


High compute


High bandwidth


Large cache


Gather/scatter support


Efficient synchronization


Fixed function units


24





Thank you for your attention.

any question?

25

References

[LEE’2010] V. W. Lee et al,
Debunking the 100X GPU vs. CPU Myth: An
Evaluation of Throughput Computing on CPU and GPU
,
ISCA 2010

[DIXON’2010] M. Dixon et al,
The next
-
generation Intel® Core ™
Microarchitecture
,
Intel® Technology Journal, Volume 14 Issue 3, 2010

[LINDHOLM’2008] E. Lindholm et al,
NVIDIA Tesla A Unified Graphics and
Computing Architecture
,
IEEE Micro 2008


26