Attaining High Performance in General-Purpose Computations on Current Graphics Processors

skillfulwolverineSoftware and s/w Development

Dec 2, 2013 (3 years and 8 months ago)

103 views

Attaining High Performance in
General-Purpose Computations on Current
Graphics Processors
Francisco Igual Rafael Mayo Enrique S.Quintana-Ortí
Departamento de Ingeniería y Ciencia de los Computadores.
University Jaume I.
Castellón (Spain)
Attaining High Performance in General-Purpose Computations on GPUs
1
Igual et al.
Motivation (I)
GPUs have become the first widely available HPC platform
Successfully used in linear algebra,image processing,data
mining,simulations...
Recently introduced advances:
Hardware level:Unified Architecture
Software level:CUDA
Peak performance up to 10x (Core2Duo vs Nvidia 8800)
Is it always a benefit the use of GPUs?
Attaining High Performance in General-Purpose Computations on GPUs
2
Igual et al.
Motivation (I)
GPUs have become the first widely available HPC platform
Successfully used in linear algebra,image processing,data
mining,simulations...
Recently introduced advances:
Hardware level:Unified Architecture
Software level:CUDA
Peak performance up to 10x (Core2Duo vs Nvidia 8800)
Is it always a benefit the use of GPUs?
Attaining High Performance in General-Purpose Computations on GPUs
2
Igual et al.
Motivation (II)
GPUs are specific-purpose processors
The GPU architecture is focused on graphics performance
Not every general-purpose algorithm will fit correctly on GPU
Newest GPU models introduce interesting features from the
GPGPU point of view
Which algorithms fit better on the graphics processor?
Is it always better the use of GPU?
Impact of the improvements of last generations of GPUs
Attaining High Performance in General-Purpose Computations on GPUs
3
Igual et al.
Motivation (II)
GPUs are specific-purpose processors
The GPU architecture is focused on graphics performance
Not every general-purpose algorithm will fit correctly on GPU
Newest GPU models introduce interesting features from the
GPGPU point of view
Which algorithms fit better on the graphics processor?
Is it always better the use of GPU?
Impact of the improvements of last generations of GPUs
Attaining High Performance in General-Purpose Computations on GPUs
3
Igual et al.
Motivation (II)
GPUs are specific-purpose processors
The GPU architecture is focused on graphics performance
Not every general-purpose algorithm will fit correctly on GPU
Newest GPU models introduce interesting features from the
GPGPU point of view
Which algorithms fit better on the graphics processor?
Is it always better the use of GPU?
Impact of the improvements of last generations of GPUs
Attaining High Performance in General-Purpose Computations on GPUs
3
Igual et al.
Motivation (II)
GPUs are specific-purpose processors
The GPU architecture is focused on graphics performance
Not every general-purpose algorithm will fit correctly on GPU
Newest GPU models introduce interesting features from the
GPGPU point of view
Which algorithms fit better on the graphics processor?
Is it always better the use of GPU?
Impact of the improvements of last generations of GPUs
Attaining High Performance in General-Purpose Computations on GPUs
3
Igual et al.
Outline
1
Introduction
2
Classical GPU architecture
3
Unified GPU architecture
4
Experimental setup
5
Results on classical GPUs
6
Results on unified GPUs
7
Conclusions
Attaining High Performance in General-Purpose Computations on GPUs
4
Igual et al.
Introduction
Introduction and goals
Methodology
Microbenchmarks from BLAS and image processing
Implement the benchmarks on CPU and both generations of
GPUs
Use only optimized benchmarks on both architectures
Goals
Extract the features of GPU-like algorithms through the
evaluation of different routines
Compare both generations of GPU architectures
Extract the impact of data transfers on both generations
Decide which algorithms are CPU or GPU-like
Attaining High Performance in General-Purpose Computations on GPUs
5
Igual et al.
Introduction
Introduction and goals
Methodology
Microbenchmarks from BLAS and image processing
Implement the benchmarks on CPU and both generations of
GPUs
Use only optimized benchmarks on both architectures
Goals
Extract the features of GPU-like algorithms through the
evaluation of different routines
Compare both generations of GPU architectures
Extract the impact of data transfers on both generations
Decide which algorithms are CPU or GPU-like
Attaining High Performance in General-Purpose Computations on GPUs
5
Igual et al.
Classical GPU architecture
Classical GPU pipeline
Each stage of the pipeline works with a different datatype
Each stage is computed on a different type of processor (vertex
processors and fragment processors)
But the processors are programmable...
Attaining High Performance in General-Purpose Computations on GPUs
6
Igual et al.
Classical GPU architecture
Classical GPU overview
Programmability
Both vertex processors and fragments processors are
programmable
Usually fragment processors (FP) are programmed
The replication of FP fits well with data-parallel routines
SIMD architecture
Disadvantages
No memory hierarchy at all
Just one memory direction could be written for each fragment
Hard to program (OpenGL + Cg)
Attaining High Performance in General-Purpose Computations on GPUs
7
Igual et al.
Classical GPU architecture
Classical GPU overview
Programmability
Both vertex processors and fragments processors are
programmable
Usually fragment processors (FP) are programmed
The replication of FP fits well with data-parallel routines
SIMD architecture
Disadvantages
No memory hierarchy at all
Just one memory direction could be written for each fragment
Hard to program (OpenGL + Cg)
Attaining High Performance in General-Purpose Computations on GPUs
7
Igual et al.
Unified GPU architecture
Unified GPU shader
One unified shader with multiple functionality
Works with any type of graphical data
GPGPU:all programmable shaders are now available
Attaining High Performance in General-Purpose Computations on GPUs
8
Igual et al.
Unified GPU architecture
Unified GPU architecture:G80
NVIDIA G80
Main unified
implementation
Up to 128 cores in the
unified shader
Introduction of memory
hierarchy
Higher clock frequency
in the shader
More flexibility in
reading/writing to
memory
CUDA library
Attaining High Performance in General-Purpose Computations on GPUs
9
Igual et al.
Experimental setup
Selection of benchmarks
Features to be evaluated
1
Data parallelism
2
Input data reutilization
3
Computational intensity per stream element
Routines to be evaluated:
SGEMM:C =aAB+bC
SGEMV:y =aAx+by
SAXPY:y =ax+y
SSCAL:y =ay
2D Convolution
Attaining High Performance in General-Purpose Computations on GPUs
10
Igual et al.
Experimental setup
Features of the benchmarks
Selected routines:
Routine
Type
Data parallelism
Input reutilization
Arith.intensity
SGEMM
BLAS 3
High
High
High
SGEMV
BLAS 2
High
Medium
Medium
SAXPY
BLAS 1
High
None
Low
SSCAL
BLAS 1
High
None
Lowest
Convolution
Image
High
Low
Medium
Hardware setup:
Classical architecture
AMD Athlon 2400+
+
Nvidia Geforce 6200
Unified architecture
Intel Core2Duo 1.86 Ghz
+
Nvidia Geforce 8800 Ultra
GotoBLAS/CUBLAS 1.0
Attaining High Performance in General-Purpose Computations on GPUs
11
Igual et al.
Experimental setup
Features of the benchmarks
Selected routines:
Routine
Type
Data parallelism
Input reutilization
Arith.intensity
SGEMM
BLAS 3
High
High
High
SGEMV
BLAS 2
High
Medium
Medium
SAXPY
BLAS 1
High
None
Low
SSCAL
BLAS 1
High
None
Lowest
Convolution
Image
High
Low
Medium
Hardware setup:
Classical architecture
AMD Athlon 2400+
+
Nvidia Geforce 6200
Unified architecture
Intel Core2Duo 1.86 Ghz
+
Nvidia Geforce 8800 Ultra
GotoBLAS/CUBLAS 1.0
Attaining High Performance in General-Purpose Computations on GPUs
11
Igual et al.
Experimental setup
Features of the benchmarks
Selected routines:
Routine
Type
Data parallelism
Input reutilization
Arith.intensity
SGEMM
BLAS 3
High
High
High
SGEMV
BLAS 2
High
Medium
Medium
SAXPY
BLAS 1
High
None
Low
SSCAL
BLAS 1
High
None
Lowest
Convolution
Image
High
Low
Medium
Hardware setup:
Classical architecture
AMD Athlon 2400+
+
Nvidia Geforce 6200
Unified architecture
Intel Core2Duo 1.86 Ghz
+
Nvidia Geforce 8800 Ultra
GotoBLAS/CUBLAS 1.0
Attaining High Performance in General-Purpose Computations on GPUs
11
Igual et al.
Experimental setup
Features of the benchmarks
Selected routines:
Routine
Type
Data parallelism
Input reutilization
Arith.intensity
SGEMM
BLAS 3
High
High
High
SGEMV
BLAS 2
High
Medium
Medium
SAXPY
BLAS 1
High
None
Low
SSCAL
BLAS 1
High
None
Lowest
Convolution
Image
High
Low
Medium
Hardware setup:
Classical architecture
AMD Athlon 2400+
+
Nvidia Geforce 6200
Unified architecture
Intel Core2Duo 1.86 Ghz
+
Nvidia Geforce 8800 Ultra
GotoBLAS/CUBLAS 1.0
Attaining High Performance in General-Purpose Computations on GPUs
11
Igual et al.
Experimental setup
Features of the benchmarks
Selected routines:
Routine
Type
Data parallelism
Input reutilization
Arith.intensity
SGEMM
BLAS 3
High
High
High
SGEMV
BLAS 2
High
Medium
Medium
SAXPY
BLAS 1
High
None
Low
SSCAL
BLAS 1
High
None
Lowest
Convolution
Image
High
Low
Medium
Hardware setup:
Classical architecture
AMD Athlon 2400+
+
Nvidia Geforce 6200
Unified architecture
Intel Core2Duo 1.86 Ghz
+
Nvidia Geforce 8800 Ultra
GotoBLAS/CUBLAS 1.0
Attaining High Performance in General-Purpose Computations on GPUs
11
Igual et al.
Results on classical GPUs
Results on classical GPUs.SGEMM
200
400
600
800
1000
1200
1400
1600
1800
2000
Matrix dimension
0
1000
2000
3000
4000
5000
6000
7000
8000
MFLOPs
Test SGEMM
MFLOPS CPU
MFLOPS GPU
MFLOPS GPU w. TX.
Data reutilization impact
CPU outperforms GPU
Speedup up to x4
Impact of cache hierarchy on
CPU
Data transfers not significant
Input data reutilization seems a key factor on GPU
SGEMV exhibits less input data reutilization...
Attaining High Performance in General-Purpose Computations on GPUs
12
Igual et al.
Results on classical GPUs
Results on classical GPUs.SGEMV
0
1000
2000
3000
4000
5000
Vector dimension
0
50
100
150
200
250
300
350
MFLOPs
Test SGEMV
MFLOPS CPU
MFLOPS GPU
MFLOPS GPU w. TX
Data reutilization impact
CPU outperforms GPU,but
up to x2
Data transfers more
significant
Input data reutilization is a key factor on GPU
Attaining High Performance in General-Purpose Computations on GPUs
13
Igual et al.
Results on classical GPUs
Results on classical GPUs.SAXPY and SSCAL
Arithmetic intensity
1000
2000
3000
4000
5000
Vector dimension
0
500
1000
1500
2000
2500
MFLOPs
Test SAXPY
MFLOPS CPU
MFLOPS GPU
MFLOPS GPU w. TX
0
1000
2000
3000
4000
5000
Vector dimension
0
500
1000
1500
2000
MFLOPs
Test SSCAL
MFLOPS CPU
MFLOPS GPU
MFLOPS GPU w. TX
SAXPY and SSCAL are stream-oriented operations
They should attain high performance on GPUs
However,CPU outperforms GPU for these operations
Arithmetic intensity per stream element is another key factor
Attaining High Performance in General-Purpose Computations on GPUs
14
Igual et al.
Results on classical GPUs
Results on classical GPUs.Convolution
2
4
6
8
10
12
14
16
Filter Width
0
100
200
300
400
500
Time (ms)
Test Convolution - Image 512x512
CPU
GPU4
GPU4 w. TX
GPU
Convolution performance
Attained performance similar
to that on CPU
Optimized implementation for
vectorial capabilities (GPU4)
attains 4x speedup
Required features on GPU
High data parallelism
Low input data reutilization
High arithmetic intensity per stream element
Attaining High Performance in General-Purpose Computations on GPUs
15
Igual et al.
Results on unified GPUs
Results on unified GPUs.SGEMM
200
400
600
800
1000
1200
1400
1600
1800
2000
Matrix dimension
0
20
40
60
80
100
120
GFLOPs
Test SGEMM CUDA
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX.
Data reutilization impact
GPU outperforms CPU
Speedup up to x10
However,only attaining 20%
of the peak performance of
the GPU
Data transfers more
significant than on previous
generations
Input data reutilization is still important on GPU,but less than on
previous generations
Reason:more sophisticated memory hierarchy
Attaining High Performance in General-Purpose Computations on GPUs
16
Igual et al.
Results on unified GPUs
Results on unified GPUs.SGEMV
0
1000
2000
3000
4000
5000
Vector dimension
0
1
2
3
4
5
6
7
GFLOPs
Test SGEMV CUDA
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX
Data reutilization impact
GPU outperforms CPU for big
matrices
Impact of cache hierarchy on
CPU:faster for small streams
of data
Data transfers very significant
Data transfer stage is a very important factor now
G6!G80 (x20) but AGP!PCIExpress (x2)
Modern GPUs work better with big streams of data
Attaining High Performance in General-Purpose Computations on GPUs
17
Igual et al.
Results on unified GPUs
Results on unified GPUs.SAXPY and SSCAL
Arithmetic intensity
0
1000
2000
3000
4000
5000
Vector dimension
0
1
2
3
4
5
6
GFLOPs
Test SAXPY CUDA
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX
0
1000
2000
3000
4000
5000
Vector dimension
0
1
2
3
4
5
6
7
8
GFLOPs
Test SSCAL CUDA
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX
However,arithmetic intensity per element is still a key factor
CPU outperforms GPU for these operations
Attaining High Performance in General-Purpose Computations on GPUs
18
Igual et al.
Results on unified GPUs
Results on unified GPUs.Convolution
2
4
6
8
10
12
14
16
Filter Width
0
5
10
15
20
GFLOPS
Test Convolution - Image 512x512
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX.
Convolution performance
Attained the highest speedup
of the benchmarks
Highest data transfer impact
Required features on unified GPUs
High data parallelism,low reutilization,high intensity
Low ratio transfer/calculation
Big data streams
Attaining High Performance in General-Purpose Computations on GPUs
19
Igual et al.
Results on unified GPUs
Results on unified GPUs.Convolution
2
4
6
8
10
12
14
16
Filter Width
0
5
10
15
20
GFLOPS
Test Convolution - Image 512x512
GFLOPS CPU
GFLOPS GPU
GFLOPS GPU w. TX.
Convolution performance
Attained the highest speedup
of the benchmarks
Highest data transfer impact
Required features on unified GPUs
High data parallelism,low reutilization,high intensity
Low ratio transfer/calculation
Big data streams
Attaining High Performance in General-Purpose Computations on GPUs
19
Igual et al.
Conclusions
Conclusions
The GPU is an interesting platform for HPC
However,not every algorithm extracts all the performance
Desirable features:
High data parallelism
Low input data reutilization
High arithmetic intensity
Operating with big streams of data
Necessary to design algorithm minimizing data transfers
Attaining High Performance in General-Purpose Computations on GPUs
20
Igual et al.
Conclusions
Thank you...
For more information...
http://www3.uji.es/˜figual
figual@icc.uji.es
Attaining High Performance in General-Purpose Computations on GPUs
21
Igual et al.