LU-GPU: Efficient Algorithms for Solving Dense Linear ... - GAMMA/a

pumpedlessΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

60 εμφανίσεις

TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU-GPU: Efficient Algorithms 
for Solving 
Dense Linear Systems on Graphics 
Hardware
Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha
http://gamma.cs.unc.edu/LU-GPU
2
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Goals
Demonstrate advantages of mapping
linear algebra routines to graphics
hardware:
Performance
Growth rate
LAPACK compliant set of linear algebra
algorithms on graphics hardware
3
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
4
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Sequence of row eliminations:
Scale and add: A(i,j) = A(i,j) –
A(i,k) A(k,j)
Input data mapping: 2 distinct memory regions
No data dependencies
within a row elimination
Pivoting
Pointer-swap vs. data copy
k
k
k
5
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Theoretical complexity (partial pivoting):
(2/3) n3
+ O(n2)
Performance <> Architecture
Order of operations
Memory access (latency)
Memory bandwidth
6
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition
&
Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
7
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Commodity CPUs
LINPACK Benchmark:
Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s
(Jack Dongarra, Oct 2005)
8
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Streaming architectures
Specialized hardware
High bandwidth/compute ratio
Merrimac [Erez04]
Molecular modeling: 38 GFLOPs vs. 2.7 GFLOPs (P4)
$1,000/node
Imagine [Ahn04]
10.46 GFLOPs/s on QR-decomposition
Research
9
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
10
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU
Pentium EE 840
3.2 GHz Dual Core
230M Transistors
90nm process
206 mm2
2 x 1MB Cache
25.6 GFLOPs
Price: $ 1,040
GeForce
7800 GTX
430 MHz
302M Transistors
110 nm process
326 mm2
512MB onboard memory
313 GFLOPs (shader)
1.3 TFLOPs
(total)
Price: $ 450
11
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU
(Henry Moreton: NVIDIA, Aug. 2005)
PEE 840
7800GTX
GPU/CPU
Graphics GFLOPs
25.6
1300
50.8
Shader
GFLOPs
25.6
313
12.2
Die area (mm2)
206
326
1.6
Die area normalized
206
218
1.1
Transistors (M)
230
302
1.3
Power (W)
130
65
0.5
GFLOPS/mm
0.1
6.0
47.9
GFLOPS/tr
0.1
4.3
38.7
GFLOPS/W
0.2
20.0
101.6
12
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU: Bandwidth
CPU
(3 GHz)
System Memory
(2+ GB)
AGP Memory
(512 MB)
PCI-E Bus
(4 GB/s)
Video Memory
(512 MB)
GPU (500 MHz)
Video Memory
(512 MB)
GPU (500 MHz)
2 x 1 MB Cache
35.2 GB/s bandwidth
6.4 GB/s bandwidth
13
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth
Large high bandwidth memory
512 MB video memory vs. 2 MB L2 cache on CPUs
High memory to compute clock ratio –
reduces
memory stalls
14
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics pipeline
vertex
pixel
texture
image
polygon
per-pixel texture,
fp16 blending
programmable vertex
processing (fp32)
programmable per-
pixel math (fp32)
polygon setup,
culling, rasterization
Z-buf, fp16 blending,
anti-alias (MRT)
memory
15
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream processor (non-graphics)
(David Kirk, NVIDIA, May’05)
data
setup
rasterizer
data
data
data
data fetch,
fp16 blending
programmable MIMD
processing (fp32)
programmable SIMD
processing (fp32)
lists
SIMD
“rasterization”
predicated write, fp16
blend, multiple output
memory
16
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Potential 
of graphics processors
Commodity horsepower
Parallel computation
Bandwidth
Programmable graphics pipeline
Stream processor
Exploit large growth rate
17
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Exploiting technology moving
faster than Moore’s law
Source: Anselmo
Lastra
CPU Growth Rate
GPU Growth Rate
18
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General 
purpose computing on GPUs
Physical Simulation
Fluid Flow [Fan et al. 2004]
FEM [Rumpf
and Strzodka
2001]
Cloud Dynamics [Harris et al. 2003]
Sparse Linear Algebra
Operators [Krüger
and Westermann
2003]
Iterative Solvers [Bolz
et al. 2003]
19
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General 
purpose computing on GPUs
Matrix-Matrix Multiplication
Fixed graphics pipeline, fixed-point arithmetic
[Larsen and McAllister 2001]
Floating-point (SP) [Fatahalian
et al. 2004]
High-level API
BrookGPU
[Buck et al. 2004]
Sh
[McCool et al. 2004]
20
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
21
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation for LU-GPU
LU decomposition maps well:
Stream program
Few data dependencies
Pivoting
Parallel pivot search
Exploit large memory bandwidth
22
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mapping
23
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Texture mapping hardware: Input data mapping
24
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Matrix elements
2D texture memory
One-to-one mapping
Texture memory = on-board memory
Exploit bandwidth
Avoid CPU-GPU data transfer
25
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mapping
Stream computation
Input data mapping
Fast row swaps
26
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Algorithm mapping
Texture mapping hardware: Input data mapping
27
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream computation
Rasterize quadrilaterals
Generates computation stream
Invokes SIMD units
Rasterization simulates blocking
Rasterization pass = row elimination
Alternating memory regions
28
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Texture mapping hardware: Input data mapping
29
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Dedicated texture mapping hardware
Traditionally for color interpolation
Map input matrix elements to output elements
Eliminates computation of memory locations
25% performance improvement
30
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Main issues:
Pivot search
Row/column swapping
31
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Texture mapping hardware: Input data mapping
32
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Partial pivoting
Fast row swap
Data copy: mapped rasterization
Texture mapping hardware
High memory bandwidth
Improvement over pointer swapping
TEXTURE MAPPING
HARDWARE
Input
O
utput
33
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Full pivoting
Fast column/row swap
Parallel pivot search
Divide and conquer approach
Partial pivoting
Full pivoting
34
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
35
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Benchmarks
GPU
SIMD units
Core clock
Memory
Memory clock
6800 GT
12
350 MHz
16
425 MHz
430 MHz
24
256 Mb
900 MHz
6800 Ultra
256 Mb
1100 MHz
7800 GTX
256 Mb
1200 MHz
Commodity CPU
3.4 GHz Pentium IV with Hyper-Threading, 1 MB L2 cache
LAPACK sgetrf() (blocked algorithm, ATLAS library)
LAPACK sgetc2() (SSE-optimized IMKL library)
36
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: No pivoting
0
1
2
3
4
5
6
7
8
9
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
Ultra 6800 LU (no pivoting)
GT 6800 LU (no pivoting)
7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
Ultra 6800 LU (no pivoting)
GT 6800 LU (no pivoting)
7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
Ultra 6800 LU (no pivoting)
GT 6800 LU (no pivoting)
7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
Ultra 6800 LU (no pivoting)
GT 6800 LU (no pivoting)
7800 LU (no pivoting)
37
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Partial pivoting
0
2
4
6
8
10
12
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
GT 6800 Partial Pivot
Ultra 6800 Partial Pivot
7800 Partial Pivot
0
2
4
6
8
10
12
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot)
GT 6800 Partial Piv
ot
Ultra 6800 Partial Pivot
7800 Partial Piv
ot
0
2
4
6
8
10
12
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot
)
GT 6800 Partial Pivot
Ultra 6800 Partial Pivot
7800 Partial Pivot
0
2
4
6
8
10
12
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot
)
GT 6800 Partial Pivot
Ultra 6800 Partial Pivot
7800 Partial Pivot
38
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Full Pivoting
0
50
100
150
200
250
5
0
0
1
0
0
01
5
0
02
0
0
02
5
0
03
0
0
03
5
0
0
Matrix size N
Time (s)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
5
0
0
1
0
0
01
5
0
02
0
0
02
5
0
03
0
0
03
5
0
0
Matrix size N
Time (s)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
5
0
0
1
0
0
01
5
0
02
0
0
02
5
0
03
0
0
03
5
0
0
Matrix size N
Time (s)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
39
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: 
Number of computational units
4
0
5
10
15
20
25
30
35
40
45
500
1000
1500
2000
2500
3000
3500
4000
Matrix Size (N)
Time (s)
4
8
0
5
10
15
20
25
30
35
40
45
500
1000
1500
2000
2500
3000
3500
4000
Matrix Size (N)
Time (s)
4
8
12
0
5
10
15
20
25
30
35
40
45
500
1000
1500
2000
2500
3000
3500
4000
Matrix Size (N)
Time (s)
4
8
12
16
0
5
10
15
20
25
30
35
40
45
500
1000
1500
2000
2500
3000
3500
4000
Matrix Size (N)
Time (s)
6800 Ultra (no pivoting)
(Jun 2003)
(Mar 2004)
40
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU-CPU data transfer overhead
0.00
2.00
4.00
6.00
8.00
10.00
12.00
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
GT 6800 Partial Pivot
Ultra 6800 Partial Pivot
7800 Partial Pivot
CPU-GPU transfer
41
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth efficiency
0
5
10
15
20
25
30
35
500
1000
1500
2000
2500
3000
3500
4000
Matrix
size (N)
Bandwidth Usage (GB/s)
6800 Ultra
6800 GT
6800 Ultra Peak Bandwidth: 35.2 GB/s
6800 GT Peak
Bandwidth: 28.8
42
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Faster than Moore’s law
0
2
4
6
8
10
12
500
1000
1500
2000
2500
3000
3500
Matrix size N
Time (s)
ATLAS GETRF (Partial Pivot
)
GT 6800 Partial Pivot
Ultra 6800 Partial Pivot
7800 Partial Pivot
(Mar 2004)
(Jun 2005)
43
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application: Fluid Simulation
Solve parallel sub-problems
N=2048
Diagonally-dominant
No pivoting required
15% faster than ATLAS
on Pentium IV 3.06 GHz
44
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Limitations
Maximum matrix size: 4096x4096
Block-partitioned LU decomposition
Precision
Single precision floating point
Not 100% IEEE floating point compliant
CPU-GPU data transfer overhead
Small matrices
45
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics
 hardware advancements
Improved floating point bandwidth
4 component vs. single component
Floating point blending
Use of non-programmable TFLOPs
46
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
47
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Algorithm mapped to graphics pipeline
Novel mapping of row operations to rasterization
Stream computation
Blocking
48
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Optimized with GPU architecture
Input data mapping
Fast pivoting
49
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Performance
Comparable to industry-standard libraries
Relatively small development effort
GPU are useful co-processors
Scientific computations
Many applications
50
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
LU-GPU Open Source library available:
http://gamma.cs.unc.edu/LUGPULIB/
51
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Ongoing work
Sorting on 
GPUs
Linear 
algebra: 
GPU-LAPACK / QR decomposition
52
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on 
GPUs
Goal: Utilize the high parallelism, and
memory bandwidth on GPUs for fast sorting
[Govindaraju
et al, SIGMOD05]
GPUSort: Open Source library
[
http://gamma.cs.unc.edu/GPUSORT
/]
53
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on 
GPUs
6 times faster than Quicksort
on 3.4 GHz Pentium IV PC!
54
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Linear algebra
LAPACK-compliant library for GPUs
QR-decomposition in development
(LAPACK SGEQRF)
55
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Acknowledgements
Army Research Office
DARPA
National Science Foundation
Office of Naval Research
RDECOM
Intel Corporation
NVIDIA Corporation
UNC GAMMA Group
56
TheUNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank you
For questions or comments:
nico@cs.unc.edu
naga@cs.unc.edu
henson@cs.unc.edu
http://gamma.cs.unc.edu/
http://gamma.cs.unc.edu/LUGPULIB/