TRIDIAGONAL Solvers on the GPU

compliantprotectiveSoftware and s/w Development

Dec 1, 2013 (3 years and 16 days ago)

96 views



IIIT Hyderabad


BTP Report

TRIDIAGONAL Solvers
on the GPU

CUDA optimized
s
olutions!



Prasoon Mishra 200730017
prasoon_mishra@students.iiit.ac.in

Suyesh Tiwari 200701080

suyesh@students.iiit.ac.in

Under the guidance of Prof. R. Govindarajulu








B.Tech Project submitted in partial fulfillment

of the requirements for the degree of

Bachelors of Technology

in CSE by


Prasoon Mishra


200730017
-

prasoon_mishra@students.iiit.ac.in

Suyesh Tiwari


200701080
-

suyesh@students.iiit.ac.in


Int
ernational Institute of Information Technology

Hyderabad
-

500 032, INDIA

Acknowledgements



We are thankful to to our faculty advisor, Prof. R. Govindarajulu for offering us this exciting
project, and for his unconditional and invaluable guidance through
out its execution.























Contents



Topics









Page

Executive Summary







4

Phase 1 Work (Till Mid Term Evaluation)




6

Phase 2 Work (Post Mid Term Evaluation)



7


Parallel Cyclic Reduction (PCR)





7


Parallel Cyclic Reduction (
PCR) on CUDA



8


Hybrid (CR + PCR)







9


Performance








10


Configuration of Systems used





10

Applications








11

Conclusion








11

References









11








Executive Summary

Problem Statement



Implement and study the performance of tw
o parallel algo
rithms and their hybrid variant
for solving tri
-
diagonal linear systems on the GPU: cyclic reduction (CR), parallel cyclic

reduction (PCR) and Hybrid CR+PCR



Performance is to be measured by solving a large number of small tridiagonal systems

in
parallel, as this maps well to the hierarchical and throughput
-
oriented architecture of the
GPU




Wo
rk

First Phase

(Till Mid
-
Term Evaluation)



Implemented C
yclic
R
eduction (CR) algorithm to solve tridiagonal matrices

in the
first
phase of the project.

Second Phase

(Post Mid
-
Term Evaluation)



Implemented Parallel Cyclic Reduction and Hybrid (CR+PCR) algorithm.



Identified the intermediate matrix size for Hybrid algorithm ( the point when CR must
switch to PCR).



Benchmarking of code to measure performance o
f various algorithms and %age speedup.

Challenges



Mapping tridiagonal solvers to CUDA Architecture:

The primary challenge in solving any
problem on CUDA lies in efficiently utilizing the 2
-
D architecture of CUDA. Matrices and
their corresponding equations
acted as the 2 levels of our problems, and hence, were
mapped to the CUDA architecture.



Determining intermediate system size for Hybrid
(CR
-
PCR) algorithm
:

Hybrid algorithms
works by switching from CR to PCR when parallelism starts to die off in CR. Determi
ning
this was an important phase as the results varied based on the matrices that were
supplied.



Overcoming bank conflicts:
Code efficiency in CUDA codes can be improved if bank
conflicts can be tackled. We were able to do this well for PCR by in place shi
fting of
intermediate results.

Learnings



Optimizing parallel codes
: We found that an algorithm’s performance is bottlenecked by a
composition of multiple factors like bank conflicts, and synchronization and control
overhead.



Parallel Programming Platforms:

We got hands on with various parallel programming
platfo
rms like CUDA, OpenMP, MPI, etc and g
ot
well
-
versed with
situations when each of
them are used.




Phase 1

Work

(T
ill
Mid Term Evaluation
)

We implemented the Cyclic Reduction algorithm on various p
latforms and compared its
performance to the non
-
parallel CPU implementation.

OpenMP & MPI:

We ventured into the parallel programming world by initially understanding the basics of the
Common parallel
programming platforms
: OpenMP and MPI.

o

Implemented the
Cyclic Reduction algorithm

for solving the tridiagonal matrices on
OpenMP & MPI.

o

MPI ( Message Passing Interface)

: MPI a
de facto

standard for communication among
processes that model a parallel program running on a

distributed memory system or
Symmetric
Multiprocessing (SMP).

o

OpenMP

:
OpenMP

(Open Multi
-
Processing) is an application programming interface
(API) that supports
shared memory multiprocessing

programming in C, C++
.

CUDA Implementation of CR on GPU
:

CUDA virtualizes the multiprocessors of the
GPU as blocks and its processors as threads. So, the
programmer can run threads and blocks irrespective of the no. of physical processors.

CPU Execution
Time(in ms)

GPU Execution
Time(in ms)

GPU cudaMalloc
Time(in ms)

Net GPU Time

(in ms)

SpeedUp

Achieved

4.131

60.896

58.525

2.371

42.60%

5.530

54.147

51.547

2.600

52.98%

4.415

51.055

48.632

2.423

45.11%

5.512

53.652

51.112

2.540

53.91%

4.132

43.278

40.896

2.382

42.35%


Table 1: Execution time of
CR

& comparison with sequential algorithm, for 256 X 256

systems
.





Phase 2
Work

(Post Mid Term Evaluation)

Our work constituted implementation and study of other parallel algorithms to solve tridiagonal
solvers.


Parallel Cyclic Reduction:



Unlike CR, only consists of the
forward reduction
phase.



Forward Red
uction phase
: successively reduces a
tridiagonal
system
of equations
to a
smaller system with half the number of unknowns, until a system of 2 unknowns is
reached.
We take a deeper look into forward reduction just a little later.



In each reduction step, ea
ch of the current tridiagonal system of equations is reduced to
two systems of half size using forward reduction.

Example: An 8
-
unknown system will reduce to two 4
-
unknown systems in step 1 (see Figure 1).
Then further reduce the two 4
-
unknown systems to f
our 2
-
unknown systems in step 2, and

nally solve the four systems in step 3.


Figure 1



Forward Reduction
:

o

All even
-
indexed equations


i, of the matrix are written as a linear combination of
equations i, i+1 and i
-
1.

o

This is done in parallel, for all the even indexed equations so that we
derive a
system of only even
-
indexed unknowns.






o

If equation i has the form a
i
x
i
-
1

+b
i
x
i

+c
i
x
i+1

= d
i
. The updated values of a
i
, b
i
, c
i

and d
i

in the new matrix of even
-
indexed unknowns are

given as:


Figure 2

o

Time Complexity: On GPU, it will take
log

2

n

steps. Hence, O(
log
2

n
) steps.

PCR implementation on CUDA:



CUDA virtualizes the multiprocessors of the GPU as blocks and its processors as threads.
So, the programmer can run threads and blocks irrespective of the no. of physical
processors.
(Figu
re 3)



GPU has 2 level hierarchical architecture of blocks (multiprocessor) and threads
(processors).




Matrix Block (multiprocessor).



Each equation Single Thread (runs on processor).



Since in PCR, paralleli
sm doesn’t decrease at any stage. Hence, no. of treads for each
system of equations remains equal to the total no. of equations and doesn’t decrease at
any stage.


Figure 3

Hybrid CR + P
CR:

Why hybrid?



In terms of complexity,
C
R is the quickest parallel a
lgorithm.

But CR has
drawbacks !!

o

In Forward reduction stage, parallelization reduces with progress. It suffers from
lack of parallelization towards latter stages.

o

In Backward substitution, parallelization increases with progress of algorithm. But,
earlie
r stages suffer from lack of it.



Solution



Parallel Cyclic Reduction (
PCR
) has equal parallelism throughout.



Use PCR when parallelism in CR starts to die out in forward reduction phase.



Get output from PCR.



Now run CR’s backward substitution on the obtain
ed output of PCR.


Example:

In this example, we use normal CR on a 8
-
unknown system

(Figure 4
). We switch to PCR
when the forward reduction of CR has reached a 4
-
unknown system. We replace the
output from PCR back into the backward substitution stage of
CR.


Figure 4








P
erformance
:

S.No

CR

PCR

Hybrid

Serial (CPU)

1

7.984

6.725

6.34

16.667

2

8.116

6.741

6.23

18.997

3

8.024

6.747

6.146

17.545

Table 1: Execution time of the
algorithms

in milliseconds

for 512 X 512 systems
.

S.No

CR

PCR

Hybrid

Serial

(CPU)

1

20.265

17.649

16.768

62.534

2

21.231

17.456

15.985

60.012

3

20.887

18.234

16.024

61.234

Table2
: Execution time of the algorithms in milliseconds

for 1024 X 1024 systems
.

Size

CR

PCR

Serial (CPU)

512 * 512

26%

7%

163%

1024*1024

33%

9%

273%

T
able3
:
Hybrid (CR
-
PCR) algorithm’s %age improvement over other algorithm.




Configuration of Systems used:

GPU:

NVIDIA GeForce GTX 280


Card Type: PCI
-
E

CPU:

Intel® Core
™ 2 CPU 2.13 GHz.







Applications



Fast solutions to a tridiagonal system of linear equations are critical for many scientific
and engineering problems.



Also used in realtime or interactive applications in computer graphics, video games ,and
animation f
ilms.



3
-
D game Engine Design.



The applications of tridiagonal solvers include alternating direction implicit (ADI)
methods , spectral Poisson solvers, numerical ocean models , preconditioners for iterative
linear solvers.

Conclusion:

We studied three trid
iagonal solvers that run on the GPU. We
measured the
performance of GPU
programs in terms of mem
ory access, computation
.
We have been able to show that hybrid
algorithm has the best performance.
For solving 512
*
512
-
unk
nown systems, the hybrid solver
achieve a
26% and 7% over CR and PCR.

References:



Fast
Tridiagonal Solvers on the GPU : Yao Zhang , Jonathan Cohe
n , John D. Owens
.



Wikipedia.



www.nvidia.com/object/
cuda
_education.html




http://gpgpu.org/