IIIT Hyderabad
BTP Report
TRIDIAGONAL Solvers
on the GPU
CUDA optimized
s
olutions!
Prasoon Mishra 200730017
prasoon_mishra@students.iiit.ac.in
Suyesh Tiwari 200701080
suyesh@students.iiit.ac.in
Under the guidance of Prof. R. Govindarajulu
B.Tech Project submitted in partial fulfillment
of the requirements for the degree of
Bachelors of Technology
in CSE by
Prasoon Mishra
–
200730017

prasoon_mishra@students.iiit.ac.in
Suyesh Tiwari
–
200701080

suyesh@students.iiit.ac.in
Int
ernational Institute of Information Technology
Hyderabad

500 032, INDIA
Acknowledgements
We are thankful to to our faculty advisor, Prof. R. Govindarajulu for offering us this exciting
project, and for his unconditional and invaluable guidance through
out its execution.
Contents
Topics
Page
Executive Summary
4
Phase 1 Work (Till Mid Term Evaluation)
6
Phase 2 Work (Post Mid Term Evaluation)
7
Parallel Cyclic Reduction (PCR)
7
Parallel Cyclic Reduction (
PCR) on CUDA
8
Hybrid (CR + PCR)
9
Performance
10
Configuration of Systems used
10
Applications
11
Conclusion
11
References
11
Executive Summary
Problem Statement
Implement and study the performance of tw
o parallel algo
rithms and their hybrid variant
for solving tri

diagonal linear systems on the GPU: cyclic reduction (CR), parallel cyclic
reduction (PCR) and Hybrid CR+PCR
Performance is to be measured by solving a large number of small tridiagonal systems
in
parallel, as this maps well to the hierarchical and throughput

oriented architecture of the
GPU
Wo
rk
First Phase
(Till Mid

Term Evaluation)
Implemented C
yclic
R
eduction (CR) algorithm to solve tridiagonal matrices
in the
first
phase of the project.
Second Phase
(Post Mid

Term Evaluation)
Implemented Parallel Cyclic Reduction and Hybrid (CR+PCR) algorithm.
Identified the intermediate matrix size for Hybrid algorithm ( the point when CR must
switch to PCR).
Benchmarking of code to measure performance o
f various algorithms and %age speedup.
Challenges
Mapping tridiagonal solvers to CUDA Architecture:
The primary challenge in solving any
problem on CUDA lies in efficiently utilizing the 2

D architecture of CUDA. Matrices and
their corresponding equations
acted as the 2 levels of our problems, and hence, were
mapped to the CUDA architecture.
Determining intermediate system size for Hybrid
(CR

PCR) algorithm
:
Hybrid algorithms
works by switching from CR to PCR when parallelism starts to die off in CR. Determi
ning
this was an important phase as the results varied based on the matrices that were
supplied.
Overcoming bank conflicts:
Code efficiency in CUDA codes can be improved if bank
conflicts can be tackled. We were able to do this well for PCR by in place shi
fting of
intermediate results.
Learnings
Optimizing parallel codes
: We found that an algorithm’s performance is bottlenecked by a
composition of multiple factors like bank conﬂicts, and synchronization and control
overhead.
Parallel Programming Platforms:
We got hands on with various parallel programming
platfo
rms like CUDA, OpenMP, MPI, etc and g
ot
well

versed with
situations when each of
them are used.
Phase 1
Work
(T
ill
Mid Term Evaluation
)
We implemented the Cyclic Reduction algorithm on various p
latforms and compared its
performance to the non

parallel CPU implementation.
OpenMP & MPI:
We ventured into the parallel programming world by initially understanding the basics of the
Common parallel
programming platforms
: OpenMP and MPI.
o
Implemented the
Cyclic Reduction algorithm
for solving the tridiagonal matrices on
OpenMP & MPI.
o
MPI ( Message Passing Interface)
: MPI a
de facto
standard for communication among
processes that model a parallel program running on a
distributed memory system or
Symmetric
Multiprocessing (SMP).
o
OpenMP
:
OpenMP
(Open Multi

Processing) is an application programming interface
(API) that supports
shared memory multiprocessing
programming in C, C++
.
CUDA Implementation of CR on GPU
:
CUDA virtualizes the multiprocessors of the
GPU as blocks and its processors as threads. So, the
programmer can run threads and blocks irrespective of the no. of physical processors.
CPU Execution
Time(in ms)
GPU Execution
Time(in ms)
GPU cudaMalloc
Time(in ms)
Net GPU Time
(in ms)
SpeedUp
Achieved
4.131
60.896
58.525
2.371
42.60%
5.530
54.147
51.547
2.600
52.98%
4.415
51.055
48.632
2.423
45.11%
5.512
53.652
51.112
2.540
53.91%
4.132
43.278
40.896
2.382
42.35%
Table 1: Execution time of
CR
& comparison with sequential algorithm, for 256 X 256
systems
.
Phase 2
Work
(Post Mid Term Evaluation)
Our work constituted implementation and study of other parallel algorithms to solve tridiagonal
solvers.
Parallel Cyclic Reduction:
Unlike CR, only consists of the
forward reduction
phase.
Forward Red
uction phase
: successively reduces a
tridiagonal
system
of equations
to a
smaller system with half the number of unknowns, until a system of 2 unknowns is
reached.
We take a deeper look into forward reduction just a little later.
In each reduction step, ea
ch of the current tridiagonal system of equations is reduced to
two systems of half size using forward reduction.
Example: An 8

unknown system will reduce to two 4

unknown systems in step 1 (see Figure 1).
Then further reduce the two 4

unknown systems to f
our 2

unknown systems in step 2, and
ﬁ
nally solve the four systems in step 3.
Figure 1
Forward Reduction
:
o
All even

indexed equations
–
i, of the matrix are written as a linear combination of
equations i, i+1 and i

1.
o
This is done in parallel, for all the even indexed equations so that we
derive a
system of only even

indexed unknowns.
o
If equation i has the form a
i
x
i

1
+b
i
x
i
+c
i
x
i+1
= d
i
. The updated values of a
i
, b
i
, c
i
and d
i
in the new matrix of even

indexed unknowns are
given as:
Figure 2
o
Time Complexity: On GPU, it will take
log
2
n
steps. Hence, O(
log
2
n
) steps.
PCR implementation on CUDA:
CUDA virtualizes the multiprocessors of the GPU as blocks and its processors as threads.
So, the programmer can run threads and blocks irrespective of the no. of physical
processors.
(Figu
re 3)
GPU has 2 level hierarchical architecture of blocks (multiprocessor) and threads
(processors).
Matrix Block (multiprocessor).
Each equation Single Thread (runs on processor).
Since in PCR, paralleli
sm doesn’t decrease at any stage. Hence, no. of treads for each
system of equations remains equal to the total no. of equations and doesn’t decrease at
any stage.
Figure 3
Hybrid CR + P
CR:
Why hybrid?
In terms of complexity,
C
R is the quickest parallel a
lgorithm.
But CR has
drawbacks !!
o
In Forward reduction stage, parallelization reduces with progress. It suffers from
lack of parallelization towards latter stages.
o
In Backward substitution, parallelization increases with progress of algorithm. But,
earlie
r stages suffer from lack of it.
Solution
Parallel Cyclic Reduction (
PCR
) has equal parallelism throughout.
Use PCR when parallelism in CR starts to die out in forward reduction phase.
Get output from PCR.
Now run CR’s backward substitution on the obtain
ed output of PCR.
Example:
In this example, we use normal CR on a 8

unknown system
(Figure 4
). We switch to PCR
when the forward reduction of CR has reached a 4

unknown system. We replace the
output from PCR back into the backward substitution stage of
CR.
Figure 4
P
erformance
:
S.No
CR
PCR
Hybrid
Serial (CPU)
1
7.984
6.725
6.34
16.667
2
8.116
6.741
6.23
18.997
3
8.024
6.747
6.146
17.545
Table 1: Execution time of the
algorithms
in milliseconds
for 512 X 512 systems
.
S.No
CR
PCR
Hybrid
Serial
(CPU)
1
20.265
17.649
16.768
62.534
2
21.231
17.456
15.985
60.012
3
20.887
18.234
16.024
61.234
Table2
: Execution time of the algorithms in milliseconds
for 1024 X 1024 systems
.
Size
CR
PCR
Serial (CPU)
512 * 512
26%
7%
163%
1024*1024
33%
9%
273%
T
able3
:
Hybrid (CR

PCR) algorithm’s %age improvement over other algorithm.
Configuration of Systems used:
GPU:
NVIDIA GeForce GTX 280
Card Type: PCI

E
CPU:
Intel® Core
™ 2 CPU 2.13 GHz.
Applications
Fast solutions to a tridiagonal system of linear equations are critical for many scientific
and engineering problems.
Also used in realtime or interactive applications in computer graphics, video games ,and
animation f
ilms.
3

D game Engine Design.
The applications of tridiagonal solvers include alternating direction implicit (ADI)
methods , spectral Poisson solvers, numerical ocean models , preconditioners for iterative
linear solvers.
Conclusion:
We studied three trid
iagonal solvers that run on the GPU. We
measured the
performance of GPU
programs in terms of mem
ory access, computation
.
We have been able to show that hybrid
algorithm has the best performance.
For solving 512
*
512

unk
nown systems, the hybrid solver
achieve a
26% and 7% over CR and PCR.
References:
Fast
Tridiagonal Solvers on the GPU : Yao Zhang , Jonathan Cohe
n , John D. Owens
.
Wikipedia.
www.nvidia.com/object/
cuda
_education.html
http://gpgpu.org/
Comments 0
Log in to post a comment