on the GPU
Prasoon Mishra 200730017
Suyesh Tiwari 200701080
Under the guidance of Prof. R. Govindarajulu
B.Tech Project submitted in partial fulfillment
of the requirements for the degree of
Bachelors of Technology
in CSE by
ernational Institute of Information Technology
500 032, INDIA
We are thankful to to our faculty advisor, Prof. R. Govindarajulu for offering us this exciting
project, and for his unconditional and invaluable guidance through
out its execution.
Phase 1 Work (Till Mid Term Evaluation)
Phase 2 Work (Post Mid Term Evaluation)
Parallel Cyclic Reduction (PCR)
Parallel Cyclic Reduction (
PCR) on CUDA
Hybrid (CR + PCR)
Configuration of Systems used
Implement and study the performance of tw
o parallel algo
rithms and their hybrid variant
for solving tri
diagonal linear systems on the GPU: cyclic reduction (CR), parallel cyclic
reduction (PCR) and Hybrid CR+PCR
Performance is to be measured by solving a large number of small tridiagonal systems
parallel, as this maps well to the hierarchical and throughput
oriented architecture of the
eduction (CR) algorithm to solve tridiagonal matrices
phase of the project.
Implemented Parallel Cyclic Reduction and Hybrid (CR+PCR) algorithm.
Identified the intermediate matrix size for Hybrid algorithm ( the point when CR must
switch to PCR).
Benchmarking of code to measure performance o
f various algorithms and %age speedup.
Mapping tridiagonal solvers to CUDA Architecture:
The primary challenge in solving any
problem on CUDA lies in efficiently utilizing the 2
D architecture of CUDA. Matrices and
their corresponding equations
acted as the 2 levels of our problems, and hence, were
mapped to the CUDA architecture.
Determining intermediate system size for Hybrid
works by switching from CR to PCR when parallelism starts to die off in CR. Determi
this was an important phase as the results varied based on the matrices that were
Overcoming bank conflicts:
Code efficiency in CUDA codes can be improved if bank
conflicts can be tackled. We were able to do this well for PCR by in place shi
Optimizing parallel codes
: We found that an algorithm’s performance is bottlenecked by a
composition of multiple factors like bank conﬂicts, and synchronization and control
Parallel Programming Platforms:
We got hands on with various parallel programming
rms like CUDA, OpenMP, MPI, etc and g
situations when each of
them are used.
Mid Term Evaluation
We implemented the Cyclic Reduction algorithm on various p
latforms and compared its
performance to the non
parallel CPU implementation.
OpenMP & MPI:
We ventured into the parallel programming world by initially understanding the basics of the
: OpenMP and MPI.
Cyclic Reduction algorithm
for solving the tridiagonal matrices on
OpenMP & MPI.
MPI ( Message Passing Interface)
: MPI a
standard for communication among
processes that model a parallel program running on a
distributed memory system or
Processing) is an application programming interface
(API) that supports
shared memory multiprocessing
programming in C, C++
CUDA Implementation of CR on GPU
CUDA virtualizes the multiprocessors of the
GPU as blocks and its processors as threads. So, the
programmer can run threads and blocks irrespective of the no. of physical processors.
Net GPU Time
Table 1: Execution time of
& comparison with sequential algorithm, for 256 X 256
(Post Mid Term Evaluation)
Our work constituted implementation and study of other parallel algorithms to solve tridiagonal
Parallel Cyclic Reduction:
Unlike CR, only consists of the
: successively reduces a
smaller system with half the number of unknowns, until a system of 2 unknowns is
We take a deeper look into forward reduction just a little later.
In each reduction step, ea
ch of the current tridiagonal system of equations is reduced to
two systems of half size using forward reduction.
Example: An 8
unknown system will reduce to two 4
unknown systems in step 1 (see Figure 1).
Then further reduce the two 4
unknown systems to f
unknown systems in step 2, and
nally solve the four systems in step 3.
i, of the matrix are written as a linear combination of
equations i, i+1 and i
This is done in parallel, for all the even indexed equations so that we
system of only even
If equation i has the form a
. The updated values of a
in the new matrix of even
indexed unknowns are
Time Complexity: On GPU, it will take
steps. Hence, O(
PCR implementation on CUDA:
CUDA virtualizes the multiprocessors of the GPU as blocks and its processors as threads.
So, the programmer can run threads and blocks irrespective of the no. of physical
GPU has 2 level hierarchical architecture of blocks (multiprocessor) and threads
Matrix Block (multiprocessor).
Each equation Single Thread (runs on processor).
Since in PCR, paralleli
sm doesn’t decrease at any stage. Hence, no. of treads for each
system of equations remains equal to the total no. of equations and doesn’t decrease at
Hybrid CR + P
In terms of complexity,
R is the quickest parallel a
But CR has
In Forward reduction stage, parallelization reduces with progress. It suffers from
lack of parallelization towards latter stages.
In Backward substitution, parallelization increases with progress of algorithm. But,
r stages suffer from lack of it.
Parallel Cyclic Reduction (
) has equal parallelism throughout.
Use PCR when parallelism in CR starts to die out in forward reduction phase.
Get output from PCR.
Now run CR’s backward substitution on the obtain
ed output of PCR.
In this example, we use normal CR on a 8
). We switch to PCR
when the forward reduction of CR has reached a 4
unknown system. We replace the
output from PCR back into the backward substitution stage of
Table 1: Execution time of the
for 512 X 512 systems
: Execution time of the algorithms in milliseconds
for 1024 X 1024 systems
512 * 512
PCR) algorithm’s %age improvement over other algorithm.
Configuration of Systems used:
NVIDIA GeForce GTX 280
Card Type: PCI
™ 2 CPU 2.13 GHz.
Fast solutions to a tridiagonal system of linear equations are critical for many scientific
and engineering problems.
Also used in realtime or interactive applications in computer graphics, video games ,and
D game Engine Design.
The applications of tridiagonal solvers include alternating direction implicit (ADI)
methods , spectral Poisson solvers, numerical ocean models , preconditioners for iterative
We studied three trid
iagonal solvers that run on the GPU. We
performance of GPU
programs in terms of mem
ory access, computation
We have been able to show that hybrid
algorithm has the best performance.
For solving 512
nown systems, the hybrid solver
26% and 7% over CR and PCR.
Tridiagonal Solvers on the GPU : Yao Zhang , Jonathan Cohe
n , John D. Owens