Experiences for computational biology

sizzlepictureΛογισμικό & κατασκευή λογ/κού

2 Δεκ 2013 (πριν από 3 χρόνια και 11 μήνες)

86 εμφανίσεις

Chun
-
Yuan Lin

Assistant Professor

Department of Computer Science and Information Engineering

Chang Gung University

Experiences for computational biology
on CUDA

2013/12/2

1

GPU Workshop

Introduction (1)

2013/12/2

GPU Workshop

2


The fast increasing power of the GPU (
Graphics Processing Unit
) and its
streaming architecture
opens

up a range of new possibilities for a variety
of applications.



Previous works on GPGPU (
General
-
Purpose computation

on GPUs
) have
showed the design and implementation of algorithms for non
-
graphics
applications. (
scientific computing
,
computational geometry
,
image processing,
Bioinformatics

and
etc.
)



Introduction (2)

2013/12/2

GPU Workshop

3


Some bioinformatics applications have been successfully ported to
GPGPU in the past.


Liu
et al.
(
IPDPS

2006)

implemented the
Smith
-
Waterman algorithm

(sequence
alignment problem) to run on the nVidia GeForce 6800 GTO and GeForce 7800
GTX, and reported an approximate 16
×

speedup by computing the alignment score
of multiple cells simultaneously.



Charalambous
et al.
(
LNCS
2005)

ported an expensive loop from

RAxML, an
application for
phylogenetic tree construction
, and achieved a 1.2
×

speedup on the
nVidia GeForce 5700 LE.

Introduction (3)


Sequence alignment


DNA/RNA sequences: 4
-
letter alphabet (ATGC, AUGC)


Protein sequences: 20
-
letter alphabet (or 23
-
letter alphabet)


High sequence similarity usually implies functional or structural similarity.




2013/12/2

GPU Workshop

4

Introduction (4)

2013/12/2

GPU Workshop

5

Introduction (5)

2013/12/2

GPU Workshop

6

Introduction (6)

2013/12/2

GPU Workshop

7

Introduction (7)

2013/12/2

GPU Workshop

8

Introduction (8)

2013/12/2

GPU Workshop

9

Introduction (9)

2013/12/2

GPU Workshop

10

Introduction (10)

2013/12/2

GPU Workshop

11

Introduction (11)


An
evolutionary tree
can be seen as a representation of evolutionary
histories for a set of species and is helpful for biologists to observe
existent species or to evaluate the
relationship

of them in the taxonomy.



The real evolutionary histories (trees) are
unknown

in practice. (root
and internal node)



The majority of these methods or models are based on two inputs: the
sequences
and the
distance matrix
.



However, most of optimization problems for evolutionary tree
construction have been shown to be
NP
-
hard
.

2013/12/2

GPU Workshop

12

Introduction (12)

2013/12/2

GPU Workshop

13

Introduction (13)

2013/12/2

GPU Workshop

14

Introduction (14)


Liu
et al.
(
IEEE TPDS

2007) presented a GPGPU approach to high
-
performance biological sequence alignment based on commodity PC
graphics hardware. (
C++ and OpenGL Shading Language
(
GLSL
))


Pairwise Sequence Alignment
(Smith
-
Waterman algorithm, scan database, no backtrack)





Multiple sequence alignment
(MSA)

2013/12/2

GPU Workshop

15

(from
Liu

et al
.
TDPS

2007)

(intra
-
task parallel)

2013/12/2

GPU Workshop

16

(from
Liu

et al
.
TPDS

2007)

2013/12/2

GPU Workshop

17

(from
Liu

et al
.
TDPS

2007)

CUDA (1)


CUDA (
Compute Unified Device Architecture
) is an extension of C/C++ which
enables users to write scalable multi
-
threaded programs for CUDA
-
enabled
GPUs.


CUDA programs contain a sequential part, called a
kernel
.



Readable and writable
global memory
(ex. 1GB)


(The effective bandwidth of global memory depends heavily on the memory access
pattern) (coalesced access)



Readable and writable
per
-
thread

local memory
(16KB per thread)


(Access to local memory is as expensive as access to global memory)

2013/12/2

GPU Workshop

18

CUDA (2)


Read
-
only
constant memory
(64KB, cached, 8kB per multiprocessor)


(The reading cost scales with the number of different addresses read by all threads)
(Reading from constant memory can be as fast as reading from a register)



Read
-
only
texture memory
(size of global, cached, 8kB per multiprocessor)


(Reading from texture memory is generally faster than reading from global or local
memory)



Readable and writable
per
-
block shared memory
(16KB per block)


(Shared memory is divided into equally
-
sized banks that can be accessed
simultaneously by each thread)



Readable and writable
per
-
thread registers

(ex. 8192 per block)


(the fastest memory)


2013/12/2

GPU Workshop

19

2013/12/2

GPU Workshop

20

Grid

Constant

Memory

Texture

Memory

Global

Memory

Block (0, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Block (1, 0)

Shared Memory

Local

Memory

Thread (0, 0)

Registers

Local

Memory

Thread (1, 0)

Registers

Host

© David Kirk/NVIDIA and
Wen
-
mei

W.
Hwu
, 2007

ECE 498AL, University of Illinois, Urbana
-
Champaign

(
from

Schatz et al. BMC Bioinformatics
2007)

CUDA (3)


Some bioinformatics applications have been successfully ported to
CUDA now.


Smith
-
Waterman algorithm
(
scan database, no alignment results
)


Manavski and Valle (
BMC Bioinformatics 2008
),


Striemer

and Akoglu (
IPDPS 2009
),


Liu et al.
(
BMC Research Notes 2009
)



Multiple sequence alignment
(ClustalW)


Liu et al.
(
IPDPS 2009
) for Neighbor
-
Joining Trees construction


Liu et al.
(
ASAP

2009
)



Pattern matching
(MUMmerGPU)


Schatz et al.
(
BMC Bioinformatics 2007
)

2013/12/2

GPU Workshop

21

CUDA
-

Smith
-
Waterman algorithm (1)


Manavski

and
Valle

present the first solution (CUDA solution) based on
commodity hardware that efficiently computes the exact Smith
-
Waterman alignment. It runs from
2 to 30
times faster than any
previous implementation on general
-
purpose hardware.

2013/12/2

GPU Workshop

22

(
from

Schatz et al. BMC Bioinformatics
2007)

(inter
-
task parallel)

CUDA
-

Smith
-
Waterman algorithm (2)


Pre
-
compute a
query profile
parallel to the query sequence for each
possible residue.



The implementation in CUDA was to make
each GPU thread
compute
the whole alignment of the query sequence with one database
sequence. (
pre
-
order

the sequences of the database in function of their
length
)



The ordered database is stored in the
global memory
,
while the query
-
profile is saved into the
texture memory
.



For each alignment the matrix is computed
column by column
in
order parallel to the query sequence. (store them in the
local memory of
the thread
)

2013/12/2

GPU Workshop

23

2013/12/2

GPU Workshop

24

(
from

Schatz et al. BMC Bioinformatics
2007)

(no backtrack)

The GPU is able to read and write up

to 128 bits of the
local memory with a single
instruction
.

2013/12/2

GPU Workshop

25

(
from

Schatz et al. BMC Bioinformatics
2007)

CUPS: cell updates per second

CUDA
-

Smith
-
Waterman algorithm (3)


Striemer

and
Akoglu

further study the effect of memory organization and
the instruction set architecture on GPU performance.


For both single and dual GPU configurations, Manavski utilizes the help of an
Intel Quad Core processor by distributing the workload among GPU(s) and the
Quad Core processor.



They pointed out that
query profile
in Manavski’s method has a major drawback in
utilizing the texture memory of the GPU that leads to unnecessary
caches misses
.
(larger than 8KB)



Long sequence
problem.

2013/12/2

GPU Workshop

26

(inter
-
task parallel)

CUDA
-


Smith
-
Waterman algorithm (4)


They placed the substitution matrix in the
constant memory
to exploit
the constant cache, and created an efficient cost function to access it.
(
modulo operator (%)

is extremely inefficient on CUDA, not use
hash function)


The substitution matrix needs to be
re
-
arranged

in alphabetical order.



They mapped query sequence as well as the substitution matrix to
the
constant memory
.



They calculated the SW score from the query sequence and database
sequences by means of
columns
, four cells at a time due to the
restrictions in the size of the
shared memory
.

2013/12/2

GPU Workshop

27

CUDA
-


Smith
-
Waterman algorithm (5)


After the alignment is complete, the score is written to the
global
memory
.



They pointed out the main drawback of GPU is
the limited
on chip
memory. (need to be designed carefully)


2013/12/2

GPU Workshop

28

2013/12/2

GPU Workshop

29

(
from

Striemer

and Akoglu IPDPS 2009
)

CUDA
-

Smith
-
Waterman algorithm (6)


Liu et al.
proposed Two versions of CUDASW++ are implemented: a
single
-
GPU version and a multi
-
GPU version.


The alignment can be computed in
minor
-
diagonal

order from the top
-
left
corner to the bottom
-
right corner in the alignment matrix.



Considering the optimal local alignment of a query sequence and a subject
sequence as a task.


Inter
-
task parallelization
: Each task is assigned to exactly one thread and
dimBlock

tasks are performed in parallel by different threads in a thread block.


Intra
-
task parallelization
: Each task is assigned to one thread block and all
dimBlock threads in the thread block cooperate to perform the task in parallel.

2013/12/2

GPU Workshop

30

CUDA
-

Smith
-
Waterman algorithm (7)


Inter
-
task parallelization
occupies more device memory but achieves
better performance than intra
-
task parallelization.



Intra
-
task parallelization
occupies significantly less device memory
and therefore can support longer query/subject sequences. (
two
stages
implementation, the threshold is set to 3,072)



In order to achieve high efficiency for
inter
-
task parallelization
, the
runtime of all threads in a thread block should be roughly identical.
(
order
database sequences based on their
lengths
)




2013/12/2

GPU Workshop

31

CUDA
-

Smith
-
Waterman algorithm (8)


Coalesced subject sequence arrangement


For
inter
-
task parallelization
, sorted subject sequences are arranged in an array like
a multi
-
layer bookcase, where all symbols of a sequence are restricted to be stored
in the
same column
from top to bottom and all sequences are arranged in increasing
length

order from left to right and top to bottom in the array. (
global memory
)



Sorted subject sequences for the intra
-
task parallelization are sequentially stored in
an
array

row by row from the top
-
left corner to the bottom
-
right corner.



A
hash table
records the location coordinate in the array and the length of
each
sequence
, providing fast access to any sequence)


2013/12/2

GPU Workshop

32

CUDA
-

Smith
-
Waterman algorithm (9)


Coalesced global memory access


During the execution of the SW algorithm,
additional memory
is required to
store intermediate alignment data. To support much longer sequences, the
global
memory
is used to store the intermediate results.



A prerequisite for coalescing is that the words accessed by all threads in a half
-
warp must lie in the
same segment
)



For
inter
-
task
parallelization, a memory slot is allocated to a thread in a thread
block and is indexed top
-
to bottom, and the access to MemSlot using the same
index for all threads in a half
-
warp is coalesced into one or two memory
transactions depending on the compute capacity of devices.





2013/12/2

GPU Workshop

33

CUDA
-

Smith
-
Waterman algorithm (10)


For
intra
-
task parallelization
, a memory slot is allocated to a thread block and is
indexed left
-
to right, and the coalesced access is able to be obtained using the
common global memory access pattern.

2013/12/2

GPU Workshop

34

2013/12/2

GPU Workshop

35

(
from

Liu et al. BMC Research Notes 2009
)

Coalesced subject sequence arrangement

Coalesced global memory access

CUDA
-

Smith
-
Waterman algorithm (11)


Cell block division method


To maximize performance and to reduce the bandwidth demand of global memory,
they propose a cell block division method for the
inter
-
task parallelization
, where
the alignment matrix is divided into cell blocks of
equal size
.



A cell block is a square matrix of size
n

×

n
. If the length of query or subject
sequence is not a multiple of
n
, the sequence is padded with an appropriate
number of dummy symbols. (add to scoring matrix)




However, the size of cell block is limited by the number of registers available per
thread. (8
×

8 per thread)


2013/12/2

GPU Workshop

36

CUDA
-

Smith
-
Waterman algorithm (12)


Constant memory
is exploited to store the gap penalties, scoring matrix
and the query sequence. (In our implementation, sequences of length
up to 59K can be supported)


2013/12/2

GPU Workshop

37

2013/12/2

GPU Workshop

38

a single
-
GPU NVIDIA GeForce GTX 280 graphics card and a dual
-
GPU
GeForce GTX 295 graphics card

(
from

Liu et al. BMC Research Notes 2009
)

CUDA
-

Multiple sequence alignment


Liu et al.
presents MSA
-
CUDA, a parallel MSA program, which
parallelizes all three stages of the ClustalW processing pipeline using
CUDA.


Pairwise distance computation
:


a forward score
-
only pass using Smith
-
Waterman (SW) algorithm


a reverse score
-
only pass using SW algorithm


a traceback computation pass using Myers
-
Miller algorithm


they have developed a new stack
-
based iterative implementation. (CUDA does
not support recursion)


As the work in
Liu et al.
(
BMC Research Notes 2009
)


Neighbor
-
Joining Trees
: as the work in
Liu et al.
(
IPDPS 2009
)


Reconstruction of the unrooted NJ tree


Rooting the NJ tree and computing sequence weights


Progressive alignment
: conducted iteratively in a multi
-
pass way.

2013/12/2

GPU Workshop

39

2013/12/2

GPU Workshop

40

(from
Liu et al. ASAP

2009
)

CUDA
-
Pattern matching (1)


Exact

or
approximate string matching problem:


given a query string
P
of length
m
, a text string
T
, and a distance
k

(
k
is 0 for the exact string matching problem), find all substrings
t
of
T
that are within the distance
k
from
P
.



more than million query strings for a practical application.


2013/12/2

GPU Workshop

41

CUDA
-
Pattern matching (2)


Schatz et al.
proposed
MUMmerGPU
, an open
-
source high
-
throughput
parallel pairwise local sequence alignment program (
exact sequence
alignment
) that runs on commodity Graphics Processing Units (GPUs)
in common workstations.



MUMmerGPU uses the new Compute Unified Device Architecture
(CUDA) from nVidia to align
multiple query sequences
against a
single reference sequence
stored as a
suffix tree
.

2013/12/2

GPU Workshop

42

2013/12/2

GPU Workshop

43

(from
Schatz et al. BMC Bioinformatics 2007
)

CUDA
-
Pattern matching (3)


First
a suffix tree
of the reference sequence is constructed on the
CPU

using Ukkonen's algorithm and transferred to the GPU. (the
reference suffix tree, query sequences, and output buffers will fit on
the GPU)


MUMmerGPU builds
k smaller suffix trees from overlapping
segments of the
reference. The suffix tree is "flattened" into two 2D
textures
, the node texture
and the child texture. (32
×

32)



The queries are read from disk in blocks that will fill the remaining (
global
)
memory, concatenated into
a single large buffer
(separated by null characters),
and transferred to the GPU. An auxiliary 1D array, also transferred to the GPU,
stores the offset of each query in the query buffer.

2013/12/2

GPU Workshop

44

2013/12/2

GPU Workshop

45

(from
Schatz et al. BMC Bioinformatics 2007
)

k smaller suffix trees

CUDA
-
Pattern matching (4)


Then the query sequences are transferred to the GPU, and are
aligned to the tree on the
GPU

using the alignment algorithm.


Each multiprocessor on the GPU is assigned
a subset of queries
to process in
parallel, depending on the number of multiprocessors and processors available.
(inter
-

and intra
-
task parallel)



Thus, the data
reordering

scheme attempts to increase the cache hit rate for a
single thread. (alphabet order)



Alignment results are temporarily written to the GPU's memory
(
global memory
), and then transferred in bulk to host RAM once the
alignment kernel is complete for all queries. (the alignments are
printed by the CPU)


2013/12/2

GPU Workshop

46

2013/12/2

GPU Workshop

47

(from
Schatz et al. BMC Bioinformatics 2007
)

The time for building the suffix
tree, reading queries from

disk, and printing alignment
output is the same
regardless

of
whether MUMmerGPU ran on
the CPU or the GPU

2013/12/2

GPU Workshop

48

(from
Schatz et al. BMC Bioinformatics 2007
)