High-throughput sequence alignment using Graphics Processing Units

sizzlepictureSoftware and s/w Development

Dec 2, 2013 (3 years and 11 months ago)

94 views

High
-
throughput sequence
alignment using Graphics
Processing Units

Michael C Schatz, Cole Trapnell, Arthur L Delcher,
Amitabh Varshney

UMD


Presented by Steve Rumble

Motivation


NGS technologies produce a ton of data


AB SOLiD: 22e6 25
-
mers


Others are even worse…


How does 200e6 50
-
mers sound?



Algorithms have been pushed hard, but
typically assume same workstation CPU



Wozniak and others showed S
-
W could be
well
-
parallelised on special H/W.


What of other algorithms/hardware?

Motivation


GPUs have recently evolved
general purpose
programmability (GPGPU)



E.g.: nVidia 8800 GTX


16 multiprocessors


8 processors each


=> 128 stream processors


768MB onboard


1.35GHz clock


Almost a year old now…

Short GPU Overview


Highly parallel execution (hundreds of
simultaneous operations)


Hundreds of gigaflops per chip!


Large on
-
board memories (up to 2GB)



Limitations:


No recursion (no stacks)


Each multiprocessor’s constituent processors
execute same instruction


Thread Divergence

due to conditionals hurts…


No direct host memory access


Small caches (locality is
key
)


High memory latency


No dynamic memory allocation (why one would ever
do that, I don’t know)

Short GPU Overview


GPGPU environments



Previously had to reduce problems to
graphics primitives… no more



Simplified C
-
like programming


Paper has very little detail, but they make
it sound enticingly simple…



Each processor runs the same ‘kernel’

Muh
-
muh
-
muh… MUMmer!


M
aximal
U
nique
M
atch



Find longest match for each
subsequence of a read (of
reasonable length)



Employs Suffix Trees

MUMmerGPU


Plug
-
and
-
play replacement for MUMmer


MUMmer is
not

‘arithmetic intensive’


Is the GPU a good fit?



Six
-
step process


1) Build Suffix Tree of reference genome
(Ukkonen’s alg.


O(n)) on host CPU


2) Suffix Tree
-
> GPU Memory


3) Queries
-
> GPU Memory


4) Kick off the GPU…


5) Results
-
> Host Memory


6) Final processing on Host CPU

Suffix Trees


We want to find the longest
subsequence of a string (query)
quickly



Suffix Trees permit O(m) string search,
m = string length


Space complexity is O(n)


But constants are apparently pretty big


Suffix Trees


Definition:


Node edges have a
node label


A string subsequence


Non
-
empty (but can be terminating)



A
path label

is the sequence formed by traversing
from root to leaf



1
-
1 correspondence of suffixes of
S

to
path labels



Internal nodes have at least 2 children



n

leaf nodes


one for each suffix of
S

Suffix Trees


O(
n
) space


n
leaf nodes


=> at most n


1 internal nodes


=> n + (n


1) + 1 = 2n nodes (worst
case)

n = 3

n


1 = 2

3 + 2 + root = 6 nodes

Suffix Trees


Example: TORONTO$


‘$’ is terminating character

6

4

0

5

2

3

1

Suffix Trees


Example: TORONTO$


Searching for ‘ONT’

6

4

0

5

2

3

1

Suffix Trees


Example: TORONTO$


Searching for ‘ONT’

6

4

0

5

2

3

1

Suffix Trees


Example: TORONTO$


Searching for ‘ONT’

6

4

0

5

2

3

1

Suffix Trees


Example: TORONTO$


Searching for ‘ONT’

6

4

0

5

2

3

1

‘ONT’ at position 3 in S

Suffix Trees


MUMmer wants to find all maximal
unique matches for all suffixes:


E.g., for query ACCGTGCGTC, we want:


ACCGTGCGTC



CCGTGCGTC



CGTGCGTC



GTGCGTC






Up to some reasonable limit…



Don’t want to go back to root of tree
each time…

Suffix Trees


Suffix Links


All internal, non
-
root nodes have a
suffix link

to another node


If
x

is a single character and
a

is a
(possibly empty) string (subsequence),
then the path from the root to a node
v
spelling
ax

(
path
-
label
is

ax)
has a suffix
link to node
v’
, whose
path
-
label
is
a.



Got that?

Suffix Trees


Example: TORONTO$


Suffix Links… Don’t backtrack (bad ex.)

6

4

0

5

2

3

1

Suffix Trees


Example: BANANA$


Better example of Suffix Links

1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘
ANANA


1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘
ANANA


1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘
ANANA


1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘
ANANA


1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘A
NANA


1

0

5

3

BANANA$

2

4

Suffix Trees


Example: BANANA$


Searching for suffixes of ‘A
NANA


1

0

5

3

BANANA$

2

4

Memory Limitations


Suffix trees take up a fair bit of
memory



GPUs have 100’s of MBs, but this is
still small



Divide the target sequence into ‘k’
segments with overlaps

Cache Optimisation


Memory latency high, cache performance
crucial


We’re walking a tree here, not crunching numbers
down an array



Can store read
-
only data in 2D textures; nVidia
caching scheme optimises access



Re
-
order and squish tree nodes into ‘texel
blocks’ such that:



Nodes near root are level
-
ordered (BFS)


Nodes further down are ordered with descendants

Cache Optimisation

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

21

23

20

22

24

25

26

27

28

29

0

2

4

6

8

10

12

14

1

3

5

7

9

11

13

15

16

18

20

22

24

26

28

30

17

19

21

23

25

27

29

31



Texture cache organized in 2x2 blocks.



Try to place all children of a node are in the same cache block


Shamelessly cribbed from:

http://www.cbcb.umd.edu/software/cmatch/FastExactStringMatching.ppt

Cache Optimisation


Reference Sequence stored in 4x2
16

blocks of a 2D array


Sequence: A B C D E F G H …

……….

A E

B F

C G

D H

……….

α Φ

β Χ

Γ Ψ

Δ Ω

Why? It worked well.

Cache Optimisation


Memory layouts heuristically
determined


nVidia cache details not public



Cache optimisation improves
execution speed ‘by several fold’.

Conclusions


GPGPU isn’t just good for
‘arithmetic intensive’ applications


5
-
11x speed
-
up for NGS data

Conclusions


Fine Print:


5
-
11x is for the Suffix Tree kernel on the GPU


Reality is different!


3.5x speed
-
up for real data in terms of total
application runtime.


Pretty constant across read lengths (35
-
700+ bp)



Careful management of memory layout is crucial



Authors claim several
-
fold performance increase
(could be difference between some improvement
and none)

Conclusions


Runtime dominated by serial parts
of MUMmer

Food for Thought


8800 GTX costs ~$400, uses 100
-
150 watts



Quad Core 2 chip runs ~$250, uses 100
-
130
watts



Each core approx. 2x faster than their test
CPU



MUMmerGPU maximally 3.5x faster than test
CPU



What have we won here?

Food for Thought


Confusing reports



“Fast Exact String Matching on the
GPU” (Schatz, Trapnell) claims up to
35x improvement


Earlier course paper (early/mid
-
2007)



Why from 35x down to 5
-
11x with
MUMmerGPU?

My Impressions…


(…whatever they’re worth)



GPU is not a clear win (in this case)


Suffix trees seem unsuited:


Cache locality trouble


O(n) footprint, but multiplicative
constants are still substantial


Host CPUs seem to be as good or
better (in $ and watts)


My Impressions…


GPGPU’s aren’t a great fit here



At least for this algorithm…



MUMmerGPU isn’t the order
-
of
-
magnitude win
it claims to be



But this is a first
-
generation, general
-
purpose chip



geared toward number
-
crunching, not pointer
-
traversing



I don’t think we’ve seen the last (nor the best)
of GPUs…