Genome assembly

voraciousdrabSoftware and s/w Development

Dec 14, 2013 (3 years and 9 months ago)

554 views

1


Module
5
. Genome assemb
ly:
Assembly
algorithms

Background

Three types of assembly algorithms include Greedy, Overlap/Layout/Consensus (OLC) and De Bruijn
Graphs See M
iller et al. (2010) for detail.

Greedy algorithms work by, for any given contig, using the
next highest
-
scoring overlap to join next. Scoring algorithms may vary. Greedy algorithms do not
“reconsider” whether a read would help another contig more, after adding to an earlier contig.
These
are fast algorithms but can miss the optimal contig construction. We won’t be using a Greedy assembler
here. OLC assemblers perform all
-
vs
-
all pairwise read comparison using shared kmers as alignment
seeds. Overlap graphs are constructed and mult
iple sequence alignments used to refine alignment
(Miller et al. 2010). The all
-
vs
-
all step is particularly computationally intensive as there are [n(n
-
1)]/2
pairwise comparisons among reads, and scores for each alignment must be stored.


Reflection Q:

How many pairwise comparisons are there among 100 million reads?


The Newbler assembler is a widely used OLC assembler used predominantly for Roche 454 data
(Margulies et al. 2005). First unitigs are generated from reads using the OLC method. Unitigs ar
e
contigs that do not overlap with reads in other unitigs. The unitigs seed generation of larger contigs that
are joined based on pairwise overlap among unitigs. Unitigs may be split in the process if their
beginning and ends align to different contigs.
On the 454 platform multiple calls of a nucleotide are
made at the same time and produce a light intensity that is proportional to the number of bases in a
row. As the number of identical bases repeated in a stretch (i.e. homopolymer like AAAAAAAAAA)
incr
eases, it becomes increasingly difficult for 454 technologies to resolve the exact number of repeats.
Consensus of the raw signal strength from multiple aligning nucleotides is used to help resolve
homopolymer length (Miller et al. 2010).


De Bruijn graphi
ng techniques are described in the figure below (Compeau et al. 2011).

2




F
IGURE
1
.

F
ROM
C
OMPEAU ET AL
.

2011


The
figure below shows how higher kmer size can result in fragmented assemblies if coverage isn’t high
enough. It also
shows that low kmer size may result in ambiguities.
3



F
IGURE
2
.

F
ROM
HTTP
://
GCAT
.
DAVIDSON
.
EDU
/
PHAST
/
DEBRUIJN
.
HTML
.

A)

E
XAMPLE READS FOR THE

CONSENSUS SEQUENCE S
HOWN
.

B)

L
ARGE KMERS
(5
BP
)

FROM THESE READS CAN

CAUSE FRAGMENTATION
(
AN UNRESOLVED GRAPH
)

IF NOT ALL KMERS ARE

PRESENT
.

T
HE READ IN RED IS TH
ROWN OUT
,

AND THE KMER
ATTAG

IS NOT CONNECTED
.

C)

S
MALL KMERS
(4
BP
)

MAY COVER THE WHOLE
GENOME BUT RESULT IN

PATH AMBIGU
ITIES
.

R
EADS IN GREEN CAN BE

USED TO RESOLVE THE
GRAPH
,

CAUSING THE RESOLUTI
ON
PLOTTED IN RED IN
(C).


Manual Exercise

Given the following set of 3bp reads {ATG, CAT, TGC, GCA}, construct a genome using a De Bruijn graph
with kmer size 2.


Errors cause m
any kmers to be affected, and cause “bulges” in de Bruijn graphs.

4




5



F
IGURE
3
.

M
ILLER ET AL
.

2010.


F
IGURE
4
.

M
ILLER ET AL
.

2010

6



SOAPden
ovo

SOAPd
enovo

(Li et al. 2010; Luo et al 2012)

is comprised of 4 distinct commands that typically run at the
same time.

P
regraph: construct kmer
-
graph

C
ontig: eliminate errors and output contigs

M
ap: map reads to contigs

S
caff: construct scaffolds

A
ll: do all of the above in turn

Error correction
in

SOAPdenovo
itself

includes calculating kmer frequencies and filtering kmers below a
certain frequency, correcting bubbles, and frayed robe patterns. It creates DeBruijn graphs and to
create scaffolds, maps all paired reads to contig consensus sequences,
including reads not used in the
graph.

Below we will try assembly with all SOAPdenovo modules, on both the raw data and Quake
-
corrected data.

Goals



De

novo

g
enome assembly in
L
inux
OS



Effects of key variables on assembly quality



Measuring assembly quality



Checking bioinformatic results against a standard

V&C core competencies addressed



1)
A
bility t
o apply the process of science:
Observational strategies, Hypothesis testing, Experimental
design, Evaluation of experimental evidence, Developing
problem
-
solving strategies

2) Ability to use quantitative reasoning:
Developing and interpreting graphs, Applying statistical

methods to diverse data,

Mathematical modeling,
Managing and analyzing large data sets

3)
U
se modeling and simulation to understan
d complex biological

systems:
Computational

modeling of dynamic systems, Applying informatics tools, Managing and analyzing large data sets,
Incorporating stochasticity into biological models

7



GCAT
-
SEEK sequencing requirements

any

Computer/program requirem
ents for data analysis

Linux OS,
SOAPdenovo
2

Optional Cluster: Qsub

If starting from Window OS: Putty

If starting from Mac
or Linux
OS: SSH


Protocols

Genome a
ssembly using SOAPdenovo
2

in the Linux environment

We will perform genome assembly on the HHMI
cluster. We will start off trying to repeat some work
that was published in the Genome Assembly Gold
-
standard Evaluations (GAGE) project (
Salzberg et al.
2012).

Our work will focus on bacterial genomes for purposes of brevity, but
this work, in this
co
mputing environment, can

be applied to even mammalian sized genomes (>1Gb). You now have 4
bacterial genome datasets uploaded in two forms, raw and quality filtered/error corrected. The raw
data includes a paired end fragment libraries of 101bp reads,
an “insert length” of 180bp, in “innie”
orientation, with 1,294,104 reads providing 45x genome coverage (two files: frag_1.fastq, frag_2.fastq).
The raw data also includes two shorter read (37bp) mate
-
paired jumping libraries with an “insert length”
of 35
00bp, in “outie” orientation, with 3,494,070 reads providing another 45x genome coverage (two
files: shortjump_1.fastq, shortjump_2.fastq).

A.

Log onto your Linux home directory on
the cluster
.

B.

Make a directory entitled, soap, check that it is there, and m
ove into it.

$mkdir soap

$ls

$cd soap


C.

Make the config file using nano. This
will tell Soap which files to use, where you can enter
important characteristics of the data.

Below y
ou can skip the comment lines starting with “#”.
When finished, hit
control
-
X, and you will be prompted to save the file.

$nano

___________________________

#maximal read length

8


max_rd_len=101

#below starts a new library

[LIB]

#average insert size

avg_ins=180

#if sequence needs to be reversed
put 1, otherwise 0

reverse_seq=
0

#in which part(s) the reads are used. Flag of 1 means only contigs.

asm_flags=1

#use only first 101 bps of each read

rd_len_cutoff=101

#in which order the reads are used while scaffolding
. Small fragments usually first, but
worth playing with.

rank=1

#

cutoff of pair number for a reliable connection (at least 3 for short insert size)
. Not sure
what this means.

pair_num_cutoff=3

#minimum aligned length to contigs for a reliable read location (at least 32 for short insert
size)

map_len=32

#a pair of fastq

file, read 1 file should always be followed by read 2 file

q1=frag_1.cor.fastq

q2=frag_2.cor.fastq

#now you will enter the information for the mate pair library

[LIB]

avg_ins=3500

reverse_seq=1

asm_flags=2

rank=2

9


# cutoff of pair number for a reliable con
nection (at least 5 for large insert size)

pair_num_cutoff=5

#minimum aligned length to contigs for a reliable read location (at least 35 for large insert
size)

map_len=35

q1=shortjump_1.cor.fastq

q2=shortjump_2.cor.fastq

Control
-
X to exit and save the
file as “config.txt”

D.

Move

your

Quake correct
ed

fastq files into this directory.

First navigate into the Quake folder
and then move the files.

$mv *.cor.fastq ../soap/


E.

Copy your Qsub script to your current

directory and
edit
it
for running SOAP
.
Please u
se the low
memory (63mer) version

of SOAPdenovo, a kmer (
-
K) of 31, 16

processors (
-
p), the config file
you just made (
-
s), and give all output files the prefix asm (
-
o).

Use the following text at the end
of your example Qsub control file.

NOTE: If you s
imply type the following into the command
prompt, you will not be running Qsub.

Make sure to save.

$SOAPdenovo
-
63mer all
-
K 31
-
p 16

-
s config.txt
-
o asm

Then r
un the Qsub script
using:

$qsub

p 100 NameofYourQsubScript

F.

Run GapCloser

using Qsub

with the following instructions at the end of the Qsub script
.
You will
use

the

config file you made above (
-
b), fill

gaps

in the sequence file asm.scafSeq (
-
a), make the

output file asm2.scafSeq, use 16 processers (
-
t 16
), and make
sure

there is an ovel
ap of 31 nt
before gap filling (
-
p).


$GapCloser
-
b config.txt
-
a
asm.scafSeq
-
o asm2.scafSeq
-
t 16

-
p 31

G.

Examine directory contents using
$ls

H.

$
Less

the
*.scafSeq

file. It will have the scaffold result files.

I.

$Less the
*.scafStatistics

file. This will co
ntain detailed information on scaffold
assembly.


Size_includeN

Total size of assembly in scaffolds, including Ns

Size_withoutN

Total size of assembly in scaffolds, not including Ns

Scaffold_Num

Number of scaffolds

Mean_Size


Mean size of scaffolds

10


Median_
Size


Median size of scaffolds

Longest_Seq


Longest scaffold

Shortest_Seq

Shortest scaffold

Singleton_Num

Number of singletons

Average_length_of_break(N)_in_scaffold Average length of unkno
wn nucleotides (N) in scaffolds

Also contained will be counts of
scaffolds above certain sizes, percent of each nucleotide and N (Gap)
values, and “N statistics.” An N50 is the size of the smallest scaffold such that 50% of the genome is
contained in scaffolds of size N50 or larger (Salzberg et al. 2012). A line here s
howing “N50 35836 28”
indicates that there are 28 scaffolds of at least 35836 nucleotides, and that they contain at least 50% of
the overall assembly length. Statistics for contigs (pre
-
scaffold assemblies) are also shown.

Assessment

Fill in the table
below and compare to values reported from the GAGE paper.

Q. Given the total number of scaffolds and the number greater than 100bp, how many were less than
100bp?

Note that the GAGE paper throws out “chaff” contigs that are less than 200bp in length and
we did not

do that. GAGE also used a SOAP
denovo module called GapCloser, which will use read data to substitute
Ns with bp data between reads used to generate scaffolds. This would not change the length and
number of the scaffolds or contigs, however. W
e are also using a later version of SOAPdenovo which
contructs fewer chimeric or misjoined scaffolds.


11



A.

Now re
-
run the analysis using the raw reads instead of Quake Filtered reads, without using the
jumping library, and with different kmer parameter choic
es to examine effects on some key
assembly statistics.

Time line of module

Two hours

Discussion topics for class

Discuss

effect
s

of changing each p
arameter on quality of assembly.

Relevant lecture topics include genome structure and sequencing, correctio
n of errors in sequence
reads, genome assembly approaches
,
and cited literature
.

References

Literature Cited

Compeau PEC, Pevzner PA, Tesler G. 2011. How to apply de Bruijn graphs to genome assembly. Nature
Biotechnology. 29:987
-
991.

Margulies M, Egholm
M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y
-
J,
Chen Z et al . 2005. Genome sequencing in open fabricated high density picoliter reactors. Nature 437:
376

380.

Miller JR, Koren S, Sutton G. 2010. Assembly algorithms for ne
xt
-
generation sequencing data.
Genomics 95: 315
-
327.

Salzberg SL, Phillippy AM, Zimin A, et al. 2012. GAGE: a critical evaluation of genome assemblies and
assembly algorithms. Genome Res 22: 557
-
567. [important online supplements!]