Supplementary Methods. PAGIT: two worked examples

brawnywinderΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 10 μήνες)

113 εμφανίσεις

Supplementary Methods.

PAGIT: two worked examples

Here we give a synopsis of the
work
-
flow used for the two examples discussed in the “Anticipated
Results” section.

E. coli

data and initial
assembly.

The reads for
Escherichia coli

K
-
12 strain MG1655 were obtained from the Short Read Archive at the
NCBI (run ERR008613 under study ERP000092). They were prepared by Illumina as a test data set for
development of de novo sequence assembly algorithms and related tools. They consisted of
101 bp
paired
-
end Illumina reads.


The NCBI was searched for “Escherichia coli” in order to find the identifier of a suitable reference
genome: “U00096” was found for strain K
-
12 substr. MG1655, and the sequence file U00096.fna was
downloaded. The EBI was

then searched with “U00096” and the annotation in EMBL format was
downloaded.


Before applying the
PAGIT
protocol
,

we performed a
de novo

assembly. Velvet version 1.1.06
compiled to support
k
-
mers of up to 99 bp was used to generate an initial assembly
.

T
he reads came
in two files (“s_6_1.fastq” and “s_6_2.fastq”) which we first “shuffled” together using the
appropriate Velvet perl script
. W
e then ran

velveth

, followed by

velvetg


using a number of
different
k
-
mers. The best results were obtained with

a
k
-
mer of 81. A synopsis of the commands we
used is as follows (for more detail please see the Velvet manual):


/path/to/velvet/shuffleSequences_fastq.pl
s_6
_1.fastq
s_6
_2.fastq
shuffled.fastq

/path/to/velvet/velveth K81 81
-
shortPaired
-
fastq shuffled.f
astq

/path/to/velvet/velvetg K81
-
exp_cov
auto


The result of this

assembly was 182 scaffolds with a combined sum of 4.58 Mb, a scaffold N50 of
70.3 Kb and an average size of 25.2 Kb. The largest scaffold was 209.8 Kb.


Applying PAGIT to the
E. coli

examp
le.

ABACAS was executed with the following command:


abacas.pl
-
b
-
m
-
r U00096.fna
-
q contigs.fa
-
p nucmer
-
o U96mapped


After joining the results (ordered contigs and bin),
IMAGE was initially executed with the following
command:


image.pl
-
scaffolds
ABACAS.fasta
-
prefix s_6
-
iteration 1
-
all_iteration 1
-
dir_prefix ite
-
kmer 81


The initial IMAGE run took about 4 hours to complete. The following commands were then executed
one after another
(they were
past
ed

into a file,
the file was made executable a
nd then executed)
.

For these runs, each iteration took about 20 minutes:


restartIMAGE.pl ite3 71 3 partitioned

restartIMAGE.pl ite6 61 3 partitioned

restartIMAGE.pl ite9 51 3 partitioned

restartIMAGE.pl ite12 41 3 partitioned

restartIMAGE.pl ite15 31 3

partitioned


The following command was used to execute ICORN:


icorn.start.sanger.sh scaffolds18.fa 1 6 s_6_1.fastq s_6_2.fastq
400,800 600


The following command was used to execute RATT:


$RATT_HOME/start.ratt.sh ./EMBL
scaffolds18.fa.7 ratted
Strain.G
l
obal


T
o generate
Figure 6

t
he following command
s were used to first create a BLAST comparison file, and
then to load the files into ACT
:


formatdb
-
p F
-
i
U00096.fna


blastall
-
p blastn
-
m 8
-
e 1e
-
40
-
d

U00096.fna

-
i
Sequences/

ordered_gi.48994873

-
o co
mp.blast

act EMBL/ U00096.embl comp.blast
ratted.ordered_gi.4
8994873.gb.U00096.2..final.embl


Initially
ACT

displayed the reference genome “U00096.embl” on top with the newly annotated
genome below, and syntenic blocks between them indicated in red. Subsequently, the original entry
“U00096.embl” was disabled and the file with non
-
transferred annotations,
“ratt.
U00096.NOTTransfered.embl”, was loaded in its place (this was performed via the following
ACT menus: Entries
-
> U00096.embl
-
> disable U00096.embl).

C. Trachomatis

data

The
data used here

(EMBL files, 454 assembly and the Illumina reads
) can be found on t
he PAGIT
website:
ftp://ftp.sanger.ac.uk/pub4/resources/software/pagit/PAGIT.ChlamydiaExample.tgz



We took 15% of the reads

to get coverage of around 100x

from

the library ERS001402

(available at
the NCBI in the Short Read Archive)
, which is a 54 bp Illumina
paired
-
end

library with around 40% of
mouse contamination.


T
he reference genome (AM884177) and its pla
s
mid (AM886279)
were downloaded
from
the EBI. We
saved the two EMBL

file
s into two different directories


dir1


and

dir2

. Next we extracted the
FASTA

sequence for each EMBL file with:


perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir1 AM884177.fasta

perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir2
AM886279.fasta


Applying PAGI
T to the
C. Trachomatis

example

We generated a working directory for

ABACAS and linked the
FASTA files

into it. As we have two
reference sequence
s
, we first mappe
d the contigs against the nuclear

reference
genome, and
then
orde
r
ed the contigs in
the bin against the plasmid genome:


ln
-
s ../*.fasta .

ln
-
s ../454LargeContigs.fna

perl $PAGIT_HOME/ABACAS/abacas.pl
-
r AM884177.fasta
-
q
454LargeContigs.fna
-
p nucmer
-
c
-
m
-
b
-
o Res.Nuc

perl $PAGIT_HOME/ABACAS/abacas.pl
-
r AM886279
.fasta
-
q
Res.Nuc.contigsInbin.fas
-
p nucmer
-
c
-
m
-
b
-
o Res.Plasmid


Most prokaryotic genomes consist of a single
circular chromosome.
Circular genomes are
represented linearly in sequence files with the origin of replication at the start: this can cause
problems when using them in mapping or alignment tasks due to the artificially created ends of the
linear sequence. When initially assembling a circular genome i
t is usually the case that the origin of
replication is
located i
n the middle of a contig
: henc
e there may be problems when aligning that
contig to the reference genome
.



For this use
-
case, based on the assumption that the origin of replication is annotated correctly in the
reference genome, we

used ACT to identify the origin of replication in our
query genome. On
e
xamin
ing

the synteny between the origin of
replication in
the reference to
our new sequence
, we
found the following
:


ATGACAAGGCTTCCATTACT”
. We then loaded the ABACAS output file
“Res.Nuc.fasta” into an editor and searched for those bases
.


On finding the
pattern
,
we inserted a
new line
before it
with a
FASTA

head
er (for example

“>manual Cut”
)
.
In this way

we manually
divided the sequence into two
according to
where we suspect the orig
i
n

of replication

to be.


N
ext we reran the first ABAC
AS

command
,

chang
ing

the input to
use the

new split sequence

of the
ABACAS pseudomolecule (just 2 sequences to map)
.

This new mapping will have gained an
additional gap, but the advantage is that the gaps can be more accurately closed by IMAGE.



The seven

unmapped contigs from the first ABACAS execution were mapped against the plasmid
sequence, giving a plasmid pseudomolecule. W
e joined the two pseudo contigs
(from the genome
and the plasmid)
and the
plasmid
bin

contigs
.
Now we have a pseudo sequence for t
he nucleus
genome and the plasmid, rather than a joined sequence. Note that we could use the plasmid bin
contigs for further analysis.


Here is a synopsis of
the commands used:


formatdb
-
p F
-
i AM884177.fasta

blastall
-
p blastn
-
m 8
-
e 1e
-
40
-
d
AM884177.fasta
-
i Res.Nuc.fasta
-
o comp.Nuc.crunch

act AM884177.fasta comp.Nuc.crunch Res.Nuc.fasta

perl $PAGIT_HOME/ABACAS/abacas.pl
-
r AM884177.fasta
-
q Res.Nuc.fasta
-
p nucmer
-
c
-
m
-
b
-
o Res.NucOrigin

cat Res.NucOrigin.fasta Res.Plasmid.fasta
Res.Plasmid.cont
igsInbin.fas > Res.abacas.fasta


IMAGE
was executed
with
two
different k
-
mers
. A synopsis of the commands used is as follows:


perl $PAGIT_HOME
/IMAGE/image.pl
-
scaffolds Res.abacas.fasta
-
prefix
3807_partial
-
iteration 1
-
all_iteration 3
-
dir_prefix ite
-
kmer 49
-
vel_ins_len 300
-
smalt_minScore 45 > output.txt

perl $
PAGIT
_HOME
/IMAGE/restartIMAGE.pl ite3 31 12 partitioned

image_run_summary.pl ite

perl $
PAGIT
_HOME
/IMAGE/
c
ontigs2scaffolds.pl ite9/new.fa
ite9/new.read.placed 300 10 Res.image


After IMAGE,
ICORN
was executed for 4

iterations
. A synopsis of the commands used is as follows:


icorn.start.sh Res.image.fa 1 4 3807_partial_1.fastq
3807_partial_2.fastq 80,600 300

icorn.CollectResults.sh Res.image.fa 3807


We ran RATT using the sequence
s output from the 4
th

(
and final
)

iteration of ICORN

using the
following command:


start.ratt.sh embl/ Res.image.fa.4 TransferPAGIT Strain >
ratt.output.TransferPAGIT.txt


For transferring onto the original 454 assembly we ran the following command:


start
.ratt.sh embl/ 454LargeContigs.fna Transfer454 Strain >
ratt.output.Transfer454.txt


To gather the general transfer statistics we used the following command:


grep
-
iB1
-
A1 "Gene model" ratt.out*txt


The command used for summarizing statistics on the
incorrect

models

was the following. It prin
ts all
the lines from the report files of the 454 transfer. The
grep
-
v Gene_ID

will delete the header
line of each file. The
wc
-
l

counts the amount of lines, therefore the amount of gene models that
need

correction after the transfer.


cat Transfer454*.Report.txt | grep
-
v Gene_ID | wc
-
l


And similarly for the PAGIT sequence:


cat TransferPAGIT*.Report.txt | grep
-
v Gene_ID | wc
-
l


To gather statistics on the transferred models with f
rameshift
s on th
e
454 contigs

we used the
following command.
The difference to the previous command is
just
to count where the
fifth column
has more than two counts. The
fifth
column
is the number

of
stop codons in the codon regions
:

therefore it is likely to be a framesh
ift if the number
of stop codons is higher than 1.


cat Transfer454*.Report.txt | awk '$5>2' | grep
-
v Gene_ID | wc


And similarly for the PAGIT sequence:


cat TransferPAGIT*.Report.txt | awk '$5>2' | grep
-
v Gene_ID | wc