ppt

educationafflictedBiotechnology

Oct 4, 2013 (3 years and 8 months ago)

67 views

NGS Bioinformatics
Workshop

2.2 Tutorial


Whole Genome Assembly

Part I

May
9
th, 2012

IRMACS 10900

Facilitator: Richard
Bruskiewich

Adjunct Professor, MBB

Workflow for Today


Generate a synthetic NGS read data set


Genome assembly


ABySS


Velvet


ALLPATHS
-
LG

Generate synthetic NGS read data for assembly


Try a new program out called “ART” from Baylor College


Huang
W, Li L, Myers JR,
Marth

GT. 2012. ART: a next
-
generation
sequencing read simulator. Bioinformatics. 28(4):
593
-
4



Available as open source and as binary programs for 32 or 64 bit
Windows, Mac and Linux


http
://
www.niehs.nih.gov/research/resources/software/art



Notes:


the
binary archive names are a bit strange


really a .tar.gz in disguise (need
to do a
gunzip

followed by a tar

xvf
)


The
fastq

sequence line is *lower case* which is not expected by some
software (e.g.
ABySS
)


Simulated
Illuminex

Paired End Reads


Using rice chloroplast genome (~134kb)

art_illumina

-
i

Chloroplast.fasta




-
p
-
l 50
-
f 20
-
m 200





-
猠㄰s
-

䍨汯牯p污獴l
-
獡s


Generates files:


Chloroplast1.aln


C
hloroplast1.fq


Chloroplast2.aln


Chloroplast2.fq


Chloroplast.sam


==============================================================================


ART (Q Version 1.3.6)


Copyright(c) 2008
-
2012,
Weichun

Huang, Jason Myers. All Rights Reserved.

==============================================================================



Paired
-
end Simulation


Total CPU time used: 2.48


Parameters used during run


Read Length: 50


Fold Coverage: 20X


Mean Fragment Length: 200


Standard Deviation: 10


Profile Type: Combined


ID Tag:


Quality Profile(s)


First Read: EMP50R1 (built
-
in profile)


Second Read: EMP50R2 (built
-
in profile)


Output files



FASTQ Sequence Files:



the 1st reads: Chloroplast1.fq



the 2nd reads: Chloroplast2.fq



ALN Alignment Files:



the 1st reads: Chloroplast1.aln



the 2nd reads: Chloroplast2.aln



SAM Alignment File:


Chloroplast.sam

Unfortunately…


The ART program generates peculiar id’s
(doesn’t mark the paired end reads…) and
lower case sequence letters, which causes
some headaches…


So, I wrote a small python script to fix this…

#!/
usr
/bin/python


# Fixes the output of the ART program

#
art_illumina

-
i

reference.fa

-
p
-
l 50
-
f 20
-
m 200
-
s 10
-
o
outFile_prefix

-
sam


from sys import
stdin


seq

= False

qual

= False


if __name__ == '__main__':


for line in
stdin
:



line =
line.strip
()



if
qual
:




qual

= False # to avoid treating rare quality score lines that start with '@' as id's



elif

line.startswith
('+'):




qual

= True



elif

not
seq

and
line.startswith
('@'):




# massage the ID




part1 =
line.split
('|')




part2 = part1[1].split('
-
')




line =
part1[0
]+'_'+part2[0]+'
-
'+part2[1]+'/'+part2[2]




seq

= True



elif

seq
:




# convert sequence all to upper case to avoid downstream confusion...




line =
line.upper
()




seq

= False



print line

Getting
ABySS


Installation:


For Ubuntu,
sudo

apt
-
get install abyss


Or visit BCGSC and download tar.gz source, then
configure..make

(more up
-
to
-
date?)


Perhaps put the abyss bin directory on your path…


To test run
ABySS
:

abyss
-
pe

k=25
name=test




se=https
://raw.github.com/dzerbino/



velvet/master/data/
test_reads.fa





Try our test PE read data set



abyss
-
pe

name=Chloroplast31 k=31



ABYSS_OPTIONS=
--
no
-
trim
-
masked




in
=‘Chloroplast1.fastq Chloroplast2.fastq‘


The ‘no
-
trim
-
masked’ needed because default
behaviour of abyss is to trim lower case letters
in sequence (which designate identified vector
sequences in 454 outputs…)


Try with other k
-
mer

sizes…

For more info about
ABySS

http://
www.bcgsc.ca/platform/bioinfo/software/abyss


Active list service to troubleshoot issues:

abyss
-
users@googlegroups.com


Velvet

http://www.ebi.ac.uk/~zerbino/velvet
/


download & tar
-
zxvf


make


sudo

make install


p
ut velvet directory on your $PATH


Run
velveth
:


velveth

outputdir

k_mer

-
fastq

readfile


Run
velvetg
:


v
elvetg

outputdir

-
ins_length

200

-
exp_cov

20



ALLPATHS
-
LG

http://www.broadinstitute.org/software/allpaths
-
lg/blog
/


d
ownload and tar

zxvf


./configure


make


sudo

make install


Execute the program:


PrepareAllPathsInputs.pl # needs some
config

files…


RunAllPathsLG