Bioinformatics_Training_for_Illumina_RNAseqx - The GenePool

hordeprobableBiotechnology

Oct 4, 2013 (4 years and 2 months ago)

187 views

Bioinformatics for
Illumina

RNAseq

-

Denovo

assembly and mapping of
transcriptome sequencing data

Stephen Bridgett, The
Genepool
, 2012

Sequencing


Sample QC



to check DNA quality and quantity



Library preparation



add the
Illumina

adapters
and barcode indexes.



Cluster generation


on the
cBot

cluster station



Sequencing

-

on
HiSeq

2000


HiSeq

Flowcell

(Image from
Flickr.com

-

with researcher Nicole Shapiro. Image by Roy
Kaltschmidt
, LBNL)

HiSeq

2000 sequencer

(Image from
Illumina.com

website)

Demultiplexing

and
RunQC

Demultiplexing

-

using the
Illumina

Casava

version 1.82
pipeline, to split the samples (that were sequenced
together in one lane) into separate
fastq

files for each
sample.
http://
www.illumina.com/software/genome_analyzer_software.ilmn


Read Quality
checks


to check the number and quality of
the reads, we use several programs:


-

Fastx

toolkit
http://
hannonlab.cshl.edu/fastx_toolkit
/


-

FastQC

http://
www.bioinformatics.babraham.ac.uk/projects/fastqc
/



-

Usearch

to count adapters
http://www.drive5.com/usearch/

Adapter filtering and Quality
Trimming

Adapter filtering
-

if adapters are present in the reads
then the assembler will try to align the reads on the
adapter sequences (instead of on the actual
transcriptome sequence), which would produce
incorrect assembly or fail to assemble. We use the
‘Scythe’ program, and a
fasta

file of adapter sequences.
https://
github.com/vsbuffalo/scythe



Quality trimming

-

the assembler doesn't use the quality
scores so it has to assume that all bases are accurately
called. To improve the assembly, bases with low quality
scores (
ie
. less accurate bases) are trimmed from the
ends of reads before assembly. We use the ‘Sickle’
program with parameters of:
-
t

sanger

-
l

50
-
q

20
-
n


x

https://
github.com/najoshi/sickle

De
-
novo Assembly

There are several
denovo
-
assemblers for
RNAseq

data:


-

Oases
, uses Velvet (EBI).
http://
www.ebi.ac.uk/~zerbino/oases
/


-

SOAPdenovo
-
Trans

(BGI)


Fastest.
http://soap.genomics.org.cn/SOAPdenovo
-
Trans.html


-

Trans
-
ABySS

(
BcGsc
).
http://
www.bcgsc.ca/platform/bioinfo/software/trans
-
abyss


-

Trinity

(Broad)


Slower but Longest
contigs
, and more EST
blast hits, fixed
kmer

length.
http://
trinityrnaseq.sourceforge.net
/


It would require a very long time to fully compare every read
against every read, so the assemblers use a
k
-
mer

(hash) index,
to more quickly find reads that are similar; then construct a De
-
Bruijn

graph, then find path through this graph.

Mapping reads to the
contigs

Several aligner programs,
eg
:

-
BWA



Fast, can map DNA reads to a genomic
reference, but not splice aware.
http://bio
-
bwa.sourceforge.net
/

-
Stampy



Slower, more sensitive to
Indels
.
http://www.well.ox.ac.uk/project
-
stampy

-
TopHat



uses Bowtie, Fast, slice
-
junction aware.
http://
tophat.cbcb.umd.edu
/

-
GSNAP



Slower, slice
-
junction aware, claims to give
more
SNP
-
tolerant alignment.
http://research
-
pub.gene.com/gmap
/


GSNAP commands


Build index for the
contigs

that will map to:





gmap_build

--
db=$
RefDB

--
dir=$
RefDir

$Ref


Where $Ref is the assembled
contigs

fasta

file, $
RefDir

is the assembly directory, and $
RefDB

and the index to create.




Align the reads to the
contigs
:





gsnap

--
db=$
RefDb

--
dir=$
RefDir

--
split
-
output=$Output
\








--
pairexpect
=
180
--
format=
sam

--
npaths
=
5
--
nthreads
=12
\








$Read1 $Read2 (optionally add:
--
novelsplicing
=1 )


Where $Output is the mapping output name, $Read1 and $Read2 are the input
paired read
fastq

files. The ‘
--
pairexpect
’ is
expected paired
-
end length; ‘

format’ is
SAM file output;


--
nthreads
’ means use 12 processes in parallel, The ‘
\
’ is command
continues on next line in a bash script.



Analyse the resulting BAM alignment files.

Differential expression analysis

Several packages for differential expression:

-
DEseq

-

an R package, via
BioConductor
, uses “
negative
binomial distribution, with variance and mean linked by local
regression”
.
http://bioconductor.org/packages/2.10/bioc/html/DESeq.html

-
EdgeR

-

an R package, uses “
empirical
Bayes

estimation and
exact tests based on the negative binomial distribution.”
http://www.bioconductor.org/packages/2.10/bioc/html/edgeR.html

-
CuffDiff



part of
CuffLinks
, calculates
Transcript and Gene
differential FPKM.
http://
cufflinks.cbcb.umd.edu/manual.html

-
GeneProf



has web
-
based interface,
http://
www.geneprof.org/GeneProf
/

-
Partek

-

needs license.
http://
www.partek.com/ngs