NCBI SRA exercises

californiamandrillΛογισμικό & κατασκευή λογ/κού

13 Δεκ 2013 (πριν από 3 χρόνια και 4 μήνες)

165 εμφανίσεις

Quality control assessment
for next
-
generation
sequence data

Leonardo
Mariño
-
Ramírez

NCBI SRA Portal

NCBI SRA
Query

NCBI SRA
454 data

Getting data out of the NCBI SRA

Launching the Virtual Machine

Launching the Virtual Machine

The VM file structure for the class

Converting SRA to FASTQ

student
@linux
-
k5oc:~> cd
sra
/


student
@linux
-
k5oc:~/
sra
>
ls

H_pylory_sequence.fasta

SRR023794.sra


student
@linux
-
k5oc:~/
sra
>
fastq
-
dump
--
split
-
spot
--
skip
-
technical

--
clip SRR023794.sra

Written

231208 spots for SRR023794.
sra

Written

231208 spots
total


student
@linux
-
k5oc:~/
sra
> more SRR023794.
fastq

@
SRR023794.1 FMQS1PV02F25Z5
length
=
86

GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC

+
SRR023794.1 FMQS1PV02F25Z5
length
=
86

7779
@EEEFFFFFFFFFFFFFFFFEE55000@8E88::EEFFFFFFEDAAAEEFFFFFFFFFFFFFFF;==;=ADFFFFFFFFFFF

Converting FASTQ to Sequence and Quality

student
@linux
-
k5oc:~/
sra
>
ls

H_pylori_sequence.fasta

SRR023794.fastq SRR023794.
sra


student
@linux
-
k5oc:~/
sra
> cd ../bin
/


student
@linux
-
k5oc:~/bin>
wget

http://
marino
-
johnson.org
/~
marino
/
fastq2faqual

asking
libproxy

about
url

'http://
marino
-
johnson.org
/~
marino
/
fastq2faqual’

libproxy

suggest to use 'direct:/
/’

-
-
2011
-
10
-
21 22:07:43
--

http://
marino
-
johnson.org
/~
marino
/
fastq2faqual

Resolving
marino
-
johnson.org
...
72.66.65.252

Connecting
to marino
-
johnson.org|72.66.65.252|:80... connected
.

HTTP
request sent, awaiting response... 200
OK

Length
: 675 [text/plain
]

Saving
to: “fastq2faqual


100
%[============================================================================>] 675
--
.
-
K/s in 0s

2011
-
10
-
21 22:07:43 (105 MB/s)
-

“fastq2faqual” saved [675/675
]


student
@linux
-
k5oc:~/bin>
ls


l

total
4
-
rw
-
r
--
r
--

1 student users 675 Oct 21 22:07
fastq2faqual


student
@linux
-
k5oc:~/bin>
chmod

+x
fastq2faqual


student
@linux
-
k5oc:~/bin>
perl

-
cw

fastq2faqual

fastq2faqual
syntax
OK

student
@linux
-
k5oc:~/bin>

Converting FASTQ to Sequence and Quality

Converting FASTQ to Sequence and Quality

student
@linux
-
k5oc:~/bin>
cd

student
@linux
-
k5oc:~> cd
sra
/


student
@linux
-
k5oc:~/
sra
> fastq2faqual SRR023794.fastq


student
@linux
-
k5oc:~/
sra
>
ls


l

total 543876

-
rw
-
r
--
r
--

1 student users 1611041 Oct 5 22:15
H_pylori_sequence.fasta

-
rw
-
r
--
r
--

1 student users 135044444 Oct 15 01:22 SRR023794.
fastq

-
rw
-
r
--
r
--

1 student users 68356223 Oct 21 22:29 SRR023794.
fastq.fa

-
rw
-
r
--
r
--

1 student users 181679433 Oct 21 22:29 SRR023794.
fastq.qual

-
rw
-
r
--
r
--

1 student users 170219206 Oct 5 22:27 SRR023794.
sra

student
@linux
-
k5oc:~/
sra
>

Converting FASTQ to Sequence and Quality

=
=> SRR023794.fastq <=
=


@
SRR023794.1 FMQS1PV02F25Z5 length=
86

GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC

+
SRR023794.1 FMQS1PV02F25Z5 length=
86

7779
@EEEFFFFFFFFFFFFFFFFEE55000@8E88::EEFFFFFFEDAAAEEFFFFFFFFFFFFFFF;==;=
ADFFFFFFFFFFF



Result files


=
=> SRR023794.fastq.fa <=
=


>
SRR023794.1
FMQS1PV02F25Z5 length
=
86

GGGTAGGCACAGCGACTGTTCTTATCTTTTTGTGCCTTATATGCATATCCCAGATAGCGTCAATATCCTTAAAGAAGTCGGCACGC


=
=> SRR023794.fastq.qual <=
=


>
SRR023794.1 FMQS1PV02F25Z5 length=
86

22
22 22 24 31 36 36 36 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37
36

36
20 20 15 15 15 31 23 36 23 23 25 25 36 36 37 37 37 37 37 37 36 35 32
32

32
36 36 37 37 37 37 37 37 37 37 37 37 37 37 37 37 37 26 28 28 26 28 32
35

37
37 37 37 37 37 37 37 37 37 37


Converting FASTQ to Sequence and Quality

Quality control assessment
with FASTQC

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/

Quality control assessment
with FASTQC

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/

Downloading and extracting FASTQC

student
@linux
-
k5oc:~/
sra
>
wget

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/fastqc_v0.10.0.
zip

asking
libproxy

about
url

'http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/fastqc_v0.10.0.
zip’

libproxy

suggest to use 'direct:/
/’

-
-
2011
-
10
-
21 23:07:42
--

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/fastqc_v0.10.0.
zip

Resolving
www.bioinformatics.bbsrc.ac.uk
... 149.155.200.1,
149.155.200.2

Connecting
to www.bioinformatics.bbsrc.ac.uk|149.155.200.1|:80... connected
.

HTTP
request sent, awaiting response... 200
OK

Length
: 599007 (585K) [application/zip]Saving to:
“fastqc_v0.10.0.zip”100%[==============================================================================>
] 599,007 162K/s in 3.6s

2011
-
10
-
21 23:07:46 (162 KB/s)
-

“fastqc_v0.10.0.zip” saved [599007/599007
]


student
@linux
-
k5oc:~/
sra
> unzip fastqc_v0.10.0.
zip

Archive
: fastqc_v0.10.0.zip creating:
FastQC
/Contaminants/

inflating
:
FastQC
/Contaminants/
contaminant_list.txt


inflating
:
FastQC
/
fastqc


inflating
:
FastQC
/
fastqc_icon.ico

creating
:
FastQC
/Help/

creating
:
FastQC
/Help/1 Introduction/

inflating
:
FastQC
/Help/1 Introduction/1.1 What is
FastQC.html

...

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/

Downloading and extracting FASTQC

student
@linux
-
k5oc:~/
sra
>
ls

FastQC

H_pylori_sequence.fasta

SRR023794.fastq.fa SRR023794.
sra

fastqc_v0.10.0
.zip SRR023794.fastq SRR023794.
fastq.qual


student
@linux
-
k5oc:~/
sra
> cd
FastQC
/

student
@linux
-
k5oc:~/
sra
/
FastQC
>
ls


l

total 656

drwxr
-
xr
-
x 2 student users 4096 Sep 9 09:10
Contaminants

-
rw
-
r
--
r
--

1 student users 7332 Sep 8 17:26
fastqc

-
rw
-
r
--
r
--

1 student users 2238 Apr 26 2010
fastqc_icon.ico

drwxr
-
xr
-
x 5 student users 4096 Sep 9 09:10
Help

-
rw
-
r
--
r
--

1 student users 5710 Nov 24 2010
INSTALL.txt

-
rw
-
r
--
r
--

1 student users 50147 Jul 22 12:52 jbzip2
-
0.9.
jar

-
rw
-
r
--
r
--

1 student users 35821 Apr 13 2010
LICENSE.txt

-
rw
-
r
--
r
--

1 student users 2143 Apr 26 2010
README.txt

-
rw
-
r
--
r
--

1 student users 24730 Sep 8 11:16
RELEASE_NOTES.txt

-
rw
-
r
--
r
--

1 student users 106 Aug 16 09:04
run_fastqc.bat

-
rw
-
r
--
r
--

1 student users 507539 Jul 22 12:52 sam
-
1.32.
jar

drwxr
-
xr
-
x 3 student users 4096 Sep 9 09:10
Templates

drwxr
-
xr
-
x 3 student users 4096 Sep 9 09:10
uk

student
@linux
-
k5oc:~/
sra
/
FastQC
>

http://
www.bioinformatics.bbsrc.ac.uk
/projects/
fastqc
/

Running FASTQC

student
@linux
-
k5oc:~/
sra
/
FastQC
>
chmod

+x
fastqc

student
@linux
-
k5oc:~/
sra
/
FastQC
> ./
fastqc

&

[
1]
10137

student
@linux
-
k5oc:~/
sra
/
FastQC
>

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Running FASTQC

Converting SRA to SFF

student
@linux
-
k5oc:~/
sra
>
sff
-
dump SRR023794.sra

Written
231208 spots for SRR023794.
sra

Written
231208 spots
total

student
@linux
-
k5oc:~/
sra
>


5
Mbp

Genome, 500
bp

reads, 25
bp

overlap

# reads

coverage


%
sequenced


#
contigs

2500

0.25


22.12


1971

5000

0.5


39.35


3109

10000

1


63.21


3867

20000

2


86.47


2991

30000

3


95.02


1735

40000

4


98.17


895

50000

5


99.33


433

60000

6


99.75


201

70000

7


99.91


91

80000

8


99.97


40

90000

9


99.99


17

100000

10


100.00


7

Graph of previous data

LW calculations for 5 MB genome, 500 bp reads
0
500
1000
1500
2000
2500
3000
3500
4000
4500
2500
5000
10000
20000
30000
40000
50000
60000
70000
80000
90000
100000
number of reads
0
10
20
30
40
50
60
70
80
90
100
# contigs
% sequenced
Lander

Waterman Assumptions

1.
Sequencing reads will be randomly
distributed in the genome


2. The ability to detect an overlap
between two truly overlapping reads
does not vary from clone to clone


In practice…

Lander
-
Waterman is almost always an underestimate


-
cloning biases in shotgun libraries


-
repeats


-
GC/AT rich regions


-
other low complexity regions

Sequence assembly overview

Chromosome

Genomic DNA

GAPS

Fill in Gaps

Sequence the

pieces

Sequence all pieces


Smaller pieces

Library Construction

Order Unknown

Draft Sequence

(phase 1)

Order the pieces

(phase 2)

Finished Sequence

All Pieces Sequenced

and placed in order

(phase 3)

Sequence assembly

Assemblers:

PHRAP, CAP, TIGR,
CELERA, NEWBLER

Overlap:

find potentially overlapping reads

Layout:

merge reads into contigs and


contigs into supercontigs

Consensus:

derive the DNA
sequence and correct read errors

..ACGATTACAATAGGTT..

Assembling a jigsaw puzzle


The task of the assembly becomes the task of
assembling a giant jigsaw puzzle


We look for reads whose sequences suggest that
they came from the same place in the genome:


AGTGATTA
GATGATAGTAGA



||||||||||||


GATGATAGTAGA
GGATAGATTTA

Repetitive Regions cause errors

A

R

B

R

C

Repetitive regions

A

R

C

De novo sequence assembly

Traditional WGS assemblers (Sanger and 454):


1.
Amos (UMD)

2.
Arachne

(Broad)

3.
Celera assembler (JCVI/UMD)

Newbler

(454)



Short
-
read assemblers (
Illumina

and
Solid)
:


1.
SOAPdenovo

(BGI)

2.
SSAKE/
ABySS

(GSC)

3.
VCAKE (UNC)

4.
Velvet (EBI)

5.
ALLPATHS (Broad)

6.
Euler
-
SR (UCSD)

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Comparative assembly with
Newbler

Celera Assembler Hybrid
Assembly exercise

student
@linux
-
k5oc:~/practice>
runCA

-
p
ca

-
d CA 454.
frg

student
@linux
-
k5oc:~/practice> cd
CA

student
@linux
-
k5oc:~/practice/CA> gatekeeper
-
dumpfrg

ca.gkpStore

>
ca.frg

student
@linux
-
k5oc:~/practice/CA>
toAmos

-
a
ca.asm

-
f
ca.frg

-
o
ca.afg

student
@linux
-
k5oc:~/practice/CA> bank
-
transact
-
m
ca.afg

-
b
ca.bnk


c

student
@linux
-
k5oc:~/practice/CA>
hawkeye

ca.bnk

&

Velvet Assembly exercise

student
@linux
-
k5oc:~/practice>
velveth

velvet_asm

31
-
fastq

-
shortPaired

illumina_pe.fq



student
@linux
-
k5oc:~/practice>
velvetg

velvet_asm


student
@linux
-
k5oc:~/practice>
more
velvet_asm
/
contigs.fa


student
@linux
-
k5oc:~> cd practice/
alignment


student
@linux
-
k5oc:~/practice/alignment>
nucmer

-
p
hybrid_vs_illumina

../
hybrid_CA
/9
-
terminator/
hybrid.scf.fasta

../
velvet_asm
/
contigs.fa


student
@linux
-
k5oc:~/practice/alignment>
mummerplot

-
layout
hybrid_vs_illumina.delta


student
@linux
-
k5oc:~/practice/alignment>
gnuplot

out.gp



Exome

capture alignment &
variant calling exercise

1. BWA alignment


student
@linux
-
k5oc:~> cd practice
/


student
@linux
-
k5oc:~/practice> cd
ExomeAnalysis
/


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
bwa

aln

chr22 Example.1.fastq > Example.1.
sai


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
bwa

aln

chr22 Example.2.fastq > Example.2.
sai


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
bwa

sampe

chr22 Example.1.sai Example.2.sai
Example.1.fastq Example.2.fastq >
Example.sam


2. BAM conversion, sorting and indexing of files for variant detection


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
samtools

view
-
S
-
b
-
o
Example.bam

Example.sam


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
samtools

sort
Example.bam

Example.sorted


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
samtools

index
Example.sorted.bam


Exome

capture alignment &
variant calling exercise

3
. Variant detection using
samtools

multi
-
way pileup


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
samtools

mpileup

-
uf

chr22.fasta
Example.sorted.bam

|
bcftools

view
-
bvcg

-

>
Example.bcf


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
bcftools

view
Example.bcf

|
vcfutils.pl

varFilter

-
D100 >
Example.vcf


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
> more
Example.vcf


4
. Indexing VCF file for IGV display


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
bgzip

-
c
Example.vcf

>
Example.vcf.gz

student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
tabix

-
p
vcf

Example.vcf.gz


5. IGV
display


student
@linux
-
k5oc:~/practice/
ExomeAnalysis
>
igv.sh

&

File


Load Genome…

Select
chr22.genome
and click Ok

File


Load
from File…

Select
Example.sorted.bam

and click Ok

File


Load from File…

Select
Example.vcf.gz

and click Ok

File


Load from File…

Select
nimbleGen_SeqCapEZ_exome_chr22
.bed

and click Ok




Exome

capture alignment &
variant calling exercise