Read Processing and Mapping: From Raw to Analysis-ready Reads

gorgeousvassalSoftware and s/w Development

Nov 7, 2013 (3 years and 9 months ago)

107 views

Read Processing and Mapping:

From Raw to Analysis
-
ready Reads


B E N P A S S A R E L L I

S T E M C E L L I N S T I T U T E G E N O ME
C E N T E R


N G S WO R K S H O P

3 1 MA Y 2 0 1 3


2

From Raw to Analysis
-
ready Reads

Session Topics



Overview of high
-
throughput sequencing platforms


Understand read data formats and quality scores


Identify and fix some common read data problems


Find a genomic reference for mapping


Mapping reads to a reference genome


Understand alignment
output


Sort
, merge, index
alignment
for further
analysis


Mark/eliminate duplicate
reads


Locally realign at
indels


Recalibrate base quality scores


How to get started


Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

3

Sample to Raw Reads

Library
Construction

QC and
Quantification

Sample

Preparation


Sequencing

Raw
Reads

Instrument
Output

4

Illumina

MiSeq

Illumina

HiSeq

Ion

PGM

Ion Proton

Pacific Biosciences

RS

Images (.tiff)

Cluster intensity file (.
cif
)

Base call file (.
bcl
)

Standard
f
lowgram

file (.
sff
)

Movie

Trace (.trc.h5)

Pulse (.pls.h5)

Base (.bas.h5)

Sequence Data

(FASTQ Format)

Sequencing Platforms at a Glance

6

Solid Phase Amplification

V3
HiSeq

Library DNA binds to
Oligos

Immobilized on Glass
Flowcell

Surface

Sequencing Steps


Clusters are linearized


Sequencing primer annealed


All four
dNTPs

added at each cycle


Each with different
*
*
Fluorescent Tag
*
*


Intensity of different tags


b
ase call


Error
Profile: substitutions


FASTQ Format (Illumina Example)

@DJG84KN1:272:D17DBACXX:2:1101:12432:5554
1:N:0:AGTCAA

CAGGAGTCTTCGTACTGCTTCTCGGCCTCAGCCTGATCAGTCACACCGTT

+

BCCFFFDFHHHHHIJJIJJJJJJJIJJJJJJJJJJIJJJJJJJJJIJJJJ

@DJG84KN1:272:D17DBACXX:2:1101:12454:5610 1:N:0:AG

AAAACTCTTACTACATCAGTATGGCTTTTAAAACCTCTGTTTGGAGCCAG

+

@@@DD?DDHFDFHEHIIIHIIIIIBBGEBHIEDH=EEHI>FDABHHFGH2

@DJG84KN1:272:D17DBACXX:2:1101:12438:5704 1:N:0:AG

CCTCCTGCTTAAAACCCAAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTC

+

CCCFFFFFHHGHHJIJJJJJJJI@HGIJJJJIIIJGIGIHIJJJIIIIJJ

@DJG84KN1:272:D17DBACXX:2:1101:12340:5711 1:N:0:AG

GAAGATTTATAGGTAGAGGCGACAAACCTACCGAGCCTGGTGATAGCTGG

+

CCCFFFFFHHHHHGGIJJJIJJJJJJIJJIJJJJJGIJJJHIIJJJIJJJ

Read Record

Header

Read Bases

Separator

(with optional
repeated
header)

Read Quality
Scores

Flow Cell ID

Lane

Tile

Tile
Coordinates

Barcode

NOTE: for
paired
-
end runs, there
is a second file

with one
-
to
-
one
corresponding
headers and reads

Phred
* quality score
Q
with
base
-
calling
error
probability P

Q

=
-
10 log
10
P

* Name of first program to assign accurate base quality scores. From the Human Genome Project.


SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
.....................................................


...............................
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII......................


LLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLLL
....................................................


!"#$%&'()*
+
,
-
./01234
5
6789:;<=>
?
@ABCDEFGH
I
JKLMNOPQRSTUVWXYZ[
\
]^_`
abcdefghijklmnopqrstuvwxyz
{|}~


| | | | | |


33 59 64 73 104 126



S
-

Sanger
Phred+33 range: 0 to 40


I
-

Illumina 1.3+
Phred+64 range
: 0 to 40


L
-

Illumina 1.8+
Phred+33 range
: 0 to
41

Q

score

Probability of
base

error

Base
confidence

Sanger
-
encoded

(Q Score +
33) ASCII
character

10

0.1

90%

“+”

20

0.01

99%

“5”

30

0.001

99.9%

“?”

40

0.0001

99.99%

“I”

Base Call Quality:
Phred

Quality Scores

Initial Read Assessment and Processing

C
OMMON

PROBLEMS

THAT

CAN

AFFECT

ANALYSIS


L
OW

CONFIDENCE

BASE

CALLS


typically toward ends of reads


criteria vary by application


P
RESENCE

OF

ADAPTER

SEQUENCE

IN

READS


poor fragment size selection


protocol execution or artifacts


O
VER
-
ABUNDANT

SEQUENCE

DUPLICATES


L
IBRARY

CONTAMINATION



Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Quick Read Assessment:
FastQC

F
REE

D
OWNLOAD


Download:
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/


Tutorial :
http://www.youtube.com/watch?v=bz93ReOv87Y


S
AMPLES

READS

(200K
DEFAULT
):
FAST
,
LOW

RESOURCE

USE


Read Assessment Example (Cont’d)

Trim for base quality or adapters

(run or library issue)



Trim leading bases

(library artifact)

Read Assessment Example (Cont’d)

TruSeq

Adapter, Index 9

5’ GATCGGAAGAGCACACGTCTGAACTCCAGTCACGATCAGATCTCGTATGCCGTCTTCTGCTTG


13

Comprehensive Read Assessment:
Prinseq

http://prinseq.sourceforge.net
/





Fastx

toolkit*
http
://hannonlab.cshl.edu/fastx_toolkit
/

(partial list)

FASTQ Information: Chart
Quality Statistics and Nucleotide Distribution

FASTQ Trimmer: Shortening FASTQ/FASTA reads (
removing barcodes or noise).

FASTQ Clipper: Removing
sequencing
adapters

FASTQ
Quality
Filter: Filters
sequences based on quality

FASTQ Quality
Trimmer: Trims
(cuts) sequences based on quality

FASTQ
Masker: Masks
nucleotides with 'N' (or other character) based on
quality

*defaults to old Illumina
fastq

(ASCII offset 64). Use

Q33 option.


SepPrep

https://
github.com/jstjohn/SeqPrep


Adapter trimming


Merge overlapping paired
-
end read


Biopython

http://
biopython.org
,
http
://
biopython.org/DIST/docs/tutorial/Tutorial.html


(for python programmers)


Especially useful for implementing custom/complex sequence analysis/manipulation


Galaxy

http
://
galaxy.psu.edu


Great for beginners: upload data, point and click


Just about everything you’ll see in today’s presentations


SolexaQA2
http://
solexaqa.sourceforge.net


Dynamic trimming


Length sorting (resembles read grouping of
Prinseq
)





Selected Tools to Process Reads

15

Many Analysis Pipelines Start with Read Mapping

http://www.broadinstitute.org/gatk/guide/topic?name=best
-
practices

http://www.nature.com/nprot/journal/v7/n3/full/nprot.2012.016.html

Read Mapping

http
://www.broadinstitute.org/igv/

Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Sequence References and Annotations





http://www.ncbi.nlm.nih.gov/projects/genome/assembly/grc/data.shtml


http://www.ncbi.nlm.nih.gov/guide/howto/dwn
-
genome


Comprehensive reference information



http://hgdownload.cse.ucsc.edu/downloads.html


Comprehensive reference, annotation, and translation information



ftp://gsapubftp
-
anonymous@ftp.broadinstitute.org/bundle


References and SNP information data by GATK


Human only



http://cufflinks.cbcb.umd.edu/igenomes.html


Pre
-
indexed references and gene annotations for Tuxedo suite


Human, Mouse, Rat , Cow, Dog, Chicken, Drosophila, C.
elegans
,
Yeast




http://www.repeatmasker.org


Fasta Sequence Format

>chr1



TGGACTTGTGGCAGGAATgaaatccttagacctgtgctgtccaatatggt

agccaccaggcacatgcagccactgagcacttgaaatgtggatagtctga

attgagatgtgccataagtgtaaaatatgcaccaaatttcaaaggctaga

aaaaaagaatgtaaaatatcttattattttatattgattacgtgctaaaa

taaccatatttgggatatactggattttaaaaatatatcactaatttcat



>
chr2



>chr3




One or more sequences per file


“>” denotes beginning of sequence or
contig


Subsequent lines up to the next “>” define sequence


Lowercase base denotes repeat masked base


Contig

ID may have comments delimited by “|”


Read Mapping



Novoalign


(3.0)

SOAP3

(version

91)

BWA

(0.7.4)

Bowtie2

(2.1.0)

Tophat2

(2.0.8b)

STAR

(2.3.0e)

License

Commercial

GPL v3

GPL v3

Artistic

Artistic

GPL v3

Mismatch
allowed

up to
8

up

to
3

user

specified.
m
ax is function of
read length and
error rate

user specified

uses

Bowtie2



user specified


Alignments
reported per
read

random/all/none

random/all/none

user

selected

user

selected

uses

Bowtie2


user

selected


Gapped
alignment

up to 7bp

1
-
3bp gap

yes

yes

yes

splice junctions

introns

yes

splice junctions

introns

Pair
-
end reads

yes

yes

yes

yes

yes

yes

Best alignment

highest
alignment score

minimal number
of mismatches

minimal number
of
mismatches

highest
alignment score

uses

Bowtie2

highest
alignment score

Trim bases

3’
end

3’ end

3’ and 5’ end

3’ and 5’ end

畳敳

B潷ti攲

3’ and 5’ end

Comments

At

one time, best
performance and
alignment quality

Element of
B
road’s
“best practices”
来湯瑹灩湧
workflow

Smith
-
Waterman

quality alignments,
currently fastest

Currently

m
ost
popular

RNA
-
seq

aligner

Very fast;

uses
memory to achieve
performance

BWA Features


Uses Burrows Wheeler Transform


f
ast


modest
memory footprint
(<
4GB
)


Accurate


Tolerates base mismatches


increased sensitivity


reduces allele bias


Gapped
alignment for both
single
-

and paired
-
ended reads


Automatically adjusts
parameters based on read lengths and
error
rates


Native BAM/SAM
output
(the
de facto
standard)


Large installed base, well
-
supported


Open
-
source (no charge)


Read Mapping: BWA

21

Read Mapping: Bowtie 2

B
OWTIE
2


Uses dynamic programming (edit distance scoring)

o
Eliminates
need for realignment around indels

o
Can be tuned for different sequencing technologies


Multi
-
seed
search
-

adjustable
sensitivity


Input
read length limited only by available memory


Fasta or Fastq input


Caveats

o
Longer input reads require much more memory

o
Trade
-
off parallelism with memory
requirement





http://
bowtie
-
bio.sourceforge.net/bowtie2


Langmead

B,
Salzberg

S.
Fast gapped
-
read alignment with Bowtie
2
,
Nature
Methods. 2012, 9:357
-
359


Dynamic Programming Illustration

SAM (BAM) Format

S
EQUENCE

A
LIGNMENT
/M
AP

FORMAT


Universal standard


Human
-
readable (SAM) and compact (BAM) forms


S
TRUCTURE



Header


version, sort order, reference sequences, read groups, program/processing
history


Alignment records

[
benpass

align_genotype
]$
samtools

view
-
H
allY.recalibrated.merge.bam

@HD

VN:1.0

GO:none

SO:coordinate

@SQ

SN:chrM

LN:16571

@SQ

SN:chr1

LN:249250621

@SQ

SN:chr2

LN:243199373

@SQ

SN:chr3

LN:198022430



@SQ

SN:chr19

LN:59128983

@SQ

SN:chr20

LN:63025520

@SQ

SN:chr21

LN:48129895

@SQ

SN:chr22

LN:51304566

@SQ

SN:chrX

LN:155270560

@SQ

SN:chrY

LN:59373566



@RG

ID:86
-
191

PL:ILLUMINA

LB:IL500

SM:86
-
191
-
1

@RG

ID:BsK010

PL:ILLUMINA

LB:IL501

SM:BsK010
-
1

@RG

ID:Bsk136

PL:ILLUMINA

LB:IL502

SM:Bsk136
-
1

@RG

ID:MAK001

PL:ILLUMINA

LB:IL503

SM:MAK001
-
1

@RG

ID:NG87

PL:ILLUMINA

LB:IL504

SM:NG87
-
1



@RG

ID:SDH023

PL:ILLUMINA

LB:IL508

SM:SDH023

@PG

ID:GATK
IndelRealigner

VN:2.0
-
39
-
gd091f72

CL:knownAlleles
=[]
targetIntervals
=
tmp.intervals.list

LODThresholdForCleaning
=5.0
consensusDeterminationModel
=USE_READS
entropyThreshold
=0.15
maxReadsInMemory
=150000
maxIsizeForMovement
=3000
maxPositionalMoveAllowed
=200
maxConsensuses
=30
maxReadsForConsensuses
=120
maxReadsForRealignment
=20000
noOriginalAlignmentTags
=false
nWayOut
=null generate_nWayOut_md5s=false
check_early
=false
noPGTag
=false
keepPGTags
=false
indelsFileForDebugging
=null
statisticsFileForDebugging
=null
SNPsFileForDebugging
=null

@PG

ID:bwa

PN:bwa

VN:0.6.2
-
r126



SAM/BAM Format: Header

s
amtools

to view bam

header

s
ort order

r
eference sequence names
with lengths

r
ead groups with platform,
library and sample information

p
rogram (analysis) history

[
benpass

align_genotype
]$
samtools

view
allY.recalibrated.merge.bam


HW
-
ST605:127:B0568ABXX:2:1201:10933:3739


147

chr1

27675

60

101M

=

27588

-
188


TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC


=
7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/


RG:Z:86
-
191




HW
-
ST605:127:B0568ABXX:3:1104:21059:173553

83

chr1

27682

60

101M

=

27664

-
119


ATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGCTACAGTA


8;8.7
::<?=BDHFHGFFDCGDAACCABHCCBDFBE</BA4//BB@BCAA@CBA@CB@ABA>A??@B@BBACA>?;A@8??CABBBA@AAAA?AA??@BB0


RG:Z:SDH023


* Many fields after column 12 deleted (e.g., recalibrated base scores) have been deleted for improved readability


SAM/BAM Format: Alignment Records

http://samtools.sourceforge.net/SAM1.pdf

1

3

4

5

6

8

9

10

11

2

25

Preparing for Next Steps

S
UBSEQUENT

STEPS

REQUIRE

SORTED

AND

INDEXED

BAMS


Sort orders:
karyotypic
, lexicographical


Indexing improves analysis performance


P
ICARD

TOOLS
:
FAST
,
PORTABLE
,
FREE


http://picard.sourceforge.net/command
-
line
-
overview.shtml


Sort:


SortSam.jar


Merge:

MergeSamFiles.jar


Index:


BuildBamIndex.jar


O
RDER
:
SORT
,
MERGE

(
OPTIONAL
),
INDEX

Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Duplicate Marking



Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

$java
-
Xmx4g
-
jar
<path to
picard
>
/
MarkDuplicates.jar

\

INPUT=
aligned.sorted.bam

\

OUTPUT=
aligned.sorted.dedup.bam

\

VALIDATION_STRINGENCY=LENIENT
\

METRICS_FILE=aligned.dedup.metrics.txt
\

REMOVE_DUPLICATES=false
\

ASSUME_SORTED=true



http://picard.sourceforge.net/command
-
line
-
overview.shtml#MarkDuplicates

[
benpass

align_genotype
]$
samtools

view
allY.recalibrated.merge.bam


HW
-
ST605:127:B0568ABXX:2:1201:10933:3739


147

chr1

27675

60

101M

=

27588

-
188


TCATTTTATGGCCCCTTCTTCCTATATCTGGTAGCTTTTAAATGATGACCATGTAGATAATCTTTATTGTCCCTCTTTCAGCAGACGGTATTTTCTTATGC


=
7;:;<=??<=BCCEFFEJFCEGGEFFDF?BEA@DEDFEFFDE>EE@E@ADCACB>CCDCBACDCDDDAB@@BCADDCBC@BCBB8@ABCCCDCBDA@>:/


RG:Z:86
-
191



SAM/BAM Format: Alignment Records

http://samtools.sourceforge.net/SAM1.pdf

http://picard.sourceforge.net/explain
-
flags.html

Local Realignment

BWT
-
BASED

ALIGNMENT

IS

FAST

FOR

MATCHING

READS

TO

REFERENCE


I
NDIVIDUAL

BASE

ALIGNMENTS

OFTEN

SUB
-
OPTIMAL

AT

INDELS


A
PPROACH


Fast read mapping with BWT
-
based aligner


Realign reads at
indel

sites using gold standard (but much slower)
Smith
-
Waterman algorithm


B
ENEFITS


Refines location of
indels


Reduces erroneous SNP calls


Very high alignment accuracy in significantly less time, with fewer
resources



1
Smith
, Temple F.; and Waterman, Michael S. (1981). "Identification of Common Molecular Subsequences". Journal
of Molecular Biology 147: 195

197. doi:10.1016/0022
-
2836(81)90087
-
5. PMID
7265238


Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Local Realignment

DePristo

MA, et al. A framework for variation discovery and
genotyping using next
-
generation DNA sequencing data. Nat Genet.
2011 May;43(5):491
-
8. PMID: 21478889

Post re
-
alignment at
indels

Raw BWA alignment

STEP 1: Find covariates at non
-
dbSNP

sites using:

Reported quality score

The position within the read

The preceding and current
nucleotide (sequencer properties)


java
-
Xmx4g
-
jar
GenomeAnalysisTK.jar
\

-
T
BaseRecalibrator

\

-
I
alignment.bam

\

-
R
hg19/ucsc.hg19.fasta
\

-
knownSites

hg19/dbsnp_135.hg19.vcf
\

-
o
alignment.recal_data.grp


STEP 2
: Generate BAM with recalibrated base scores:


java
-
Xmx4g
-
jar
GenomeAnalysisTK.jar
\

-
T
PrintReads

\

-
R
hg19/ucsc.hg19.fasta
\

-
I
alignment.bam

\

-
BQSR
alignment.recal_data.grp

\

-
o
alignment.recalibrated.bam


Base Quality Recalibration

Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Base Quality Recalibration (Cont’d)

32

Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads

Is there an easier way to get started?!

http://galaxyproject.org
/

Click on “Use Galaxy”

Getting Started

35

Raw reads

Read
assessment
and prep

Mapping

Duplicate

Marking

Local
realignment

Base quality
recalibration

Analysis
-
ready

reads