Next Generation Sequencing Bioinformatics

earthsomberΒιοτεχνολογία

29 Σεπ 2013 (πριν από 3 χρόνια και 10 μήνες)

95 εμφανίσεις

Next Generation Sequencing
Bioinformatics
Stephen Taylor
Computational Biology Research Group
History

Sanger

Dominant for last ~30 years

1000bp longest read

Based on primers so not good for repetitive or SNPs sites

Next Generation Sequencing

Much shorter reads, 25 to 300 bp

Higher throughput

Cheaper cost per Mb

Single molecule sequencing (no cloning step)

Since Jan 2008 more DNA sequenced than
all previous years
Computational Biology Research Group
Hence We Need High Throughput
Bioinformatics
Computational Biology Research Group
Sanger

Fred Sanger (1980)

Dye
-
terminator sequencing

PCR up DNA fragment

Separate into 2 strands

Polymerase elongates DNA

Incorporation of fluorescence labelled ddNTP causes
termination of elongation for each base

Run DNA fragments on gel/capillary

Peak generated for each base
Computational Biology Research Group
Illumina (Solexa)
Computational Biology Research Group
Illumina (Solexa)
Computational Biology Research Group
Illumina (Solexa)
Computational Biology Research Group
Illumina (Solexa) Applications
Resequencing

Characterise different related species or strains
Transcriptome analysis

No chip/array required!

random priming of RNA
DNA methylation analysis

sequencing bisulfite
-
converted DNA methylation
-
sensitive restriction
digest enriched fragments
Examine chromatin modifications

Quantify in vivo protein
-
DNA interactions using the combination of
chromatin immunoprecipitation and sequencing (ChIP
-
Seq)
Computational Biology Research Group
Price Comparison
Computational Biology Research Group
Processing and management
Computational Biology Research Group
Assemble Data
-
Illumina
Generates short reads (~35
-
75bp)
Good for resequencing
Difficult to do de novo assembly all but smallest organisms
Computational Biology Research Group
Mapping Illumina Reads

Acquire and process images and convert to FASTQ*

Get data

Quality control**

Map to genome

Visualisation

Post Processing

Peak Finding

SNP Calling
* Not covered today?
Computational Biology Research Group
FASTQ format
@HWUSI
-
EAS100R:6:73:941:1973#0/1
TATACAATGCACTTAGTCATCCGCGTATCACTTTAT
+
IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I
1.
HWUSI
-
EAS100R the unique instrument name
2.
6 flowcell lane
3.
73 tile number within the flowcell lane
4.
941 'x'
-
coordinate of the cluster within the tile
5.
1973 'y'
-
coordinate of the cluster within the tile
6.
#0 index number for a multiplexed sample (0 for no indexing) /1
the member of a pair, /1 or /2
(paired
-
end or mate
-
pair reads
only)
Computational Biology Research Group
FASTQ format
Quality Score
ASCII representation of score
for each base e.g. I
Convert to ASCII e.g. 73
Minus <a value>
Original Qphred= 40
See http://en.wikipedia.org/wiki/FASTQ_format
Computational Biology Research Group
Formats

warning!
FASTQ format appears ‘standard’ but there are 3 types based on
the probabilities of the base calls…
Qphred =
-
10 x log10(error_prob)
Qsolexa =
-
10 x log10(error_prob/(1
-
error_prob))
1.
Standard fastq: ASCII( Qphred + 33 )
2.
Illumina pre v1.3 : ASCII( Qsolexa + 64 )
3.
Illumina post v1.3: ASCII( Qphred+64 )
Option 3 should be the main one for the forseeable future!?
Computational Biology Research Group
Convert between formats?
Computational Biology Research Group
Use sol2std2
Get Data
May be supplied in a variety of formats
.prb .txt files

Contain probabilities for each base

Some SNP callers use this

Usually convert to FASTQ
FASTQ

Like FASTA but with quality score associated with each base
Computational Biology Research Group
WTCHG

If data is from WTCHG likely to get an email

E.g. http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/

wget the FASTQ file in the GERALD directory

http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/GERA
LD_24
-
09
-
2009_johnb/s_2_sequence.txt.gz
Computational Biology Research Group
Processing reads
-
Illumina
Mapping Tools

MAQ

Sanger

Uses quality scores

ELAND

Comes with the machine and runs as standard

Very fast

NOVOALIGN

Slower, more accurate

Output option includes pairwise (handy for following up SNP calls)

TOPHAT

For RNA
-
Seq

Can map slice junctions
Computational Biology Research Group
Notes on Mapping

What genome?

Masking?

Some tools disregard multiple maps e.g. ELAND

Some tools map to one location and adjust probability score
e.g. MAQ

Can be confusing…

For ChIP
-
Seq we normally use DNA heavily masked for repeats
(simple/complex/ribosomal)
Computational Biology Research Group
Databanks Indices

We have many indexed databanks

Under /databank/indices/<tool> e.g. for maq

ens_human_chrs/

ens_human_chrs_ucsc_rmfull_2/

ens_mouse_chrs/

ens_mouse_chrs_ucsc_rmfull/

ens_human_cdna/

ens_mouse_masked_chrs/

Indices for both maq and novoalign

If an index you need is not there please ask

don’t make a
local one in your account!
Computational Biology Research Group
ChIP
-
Seq Pipeline
Computational Biology Research Group

ChIP
-
Sequencing Advantages

Less DNA needed

Not limited by micro
-
array content

More precise site mapping

Increased reads increases sensitivity

Produces higher quality data
ChIP
-
Seq example

NGS

reads
Map
(maq)
Peak pick
(cisgenome)
Extract sequences from features
(Motif extract)
MEME
Weblogo
MAQ
For simple runs use ‘easyrun’ option…
nohup /proj/hts/bin/maq.pl easyrun
<db>
<
fastq
>
-
d
<results
-
directory
>
maq
.log
In <results
-
directory> the main file is all.map
To see the binary to something usable:
maq pileup <db> all.map > all.pileup
These are quite large files…
Computational Biology Research Group
Visualization
all.map file converts to wig using CBRG custom tool
maq wig <db> all.map > all.wig
Then we convert to GFF format using custom scripts
Computational Biology Research Group
GFF format

Gene Feature Format

Developed at the Sanger Institute

http://www.sanger.ac.uk/Software/formats/GFF/

Format for describing features associated with DNA, RNA and
Protein sequences

Easy to parse

More tools e.g. EMBOSS starting to use this as standard

GFF3 is more standard and works best with GBrowse
Computational Biology Research Group
Example
GFF3 instance
Computational Biology Research Group
##gff
-
version 3
chr3 src exon 1300 1500 . + . ID=exon00001
chr3 src exon 1050 1500 . + . ID=exon00002
chr3 src exon 3000 3902 . + . ID=exon00003
chr3 src exon 5000 5500 . + . ID=exon00004
chr3 src exon 7000 9000 . + . ID=exon00005
SOFA term
Note ‘=‘
http://gmod.org/wiki/GFF3
Wig binary files
Scripts and modules to handle :
UCSC wiggle format
(1 column; 2 column; 4 column)
or, gff3
binary (.wib)
GMOD script
wiggle_to_wigBinary.pl
gff file
Function:
wiggle_to_wigBinary.pl
variables
(source / method / trackname / paths / input & output filenames )
command line to load binary / gff data into GBrowse
(bp_seqfeature_load.pl + all variables: database name, filenames, paths etc)
a conf file stanza
-
to display the loaded data
construct an intermediate wiggle format file
(....if input was gff3, maq binary)
Peak Calling

Lots of algorithms to do this

Problems with identifying a good cut off score

Over and under prediction
F
-
Seq

Based on a training set of peaks identified by researcher in specific
region

Iterate over parameter space until achieve best TP/FP score
cisgenome

Uses IP and Non IP ChIP
-
Seq data, increases accuracy of predictions
Computational Biology Research Group
Motif Extraction

Extract underlying DNA from peak calls

Run using web based motif finders

Weeder

MEME

May need to do successive rounds to find weaker motifs
Computational Biology Research Group
Quick note: SNP Calling
Often finds errors in the PCR amplication step

maq cns2snp (run during the easyrun option)

SNPseeker

Novoalign + CBRG script
Worth trying all of the above!
Computational Biology Research Group
Molbiol Data Structure
Analyse your data on deva.molbiol.ox.ac.uk
CBRG set up /proj/hts/data/<username>
Suggested structure:
batch/
fastq/ dbname/
Contact us if you want a GBrowse database for your data
Computational Biology Research Group
Future
Problem

In depth analysis after mapping = bottleneck

Need to empower the users to do their own analysis
Solution

Makefiles for bulk data analysis

Allow access to NGS data via GBrowse ‘workbench’

GBrowse plugins to export data to other tools

Galaxy
http://main.g2.bx.psu.edu/
looks promising
Computational Biology Research Group