Introduction to Bioinformatics

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

97 views

Next Generation Sequencing
Bioinformatics


Stephen Taylor

Computational Biology Research Group

History



Sanger


Dominant for last ~30 years


1000bp longest read


Based on primers so not good for repetitive or SNPs sites


Next Generation Sequencing


Much shorter reads, 25 to 300 bp


Higher throughput


Cheaper cost per Mb


Single molecule sequencing (no cloning step)


Since Jan 2008 more DNA sequenced than
all previous years




Computational Biology Research Group

Hence We Need High Throughput
Bioinformatics

Computational Biology Research Group

Sanger


Fred Sanger (1980)


Dye
-
terminator sequencing


PCR up DNA fragment


Separate into 2 strands


Polymerase elongates DNA


Incorporation of fluorescence labelled ddNTP causes
termination of elongation for each base


Run DNA fragments on gel/capillary


Peak generated for each base











Computational Biology Research Group

Illumina (Solexa)

Computational Biology Research Group

Illumina (Solexa)

Computational Biology Research Group

Illumina (Solexa)

Computational Biology Research Group

Illumina (Solexa) Applications

Resequencing


Characterise different related species or strains

Transcriptome analysis


No chip/array required!


random priming of RNA

DNA methylation analysis


sequencing bisulfite
-
converted DNA methylation
-
sensitive restriction
digest enriched fragments

Examine chromatin modifications


Quantify in vivo protein
-
DNA interactions using the combination of
chromatin immunoprecipitation and sequencing (ChIP
-
Seq)




Computational Biology Research Group

Price Comparison

Computational Biology Research Group

Processing and management

Computational Biology Research Group

Assemble Data
-

Illumina

Generates short reads (~35
-
75bp)

Good for resequencing

Difficult to do de novo assembly all but smallest organisms



Computational Biology Research Group

Mapping Illumina Reads



Acquire and process images and convert to FASTQ*


Get data


Quality control**


Map to genome


Visualisation


Post Processing


Peak Finding


SNP Calling



* Not covered today?

Computational Biology Research Group

FASTQ format

@HWUSI
-
EAS100R:6:73:941:1973#0/1

TATACAATGCACTTAGTCATCCGCGTATCACTTTAT

+

IIIIIIIIIIIIIIIIIIGIIIIIIIIII4IIII:I

1.
HWUSI
-
EAS100R the unique instrument name

2.
6 flowcell lane

3.
73 tile number within the flowcell lane

4.
941 'x'
-
coordinate of the cluster within the tile

5.
1973 'y'
-
coordinate of the cluster within the tile

6.
#0 index number for a multiplexed sample (0 for no indexing) /1
the member of a pair, /1 or /2
(paired
-
end or mate
-
pair reads
only)




Computational Biology Research Group

FASTQ format

Quality Score

ASCII representation of score

for each base e.g. I

Convert to ASCII e.g. 73

Minus <a value>

Original Qphred= 40


See http://en.wikipedia.org/wiki/FASTQ_format




Computational Biology Research Group

Formats


warning!

FASTQ format appears ‘standard’ but there are 3 types based on
the probabilities of the base calls…


Qphred =
-
10 x log10(error_prob)

Qsolexa =
-
10 x log10(error_prob/(1
-
error_prob))


1.
Standard fastq: ASCII( Qphred + 33 )

2.
Illumina pre v1.3 : ASCII( Qsolexa + 64 )

3.
Illumina post v1.3: ASCII( Qphred+64 )


Option 3 should be the main one for the forseeable future!?


Computational Biology Research Group

Convert between formats?

Computational Biology Research Group

Use sol2std2

Get Data

May be supplied in a variety of formats


.prb .txt files


Contain probabilities for each base


Some SNP callers use this


Usually convert to FASTQ


FASTQ


Like FASTA but with quality score associated with each base


Computational Biology Research Group

WTCHG



If data is from WTCHG likely to get an email


E.g. http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/


wget the FASTQ file in the GERALD directory


http://www.well.ox.ac.uk/htseq/1T3qcHwk6jmlZeVtSnQO/GERA
LD_24
-
09
-
2009_johnb/s_2_sequence.txt.gz

Computational Biology Research Group

Processing reads
-

Illumina

Mapping Tools


MAQ


Sanger


Uses quality scores


ELAND


Comes with the machine and runs as standard


Very fast


NOVOALIGN


Slower, more accurate


Output option includes pairwise (handy for following up SNP calls)


TOPHAT


For RNA
-
Seq


Can map slice junctions





Computational Biology Research Group

Notes on Mapping


What genome?


Masking?


Some tools disregard multiple maps e.g. ELAND


Some tools map to one location and adjust probability score
e.g. MAQ


Can be confusing…


For ChIP
-
Seq we normally use DNA heavily masked for repeats
(simple/complex/ribosomal)



Computational Biology Research Group

Databanks Indices


We have many indexed databanks


Under /databank/indices/<tool> e.g. for maq


ens_human_chrs/


ens_human_chrs_ucsc_rmfull_2/


ens_mouse_chrs/


ens_mouse_chrs_ucsc_rmfull/


ens_human_cdna/


ens_mouse_masked_chrs/


Indices for both maq and novoalign


If an index you need is not there please ask


don’t make a
local one in your account!



Computational Biology Research Group

ChIP
-
Seq Pipeline

Computational Biology Research Group


ChIP
-
Sequencing Advantages


Less DNA needed


Not limited by micro
-
array content


More precise site mapping


Increased reads increases sensitivity


Produces higher quality data


ChIP
-
Seq example




NGS


reads

Map
(maq)

Peak pick

(cisgenome)

Extract sequences from features

(Motif extract)

MEME

Weblogo

MAQ

For simple runs use ‘easyrun’ option…


nohup /proj/hts/bin/maq.pl easyrun
<db>

<
fastq
>

-
d
<results
-
directory

>
maq
.log


In <results
-
directory> the main file is all.map

To see the binary to something usable:



maq pileup <db> all.map > all.pileup


These are quite large files…





Computational Biology Research Group

Visualization










all.map file converts to wig using CBRG custom tool

maq wig <db> all.map > all.wig

Then we convert to GFF format using custom scripts



Computational Biology Research Group

GFF format



Gene Feature Format


Developed at the Sanger Institute


http://www.sanger.ac.uk/Software/formats/GFF/


Format for describing features associated with DNA, RNA and
Protein sequences


Easy to parse


More tools e.g. EMBOSS starting to use this as standard


GFF3 is more standard and works best with GBrowse



Computational Biology Research Group

Example
GFF3 instance

Computational Biology Research Group

##gff
-
version 3

chr3 src exon 1300 1500 . + . ID=exon00001

chr3 src exon 1050 1500 . + . ID=exon00002

chr3 src exon 3000 3902 . + . ID=exon00003

chr3 src exon 5000 5500 . + . ID=exon00004

chr3 src exon 7000 9000 . + . ID=exon00005

SOFA term

Note ‘=‘

http://gmod.org/wiki/GFF3


Wig binary files

Scripts and modules to handle :

UCSC wiggle format

(1 column; 2 column; 4 column)

or, gff3

binary (.wib)

GMOD script

wiggle_to_wigBinary.pl


gff file

Function:


wiggle_to_wigBinary.pl

variables


(source / method / trackname / paths / input & output filenames )


command line to load binary / gff data into GBrowse


(bp_seqfeature_load.pl + all variables: database name, filenames, paths etc)


a conf file stanza
-

to display the loaded data


construct an intermediate wiggle format file

(....if input was gff3, maq binary)

Peak Calling



Lots of algorithms to do this


Problems with identifying a good cut off score


Over and under prediction


F
-
Seq


Based on a training set of peaks identified by researcher in specific
region


Iterate over parameter space until achieve best TP/FP score


cisgenome


Uses IP and Non IP ChIP
-
Seq data, increases accuracy of predictions


Computational Biology Research Group

Motif Extraction


Extract underlying DNA from peak calls


Run using web based motif finders


Weeder


MEME


May need to do successive rounds to find weaker motifs


Computational Biology Research Group

Quick note: SNP Calling

Often finds errors in the PCR amplication step




maq cns2snp (run during the easyrun option)


SNPseeker


Novoalign + CBRG script


Worth trying all of the above!

Computational Biology Research Group

Molbiol Data Structure

Analyse your data on deva.molbiol.ox.ac.uk


CBRG set up /proj/hts/data/<username>


Suggested structure:


batch/


fastq/ dbname/


Contact us if you want a GBrowse database for your data



Computational Biology Research Group

Future

Problem


In depth analysis after mapping = bottleneck


Need to empower the users to do their own analysis


Solution


Makefiles for bulk data analysis


Allow access to NGS data via GBrowse ‘workbench’


GBrowse plugins to export data to other tools


Galaxy
http://main.g2.bx.psu.edu/

looks promising



Computational Biology Research Group