How will new sequencing

sharpfartsΤεχνίτη Νοημοσύνη και Ρομποτική

8 Νοε 2013 (πριν από 4 χρόνια και 4 μέρες)

83 εμφανίσεις

How will new sequencing
technologies enable the HMP?

Elaine Mardis, Ph.D.

Associate Professor of Genetics

Co
-
Director, Genome Sequencing Center

Washington University School of Medicine

emardis@wustl.edu

Advantages of Next Gen
Platforms



No sub
-
cloning, no use of
E. coli

as host



-

cloning bias abolished



-

one FTE can keep several instruments busy



Each sequence is from a unique DNA molecule



-

quantitation is possible through “counting”



-

enhanced dynamic range



-

detection of rare variants


Multiple sequence
-
based assays on one platform

emardis@wustl.edu

New Sequencing Platforms



Roche FLX Sequencer




Illumina 1G Analyzer




ABI SOLiD


Sequencer




Helicos Single
-
molecule sequencer

emardis@wustl.edu

Roche FLX: Vital Statistics


>100Mb data/7 hours/$16K



Read lengths average 250 bp



Accuracy is hindered by homopolymer run in/dels



Coverage model is higher than for 3730 data

emardis@wustl.edu

© Elaine Mardis, Ph.D.

Currently:

By year’s end:



Improved pipeline and read assembly software



Paired end reads



400 bp read lengths



Bar
-
code tagging of libraries

Illumina 1G Analyzer: Vitals


1 Gb/4 days/$3
-
5000



40 bp read lengths, 8 channel flow cell



Read accuracy is highest in 1st 25 bp, ~1% overall
error rate



Biased representation of high AT regions


Currently:

By year’s end:



Paired end read capability



50 bp read lengths



Improved short read mapping, assembly algorithms (?)

emardis@wustl.edu

Cross
-
Platform Comparisons

Platform cost

$350K

$500K

$395K

Read length

650 bp +

250 bp

40
-
50 bp

Cost/run

$55

$16,000

$3
-
5,000

Mbp/day

1.4

200

333

Cost/Mbp

$880

$160

$5

Accuracy

high

No subs,

Indels at
homopolymers

high

Paired end
reads

Yes

Coming

Yes*

Criterion

3730

Roche

Illumina

emardis@wustl.edu

© Elaine Mardis, Ph.D.

AB SOLiD™: Vital Statistics



500Mb
-
1Gb/5 days/?$$



50 base pair read lengths/ paired end
or fragment reads



Ligation based sequencing with high
accuracy due to 2
-
base encoding



Analysis software is unknown



Early access platform due Q3 of ‘07

emardis@wustl.edu

HeliScope sequencer



Single molecule detection obviates PCR
amplification step


>25Mbp/hour initial data rate, 1000Mbp/hour
ultimately with <1% error rate



Short read lengths, single molecule sequencing with
high fidelity



Two 25 channel flow cells



Read mapping/assembly capability (?)

emardis@wustl.edu

Comparative metagenomics: Cecal contents of
obese mice (
ob/ob
) and lean littermates


EXPERIMENTAL DESIGN:


1)
Remove cecal contents of 2
ob/ob
, 2 +/+, and 1
ob
/+
C57Bl/6J mice and isolate DNA.

2)
454 pyrosequencing of total
DNA
-

350,000 reads/mouse
(one
ob/ob
, one +/+ mouse).

3)
Compare data from each mouse
to all known bacterial
sequences.

4)
Use data clustering methods to
examine similarities and
differences between all 5 mice
that were sequenced.

5)
Perform microbiota
transplantation to test for ability
to transfer phenotype to
gnotobiotic mice.

emardis@wustl.edu

© Elaine Mardis, Ph.D.

Next Gen RNA Sequencing



Our laboratory has developed a robust full
-
length
cDNA process for 454
-
based sequencing of
eukaryotic transcriptomes that features low input of
total RNA, enzyme
-
based normalization and the
ability to preferentially sequence the 5’ ends of
cDNAs.



We presently are working to modify this approach for
sequencing microbiotal transcriptomes and clinical
isolates likely to contain viral RNA genomes (e.g.
nasal lavage samples).


emardis@wustl.edu

© Elaine Mardis, Ph.D.

Illumina ‘Mockagenomics’
Experiment

emardis@wustl.edu



We created two mock metagenomic samples
by combining known bacterial and human
genomic DNAs and sequenced them by
Illumina platform to generate short (30bp)
reads.



We plan to compare the relative strengths of
classification by assembly and alignment to
those of “signature” characterization (GC
content, kmer analysis) for short read data

Practical Issues



DNA quality and quantity



Value of paired end vs. fragment reads



Normalization vs. quantitation



Depth of “search space”

emardis@wustl.edu

Sample prep



Evaluate DNA



Fragment (2
-
500bp)



Repair ends



Adapter ligate



Enrich



Amplify on
bead(Roche/AB) or on
glass slide (Illumina)



Evaluate DNA



Fragment (2.5kb)



Repair ends



Adapter ligate



Methylate



Restrict adapters



Circularize



2
°

restriction with type
IIS enzyme



Purify tags+adapter



Amplify

Fragment reads



Paired end reads

emardis@wustl.edu

Paired End Libraries

Internal Adapter

25 base

Tag #1

25 base

Tag #2

Mate Pair Library


EcoP15I or

fragmentation

emardis@wustl.edu

Sequencing:

PESP#1

PESP#2

NaIO
4

U.S.E.R.

Read 1 (25 to 40 cycles)

Read 2 (25
-
40 cycles)

Total 50
-
80 cycles

3
-
primer PE method

Graft:

P7
:
P7diol
:
9TUP5

[
P7
+
P7diol
] = [
9TUP5
]

P5
P7
P7diol
U
P5
P7
P7diol
U
P5
P7
P7diol
U
P7diol

&
9TUP5
linearisable


P7

non
-
linearisable

Cluster formation:

Heterogeneous clusters containing:



P7
/
9TUP5

bridges



P7diol
/
9TUP5

bridges

SBS8
SBS3
NaIO
4
USER
SBS8
SBS3
NaIO
4
NaIO
4
USER
USER
P7diol
/
9TUP5

P7
/
9TUP5

What are the issues?



Consented sample availability!!



Read length and accuracy



Sample complexity



Sensitivity to detect



Coverage and cost



DNA vs. RNA



Bioinformatics
-
based analyses

emardis@wustl.edu

Bioinformatics Challenges


Most daunting issue: the ability to analyze
enormous data sets intelligently and
efficiently


Metagenomic analysis tools are now
emerging for next gen sequence data



Testing and implementation into analysis
pipelines will follow



Output is only as good as the depth of the
search space and the depth of coverage for
any given combination of sample &
sequencer

emardis@wustl.edu