Next- and Third Generation Genome Sequencing Technologies, and ...

clumpfrustratedBiotechnology

Oct 2, 2013 (3 years and 10 months ago)

204 views

Personalized Genomic Data, Whole Genome
Sequencing Technologies, and the Clinical
Diagnostic Genome.

Chris Bradburne, PhD.

Applied Physics Laboratory,

Johns Hopkins University

Lecture Topics


Genomic diseases and personalized medicine


SNP variation and risk estimation


Translating ‘
Omics

to the Clinic


Next generation sequencing technologies


Bioinformatics and analysis problems for human
genome sequence and clinical diagnostics


Genome sequencing technologies and the future
of personalized medicine.



Genomic Disease and Medicine

3

~1000 Genome Wide

Association studies

(GWAS)

2000

2011

Structure and primary

sequence of genomes

Biology and variation

of genomes

1990

Biology of

disease

Advent of Next

Generation Sequencing

and exponential decrease

in cost of full
exome

and

genome sequencing.


The long road to realizing genome
-
informed medicine

E D. Green
et al
.
Nature

470
,
204
-
213

(2011)
doi:10.1038/nature
09764

Types of Genetic Diseases


Mendelian

Diseases


single gene or single loci
diseases



Polygenic Disorders


multiple genes/loci



Chromosomal Diseases


altered chromosomal
structure or number



Complex Diseases


multiple loci and environmental
factors

Influence of genetics on human disease.

For any condition the overall balance
of genetic and environmental
determinants can be represented by
a point somewhere within the
triangle
.

6


Single

Locus /

Mendelian

Multiple

Loci or multi
-

chromosomal

Environmental

Cystic Fibrosis

Hemophilia A

Examples:

Alzheimer’s Disease

Type II Diabetes

Cardiovascular Disease

Diet

Carcinogens

Infections

Stress

Radiation

Lifestyle

Gene = F8

Gene= CFTR

F8 = Coagulation Factor VIII

CFTR = Cystic Fibrosis Conductance
Transmembrane

Regulator

Lung Cancer

Mendelian

disease genes are more likely to be involved in complex diseases.

Jin W et al. Hum. Mol.

Genet
. 2012;21:1611
-
1624

hOMIM
: Database of

Online
Mendelian

Inheritance in Man

GAD: Genetic Association
Database

Mendelian

alleles are not restricted to individual
traits.


1
bp

(Single nucleotide
)

Single base change

Single base
indel

SNP variation






2
bp

to 1,000
bp

Microsatellites,

Indels
, Inversions

Repetitive Sequences, VNTRs


1 kb to submicroscopic

Copy number variants (CNVs)

Segmental duplications, Inversions, translocations

Microdeletions
,
microduplications


Microscopic to
subchromosomal

Segmental
autosomy

Chromosomal deletions, insertions
,

inversions, abnormality

Intrachromosomal

translocations,
Heteromorphisms

Fragile Sites


Whole chromosome to whole genome

Interchromosomal

translocations

Ring chromosomes,
isochromosomes

Marker Chromosomes, Aneuploidy,
Aneusomy


Structural


Variation

Sequence


Variation

Ranges of structural variation in the
genome influence
inherited phenotypes and conditions


Rate of SNPs in human
genome = 1 to 3 SNPs
per 1000 bps.

Technology has enabled genome
-
wide screens using SNPs

Affymetrix

6.0
Genome
-
Wide

SNP Chip

MAF= 23%, Variants: 900,000 SNPs

900,000 CNVs

Illumina
OmniBead

2.5
Arrays

70% or more
representation at
MAFs of 5% or
more for most
haplogroups
,
Variants: 2,300,000
SNPs

The Genome
-
Wide Association Study (GWAS)

Manolio TA. N
Engl

J
Med 2010;363:166
-
176.

As of 2011,
1200 human
GWASs have
been
published on
over 400 traits

Utilizing GWAS data to screen individuals for
disease conditions.



Age
-
related macular degeneration


Chrohn’s

Disease


Myocardial infarction


Inflammatory bowel disease


Type 1, and Type 2 Diabetes


Bipolar Disorder


Coronary artery disease


Hypertension


Rheumatoid arthritis


Human height


Others




11

Relative risk for each
disease


can be assigned for
that individual

SNPs for each disease/trait


yields an Odds ratio

(OR) for each disease.

Individual/patient is genotyped
using a panel of SNPs

Useful or not???

GWAS characteristics


Alleles are considered in Linkage Disequilibrium (LD) using a
value of r
2

>= 0.8


Usually based on International
HapMap

data that used a Minor
Allele Frequency of MAF > 5%.


MAF is the frequency at which the less common allele occurs in the
population.


Odds ratio (OR) is calculated which represents an ‘effect size’.


Probability of Disease if Allele A is carried
vs

if Allele B is carried.


Population Stratification is done to minimize structural
variations common to specific ethnic groups and increase
power.


Most SNPs found in LD still do not yield large ORs, and so most
lack significant predictive power.

Hindorf

et. al., 2009

The structural position of a SNP can influence the
effect size, but most SNPs are not causative.

TA Manolio
et al. Nature

461
,
747
-
753

(2009) doi:10.1038/nature08
494

Feasibility of identifying genetic variants by risk allele

frequency and strength of genetic effect (odds ratio).

Published Genome
-
Wide Associations through 06/2011,

1,449 published GWA at p≤5x10
-
8

for 237 traits

NHGRI GWA Catalog

www.genome.gov/GWAStudies

dbSNP
: database of SNPs

NCBI
dbSNP

Build 135, Nov 14, 2011


New submissions: 167,019

New
RefSNP

Clusters: 12,632,873

# Validated: 12, 539,846

New submission with frequency data: 125

dbVAR
: database of

Variants

Multiple entries. Late 2011


Example:

1000 genomes consortium:

# of variant regions
: 218,039

# of variant calls
: 2,150,366

# of SNPs
: Approx 15 million

# of
Indels
: Approx 1 million

# structural variants
: Approx 20,000


Per Individual:

# of Loss
-
of
-
function variants
: 250
-
300

on average per individual.

# variants previously implicated in

genetic disorders
: 50 to 100 on average

per individual



Cataloguing Human genetic Variation

Direct to Consumer Genetic
Testing

Copyright ©2011 American Association for Clinical Chemistry

Imai, K. et al. Clin Chem 2011;57:518
-
521

Concordance of Relative

risk estimations for 3 Direct
-
to
-
Consumer companies

Catherine Stack et. al., Genetics in Medicine, 2011

Genetic Variant/ Health
Condition Selection

Informed Cohort
Oversight Board (ICOB)
Assessment

Selection of
Publication for Risk
Reporting

Risk reporting


Condition selection and risk estimation via a published literature
curation

process and oversight board selection

Careful literature

curation

and variant
selection
will have to be done.


Genetic Risk Assessment

for the Coriell Personalized Medicine Collaborative
(CPMC).

Estimates of heritability and number of loci for several complex traits

TA Manolio
et al. Nature

461
,
747
-
753

(2009) doi:10.1038/nature08
494

For the majority of diseases, SNPs account for
very little of the heritability of complex traits.

Environment and the missing heritability.


Example of chromosomal evolution in human solid tumor progression

Scalar protein analysis of domains enriched in genetics (SPADE
-
gen): changing the paradigm
of drug repositioning for complex diseases with genetically anchored biological mechanisms.

Regan K et al. J Am Med
Inform Assoc 2012;19:306
-
316

New approaches are needed to combine systems level biology derived
from
mendelian

interactions with complex disease models.

dbSNP
: database of SNPs

NCBI
dbSNP

Build 135, Nov 14, 2011


New submissions: 167,019

New
RefSNP

Clusters: 12,632,873

# Validated: 12, 539,846

New submission with frequency data: 125

dbVAR
: database of

Variants

Multiple entries. Late 2011


Example:

1000 genomes consortium:

# of variant regions
: 218,039

# of variant calls
: 2,150,366

# of SNPs
: Approx 15 million

# of
Indels
: Approx 1 million

# structural variants
: Approx 20,000


Per Individual:

# of Loss
-
of
-
function variants
: 250
-
300

on average per individual.

# variants previously implicated in

genetic disorders
: 50 to 100 on average

per individual



Cataloguing Human genetic Variation

Translating ‘
Omics

to the clinic

From: Summary of Evolution of Translational
Omics

meeting, NAS, 2012

Omics
-
based test development process. In the first stage of
omics
-
based test development, there are two phases: discovery and test
validation.

Usefulness of genomic screening devices
and data in the clinic must be validated.


Analytic Validity


The accuracy of a genetic test in identifying the presence or absence of a
specific mutation.


The probability that a test will be positive when a particular sequence (
analyte
) is
present (analytical sensitivity) and the probability that the test will be negative
when the sequence is absent (analytical specificity).



Clinical Validity


The accuracy with which a test predicts a clinical outcome.


The sensitivity, specificity, and predictive value of a test in relation to a particular
phenotype (
Holtzman

& Watson, 1999)



Clinical Utility


The net value of the information gained from a genetic test in changing
disease outcomes (Gwinn 2004)


Requires the collection of data demonstrating the benefits and risks that accrue
from both positive and negative results.



ACCE Model Process for Evaluating
Genetic Tests

Whole genome Sequencing…

Whole Genome Sequencing:

Why did the Human Genome Project Cost so much?

Sanger Sequencing


Dideoxy

chain termination sequencing


10 years to complete



Cost $2.7 Billion, to sequence 3 billion bps


Approximately $1 / base


99% complete by 2001 (Draft sequence)


Still not exactly complete…



Sanger Sequencing: low throughput


700
-
1000 bps per run, so equivalent to

3
-
4 million individual experiments if not

Optimized for higher throughput.

Sequencing cost is crashing, throughput soaring!

Impact on
Metagenomics
:

Schmieder


and

Edwards,

2012

Next Generation Sequencing
technologies


Based on Massively parallel sequencing of
fragmented templates.


Detects clonally amplified templates from a
single DNA molecule (2
nd

generation
technologies).


Amplification of clone in a massively parallel
fashion is done by:


Emulsion PCR (Roche 454)


Bridge Amplification on a solid surface (Illumina)



Roche 454 Sequencer


Pyrosequencing
,
chemiluminescent

signal generation, and
optical detection


Signal processing to determine base sequence and quality
score


Approximately 1 million reads obtained in parallel on a
large
microtiter

plate format


400
-
700 base pairs / read


500
-
900 Mb/run


Run time of 10
-
20 hours


1% Error rate




Illumina
Solexa


Reversible dye
-
labeled chain
terminators


Signal processing to determine base
sequence and quality score


Approximately 1 million reads
obtained in parallel on a large
microtiter

plate format


150
-
200 base pairs / read


96K Mb/run


Run time of 14 days


0.1% Error rate


SOLID Sequencing


Sequencing by Ligation


Emulsion PCR amplification


8
-
12 day run time


500
-
1400 reads per run


100,000


155,000 MB / run


.01
-
.06 error rates, but has an A
-
T bias

Comparison of the Roche 454 and Illumina Genome Analyzer II platforms workflows.

Roche 454 FTX

Illumina

GA II

Magnesium
-
catalyzed hydrolysis of

cDNA

to a 200
nt

average length

Adapter ligation

Size selection on a gel of

adapter/
cDNA

products

PCR amplification of

adapter/
cDNA

products

Hydrolysis of

cDNA

to a 500
nt

average length

Adapter ligation

Attachement

of single

adapter/
cDNA

onto bead

PCR amplification of

adapter/
cDNA

products

on bead

Attachment to flow cells

Run on machines

Mapping to the

reference genome

Reference genome

Mapped reads

4 RPKM

2 RPKM

6 RPKM

1 RPKM

DNA Template

Four types of reads generated in NGS

Sequencing workflows, biases and quality scores are
influenced by their platforms, approaches and
chemistries

Phred

quality scores (Q): A quality score assigned
to each base call in a consensus sequence.
Represented by the equation Q =
-
10 log
10
P

Quality scores can generally

be improved by increasing depth.

3
rd

Generation Sequencing
technologies


Single Molecule sequencers. Uses a single
molecule DNA template and therefore does not
require amplification.


PacBio


Oxford
Nanopore


Sequencers with improved optical benches and
chemistries


Illumina
HiSeq
,
MiSeq


Roche 454 with Titanium chemistries


Sequencers without an optical bench


Ion Torrent

PacBio

Single Molecule Sequencer


Single molecule
sequencing


0.5


2 hour run
time


10K reads/run


860
-
1100 Bases
/read


5
-
10 MB / run


16
-
20% error rate

Illumina
HiSeq
,
MiSeq


Updated chemistries for
improved throughput, error
rates, read length, and data
generation.


MiSeq
: 26 hour run time, 3.4
million reads / run, 150
-
200
bp

reads, 1024 Mb / run, 0.1%
error rate


HiSeq
: 8 day runs, 500
-
3000
million reads / run, 100
bp

read lengths, 100,000


600,000 MB / run, 0.1% error
rate.



HiSeq

MiSeq

Ion Torrent


pH change detection with
release of charged ion
(proton) during addition of
nucleobase

to growing chain.


2 hour run times


1
-
8 million reads /run


100 bases /read


100
-
1000 MB/run


Approximately 1% error rates

Oxford
Nanopore
, Single Molecule
Sequencing


Single Molecule Sequencing


Alpha
-
hemolysin

heptamer

immobilized in a
lipid
bilayer
, which separates 2 wells of
buffered
KCl

solution.


Voltage is applied, causing negatively
charged
ssDNA

to
electrophorese

through
the pore.


Each base that passes through the
nanopore

results in a current blockade that is
influenced by DNA strand length and base
composition.


Thousands of BP reads…


Can differentiate modified bases…

Whole genome sequencing for personalized medicine
and for clinical diagnostics: Current Issues


Error rates, quality scores


Data Storage, Computational power


Up front database management (important
for clinical diagnostics and
metagenomics
)


Assembling the genome


$1000 genome, $1,000,000 Analysis problem


Errors in the Data: Nucleotide Bias and Data Quality Issues

Ilumina

GAII data showing biases

(Courtesy of
Yurey

Fofanov)

Computational and memory/storage requirements to
run a sequencing center are large

Courtesy Dr.
Yurey

Fofanov,

Director, Center for

Biomedical and

Environmental Genomics,

University of Houston

Issues assembling a shotgun sequence


Issues: Repetitive sequences


Structural Variations leading to
over or underrepresented reads


Biased base calling or library
preparation issues.


Over or underrepresented
sequences in the database


Assembly and Mapping: de
novo? Or mapping and then
assembly?

Genome assembly and
identifcation
: Up front database
management for
metagenomics

applications

Redundancies in the NCBI database can confound
causative pathogen ID and quantification

Gulf Oil Spill, summer of 2010

Human genome assembly and annotation can
be confounded because of structural variation

Assembly
and comparison of an Asian and an African genome.

Non
-
SNP structural
variants

were difficult to
assemble and call.


SNPs however,
were easy to
identify!

Li et. al., 2011, Nature Biotechnology

Use of combinatory technologies like multiple sequencing
platforms, or Optical Mapping to help assemble a genome can
address non
-
SNP structural variants.

Single DNA molecules are flowed through
microfluidic

channels and immobilized on
a

charged glass surface. The immobilized DNA
is digested, maintaining the fragment order.

The DNA fragments are stained with
fluorescent dye; fragment length is
proportional to fluorescence intensity. By
overlapping fragment patterns, the single
-
molecule maps are assembled to produce a
Whole Genome Map that provides a minimum
30 X coverage.

Personalized medicine by personal ‘
omics

profiling

Closing Remarks


Initial basic questions and techniques still need to
be proved out, but are promising.


Enormous SNP/GWAS effort has yielded data
that, though not very predictive, will translate
easily to the whole genome sequencing era and
to personalized medicine.


New effort of GWAS to address non
-
SNP
structural variants?


Personalized medicine will include other ‘
omics

and clinical data to provide systems biology
-
level,
temporal monitoring of individual health.