IST 444 Bioinformatics

abalonestrawΒιοτεχνολογία

2 Οκτ 2013 (πριν από 3 χρόνια και 8 μήνες)

67 εμφανίσεις

High Throughput Genomic DNA
Sequencing and Bioinformatics

The Human Genome Project


The Human genome is now
officially sequenced.


That was a big job.


How did they do it?


Is there anything that a
knowledge of bioinformatics tells
us that we should watch out for
in the human genome sequence?

What is DNA Sequencing?


A DNA sequence is the order of the
bases on one strand.


By convention, we order the DNA
sequence from 5’ to 3’, from left to
right.


Often, only one strand of the DNA
sequence is written, but
usually

both
strands have been sequenced as a check.


DNA Sequencing was Awarded
the Nobel Prize


Walter Gilbert and Fred Sanger were
awarded the Nobel Prize in Chemistry
for the development of two different
methods of DNA sequencing.


http://www.nobel.se/chemistry/laureat
es/1980/


(Oh yes, and Paul Berg for Recombinant
DNA
-

a big year!)

Two Methods of DNA
Sequencing


Maxam
-

Gilbert Method
, in which a DNA
sequence is end
-
labeled with [P
-
32] phosphate
and chemically cleaved to leave a signature
pattern of bands.


Sanger Method
, in which a DNA sequence is
annealed to an oligonucleotide primer, which is
then extended by DNA polymerase using a
mixture of dNTP and ddNTP (chain
terminating) substrates.
This is the main
method used now.

Sanger Method is a Form of
DNA Synthesis


DNA to be sequenced acts as a
template for the enzymatic synthesis
of new DNA strand
starting at a
defined primer
.


Polymerases used are Pol I type
polymerases.


Incorporation of a dideoxynucleotide
blocks further synthesis of the new
DNA strand.

Remember the Rules of
In Vivo

DNA Replication

Remember the Rules of
In Vivo

DNA Replication

How the Reaction Works


If the DNA is double stranded, the
reaction is started by heating until the
two strands of DNA separate.


Lower the temperature and the
primer

sticks to its intended location by H
bonds.


DNA polymerase starts elongating the
primer.


If allowed to go to completion, a new
strand of DNA would be the result.


How the Reaction Works


If we start with a billion identical
pieces of template DNA, we'll get a
billion new copies of one of its strands.


We run the reactions, however, in the
presence of a dideoxyribonucleotide.


This is just like regular DNA, except it
has no 3' hydroxyl group
-

once it's
added to the end of a DNA strand,
there's no way to continue elongating it.



Original Sanger Sequencing


A mixture of dNTPs and a single ddNTP
is used in the reaction tubes.


We can start with 4 different reaction
tubes, each with all four dNTPS (dATP,
dGTP, dTTP, dCTP) and ONLY one of
either ddA, ddC, ddG and ddT (only 1%).


The key is MOST of the nucleotides are
regular ones, and just a fraction of
them are dideoxynucleotides.

An Example of a T tube:


MOST of the time when a 'T' is required to
make the new strand, the enzyme will get a
good one and it continues to elongate.


MOST of the time after adding a T, the
enzyme will go ahead and add more
nucleotides.


However, about 1% of the time, the enzyme
will get a dideoxy
-
T, and that strand can
never again be elongated.


It eventually breaks away from the enzyme,
leaving a dead end DNA that can’t be further
extended.


Original Sanger Sequencing


Sooner or later ALL of the copies will
get terminated by a T.


But each time the enzyme makes a new
strand, the place it gets stopped will be
random.


In millions of starts, there will be
strands stopping at every possible T
along the way.

Specific Primers Start the
Sequence


ALL of the strands we make started
at one exact position.


ALL of them end with a T. There are
billions of them ... many millions at
each possible T position.


To find out where all the T's are in
our newly synthesized strand, all we
have to do is find out the sizes of all
the terminated products!


Non
-
Radioactive DNA Labels


Add a chemical tag to each ddNTP that can
emit a fluorescent color when excited by a
laser.


We can add
a different dye to each ddNTP

and each is excited by a different laser wave
length.


Run the reactions in only one tube, not 4
tubes!


This is easier and faster. A big contribution
to high throughput sequencing.


Automated DNA Sequencing


We don't even have to 'read' the sequence
from the gel
-

the computer does that for us!


This is a plot of the colors detected in one
'lane' of a gel (one sample), scanned from
smallest fragments to largest.


The computer even interprets the colors by
printing the nucleotide sequence across the
top of the plot.


This is just a fragment of the entire file,
which would span around 700 or so nucleotides
of accurate sequence.

Automated DNA Sequence Readouts

The Biology of DNA Sequencing


Virtually all DNA sequencing, (both automated
and manual) relies on the Sanger method


DNA replication with dideoxy chain
termination


separation of the resulting molecules by
polyacrylamide gel electrophoresis.


The DNA fragment to be sequenced must
first be cloned into a vector (plasmid or
lambda).


Then the cloned DNA must be copied in a test tube
(
in vitro

) by a DNA polymerase enzyme to obtain a
sufficient quantity to be sequenced.

Sample DNA Sequence

from ABI sequencer


Automated

sequencing

machines,


particularly

those

made

by

PE

Applied

Biosystems,

use

4

colors

of

dye,

so

they

can

read

all

4

bases

at

once
.

Challenges of DNA Sequencing


One technician with an automated DNA
sequencer can produce over 20 KB of
raw sequence data per day.


The real challenge of DNA sequencing is
in the analysis of the data


J. Craig Venter


Proposed a whole
-
genome shotgun sequencing
method to NIH in 1991. Proposal rejected.


Sets up The Institute for Genomic Research
(TIGR) in 1992 (private and non
-
profit)


TIGR publishes the first complete genome
sequence in 1995 (
Haemophilis influenzae
)


Forms Celera Genomics in 1998 to sequence
human genome in three years (private, for
-
profit)


The Sequence of the Human Genome is
published in Science. February 2001


Venter departs Celera. 2002

Human Genome Project

Sequencing

Strategy


Clone
-
based physical mapping


Digest genome and make Bacterial
Artificial Chromosomes (BACs, 150,000
bp each)


Digest BACs to create fingerprints


Organize BACs to form contigs


Select BAC clones for sequencing


Shear BACs and shotgun clone


Sequence clones and assemble overlaps

Celera Sequencing Strategy


Whole
-
genome shotgun sequencing of
five individuals with 5 fold coverage


Computer assembles overlapping
sequences to form contigs


Contigs are assembled into scaffolds


Scaffolds are mapped to the genome
by two or more Sequence Tagged Site
(STS) markers

Technology Breakthroughs


Development of Expressed Sequence Tag
(EST) method to discover and map human
genes


Development of Bacterial Artificial
Chromosomes (BACs) to clone large DNA
fragments


Development of an automated high
-
throughput capillary DNA sequencer in
1998 (Applied Biosystems ABI PRISM
3700 DNA Analyzer)


Development of powerful computers and
software to analyze sequence data

Genome Questions


Has every base in our genome been
sequenced?


What is the total number of genes and
where are they located?


How many genes have an unknown function?


What percent of our DNA encodes genes
and what is the remainder?


Do we share DNA sequences with other
organisms?


How much sequence variation is there
between individuals?


Genome Sequencing

HTG, GSS,(WGS)


Draft Sequence

(
HTG division
)

shredding

Whole BAC insert (or genome)

cloning isolating

assembly

sequencing

GSS division

or trace archive

GSS Division:
G
enome
S
urvey
S
equences


Genomic equivalent of ESTs


BAC and other first pass surveys


BAC end sequences


Whole Genome Shotgun (some)


RAPIDS and other anonymous loci

Genomic Clone (BAC)

T7 end

SP6 end

Working Draft Sequence

gaps

Technology Limitations


Sequences can only be determined in
approximately 400
-
800 base pair
sections known as “reads.”


This is due to both the biochemistry of the
DNA polymerase enzyme and the resolution
of polyacrylamide gel electrophoresis.


Most genes contain many thousands of bp
and many modern sequencing projects are
intended to produce complete sequences of
large genomic regions (millions of bp)

Assembly of Contigs


As a result, all sequencing projects
must involve the division of the
target DNA into a set of overlapping
~500 bp fragments.


Then these fragments are assembled
into complete sequences (contigs)


Contig = contiguous sequenced region


Assembly of overlapping fragments is
a computational problem


Contig Assembly Problems

1) The 500 bp reads of sequence data have
errors of both incorrectly determined bases
and insertions/deletions.


2) The error rate is highest at the beginning
and ends of the reads
-

precisely the regions
that must be overlapped.


3) Some sequence from cloning vectors is often
included at the ends of sequence reads.

Sequence Assembly Algorithms


Different than similarity searching


Look for ungapped overlaps at end of
fragments


(method of Wilbur and Lipman, (SIAM J. Appl. Math. 44;
557
-
567, 1984)


High degree of identity over a short region


Want to exclude chance matches, but not be
thrown off by sequencing errors


Vector removal uses similar approach, but less
stringent


should recognize small regions of identity and
tolerate more mismatches

Celera Innovation: Clone End Tracking


Create 3 libraries with 2, 10, and 50 KB inserts


Use information from clone ends: distance and
orientation


Can span some gaps between contigs and determine the
size of gaps

Overlap at ends, not internal

Software determines
strategy


Based on their faith in the speed and
reliability of sequence
analysis/assembly software,
researchers have generally taken one
of three different approaches to
planning sequencing projects:



Ordered sub
-
cloning


Primer walking


Shotgun sequencing

Ordered cloning


People who don't trust software
generally put a lot of time into dividing
large pieces of DNA into small ordered
overlapping fragments


This strategy requires much more initial
cloning work in the laboratory.


but it minimizes the number of actual
sequencing reads required to complete a
project.


It is easy to assemble the reads since it is
known how they should fit together to
form the final contig.

Primer Walking


Make a new primer from the end of each new
sequence read


It requires very fast and accurate analysis of

sequence reads since each step uses
information

from the previous read


Skips sub
-
cloning step entirely since all sequencing
reactions can be done on one large clone


Expensive to make a lot of PCR primers


but the price of primer synthesis keeps dropping
& there is an economy of scale


Assembly problems are minimized since both
the order and the amount of overlap of reads
are known

Shotgun Sequencing


Shotgun sequencing takes maximum
advantage of the speed and low cost of
automated sequencing


relies totally on software to assembly a jumble
of essentially random sequence reads into a
coherent and accurate contig


TIGR

demonstrated “proof of concept” on the
genomes of
Haemophilus influenzae,
Methanococcus jannaschii,

and
Mycoplasma
genitalium



Celera Genomics

demonstrated the ability to
shotgun sequence the entire human genome (?)



Human Genome Assembly


The HGP vs. Celera race to sequence the entire
human genome was a classic battle of different
strategies


The HGP used an ordered cloning approach


Breaking the genome into mapped BAC clones, then
shotgun sequencing the BACs


Celera used a modified shotgun method


Random clones of various sizes (size selected libraries)


Plus relative mapping of clone ends (they must be located
in the assembly at the correct distance and orientations


Created custom software to handle the assembly


Celera did make use of the “scaffold” built by the HGP

Other Large Sequencing
Projects


Phylogenetic identification/analysis


medical studies of bacteria


environmental samples


EST sequencing
-

differential
expression


cDNA studies


alternate splicing


full length transcripts


Genotyping


score known alleles


identify new mutations

Automation


The "pipeline" approach:


Vector removal


Assembly of identical and/or overlapping
fragments


Identify genes


Lookup on genome if fully sequenced organism


Or genome contigs for partially sequences organsims


BLAST search of GeneBank for similar genes


Lookup in specialized database of "predicted
genes"


ie. ENSEMBL


Project specific analysis


differentials between sets


Phylogenetics

DATABASE!!


What these projects all share is a need to
keep track of a lot of data


Hundreds to thousands of sequences


Many fields of information about each one

»
Organism, library, plate ID for each clone

»
the sequence itself

»
cluster/contig membership

»
best BLAST hit (accession #, e
-
value, alignment)

»
genome position


Can't keep track just using folders and
text files on your hard drive


Design the database to include all possible
fields



(it’s a lot harder to add info later)

Computer tools for sequencing


A wide variety of different software tools
have been created to aid DNA sequencing
projects.


Each genome project lab has built its own custom
software


UNIX


Based on a particular workflow design


PHRED
,
PHRAP, and Consed


Many packages for the individual investigator
-

included in most “comprehensive” molecular biology
products:
MacVector
,
LaserGene
,
DNA*
, etc.



I will focus on the assembly tools in
GCG.

The GCG Fragment Assembly System


GCG has a complete set of programs that allow
data entry, and assembly of overlapping
nucleotide sequence fragments into one contig


SEQED:

a single sequence editor


GELSTART
: creates fragment assembly projects


GELENTER
: adds sequences (reads) to an assembly project,

input of new sequences from keyboard, digitizer, or import of

existing text files


GELMERGE
: assembles individual sequences into contigs, can

automatically remove vector sequences


GELASSEMBLE
: multiple sequence editor for viewing and editing
contigs, allows manual alignment of fragments insertion/deletion
of gaps and changing of individual bases


GELDISASSEMBLE
: breaks up contigs into individual sequences within a
project


GELVIEW
: displays contigs as a schematic display of overlapping
fragments

SeqLab has a Chromatogram viewer

Other Chromatogram Viewers


Applied Biosystems has a free viewer/editor
program for sequence chromatograms


It is called
EditView

and it is a Macintosh only
program
(does not work in System 9.1 and newer)

http://cancer
-
seqbase.uchicago.edu/documents/EditView.hqx


There are a couple of viewers for Windows
machines


ABIView

is free from David H. Klatte


http://bioinformatics.weizmann.ac.il/software/abiview/abiinfo.html



Chromas

is $50 shareware from Conor
McCarthy, Technelysium Pty Ltd in
Australia
http://www.technelysium.com.au/chromas
.html

The Genome Sequencing Era

1998

2000


1997

1999


1996

2001

2002

First microbial genome

H. influenzae

First eukaryote genome

Yeast

E. coli

First multicellular animal

C. elegans

Fruit fly

First higher plant

Arabidopsis

First mammal

Homo sapiens

40 microbial genomes

malaria:

mosquito

and

parasite

First fish

Fugu

mouse

and

tunicate

100 microbial genomes

18 microbial genomes

Complex Genomes Jan. 2003


Chordates


Human


Mouse


Rat


Pufferfish


Sea squirt (Ciona)


Arthropods


D. melanogaster


D. simulans


A. gambiae



Higher plants


Arabidopsis


Rice


Fungi


Aspergillus terreus

Coming soon …


In progress


purple sea urchin


zebrafish


NHGRI’s Priority Organisms


Chicken


Cow


Dog


Chimpanzee


Honeybee


Tetrahymena


Oxytrichia


Several fungi


Over 100 bacterial genomes

Some Books on the Human
Genome Project


The Common Thread: A Story of Science,
Politics, Ethics and the Human Genome

by
John Sulston, Georgina Ferry



The Gene Masters: How a New Breed of
Scientific Entrepeneurs Raced for the
Biggest Prize in Biology

by Ingrid Wickelgren



The Genome War: How Craig Venter Tried to
Capture the Code of Life and Save the World

by James Shreeve

Controversy and Issues


Does human DNA sequence information
belong to everyone?


Should publication require the release of
all data?


Did Celera use public information to
complete the human sequence?


Should a gene or life form be patented?


Should personal genetic information be
protected from public release?