Sequencing the Bergey's

breadloafvariousBiotechnology

Feb 20, 2013 (4 years and 8 months ago)

111 views

Rob Edwards, Jonathan A. Eisen, Ross Overbeek, George Garrity,

Veronika Vonstein, Sveta Gerdes, Folker Meyer, Kevin White,

Tim Lilburn, Barney Whitman, et. al.


Creating the

Genomic Encyclopedia for Bacteria
and Archaea


Rick Stevens





Eddy Rubin

Argonne National Laboratory


Joint Genome Institute

The University of Chicago



Berkeley Lab


The Basic Idea of the Project


To build an enterprise that can take advantage of the
expected exponential improvements of sequencing
capabilities to sequence “all known” cultured and described
prokaryotes


Ride the expected “Moore’s law” of sequencing capability


To develop a distributed high
-
throughput “industrial”
approach to the cultivation, characterization, sequencing,
annotation and analysis of prokaryotic genomes


Build a team from groups that have expertise and track records


To build and curate a database of genome sequences,
metabolic reconstructions, and standardized phenotype
assays associated with each target organism


Streamline the release of data, provide a foundation for derivative
projects


Concept of the Bergey’s/GEBA Sequencing Project


A Fixed cost

annual investment


Each year more can be sequenced as sequencing costs decrease and
as cultivation efficiencies improve based on experience


Leverage the expected improvement of sequencing costs


Address the overall scope within 5 to 6 years


Increase amount of near complete sequences per year


Optimize the choice of organisms to maximize diversity at
each stage


Exploit the Bergey’s Trust and International Committee on Systematics
for Prokaryotes for Taxonomic coverage (e.g. Garrity and Whitman)


Involve the microbiology community for prioritization


Industrialize

the pipeline


Biological Resource Centers to produce and characterize type material


DOE JGI,
NIAID/DMID Centers, NSF/USDA Centers for Sequencing


Laboratories for bioinformatics (Argonne, JGI, TIGR, ORNL, etc.)


Universities and Laboratories for modeling and analysis


The Question is not if, but When and How ?


Why should we want to accelerate this transition?


Why not just let it happen as a matter of course?


What is in the current sequencing pipeline?


Completed Genomes


Ongoing/In the Pipeline


Archaeal


29





56


Bacterial

397




991


Eukaryal


44




631



The existing process of bottoms up selection of organisms for
sequencing is leaving many important groups underrepresented,
closure will take a long time


There are groups are well represented in the literature, but not in the
sequencing databases


Under representation is also an issue in environmental sequencing
data

Tapping into prokaryotic biodiversity

-

Industrial Biotechnology

Hans E. Schoemaker, et al. 2003. Science 299:1694
-
97


Rapidly growing field




by 2010 biocatalysis will be used in production
of 60% of fine chemicals (McKinsey analysis)




In US coordinated by USDA Biobased
Products and Bioenergy Coordination Council
(BBCC)



Applications:




pharmaceuticals



food ingredients (sweeteners, vitamins)



feed additives and other agrochemicals



organic solvents



polymer raw materials



biofuels



Advantages over chemical methods:



exquisite substrate specificity



excellent chemo
-
, regio
-

and stereoselectivity



environmentally friendly “green chemistry”
based on biorenewables



Needed:



novel enzymes and pathways



“Periodic table” of biochemical transformations

Straathof et al. 2002. Curr Opinion Biothech 13:548
-
56

~150 compounds are currently produced on
industrial scale using biocatalysts. Examples:

Analysis of 1000s of new bacterial genomes will likely yield completely novel
pathways and enzymes for industrial applications


Still to be discovered:

enzymes involved in the biosynthesis
or catabolism of approximately 40 naturally occurring
chemical functional groups are still not known

• Hydoxylaminobenzene mutase

• Aldoximine dehydratase

• Azetidine
-
2
-
carboxylate hydrolase

• Benzylsuccinate synthase

• Phenylboronic acid oxygenase

L.P. Wackett. 2004. Current Opinion in Biotechnology, 15:280

284

Examples of recently discovered biocatalytic
transformations of novel organic functional groups:


Current approaches to
discovery of new enzymes:




Screening environmental
samples by enrichment cultures
(BUT: only <<1% prokaryotes
are currently culturable)



Metagenome approach: cloning
& expression of DNA samples in
a surrogate host, then screening
for desired function
(BUT: only
known functions can be
screened for, new biochemistry
cannot be discovered)




Sequence
-
based discovery
(growing explosively, generating
knowledge base for basic
sciences and biotechnological
applications)


Building the Case


There is a disparity between the literature and the existing genomes


We can’t fully exploit the community’s historical knowledge and investments
without closing this gap


There is a disparity between the rank/abundance curves from 16s
studies and from environmental sequencing projects and the existing
genomes


We can’t fully understand the new datasets without closing this gap (I.e. lack
of complete sequence coverage of known culturables is holding back future
work)


There is likely to be new biochemical pathways and novel enzymes in
the set of culturable but unsequenced organisms, sequencing non
-
cultured organisms to expand diversity


These represent the low hanging fruit for discovery since the investment has
already be made in determining culture conditions


A comprehensive database produced under controlled conditions that
includes phenotype data and genotype data will accelerate research in
understanding the genotype
-
phenotype relationship


Genome
-
Scale reconstruction and modeling will be dramatically accelerated
by comprehensive databases that include phenotype data

Estimated Sequencing Rates

Year
2007
2008
2009
2010
2011
2012
2013
2014
Notes
Base Pairs
per dollar
200
300
450
675
1,013
1,519
2,278
3,417
50% improvement per year
Bacterial
Genome
Cost in $
20,000
13,333
8,889
5,926
3,951
2,634
1,756
1,171
~4M bp per genome
Number
Genomes
for $5M
250
375
563
844
1,266
1,898
2,848
4,271
Cumulative
Genomes
Sequenced
250
625
1,188
2,031
3,297
5,195
8,043
12,314
Selection

of Targets

Produce

DNA

Sequencing

Assembly

Rapid

Annotation

(24 Hours)

Metabolic

Reconstruction

Model

Generation

Phenotype

Prediction

Database

Repository

Technical Feasibility FAQ


How many genomes would the project propose to sequence?


About 5000 over 5
-
7 years


Who would produce the biomass needed for DNA extraction?


Type culture centers until enrichment and environmental methods mature


Will the biomass/DNA be available for distribution?


Yes, both the DNA and the libraries could be stored for distribution


What throughput is needed for DNA production?


In the beginning of the project ~300 taxa per year to 2000 per yr at the end


What combinations of sequencing technologies need to be employed?


Sanger and Pyrosequencing

initially, others as they come online


What throughput is needed for annotation?


24 hour turnaround from assembled sequence to initial availability this has already been
achieved at Argonne, TIGR and elsewhere


Is is possible to have a standard set of phenotype assays given the broad
spectrum of organisms and conditions?


We are considering Biolog as a model, but it is too limited


How would the genomes be selected and prioritized?


At each cycle we choose genomes (e.g. via 16s) to minimize the diversity gaps


Community input would be solicited to insure the project is tracking the communities
interests


Is it necessary to “close” the genomes?


We think no.

Libraries would be archived for groups that might be interested in closing.


The Project Would Provide a Comprehensive
Set of Genome Sequences for:


Biofuels, and bioproduction of alternative feedstocks


Understanding and managing the microbial carbon cycle


Soil and subsurface microbial ecology


Bioremediation and bioconversion of waste streams


Evolution and microbial ecological dynamics


Context for environmental sequencing and metagenomics


Basis for developing predictive models of phenotypes


Source of components for synthetic biology


Improving our understanding of cultivability


Dramatically improving the reliability and quality of genome
annotations



How Many Known Cultured Organisms?


Latest version of the Prokaryotic Taxonomic Outline will
contain
7951

named species of
Bacteria

and
Archaea
.


Of these, 178 are non
-
cultivable or not represented by
viable type material.


An additional 1222 are synonyms.


Of the 6543 type strains for which viable material is
reportedly deposited, we have assembled a minimal set of
6389

strains that are available from 16 major public
culture collections or biological resource centers in the
US, Europe, and Asia.


The remaining 154 are in minor or non
-
public collections.



This information is derived from Release 6.1 of the Taxonomic Outline of the Prokaryotes which will
be published in 2007 and is current through May 2006.


What Has Been Sequenced or is In Play


Of the 6400 strains available from public sources


About 380 are human, animal or plant pathogens


Order 1/3
-
1/2 of the known pathogens have been sequenced


360 complete prokaryotic genomes published


56 archaeal and 940 bacterial genomes in progress


From 897 prokaryotic genomes in progress in GOLD


~400 are pathogens (many duplicate taxa)


~221 are supported by DOE (156 biotech, 51 environment)


Approximately ~5000 prokaroytes not yet in play


We estimate about 4800 non
-
pathogen taxa

Strain Distribution in Collections

US Collections / BRCs




Strains

American Type Culture Collection (ATCC)



4027

USDA ARS Collection (NRRL)




223

European Collections


Deutsche Sammlung vor Microoransmen (DSMZ)

1302


Culture Collection University Gottenberg (CCUG)


183


Pasteur Institute (CIP)




170


Laboratory for Micrbiology, Gent (LMG)



101


National Collection of Industrial and



Marine Bacteria




25


French Collection of Phytopathogens (CFPB)


15


National Collection of Type Cultures (NCTC)


12


National Collection of Phytopathogenic



Bacteria





11

Asia


Japan Collection of Microorganisms (JCM)



185


Institute of Fermentation, Osaka (IFO)



34


Korean Collection of Type Cultures (KCTC)


28


Institute of Applied Microbiology, Tokyo (IAM)


26


National Institute of Technology



And Evaluation (NBRC)



24


All
-
Russian Collection of Microorganisms (VKM)


13


Distribution of Genome Sizes in the Pipeline

Microbial Genome Size
0
2000
4000
6000
8000
10000
12000
14000
1
27
53
79
105
131
157
183
209
235
261
287
313
339
365
391
417
443
469
495
521
Taxa
Base pairs (1,000's)
Series1
Average Sequence ~ 4Mbp

Getting Value from the Genomes


Genomes would be assembled by the groups doing the
sequencing


Assembled contigs would be sent to the initial high
-
throughput annotation server for draft annotations and
immediately published on
-
line


The accumulated (additional) genomes will be used to
improve annotations (gene calls, functional coupling)


Genomes will be integrated into databases to support
comparative analysis and evolutionary analysis


Annotated genomes can be used to semi
-
automatically
construct genome
-
scale models which could be used to make
metabolic phenotype predictions

Background



online at


http://www.sequencingbergeys.org


login required (just ask us)


guest read
-
only access after the meeting?


make maximum information available

Bergey hierarchy, NCBI taxonomy, 16s RNA,
strain collections, GOLD, SEED,


List of organisms for sequencing

-

based on 16s clusters

Cluster Page

select strain for cluster

“Bergey” Browser

Species Page