Automated Microbial Genome Annotation

tastelesscowcreekΒιοτεχνολογία

4 Οκτ 2013 (πριν από 4 χρόνια και 1 μήνα)

67 εμφανίσεις

1


MICROBIAL GENOME
ANNOTATION

Loren Hauser

Miriam Land

Yun
-
Juan Chang

Frank Larimer

Doug Hyatt

Cynthia Jeffries


NEB Educational Support

2

http://www.neb.com/nebecomm/course_support.asp?

Why study Computational Biology

and Bioinformatics?


DNA sequencing output is growing faster than
Moore’s law!


1 Illumina sequencing machine = 0.5
Tbp
/week


There are hundreds of these and thousands of
other sequencing machines around the world.


New sequencing technology will conceivably
allow sequencing a human genome for less
than $1K in less than 1 day!

3

Why study
Medical
Bioinformatics?


In the near future, most cancer diagnostics
will involved DNA or RNA sequencing!


In the near future, every baby born in the
developed world will have their genome
sequenced. Protecting privacy and
your
doctors
ability to use that
information
are the
only real impediments
!


Hospitals are using DNA sequencing to track
antibiotic resistant bacterial infections.


4

DOE Undergraduate Research in
Microbial Genome Analysis and
Functional Genomics

5

http://www.jgi.doe.gov/education

6

Why Study Microbial Genomes?


Large biological mass (50% of total)


photosynthetic (Prochlorococcus)


fix N
2

gas to NH
3

(
Rhodopseudomonas
)


NH
3

to NO
2

(
Nitrosomonas
)


bioremediation (
Shewanella
, Burkholderia)


pathogens, BW (Yersinia
pestis

-

plague)


food production (Lactobacillus)


CH
4
production (
Methanosarcina
)


H
2

production (
Rhodopseudomonas
)



Example of Current Microbial
Genome Projects


UC Davis


FDA funded 100K bacterial
genomes project associated with food.


5 years = 20K per year / 200 days/year =
100 genomes/day!

7

8

Web
Resources
and
Contact
I
nformation


http://genome.ornl.gov/microbial/


http://www.jgi.doe.gov/


http://genome.jgi
-
psf.org/


http://www.jcvi.org/


http://www.ncbi.nlm.nih.gov/


http://www.sanger.ac.uk/


http://www.ebi.ac.uk/


ftp://ftp.lsd.ornl.gov/pub/JGI


artemis ready files for each scaffold =
(feature table plus fasta sequence file)


Contact:


landml@ornl.gov
; hauserlj@ornl.gov

9

Evolution of Sequencing
Throughput

11

Sequenced Microbial Genomes


ARCHAEAL GENOMES


159 FINISHED
;
218 IN
PROGRESS


BACTERIAL GENOMES


3363 FINISHED
;
11831 IN
PROGRESS


ENVIRONMENTAL
COMMUNITIES


> 50,000 samples (see
MGRast
)



as of
Sept 6, 2012


http://www.expasy.ch/alinks.html


http://
www.genomesonline.org


http://metagenomics.anl.gov/


12

Published Genomes


Nitrosomonas europaea
-

J.Bac. 185(9):2759
-
2773 (2003)


Prochlorococcus MED4 & MIT9313
-

Nature 424:1042
-
1047 (2003)


Synechococcus WH8102
-

Nature 424:1037
-
1042 (2003)


Rhodopseudomonas palustris
-

Nat. Biotech. 22(1):55
-
61 (2004)


Yersinia pseudotuberculosis
-

PNAS 101(22):13826
-
31 (2004)


Nitrobacter winogradskyi


Appl. Envir. Micro. 72(3):2050
-
63 (2006)


Nitrosococcus oceani
-

Appl. Envir. Micro. 72(9):6299
-
315 (2006)


Burkholderia xenovorans


PNAS 103(42):15280
-
7 (2006)


Thiomicrospira crunogena


PLoS Biology 4(12):e383 (2006)


Nitrosomonas eutropha C91


Env. Micro. 9(12):2993
-
3007 (2007)


Sulfuromonas denitrificans


Appl. Envir. Micro. 74(4):1145
-
56 (2008)


Nitrosospira multiformis
--

Appl. Envir. Micro. 74(11):3559
-
72 (2008)


Nitrobacter hamburgensis
--

Appl. Envir. Micro. 74(9):2852
-
63 (2008)


Saccharophagus degradans


PLoS Genetics 4(5):e1000087 (2008)


R. palustris


5 strain comparison


PNAS 105(47):18543
-
8 (2008)


L. rubarum and L. ferrodiazotrophum


Appl. Envir. Micro. (in press)


13

Basic
Annotation
I
mpacts


D
esign
of
oligonucleotide

arrays


D
esign
& prioritize protein expression
constructs


D
esign
& prioritize gene knockouts


A
ssessment
of overall metabolic capacity


D
atabase
for
proteomics


Allows visualization of whole genome

14

Additional
Analysis
I
mpacts


R
evised
functional assignments based on
domain fusions, functional clustering,
phylogenetic profile


R
egulatory
motif discovery


O
peron
and regulon discovery


R
egulatory
and protein association
network discovery

15

Scaffolds

or

contigs

Prodigal

Model

correction

Final Gene

List

InterPro

COGs

Web

Pages

Blast

Complex

Repeats

Simple

repeats

GC Content,


GC skew

PRIAM

Function call

tRNAs

rRNA,

Misc_RNAs

Feature

table

TMHMM

SignalP

Microbial

Annotation

Genome

Pipeline

16

Prodigal (
Pro
karyotic
Dy
namic
Programming
G
enefinding

Al
gorithm)



Unsupervised
:


Automatically learns the statistical
properties of the genome
.


Indifferent
to GC Content:


Prodigal performs well
irrespective of the GC content of the organism
.


Draft
:


Prodigal can train on multiple sequences
then analyze individual draft sequences
.


Open
Source:


Prodigal is freely available under
the GPL
.


Reference
:


Hyatt D, Chen GL,
Locascio

PF, Land
ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic
gene recognition and translation initiation site
identification. BMC Bioinformatics. 2010 Mar
8;11(1):119. (Highly Accessed)



17

G+C Frame Plot Training


Takes
all ORFs above a specified length in
the genome
.


Examines
the G+C bias in each frame
position of these ORFs
.


Does
a dynamic programming algorithm
using G+C frame bias as its coding
scoring function to predict genes
.


Takes
those predicted genes and gathers
dicodon

usage statistics.

18

Gene Prediction



Dicodon

usage coding
score


Length
factor added to coding score (GC
-
content
-
dependent
)


Coding/noncoding
thresholds sharpened (starts
downstream of starts with higher coding get
penalized by the difference
).


Dynamic
programming to put genes
together.


Bonuses
for operon distances, larger bonus for
-
1/
-
4
overlaps.


Same
strand overlap allowed (up to 60 bases
).


Opposite
strand
--
>3'r 5'f<
-

allowed (up to 250
bases)


19

Start Site Scoring

Shine
Dalgarno

Motif


Examines
initially predicted genes and gathers
statistics on the starts (RBS motifs, ATG
vs

GTG
vs

TTG frequency
)


Moves
starts based on these discoveries
.


Gathers
statistics on the new set of starts and
repeats this process until convergence (5
-
10
iterations
).


RBS
motifs based on AGGAGG sequence, 3
-
6
base motifs, with one mismatch allowed in 5
base or longer motifs (e.g. GGTGG, or AGCAG
).


Does
a final dynamic programming with the
start scoring function.

20

Start Site Scoring

Other Motifs


If Shine
-
Dalgarno

scoring is strong, use it


this accounts for ~85% of genomes.


If Shine
-
Dalgarno

scoring is weak, look for
other motifs


If a strong scoring motif is found, use it
(example GGTG in A.
pernix
)


If no strong scoring motif is found, use
highest score of all found motifs (example


Crenarchaea
,
Tc

and
Tl

start sites are the
same, but internal operon genes use weak
Shine
-
Dalgarno

motifs)

Annotated Gene Prediction

21

Prodigal Scoring

22

23

Gene Prediction Problems


Pseudogenes


24

Pseudogenes


Internal deletion

25

Pseudogenes


Premature stop codon

26

Pseudogenes


N
-
terminal deletion


27

Pseudogenes


Transposon insertion


28

Pseudogenes


Multiple
frameshifts

29

Pseudogenes


Premature Stop and
Frameshift


30

Pseudogenes


Dead Start Codon


31

32

GENE PAGE

33

34

35

36

ORGANISM’S (PSYC) COGS LIST

37

Taxonomic Distribution of Top
KEGG BLAST Hits

38

Frequency distance distributions


Salgado et al.

PNAS (2000)

97:6652

Fig. 2

39

Frequency distance distributions

Salgado et al.

PNAS (2000)

97:6652

Fig. 3b

40

Branched Chain Amino Acid
Transporter family

41

Probable Ancient Gene (Liv Operon)


42

Branched Chain Amino Acid Transporter
family


Rhodopseudomonas palustris

43

Example of Lateral Transfer


44

Transporter Gene Loss

in
Yersina Pestis


36 Genes involved in transport from YPSE
are nonfunctional in YPES


13 lost due to
frameshifts


11 lost due to deletions


6 lost due to IS element insertions


4 (2 pair) lost due to recombination
causing deletions and
frameshifts


2 lost due to premature stop codons

45

46

Nostoc punctiforme

Signal Transduction Histidine Kinases


47

Nostoc punctiforme

Signal Transduction Histidine Kinases


48

Nostoc punctiforme

Signal Transduction Histidine Kinases

49

Nostoc punctiforme

Signal Transduction Histidine Kinases

50

Nostoc punctiforme

Regulatory Proteins

51

Burkholderia xenovorans

Regulatory Proteins

52

Regulatory Protein

Identification Scheme

53

Summary of automated transporter
annotation
---

Zymomonas

54

Zymomonas transporters

complete listing

Transcriptome Analysis Pipeline:

RNA sequences to GRN

Collect
RNAseq
data

Map
reads to
genomes

Calculate
reads/
bp

Display
frequency
plot

Determine
operons from
frequency plot

Compare operon
determinations
(genome co
-
ordinates)

Predict
operons
In
silico

Improve
algorithm

Determine
orthologous

operons

Determine
orthologs with
OrthoMCL

Align
orthologous

promoters

Determine
TFBS from
alignments

Determine TISs
with 5’ RACE.

Cluster analysis
from gene
expression arrays

Predict
TFBS

In
silico

Cluster analysis
of gene
expression
changes

GRN genetic
regulatory
network

Dynamic range and sensitivity

New gene, wrong start, riboswitch

Small Regulatory RNA ???

Differential gene expression

Operon with Internal Promoter

60

Long Term Vision


Develop

TPing

SOPs, and
an automated
analysis
pipeline.


Initially produce
TPs and
preliminary GRNs for
all
important DOE microbial
genomes
(
i.e.
BESC),
and eventually
all DOE microbial
genomes.


Incorporate
the
TP
analysis pipeline into ORNL’s
automated microbial annotation pipeline, and
eventually
into IMG and GenBank files.


Add additional experimental methods to improve
the GRN
determinations
.