Nebraska Collaboration: Projects in Bioinformatics

websterhissBiotechnology

Oct 1, 2013 (3 years and 8 months ago)

65 views

Nebraska Collaboration: Projects in Bioinformatics


For all projects here, Dan is assuming access to large
collections of complete genomes, their predicted translated
proteins with coordinates, HMM search results, Perl, some
background in biology, and som
e degree contact with him to
resolve issues. As stated before, a strong local
contact/mentor in Omaha will be absolutely essential to the
success of any of these.


Estimating time required is an art form rather than a
science, and is plagued with inaccur
acies. However, we
suspect that all should give fairly definitive answers
about practicality (go/no
-
go) within six weeks. On the
opposite end, literature searches, validation, and paper
writing may take any of them past 6 months. Many of
these, taken t
o completion, resemble candidate jobs talks,
so expanding one or two of these to thesis
-
length is
clearly possible, if that is desirable. Iterative
implementation is desirable, when applicable.



BEST IDEAS (A
-
C)



A.

Transcription terminators in bacte
ria include rho
-
dependent terminators and rho
-
independent terminators. Each
involves some RNA stem
-
loop formation, although the
characteristics (size, flanks) differ. These sites
generally are treated as if they arose either by convergent
evolution or from

so many accumulated changes that only
stem loop structure, not homology per se, is examined. We
sequenced a genome, Coxiella, where some kind of homologous
family of stem
-
loop structure (palindromic except in the
middle) clearly is copied and distributed

throughout the
genome. Not only are they (usually) intergenic, but
operons more commonly point towards them rather than away
from them, suggesting they act as a family of
transcriptional terminators. This would be, potentially, a
third class to add to rh
o
-
dependent and rho
-
independent.


>sample

GTAGCCTGTATGGAGCGAAGCGGAATACAGGGGTTTCCATTTCTTTA

TAGTTAAAATTTCCCTGTATTCCGCTTCGCTCCATACAGGCTAC


The project is to

* develop bioinformatics tools for finding analogous or
homologous families of intergenic palindr
omic repeats

* compare to locations to operon structures and weigh
evidence for a terminator role

* see what other species have similar structures and how
the structures relate

* see what is the relationship to work in M. jannaschii
(PMID: 16182294).



(graduate level or advanced undergrad)


-----



B.

More families analogous to LPXTG / PEP
-
CTERM. A
recurring idea I have used is that a fairly mundane
sequence feature


a collection of transmembrane alpha
helices near one end of a sequence and flanke
d by some
recurring motif, may represent a previously undiscovered
paralogous family domain within one genome. That is, the
relationship that might have looked like convergent
evolution toward being a transmembrane helix may actually
represent local homol
ogy between proteins lacking homology
elsewhere. These transmembrane
-
region
-
with
-
motif homology
domains can represent families shared between species, and
conserved because they interact with some protein such as a
sortase or signal peptidase. The most co
mmon ones may have
been discovered already (pre
-
lipoprotein leader peptide,
prepilin leader peptide, LXPTG sequence, PEP
-
CTERM
sequence, YSIRK), but there are many more, undiscovered,
and there is no good pipeline for discovering them, as easy
as it may be
. The trick is knowing where to look


either
the first TM helix or the last TM helix in a protein.
Cutting down sequence length and getting multiple alignment
rather than pairwise alignments can make these more
apparent. Some of the nicest informatics s
tories I’ve seen
have dealt with these. I recommend one program for C
-
terminal (easy, because stop codons are easy) TM domains,
and another for N
-
terminal.



(This can be pursued to different extents


see our paper
on PEP
-
CTERM. Undergraduate to graduat
e)

http://www.pubmedcentral.nih.gov/articlerender.fcgi?tool=pm
centrez&artid=1569441&rendertype=abstract



-----



C.

Special insert clas
ses of protein insert already known
include inteins (a pretty big literature) and protein
-
coding palindromic elements as in Rickettsia (see
TIGR01045) and Wolbachia (TIGR02697), and possibly M.
jannaschii (see PMID: 16182294).

I note that proteins in s
ome GC
-
rich species often have
short inserts into their proteins as compared with close
homologs in other species. What are the bioinformatic
characteristic of these inserts? Are they particularly
common in Frankia alni, as I feel them to be?


Project:


Develop a high
-
throughput method to find weird inserts (not
repeats) or weird tails to proteins across a genome, or
multiple genomes. See if there are new classes and/or
specialized DNA signals around them. Figure out what are
the implications for pro
tein evolution in GC
-
rich species.
Here are some sample insert sequences.

e.g.

>insert_NTL01NF4221

RGPDGTVTALPE

>OMNI|NTL01FS0738/40
-
87

GHGSPASPGSAAVPLPSGSAASPGSAAVPLPSGSAASPGSAAPLGAAT

>OMNI|NTL01FS0004/235
-
259

TLPDILHASGPPGPPGQPEQPGAGR

>OMNI|NT
L01FS0079/236
-
270

TDSLAVADPDATGVRVGDAGGGDAGGGVPTAAEDL

>OMNI|NTL01FA0110/215
-
256

AAAGPGEPGAVAAAPAGPSSGITPGTRSTSGSGHQSGFSPFG

>OMNI|NTL01FS0084/215
-
239

TGGEAGGEDRSSLCEERSTRLGEPR

>OMNI|NTL01FA0119/125
-
183

PPTGLTPPTEPTPPAERALGTGLTAPAALAPPAERASGAATPQGGAGA
WGPEAARRRAR

>OMNI|NTL01FS0090/42
-
94

AGRILESGLDVTTAGRAGVVCSVWDKAGGSAGRPGCVARAGPGERSPVRRRSD

>OMNI|NTL01FA0124/548
-
595

DLVDADLVDAAGLADAADLTDAADLTDVADLTDAADLTDGVDLTGSAQ

Several in the differences between NTL01FA0122 and
NTL01FS0092.

>OMNI|NTL01FA0125/366
-
389

ADGNHPAGESDEREGAGVRQAAAT

A couple of inserts in NTL01FA0149



(advanced undergraduate, or possibly graduate)



OTHER IDEAS (D
-
K)




D.

GARBASE. The idea here is to create an EXPLICT
collection of known wrong gene calls. The future of gene
-
call
ing in prokaryotes is that homology methods will mostly
replace statistical methods, once sequence space is well
-
enough explored. However, the legacy of number bad gene
calls in the past has meant that many old bad gene calls
are supported by homology. T
he problem is especially
tricky for improper 5’
-
extensions to real proteins.
Establishing a balancing database of peptide sequences that
have been called as proteins, or part of proteins, but are
not, would greatly improve gene calling and break the cycle

of transitive homology
-
based gene
-
calling errors. Note
that GARBASE will also be useful in benchmarking methods
of gene finding, methods of protein annotation (everything
should be called “spurious translation”), and so on. I
have a starter set alrea
dy.



(GARBASE in its simplest form could be an advanced
undergrad project. However, developing it fully, using in
gene
-
calling pipelines, using for metagenomics, and using
to benchmark gene
-
calling projects makes it a full graduate
project)


---



E.

M
ore families analogous to LPXTG / PEP
-
CTERM. A
recurring idea I have used is that a fairly mundane
sequence feature


a collection of transmembrane alpha
helices near one end of a sequence and flanked by some
recurring motif, may represent a previously u
ndiscovered
paralogous family domain within one genome. That is, the
relationship that might have looked like convergent
evolution toward being a transmembrane helix may actually
represent local homology between proteins lacking homology
elsewhere. These
transmembrane
-
region
-
with
-
motif homology
domains can represent families shared between species, and
conserved because they interact with some protein such as a
sortase or signal peptidase. The most common ones may have
been discovered already (pre
-
lipopro
tein leader peptide,
prepilin leader peptide, LXPTG sequence, PEP
-
CTERM
sequence, YSIRK), but there are many more, undiscovered,
and there is no good pipeline for discovering them, as easy
as it may be. The trick is knowing where to look


either
the firs
t TM helix or the last TM helix in a protein.
Cutting down sequence length and getting multiple alignment
rather than pairwise alignments can make these more
apparent. Some of the nicest informatics stories I’ve seen
have dealt with these. I recommend one

program for C
-
terminal (easy, because stop codons are easy) TM domains,
and another for N
-
terminal.



(This can be pursued to different extents


see our paper
on PEP
-
CTERM. Undergraduate to graduate)


---



F.

Comparative genomics for prokaryotic sta
rt site
editing. Moving along two homologous protein sequences
from C
-
term to N
-
term, sequence similarity usually drops
sharply when an attempt is made to go further upstream than
the actual start codon. But a loss of protein
-
level
homology could also re
present a frameshift, variable
protein architecture, or the lower levels of sequence
conservation found in signal sequences rather than mature
proteins. Homology may persist because of constraints from
regulatory sites or from conservation of a gene in a
different frame. But it seems to me that when multiple
close homologs are compared, characteristics of homology
drop
-
off are so striking that they offer one of the highest
resolution mechanisms for assigning start sites available.
A tool that uses this t
rick for an RBS
-
independent
predictor of starts would be hugely valuable in any
genomics pipeline.


(Advanced undergraduate project)


---



G.

Novel protein repeat finder (with iterator). Long
proteins with no HMM hits are likely to have some kind of
p
rotein repeat. It takes too much evolutionary creativity
to create a long protein and have no recognized homologies,
unless the creation mechanism involved creating something
new but short, and repeating it many times. These novel
repeats frequently show
homologous repeat proteins in other
species, and looking transitively from apparent homologs to
homologs to homologs may reach a known class of probable
beta
-
propeller repeat proteins, for example. Some of these
classes of repeat are interesting because t
he help mark
cell location (often, the cell surface) or function.


In addition to being high
-
interest, as the repeats reflect
surface location of modes of evolutionary variation for
virulence proteins, these repeats represent a difficult
corner of the p
rotein annotation world. Taking any one
repeat out to the level of all apparent homologs can lead
to collisions and confusion. A given repeat will have a
mixture of high
-
scoring and low
-
scoring copies in the
proteins that contain it. It would therefore

be good to
develop a combination iterative repeat finding and repeat
class database. The challenges therefore are two
-
fold.
First, develop a means to find repeats, cold, in long
proteins with HMM
-
detectable domains or repeats. Second,
develop the me
ans to collect repeats, build HMM models for
the repeats, and iterate. Iteration will require checks on
repeat length consistency and avoidance of collisions with
other repeat class definitions.


Note that there may be some linkage between this project
a
nd efforts to identify OMP proteins automatically.


(The first step could be an undergraduate project, but
interating, making a database of different repeat classes,
validating, and creating a tool to add to annotation
pipelines makes it graduate
-
level).


---



H.

OMP (Outer Membrane Protein) signal refinement. The
rule that determines if a protein will be found in the
bacterial outer membrane may be as simple as “if a protein
is a transmembrane beta
-
barrel, it is an outer membrane
protein”, but this d
oes not help too much because it may be
hard to predict transmembrane beta barrel from other beta
-
rich structures. There is, apparently, a decent predictor
signal. If the last 10 residues are alternating
hydrophobic/hydrophilic, with three aromatics inclu
ding the
last residue, the protein probably is an OMP protein.
However, this signal has VERY high false
-
positives and
false
-
negatives. I have done better with TIGRFAMs model
TIGR03304. I have a list of > 15 HMMs that ALWAYS mark the
end of an OMP protei
n, plus an HMM for the final 10
residues that is probably better than any regular
-
expression tool. OMP proteins tend to have protein repeats
associated with transmembrane beta
-
barrel structure. I
picture a system in with iteration of an OMP anchor signal

model, with a pretty stringent support system
ratifying/validating recruited new OMPs . The OMP
recognition iterator can interact with literature
-
driven
submission of additional protein families. Because OMP
proteins represent the exposed surface prote
ins of many
important human pathogens, there is strong interest in
finding a good means to identify novel classes of OMP
protein.



(The simplest implementation is almost an exercise, one
motif within the general approach of project I, below. But
OMP d
etection in general is an open question of high
importance, needing a multi
-
pronged attack and deep
drilling in the literature. Building an OMP detection
database system that combines multiple types of signatures
and giving evidence codes with OMP assignm
ents in an
annotation pipeline is a big graduate
-
level project)


---



I.

Species
-
tailoring of general PROSITE
-
like signals.
Once again, I note that features described as physically
determined, like signal sequences, often have a look of
homology to th
em within a species. This means that high
-
multiplicity signals such as the lipoprotein signal
sequence can become much more accurate when tailored to a
species. First, all candidate sequences are identified.
Next, alignment establishes that a large subse
t of these
cluster tightly. These become the high
-
confidence calls
within the species, while the outliers (although they meet
regular expression requirements) become highly suspect.
This principle can lead to a generalizable tool in which a
fairly large
number of standard motif definitions can get
tuned on the fly to species, genus, and so on. The
beauty is the motifs that can be refined in this way are
already widely known, badly needed in annotation pipelines,
and recurringly infuriating because the
y perform so poorly
when not optimized. I don’t believe that such a motif
optimizer exists, but I think it could become an industry
standard.


Note that there can be DNA versions as well as protein
versions of this approach


turn a phylogenetically broa
d
definition of some DNA feature into a clade
-
specific
enhanced version for more accurate genome annotation of
that feature.


(advanced undergraduate or graduate project)


---



J.

Tranfer RNA (tRNA) genes are found almost flawlessly by
tRNA
-
scanSE. Th
e number is not simply one tRNA per
anticodon. tRNA counts have some interesting features:
clusters of tRNA genes, high variability in tRNA gene
numbers, etc. Many ideas can be tested here, including
whether having large number of tRNA of some type may
r
epresent a rapid evolution to be able to express different
kinds of proteins. What do you get when you take a 16S RNA
tree and overlay it with information about evolving tRNA
populations and codon/nucleotide frequencies. Are there
relationships between t
RNA counts and codon bias? My bet
is that certain kinds of abrupt shift in evolutionary
directions are accompanied by changes in tRNA panels, so
that comparison of strains within a species or species
within a genome may be informative. In this case, I
don’t
have a clear sense of where the story will go. I suspect
there tends to be paralogous expansion of particular tRNAs,
gene loss of many of the paralogs, expansion from one the
remains, and so on, which should leave an odd
-
looking
pattern. There is a
lso lateral transfer. So the first
step is simply a tool for doing comparative genomic tRNA
content analysis, and the ideas can come after inspection
shows where the aberrations are.


(long
-
shot science. I’m sure there will be something
interesting, but

there won’t necessarily be anything solid
and important)


---



K.

I once looked at a restriction enzyme, where I had a
guess at its recognition sequence. It turned out that, of
all allowable permutations of the nucleotides in the
pattern I proposed (k
eep the same composition but change
the order), the actual pattern was the rarest in the
genome. I suspect that this is a common feature of
restriction enzymes in genomes


its cognate sequences
become rare. It would be clever to test for a number of
rest
riction enzymes with known cut patterns from the host
sequenced genomes, and see whether one could suggest a
means to prediction restriction sites from complete genomes
from a combo of homology and rarified oligo frequencies.


Project: test the statistic
al relationships between
cutting patterns and oligo frequencies, so that a
combination of best guess recognition site by homology and
support by an oligo
-
frequency test among the top candidates
can lead to accurate restriction specificity predictions.
Us
e this technique to try to find new restriction enzyme
recognition sequences where one is not yet known.


(skilled undergraduate project)