Slide 1 - McClure lab homepage - Montana State University

breadloafvariousΒιοτεχνολογία

20 Φεβ 2013 (πριν από 4 χρόνια και 3 μήνες)

166 εμφανίσεις


from the thesis studies of
Sean Bruce Cleveland

The Department of Microbiology

Montana State University

Marcella A McClure, P.I.

NIH/NIAID R21A1028309 high risk program

Rhabdoviridae:

Cytorhadoviruses

Nucleorhabdovirus


Infection of vertebrates

Rhabdoviridae:

Lyssavirus

Vesiculovirus

Ephemerovirus

Paramyxoviridae

Filoviridae

Host range of the order Mononegavirales

Infection of plants

Infection of invertebrates

Rhabdoviridae



“OLD” FOES



rabies (Rhabdoviridae)



measles, RSV, mumps (Paramyxoviridae)


“EMERGING” THREATS

Ebola, Marburg (Filoviridae)

equine morbillivirus, Nipah virus (Paramyxoviridae)



MODEL AGENT


vesicular stomatitis virus (Rhabdoviridae)


The CDC has included Ebola and Marburg
and Rabies viruses in their list of
Bioterrorism Agents/Diseases.



Most of these viruses have few treatment
options outside of vaccination.



Attempts to define the structure of the
replication/transcription complex of viruses
of the order
Mononegavirales

using physical
methods like X
-
ray crystallography and
NMR spectroscopy have just begun to
produce some results.


Composed of the RNA genome and three proteins: N, P
and L.




The RNA
~12,000
base genome of this complex is always
associated with the N protein. The N:RNA coupling
protects

the genome from
ribonuclease digestion.




L:P:N/RNA

is a complex
of

one RNA genome and
approximately, 1200
N, 400 P, and 50 L proteins is beyond
the limits of current structure determination methods (i.e.,
X
-
ray crystallography or NMR spectroscopy) .


Due to the size of the L polymerase (~2100 AA), no structural information is available


A large multifunctional enzyme that is believed to possess all the catalytic activity of the RdRp complex.


viral genome replication,


mRNA transcription initiation, elongation, termination and


Three basic properties.




Binds to the RNA genome to protect it from ribonucleases.


Polymerizes to encapsidate the entire length of the genome.


N requires association with P to encapsidate the RNA otherwise N
aggregates.



Proposed mechanism for N's involvement in RNA synthesis:

A portion of the N protein temporarily dissociates from the RNA with the active
polymerase complex






Has a central role assisting N in recognition and encapsidation of the RNA genome and functions by
allowing L to specifically recognize the N:RNA template and progress along it.


Forms a dimer with the central oligermerization domain.


This interdimer interaction suggests the existence of a tetramer of these side by side dimers


-
Disorder Predictions


-
Correlated Mutations and Intra
-
residue
Contact Predictions


-
Correlated Mutations and Inter
-
residue
predictions


-
Bayesian Inference Network


-
Creation of an SQL database to store all results


-
Construction of a web accessible resource


Creation of Automated Consensus Predictor

Disorder Prediction

DISOPRED2

IUPred

PONDR/

DisEMBL

Multiple alignment

Evolutionary Dynamics

Phylogenetic
reconstruction

Intra
-
CM

Inter
-
CM

Integration of Heterogeneous Data Sources in a Bayesian Framework


GUI Interface and Display for public access


Evolutionary Relationships of 63 N protein sequences from the three families of the
order Mononegavirales

Mr. Bayes

phylogenetic tree

performed over

10million

generations of a

mixed amino
-
acid

model with

posterior

probabilities at each

Node.


Why?


Proteins that are Intrinsically Unstructured have
their secondary structure tied to their function i.e.
Upon binding they assume a secondary structure.
Thus regions of disorder are likely portions of the
protein important to forming the complex or
important to its function.



Neural Network Trained





PONDR, DisoPRED



Pairwise Energy Calculation

IUPred



Secondary Structure Prediction



DisEMBL

E

100

50

150

200

250

300

350


400

450


500

550

600


650

700


750

800

850


900

950


1000

Rabies pdb 2GTT

Disorder 0.5 and up



A.K.A. correlated mutation


Concept:


A mutation event is random and has equal
opportunity to strike at any place on a genome


A mutation event which is detrimental to survival
will not be observable in anything living


A mutation at one interactive site of a protein will
likely require a “compensatory” mutation to account
for the difference at the conjugate position

INTRAMOLECULAR CM


Protein sequence has
minimal information


Structure can be
minimally inferred


Topography can be
minimally inferred


Secondary and tertiary
structures are
maintained by
interacting residues
WITHIN a polypeptide

INTERMOLECULAR CM


Protein sequence has
minimal information


Structure can be
minimally inferred


Topography can be
minimally inferred


Quaternary structures
are maintained by
interacting residues
BETWEEN two or more
polypeptides


What do these programs do?


Xdet, CORNET, ConSEQ, CAPS




What is needed to use these programs?



At least 10 sequences in a data set between 20
-
90% ID

Intra
-
residue contact predictors: ConSEQ and CORNET.


ConSEQ

makes predictions by estimating the rate of amino acid substitutions at each
position in a MSA of homologous proteins. The underlying assumption of this
approach is that, in general, structurally and functionally important residues are
slowly evolving.


CORNET

is a neural network
-
based method using correlated mutations, sequence
conservation, predicted secondary structure, and evolutionary tree information.


Coevolving residue mutation predictors: XDET and CAPS.


CAPS

compares the correlated variance of the evolutionary rates at two sites
corrected by the time since the divergence of the protein sequence they belong to.



XDET
compares the mutational behavior of a residue position with the mutational
behaviors of the entire alignment, which assumes the positions showing a family
-
dependent conservation pattern will have similar mutational behaviors as the rest of
the family.


The results of these four approaches were combined into a consensus prediction

Rabies pdb 2GTT

Intra mapped

Rabies Chain A from pdb 2GTT

Disorder 0.5 and up mapped and Intra mapped blue, green is both

Note residues 372
-
398 are missing which are disordered

VSV N&P pdb

N in Green and Blue

P in Magenta and Purple

Disorder 0.5 and up in Yellow


The Mutualism Continuum
of Human Retroid Agents

parasite

beneficial

symbioant

commensalism

Hemophilia A

Muscular dystrophies

Alport Syndrome
-
Diffuse


Leiomyomatosis

Chronic Granulomatous

Insulin dependent diabetes

Rheumatoid arthritis


Schizophrenia

Multiple sclerosis

Systemic lupus erythematosus

Testicular tumors

AIDS

Human T
-
cell

leukemia

Gene

regulatio
n

Chromosomal

repair

Reproduction

Deadly

disease

Disease

association

Genetic

disease

HIV

HTLV

TERT

Endogenous

retroviruses

LINEs

Retroviral

LTRs

Endogenous

retroviruses

Retroid agents:

Danio rerio

Chromosome 14

RepeatMasker

GPS/RASCAL

Retroviruses:


New fish retrovirus

81

2 small fragments of two different
non
-
fish retroviruses

648

Yes,complete genome with
novel LTRs

Retrotransposons:


DRR1


DREGG1


Size


Gene Components


LTRs

24 classified as gypsy

Yes

Misclassified:retrovirus

nucleotide fragments


no info


no info

2301

Yes

Yes, copia
-
like

4994

GAG/PR/IN/RT/RH

5’ and 3’

Retroposons:


Babar


Size


Gene Components

980

LINE
-
like

100

no

812

Yes

2343

APE/RT

Other:

pararetroviruses, group II
introns, archea
-
RT like and the
fish TERT

0

69

Results from RepeatMasker and the GPS/RDAP. Size is in nucleotides. The values found by the
GPS/RDAP for the Retroid classes include all statistically significant RT occurrences, only a fraction
of which are full
-
length Retroid agents. The GPS classification has revealed many smaller Retroid
genomes, all of which have the RT gene, various other Retroid gene components, with and without
LTRs or UTRs.Gene components are: the group specific antigen (GAG), protease (PR), integrase (IN),
RT, ribonuclease H (RH) and apurinic
-
apyrimidinic endonuclease (APE). All other terms as defined
in the text.


GPS/RASCAL
*


REPbase


Retrobase

Database Size


2 Million+

7000+

138

Refined Classification





Chromosome Positions






Blast Results



Excised Sequences







*Sequence alignments





Gene
-
by
-
gene Assessments:



Frame shift#



Stop Codon#



OSM Score



% Identity



Genome % Identity





Condition Differentiation

(i.e. partial, complete, LTR, etc.)





*New functionality:



Viability Score


Recombination


Complementation









The database size
represents the total
number of unique
sequences available for
analysis. For the
GPS/RDAP system, this
number is comprised of
the total number of
unique RT hits, with full
genome analysis, from a
variety of sequenced
genomes already
processed by the
software. The size of
the current GPS
database is rapidly
increasing as new
genomes are analyzed.
The asterisks indicate
some of the new
functionality proposed
for the RDAP.

Retroid agent Database Feature Comparison.

WU
-
tBLASTn

RT queries against
Host
chromosome

Retroid RT queries

Analyze all Raw RT hits

Stage I GPS

Compound small
hits

Correct for cross
coverage

Correct for
redundancy

Unique RT hits

Analysis for frame shifts and stop codons and ordered series of motifs


One stop codon, partial motif

Excises a 14Kb chromosomal segment inclusive of each RT hit

Stage II GPS

7 kb

7 kb

Compares DNA cutout to query component library in a RT outward fashion thereby constructing the Retroid agent genome

PRO

5’LTR

GAG

IN

RH

ENV

3’LTR

Query component library

PRO

5’LTR

GAG

IN

RH

ENV

3’LTR

Full length Retroid agent

Perfect + one frame shift + one stop codon full
-
length Retroid genomes = Potentially active sequences

RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

RT

LPQG*LFK

..PKK..LDL..LPQG..YADDLL..FLG..FLG..

RT

RT

RT

Reports results for all RT extensions: number of and % identity to all query components, stops codons, and frame shifts.

Segments with all query components in the correct order are labeled full length with query as closest relative.

All six motifs of the OSM, no stop codons or frame shifts

HGD Freeze
Date

OLERV3
ERV3 Tet
OLERV2
DRERV4
GAERV2
GAERV3
DRERV2
ZFERV
GAERV1
OLERV1
SSSV
ERV4 Tet
ERV2 Tet
DRERV1
DRERV3
DRERV5
SnRV
WDSV
WEHV1
WEHV2
HIV1
MMTV
0.5
*

*

*

87

100

100

93

100

83

100

100

100

50

99

*

*

*

100

97

99

100

100

100

69

100

*

*

*

*~

*~

*

*

*

*

*

*

*

Clade 1

Clade 2

Clade 1

Clade 2

Lineage 1

Lineage 2

Lineage 3

Lineage 4

Preliminary observations of consensus trees generated with a mixed amino acid model and

eight category gamma distribution rate produced high posterior probabilities with a number

of incorrect internodes even after 100,000s of iterations and apparent convergence.


MrBayes

100%

identical

Align all
LTR

pairs.


Measure Distance (
d
) between each pair.

Estimate insertion time (
t
).





t

=
d/2r
.


Rate (
r
) = neutral evolution


(
Tetraodon

and
Takifugu

genuses


using 5,802
orthologues
).

Millions of Years (Log10)

1.14

.57

1.67

.98

0

.42

.31

1.54

0

0

3.79

1.50

.90

.12

2.99

.52

.89

.07

.07

0

0

1.67

.98

0

.42

0

0

0

0

3.79

1.50

0

.12

.43

.52

.89

0

0

0.0001
0.001
0.01
0.1
1
10
DRERV1.1 (1/5)
DRERV1.3 (1/3)
DRERV1.4 (0/1)
DRERV1.6 (0/1)
DRERV1.8 (0/1)
DRERV1.9 (0/1)
DRERV2 (0/5)
DRERV3.1 (1/2)
DRERV3.2 (0/1)
DRERV3.3 (1/1)
DRERV3.4 (0/1)
DRERV3.5 (0/1)
DRERV4.1 (0/12)
DRERV4.4 (1/1)
DRERV4.5 (0/2)
DRERV4.6 (0/1)
DRERV4.7 (0/1)
DRERV5 (2/2)
ZFERV (2/3)
A.

.43

.32

.45

.22

.56

.80

.50

.07

1.72

3.65

.43

.32

.87

.47

.56

.80

.50

.07

1.72

3.65

0.0001
0.001
0.01
0.1
1
10
OLERV1.1 (1/1)
OLERV1.2 (0/1)
OLERV1.3 (0/2)
OLERV1.4 (0/2)
OLERV1.6 (0/1)
OLERV1.8 (0/1)
OLERV1.9 (0/1)
OLERV3.1. (1/1)
OLERV3.2 (0/1)
OLERV3.3 (0/1)
B.

0

.43

0

.67

.47

.32



.32

.98

2.11

0

0

.30

.22

2.80

.03

.43

1.56

1.02

.47

.32

.
33



.98

2.11

.95

0

.30

.22

3.07


0.0001
0.001
0.01
0.1
1
10
GAERV1.1 (1/2)
GAERV1.2 (0/1)
GAERV2.1 (2/17)
GAERV2.2 (0/2)
GAERV2.3 (0/1)
GAERV2.4 (0/1)
GAERV2.5 (0/2)
GAERV2.6 (0/1)
GAERV2.7 (0/1)
GAERV3.1 (0/19)
GAERV3.2 (0/1)
GAERV3.3 (0/1)
GAERV3.4 (0/1)
GAERV3.5(0/2)
C.

.84

.05

1.39

0.0001
0.001
0.01
0.1
1
10
ERV2_Tet.1 (0/1)
ERV2_Tet.2 (0/1)
ERV3_Tet (0/1)
D.

Figure 3

Japanese sardine

Carp

Zebrafish

Green Spotted
pufferfish

Torafugu

Hilgendorf’s saucord

Three
-
spined
stickleback

Bastard halibut

Medaka

84.8

183.3

155.5

179.7

191.2

484.9

304.1

190.1

1700 Mbp

T.Nigroviridis


675 Mbp

385 Mbp

D. Rerio

1000 Mbp

G. aculeatus

O. Latipes

~4mya

~1mya

~4mya

~3mya

Marcella McClure, P.I. (Marcie)

Sean Cleveland, Ph.D. graduate student, Microbiology

Undergraduates: Holly Basta, Microbiology




Ted Weatherwax, Microbiology




Alex Busak, Computer Science




Sabindra Katlia, Biotechnology




Robert Frost, Microbiology




Crystal Hepp, MIcrobiology




Software Engineer: Rochelle Clinton

G
enome
P
arsing
S
uite (GPS) evolves into

“a system for
R
etroid
A
gent di
SC
overy and
an
AL
ysis,”



RASCAL


Figure 4.

Legacy Workflow
. Current workflow of the GPS system including in
-
house analytical scripts (3a
-
b, & c
-
f) and external
methods 2.d and 3.c
.

Figure 5.

Web 2.0 Application
.

The new Web 2.0 Application Architecture of
the RASCAL will allow simpler management of
the RASCAL data, more modular development
of new functionality and a rich user interface
incorporating the JBrowse genomic browser
and rich web technologies including AJAX,
RSS, and using data in either XML or JSON
format.



Predictive features of RASCAL

Viability score

Complementarity Model


Networks of ERV co
-
expression to overcome inactive
genes.

Translational Read Through Model



1) Selection of stop codons and frame shifts as regulators of ERV


polycystronic messages.

2) Selection of stop codons and frame shifts as regulators of
specific ERV genes.


RT provider

LTR provider

Host gene expression network

Integrase provider